Encyclopedia of Optimization [2 ed.] 9780387747583, 0387747583, 0387747591, 0387747605

Table of contents : Cover Page......Page 1 Title: Encyclopedia of Optimization, Second Edition......Page 3 ISBN 03877475

602 86 54MB

English Pages 4646 Year 2008

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Encyclopedia of Modern Architecture

1,068 201 38MB Read more

Encyclopedia of Nanotechnology 9781774693841

Nanotechnology is the study of the nanoscale: objects around a nanometer (NM) in size. Our ability to develop enormous,

602 88 11MB Read more

Encyclopedia of Tantra 8129202034

342 119 9MB Read more

The Encyclopedia of Wood

Timber is one of our most precious, versatile, and vulnerable resources, so using it effectively is important. Knowing y

1,122 141 48MB Read more

Ezmerelda's Encyclopedia of Evil

Ezmerelda's Encyclopedia of Evil is a monster book in the vein of Volo's Guide to Monsters and Mordenkainen�

523 59 23MB Read more

Encyclopedia of Combination

1,032 56 6MB Read more

Foundations of Optimization 0387344314, 9780387344317

This book is intended as a textbook to be used in a first graduate level course, and covers the fundamental principals o

181 25 7MB Read more

Encyclopedia of Cheese

652 89 5MB Read more

Encyclopedia Of Internet

111 53 10MB Read more

Encyclopedia of Pasta 9780520944718

Spaghetti, gnocchi, tagliatellea, ravioli, vincisgrassi, strascinati—pasta in its myriad forms has been a staple of the

447 87 18MB Read more

Encyclopedia of Optimization [2 ed.]
9780387747583, 0387747583, 0387747591, 0387747605

Author / Uploaded
Christodoulos A. Floudas
Panos M. Pardalos

Commentary
eBook

Citation preview

Encyclopedia of Optimization Second Edition

C. A. Floudas and P. M. Pardalos (Eds.)

Encyclopedia of Optimization Second Edition

With 613 Figures and 247 Tables

123

CHRISTODOULOS A. FLOUDAS Department of Chemical Engineering Princeton University Princeton, NJ 08544-5263 USA [email protected]

PANOS M. PARDALOS Center for Applied Optimization Department of Industrial and Systems Engineering University of Florida Gainesville, FL 32611-6595 USA [email protected]

Library of Congress Control Number: 2008927531

ISBN: 978-0-387-74759-0 The print publication is available under ISBN: 978-0-387-74758-3 The print and electronic bundle is available under ISBN: 978-0-387-74760-6 © 2009 Springer Science+Buisiness Media, LLC. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC., 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. springer.com Printed on acid free paper

SPIN: 11680840 2109letex – 5 4 3 2 1 0

Preface to the Second Edition

Optimization may be regarded as the cornerstone of many areas of applied mathematics, computer science, engineering, and a number of other scientific disciplines. Among other things, optimization plays a key role in finding feasible solutions to real-life problems, from mathematical programming to operations research, economics, management science, business, medicine, life science, and artificial intelligence, to mention only several. Optimization entails engaging in an action to find the best solution. As a flourishing research activity, it has led to theoretical and computational advances, new technologies and new methods in developing more optimal designs of different systems, efficiency, and robustness, in minimizing the costs of operations in a process, and maximizing the profits of a company. The first edition of the encyclopedia of optimization was well received by the scientific community and has been an invaluable source of scientific information for researchers, practitioners, and students. Given the enormous yearly increases in this field since the appearance of the first edition, additional optimization knowledge has been added to this second edition. As before, entries are arranged in alphabetical order; the style of the entries has been retained to emphasize the expository and survey-type nature of the articles. Also many older entries have been updated and revised in light of new developments. Finally, several improvements have been made in the format to allow for more links to appropriate internet cites and electronic availability. Acknowledgments We wish to thank all invited contributors for their excellent efforts in writing their article in an expository way so that it is accessible to most scientists. The editors want to also take this opportunity to express gratitude to all the contributors, their research groups, their families for their support and encouragement, their research sponsors, and to Princeton University, and the University of Florida. Many thanks go especially to Kerstin Kindler at Springer for her efficiency, terrific organization skills, and friendly spirit through the entire project. We would like also to thank the advisory board for their support and suggestions. Finally, we would like to thank the editors of Springer, Ann Kostant and Elizabeth Loew, for their thoughtful guidance, assistance and support during the planning and preparation of the second edition. C.A. Floudas and P.M. Pardalos Editors

About the Editors

Christodoulos A. Floudas Christodoulos A. Floudas is the Stephen C. Macaleer ’63 Professor in Engineering and Applied Science and Professor of Chemical Engineering at Princeton University. Concurrent Faculty positions include Center for Quantitative Biology at the Lewis-Sigler Institute, Program in Applied and Computational Mathematics, and Department of Operations Research and Financial Engineering. He has held Visiting Professor positions at Imperial College, Swiss Federal Institute of Technology, ETH, University of Vienna, and the Chemical Process Engineering Research Institute (CPERI), Thessaloniki, Greece. Dr. Floudas obtained his Ph. D. from Carnegie Mellon University, and today is a world authority in mathematical modeling and optimization of complex systems. His research interests lie at the interface of chemical engineering, applied mathematics, and operations research with principal areas of focus including chemical process synthesis and design, process control and operations, discrete-continuous nonlinear optimization, local and global optimization, and computational chemistry and molecular biology. He has received numerous awards for teaching and research including the NSF Presidential Young Investigator Award, the Engineering Council Teaching Award, the Bodossaki Foundation Award in Applied Sciences, the Best Paper Award in Computers and Chemical Engineering, Aspen Tech Excellence in Teaching Award, the 2001 AIChE Professional Progress Award for Outstanding Progress in Chemical Engineering, the 2006 AIChE Computing in Chemical Engineering Award, and the 2007 Graduate Mentoring Award. Dr. Floudas has served on the editorial boards of Industrial Engineering Chemistry Research, Journal of Global Optimization, Computers and Chemical Engineering, and various book series. He has authored 2 graduate textbooks, has co-edited several volumes, has published over 200 articles, and has delivered over 300 invited lectures and seminars.

VIII

About the Editors

Panos M. Pardalos Panos M. Pardalos is Distinguished Professor of Industrial and Systems Engineering at the University of Florida and the director of the Center for Applied Optimization. He is also an affiliated faculty member of the Computer Science Department, the Hellenic Studies Center, and the Biomedical Engineering Program. Pardalos has held visiting appointments at Princeton University, DIMACS Center, Institute of Mathematics and Applications, Fields Institute, AT & T Labs Research, Trier University, Linkoping Institute of Technology, and universities in Greece. Dr. Pardalos obtained his Ph. D. from the University of Minnesota, and today is a world leading expert in global and combinatorial optimization. His primary research interests include network design problems, optimization in telecommunications, e-commerce, data mining, biomedical applications, and massive computing. He has been an invited lecturer at several universities and research institutes around the world and has organized several international conferences. He has received numerous awards including University of Florida Research Foundation Professor, UF Doctoral Dissertation Advisor/Mentoring Award, Foreign Member of the Royal Academy of Doctors (Spain), Foreign Member Lithuanian Academy of Sciences, Foreign Member of the Ukrainian Academy of Sciences, Foreign Member of the Petrovskaya Academy of Sciences and Arts (Russia), and Honorary Member of the Mongolian Academy of Sciences. He has also received an Honorary Doctorate degree from Lobachevski University, he is a fellow of AAAS, a fellow of INFORMS, and in 2001 he was awarded the Greek National Award and Gold Medal for Operations Research. Panos Pardalos is the editor-in-chief of the Journal of Global Optimization, Journal of Optimization Letters, and Computational Management Science. Dr. Pardalos is also the managing editor of several book series, and a member of the editorial board of various international journals. He has authored 8 books, has edited several volumes, has written numerous articles, and has also developed several well known software packages.

Editorial Board Members

DIMITRI P. BERTSEKAS McAfee Professor of Engineering Massachusetts Institute of Technology Cambridge, MA, USA

WALTER MURRAY Department of Management Science and Engineering Stanford University Stanford, CA, USA

JOHN R. BIRGE Jerry W. and Carol Lee Levin Professor of Operations Management The University of Chicago Graduate School of Business Chicago, IL, USA

GEORGE L. NEMHAUSER Chandler Chaired Professor School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA, USA

JONATHAN M. BORWEIN FRSC, Canada Research Chair Computer Science Department Dalhousie University Halifax, NS, Canada

JAMES B. ORLIN Edward Pennell Brooks Professor of Operations Research MIT Sloan School of Management Cambridge, MA, USA

VLADIMIR F. DEMYANOV Applied Mathematics Department St. Petersburg State University St. Petersburg, Russia FRED GLOVER University of Colorado Boulder, CO, USA OLVI L. MANGASARIAN Computer Sciences Department University of Wisconsin Madison, WI, USA ROBERT R. MEYER Computer Sciences Department University of Wisconsin Madison, WI, USA BORIS MORDUKHOVICH Department of Mathematics Wayne State University Detroit, MI, USA

J. BEN ROSEN Computer Science Department UCSD and University of Minnesota La Jolla, CA, and Minneapolis, MN, USA ROBERT B. SCHNABEL Computer Science Department University of Colorado Boulder, CO, USA HANIF D. SHERALI W. Thomas Rice Chaired Professor of Engineering Virginia Polytechnic Institute and State University Blacksburg, VA, USA RALPH E. STEUER Department of Banking & Finance Terry College of Business University of Georgia Athens, GA, USA

X

Editorial Board Members

TAMÁS TERLAKY Canada Research Chair in Optimization Director, McMaster School of Computational Engineering and Science McMaster University Hamilton, ON, Canada

J.-PH. VIAL Operations Management University of Geneva Geneva, Switzerland

HOANG TUY Department of Optimization and Control Institute of Mathematics, VAST Hanoi, Vietnam

HENRY WOLKOWICZ Faculty of Mathematics University of Waterloo Waterloo, ON, Canada

List of Contributors

ACKERMANN, JUERGEN DLR Oberpfaffenhofen Wessling, Germany

ANASTASSIOU, GEORGE A. University of Memphis Memphis, TN, USA

ADJIMAN, CLAIRE S. Imperial College London, UK

ANDERSEN, ERLING D. Odense University Odense M, Denmark

AGGOUN, ABDERRAHMANE KLS-OPTIM Villebon sur Yvette, France

ANDREI, NECULAI Center for Advanced Modeling and Optimization and Academy of Romanian Scientists Bucharest, Romania

AHMED, SHABBIR University of Illinois Urbana-Champaign, IL, USA AHUJA, RAVINDRA K. University of Florida Gainesville, FL, USA AIT SAHLIA, FARID University of Florida Gainesville, FL, USA ALEVRAS, DIMITRIS IBM Corporation West Chester, PA, USA ALEXANDROV, NATALIA M. NASA Langley Res. Center Hampton, VA, USA

ANDRICIOAEI, IOAN Boston University Boston, MA, USA ANDROULAKIS, IOANNIS P. Rutgers University Piscataway, NJ, USA ANSTREICHER, KURT M. University of Iowa Iowa City, IA, USA ARABIE, P. Rutgers University Newark, DE, USA

ALIZAMIR, SAED University of Florida Gainesville, FL, USA

AURENHAMMER, FRANZ Graz University of Technology Graz, Austria

ALKAYA, DILEK Carnegie Mellon University Pittsburgh, PA, USA

AUSTIN-RODRIGUEZ, JENNIFER Louisiana State University Baton Rouge, LA, USA

ALVES, MARIA JOÃO University Coimbra and INESC Coimbra, Portugal

BAGIROV, ADIL University of Ballarat Victoria, Australia

XII

List of Contributors

BALAS, EGON Carnegie Mellon University Pittsburgh, PA, USA

BENDSØE, MARTIN P. Technical University of Denmark Lyngby, Denmark

BALASUNDARAM, B. Texas A&M University College Station, TX, USA

BENHAMOU, FRÉDÉRIC Université de Nantes Nantes, France

BANERJEE, IPSITA Rutgers University Piscataway, NJ, USA

BENSON, HAROLD P. University of Florida Gainesville, FL, USA

BAO, GANG Michigan State University East Lansing, MI, USA

BEN-TAL, AHARON Technion – Israel Institute of Technology Haifa, Israel

BARD, JONATHAN F. University of Texas Austin, TX, USA

BERTSEKAS, DIMITRI P. Massachusetts Institute of Technology Cambridge, MA, USA

BARDOW, ANDRÉ RWTH Aachen University Aachen, Germany

BIEGLER, L. T. Carnegie Mellon University Pittsburgh, PA, USA

BARNETTE, GREGORY University of Florida Gainsville, FL, USA

BILLUPS, STEPHEN C. University of Colorado, Denver Denver, CO, USA

BARNHART, CYNTHIA Massachusetts Institute of Technology Cambridge, MA, USA

BIRGE, JOHN R. Northwestern University Evanston, IL, USA

BATTITI, ROBERTO Universitá Trento Povo (Trento), Italy

BIRGIN, ERNESTO G. University of São Paulo São Paulo, Brazil

BEASLEY, JOHN E. The Management School, Imperial College London, England

BISCHOF, CHRISTIAN H. RWTH Aachen University Aachen, Germany

BECKER, OTWIN University of Heidelberg Heidelberg, Germany

BJÖRCK, ÅKE Linköping University Linköping, Sweden

BEHNKE, HENNING Institute Math. TU Clausthal Clausthal, Germany

BOARD, JOHN L.G. London School of Economics and Political Sci. London, UK

BELIAKOV, GLEB Deakin University Victoria, Australia

BOMZE, IMMANUEL M. University of Vienna Wien, Austria

List of Contributors

BORCHERS, BRIAN New Mexico Technology Socorro, NM, USA

BUTENKO, S. Texas A&M University College Station, TX, USA

BORGWARDT, KARL HEINZ University of Augsburg Augsburg, Germany

CALVETE, HERMINIA I. Universidad de Zaragoza Zaragoza, Spain

BOUYSSOU, DENIS LAMSADE University Paris Dauphine Paris, France

CALVIN, J. M. New Jersey Institute of Technology Newark, DE, USA

BOYD, STEPHEN Stanford University Stanford, CA, USA

CAMBINI, ALBERTO University of Pisa Pisa, Italy

BRANDEAU, MARGARET Stanford University Stanford, CA, USA

CAPAR, ISMAIL Texas A&M University College Station, TX, USA

BRÄNNLUND, ULF Kungliga Tekniska Högskolan Stockholm, Sweden

CARLSON, DEAN A. University of Toledo Toledo, OH, USA

BREZINSKI, CLAUDE Université Sci. et Techn. Lille Flandres–Artois Lille, France

CARON, RICHARD J. University of Windsor Windsor, ON, Canada

BRIMBERG, JACK Royal Military College of Canada Kingston, ON, Canada

CERVANTES, ARTURO Mellon University Pittsburgh, PA, USA

BRUALDI, RICHARD A. University of Wisconsin Madison, WI, USA

CHA, MEEYOUNG Korean Advanced Institute of Science and Technology Daejeon, Korea

BRUCKER, PETER Universität Osnabrück Osnabrück, Germany

CHANG, BYUNGMAN Seoul National University of Technology Seoul, Korea

BRUGLIERI, MAURIZIO University of Camerino Camerino, Italy

CHAOVALITWONGSE, PAVEENA University of Florida Gainesville, FL, USA

BURKARD, RAINER E. Technical University of Graz Graz, Austria

CHAOVALITWONGSE, W. ART Rutgers University Piscataway, NJ, USA

BUSYGIN, STANISLAV University of Florida Gainesville, FL, USA

CHARDAIRE, PIERRE University of East Anglia Norwich, UK

XIII

XIV

List of Contributors

CHEN, ANTHONY Utah State University Logan, UT, USA

CORLISS, GEORGE F. Marquette University Milwaukee, WI, USA

CHEN, JIANER Texas A&M University College Station, TX, USA

COTTLE, RICHARD W. Stanford University Stanford, CA, USA

CHEN, QING Louisiana State University Baton Rouge, LA, USA

CRAVEN, B. D. University of Melbourne Melbourne, VIC, Australia

CHRISTIANSEN, MARIELLE Norwegian University of Science and Technology Trondheim, Norway CIESLIK, DIETMAR University of Greifswald Greifswald, Germany CIFARELLI, C. Università di Roma “La Sapienza” Roma, Italy CIRIC, AMY University of Cincinnati Cincinnati, OH, USA CLÍMACO, JOÃO University Coimbra and INESC Coimbra, Portugal COMBETTES, PATRICK L. University Pierre et Marie Curie Paris, France and New York University New York, NY, USA

CSENDES, TIBOR University of Szeged Szeged, Hungary CUCINIELLO, SALVATORE Italian Research Council Napoli, Italy DADUNA, JOACHIM R. Fachhochschule für Wirtschaft (FHW) Berlin Berlin, Germany DANNINGER-UCHIDA, GABRIELE E. University of Vienna Vienna, Austria DASCI, ABDULLAH York University Toronto, ON, Canada DE ANGELIS, PASQUALE L. Naval Institute Naples, Italy DE LEONE, R. University degli Studi di Camerino Camerino, Italy

COMMANDER, CLAYTON W. University of Florida Gainesville, FL, USA

DEMPE, STEPHAN Freiberg University of Mining and Technology Freiberg, Germany

CONEJEROS, RAÚL Cambridge University Cambridge, UK

DEMYANOV, VLADIMIR F. St. Petersburg State University St. Petersburg, Russia

CONSTANTIN, ZOPOUNIDIS Technical University of Crete Chania, Greece

DENG, NAIYANG China Agricultural University Beijing, China

List of Contributors

DENG, XIAOTIE City University of Hong Kong Kowloon, China

DYE, SHANE University of Canterbury Christchurch, New Zealand

DIDERICH, CLAUDE G. Swiss Federal Institute of Technology Lausanne, Switzerland

EDIRISINGHE, CHANAKA University of Tennessee Knoxville, TN, USA

DI GIACOMO, LAURA Università di Roma “La Sapienza” Roma, Italy

EDMONSON, WILLIAM Hampton University Hampton, VA, USA

DIMAGGIO JR., PETER A. Princeton University Princeton, NJ, USA

EGLESE, RICHARD Lancaster University Lancaster, UK

DIXON, LAURENCE University of Hertfordshire Hatfield, England

GOUNARIS, CHRYSANTHOS E. Princeton University Princeton, NJ, USA

DOUMPOS, MICHAEL Financial Engineering Lab. Techn. University Crete Chania, Greece DREZNER, TAMMY California State University Fullerton, CA, USA DUA, PINKY Imperial College London, UK and GlaxoSmithKline Research & Development Limited Harlow, UK

EKSIOGLU, BURAK Mississippi State University Mississippi State, MS, USA EKSIOGLU, SANDRA DUNI University of Florida Gainesville, FL, USA ELHEDHLI, SAMIR McGill University Montréal, QC, Canada EMMONS, HAMILTON Case Western Reserve University Cleveland, OH, USA

DUA, VIVEK Imperial College London, UK

ENGE, ANDREAS University of Augsburg Augsburg, Germany

DU, DING-ZHU University of Texas at Dallas Richardson, TX, USA

ERLEBACH, THOMAS University of Leicester Leicester, UK

DUNN, JOSEPH C. North Carolina State University Raleigh, NC, USA

ERMOLIEV, YURI International Institute for Applied Systems Analysis Laxenburg, Austria

ˇ , JITKA DUPA COVÁ Charles University Prague, Czech Republic

ESCUDERO, LAUREANO F. M. Hernández University Elche, Spain

XV

XVI

List of Contributors

ESPOSITO, WILLIAM R. Princeton University Princeton, NJ, USA

FIACCO, A. V. George Washington University Washington, DC, USA

EVSTIGNEEV, IGOR Russian Acadamy of Science Moscow, Russia

FISCHER, HERBERT Technical University of Munich München, Germany

FÁBIÁN, CSABA I. Eötvös Loránd University Budapest, Hungary

FLÅM, SJUR DIDRIK University of Bergen Bergen, Norway

FAGERHOLT, KJETIL Norwegian University of Science and Technology and Norwegian Marine Technology Research Institute (MARINTEK) Trondheim, Norway

FLOUDAS, CHRISTODOULOS A. Princeton University Princeton, NJ, USA

FAÍSCA, NUNO P. Imperial College London, UK FANG, SHU-CHERNG North Carolina State University Raleigh, NC, USA FANG, YUGUANG University of Florida Gainesville, FL, USA FAN, YA-JU Rutgers University Piscataway, NJ, USA

FORSGREN, ANDERS Royal Institute of Technology (KTH) Stockholm, Sweden FOULDS, L. R. University Waikato Waikato, New Zealand FRAUENDORFER, KARL University of St. Gallen St. Gallen, Switzerland FRENK, HANS Erasmus University Rotterdam, The Netherlands FU, CHANG-JUI National Tsing Hua University Hsinchu, Taiwan

FEMINIANO, DAVIDE Italian Research Council Napoli, Italy

FUNG, HO KI Princeton University Princeton, NJ, USA

FENG, F. J. University of Sussex Sussex, England

FÜRER, MARTIN Pennsylvania State University University Park, PA, USA

FERREIRA, AFONSO INRIA Sophia Antipolis Sophia–Antipolis, France

GAGANIS, CHRYSOVALANTIS Technical University of Crete Chania, Greece

FESTA, PAOLA Universitá Salerno Baronissi, Italy

GALÉ, CARMEN Universidad de Zaragoza Zaragoza, Spain

List of Contributors

GAO, DAVID YANG Virginia Polytechnic Institute and State University Blacksburg, VA, USA

GRIPPO, LUIGI Università di Roma “La Sapienza” Roma, Italy

GARCÍA-PALOMARES, UBALDO M. University of Simón Bolívar Caracas, Venezuela

GROSSMANN, IGNACIO E. Carnegie Mellon University Pittsburgh, PA, USA

GEHRLEIN, WILLIAM V. University of Delaware Newark, DE, USA

GRUNDEL, DON 671 ARSS/SYEA Eglin AFB, FL, USA

GENGLER, MARC Université Méditerrannée Marseille, France

GUARRACINO, MARIO R. Italian Research Council Napoli, Italy

GEROGIORGIS, DIMITRIOS I. Imperial College London, UK GEUNES, JOSEPH University of Florida Gainesville, FL, USA GIACOMO, LAURA DI Università di Roma “La Sapienza” Roma, Italy

GUDDAT, JÜRGEN Humboldt University Berlin, Germany GUERRA, FRANCISCO University of the Americas Cholula, Mexico

GIANNESSI, FRANCO University of Pisa Pisa, Italy

GÜMÜ¸S, ZEYNEP H. University of Cincinnati Cincinnati, OH, USA and Cornell University New York, NY, USA

GOELEVEN, DANIEL I.R.E.M.I.A. University de la Réunion Saint-Denis, France

GUPTA, KAPIL Georgia Institute of Technology Atlanta, GA, USA

GOFFIN, JEAN-LOUIS McGill University Montréal, QC, Canada

GÜRSOY, KORHAN University of Cincinnati Cincinnati, OH, USA

GOUNARIS, CHRYSANTHOS E. Princeton University Princeton, NJ, USA

GUSTAFSON, SVEN-ÅKE Stavanger University Stavanger, Norway

GRAMA, ANANTH Purdue University West Lafayette, IN, USA

GUTIN, GREGORY University of London Egham, UK

GRIEWANK, ANDREAS Technical University of Dresden Dresden, Germany

HADDAD, CAROLINE N. State University of New York Geneseo, NY, USA

XVII

XVIII

List of Contributors

HADJISAVVAS, NICOLAS University of the Aegean Hermoupolis, Greece

HEARN, DONALD W. University of Florida Gainesville, FL, USA

HAFTKA, RAPHAEL T. University of Florida Gainesville, FL, USA

HEINKENSCHLOSS, MATTHIAS Rice University Houston, TX, USA

HAMACHER, HORST W. Universität Kaiserslautern Kaiserslautern, Germany

HERTZ, DAVID RAFAEL Department 82 Haifa, Israel

HAN, CHI-GEUN Kyung Hee University Seoul, Korea

HETTICH, RAINER University of Trier Trier, Germany

HANSEN, PIERRE GERAD and HEC Montréal Montréal, QC, Canada

HE, XIAOZHENG University of Minnesota Minneapolis, MN, USA

HANSMANN, ULRICH H.E. Michigan Technological University Houghton, MI, USA

HICKS, ILLYA V. Rice University Houston, TX, USA

HARDING, S. T. Princeton University Princeton, NJ, USA

HIGLE, JULIA L. University of Arizona Tucson, AZ, USA

HARHAMMER, PETER G. Technical University of Vienna Vienna, Austria

HOFFMAN, KARLA George Mason University Fairfax, VA, USA

HARJUNKOSKI, IIRO Åbo Akademi University Turku, Finland

HOLDER, ALLEN University Colorado Denver, CO, USA

HASLINGER, JAROSLAV Charles University Prague, Czech Republic

HOLMBERG, KAJ Linköping Institute of Technology Linköping, Sweden

HATZINAKOS, DIMITRIOS University of Toronto Toronto, ON, Canada

HOOKER, J. N. Carnegie Mellon University Pittsburgh, PA, USA

HAUPTMAN, HERBERT A. Hauptman–Woodward Medical Research Institute Inc. Buffalo, NY, USA

HOVLAND, PAUL D. Argonne National Lab. Argonne, IL, USA

HAURIE, ALAIN B. University of Geneva Geneva, Switzerland

HUANG, HONG-XUAN Tsinghua University Beijing, P. R. China

List of Contributors

HUANG, XIAOXIA University of Florida Gainesville, FL, USA

JANSSON, C. Techn. Universität Hamburg-Harburg Hamburg, Germany

HUBERT, L. J. University of Illinois Champaign, IL, USA

JEFFCOAT, DAVID AFRL/RWGN Eglin AFB, FL, USA

HUNT III, H. B. University of Albany New York, NY, USA

JEYAKUMAR, V. University of New South Wales Sydney, NSW, Australia

HÜRLIMANN, TONY University of Fribourg Fribourg, Switzerland HURSON, CHRISTIAN University Rouen, CREGO Mont Saint Aignan, France IERAPETRITOU, MARIANTHI Rutgers University Piscataway, NJ, USA IRI, MASAO Chuo University Tokyo, Japan ISAC, GEORGE Royal Military College of Canada Kingston, ON, Canada ˙IZBIRAK, GÖ KHAN Eastern Mediterranean University Mersin-10, Turkey JACOBSEN, STEPHEN E. University of California Los Angeles, CA, USA JAHN, JOHANNES University of Erlangen–Nürnberg Erlangen, Germany

JHA, KRISHNA C. GTEC Gainesville, FL, USA JHONES, ALINA RUIZ University of Havana San Lázaro y L Ciudad Habana, Cuba JIA, ZHENYA Rutgers University Piscataway, NJ, USA JONES, DONALD R. General Motors Corp. Warren, MI, USA JONGEN, HUBERTUS T. RWTH Aachen University Aachen, Germany JUDSON, RICHARD S. Genaissance Pharmaceuticals New Haven, CT, USA KAKLAMANIS, CHRISTOS University of Patras Patras, Greece

JANAK, STACY L. Princeton University Princeton, NJ, USA

KALLRATH, JOSEF BASF Aktiengesellschaft Ludwigshafen, Germany and University of Florida Gainesville, FL, USA

JANSEN, KLAUS Univerisität Kiel Kiel, Germany

KALMAN, DAN American University Washington, DC, USA

XIX

XX

List of Contributors

KAMMERDINER, ALLA R. University of Florida Gainesville, FL, USA

KLAFSZKY, EMIL Technical University Budapest, Hungary

KAPLAN, ALEXANDER University of Trier Trier, Germany

KLATTE, DIETHARD University of Zurich Zurich, Switzerland

˘ KAPLAN, UGUR Koç University Istanbul, Turkey

KLEPEIS, JOHN L. Princeton University Princeton, NJ, USA

KASAP, SUAT University of Oklahoma Norman, OK, USA

KNIGHT, DOYLE Rutgers University New Brunswick, NJ, USA

KAS, PÉTER Eastern Mediterranean University Mersin-10, Turkey

KOBLER, DANIEL Swiss Federal Institute of Technology Lausanne, Switzerland

KATOH, NAOKI Kyoto University Kyoto, Japan

KOHOUT, LADISLAV J. Florida State University Tallahassee, FL, USA

KEARFOTT, R. BAKER University of Louisiana at Lafayette Lafayette, LA, USA

KOMÁROMI, ÉVA Budapest University of Economic Sciences Budapest, Hungary

KELLEY, C. T. North Carolina State University Raleigh, NC, USA

KONNOV, IGOR V. Kazan University Kazan, Russia

KENNINGTON, JEFFERY L. Southern Methodist University Dallas, TX, USA KESAVAN, H. K. University of Waterloo Waterloo, ON, Canada

KORHONEN, PEKKA Internat. Institute Applied Systems Analysis Laxenburg, Austria and Helsinki School Economics and Business Adm. Helsinki, Finland

KHACHIYAN, LEONID Rutgers University Piscataway, NJ, USA

KOROTKICH, VICTOR Central Queensland University Mackay, QLD, Australia

KIM, DUKWON University of Florida Gainesville, FL, USA

KORTANEK, K. O. University of Iowa Iowa City, IA, USA

KISIALIOU, MIKALAI University of Minnesota Minneapolis, MN, USA

KOSMIDIS, VASSILEIOS D. Imperial College London, UK

List of Contributors

KOSTREVA, MICHAEL M. Clemson University Clemson, SC, USA

KYPARISIS, GEORGE J. Florida International University Miami, FL, USA

KOURAMAS, K.I. Imperial College London, UK

KYRIAKI, KOSMIDOU Athens University Economics and Business Athens, Greek

KRABS, W. University Darmstadt Darmstadt, Germany

LAMAR, BRUCE W. The MITRE Corp. Bedford, MA, USA

KRARUP, JAKOB DIKU Universitetsparken 1 Copenhagen, Denmark

LANCASTER, LAURA C. Clemson University Clemson, SC, USA

KRISHNAN, NIRANJAN Massachusetts Institute of Technology Cambridge, MA, USA

LAPORTE, GILBERT HEC Montréal Montréal, QC, Canada

KROKHMAL, PAVLO A. University of Iowa Iowa City, IA, USA

LAURENT, MONIQUE CWI Amsterdam, The Netherlands

KRUGER, ALEXANDER Y. University of Ballarat Ballarat, VIC, Australia

LAVOR, CARLILE State University of Campinas (IMECC-UNICAMP) Campinas, Brazil

KUBOTA, KOICHI Chuo University Tokyo, Japan

LAWPHONGPANICH, SIRIPHONG Naval Postgraduate School Monterey, CA, USA

KUEHRER, MARTIN Siemens AG (NYSE: SI) Wien, Austria

LECLERC, ANTHONY P. College of Charleston Charleston, SC, USA

KUMAR, ARVIND GTEC Gainesville, FL, USA

LEDZEWICZ, URSZULA Southern Illinois University at Edwardsville Edwardsville, IL, USA

KUMAR, VIPIN University of Minnesota Minneapolis, MN, USA

LEE, EVA K. Georgia Institute of Technology Atlanta, GA, USA

KUNDAKCIOGLU, O. ERHUN University of Florida Gainesville, FL, USA

LEE, WEN University of Florida Gainesville, FL, USA

KUNO, TAKAHITO University of Tsukuba Ibaraki, Japan

LEOPOLD-WILDBURGER, ULRIKE Karl-Franzens University of Graz Graz, Austria

XXI

XXII

List of Contributors

LEPP, RIHO Tallinn Technical University Tallinn, Estonia

LIU, WENBIN University of Kent Canterbury, England

LETCHFORD, ADAM Lancaster University Lancaster, UK

LIWO, ADAM Cornell University Ithaca, NY, USA

LEWIS, KAREN R. Southern Methodist University Dallas, TX, USA

L. JANAK, STACY Princeton University Princeton, NJ, USA

LEYFFER, S. University of Dundee Dundee, UK

LOCKHART BOGLE, IAN DAVID University College London London, UK

LIANG, ZHE Rutgers University Piscataway, NJ, USA

LOUVEAUX, FRANCOIS University of Namur Namur, Belgium

LIANG, ZHIAN Shanghai University of Finance and Economics Shanghai, P.R. China

LOWE, TIMOTHY J. University of Iowa Iowa City, IA, USA

LIBERTI, LEO LIX Palaiseau, France

LU, BING University of Minnesota Minneapolis, MN, USA

LI, GUANGYE Silicon Graphics, Inc. Houston, TX, USA

LUCIA, ANGELO University of Rhode Island Kingston, RI, USA

LI, HAN-LIN National Chiao Tung University Hsinchu, Taiwan

LUO, ZHI-QUAN University of Minnesota Minneapolis, MN, USA

LIM, GINO J. University of Houston Houston, TX, USA

LUUS, REIN University of Toronto Toronto, ON, Canada

LINDBERG, P. O. Linköping University Linköping, Sweden

MAAREN, HANS VAN Delft University of Technology Delft, The Netherlands

LIN, YOUDONG University of Notre Dame Notre Dame, IN, USA

MACULAN, NELSON Federal University of Rio de Janeiro (COPPE-UFRJ) Rio de Janeiro, Brazil

LISSER, ABDEL France Telecom Issy les Moulineaux, France

MAGNANTI, THOMAS L. Massachusetts Institute of Technology Cambridge, MA, USA

List of Contributors

MAIER, HELMUT University of Ulm Ulm, Germany

MARTI, KURT University of Munich Neubiberg, Germany

MÄKELÄ, MARKO M. University of Jyväskylä Jyväskylä, Finland

MARTÍNEZ, J. M. University of Campinas Campinas, Brazil

MÁLYUSZ, LEVENTE Technical University Budapest, Hungary

MATOS, ANA C. Université Sci. et Techn. Lille Flandres–Artois Lille, France

MAMMADOV (MAMEDOV), MUSA University of Ballarat Ballarat, VIC, Australia

MAVRIDOU, THELMA D. University of Florida Gainesville, FL, USA

MAPONI, PIERLUIGI University of Camerino Camerino, Italy

MAVROCORDATOS, P. Algotheque and Université Paris 6 Paris, France

MARANAS, COSTAS D. Pennsylvania State University University Park, PA, USA

MCALLISTER, S. R. Princeton University Princeton, NJ, USA

MARAVELIAS, CHRISTOS University of Wisconsin – Madison Madison, WI, USA

MCDONALD, CONOR M. E.I. DuPont de Nemours & Co. Wilmington, DE, USA

MARCOTTE, PATRICE University of Montréal Montréal, QC, Canada

MEDVEDEV, VLADIMIR G. Byelorussian State University Minsk, Republic Belarus

MARINAKI, MAGDALENE Technical University of Crete Chania, Greece

MEULMAN, J. Leiden University Leiden, The Netherlands

MARINAKIS, YANNIS Technical University of Crete Chania, Greece

MIETTINEN, MARKKU University of Jyväskylä Jyväskylä, Finland

MARINO, MARINA University of Naples ‘Federico II’ and CPS Naples, Italy

MINOUX, MICHEL University of Paris Paris, France

MARQUARDT, WOLFGANG RWTH Aachen University Aachen, Germany

MISSEN, RONALD W. University of Toronto Toronto, ON, Canada

MARTEIN, LAURA University of Pisa Pisa, Italy

MISTAKIDIS, EURIPIDIS University of Thessaly Volos, Greece

XXIII

XXIV

List of Contributors

MITCHELL, JOHN E. Math. Sci. Rensselaer Polytechnic Institute Troy, NY, USA

MUROTA, KAZUO Res. Institute Math. Sci. Kyoto University Kyoto, Japan

MLADENOVI C´ , NENAD Brunel University Uxbridge, UK

MURPHEY, ROBERT US Air Force Research Labor. Eglin AFB, FL, USA

MOCKUS, JONAS Institute Math. and Informatics Vilnius, Lithuania MOHEBI, HOSSEIN University of Kerman Kerman, Iran MONDAINI, RUBEM P. Federal University of Rio de Janeiro, Centre of Technology/COPPE Rio de Janeiro, Brazil MONGEAU, MARCEL University of Paul Sabatier Toulouse, France MÖNNIGMANN, MARTIN Technische Universität Braunschweig Braunschweig, Germany MOON, SUE B. Korean Advanced Institute of Science and Technology Daejeon, Korea MOORE, RAMON E. Worthington, OH, USA MORTON, DAVID P. University of Texas at Austin Austin, TX, USA MOTREANU, DUMITRU University of Alexandru Ioan Cuza Iasi, Romania MULVEY, JOHN M. Princeton University Princeton, NJ, USA MURLI, ALMERICO University of Naples Federico II and Center for Research on Parallel Computing and Supercomputers of the CNR (CPS-CNR) Napoli, Italy

MURRAY, WALTER Stanford University Stanford, CA, USA MURTY, KATTA G. University of Michigan Ann Arbor, MI, USA MUTZEL, PETRA University of Wien Wien, Austria NAGURNEY, ANNA University of Massachusetts Amherst, MA, USA NAHAPETYAN, ARTYOM G. University of Florida Gainesville, FL, USA NAKAYAMA, HIROTAKA Konan University Kobe, Japan NAZARETH, J. L. Washington State University Pullman, WA, USA and University of Washington Seattle, WA, USA NEMIROVSKI, ARKADI Technion: Israel Institute Technology Technion-City, Haifa, Israel NGO, HUANG State University of New York at Buffalo Buffalo, NY, USA NICKEL, STEFAN Universität Kaiserslautern Kaiserslautern, Germany

List of Contributors

NIELSEN, SØREN S. University of Copenhagen Copenhagen, Denmark

PALAGI, LAURA Universitá di Roma “La Sapienza” Roma, Italy

NIÑO-MORA, JOSÉ Universidad Carlos III de Madrid Getafe, Spain

PANAGOULI, OLYMPIA Aristotle University Thessaloniki, Greece

NOOR, MUHAMMAD ASLAM Dalhousie University in Halifax Halifax, NS, Canada

PANICUCCI, BARBARA University of Pisa Pisa, Italy

NOWACK, DIETER Humboldt University Berlin, Germany

PAPAGEORGIOU, LAZAROS G. UCL (University College London) London, UK

OKAMOTO, YUKO Graduate University of Adv. Studies Okazaki, Japan

PAPAJORGJI, PETRAQ University of Florida Gainesville, FL, USA

OLAFSSON, SIGURDUR Iowa State University Ames, IA, USA

PAPALEXANDRI, KATERINA P. bp Upstream Technology Middlesex, UK

OLSON, DAVID L. University of Nebraska Lincoln, NE, USA ONN, SHMUEL Technion – Israel Institute of Technology Haifa, Israel ORLIK, PETER University of Wisconsin Madison, WI, USA ORLIN, JAMES B. Massachusetts Institute of Technology Cambridge, MA, USA OULTON, R.F. University of California at Berkeley Berkeley, CA, USA

PAPARRIZOS, KONSTANTINOS University of Macedonia Thessaloniki, Greece PAPPALARDO, MASSIMO University of Pisa Pisa, Italy PARDALOS, PANOS M. University of Florida Gainesville, FL, USA PARPAS, PANOS Imperial College London, UK PASIOURAS, FOTIOS University of Bath Bath, UK

PACHTER, RUTH Air Force Research Laboratory Materials & Manufacturing Directorate Wright–Patterson AFB, USA

PATRIKSSON, MICHAEL Chalmers University of Technology Göteborg, Sweden

PADBERG, MANFRED New York University New York, NY, USA

PATRIZI, GIACOMO Università di Roma “La Sapienza” Roma, Italy

XXV

XXVI

List of Contributors

PELILLO, MARCELLO Università Ca’ Foscari di Venezia Venice, Italy

POTVIN, JEAN-YVES University of Montréal Montréal, QC, Canada

PERSIANO, GIUSEPPE Università di Salerno Fisciano, Italy

POURBAIX, DIMITRI Royal Observatory of Belgium Brussels, Belgium

PFLUG, GEORG University of Vienna Vienna, Austria

PRÉKOPA, ANDRÁS RUTCOR, Rutgers Center for Operations Research Piscataway, NJ, USA

PHILLIPS, ANDREW T. University of Wisconsin–Eau Claire Eau Claire, WI, USA

PROKOPYEV, OLEG University of Pittsburgh Pittsburgh, PA, USA

PIAO TAN, MENG Princeton University Princeton, NJ, USA PICKENHAIN, SABINE Brandenburg Technical University Cottbus Cottbus, Germany P˘INAR, MUSTAFA Ç. Bilkent University Ankara, Turkey

QI, LIQUN University of New South Wales Sydney, NSW, Australia QUERIDO, TANIA University of Florida Gainesville, FL, USA QUEYRANNE, MAURICE University of British Columbia Vancouver, BC, Canada

PINTÉR, JÁNOS D. Pintér Consulting Services, Inc., and Dalhousie University Halifax, NS, Canada

RADZIK, TOMASZ King’s College London London, UK

PISTIKOPOULOS, EFSTRATIOS N. Imperial College London, UK

RAGLE, MICHELLE A. University of Florida Gainesville, FL, USA

PITSOULIS, LEONIDAS Princeton University Princeton, NJ, USA

RAI, SANATAN Case Western Reserve University Cleveland, OH, USA

POLYAKOVA, LYUDMILA N. St. Petersburg State University St. Petersburg, Russia

RAJGARIA, R. Princeton University Princeton, NJ, USA

POPOVA, ELMIRA University of Texas at Austin Austin, TX, USA

RALL, L. B. University of Wisconsin–Madison Madison, WI, USA

PÖRN, RAY Åbo Akademi University Turku, Finland

RALPH, DANIEL University of Melbourne Melbourne, VIC, Australia

List of Contributors

RAPCSÁK, TAMÁS Hungarian Academy of Sciences Budapest, Hungary

ROMA, MASSIMO Universitá di Roma “La Sapienza” Roma, Italy

RASSIAS, THEMISTOCLES M. University Athens Zografou Campus Athens, Greece

ROMEIJN, H. EDWIN University of Florida Gainsville, FL, USA

RATSCHEK, H. Universität Düsseldorf Düsseldorf, Germany

ROOS, KEES Delft University of Technology AJ Delft, The Netherlands

RAYDAN, MARCOS Universidad Central de Venezuela Caracas, Venezuela

RUBINOV, A. M. School Inform. Techn. and Math. Sci. University Ballarat Ballarat, VIC, Australia

REBENNACK, STEFFEN University of Florida Gainesville, FL, USA RECCHIONI , MARIA CRISTINA University of Ancona Ancona, Italy REEMTSEN, REMBERT Brandenburg Technical University Cottbus Cottbus, Germany

RUBIN, PAUL A. Michigan State University East Lansing, MI, USA RUBIO, J. E. University of Leeds Leeds, UK RÜCKMANN, JAN-J. Ilmenau University of Technology Ilmenau, Germany

REINEFELD, ALEXANDER ZIB Berlin Berlin, Germany

RUSTEM, BERÇ Imperial College London, UK

RESENDE, MAURICIO G.C. AT&T Labs Res. Florham Park, NJ, USA

SAFONOV, MICHAEL G. University Southern California Los Angeles, CA, USA

RIBEIRO, CELSO C. Catholic University Rio de Janeiro Rio de Janeiro, Brazil

SAGASTIZÁBAL, CLAUDIA IMPA Jardim Botânico, Brazil

RIPOLL, DANIEL R. Cornell University Ithaca, NY, USA

SAHINIDIS, NIKOLAOS V. University of Illinois Urbana-Champaign, IL, USA

R. KAMMERDINER, ALLA University of Florida Gainesville, FL, USA

SAHIN, KEMAL University of Cincinnati Cincinnati, OH, USA

ROKNE, J. University of Calgary Calgary, AB, Canada

SAKIZLIS, V. Bechtel Co. Ltd. London, UK

XXVII

XXVIII

List of Contributors

SAMARAS, NIKOLAOS University of Macedonia Thessaloniki, Greece

SCHWEIGER, CARL A. Princeton University Princeton, NJ, USA

SANCHEZ, SALVADOR NIETO Louisiana State University Baton Rouge, LA, USA

SEN, SUVRAJEET University of Arizona Tucson, AZ, USA

SARAIVA, PEDRO M. Imperial College London, UK

SHAIKH, AMAN AT&T Labs – Research Florham Park, NJ, USA

SAVARD, GILLES École Polytechnique Montréal, QC, Canada

SHAIK, MUNAWAR A. Princeton University Princeton, NJ, USA

SAVELSBERGH, MARTIN W.P. Georgia Institute of Technology Atlanta, GA, USA

SHALLOWAY, DAVID Cornell University Ithaca, NY, USA

SAYIN, SERPIL Koç University ˙Istanbul, Turkey

SHAPIRO, ALEXANDER Georgia Institute of Technology Atlanta Atlanta, GA, USA

SAYYADY, FATEMEH North Carolina State University Raleigh, NC, USA

SHERALI, HANIF D. Virginia Polytechnic Institute and State University Blacksburg, VA, USA

SCHAIBLE, SIEGFRIED University of California Riverside, CA, USA

SHETTY, BALA Texas A&M University College Station, TX, USA

SCHÄTTLER, HEINZ Washington University St. Louis, MO, USA

SHI, LEYUAN University of Wisconsin Madison, WI, USA

SCHERAGA, HAROLD A. Cornell University Ithaca, NY, USA

SIM, MELVYN NUS Business School Office Singapore

SCHOEN, FABIO Universita Firenze Firenze, Italy

SIMONE, VALENTINA DE University Naples ‘Federico II’ and CPS Naples, Italy

SCHULTZ, RÜDIGER Gerhard-Mercator University Duisburg, Germany

SIMONS, STEPHEN University of California Santa Barbara, CA, USA

SCHÜRLE, MICHAEL University of St. Gallen St. Gallen, Switzerland

SIRLANTZIS, K. University of Kent Canterbury, England

List of Contributors

SISKOS, YANNIS Technical University of Crete Chania, Greece

STAVROULAKIS, GEORGIOS E. Carolo Wilhelmina Technical University Braunschweig, Germany

SIVAZLIAN, B. D. University of Florida Gainesville, FL, USA

STEIN, OLIVER University of Karlsruhe Karlsruhe, Germany

SLOWINSKI, ROMAN Poznań University Technology Poznań, Poland

STILL, CLAUS Åbo Akademi University Åbo, Finland

SMITH, ALEXANDER BARTON Carnegie Mellon University Pittsburgh, PA, USA

STILL, GEORG University of Twente Enschede, The Netherlands

SMITH, J. MACGREGOR University of Massachusetts, Amherst Amherst, MA, USA

STRAUB, JOHN Boston University Boston, MA, USA

SMITH, WILLIAM R. University of Guelph Guelph, ON, Canada

STRONGIN, ROMAN G. Nizhni Novgorod State University Nizhni Novgorod, Russia

SO, ANTHONY MAN-CHO The Chinese University of Hong Kong Hong Kong, China

SUGIHARA, KOKICHI Graduate School of Engineering, University of Tokyo Tokyo, Japan

SOBIESZCZANSKI-SOBIESKI, JAROSLAW NASA Langley Research Center Hampton, VA, USA

SUN, DEFENG University of New South Wales Sydney, NSW, Australia

SOKOLOWSKI, JAN Université Henri Poincaré Nancy, France

SUTCLIFFE, CHARLES M.S. University of Southampton Southampton, UK

SOLODOV, MICHAEL V. Institute Mat. Pura e Apl. Rio de Janeiro, Brazil

SVENSSON, LARS Royal Institute of Technology (KTH) Stockholm, Sweden

SPEDICATO, EMILIO University of Bergamo Bergamo, Italy

SZÁNTAI, TAMÁS Technical University Budapest, Hungary

SPIEKSMA, FRITS Maastricht University Maastricht, The Netherlands

TAN, MENG PIAO Princeton University Princeton, NJ, USA

STADTHERR, MARK A. University of Notre Dame Notre Dame, IN, USA

TAWARMALANI, MOHIT University of Illinois Urbana-Champaign, IL, USA

XXIX

XXX

List of Contributors

TEBOULLE, MARC Tel-Aviv University Ramat-Aviv, Tel-Aviv, Israel

TRAUB, J. F. Columbia University New York, NY, USA

TEGHEM, JACQUES Polytechnique Mons Mons, Belgium

TRIANTAPHYLLOU, EVANGELOS Louisiana State University Baton Rouge, LA, USA

TERLAKY, TAMÁS McMaster University Hamilton, ON, Canada

ˇ TRLIFAJOVÁ, KATE RINA Charles University Prague, Czech Republic

TESFATSION, LEIGH Iowa State University Ames, IA, USA

TSAI, JUNG-FA National Taipei University of Technology Taipei, Taiwan

THENGVALL, BENJAMIN CALEB Technologies Corp. Austin, TX, USA

TSAO, JACOB H.-S. San José State University San José, CA, USA

THOAI, NGUYEN V. University of Trier Trier, Germany

TSENG, PAUL University of Washington Seattle, WA, USA

THOMAS, REKHA R. University of Washington Seattle, WA, USA

TSIPLIDIS, KONSTANTINOS University of Macedonia Thessaloniki, Greece

TICHATSCHKE, RAINER University of Trier Trier, Germany

TUNÇEL, LEVENT University of Waterloo Waterloo, ON, Canada

TIND, JØRGEN University of Copenhagen Copenhagen, Denmark

TÜRKAY, METIN Koç University Istanbul, Turkey

TITS, ANDRÉ L. University of Maryland College Park, MD, USA

TUY, HOANG Vietnamese Academy of Science and Technology Hanoi, Vietnam

TORALDO, GERARDO University of Naples ‘Federico II’ and CPS Naples, Italy

UBHAYA, VASANT A. North Dakota State University Fargo, ND, USA

TORVIK, VETLE I. Louisiana State University Baton Rouge, LA, USA

URYASEV, S. University of Florida Gainesville, FL, USA

TRAFALIS, THEODORE B. University of Oklahoma Norman, OK, USA

VAIDYANATHAN, BALACHANDRAN University of Florida Gainesville, FL, USA

List of Contributors

VAIRAKTARAKIS, GEORGE Case Western Reserve University Cleveland, OH, USA

VISWESWARAN, VISWANATHAN SCA Technologies, LLC Pittsburgh, PA, USA

VANCE, PAMELA H. Emory University Atlanta, GA, USA

VLADIMIROU, HERCULES University of Cyprus Nicosia, Cyprus

VANDENBERGHE, LIEVEN University of California Los Angeles, CA, USA

ˇ , PETR VOP ENKA Charles University Prague, Czech Republic

VAN DEN HEEVER, SUSARA Carnegie Mellon University Pittsburgh, PA, USA

VOSS, STEFAN University of Hamburg Hamburg, Germany

VASANTHARAJAN, SRIRAM Mobil Technology Company Dallas, TX, USA

VURAL, ARIF VOLKAN Mississippi State University Mississippi State, MS, USA

VASSILIADIS, VASSILIOS S. Cambridge University Cambridge, UK VAVASIS, STEPHEN A. Cornell University Ithaca, NY, USA VAZACOPOULOS, ALKIS Dash Optimization Englewood Cliffs, NJ, USA VEMULAPATI, UDAYA BHASKAR University of Central Florida Orlando, FL, USA VERTER, VEDAT McGill University Montréal, QC, Canada VIAL, JEAN-PHILIPPE University of Genève Geneva, Switzerland VICENTE, LUIS N. University of Coimbra Coimbra, Portugal VINCKE, PH. Université Libre de Bruxelles, Gestion Brussels, Belgium

WALLACE, STEIN W. Norwegian University Sci. and Techn. Trondheim, Norway WALTERS, JAMES B. Marquette University Milwaukee, WI, USA WANG, YANJUN Shanghai University of Finance and Economics Shanghai, China WANG, ZHIQIANG Air Force Research Laboratory Materials & Manufacturing Directorate Wright-Patterson AFB, OH, USA WATSON, LAYNE T. Virginia Polytechnic Institute and State University Blacksburg, VA, USA WEI, JAMES Princeton University Princeton, NJ, USA WERSCHULZ, A. G. Fordham University and Columbia University New York, NY, USA

XXXI

XXXII

List of Contributors

WESOLOWSKY, GEORGE O. McMaster University Hamilton, ON, Canada

XUE, JUE City University of Hong Kong Kowloon, Hong Kong

WESTERLUND, TAPIO Åbo Akademi University Åbo, Finland

XU, YINFENG Xian Jiaotong University Xian, China

WOLKOWICZ, HENRY University of Waterloo Waterloo, ON, Canada

YAJIMA, YASUTOSHI Tokyo Institute of Technology Tokyo, Japan

WOOD, GRAHAM Massey University Palmerston North, New Zealand

YANG, ERIC Rutgers University Piscataway, NJ, USA

WU, SHAO-PO Stanford University Stanford, CA, USA WU, TSUNG-LIN Georgia Institute of Technology Atlanta, GA, USA WU, WEILI University of Minnesota Minneapolis, MN, USA WYNTER, LAURA University of Versailles Versailles-Cedex, France XANTHOPOULOS, PETROS University of Florida Gainesville, FL, USA XIA, ZUNQUAN Dalian University of Technology Dalian, China

YATES, JENNIFER AT&T Labs – Research Florham Park, NJ, USA YAVUZ, MESUT University of Florida Gainesvile, FL, USA YE, YINYU University of Iowa Iowa City, IA, USA YU, GANG University of Texas at Austin Austin, TX, USA YÜKSEKTEPE, FADIME ÜNEY Koç University Istanbul, Turkey ZABINSKY, ZELDA B. University of Washington Seattle, WA, USA

XIE, WEI American Airlines Operations Research and Decision Support Group Fort Worth, TX, USA

ZAMORA, JUAN M. Universidad Autónoma Metropolitana-Iztapalapa Mexico City, Mexico

XI, SHAOLIN Beijing Polytechnic University Beijing, China

ZELIKOVSKY, ALEXANDER Georgia State University Atlanta, GA, USA

XU, CHENGXIAN Xian Jiaotong University Xian, China

ZENIOS, STAVROS A. University of Cyprus Nicosia, Cyprus

List of Contributors

ZHANG, JIANZHONG City University of Hong Kong Kowloon Tong, Hong Kong ZHANG, LIWEI Dalian University of Technology Dalian, China ZHANG, QINGHONG University of Iowa Iowa City, IA, USA ZHU, YUSHAN Tsinghua University Beijing, China ZIEMBA, WILLIAM T. University of British Columbia Vancouver, BC, Canada ŽILINSKAS, ANTANAS Institute of Mathematics and Informatics Vilnius, Lithuania ŽILINSKAS, JULIUS Institute of Mathematics and Informatics Vilnius, Lithuania

ZIRILLI, FRANCESCO Università di Roma “La Sapienza” Roma, Italy ZISSOPOULOS, DIMITRIOS Technical Institute of West Macedonia Kozani, Greece ZLOBEC, SANJO McGill University West Montréal, QC, Canada ZOCHOWSKI, ANTONI Systems Research Institute of the Polish Academy of Sciences Warsaw, Poland ZOPOUNIDIS, CONSTANTIN Technical University of Crete Chania, Greece ZOWE, JOCHEM University of Nürnberg–Erlangen Erlangen, Germany

XXXIII

A

ABS Algorithms for Linear Equations and Linear Least Squares

A

ABS Algorithms for Linear Equations and Linear Least Squares EMILIO SPEDICATO Department Math., University Bergamo, Bergamo, Italy MSC2000: 65K05, 65K10 Article Outline Keywords Synonyms The Scaled ABS Class: General Properties Subclasses of the ABS Class The Implicit LU Algorithm and the Huang Algorithm Other ABS Linear Solvers ABS Methods for Linear Least Squares See also References

what is now called the basic or unscaled ABS class. The basic ABS class was later generalized to the so-called scaled ABS class and subsequently applied to linear least squares, nonlinear equations and optimization problems, see [2]. Preliminary work has also been initiated concerning Diophantine equations, with possible extensions to combinatorial optimization, and the eigenvalue problem. There are presently (1998) over 350 papers in the ABS field, see [11]. In this contribution we will review the basic properties and results of ABS methods for solving linear determined or underdetermined systems and overdetermined linear systems in the least squares sense. Let us consider the linear determined or underdetermined system, where rank(A) is arbitrary Ax D b;

x 2 Rn ; b 2 Rm ;

m n;

(1)

or a> i x b i D 0;

i D 1; : : : ; m;

(2)

where 1 a1> B C A D @ ::: A : 0

Keywords Linear algebraic equations; Linear least squares; ABS methods; Abaffian matrices; Huang algorithm; Implicit LU algorithm; Implicit LX algorithm Synonyms Abaffi–Broyden–Spedicato algorithms for linear equations and linear least squares The Scaled ABS Class: General Properties ABS methods were introduced by [1], in a paper dealing originally only with solving linear equations via

(3)

a> m The steps of the scaled ABS class algorithms are as follows: A) Let x1 2 Rn be arbitrary, H 1 2 Rn, n be nonsingular arbitrary, v1 be an arbitrary nonzero vector in Rm ; set i = 1. B) Compute the residual ri = Axi b. If ri = 0, stop (xi solves the problem); else compute si = H i A| vi . If si 6D 0, then go to C). If si = 0 and = v> i ri = 0, then set xi + 1 = xi , H i + 1 = H i and go to F), else stop (the system has no solution).

1

2

A

ABS Algorithms for Linear Equations and Linear Least Squares

C) Compute the search vector pi by p i D H> i zi ;

(4)

where zi 2 Rn is arbitrary save for the condition > v> i AH i z i ¤ 0:

(5)

D) Update the estimate of the solution by x iC1 D x i ˛ i p i ;

(6)

where the stepsize ˛ i is given by ˛i D

v> i ri r> i Ap i

:

(7)

E) Update the matrix H i by H iC1 D H i

H i A> v i w > i Hi > w> i Hi A vi

;

(8)

where wi 2 Rn is arbitrary save for the condition > w> i H i A v i ¤ 0:

(9)

F) If i = m, then stop (xm + 1 solves the system), else define vi + 1 as an arbitrary vector in Rm but linearly independent from v1 , . . . , vi , increment i by one and go to B). The matrices H i appearing in step E) are generalizations of (oblique) projection matrices. They probably first appeared in [16]. They have been named Abaffians since the first international conference on ABS methods (Luoyang, China, 1991) and this name will be used here. The above recursion defines a class of algorithms, each particular method being determined by the choice of the parameters H 1 , vi , zi , wi . The basic ABS class is obtained by taking vi = ei , ei being the ith unitary vector in Rm . The parameters wi , zi , H 1 have been introduced respectively by J. Abaffy, C.G. Broyden and E. Spedicato, whose initials are referred to in the name of the class. It is possible to show that the scaled ABS class is a complete realization of the so-called Petrov–Galerkin iteration for solving a linear system (but the principle can be applied to more general problems), where the iteration has the form xi + 1 = xi ˛ i pi with ˛ i , pi chosen so that the orthogonality relation r> iC1 vj = 0, j = 1,

. . . , i, holds, the vectors vj being arbitrary linearly independent. It appears that all deterministic algorithms in the literature having finite termination on a linear system are members of the scaled ABS class (this statement has been recently shown to be true also for the quasiNewton methods, which are known to have under some conditions termination in at most 2n steps: the iterate of index 2i 1 generated by Broyden’s iteration corresponds to the ith iterate of a certain algorithm in the ABS class). Referring [2] for proofs, we give some of the general properties of methods of the scaled ABS class, assuming, for simplicity, that A has full rank. Define V i = (v1 , . . . , vi ), W i = (w1 , . . . , wi ). Then H i + 1 A| V i = 0, H > iC1 W i = 0, meaning that vectors A| vj , wj , j = 1, . . . , i, span the null spaces of H i + 1 and its transpose, respectively. The vectors H i A| vi , H > i wi are nonzero if and only if ai , wi are linearly independent from a1 , . . . , ai 1 , w1 , . . . , wi 1 , respectively. Define Pi = (p1 , . . . , pi ). Then the implicit factor> ization V > i A i Pi = Li holds, where Li is nonsingular lower triangular. From this relation, if m = n, one obtains the following semi-explicit factorization of the inverse, with P = Pn , V = V n , L = Ln A1 D PL1 V > :

(10)

For several choices of the matrix V the matrix L is diagonal, hence formula (10) gives a fully explicit factorization of the inverse as a byproduct of the ABS solution of a linear system, a property that does not hold for the classical solvers. It can also be shown that all possible factorizations of the form (10) can be obtained by proper parameter choices in the scaled ABS class, another completeness result. Define Si and Ri by Si = (s1 , . . . , si ), Ri = (r1 , . . . , ri ), where si = H i A| vi , ri = H > i wi . Then the Abaffian can be written in the form H i+1 = H 1 Si R> i and the vectors si , ri can be built via a Gram–Schmidt type iterations involving the previous vectors (the search vector pi can be built in a similar way). This representation of the Abaffian in terms of 2i vectors is computationally convenient when the number of equations is much less than the number of variables. Notice that there is also a representation in terms of n i vectors.

ABS Algorithms for Linear Equations and Linear Least Squares

A compact formula of the Abaffian in terms of the parameter matrices is the following H iC1 D H1 H1 A> Vi (Wi> H1 A> Vi )1 Wi> H1 : (11) Letting V = V m , W = W m , one can show that the parameter matrices H 1 , V, W are admissible (i. e. are such that condition (9) is satisfied) if and only if the matrix Q = V | AH > 1 W is strongly nonsingular (i. e. is LU factorizable). Notice that this condition can always be satisfied by suitable exchanges of the columns of V or W, equivalent to a row or a column pivoting on the matrix Q. If Q is strongly nonsingular and we take, as is done in all algorithms insofar considered, zi = wi , then condition (5) is also satisfied. It can be shown that the scaled ABS class corresponds to applying (implicitly) the unscaled ABS algorithm to the scaled (or preconditioned) system V | Ax = V | b, where V is an arbitrary nonsingular matrix of order m. Therefore we see that the scaled ABS class is also complete with respect to all possible left preconditioning matrices, which in the ABS context are defined implicitly and dynamically (only the ith column of V is needed at the ith iteration, and it can also be a function of the previous column choices). Subclasses of the ABS Class In [1], nine subclasses are considered of the scaled ABS class. Here we quote three important subclasses. The conjugate direction subclass. This class is well defined under the condition (sufficient but not necessary) that A is symmetric and positive definite. It contains the implicit Choleski algorithm, the Hestenes–Stiefel and the Lanczos algorithms. This class generates all possible algorithms whose search directions are A-conjugate. The vector xi + 1 minimizes the energy or A-weighted Euclidean norm of the error over x1 + Span(p1 , . . . , pi ). If x1 = 0, then the solution is approached monotonically from below in the energy norm. The orthogonally scaled subclass. This class is well defined if A has full column rank and remains well defined even if m is greater than n. It contains the ABS formulation of the QR algorithm (the socalled implicit QR algorithm), of the GMRES and of

A

the conjugate residual algorithms. The scaling vectors are orthogonal and the search vectors are AA| conjugate. The vector xi + 1 minimizes the Euclidean norm of the residual over x1 + Span(p1 , . . . , pi ). In general, the methods in this class can be applied to overdetermined systems to obtain the solution in the least squares sense. The optimally scaled subclass. This class is obtained by the choice vi = A | pi . The inverse of A| disappears in the actual formulas, if we make the change of variables zi = A| ui , ui being now the parameter that defines the search vector. For ui = ei the Huang method is obtained and for ui = ri a method equivalent to Craig’s conjugate gradient type algorithm. From the general implicit factorization relation one obtains P| P = D or V | AA| V = D, a relation which was shown in [5] to characterize the optimal choice of the parameters in the general Petrov–Galerkin process in terms of minimizing the effect of a single error in xi on the final computed solution. Such a property is therefore satisfied by the Huang (and the Craig) algorithm, but not, for instance, by the implicit LU or the implicit QR algorithms. A. Galantai [8] has shown that the condition characterizing the optimal choice of the scaling parameters in terms of minimizing the final residual Euclidean norm is V | V = D, a condition satisfied by the implicit QR algorithm, the GMRES method, the implicit LU algorithm and again by the Huang algorithm, which therefore satisfies both conditions). The methods in the optimally stable subclass have the property that xi + 1 minimizes the Euclidean norm of the error over x1 + Span(p1 , . . . , pi ). The Implicit LU Algorithm and the Huang Algorithm Specific algorithms of the scaled ABS class are obtained by choosing the available parameters. The implicit LU algorithm is given by the choices H 1 = I, zi = wi = vi = ei . We quote the following properties of the implicit LU algorithm. a) The algorithm is well defined if and only if A is regular (i. e. all principal submatrices are nonsingular). Otherwise column pivoting has to be performed (or, if m = n, equations pivoting).

3

4

A

ABS Algorithms for Linear Equations and Linear Least Squares

b) The Abaffian H i + 1 has the following structure, with K i 2 Rn i, i : 0

H iC1

0 B :: B DB : @0 Ki

1 0 :: C : C C: 0 A

(12)

I ni

c) Only the first i components of pi can be nonzero and the ith component is one. Hence the matrix Pi is unit upper triangular, so that the implicit factorization A = LP1 is of the LU type, with units on the diagonal, justifying the name. d) Only K i has to be updated. The algorithm requires nm2 2m3 /3 multiplications plus lower order terms, hence, for m = n, n3 /3 multiplications plus lower order terms. This is the same overhead required by the classical LU factorization or Gaussian elimination (which are two essentially equivalent processes). e) The main storage requirement is the storage of K i , whose maximum value is n2 /4. This is two times less than the storage needed by Gaussian elimination and four times less than the storage needed by the LU factorization algorithm (assuming that A is not overwritten). Hence the implicit LU algorithm is computationally better than the classical Gaussian elimination or LU algorithm, having the same overhead but less memory cost. The implicit LU algorithm, implemented in the case m = n with row pivoting, has been shown in experiments of M. Bertocchi and Spedicato [3] to be numerically stable and in experiments of E. Bodon [4] on the vector processor Alliant FX 80 with 8 processors to be about twice faster than the LAPACK implementation of the classical LU algorithm. The Huang algorithm is obtained by the parameter choices H 1 = I, zi = wi = ai , vi = ei . A mathematically equivalent, but numerically more stable, formulation of this algorithm is the so-called modified Huang algorithm where the search vectors and the Abaffians are given by formulas pi = H i (H i ai ) and H i+1 = H i > pi p> i /p i pi . Some properties of this algorithm follow. The search vectors are orthogonal and are the same vectors obtained by applying the classical Gram– Schmidt orthogonalization procedure to the rows of A. The modified Huang algorithm is related,

but is not numerically identical, with the Daniel– Gragg–Kaufmann–Stewart reorthogonalized Gram– Schmidt algorithm [6]. If x1 is the zero vector, then the vector xi+1 is the solution with least Euclidean norm of the first i equations and the solution x+ of least Euclidean norm of the whole system is approached monotonically and from below by the sequence xi . L. Zhang [17] has shown that the Huang algorithm can be applied, via the Goldfarb–Idnani active set strategy [9], to systems of linear inequalities. The process in a finite number of steps either finds the solution with least Euclidean norm or determines that the system has no solution. While the error growth in the Huang algorithm is governed by the square of the number i = k ai k / k H i ai k, which is certainly large for some i if A is ill conditioned, the error growth depends only on i if pi or H i are defined as in the modified Huang algorithm and, at first order, there is no error growth for the modified Huang algorithm. Numerical experiments, see [15], have shown that the modified Huang algorithm is very stable, giving usually better accuracy in the computed solution than both the implicit LU algorithm and the classical LU factorization method. The implicit LX algorithm is defined by the choices H 1 = I, vi = ei , zi = wi = e k i , where ki is an integer, 1 ki n, such that e> k i H i a i ¤ 0:

(13)

Notice that by a general property of the ABS class for A with full rank there is at least one index ki such that (13) is satisfied. For stability reasons it may be recommended to select ki such that i = |e> k i H i ai | is maximized. The following properties are valid for the implicit LX algorithm. Let N be the set of integers from 1 to n, N = (1, . . . , n). Let Bi be the set of indexes k1 , . . . , ki chosen for the parameters of the implicit LX algorithm up to the step i. Let N i be the set N \ Bi . Then: The index ki is selected in the set N i1 . The rows of H i + 1 of index k 2 Bi are null rows. The vector pi has n i zero components; its ki th component is equal to one. If x1 = 0, then xi + 1 is a basic type solution of the first i equations, whose nonzero components may lie

ABS Algorithms for Linear Equations and Linear Least Squares

only in the positions corresponding to the indices k 2 Bi . The columns of H i + 1 of index k 2 N i are the unit vectors ek , while the columns of H i + 1 of index k 2 Bi have zero components in the jth position, with j 2 Bi , implying that only i(n i) elements of such columns have to be computed. At the ith step i(n i) multiplications are needed to compute H i ai and i(n i) to update the nontrivial part of H i . Hence the total number of multiplications is the same as for the implicit LU algorithm (i. e. n3 /3), but no pivoting is necessary, reflecting the fact that no condition is required on the matrix A. The storage requirement is the same as for the implicit LU algorithm, i. e. at most n2 /4. Hence the implicit LX algorithm shares the same storage advantage of the implicit LU algorithm over the classical LU algorithm, with the additional advantage of not requiring pivoting. Numerical experiments by K. Mirnia [10] have shown that the implicit LX method gives usually better accuracy, in terms of error in the computed solution, than the implicit LU algorithm and often even than the modified Huang algorithm. In terms of size of the final residual, its accuracy is comparable to that of the LU algorithm as implemented (with row pivoting) in the MATLAB or LAPACK libraries, but it is better again in terms of error in the solution.

Other ABS Linear Solvers ABS reformulations have been obtained for most algorithms proposed in the literature. The availability of several formulations of the linear algebra of the ABS process allows alternative formulations of each method, with possibly different values of overhead, storage and different properties of numerical stability, vectorization and parallelization. The reprojection technique, already seen in the case of the modified Huang algorithm and based upon the identities H i q = H i (H i q), H > i = > (H q), valid for any vector q if H = I, remarkably H> 1 i i improves the stability of the algorithm. The ABS versions of the Hestenes–Stiefel and the Craig algorithms for instance are very stable under the above reprojection. The implicit QR algorithm, defined by the choices H 1 = I, vi = Api , zi = wi = ei can be implemented in

A

a very stable way using the reprojection in both the definition of the search vector and the scaling vector. It should also be noticed that the classical iterative refinement procedure, which amounts to a Newton iteration on the system Ax b = 0 using the approximate factors of A, can be reformulated in the ABS context using the previously defined search vectors pi . Experiments of Mirnia [11] have shown that ABS refinement works excellently. For problems with special structure ABS methods can often be implemented taking into account the effect of the structure on the Abaffian matrix, which often tends to reflect the structure of the matrix A. For instance, if A has a banded structure, the same is true for the Abaffian matrix generated by the implicit LU, the implicit QR and the Huang algorithm, albeit the band size is increased. If A is SPD and has a ND structure, the same is true for the Abaffian matrix. In this case the implementation of the implicit LU algorithm has much less storage cost, for large n, than the cost required by an implementation of the Choleski algorithm. For matrices having the Kuhn–Tucker structure (KT structure) large classes of ABS methods have been devised, see ABS algorithms for optimization. For matrices with general sparsity patterns little is presently known about minimizing the fill-in in the Abaffian matrix. Careful use of BLAS4 routines can however substantially reduce the number of operations and make the ABS implementation competitive with a sparse implementation of say the LU factorization (e. g. by the code MA28) for values of n not too big. It is possible to implement the ABS process also in block form, where several equations, instead of just one, are dealt with at each step. The block formulation does not deteriorate the numerical accuracy and can lead to reduction of overhead on special problems or to faster implementations on vector or parallel computers. Finally infinite iterative methods can be obtained by the finite ABS methods via two approaches. The first one consists in restarting the iteration after k < m steps, so that the storage will be of order 2kn if the representation of the Abaffian in terms of 2i vectors is used. The second approach consists in using only a limited number of terms in the Gram–Schmidt type processes that are alternative formulations of the ABS procedure. For both cases convergence at a linear rate has been established using the technique developed in [7]. The infinite

5

6

A

ABS Algorithms for Linear Equations and Linear Least Squares

iteration methods obtained by these approaches define a very large class of methods, that contains not only all Krylov space type methods of the literature, but also non-Krylov type methods as the Gauss–Seidel, the De La Garza and the Kackmartz methods, with their generalizations.

has to be made for the GMRES method). A version of the implicit QR algorithm, with reprojection on both the search vector and the scaling vector, tested in [12], has outperformed other ABS algorithms for linear least squares methods as well as methods in the LINPACK and NAG library based upon the classical QR factorization via the Householder matrices.

ABS Methods for Linear Least Squares There are several ways of using ABS methods for solving in the least squares sense an overdetermined linear system without forming the normal equations of Gauss, which are usually avoided on the account of their higher conditioning. One possibility is to compute explicitly the factors associated with the implicit factorization and then use them in the standard way. From results of [14] the obtained methods work well, giving usually better results than the methods using the QR factorization computed in the standard way. A second possibility is to use the representation of the Moore– Penrose pseudo-inverse that is provided explicitly by the ABS technique described in [13]. Again this approach has given very good numerical results. A third possibility is based upon the equivalence of the normal system A| Ax = A| b with the extended system in the variables x 2 Rn , y 2 Rm , given by the two subsystems Ax = y, A| y = A| b. The first of the subsystems is overdetermined but must be solvable. Hence y must lie in the range of A| , which means that y must be the solution of least Euclidean norm of the second underdetermined subsystem. Such a solution is computed by the Huang algorithm. Then the ABS algorithm, applied to the first subsystem, in step B) recognizes and eliminates the m k dependent equations, where k is the rank of A. If k < n there are infinite solutions and the one of least Euclidean norm is obtained by using again the Huang algorithm on the first subsystem. Finally a large class of ABS methods can be applied directly to an overdetermined system stopping after n iterations in a least squares solution. The class is obtained by defining V = AU, where U is an arbitrary nonsingular matrix in Rn . Indeed at the point xn+1 the satisfied Petrov–Galerkin condition is just equivalent to the normal equations of Gauss. If U = P then the orthogonally scaled class is obtained, implying, as already stated in section 2, that the methods of this class can be applied to solve linear least squares (but a suitable modification

See also ABS Algorithms for Optimization Cholesky Factorization Gauss–Newton Method: Least Squares, Relation to Newton’s Method Generalized Total Least Squares Interval Linear Systems Large Scale Trust Region Problems Large Scale Unconstrained Optimization Least Squares Orthogonal Polynomials Least Squares Problems Linear Programming Nonlinear Least Squares: Newton-type Methods Nonlinear Least Squares Problems Nonlinear Least Squares: Trust Region Methods Orthogonal Triangularization Overdetermined Systems of Linear Equations QR Factorization Solving Large Scale and Sparse Semidefinite Programs Symmetric Systems of Linear Equations

References 1. Abaffy J, Broyden CG, Spedicato E (1984) A class of direct methods for linear systems. Numerische Math, 45:361–376 2. Abaffy J, Spedicato E (1989) ABS projection algorithms: Mathematical techniques for linear and nonlinear equations. Horwood, Westergate 3. Bertocchi M, Spedicato E (1989) Performance of the implicit Gauss–Choleski algorithm of the ABS class on the IBM 3090 VF. In: Proc. 10th Symp. Algorithms, Strbske Pleso, pp 30–40 4. Bodon E (1993) Numerical experiments on the ABS algorithms for linear systems of equations. Report DMSIA Univ Bergamo 93(17) 5. Broyden CG (1985) On the numerical stability of Huang’s and related methods. JOTA 47:401–412 6. Daniel J, Gragg WB, Kaufman L, Stewart GW (1976) Reorthogonalized and stable algorithms for updating

ABS Algorithms for Optimization

7. 8. 9.

10.

11. 12.

13.

14. 15.

16. 17.

the Gram–Schmidt QR factorization. Math Comput 30: 772–795 Dennis J, Turner K (1987) Generalized conjugate directions. Linear Alg & Its Appl 88/89:187–209 Galantai A (1991) Analysis of error propagation in the ABS class. Ann Inst Statist Math 43:597–603 Goldfarb D, Idnani A (1983) A numerically stable dual method for solving strictly convex quadratic programming. Math Program 27:1–33 Mirnia K (1996) Numerical experiments with iterative refinement of solutions of linear equations by ABS methods. Report DMSIA Univ Bergamo 32/96 Nicolai S, Spedicato E (1997) A bibliography of the ABS methods. OMS 8:171–183 Spedicato E, Bodon E (1989) Solving linear least squares by orthogonal factorization and pseudoinverse computation via the modified Huang algorithm in the ABS class. Computing 42:195–205 Spedicato E, Bodon E (1992) Numerical behaviour of the implicit QR algorithm in the ABS class for linear least squares. Ricerca Oper 22:43–55 Spedicato E, Bodon E (1993) Solution of linear least squares via the ABS algorithm. Math Program 58:111–136 Spedicato E, Vespucci MT (1993) Variations on the GramSchmidt and the Huang algorithms for linear systems: A numerical study. Appl Math 2:81–100 Wedderburn JHM (1934) Lectures on matrices. Colloq Publ Amer Math Soc Zhang L (1995) An algorithm for the least Euclidean norm solution of a linear system of inequalities via the Huang ABS algorithm and the Goldfarb–Idnani strategy. Report DMSIA Univ Bergamo 95/2

ABS Algorithms for Optimization EMILIO SPEDICATO1 , ZUNQUAN XIA2 , LIWEI ZHANG2 1 Department Math., University Bergamo, Bergamo, Italy 2 Department Applied Math., Dalian University Technol., Dalian, China MSC2000: 65K05, 65K10 Article Outline Keywords A Class of ABS Projection Methods for Unconstrained Optimization Applications to Quasi-Newton Methods ABS Methods for Kuhn–Tucker Equations

A

Reformulation of the Simplex Method via the Implicit LX Algorithm ABS Unification of Feasible Direction Methods for Minimization with Linear Constraints See also References

Keywords Linear equations; Optimization; ABS methods; Quasi-Newton methods; Linear programming; Feasible direction methods; KT equations; Interior point methods The scaled ABS (Abaffy–Broyden–Spedicato) class of algorithms, see [1] and ABS algorithms for linear equations and linear least squares, is a very general process for solving linear equations, realizing the so-called Petrov–Galerkin approach. In addition to solving general determined or underdetermined linear systems Ax = b, x 2 Rn , b 2 Rm , m n, rank(A) m, A = [a1 , . . . am ]| , ABS methods can also solve linear least squares problems and nonlinear algebraic equations. In this article we will consider applications of ABS methods to optimization problems. We will consider only the socalled basic ABS class, defined by the following procedure for solving Ax = b: A) Let x1 2 Rn be arbitrary, H 1 2 Rn, n be nonsingular arbitrary, set i = 1. B) Compute si = H i ai . IF si 6D 0, go to C). IF si = 0 and = a> i xi bi = 0, THEN set xi + 1 = xi , H i + 1 = H i and go to F), ELSE stop, the system has no solution. C) Compute the search vector pi by pi = H > i z i , where > zi 2 Rn is arbitrary save for the condition a> i H i zi 6D 0. D) Update the estimate of the solution by xi+1 = xi ˛ i pi , where the stepsize ˛ i is given by ˛ i = (a> i pi p . bi )/a> i i E) Update the matrix H i by H i+1 = H i H i ai w> i H i/ n H a , where w 2 R is arbitrary save for the conw> i i i i > dition w i H i ai 6D 0. F) IF i = m, THEN stop; xm + 1 solves the system, ELSE increment i by one and go to B). Among the properties of the ABS class the following is fundamental in the applications to optimization. Let

7

8

A

ABS Algorithms for Optimization

m < n and, for simplicity, assume that rank(A) = m. Then the linear variety containing all solutions of the underdetermined system Ax = b is represented by the vectors x of the form x D x mC1 C H > mC1 q;

(1)

where q 2 Rn is arbitrary. In the following the matrices generated by the ABS process will be called Abaffians. It is recalled that the matrix H i+1 can be represented in terms of either 2i vectors or of n i vectors, which is also true for the representation of the search vector pi . The first representation is computationally convenient for systems where the number of equations is small (less than n/2), while the second one is suitable for problems where m is close to n. In the applications to optimization, the first case corresponds to problems with few constraints (many degrees of freedom), the second case to problems with many constraints (few degrees of freedom). Among the algorithms of the basic ABS class, the following are particularly important. a) The implicit LU algorithm is given by the choices H 1 = I, zi = wi = ei , where ei is the ith unit vector in Rn . This algorithm is well defined if and only if A is regular (otherwise pivoting of the columns has to be performed, or of the equations, if m = n). Due to the special structure of the Abaffian induced by the parameter choices (the first i rows of H i+1 are identically zero, while the last n i columns are unit vectors) the maximum storage is n2 /4, hence 4 times less than for the classical LU factorization or twice less than for Gaussian elimination; the number of multiplications is nm2 2m3 /3, hence, for m = n, n3 /3, i. e. the same as for Gaussian elimination or the LU factorization algorithm. b) The Huang algorithm is obtained by the parameter choices H 1 = I, zi = wi = ai . A mathematically equivalent, but numerically more stable, formulation of this algorithm is the so-called modified Huang algorithm where the search vectors and the Abaffians are given by formulas pi = H i (H i ai ) and H i+1 = > H i pi p> i /p i pi . The search vectors are orthogonal and are equal to the vectors obtained by applying the classical Gram–Schmidt orthogonalization procedure to the rows of A. If x1 is the zero vector, then the vector xi+1 is the solution of least Euclidean

norm of the first i equations and the solution x+ of least Euclidean norm of the whole system is approached monotonically and from below by the sequence xi . c) The implicit LX algorithm, where ‘L’ refers to the lower triangular left factor while ‘X’ refers to the right factor, which is a matrix obtainable after row permutation of an upper triangular matrix, considered by Z. Xia, is defined by the choices H 1 = I, zi = wi = e k i where ki is an integer, 1 ki n, such that e> k i H i a i ¤ 0:

(2)

If A has full rank, from a property of the basic ABS class the vector H i ai is nonzero, hence there is at least one index ki such that (2) is satisfied. The implicit LX algorithm has the same overhead as the implicit LU algorithm, hence the same as Gaussian elimination, and the same storage requirement, i. e. less than Gaussian elimination or the LU factorization algorithm. It has the additional advantage of not requiring any condition on the matrix A, hence pivoting is not necessary. The structure of the Abaffian matrix is somewhat more complicated than for the implicit LU algorithm, the zero rows of H i+1 being now in the positions k1 , . . . ,ki and the columns that are unit vectors being in the positions that do not correspond to the already chosen indices ki . The vector pi has n i zero components and its ki th component is equal to one. It follows that if x1 = 0, then xi+1 is a basic type solution of the first i equations, whose nonzero components correspond to the chosen indices ki . In this paper we will present the following applications of ABS methods to optimization problems. In Section 2 we describe a class of ABS related methods for the unconstrained optimization problem. In Section 3 we show how ABS methods provide the general solution of the quasi-Newton equation, also with sparsity and symmetry and we discuss how SPD solutions can be obtained. In Section 4 we present several special ABS methods for solving the Kuhn–Tucker equations. In Section 5 we consider the application of the implicit LX algorithm to the linear programming (LP) problem. In Section 6 we present ABS approaches to the general linearly constrained optimization problem, which unify linear and nonlinear problems.

ABS Algorithms for Optimization

A Class of ABS Projection Methods for Unconstrained Optimization ABS methods can be applied directly to solve unconstrained optimization problems via the iteration xi+1 = xi ˛ i H > i z i , where H i is reset after n or less steps and zi is chosen so that the descent condition holds, > i. e. g > i H i z i > 0, with g i the gradient of the function at xi . If the function to be minimized is quadratic, one can identify the matrix A in the Abaffian update formula with the Hessian of the quadratic function. Defining a perturbed point x0 by x0 = xi ˇ vi one has on quadratic functions g 0 = g ˇ Avi , hence the update of the Abaffian takes the form H i+ 1 = H i > 0 H i yi w> i H i /w i H i yi , where yi = g g i . The above defined class has termination on quadratic functions and local superlinear (n-step Q-quadratic) rate of convergence on general functions. It is a special case of a class of projection methods developed in [7]. Almost no numerical results are available about the performance of the methods in this class.

Applications to Quasi-Newton Methods ABS methods have been used to provide the general solution of the quasi-Newton equation, also with the additional conditions of symmetry, sparsity and positive definiteness. While the general solution of only the quasi-Newton equation was already known from [2], the explicit formulas obtained for the sparse symmetric case are new, and so is the way of constructing sparse SPD updates. Let us consider the quasi-Newton equation defining the new approximation to a Jacobian or a Hessian, in the transpose form d > B0 D y > ;

(3)

where d = x0 x, y = g 0 g. We observe that (3) can be seen as a set of n linear underdetermined systems, each one having just one equation and differing only in the right-hand side. Hence the general solution can be obtained by one step of the ABS method. It can be written in the following way sd > s(B> d y)> C I > Q; B DB d> s d s 0

(4)

A

where Q 2 Rn, n is arbitrary and s 2 Rn is arbitrary subject to s| d 6D 0. Formula (4), derived in [9], is equivalent to the formula in [2]. Now the conditions that some elements of B0 should be zero, or have constant value or that B0 should be symmetric can be written as the additional linear constraints, where b0 i is the ith column of B0 (b 0i )> e k D i j ;

(5)

where ij = 0 implies sparsity, ij = const implies that some elements do not change their value and ij = ji implies symmetry. The ABS algorithm can deal with these extra conditions, see [11], giving the solution in explicit form, columnwise in presence of symmetry. By adding the additional condition that the diagonal elements be sufficiently large, it is possible to obtain formulas where B0 is quasi positive definite or quasi diagonally dominant, in the sense that the principal submatrix of order n 1 is positive definite or diagonally dominant. It is not possible in general to force B0 to be SPD, since SPD solutions may not exist, which is reflected in the fact that no additional conditions can be put on the last diagonal element, since the last column is fully determined by the n 1 symmetry conditions and the quasi-Newton equation. This result can however be exploited to provide SPD approximations by imbedding the original minimization problem of n variables in a problem of n + 1 variables, whose solution with respect to the first n variables is the original solution (just set, for instance, f (x0 ) = f (x) + x2nC1 ). This imbedding modifies the quasi-Newton equation so that SPD solutions exist. ABS Methods for Kuhn–Tucker Equations The Kuhn–Tucker equations (KT equations), which should more appropriately be named Kantorovich– Karush–Kuhn–Tucker equations (KKKT equations), are a special linear system, obtained by writing the optimality conditions of the problem of minimizing a quadratic function with Hessian G subject to the linear equality constraint Cx = b. They are the system Ax = b, where A is a symmetric indefinite matrix of the following form, with G 2 Rn, n , C 2 Rm, n AD

G C

C> : 0

(6)

9

10

A

ABS Algorithms for Optimization

If G is nonsingular, then A is nonsingular if and only if CG1 C| is nonsingular. Usually G is nonsingular, symmetric and positive definite, but this assumption, required by several classical solvers, is not necessary for the ABS solvers. ABS classes for solving the KT problem can be derived in several ways. Observe that system (6) is equivalent to the two subsystems >

G p C C z D g; Cp D c;

(7) (8)

where x = (p| , z| )| and b = (g | , C| )| . The general solution of subsystem (8) has the form, see (1) p D p mC1 C H > mC1 q;

(9)

with q arbitrary. The parameter choices made to construct pm+1 and H m+1 are arbitrary and define therefore a class of algorithms. Since the KT equations have a unique solution, there must be a choice of q in (9) which makes p be the unique n-dimensional subvector defined by the first n components of the solution x. Notice that since H m+1 is singular, q is not uniquely defined (but would be uniquely defined if one takes the representation of the Abaffian in terms of n m vectors). By multiplying equation (7) on the left by H m+1 and using the ABS property H m+1 C| = 0, we obtain the equation H mC1 G p D H mC1 g;

(10)

which does not contain z. Now there are two possibilities to determine p: A1) Consider the system formed by equations (8) and (10). Such a system is solvable but overdetermined. Since rank(H m+1 ) = n m, m equations are recognized as dependent and are eliminated in step B) of any ABS algorithm applied to this system. A2) In equation (10) substitute p with the expression of the general solution (9) obtaining

Once p is determined, there are two approaches to determine z, namely: B1) Solve by any ABS method the overdetermined compatible system C> z D g G p

(12)

by removing at step B) of the ABS algorithm the n m dependent equations. B2) Let P = (p1 , . . . pm ) be the matrix whose columns are the search vectors generated on the system Cp = c. Now CP = L, with L nonsingular lower diagonal. Multiplying equation (12) on the left by P| we obtain a triangular system, defining z uniquely L> z D P> g P> G p:

(13)

Extensive numerical testing has evaluated the accuracy of the above considered ABS algorithms for KT equations for certain choices of the ABS parameters (corresponding to the implicit LU algorithm with row pivoting and the modified Huang algorithm). The methods have been tested against classical methods, in particular the method of Aasen and methods using the QR factorization. The experiments have shown that some ABS methods are the most accurate, in both residual and solution error; moreover some ABS algorithms are cheaper in storage and in overhead, up to one order, especially for the case when m is close to n. In many interior point methods the main computational cost is to compute the solution for a sequence of KT problems where only G, which is diagonal, changes. In such a case the ABS methods, which initially work on the matrix C, which is unchanged, are advantaged, particularly when m is large, where the dominant cubic term decreases with m and disappears for m = n, so that the overhead is dominated by second order terms. Again numerical experiments show that some ABS methods are more accurate than the classical ones. For details see [8].

H mC1 GH > mC1 q D H mC1 g H mC1 G p mC1 : (11) The above system can be solved by any ABS method for a particular solution q, m equations being again removed at step B) of the ABS algorithm as linearly dependent.

Reformulation of the Simplex Method via the Implicit LX Algorithm The implicit LX algorithm has a natural application to a reformulation of the simplex method for the LP prob-

A

ABS Algorithms for Optimization

lem in standard form, i. e. the problem 8 ˆ ˆ x Ax D b x0:

The applicability of the implicit LX method is a consequence of the fact that the iterate xi+1 generated by the method, started from the zero vector, is a basic type vector, with a unit component in the position ki , non identically zero components corresponding to indices j 2 Bi , where Bi is the set of indices of the unit vectors chosen as the zi , wi parameters, i. e. the set Bi = (k1 , . . . , ki ), while the components of xi+1 of indices in the set N i = N/Bi are identically zero, where N = (1, . . . n). Therefore, if the nonzero components are nonnegative, the point defines a vertex of the polytope containing the feasible points defined by the constraints of the LP problem. In the simplex method one moves from a vertex to another one, according to some rules and usually reducing at each step the value of the function c| x. The direction along which one moves from a vertex to another one is an edge direction of the polytope and is determined by solving a linear system, whose coefficient matrix AB , the basic matrix, is defined by m linearly independent columns of the matrix A, called the basic columns. Usually such a system is solved by the LU factorization method or occasionally by the QR method, see [5]. The new vertex is associated to a new basic matrix AB 0 , which is obtained by substituting one of the columns in AB by a column of the matrix AN , which comprises the columns of A that do not belong to AB . The most efficient algorithm for solving the modified system, after the column interchange, is the Forrest– Goldfarb method [6], requiring m2 multiplications. Notice that the classical simplex method requires m2 storage for the matrix AB plus mn storage for the matrix A, which must be kept in general to provide the columns for the exchange. The application of the implicit LX method to the simplex method, developed in [4,10,13,17] exploits the fact that in the implicit LX algorithm the interchange of a jth column in AB with a kth column in AN corresponds to the interchange of a previously chosen parameter vector zj = wj = ej with a new parameter zk = wk

= ek . This operation is a special case of the perturbation of the Abaffian after a change in the parameters and can be done using a general formula of [15], without explicit use of the kth column in AN . Moreover since all quantities which are needed for the construction of the search direction (the edge direction) and for the interchange criteria can as well be implemented without explicit use of the columns of A, it follows that the ABS approach needs only the storage of the matrix H m+1 , which, in the case of the implicit LX algorithm, has a cost of at most n2 /4. Therefore for values of m close to n the storage required by the ABS formulation is about 8 times less than for the classical simplex method. Here we give the basic formulas of the simplex method in the classical and in the ABS formulation. The column in AN substituting an old column in AB is often taken as the column with minimal relative cost. In terms of the ABS formulation this is equivalent to minimize with respect to i 2 N m the scalar i = c| H | ei . Let N be the index chosen in this way. The column in AB to be exchanged is usually chosen with the criterion of the maximum displacement along an edge which keeps the basic variables nonnegative. Define ! i | = x| ei /e> i H e N , where x is the current basic feasible solution. Then the above criterion is equivalent to minimize ! i with respect the set of indices i 2 Bm such that > e> i H e N > 0:

(14)

Notice that H | e N 6D 0 and that an index i such that (14) is satisfied always exists, unless x is a solution of the LP problem. The update of the Abaffian after the interchange of the unit vectors, which corresponds to the update of the LU factors after the interchange of the basic with the nonbasic column, is given by the following formula H 0 D H (He B e B )

e> N H e> N He B

:

(15)

The search direction d, which in the classical formulation is obtained by solving the system AB d = Ae N , is given by d = H > mC1 e N , hence at no cost. Finally, the relative cost vector r, classically given by r = c A| A1 B cB , where cB consists of the components of c with indices corresponding to those of the basic columns, is simply given by r = H m + 1 c.

11

12

A

ABS Algorithms for Optimization

Let us now consider the computational cost of update (15). Since H eB has at most n m nonzero components, while H | e N has at most m, no more than m(n m) multiplications are required. The update is most expensive for m = n/2 and gets cheaper the smaller m is or the closer it is to n. In the dual steepest edge Forrest– Goldfarb method [6] the overhead for replacing a column is m2 , hence formula (15) is faster for m > n/2 and is recommended on overhead considerations for m sufficiently large. However we notice that ABS updates having a O(m2 ) cost can also be obtained by using the representation of the Abaffian in terms of 2m vectors. No computational experience has been obtained till now on the new ABS formulation of the simplex method. Finally, a generalization of the simplex method, based upon the use of the Huang algorithm started with a suitable singular matrix, has been developed in [16]. In this formulation the solution is approached by points lying on a face of the polytope. Whenever the point hits a vertex the remaining iterates move among vertices and the method is reduced to the simplex method.

ABS Unification of Feasible Direction Methods for Minimization with Linear Constraints ABS algorithms can be used to provide a unification of feasible point methods for nonlinear minimization with linear constraints, including as a special case the LP problem. Let us first consider the problem with only linear equality constraints: 8 ˆ min ˆ ˆ x2R n ˆ ˆ < s.t. ˆ ˆ ˆ ˆ ˆ :

f (x) Ax D b A 2 Rm;n ;

m n;

rank(A) D m:

Let x1 be a feasible starting point; then for an iteration procedure of the form xi+1 = xi ˛ i di , the search direction will generate feasible points if and only if Ad i D 0:

(16)

Solving the underdetermined system (16) for di by the ABS algorithm, the solution can be written in the fol-

lowing form, taking, without loss of generality, the zero vector as a special solution di D H> mC1 q;

(17)

where the matrix H m+1 depends on the arbitrary choice of the parameters H 1 , wi and vi used in solving (16) and q 2 Rn is arbitrary. Hence the general feasible direction iteration has the form x iC1 D x i ˛ i H > mC1 q:

(18)

The search direction is a descent direction if and only if d| rf (x) = q| H m+1 r f (x) > 0. Such a condition can always be satisfied by choice of q unless H m+1 r f (x) = 0, which implies, from the null space structure of H m+1 , that r f (x) = A| for some , hence that xi + 1 is a KT point and is the vector of the Lagrange multipliers. When xi+1 is not a KT point, it is immediate to see that the search direction is a descent directions if we select q as q = WH m+1 r f (x), where W is a symmetric and positive definite matrix. Particular well-known algorithms from the literature are obtained by the following choices of q, with W = I: The Wolfe reduced gradient method. Here, H m+1 is constructed by the implicit LU (or the implicit LX) algorithm. The Rosen gradient projection method. Here, H m+1 is built using the Huang algorithm. The Goldfarb–Idnani method. Here, H m+1 is built via the modification of the Huang algorithm where H 1 is a symmetric positive definite matrix approximating the inverse Hessian of f (x). If there are inequalities two approaches are possible: A) The active set approach. In this approach the set of linear equality constraints is modified at every iteration by adding and/or dropping some of the linear inequality constraints. Adding or deleting a single constraint can be done, for every ABS algorithm, in order two operations, see [15]. In the ABS reformulation of the Goldfarb–Idnani method, the initial matrix is related to a quasi-Newton approximation of the Hessian and an efficient update of the Abaffian after a change in the initial matrix is discussed in [14].

ABS Algorithms for Optimization

B) The standard form approach. In this approach, by introducing slack variables, the problem with both types of linear constraints is written in the equivalent form 8 ˆ ˆ 0 is selected to avoid that the new point has some negative components. If f (x) is nonlinear, then H m+1 can be determined once and for all at the first step, since r f (x) generally changes from iteration to iteration, therefore modifying the search direction. If, however, f (x) = c| x is linear (we have then the LP problem) to modify the search direction we need to change H m+1. As observed before, the simplex method is obtained by constructing H m+1 with the implicit LX algorithm, every step of the method corresponding to a change of the parameters e k i . It can be shown, see [13], that the method of Karmarkar (equivalent to an earlier method of Evtushenko [3]), corresponds to using the generalized Huang algorithm, with initial matrix H 1 = Diag(xi ) changing from iteration to iteration. Another method, faster than Karmarkar’s and having superlinear against linear rate of convergence p and O( n) against O(n) complexity, again first proposed by Y. Evtushenko, is obtained by the generalized Huang algorithm with initial matrix H 1 = Diag(x2i ). See also ABS Algorithms for Linear Equations and Linear Least Squares Gauss–Newton Method: Least Squares, Relation to Newton’s Method Generalized Total Least Squares Least Squares Orthogonal Polynomials

1. Abaffy J, Spedicato E (1989) ABS projection algorithms: Mathematical techniques for linear and nonlinear equations. Horwood, Westergate 2. Adachi N (1971) On variable metric algorithms. JOTA 7:391–409 3. Evtushenko Y (1974) Two numerical methods of solving nonlinear programming problems. Soviet Dokl Akad Nauk 251:420–423 4. Feng E, Wang XM, Wang XL (1997) On the application of the ABS algorithm to linear programming and linear complementarity. Optim Methods Softw 8:133–142 5. Fletcher R (1997) Dense factors of sparse matrices. In: Buhmann MD, Iserles A (eds) Approximation Theory and Optimization. Cambridge Univ. Press, Cambridge, pp 145–166 6. Forrest JH, Goldfarb D (1992) Steepest edge simplex algorithms for linear programming. Math Program 57:341–374 7. Psenichny BN, Danilin YM (1978) Numerical methods in extremal problems. MIR, Moscow 8. Spedicato E, Chen Z, Bodon E (1996) ABS methods for KT equations. In: Di Pillo G, Giannessi F (eds) Nonlinear Optimization and Applications. Plenum, New York, pp 345–359 9. Spedicato E, Xia Z (1992) Finding general solutions of the quasi-Newton equation in the ABS approach. Optim Methods Softw 1:273–281 10. Spedicato E, Xia Z, Zhang L (1995) Reformulation of the simplex algorithm via the ABS algorithm. Preprint Univ Bergamo 11. Spedicato E, Zhao J (1992) Explicit general solution of the quasi-Newton equation with sparsity and symmetry. Optim Methods Softw 2:311–319 12. Xia Z (1995) ABS generalization and formulation of the interior point method. Preprint Univ Bergamo 13. Xia Z (1995) ABS reformulation of some versions of the simplex method for linear programming. Report DMSIA Univ Bergamo 10/95 14. Xia Z, Liu Y, Zhang L (1992) Application of a representation of ABS updating matrices to linearly constrained optimization. Northeast Oper Res 7:1–9 15. Zhang L (1995) Updating of Abaffian matrices under perturbation in W and A. Report DMSIA Univ Bergamo 95/16 16. Zhang L (1997) On the ABS algorithm with singular initial matrix and its application to linear programming. Optim Methods Softw 8:143–156 17. Zhang L, Xia ZH (1995) Application of the implicit LX algorithm to the simplex method. Report DMSIA Univ Bergamo 9/95

13

14

A

Adaptive Convexification in Semi-Infinite Optimization

Adaptive Convexification in Semi-Infinite Optimization OLIVER STEIN School of Economics and Business Engineering, University of Karlsruhe, Karlsruhe, Germany MSC2000: 90C34, 90C33, 90C26, 65K05 Article Outline Synonyms Introduction Feasibility in Semi-Infinite Optimization Convex Lower Level Problems The ˛BB Method

Formulation ˛BB for the Lower Level The MPCC Reformulation

Method Refinement Step The Algorithm A Consistent Initial Approximation A Certificate for Global Optimality

Conclusions See also References Synonyms ACA Introduction The adaptive convexification algorithm is a method to solve semi-infinite optimization problems via a sequence of feasible iterates. Its main idea [6] is to adaptively construct convex relaxations of the lower level problem, replace the relaxed lower level problems equivalently by their Karush–Kuhn–Tucker conditions, and solve the resulting mathematical programs with complementarity constraints. The convex relaxations are constructed with ideas from the ˛BB method of global optimization.

finitely many inequality constraints. For adaptive convexification one assumes the form SIP : min f (x) subject to x2X

for all y 2 [0; 1] with objective function f 2 C 2 (Rn ; R), constraint function g 2 C 2 (Rn R; R), a box constraint set X D [x ` ; x u ] Rn with x ` < x u 2 Rn , and the set of infinitely many indices Y D [0; 1]. Adaptive convexification easily generalizes to problems with additional inequality and equality constraints, a finite number of semi-infinite constraints as well as higher-dimensional box index sets [6]. Reviews on semi-infinite programming are given in [8,13], and [9,14,15] overview the existing numerical methods. Classical numerical methods for SIP suffer from the drawback that their approximations of the feasible set X \ M with M D f x 2 Rn j g(x; y) 0 for all y 2 [0; 1] g may contain infeasible points. In fact, discretization and exchange methods approximate M by finitely many inequalities corresponding to finitely many indices in Y D [0; 1], yielding an outer approximation of M, and reduction based methods solve the Karush–Kuhn– Tucker system of SIP by a Newton-SQP approach. As a consequence, the iterates of these methods are not necessarily feasible for SIP, but only their limit might be. On the other hand, a first method producing feasible iterates for SIP was presented in the articles [3,4], where a branch-and-bound framework for the global solution of SIP generates convergent sequences of lower and upper bounds for the globally optimal value. In fact, checking feasibility of a given point x¯ 2 Rn is the crucial problem in semi-infinite optimization. Clearly we have x¯ 2 M if and only if '(x¯ ) 0 holds with the function ' : Rn ! R; x 7! max g(x; y) : y2[0;1]

The latter function is the optimal value function of the so-called lower level problem of SIP, Q(x) : max g(x; y) y2R

Feasibility in Semi-Infinite Optimization In a (standard) semi-infinite optimization problem a finite-dimensional decision variable is subject to in-

g(x; y) 0;

subject to 0 y 1 :

The difficulty lies in the fact that '(x¯) is the globally optimal value of Q(x¯ ) which might be hard to determine numerically. In fact, standard NLP solvers can

Adaptive Convexification in Semi-Infinite Optimization

only be expected to produce a local maximizer yloc of Q(x¯ ) which is not necessarily a global maximizer yglob . Even if g(x¯ ; yloc ) 0 is satisfied, x¯ might be infeasible since g(x¯ ; yloc ) 0 < '(x¯) D g(x¯ ; yglob ) may hold. Convex Lower Level Problems Assume for a moment that Q(x) is a convex optimization problem for all x 2 X, that is, g(x; ) is concave on Y D [0; 1] for these x. An approach developed for socalled generalized semi-infinite programs from [18,19] then takes advantage of the fact that the solution set of a differentiable convex lower level problem satisfying a constraint qualification is characterized by its first order optimality condition. In fact, SIP and the Stackelberg game SG : min f (x) subject to x;y

g(x; y) 0;

and y solves Q(x) are equivalent problems, and the restriction ‘y solves Q(x)’ in SG can be equivalently replaced by its Karush–Kuhn–Tucker condition. For this reformulation we use that the Lagrange function of Q(x), L(x; y; ` ; u ) D g(x; y) C ` y C u (1 y);

satisfies r y L(x; y; ` ; u ) D r y g(x; y) C ` u and obtain that the Stackelberg game is equivalent to the following mathematical program with complementarity constraints: MPCC :

min

x;y;` ;u

f (x) subject to g(x; y) 0 r y g(x; y) C ` u D 0 ` y D 0 u (1 y) D 0 ` ; u 0 y; 1 y 0 :

Overviews of solution methods for MPCC are given in [10,11,17]. One approach to solve MPCC is the reformulation of the complementarity constraints by a socalled NCP function, that is, a function : R2 ! R with (a; b) D 0 if and only if

a 0;

b 0;

ab D 0:

A

For numerical purposes one can regularize these nondifferentiable NCP functions. Although MPCC does not necessarily have to be solved via the NCP function formulation, in the following we will use NCP functions to keep the notation concise. In fact, MPCC can be equivalently rewritten as the nonsmooth problem P: f (x) subject to

min

x;y;` ;u

g(x; y) 0

r y g(x; y) C ` u D 0 (` ; y) D 0 (u ; 1 y) D 0 : The ˛BB Method In ˛BB, a convex underestimator of a nonconvex function is constructed by decomposing it into a sum of nonconvex terms of special type (e. g., linear, bilinear, trilinear, fractional, fractional trilinear, convex, univariate concave) and nonconvex terms of arbitrary type. The first type is then replaced by its convex envelope or very tight convex underestimators which are already known. A complete list of the tight convex underestimators of the above special type nonconvex terms is provided in [5]. For the ease of presentation, here we will treat all terms as arbitrarily nonconvex. For these terms, ˛BB constructs convex underestimators by adding a quadratic relaxation function . With the obvious modification we use this approach to construct a concave overestimator for a nonconcave function g : [y` ; y u ] ! R being C2 on an open neighborhood of [y` ; y u ]. With (y; ˛; y` ; y u ) D

˛ (y y` )(y u y) 2

(1)

we put g˜(y; ˛; y` ; y u ) D g(y) C

(y; ˛; y` ; y u ) :

In the sequel we will suppress the dependence of g˜ on y` ; y u . For ˛ 0 the function g˜ clearly is an overestimator of g on [y` ; y u ], and it coincides with g at the endpoints y` , yu of the domain. Moreover, g˜ is twice continuously differentiable with second derivative r y2 g˜(y; ˛) D r 2 g(y) ˛

15

16

A

Adaptive Convexification in Semi-Infinite Optimization

on [y` ; y u ]. Consequently g˜ is concave on [y` ; y u ] for ˛

max

y2[y ` ;y u ]

r 2 g(y)

(2)

(cf. also [1,2]). The computation of ˛ thus involves a global optimization problem itself. Note, however, that one may use any upper bound for the right-hand side in (2). Such upper bounds can be provided by interval methods (see, e. g., [5,7,12]). An ˛ satisfying (2) is called convexification parameter. Combining these facts shows that for ! ˛ max 0; max

y2[y ` ;y u ]

r 2 g(y)

the function g˜(y; ˛) is a concave overestimator of g on [y` ; y u ].

For N 2 N let 0 D 0 < 1 < : : : < N1 < N D 1 define a subdivision of Y D [0; 1], that is, with K D f1; : : : ; Ng and Y k D [ k1 ; k ];

k 2 K;

for all

Yk :

k2K

A trivial but very useful observation is that the single semi-infinite constraint g(x; y) 0

for all

y2Y

is equivalent to the finitely many semi-infinite constraints g(x; y) 0

for all

y 2 Y k; k 2 K ;

entail x¯ 2 M. ˛BB for the Lower Level For the construction of these overestimators one uses ideas of the ˛BB method. In fact, for each k 2 K we put

g k : XY k ! R; (x; y) 7! g(x; y)C (y; ˛ k ; k1 ; k ) (3) with the quadratic relaxation function

from (1) and !

max

(x;y)2XY k

r y2 g(x; y)

:

(4)

Note that the latter condition on ˛ k is uniform in x. We emphasize that with the single bound (5) ˛¯ > max 0; max r y2 g(x; y) (x;y)2XY

the choices ˛ k :D ˛¯ satisfy (4) for all k 2 K. Moreover, ¯ k 2 K. the ˛ k can always be chosen such that ˛ k ˛, k The following properties of g are easily verified.

we have [

g k (x¯ ; y) 0

˛ k > max 0;

Formulation

Y D

Lemma 1 For each k 2 K let g k : X Y k ! R, and let x¯ 2 X be given such that for all k 2 K and all y 2 Y k we have g(x¯ ; y) g k (x¯; y). Then the constraints

y 2 Y k; k 2 K :

Given a subdivision, one can construct concave overestimators for each of these finitely many semiinfinite constraints, solve the corresponding optimization problem, and adaptively refine the subdivision. The following lemma formulates the obvious fact that replacing g by overestimators on each subdivision node Y k results in an approximation of M by feasible points.

Lemma 2 ([6]) For each k 2 K let g k be given by (3). Then the following holds: (i) For all (x; y) 2 X Y k we have g(x; y) g k (x; y). (ii) For all x 2 X, the function g k (x; ) is concave on Y k . Now consider the following approximation of the feasible set M, where E D f k j k 2 Kg denotes the set of subdivision points, and ˛ the vector of convexification parameters: M˛BB (E; ˛) D f x 2 Rn j g k (x; y) 0 ; for all y 2 Y k ; k 2 K g : By Lemma 1 and Lemma 2(i) we have M˛BB (E; ˛) M. This means that any solution concept for SIP˛BB (E; ˛) : min f (x) subject to x2X

x 2 M˛BB (E; ˛) ;

A

Adaptive Convexification in Semi-Infinite Optimization

be it global solutions, local solutions or stationary points, will at least lead to feasible points of SIP (provided that SIP˛BB (E; ˛) is consistent). The problem SIP˛BB (E; ˛) has finitely many lower level problems Qk (x), k 2 K; with Q k (x) : max g k (x; y) subject to y2R

k1 y k :

Since the inequality (4) is strict, the convex problem Qk (x) has a unique solution yk (x) for each k 2 K and x 2 X. Recall that y 2 Y k is called active for the constraint max y2Y k g k (x; y) 0 at x¯ if g k (x¯ ; y) D 0 holds. By the uniqueness of the global solution of Q k (x¯ ) there exists at most one active index for each k 2 K, namely y k (x¯ ). Thus, one can consider the finite active index sets K0 (x¯ ) D f k 2 Kj g k (x¯ ; y k (x¯ )) D 0 g; Y0˛BB (x¯ ) D f y k (x¯ )j k 2 K0 (x¯ ) g :

Following the ideas to treat convex lower level problems, yk solves Qk (x) if and only if (x; y k ; `k ; uk ) solves the system r y g k (x; y) C ` u D 0 (` ; y k1 ) D 0 (u ; k y) D 0 with some `k , uk , and denoting some NCP function. With w :D (x; y

; `k ; uk ; k

2 K)

The main idea of the adaptive convexification algorithm is to compute a stationary point x¯ of SIP˛BB (E; ˛) by the approach from the previous section, and terminate if x¯ is also stationary for SIP within given tolerances. If x¯ is not stationary it refines the subdivision E in the spirit of exchange methods [8,15] by adding the active indices Y0˛BB (x¯ ) to E, and constructs ˜ by the fola refined problem SIP˛BB (E [ Y0˛BB (x¯ ); ˛) lowing procedure. Note that, in view of Carathéodory’s theorem, the number of elements of Y0˛BB (x¯) may be bounded by n C 1.

For any ˜ 2 Y0˛BB (x¯ ), let k 2 K be the index with ˜ 2 [ k1 ; k ]. Put Y k;1 D [ k1 ; ˜], Y k;2 D [˜; k ], let ˛ k;1 and ˛ k;2 be the corresponding convexification parameters, put ˛ k;1 (y k1 )(˜ y); g k;1 (x; y) D g(x; y) C 2 ˛ k;2 (y ˜)( k y); g k;2 (x; y) D g(x; y) C 2 ˜ by replacing the conand define M˛BB (E [ f˜g; ˛) straint g k (x; y) 0;

˛k k (y k1 )( k y k ) 2

H k (w; E; ˛) :D k1 k 0 1 r y g(x; y k ) C ˛ k 2C y k C `k uk B C @ A (`k ; y k k1 ) (uk ; k y k ) one can thus replace SIP˛BB (E; ˛) equivalently by the nonsmooth problem

g k;i (x; y) 0;

w k

H (w; E; ˛) D 0;

y 2 Y k;i ;

i D 1; 2 ;

The Algorithm The point x¯ is stationary for SIP˛BB (E; ˛) (in the sense of Fritz John) if x¯ 2 M˛BB (E; ˛) and if there exist y k 2 Y0˛BB (x¯ ), 1 k n C 1, and (; ) 2 S nC1 (the (n C 1)dimensional standard simplex) with r f (x¯) C

G (w; E; ˛) 0;

for all

and by replacing the entry ˛ k of ˛ by the two new entries ˛ k,i , i D 1; 2.

P(E; ˛) : min F(w) subject to

k

for all y 2 Y k

in M˛BB (E; ˛) by the two new constraints

F(w) :D f (x) G k (w; E; ˛) :D g(x; y k ) C

Method

Refinement Step

The MPCC Reformulation

k

The latter problem can be solved to local optimality by MPCC algorithms [10,11,17]. For a local minimizer w¯ of P(E, ˛) the subvector x¯ of w¯ is a local minimizer and, hence, a stationary point of SIP˛BB (E; ˛).

nC1 X

k rx g(x¯ ; y k ) D 0

kD1

k 2 K:

k g k (x¯ ; y k ) D 0; 1 k n C 1 :

17

18

A

Adaptive Convexification in Semi-Infinite Optimization

For the adaptive convexification algorithm the notions of active index, stationarity, and set unification are relaxed by certain tolerances. Definition 1 For "act , "stat , "[ > 0 we say that (i) yk is "act-active for g k at x¯ if g k (x¯ ; y k ) 2 ["act; 0], (ii) x¯ is "stat -stationary for SIP with "act-active indices if x¯ 2 M and if there exist y k 2 Y, 1 k n C 1, and (; ) 2 S nC1 such that ˇˇ ˇˇ nC1 ˇˇ ˇˇ X ˇˇ k ˇˇ k rx g(x¯ ; y )ˇˇ "stat ˇˇr f (x¯) C ˇˇ ˇˇ kD1

k

k g(x¯ ; y ) 2 [ k "act ; 0];

1 k n C 1;

hold, and (iii) the "[ -union of E and ˜ is E [ f˜g if minf˜ k1 ; k ˜g > "[ ( k k1 ) holds for the k 2 K with ˜ 2 [ k1 ; k ], and E otherwise (i. e., ˜ is not unified with E if its distance from E is too small). In [6] it is shown that Algorithm 1 is well-defined, convergent and finitely terminating. Furthermore, the following feasibility result holds. Theorem 2 ([6]) Let (x ) be a sequence of points generated by Algorithm 1. Then all x ; 2 N, are feasible for SIP, the sequence (x ) has an accumulation point, each such accumulation point x is feasible for SIP, and f (x ) provides an upper bound for the optimal value of SIP. Numerical examples for the performance of the method from Chebyshev approximation and design centering are given in [6]. A Consistent Initial Approximation Even if the feasible set M of SIP is consistent, there is no guarantee that its approximations M˛BB (E; ˛) are also consistent. For Step 1 of Algorithm 1 [6] suggests the following phase I approach: use Algorithm 1 to construct adaptive convexifications of SIP ph:I :

min

(x;z)2XR

z

subject to

g(x; y) z for all y 2 [0; 1]

Algorithm 1 (Adaptive convexification algorithm) Step 1: Determine a uniform convexification parameter ˛N with (5), choose N 2 N, k 2 Y and ˛ k ˛, N k 2 K = f1; : : : ; Ng, such that SIP˛BB (E; ˛) is consistent, as well as tolerances "act ; "stat ; "[ > N 0 with "[ 2"act /˛. Step 2: By solving P(E; ˛), compute a stationary point x of SIP˛BB (E; ˛) with "act active indices y k , 1 k n + 1, and multipliers (; ). Step 3: Terminate if x is "stat stationary for SIP with (2"act )-active indices y k ; 1 k n + 1, from Step 2 and multipliers (; ) from Step 2. Otherwise construct a new set EQ of subdivision points as the "[ -union of E and fy k j 1 k n + 1g, and perform a refinement step for the elements in EQ n E to construct a new feaQ ˛). Q sible set M˛BB (E; Q Step 4: Put E = E, ˛ = ˛, Q and go to Step 2.

Adaptive Convexification in Semi-Infinite Optimization, Algorithm 1

ph:I

until a feasible point (x¯; z¯) with z¯ 0 of SIP˛BB (E; ˛) is found with some subdivision E and convexification parameters ˛. The point x¯ is then obviously also feasible for SIP˛BB (E; ˛) and can be used as an initial point to solve the latter problem. Due to the possible nonconvexity of the upper level problem of SIP, this phase I approach is not necessarily successful, but possible remedies for this situation are given in [6]. To initialize Algorithm 1 for phase I, select some point x¯ in the box X and put E 1 D f0; 1g, that is, Y 1 D Y D [0; 1]. Compute ˛1 according to (4) and solve the convex optimization problem Q 1 (x¯) with standard software. With its optimal value z¯, the point ph:I (x¯ ; z¯) is feasible for SIP˛BB (E 1 ; ˛1 ). A Certificate for Global Optimality After termination of Algorithm 1 one can exploit that the set E [0; 1] contains indices that should also yield a good outer approximation of M. The optimal value of the problem Pouter : min f (x) subject to x2X

g(x; ) 0; 2 E ;

Adaptive Global Search

yields a rigorous lower bound for the optimal value of SIP. If Pouter can actually be solved to global optimality (e. g., if a standard NLP solver is used, due to convexity with respect to x), then a comparison of this lower bound for the optimal value of SIP with the upper bound from Algorithm 1 can yield a certificate of global optimality for SIP up to some tolerance. Conclusions The adaptive convexification algorithm provides an easily implementable way to solve semi-infinite optimization problems with feasible iterates. To explain its basic ideas, in [6] the algorithm is presented in its simplest form. It can be improved in a number of ways, for example in the magnitude of the convexification parameters and in their adaptive refinement, or by using other convexification techniques. Although the numerical results from [6] are very promising, further work is needed on error estimates on the numerical solution of the auxiliary problem P(E, ˛), which is assumed to be solved to exact local optimality by the present adaptive convexification algorithm. See also ˛BB Algorithm Bilevel Optimization: Feasibility Test and Flexibility Index Convex Discrete Optimization Generalized Semi-infinite Programming: Optimality Conditions References 1. Adjiman CS, Androulakis IP, Floudas CA (1998) A global optimization method, ˛BB, for general twice-differentiable constrained NLPs – I: theoretical advances. Comput Chem Eng 22:1137–1158 2. Adjiman CS, Androulakis IP, Floudas CA (1998) A global optimization method, ˛BB, for general twice-differentiable constrained NLPs – II: implementation and computational results. Comput Chem Eng 22:1159–1179 3. Bhattacharjee B, Green WH Jr, Barton PI (2005) Interval methods for semi-infinite programs. Comput Optim Appl 30:63–93 4. Bhattacharjee B, Lemonidis P, Green WH Jr, Barton PI (2005) Global solution of semi-infinite programs. Math Program 103:283–307 5. Floudas CA (2000) Deterministic global optimization, theory, methods and applications. Kluwer, Dordrecht

A

6. Floudas CA, Stein O (2007) The adaptive convexification algorithm: a feasible point method for semi-infinite programming. SIAM J Optim 18:1187–1208 7. Hansen E (1992) Global optimization using interval analysis. Dekker, New York 8. Hettich R, Kortanek KO (1993) Semi-infinite programming: theory, methods, and applications. SIAM Rev 35:380–429 9. Hettich R, Zencke P (1982) Numerische Methoden der Approximation und semi-infiniten Optimierung. Teubner, Stuttgart 10. Koˇcvara M, Outrata J, Zowe J (1998) Nonsmooth approach to optimization problems with equilibrium constraints: theory, applications and numerical results. Kluwer, Dordrecht 11. Luo Z, Pang J, Ralph D (1996) Mathematical programs with equilibrium constraints. Cambridge University Press, Cambridge 12. Neumaier A (1990) Interval methods for systems of equations. Cambridge University Press, Cambridge 13. Polak E (1987) On the mathematical foundation of nondifferentiable optimization in engineering design. SIAM Rev 29:21–89 14. Polak E (1997) Optimization, algorithms and consistent approximations. Springer, Berlin 15. Reemtsen R, Görner S (1998) Numerical methods for semiinfinite programming: a survey. In: Reemtsen R, Rückmann J-J (eds) Semi-infinite programming. Kluwer, Boston, pp 195–275 16. Reemtsen R, Rückmann J-J (eds) (1998) Semi-infinite programming. Kluwer, Boston 17. Scholtes S, Stöhr M (1999) Exact penalization of mathematical programs with equilibrium constraints. SIAM J Control Optim 37:617–652 18. Stein O (2003) Bi-level strategies in semi-infinite programming. Kluwer, Boston 19. Stein O, Still G (2003) Solving semi-infinite optimization problems with interior point techniques. SIAM J Control Optim 42:769–788

Adaptive Global Search J. M. CALVIN Department Computer and Information Sci., New Jersey Institute Techn., Newark, USA MSC2000: 60J65, 68Q25 Article Outline Keywords See also References

19

20

A

Adaptive Global Search

Keywords Average case complexity; Adaptive algorithm; Wiener process; Randomized algorithms This article contains a survey of some well known facts about the complexity of global optimization, and also describes some results concerning the average-case complexity. Consider the following optimization problem. Given a class F of objective functions f defined on a compact subset of d-dimensional Euclidean space, the goal is to approximate the global minimum of f based on evaluation of the function at sequentially selected points. The focus will be on the error after n observations n D n ( f ) D f n f ; where f n is the smallest of the first n observed function values (other approximations besides f n are often considered). Complexity of optimization is usually studied in the worst- or average-case setting. In order for a worst-case analysis to be useful the class of objective functions F must be quite restricted. Consider the case where F is a subset of the continuous functions on a compact set. It is convenient to consider the class F = Cr ([0, 1]d ) of real-valued functions on [0, 1]d with continuous derivatives up to order r 0. Suppose that r > 0 and f r is bounded. In this case ( d/r ) function evaluations are needed to ensure that the error is at most for any f 2 F; see [8]. An adaptive algorithm is one for which the (n + 1)st observation point is determined as a function of the previous observations, while a nonadaptive algorithm chooses each point independently of the function values. In the worst-case setting, adaptation does not help much under quite general assumptions. If F is convex and symmetric (in the sense that F = F), then the maximum error under an adaptive algorithm with n observations is not smaller than the maximum error of a nonadaptive method with n + 1 observations; see [4]. Virtually all global optimization methods in practical use are adaptive. For a survey of such methods see [6,9]. The fact that the worst-case performance can not be significantly improved with adaptation leads to consideration of alternative settings that may be more

appropriate. One such setting is the average-case setting, in which a probability measure P on F is chosen. The object of study is then the sequence of random variables n (f ), and the questions include under what conditions (for what algorithms) the error converges to zero and for convergent algorithms the speed of convergence. While the average-case error is often defined as the mathematical expectation of the error, it is useful to take a broader view, and consider for example convergence in probability of an n for some normalizing sequence {an }. With the average-case setting one can consider less restricted classes F than in the worst-case setting. As F gets larger, the worst-case deviates more and more from the average case, but may occur on only a small portion of the set F. Even for continuous functions the worstcase is arbitrarily bad. Most of what is known about the average-case complexity of optimization is in the one-dimensional setting under the Wiener probability measure on C([0, 1]). Under the Wiener measure, the increments f (t)f (s) have a normal distribution with mean zero and variance ts, and are independent for disjoint intervals. Almost every f is nowhere differentiable, and the set of local minima is dense in the unit interval. One can thus think of the Wiener measure as corresponding to assuming ‘only’ continuity; i. e., a worst-case probabilistic assumption. K. Ritter proved [5] that the best nonadaptive algorithms have error of order n1/2 after n function evaluations; the optimal order is achieved by observing at equally spaced points. Since the choice of each new observation point does not depend on any of the previous observations, the computation can be carried out in parallel. Thus under the Wiener measure, the optimal nonadaptive order of convergence can be accomplished with an algorithm that has computational cost that grows linearly with the number of observations and uses constant storage. This gives the base on which to compare adaptive algorithms. Recent studies (as of 2000) have formally established the improved power of adaptive methods in the average-case setting by analyzing the convergence rates of certain adaptive algorithms. A randomized algorithm is described in [1] with the property that for any 0 < ı < 1, a version can be constructed so that under the Wiener measure, the error converges to zero at rate n1+ı . This

Adaptive Simulated Annealing and its Application to Protein Folding

A

algorithm maintains a memory of two past observation values, and the computational cost grows linearly with the number of iterations. Therefore, the convergence rate of this adaptive algorithm improves from the nonadaptive n1/2 rate to n1+ı with only a constant increase in storage. Algorithms based on a random model for the objective function are well-suited to average-case analysis. H. Kushner proposed [3] a global optimization method based on modeling the objective function as a Wiener process. Let {zn } be a sequence of positive numbers, and let the (n + 1)st point be chosen to maximize the probability that the new function value is less than the previously observed minimum minus zn . This class of algorithms, often called P-algorithms, was given a formal justification by A. Žilinskas [7]. By allowing the {zn } to depend on the past observations instead of being a fixed deterministic sequence, it is possible to establish a much better convergence rate than that of the randomized algorithm described above. In [2] an algorithm was constructed with the property that the error converges to zero for any continuous function and furthermore, the error is of order enc n , where {cn } (a parameter of the algorithm) is a deterministic sequence that can be chosen to approach zero at an arbitrarily slow rate. Notice that the convergence rate is now almost exponential in the number of observations n. The computational cost of the algorithm grows quadratically, and the storage increases linearly, since all past observations must be stored.

4. Novak E (1988) Deterministic and stochastic error bounds in numerical analysis. Lecture Notes in Mathematics, vol 1349. Springer, Berlin 5. Ritter K (1990) Approximation and optimization on the Wiener space. J Complexity 6:337–364 6. Törn A, Žilinskas A (1989) Global optimization. Springer, Berlin 7. Žilinskas A (1985) Axiomatic characterization of global optimization algorithm and investigation of its search strategy. OR Lett 4:35–39 8. Wasilkowski G (1992) On average complexity of global optimization problems. Math Program 57:313–324 9. Zhigljavsky A (1991) Theory of global random search. Kluwer, Dordrecht

See also

Application to Protein Folding

Adaptive Simulated Annealing and its Application to Protein Folding Global Optimization Based on Statistical Models References 1. Calvin J (1997) Average performance of a class of adaptive algorithms for global optimization. Ann Appl Probab 7:711– 730 2. Calvin J (2001) A one-dimensional optimization algorithm and its convergence rate under the Wiener measure. J Complexity 3. Kushner H (1962) A versatile stochastic model of a function of unknown and time varying form. J Math Anal Appl 5: 150–167

Adaptive Simulated Annealing and its Application to Protein Folding ASA RUTH PACHTER, ZHIQIANG WANG Air Force Research Laboratory Materials & Manufacturing Directorate, Wright–Patterson AFB, Wright–Patterson AFB, USA MSC2000: 92C05 Article Outline Keywords The ASA Method Monte-Carlo Configurations Annealing Schedule Re-Annealing Computational Details Met-Enkephalin Poly(L-Alanine)

Conclusion Recent Studies and Future Directions See also References Keywords Optimization; Adaptive simulated annealing; Protein folding; Met-Enkephalin; Poly(L-Alanine) The adaptive simulated annealing (ASA) algorithm [3] has been shown to be faster and more efficient than

21

22

A

Adaptive Simulated Annealing and its Application to Protein Folding

simulated annealing and genetic algorithms [4]. In this article we first outline some of the aspects of the method and specific computational details, and then review the application of the ASA method to biomolecular structure determination [15], specifically for MetEnkephalin and a model of the poly(L-Alanine) system.

Annealing Schedule

The ASA Method

Monte-Carlo Configurations

where ci and ki are the annealing scale and ASA step of parameter pi . The index for re-annealing the cost function is determined by the number of accepted points instead of the number of generated points as is being used for the parameters. This choice was made since the Boltzmann acceptance criterion uses an exponential distribution which is not as ‘fat-tailed’ as the ASA distribution used for the parameters.

As the kth point is saved in a D-dimensional configuration space, the new point p ikC1 is generated by:

Re-Annealing

For a system described by a cost function E({pi }), where all pi (i = 1, . . . , D) are parameters (variables) having ranges [Ai , Bi ], the ASA procedure to find the global optimum of ‘E’ contains the following elements.

p ikC1 D p ik C y i (B i A i );

(1)

where the random variables yi in [1, 1] (non-uniform) are generated from a random number ui uniformly distributed in [0, 1], and the temperature T i associated with parameter pi , as follows: " # i 1 j2u 1j i i 1C 1 : (2) y D sgn(u 0:5)Ti Ti p ikC1

Note that if is outside the range of [Ai , Bi ] it will be disregarded, with the process being repeated until it falls within the range. The choice of yi is made so that the probability density distribution of the D parameters will satisfy the distribution of each parameter: 1 g i (y i ; Ti ) D i 2(jy j C Ti )(1 C

1 ; Ti )

(3)

which is chosen to ensure that any point in configuration space can be sampled infinitely often in annealing time with a cooling schedule outlined below. Thus, at any annealing time k0 , the probability of not generating a global optimum, given infinite time, is zero: 1 Y

(1 g k ) D 0;

(4)

kDk 0

where g k is the distribution function at time step k. Note that all atoms move at each Monte-Carlo step in ASA. A Boltzmann acceptance criterion is then applied to the difference in the cost function.

The annealing schedule for each parameter temperature from a starting temperature T 0i , and similarly for the cost temperature, is given by: 1 (5) Ti (k i ) D T0i exp c i k iD ;

The temperatures may be periodically re-annealed or re-scaled according to the sensitivity of the cost function. At any given annealing time, the temperature range is ‘stretched out’ over the relatively insensitive parameters, thus guiding the search ‘fairly’ among the parameters. The sensitivity of the energy to each parameter is calculated by: Si D

@E ; @p i

(6)

while the re-annealing temperature is determined by: Ti (k 0 ) D Ti (k)

Si : Smax

(7)

In this way, less sensitive parameters anneal faster. This is done approximately every 100 accepted events. For comparison, within conventional simulated annealing [6] the cooling schedule is given by: STk D T0 e (1c)k

(0 < c < 1);

(8)

where trial and error are applied to determine the annealing rate c1 as well as the starting temperature T 0 . A Monte-Carlo simulation is carried out at each temperature step k with temperature T k . This cooling schedule is equivalent to T k + 1 = T k c. The ASA algorithm is mostly suited to problems for which less is known about the system, and has proven to be more robust than other simulated annealing techniques for complex problems with multiple local minima, e. g., as compared to Cauchy annealing where T i

A

Adaptive Simulated Annealing and its Application to Protein Folding

= T 0 /k, and Boltzmann annealing where T i = T 0 /ln k. The annealing schedule in (8), faster than ASA for a large dimension of D, does not pass the infinitely often annealing-time test in (4), and is therefore referred to as simulated quench in the terminology of ASA. Application to Protein Folding Computational Details A protein can be defined as a biopolymer of hundreds of amino acids bonded by peptide bonds, while the test models in this article contain less amino acids, namely oligopeptides. The Met-Enkephalin model was constructed as (H-Tyr-Gly-Gly-Phe-Met-OH). For 14(LAlanine), the neutral —NH2 and —COOH end groups were substituted at the termini. The conformation of a protein is described by the dihedral angles of the backj bone ( i , i ), side-chains ( i ), and peptide bond (! i , often very close to 180°). Therefore, the conformation determination of the most stable protein is to find the set of {, , , !} which give the global minimal potential energy E(, , , !). Within the ASA nomenclature, the ‘cost function’ is the potential energy, while a ‘parameter’ is a dihedral angle variable. Conformational analyses using conventional simulated annealing were carried out previously [9,11]. The modifications in these works include moving a number of dihedral angles in a Monte-Carlo step; adjusting the maximum deviation of the variables as the temperature decreases to insure that the acceptance ratio is more than 25%; and treating the variables differently according to their importance in the folding process, e. g., by increasing sampling for the backbone dihedral angles as compared to those of the side-chains. It is interesting to point out that within ASA these modifications are implicitly included. Each ASA run in our work was started from a random initial configuration {, , }. The dihedral angle ! was fixed to 180° in all of the ASA runs. The initial temperature was determined by the average energy of 5 or 10 random samplings, and a full search range of the dihedral angles ( , ) was set. The typical maximum number of calls to the energy function was 30000. An ASA run was terminated if it repeated the best energy value for 3 or 5 re-annealing cycles (each cycle generates 100 configurations). Further refinement of the final ASA optimized configuration was carried out by using

the local minimizer SUMSL [1], or the conjugate gradient method. The combination of the ASA application and a local minimizer improved the efficiency of the search. The ASA calculation is governed by various control parameters [3], for which the most important setting is the annealing rate for the temperatures of ‘cost’ and ‘parameters’, determined by the so-called ‘temperatureratio-scale’ (the ratio of the final to the initial temperature after certain annealing steps) and the ‘costparameter-scale’. The control parameters were varied to improve the search efficiency. Adequate control parameters used for obtaining the results reported in this study were: ‘temperature-ratio-scale’ = 104 ; ‘costparameter-scale’ = 0.5. These parameter settings correspond to an annealing rate for energy of ccost = 3.6, and for all dihedral angles of cparameter = 7.2. Note that the annealing rate for all dihedral angles was chosen to be the same. Met-Enkephalin Met-Enkephalin has a complicated energy surface [11,16]. The lowest energy for Met-Enkephalin was found to be 12.9 kcal/mol with the force field being ECEPP/2 (Empirical Conformation Energy Program for Peptides) [8]. With all ! fixed, the lowest energy was found to be 10.7 kcal/mol by MCM [14]. Using different initial conformations and control parameter settings of the cooling schedule as described above, 55 independent ASA runs were carried out. Table 1 summarizes the energy distribution of these calculations. Most of the ASA calculations result in energies in the range of 8 to 3 kcal/mol, with 7 of the results determining conformations having energies that are only 3 kcal/mol above the known lowest energy, thus exhibiting the effectiveness of the approach. Moreover, as the range of search was somewhat narrowed, almost all of the ASA runs reach the global energy minimum. Adaptive Simulated Annealing and its Application to Protein Folding, Table 1 The energy (in kcal/mol) distribution of ASA runs for MetEnkephalin using a full search range Energy

< 8 (8, 5) (5, 3) > 3

No. of runs 7

19

19

10

23

24

A

Adaptive Simulated Annealing and its Application to Protein Folding

Adaptive Simulated Annealing and its Application to Protein Folding, Table 2 Energy and dihedral angles of the lowest energy conformations of Met-Enkephalin calculated by ASA. RMSD1 is the root-mean-square deviation (in Å) for backbone atoms, while RMSD2 is for all atoms

Adaptive Simulated Annealing and its Application to Protein Folding, Table 3 The conformation of a model 14(L-Alanine) peptide as calculated by ASA

A0 12:9 86 156 1 2 155 84 2 3 84 74 3 137 4 19 4 5 164 160 5 11 173 79 21 31 166 14 59 86 24 15 53 25 175 180 35 45 58 RMSD1 RMSD2 E 1

A 10:7 87 154 162 71 64 93 82 29 81 144 180 111 145 180 100 65 179 179 180 0 0

1 10:6 87 153 161 72 64 94 83 26 79 133 180 110 145 72 84 171 176 180 60 0:04 2:52

2 10:4 87 153 162 75 63 95 81 30 76 132 179 71 35 179 100 173 176 179 60 0:07 1:92

3 10:1 87 156 166 87 68 91 103 13 76 137 166 88 148 71 93 65 178 178 178 0:51 2:08

4 8:5 87 153 166 72 63 97 74 30 82 143 180 73 179 179 100 65 179 179 179 0:26 1:29

For the full range search, we identified three conformations with energies of 10.6, 10.4, and 10.1 kcal/mol, that exhibit the configuration of the known lowest geometry of 10.7 kcal/mol. Table 2 lists the conformations of these lowest energy configurations, as well as an additional low energy structure. Conformations A0 and A are the lowest-energy conformations with ! nonfixed and fixed, respectively, taken from [11,14]. The first two conformations, #1 and #2, have almost the same backbone configuration as that of A (10.7 kcal/mol), with a backbone root-mean-square deviation (RMSD) of only 0.04 and 0.07Å, respectively. The all-atom RMSD of the listed conformations with energies ranging from 8.5 to 10.6 kcal/mol are about 2Å. For conformations #1 and #2, the noted differences are in the side-chains, corresponding to a 0.1 and 0.3 kcal/mol difference in energy, respectively.

2 99:4 158:1 7 68:3 39:2 12 65:0 40:0

3 68:2 34:3 8 66:7 38:0 13 67:2 44:6

4 68:0 38:8 9 68:8 38:7 14 87:7 65:8

5 69:3 38:5 10 67:1 37:7 15 75:9 40:1

6 66:9 38:6 11 69:4 39:6

Poly(L-Alanine) The ASA algorithm was applied to a model of (LAlanine) that is known to assume a dominant righthanded ˛-helical structure [13]. For a search range of dihedral angles that include both the right-handed (RH) ˛-helix and the ˇ-sheet region in the Ramachandran’s diagram, : (115°, 180°) and : ( 115°, 0°), it was significant to find RH ˛-helices with 68° and 38° in all backbones except those near the end-groups, as shown in Table 3. The energy of such a geometry is typically 10.2 kcal/mol after a local minimization. The energy surfaces of the RH ˛-helical regions were found to be less complex than those of MetEnkephalin. These results are consistent with a previous study [16]. Conclusion The adaptive simulated annealing as a global optimization method intrinsically includes some of the modifications of conventional simulated annealing used for biomolecular structure determination. As applied to Met-Enkephalin, the performance of ASA is comparable to the simulated annealing study reported in [12], while better than the one reported in [11], although some differences other than the algorithms are noted. Utilizing a partial search range improves the efficiency significantly, showing that ASA may be useful for refinement of a molecular structure predicted or measured by other methods. A dominant right-handed ˛-helical conformation was found for the 14 residue (L-Alanine) model, with deviations observed only near the end groups.

Adaptive Simulated Annealing and its Application to Protein Folding

Recent Studies and Future Directions Recent studies have shown improved efficiency in the conformational search of Met-Enkephalin, e. g., the so-called conformation space annealing (CSA), which combines the ideas of genetic algorithms, simulated annealing, a build up procedure, and local minimization [7]. The use of the multicanonical ensemble algorithm (ME) (one of the generalized-ensemble algorithms [2]), allows free random walks in energy space, escaping from any energy barrier. Both the ME and CSA algorithms outperform genetic algorithms (GA), simulated annealing (SA), GA with minimization (GAM) and Monte-Carlo with minimization (MCM). Our own work (unpublished) and the work in ref. [5] both show that simple GA alone underperforms simulated annealing for the Met-Enkephalin conformational search problem. Table 4 compares these algorithms for efficiency (the number of evaluations of energy and energy gradient, or the number of local minimizations) and effectiveness (the number of runs reaching the ground state conformation (hits) versus the number of total independent runs). Caution should be exercised since some differences exist between these studies, such as the version of the ECEPP potential used, the treatment of the peptide dihedral angle !, etc. Ground state conforAdaptive Simulated Annealing and its Application to Protein Folding, Table 4 Comparison of the conformation search efficiency and effectiveness of Met-Enkephalin using different algorithms. NE , Nr E , and Nminz are the number of the evaluations of energy, energy gradient, and number of local minimizations of each run, in the unit of 103

hits/total NE ME [2] 10/10 < 1900 MCM [11] 24/24

GAM [10] 5/5

ME [2] 18/20 950 CSA [7] 99/100 300 ME [2] 21/50 400 CSA [7] 50/100 170 SA [2] 8/20 1000 GA [5] < 1/27 100

Nr E 0

0 250 0 130 0 0

Nminz 0 15 50 0 5 0 2:6 0 0:001

: The total number of E,rE evaluations are not given, but can be estimated based on roughly 100 evaluations for each minimization.

A

mations are those having energy within approximately 1eV from the known global minimum energy. Note that the generalized-ensemble method can be carried out with both Monte-Carlo and molecular dynamics. In comparison to the studies summarized in Table 4, ASA seems to be using too small a number of function evaluations. Optimizing control parameters such as the annealing schedule and increasing the number of energy evaluations may improve the effectiveness. Search efficiency could also be improved by adopting parallellization to achieve scalable simulation for various algorithms. Extensive research on the protein conformational search using various hybrids of genetic algorithms and parallelization is in progress (as of 1999). See also Adaptive Global Search Bayesian Global Optimization Genetic Algorithms Genetic Algorithms for Protein Structure Prediction Global Optimization Based on Statistical Models Global Optimization in Lennard–Jones and Morse Clusters Global Optimization in Protein Folding Molecular Structure Determination: Convex Global Underestimation Monte-Carlo Simulated Annealing in Protein Folding Multiple Minima Problem in Protein Folding: ˛BB Global Optimization Approach Packet Annealing Phase Problem in X-ray Crystallography: Shake and Bake Approach Protein Folding: Generalized-ensemble Algorithms Random Search Methods Simulated Annealing Simulated Annealing Methods in Protein Folding Stochastic Global Optimization: Stopping Rules Stochastic Global Optimization: Two-phase Methods References 1. Gay DM (1983) Subroutines for unconstrained minimization using a model/trust-region approach. ACM Trans Math Softw 9:503

25

26

A

Affine Sets and Functions

2. Hansmann UHE (1998) Generalized ensembles: A new way of simulating proteins. Phys A 254:15 3. Ingber L (1989) Very fast simulated re-annealing. Math Comput Modelling 12:967 ASA code is available from: ftp.alumni.caltech.edu:pub/ingber 4. Ingber L, Rosen B (1992) Genetic algorithm and very fast simulated re-annealing: A comparison. Math Comput Modelling 16:87 5. Jin AY, Leung FY, Weaver DF (1997) Development of a novel genetic algorithm search method (GAP1.0) for exploring peptide conformational space. J Comput Chem 18:1971 6. Kirkpatrick S, Gelatt CD, Vecchi MP Jr (1983) Optimization by simulated annealing. Science 220:671 7. Lee J, Scheraga HA, Rackovsky S (1997) New optimization method for conformational energy calculations on polypeptides: conformational space annealing. J Comput Chem 18:222 8. Li Z, Scheraga HA (1987) Monte Carlo-minimization approach to the multiminima problem in protein folding. Proc Nat Aca Sci USA 84:6611 9. Li Z, Scheraga HA (1988) Structure and free energy of complex thermodynamic systems. J Mol Struct (Theochem) 179:333 10. Merkle LD, Lamont GB, Gates GH, Pachter R (May, 1996) Hybrid genetic algorithms for minimization of polypeptide specific energy model. Proc. IEEE Int. Conf. Evolutionary Computation, p 192 11. Nayeem A, Vila J, Scheraga HA (1991) A comparative study of the simulated-annealing and Monte Carlo-with minimization approaches to the minimum-energy structures of polypeptides: Met-Enkephalin. J Comput Chem 12:594 12. Okamoto Y, Kikuchi T, Kawai H (1992) Prediction of lowenergy structure of Met-Enkephalin by Monte Carlo simulated annealing. Chem Lett (Chem Soc Japan):1275 13. Piela L, Scheraga HA (1987) On the multiple-minima problem in the conformational analysis of polypeptides: I. backbone degrees of freedom for a perturbed ˛-helix. Biopolymers 26:S33 14. Vasquez M (1999) Private communication 15. Wang Z, Pachter R (1997) Prediction of polypeptide conformation by the adaptive simulated annealing approach. J Comput Chem 18:323 16. Wilson SR, Cui W (1990) Applications of simulated annealing to peptides. Biopolymers 29:225

Article Outline Keywords See also References Keywords Linear algebra; Convex analysis A subset S of Rn is an affine set if (1 )x C y 2 S; for any x, y 2 S and 2 R. A function f : Rn ! R is an affine function if f is finite, convex and concave (cf. Convex max-functions). See also Linear Programming Linear Space References 1. Rockafellar RT (1970) Convex analysis. Princeton Univ. Press, Princeton

Airline Optimization GANG YU1 , BENJAMIN THENGVALL2 1 Department Management Sci. and Information Systems Red McCombs School of Business, University Texas at Austin, Austin, USA 2 CALEB Technologies Corp., Austin, USA MSC2000: 90B06, 90C06, 90C08, 90C35, 90C90 Article Outline Keywords See also References

Affine Sets and Functions Keywords LEONIDAS PITSOULIS Princeton University, Princeton, USA MSC2000: 51E15, 32B15, 51N20

Network design and schedule construction; Fleet assignment; Aircraft routing; Crew scheduling; Revenue management; Irregular operations; Air traffic control and ground delay programs

Airline Optimization

The airline industry was one of the first to apply operations research methodology and techniques on a large scale. As early as the late 1950s, operations researchers were beginning to study how the developing fields of mathematical programming could be used to address a number of very difficult problems faced by the airline industry. Since that time many airline related problems have been the topics of active research [26]. Most optimization-related research in the airline industry can be placed in one of the following areas: network design and schedule construction; fleet assignment; aircraft routing; crew scheduling; revenue management; irregular operations; air traffic control and ground delay programs. In the following, each of these problem areas will be defined along with a brief discussion of some of the operations research techniques that have been applied to solve them. The majority of applications utilize network-based models. Solution of these models range from traditional mathematical programming approaches to a variety of novel heuristic approaches. A very brief selection of references is also provided. Construction of flight schedules is the starting point for all other airline optimization problems and is a critical operational planning task faced by an airline. The flight schedule defines a set of flight segments that an airline will service along with corresponding origin and destination points and departure and arrival times for each flight segment. An airline’s decision to offer certain flights will depend in large part on market demand forecasts, available aircraft operating characteristics, available manpower, and the behavior of competing airlines [11,12]. Of course, prior to the construction of flight schedules, an airline must decide which markets it will serve. Before the 1978 ‘Airline Deregulation Act’, airlines had to fly routes as assigned by the Civil Aeronautics Board regardless of the demand for service. During this period, most airlines emphasized long point-to-point routes. Since deregulation, airlines have gained the freedom to choose which markets to serve and how often to serve them. This change led to a fundamental shift in most airlines routing strategies from point-to-point flight networks to hub-and-spoke oriented flight net-

A

works. This, in turn, led to new research activities for finding optimal hub [3,18] and maintenance base [13] locations. Following network design and schedule construction, an aircraft type must be assigned to each flight segment in the schedule. This is called the fleet assignment problem. Airlines generally operate a number of different fleet types, each having different characteristics and costs such as seating capacity, landing weights, and crew and fuel costs. The majority of fleet assignment methods represent the flight schedule via some variant of a time-space network with flight arcs between stations and inventory arcs at each station. A multicommodity network flow problem can then be formulated with arcs and nodes duplicated as appropriate for all fleets that can take a particular flight. Side constraints must be implemented to ensure each flight segment is assigned to only one fleet. In domestic fleet assignment problems, a common simplifying assumption is that every flight is flown every day of the week. Under this assumption, the network model need only account for one day’s flights and a looping arc connects the end of the day with the beginning. The resulting models are mixed integer programs [1,16,27,30]. Aircraft routing is a fleet by fleet process of assigning individual aircraft to fly each flight segment assigned to a particular fleet. A primary consideration at this stage is maintenance requirements mandated by the Federal Aviation Administration. There are different types of maintenance activities that must be performed after a given number of flight hours. The majority of these maintenance activities can be performed overnight; however, not all stations are equipped with proper maintenance facilities for all fleets. During the aircraft routing process, individual aircraft from each fleet must be assigned to fly all flight segments assigned to that fleet in a manner that provides maintenance opportunities for all aircraft at appropriate stations within the required time intervals. This problem has been formulated and solved in a number of ways including as a general integer programming problem solved by Lagrangian relaxation [9] and as a set partitioning problem solved with a branch and bound algorithm [10]. As described above, the problems of fleet assignment and aircraft routing have been historically solved in a sequential manner. Recently, work has been done to solve these problems simultaneously using a string-

27

28

A

Airline Optimization

based model and a branch and price solution approach [5]. Crew scheduling, like aircraft routing, is done following fleet assignment. The first of two sequentially solved crew scheduling problems is the crew pairing problem. A crew pairing is a sequence of flight legs beginning and ending at a crew base that satisfies all governmental and contractual restrictions (some times called legalities). These crew pairings generally cover a period of 2–5 days. The problem is to find a minimum cost set of such crew pairings such that all flight segments are covered. This problem has generally been modeled as a set partitioning problem in which pairings are enumerated or generated dynamically [15,17]. Other attempts to solve this problem have employed a decomposition approach based on graph partitioning [4] and a linear programming relaxation of a set covering problem [21]. Often a practice called deadheading is used to reposition flight crews in which a crew will fly a flight segment as passengers. Therefore, in solving the crew-pairing problem, all flight segments must be covered, but they may be covered by more than one crew. The second problem to be solved relating to crew scheduling is the monthly crew rostering problem. This is the problem of assigning individual crew members to crew pairings to create their monthly schedules. These schedules must incorporate time off, training periods, and other contractual obligations. Generally, a preferential bidding system is used to make the assignments in which each personalized schedule takes into account an employee’s pre-assigned activities and weighted bids representing their preferences. While the crew pairing problem has been widely studied, a limited number of publications have dealt with the monthly crew rostering problem. Approaches include an integer programming scheme [14] and a network model [24]. Revenue management is the problem of determining fare classes for each flight in the flight schedule as well as the allocation of available seats to each fare class. Not only are seats on an airplane partitioned physically into sections such as first class and coach, but also seats in the same section are generally priced at many different levels. The goal is to maximize the expected revenue from a particular flight segment by finding the proper balance between gaining additional revenue by selling more inexpensive seats and losing revenue by turning away higher fare customers. A standard assump-

tion is that fare classes are filled sequential from the lowest to the highest. This is often the case where discounted fares are offered in advance, while last minute tickets are sold at a premium. Recent research includes a probabilistic decision model [6], a dynamic programming formulation [31] and some calculus-based booking policies [8]. When faced with a lack of resources, airlines often are not able to fly their published flight schedule. This is frequently the result of aircraft mechanical difficulties, inclement weather, or crew shortages. As situations like these arise, decisions must be made to deal with the shortage of resources in a manner that returns the airline to the originally planned flight schedule in a timely fashion while attempting to reduce operational cost and keep passengers satisfied. This general situation is called the airline irregular operations problem and it involves aircraft, crew, gates, and passenger recovery. The aircraft schedule recovery problem deals with re-routing aircraft during irregular operations. This problem has received significant attention among irregular operations topics; papers dealing with crew scheduling during irregular operations have only recently started to appear [28,35]. Most approaches for dealing with aircraft schedule recovery have been based on network models. Some early models were pure networks [19]. Recently, more comprehensive models have been developed that better represent the problem, but are more difficult to solve as side constraints have been added to the otherwise network structure of these problems [2,33,36]. In practice, many airlines use heuristic methods to solve these problems as their real-time nature does not allow for lengthy optimization run times. Closely related to the irregular operations problem is the ground delay problem in air traffic control. Ground delay is a program implemented by the Federal Aviation Administration in cases of station congestion. During ground delay, aircraft departing for a congested station are held on the ground before departure. The rational for this behavior is that ground delays are less expensive and safer than airborne delays. Several optimization models have been formulated to decrease the total minutes of delay experienced throughout the system during a ground delay program. These problems have generally been modeled as integer programs ([22,23]), but the problem has also been solved using

Airline Optimization

stochastic linear programming [25] and by heuristic methods [34]. Optimization based methods have also been applied to a myriad of other airline related topics such as gate assignment [7], fuel management [29], short term fleet assignment swapping [32], demand modeling [20], and others. Airline industry is an exciting arena for the interplay between optimization theory and practice. Many more optimization applications in the airline industry will evolve in the future. See also Integer Programming Vehicle Scheduling References 1. Abara J (1989) Applying integer linear programming to the fleet assignment problem. Interfaces 19(4):20–28 2. Argüello MF, Bard JF, Yu G (1997) Models and methods for managing airline irregular operations aircraft routing. In: Yu G (ed) Operations Research in Airline Industry. Kluwer, Dordrecht 3. Aykin T (1994) Lagrangian relaxation based approaches to capacitated hub-and-spoke network design problem. Europ J Oper Res 79(3):501–523 4. Ball M, Roberts A (1985) A graph partitioning approach to airline crew scheduling. Transport Sci 19(2):107–126 5. Barnhart C, Boland NL, Clarke LW, Johnson EL, Nemhauser G, Shenoi RG (1998) Flight string models for aircraft fleeting and routing. Transport Sci 32(3):208–220 6. Belobaba PP (1989) Application of a probabilistic decision model to airline seat inventory control. Oper Res 37:183– 197 7. Brazile RP, Swigger KM, Wyatt DL (1994) Selecting a modelling technique for the gate assignment problems: Integer programming, simulation, or expert system. Internat J Modelling and Simulation 14(1):1–5 8. Brumelle SL, McGill JI (1993) Airline seat allocation with multiple nested fare classes. Oper Res 41(1):127–137 9. Daskin MS, Panagiotopoulos ND (1989) A Lagrangian relaxation approach to assigning aircraft to routes in hub and spoke networks. Transport Sci 23(2):91–99 10. Desaulniers G, Desrosiers J, Dumas Y, Solomon MM, Soumis F (1997) Daily aircraft routing and scheduling. Managem Sci 43(6):841–855 11. Dobson G, Lederer PJ (1993) Airline scheduling and routing in a hub-and-spoke system. Transport Sci 27(3):281– 297 12. Etschamaier MM, Mathaisel DFX (1985) Airline scheduling: An overview. Transport Sci 9(2):127–138

A

13. Feo TA, Bard JF (1989) Flight scheduling and maintenance base planning. Managem Sci 35(12):1415–1432 14. Gamache M, Soumis F, Villeneuve D, Desrosiers J (1998) The preferential bidding system at Air Canada. Transport Sci 32(3):246–255 15. Graves GW, McBride RD, Gershkoff I, Anderson D, Mahidhara D (1993) Flight crew scheduling. Managem Sci 39(6):736–745 16. Hane CA, Barnhart C, Johnson EL, Marsten RE, Nemhauser GL, Sigismondi G (1995) The fleet assignment problem: Solving a large-scale integer program. Math Program 70(2):211–232 17. Hoffman KL, Padberg M (1993) Solving airline crew scheduling problems by branch-and-cut. Managem Sci 39(6):657–680 18. Jaillet P, Song G, Yu G (1997) Airline network design and hub location problems. Location Sci 4(3):195–212 19. Jarrah AIZ, Yu G, Krishnamurthy N, Rakshit A (1993) A decision support framework for airline flight cancellations and delays. Transport Sci 27(3):266–280 20. Jorge-Calderon JD (1997) A demand model for scheduled airline services on international European routes. J Air Transport Management 3(1):23–35 21. Lavoie S, Minoux M, Odier E (1988) A new approach for crew pairing problems by column generation with an application to air transportation. Europ J Oper Res 35:45–58 22. Luo S, Yu G (1997) On the airline schedule perturbation problem caused by the ground delay program. Transport Sci 31(4):298–311 23. Navazio L, Romanin-Jacur G (1998) The multiple connections multi-airport ground holding problem: Models and algorithms. Transport Sci 32(3):268–276 24. Nicoletti B (1975) Automatic crew rostering. Transport Sci 9(1):33–48 25. Richetta O, Odoni AR (1993) Solving optimally the static ground-holding policy problem in air traffic control. Transport Sci 27(3):228–238 26. Richter H (1989) Thirty years of airline operations research. Interfaces 19(4):3–9 27. Rushmeier RA, Kontogiorgis SA (1997) Advances in the optimization of airline fleet assignment. Transport Sci 31(2):159–169 28. Stojkovic M, Soumis F, Desrosiers J (1998) The operational airline crew scheduling problem. Transport Sci 32(3):232– 245 29. Stroup JS, Wollmer RD (1992) A fuel management model for the airline industry. Oper Res 40(2):229–237 30. Subramanian R, Scheff RP Jr, Quillinan JD, Wiper DS, Marsten RE (1994) Coldstart: Fleet assignment at delta air lines. Interfaces 24(1):104–120 31. Tak TC, Hersh M (1993) A model for dynamic airline seat inventory control with multiple seat bookings. Transport Sci 27(3):252–265 32. Talluri KT (1993) Swapping applications in a daily airline fleet assignment. Transport Sci 30(3):237–248

29

30

A

Algorithmic Improvements Using a Heuristic Parameter, Reject Index for Interval Optimization

33. Thengvall BT, Bard JF, Yu G (2000) Balancing user preferences for aircraft schedule recovery during airline irregular operations. IEE Trans Oper Eng 32(3):181–193 34. Vranas PB, Bertsemas DJ, Odoni AR (1994) The multiairport ground-holding problem in air traffic control. Oper Res 42(2):249–261 35. Wei G, Song G, Yu G (1997) Model and algorithm for crew management during airline irregular operations. J Combin Optim 1(3):80–97 36. Yan S, Tu Y (1997) Multifleet routing and multistop flight scheduling for schedule perturbation. Europ J Oper Res 103(1):155–169

Algorithmic Improvements Using a Heuristic Parameter, Reject Index for Interval Optimization TIBOR CSENDES University of Szeged, Szeged, Hungary MSC2000: 65K05, 90C30 Article Outline Keywords and Phrases Introduction Subinterval Selection Multisection Heuristic Rejection References

was suggested by L.G. Casado as a measure of the closeness of the interval X to a global minimizer point [1]. It was first applied to improve the work load balance of global optimization algorithms. A subinterval X of the search space with the minimal value of the inclusion function F(X) is usually considered as the best candidate to contain a global minimum. However, the larger the interval X, the larger the overestimation of the range f (X) on X compared to F(X). Therefore a box could be considered as a good candidate to contain a global minimum just because it is larger than the others. To compare subintervals of different sizes we normalize the distance between the global minimum value f * and F(X). The idea behind pf * is that in general we expect the overestimation to be symmetric, i. e., the overestimation above f (X) is closely equal to the overestimation below f (X) for small subintervals containing a global minimizer point. Hence, for such intervals X the relative place of the global optimum value inside the F(X) interval should be high, while for intervals far from global minimizer points pf * must be small. Obviously, there are exceptions, and there exists no theoretical proof that pf * would be a reliable indicator of nearby global minimizer points. The value of the global minimum is not available in most cases. A generalized expression for a wider class of indicators is p( fˆ; X) D

fˆ F(X) F(X) F(X)

;

Keywords and Phrases Branch-and-bound; Interval arithmetic; Optimization; Heuristic parameter Introduction Interval optimization methods ( interval analysis: unconstrained and constrained optimization) have the guarantee not to loose global optimizer points. To achieve this, a deterministic branch-and-bound framework is applied. Still, heuristic algorithmic improvements may increase the convergence speed while keeping the guaranteed reliability. The indicator parameter called RejectIndex p f (X) D

f F(X) F(X) F(X)

where the fˆ value is a kind of approximation of the global minimum. We assume that fˆ 2 F(X), i. e., this estimation is realistic in the sense that fˆ is within the known bounds of the objective function on the search region. According to the numerical experience collected, we need a good approximation of the f * value to improve the efficiency of the algorithm. Subinterval Selection I. Among the possible applications of these indicators the most promising and straightforward is in the subinterval selection. The theoretical and computational properties of the interval branch-and-bound optimization has been investigated extensively [6,7,8,9]. The most important statements proved are the follow-

A

Algorithmic Improvements Using a Heuristic Parameter, Reject Index for Interval Optimization

ing for algorithms with balanced subdivision direction selection: 1. Assume that the inclusion function of the objective function is isotone, it has the zero convergence property, and the p(f k ,Y) parameters are calculated with the f k parameters converging to fˆ > f , for which there exists a point xˆ 2 X with f (xˆ ) D fˆ. Then the branch-and-bound algorithm that selects that interval Y from the working list which has the maximal p(f i ,Z) value can converge to a point xˆ 2 X for which f (xˆ) > f , i. e., to a point which is not a global minimizer point of the given problem. 2. Assume that the inclusion function of the objective function has the zero convergence property and f k converges to fˆ < f . Then the optimization branchand-bound algorithm will produce an everywhere dense sequence of subintervals converging to each point of the search region X regardless of the objective function value. 3. Assume that the inclusion function of the objective function is isotone and has the zero convergence property. Consider the interval branch-and-bound optimization algorithm that uses the cutoff test, the monotonicity test, the interval Newton step, and the concavity test as accelerating devices, and that selects as the next leading interval that interval Y from the working list which has the maximal p(f i ,Z) value. A necessary and sufficient condition for the convergence of this algorithm to a set of global minimizer points is that the sequence {f i } converges to the global minimum value f * , and there exist at most a finite number of f i values below f * . 4. If our algorithm applies the interval selection rule of maximizing the p( f ; X) D p f (X) values for the members of the list L (i. e., if we can use the known exact global minimum value), then the algorithm converges exclusively to global minimizer points. 5. If our algorithm applies the interval selection rule of maximizing the p( f˜; X) values for the members of the list L, where f˜ is the best available upper bound for the global minimum, and its convergence to f * can be ensured, then the algorithm converges exclusively to global minimizer points. 6. Assume that for an optimization problem min x2X f (x) the inclusion function F(X) of f (x) is isotone and ˛-convergent with given positive constants ˛ and C. Assume further that the pf * pa-

rameter is less than 1 for all the subintervals of X. Then an arbitrary large number N(> 0) of consecutive leading intervals of the basic B&B algorithm that selects the subinterval with the smallest lower bound as the next leading interval may have the following properties: i. None of these processed intervals contains a stationary point. ii. During this phase of the search the pf * values are maximal for these intervals. 7. Assume that the inclusion function of the objective function is isotone and it has the zero convergence property. Consider the interval branch-and-bound optimization algorithm that uses the cutoff test, the monotonicity test, the interval Newton step, and the concavity test as accelerating devices and that selects as the next leading interval that interval Y from the working list which has the maximal pf (f k ,Z) value. i. The algorithm converges exclusively to global minimizer points if f f k < ı( f k f ) C f k

k

k

holds for each iteration number k, where 0 < ı < 1. ii. The above condition is sharp in the sense that ı D 1 allows convergence to not optimal points. Here f D minfF(Y l ); l D 1; : : : ; jL k jg f k < f˜k D k

f k ; where |L| stands for the cardinality of the elements of the list L. II. These theoretical results are in part promising (e. g., 7), in part disappointing (5 and 6). The conclusions of the detailed numerical comparisons were that if the global minimum value is known, then the use of the pf * parameter in the described way can accelerate the interval optimization method by orders of magnitude, and this improvement is especially strong for hard problems. In case the global minimum value is not available, then its estimation, f k , which fulfills the conditions of 7, can be utilized with similar efficacy, and again the best results were achieved on difficult problems. Multisection I. The multisection technique is a way to accelerate branch-and-bound methods by subdividing the actual interval into several subintervals in a single algorithm

31

32

A

Algorithmic Improvements Using a Heuristic Parameter, Reject Index for Interval Optimization

step. In the extreme case half of the function evaluations can be saved [5,10]. On the basis of the RejectIndex value of a given interval it is decided whether simple bisection or two higher-degree multisections are to be applied [2,11]. Two threshold values, 0 < P1 < P2 < 1, are used for selecting the proper multisection type. This algorithm improvement can also be cheated in the sense that there exist global optimization problems for which the new method will follow for an arbitrary long number of iterations an embedded interval sequence that contains no global minimizer point, or that intervals in which there is a global minimizer have misleading indicator values. According to the numerical tests, the new multisection strategies result in a substantial decrease both in the number of function evaluations and in the memory complexity. II. The multisection strategy can also be applied to constrained global optimization problems [11]. The feasibility degree index for constraint g j (x) 0 can be formulated as ( pu G j (X) D min

G j (X) w(G j (X))

) ;1

:

Notice that if pu G j (X) < 0, then the box is certainly infeasible, and if pu G j (X) D 1 then X certainly satisfies the constraint. Otherwise, the box is undetermined for that constraint. For boxes that are not certainly infeasible, i. e., for which pu G j (X) 0 for all j D 1; : : : ; r holds, the total infeasibility index is given by

pu(X) D

r Y

where fˆ is a parameter of this indicator, which is usually an approximation of f * . This new index works like p( fˆ; X) if X is certainly feasible, but if the box is undetermined, then it takes the feasibility degree of the box into account: the less feasible the box is, the lower the value of pu(X) is. A careful theoretical analysis proved that the new interval selection and multisection rules enable the branch-and-bound interval optimization algorithm to converge to a set of global optimizer points assuming we have a proper sequence of {f k } parameter values. The convergence properties obtained were very similar to those proven for the unconstrained case, and they give a firm basis for computational implementation. A comprehensive numerical study on standard global optimization test problems and on facility location problems indicated [11] that the constrained version interval selection rules and, to a lesser extent, also the new adaptive multisection rules have several advantageous features that can contribute to the efficiency of the interval optimization techniques.

pu G j (X) :

jD1

We must only define the index for such boxes since certainly infeasible boxes are immediately removed by the algorithm from further consideration. With this definition, pu(X) D 1 , X is certainly feasible and pu(X) 2 [0; 1) , X is undetermined. Using the pu(X) index, we now propose the following modification of the RejectIndex for constrained problems: pup( fˆ; X) D pu(X) p( fˆ; X) ;

Heuristic Rejection RejectIndex can also be used to improve the efficiency of interval global optimization algorithms on very hard to solve problems by applying a rejection strategy to get rid of subintervals not containing global minimizer points. This heuristic rejection technique selects those subintervals on the basis of a typical pattern of changes in the pf * values [3,4]. The RejectIndex is not always reliable: assume that the inclusion function F(X) of f (x) is isotone and ˛convergent. Assume further that the RejectIndex parameter pf * is less than 1 for all the subintervals of X. Then an arbitrary large number N(> 0) of consecutive leading intervals may have the following properties: i. Neither of these processed intervals contains a stationary point, and ii. During this phase of the search the pf * values are maximal for these intervals as compared with the subintervals of the current working list. Also, when a global optimization problem has a unique global minimizer point x* , there always exists an isotone and ˛-convergent inclusion function F(X) of f (x) such that the new algorithm does not converge to x* .

Algorithms for Genomic Analysis

In spite of the possibility of losing the global minimum, obviously there exist such implementations that allow a safe way to use heuristic rejection. For example, the selected subintervals can be saved on a hard disk for further possible processing if necessary. Although the above theoretical results were not encouraging, the computational tests on very hard global optimization problems were convincing: when the whole list of subintervals produced by the B&B algorithm is too large for the given computer memory, then the use of the suggested heuristic rejection technique decreases the number of working list elements without missing the global minimum. The new rejection test may also make it possible to solve hard-tosolve problems that are otherwise unsolvable with the usual techniques. References 1. Casado LG, García I (1998) New Load Balancing Criterion for Parallel Interval Global Optimization Algorithms. In: Proceedings of the 16th IASTED International Conference on Applied Informatics, Garmisch-Partenkirchen, pp 321–323 2. Casado LG, García I, Csendes T (2000) A new multisection technique in interval methods for global optimization. Computing 65:263–269 3. Casado LG, García I, Csendes T (2001) A heuristic rejection criterion in interval global optimization algorithms. BIT 41:683–692 4. Casado LG, García I, Csendes T, Ruiz VG (2003) Heuristic Rejection in Interval Global Optimization. JOTA 118:27–43 5. Csallner AE, Csendes T, Markót MC (2000) Multisection in Interval Branch-and-Bound Methods for Global Optimization I. Theoretical Results. J Global Optim 16:371–392 6. Csendes T (2001) New subinterval selection criteria for interval global optimization. J Global Optim 19:307–327 7. Csendes T (2003) Numerical experiences with a new generalized subinterval selection criterion for interval global optimization. Reliab Comput 9:109–125 8. Csendes T (2004) Generalized subinterval selection criteria for interval global optimization. Numer Algorithms 37:93–100 9. Kreinovich V, Csendes T (2001) Theoretical Justification of a Heuristic Subbox Selection Criterion for Interval Global Optimization. CEJOR 9:255–265 10. Markót MC, Csendes T, Csallner AE (2000) Multisection in Interval Branch-and-Bound Methods for Global Optimization II. Numerical Tests. J Global Optim 16:219–228 11. Markót MC, Fernandez J, Casado LG, Csendes T (2006) New interval methods for constrained global optimization. Math Programm 106:287–318

A

Algorithms for Genomic Analysis EVA K. LEE, KAPIL GUPTA Center for Operations Research in Medicine and HealthCare, School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, USA MSC2000: 90C27, 90C35, 90C11, 65K05, 90-08, 90-00 Article Outline Abstract Introduction Phylogenetic Analysis Methods Based on Pairwise Distance Parsimony Methods Maximum Likelihood Methods

Multiple Sequence Alignment Scoring Alignment Alignment Approaches Progressive Algorithms Graph-Based Algorithms Iterative Algorithms

Novel Graph-Theoretical Genomic Models Definitions Construction of a Conflict Graph from Paths of Multiple Sequences Complexity Theory Special Cases of MWCMS Computational Models: Integer Programming Formulation

Summary Acknowledgement References Abstract The genome of an organism not only serves as its blueprint that holds the key for diagnosing and curing diseases, but also plays a pivotal role in obtaining a holistic view of its ancestry. Recent years have witnessed a large number of innovations in this field, as exemplified by the Human Genome Project. This chapter provides an overview of popular algorithms used in genome analysis and in particular explores two important and deeply interconnected problems: phylogenetic analysis and multiple sequence alignment. We also describe our novel graph-theoretical approach that en-

33

34

A

Algorithms for Genomic Analysis

compasses a wide variety of genome sequence analysis problems within a single model. Introduction Genomics encompasses the study of the genome in human and other organisms. The rate of innovation in this field has been breathtaking over the last decade, especially with the completion of Human Genome Project. The purpose of this chapter is to review some wellknown algorithms that facilitate genome analysis. The material is presented in a way that is interesting to both the specialists working in this area and others. Thus, this review includes a brief sketch of the algorithms to facilitate a deeper understanding of the concepts involved. The list of problems related to genomics is very extensive; hence, the scope of this chapter is restricted to the following two related important problems: (1) phylogenetic analysis and (2) multiple sequence alignment. Readers interested in algorithms used in other fields of computational biology are recommended to refer to reviews by Abbas and Holmes [1] and Blazewicz et al. [7]. Genome refers to the complete DNA sequence contained in the cell. The DNA sequence consists of the four nucleotides adenine (A), thymine (T), cytosine (C), and guanine (G). Associated with each DNA strand (sequence) is a complementary DNA strand of the same length. The strands are complementary in that each nucleotide in one strand uniquely defines an associated nucleotide in the other: A and T are always paired, and C and G are always paired. Each pairing is referred to as a base pair; and bound complementary strands make up a DNA molecule. Typically, the number of base pairs in a DNA molecule is between thousands and billions, depending on the complexity of a given organism. For example, a bacterium contains about 600,000 base pairs, while human and mouse have some three billion base pairs. Among humans, 99.9% of base pairs are the same between any two unrelated persons. But that leaves millions of single-letter differences, which provide genetic variation between people. Understanding the DNA sequence is extremely important. It is considered as the blueprint for an organism’s structure and function. The sequence order underlies all of life’s diversity, even dictating whether an organism is human or another species such as yeast or

a fruit fly. It helps in understanding the evolution of mankind, identifying genetic diseases, and creating new approaches for treating and controlling those diseases. In order to achieve these goals, research in genome analysis has progressed rapidly over the last decade. The rest of this chapter is organized as follows. Section “Phylogenetic Analysis” discusses techniques used to infer the evolutionary history of species and Sect. “Multiple Sequence Alignment” presents the multiple sequence alignment problem and recent advances. In Sect. “Novel Graph-Theoretical Genomic Models”, we describe our research effort for advancing genomic analysis through the design of a novel graph-theoretical approach for representing a wide variety of genomic sequence analysis problems within a single model. We summarize our theoretical findings, and present computational models based on two integer programming formulations. Finally, Sect. “Summary” summarizes the interdependence and the pivotal role played by the abovementioned two problems in computational biology. Phylogenetic Analysis Phylogenetic analysis is a major aspect of genome research. It refers to the study of evolutionary relationships of a group of organisms. These hierarchical relationships among organisms arising through evolution are usually represented by a phylogenetic tree (Fig. 1). The idea of using trees to represent evolution dates back to Darwin. Both rooted and unrooted tree representations have been used in practice [17]. The branches of a tree represent the time of divergence and the root represents the ancestral sequence (Fig. 2). The study of phylogenies and processes of evolution by the analysis of DNA or amino acid sequence data is

Algorithms for Genomic Analysis, Figure 1 An example of an evolutionary tree

Algorithms for Genomic Analysis

Algorithms for Genomic Analysis, Figure 2 Tree terminology

called molecular phylogenetics. In this study, we will focus on methods that use DNA sequence data. There are two processes involved in inferring both rooted and unrooted trees. The first is estimating the branching structure or topology of the tree. The second is estimating the branch lengths for a given tree. Currently, there are wide varieties of methods available to conduct this analysis [16,19,55,79]. These available approaches can be classified into three broad groups: (1) distance methods; (2) parsimony methods; and (3) maximum likelihood methods. Below, we will discuss each of them in detail.

Methods Based on Pairwise Distance In distance methods, an evolutionary distance dij is computed between each pair i, j of sequences, and a phylogenetic tree is constructed from these pairwise distances. There are many different ways of defining pairwise evolutionary distance used for this purpose. Most of the approaches estimate the number of nucleotide substitutions per site, but other measures have also been used [70,71]. The most popular one is the Jukes–Cantor distance [37], which defines dij as 4f 34 log(1 3 ), where f is the fraction of sites where nucleotides differ in the pairwise alignment [37]. There are a large number of distance methods for constructing evolutionary trees [78]. In this article, we discuss methods based on cluster analysis and neighbor joining.

A

Cluster Analysis: Unweighted Pair Group Method Using Arithmetic Averages The conceptually simplest and most known distance method is the unweighted pair group method using arithmetic averages (UPGMA) developed by Sokal and Michener [66]. Given a matrix of pairwise distances between each pair of sequences, it starts with assigning each sequence to its own cluster. The distances between the clusters are P defined as d i j D jC i 1jC j j p2 C i ;q2 C j d(p; q), where Ci and Cj denote sequences in clusters i and j, respectively. At each stage in the process, the least distant pair of clusters are merged to create a new cluster. This process continues until only one cluster is left. Given n sequences, the general schema of UPGMA is shown in Algorithm 1. Algorithm 1 (UPGMA) 1. Input: Distance matrix dij , 1 i; j n 2. For i D 1 to n do 3. Define singleton cluster Ci comprising of sequence i 4. Place cluster Ci as a tree leaf at height zero 5. End for 6. Repeat 7. Determine two clusters i, j such that dij is minimal. 8. Merge these two clusters to form a new cluster k having a distance from other clusters defined as the weighted average of the comprising two clusters. If Ck is the union of two clusters Ci and Cj , and if Cl is any other cluster, then dkl = d i l jC i jCd j l jC j j . jC i jCjC j j d

Define a node k at height 2i j with daughter nodes i and j. 10. Until just a single cluster remains

9.

The time and space complexity of UPGMA is O(n2 ), since there are n 1 iterations of complexity O(n). A number of approaches have been developed which are motivated by UPGMA. Li [52] developed a similar approach which also makes corrections for unequal rates of evolution among lineages. Klotz and Blanken [43] presented a method where a present-day sequence serves as an ancestor in order to determine the tree regardless of the rates of evolution of the sequences involved.

35

36

A

Algorithms for Genomic Analysis

Neighbor Joining Neighbor joining is another very popular algorithm based on pairwise distances [63]. This approach yields an unrooted tree and overcomes the assumption of the UPGMA method that the same rate of evolution applies to each branch. Given a matrix of pairwise distances between each pair of sequences dij , it first defines the modified distance matrix d¯i j . This matrix is calculated by subtracting average distances to all other sequences from the dij , thus compensating for long edges. In each stage, the two nearest nodes (minimal d¯i j ) of the tree are chosen and defined as neighbors in the tree. This is done recursively until all of the nodes are paired together. Given n sequences, the general schema of neighbor joining is shown in Algorithm 2. Algorithm 2 (Neighbor joining) 1. Input: Distance matrix d i j ; 1 i; j n 2. For i D 1 to n 3. Assign sequence i to the set of leaf nodes of the tree (T) 4. End for 5. Set list of active nodes (L) = T 6. Repeat 7. Calculate the modified distance matrix 1 P d¯i j D d i j (r i C r j ), where r i D jLj2 k2L d i k 8. Find the pair i; j in L having the minimal value of d¯i j 9. Define a new node u and set du k D 12 (d i k C d jk d i j ), for all k in L 10. Add u to T joining nodes i, j with edges of length given by: d iu D 12 (d i j C r i r j ); d ju D d i j d iu 11. Remove i and j from L and add u 12. Until only two nodes remain in L 13. Connect remaining two nodes i and j by a branch of length dij Neighbor joining has a execution time of O(n2 ), like UPGMA. It has given extremely good results in practice and is computationally efficient [63,72]. Many practitioners have developed algorithms based on this approach. Gascuel [24] improved the neighbor-joining approach by using a simple first-order model of the variances and covariances of evolutionary distance estimates. Bruno et al. [10] developed a weighted neighbor joining using a likelihood-based approach. Goeffon et al. [25] investigated a local search algorithm un-

der the maximum parsimony criterion by introducing a new subtree swapping neighborhood with an effective array-based tree representation. Parsimony Methods In science, notion of parsimony refers to the preference of simpler hypotheses over complicated ones. In the parsimony approach for tree building, the goal is to identify the phylogeny that requires the fewest necessary changes to explain the differences among the observed sequences. Of the existing numerical approaches for reconstructing ancestral relationships directly from sequence data, this approach is the most popular one. Unlike distance-based methods which build trees, it evaluates all possible trees and gives each a score based on the number of evolutionary changes that are needed to explain the observed sequences. The most parsimonious tree is the one that requires the fewest evolutionary changes for all sequences to derive from a common ancestor [69]. As an example, consider the trees in Fig. 3 and Fig. 4. The tree in Fig. 3 requires only one evolutionary change (marked by the star) compared with the tree in Fig. 4, which requires two changes. Thus, Fig. 3 shows the more parsimonious tree. There are two distinct components in parsimony methods: given a labeled tree, determine the score; determine global minimum score by evaluating all possible trees, as discussed below. Score Computation Given a set of nucleotide sequences, parsimony methods treat each site (position) independently. The algorithm evaluates the score at each position and then sums them up over all the positions. As an example, suppose we have the following three aligned nucleotide sequences: CCC GGC CGC Then, for a given tree topology, we would calculate the minimal number of changes required at each of the three sites and then sum them up. Here, we investigate a traditional parsimony algorithm developed by Fitch [21], where the number of substitutions required is taken as a score. For a particular topology, this ap-

Algorithms for Genomic Analysis

A

Algorithms for Genomic Analysis, Figure 3 Parsimony tree 1

Algorithms for Genomic Analysis, Figure 5 The sets Rk for the first site of given three sequences

Algorithms for Genomic Analysis, Figure 4 Parsimony tree 2

proach starts by placing nucleotides at the leaves and traverses toward the root of the tree. At each node, the nucleotides common to all of the descendant nodes are placed. If this set is empty then the union set is placed at this node. This continues until the root of the tree is reached. The number of union sets { equals} the number of substitutions required. The general scheme for every position is shown in Algorithm 3. Algorithm 3 (Parsimony: score computation) 1. Each leaf l is labeled with set Rl having observed nucleotide at that position. 2. Score S D 0 3. For all internal nodes k with children i and j having labels Ri and Rj do T 4. Rk D Ri R j

5. if R k D ; then S 6. Rk D Ri R j 7. S D SC1 8. end if 9. End for 10. Minimal score D S

Figure 5 shows the set Rk obtained by Algorithm 3. The computation is done for the first site of the three sequences shown above. The minimal score given by the algorithm is 1. A wide variety of approaches have been developed by modifying Fitch’s algorithm [68]. Sankoff and Cedergren [64] presented a generalized parsimony method which does not just count the number of substitutions, but also assigns a weighted cost for each substitution. Ronquist [62] improved the computational time by including strategies for rapid evaluation of tree lengths and increasing the exhaustiveness of branch swapping while searching topologies. Search of Possible Tree Topologies The number of possible tree topologies dramatically increases with the number of sequences. Consequently, in practice usu-

37

38

A

Algorithms for Genomic Analysis

ally only a subset of them are examined using efficient search strategies. The most commonly used strategy is branch and bound methods to select branching patterns [60]. For large-scale problems, heuristic methods are typically used [69]. These exact and heuristic tree search strategies are implemented in various programs like PHYLIP (phylogeny inference package) and MEGA (molecular evolutionary genetic analysis) [20,47]. Maximum Likelihood Methods The method of maximum likelihood is one of the most popular statistical tools used in practice. In molecular phylogenetics, maximum likelihood methods find the tree that has the highest probability of generating observed sequences, given an explicit model of evolution. The method was first introduced by Felenstein [18]. We discuss herein both the evolution models and the calculation of tree likelihood. Model of Evolution A model of evolution refers to various events like mutation, which changes one sequence to another over a period of time. It is required to determine the probability of a sequence S2 arising from an ancestral sequence S1 over a period of time t. Various sophisticated models of evolution have been suggested, but simple models like the Jukes–Cantor model are preferred in maximum likelihood methods. The Jukes–Cantor [37] model assumes that all nucleotides (A, C, T, G) undergo mutation with equal probability, and change to all of the other three possible nucleotides with the same probability. If the mutation rate is 3˛ per unit time per site, the mutation matrix Pij (probability that nucleotide i changes to nucleotide j in unit time) takes the form 0 B B B B B @

1 3˛

˛

˛

˛

˛

1 3˛

˛

˛

˛

˛

1 3˛

˛

˛

˛

˛

1 3˛

1 C C C C: C A

The above matrix is integrated to evaluate mutation rates over time t and is then used to calculate P(nt2 jnt1 ; t), defined as the probability of nucleotide nt 1 being substituted by nucleotide nt 2 over time t.

Algorithms for Genomic Analysis, Figure 6 A simple tree

Various other evolution models like the Kimura model have also been mentioned in the literature [9,42]. Likelihood of a Tree The likelihood of a tree is calculated as the probability of observing a set of sequences given the tree. L(tree) D probability[sequences|tree] We begin with the simple case of two sequences S1 and S2 of length n having a common ancestor a as shown in Fig. 6. It is assumed that all different sites (positions) evolve independently, and thus the total likelihood is calculated as the product of the likelihood of all sites [15]. Here, the likelihood of each site is obtained using substitution probabilities based on an evolution model. Given q a is the equilibrium distribution of nucleotide a, the likelihood for the simple tree in Fig. 6 Q is calculated as L(tree) D P(S 1 ; S 2 ) D niD1 P(S 1i ; S 2i ), P where P(S 1i ; S 2i ) D a q a P(S 1i ja)P(S 2i ja). To generalize this approach for m sequences, it is assumed that diverged sequences evolve independently after diverging. Hence, the likelihood for every node in a tree depends only on its immediate ancestral node and a recursive procedure is used to evaluate the likelihood of the tree. The conditional likelihood Lk, a is defined as the likelihood of the subtree rooted at node k, given that the nucleotide at node k is a. The general schema for every site is shown in Algorithm 4. The likelihood is then maximized over all possible tree topologies and branch lengths.

Algorithms for Genomic Analysis

Algorithm 4 (Likelihood: computation at given site) 1. For all leaf l do 2. if leaf has nucleotide a at that site then 3. L l ;a D 1 4. else 5. L l ;a D 0 6. end if 7. End for 8. For all internal nodes k with children i and j 9. define the conditional likelihood P L k;a D b;c [P(bja)L i;b ][P(cja)L j;c ] 10. End for P 11. Likelihood at given site = a q a Lroot;a Recent Improvements The maximum likelihood approach has received great attention owing to the existence of powerful statistical tools. It has been made more sophisticated using advance tree search algorithms, sequence evolution models, and statistical approaches. Yang [80] extended it to the case where the rate of nucleotide substitutions differ over sites. Huelsenbeck and Crandall [34] incorporated the improvements in substitution models. Piontkivska [59] evaluated the use of various substitution models in the maximum likelihood approach and inferred that simple models are comparable in terms of both efficiency and reliability with complex models. The enormously large number of possible tree topologies, especially while working with a large number of sequences, makes this approach computationally intensive [72]. It has been proved that reconstructing the maximum likelihood tree is nondeterministic polynomial time hard (NP) hard even for certain approximations [14]. In order to reduce computational time, Guindon and Gascuel [31] developed a simple hill-climbing algorithm based on the maximumlikelihood principle that adjusts tree topology and

Algorithms for Genomic Analysis, Figure 7 Two possible alignments for given three sequences

A

branch lengths simultaneously. Recently, parallel computation has been used to address huge computational requirement. Stamatakis et al. [67] have used OpenMP–parallelization for symmetric multiprocessing machines and Keane et al. [39] developed a distributed platform for phylogeny reconstruction by maximum likelihood. Multiple Sequence Alignment Multiple sequence alignment is arguably among the most studied and difficult problems in computational biology. It is a vital tool because it compactly represents conserved or variable features among the family members. Alignment also allows character-based analysis compared to distance-based analysis and thus helps to elucidate evolutionary relationships better. Consequently, it plays a pivotal role in a wide range of sequence analysis problems like identifying conserved motifs among given sequences, predicting secondary and tertiary structures of protein sequences, and molecular phylogenetic analysis. It is also used for sequence comparison to find the similarity of a new sequence with pre-existing ones. This helps in gathering information about the function and structure of newly found sequences from existing ones in databases like GenBank in the USA and EMBL in Europe. The multiple sequence alignment problem can be P stated formally as follows. Let be the alphabet and P PS let ˆ D fg, where “–” is a symbol to represent “gaps” in sequences. For DNA sequences, alphabet P ˆ D fA,T,C,G,g. An alignment for N sequences S1 ; : : : ; S N is given P by a set Sˆ D fS1 ; : : : ; S N g over the alphabet ˆ which satisfy the following two properties: (1) the strings in Sˆ are of the same length; (2) Si can be obtained from Sî by removing the gaps. Thus, an alignment in which each string Sî has length K can be interpreted as an alignment matrix of N rows and K columns, where row i corresponds to sequence Si . Alphabets that are placed into the same column of the alignment matrix are said to be aligned with each other. Figure 7 shows two possible alignments for given three sequences: S1 D CCC; S2 D CGGC; and S3 D CGC. For two sequences, the optimal multiple sequence alignment is easily obtained using dynamic program-

39

40

A

Algorithms for Genomic Analysis

ming (Needleman–Wunsch algorithm). Unfortunately, the problem becomes much harder for more than two sequences, and the optimal solution can be found only for a limited number of sequences of moderate length (approximately 100) [8]. Researchers have tried to solve it by generalizing the dynamic programming approach to a multidimensional space. However, this approach has huge time and memory requirements and thus cannot be used in practice even for small problems of five sequences of length 100 each. This algorithm has been improved by identifying the portion of hyperspace which does not contribute to the solution and excluding it from the computation [11]. But even this approach of Carrillo and Lipman implemented in the multiple sequence alignment program can only align up to ten sequences [53]. Although, Gupta et al. [32] improved the space and time usage of this approach, it cannot align large data sets. To reduce the huge time and memory expenses, a wide variety of heuristic approaches for multiple sequence alignment have been developed [56]. There are two components for finding the multiple sequence alignment: (1) searching over all the possible multiple alignments; (2) scoring each of them to find the best one. The problem becomes more complex for remotely related homologous sequences, i. e., sequences which are not derived from a common ancestor [28]. Numerous approaches have been proposed, but the quest for an approach which is accurate and fast is continuing. It must be remembered that even the choice of sequences and calculating the score of alignment is a nontrivial task and is an active research field in itself. Scoring Alignment There is no unanimous way of characterizing an alignment as the correct one and the strategy depends on the biological context. Different alignments are possible and we never know for sure which alignment is correct. Thus, one scores every alignment according to an appropriate objective function and alignments with higher scores are deemed to be better. A typical alignment scoring scheme consists of the following steps. Independent Columns The score of alignment is calculated in terms of columns of alignments. The individual columns are assumed to be independent and

thus the total score of an alignment is a simple summation over column scores. Thus, the score for an P alignment score(A) D j score(A j ), where Aj is column j of the multiple alignment A. Now, the score for every column j is calculated as the “sum-of-pairs” function using the scoring matrices described below. The sum-of-pairs score for column Aj is obtained as P score(A j ) D k x C FC (x) C F(x;

where the notation is as defined for (2). The introduction of the new variables wB i , wTi , wF i and wFTi is accompanied by the addition of convex inequalities of the type given in (3), (4), (5) and (6). For the trilinear, fractional and fractional trilinear terms, the specific form of these equations depends on the sign of the term coefficients and variable bounds. The form given by (11) can be used to construct convex underestimators for the objective function and inequality constraints.

x UL C i ) 1

˛ i j (x j x Lj )(x Uj x j )A ;

c>x C

bt X

b i wB i C

iD1

tt X

t i w Ti

iD1

C

ft X iD1

f i wFi C

f tt X

f t i w FTi D 0; (13)

iD1

with the addition of convex inequalities of the type given by (3), (4), (5) and (6). If the nonlinear equality contains at least one convex, univariate concave or general nonconvex term, the convexification/relaxation strategy must first transform the equality constraint h(x) into a set of two equivalent inequality constraints ( h(x) 0 (14) h(x) 0; which can then be convexified and underestimated independently using (11). The transformation of a nonconvex twice-differentiable problem into a convex lower bounding problem

67

68

A

˛BB Algorithm

described in this section allows the generation of valid and increasingly tight lower bounds on the global optimum solution. Branching Variable Selection Once upper and lower bounds have been obtained for all the existing nodes of the branch and bound tree, the region with the smallest lower bound is selected for branching. The partitioning of the solution space can have a significant effect on the quality of the lower bounds obtained because of the strong dependence of the convex underestimators described by (3)–(8) on the variable bounds. It is therefore important to identify the variables which most contribute to the separation between the original problem and the convex lower bounding problem at the current node. Several branching variable selection criteria have been designed for this purpose [1]. Least Reduced Axis Rule The first strategy leads to the selection of the variable that has least been branched on to arrive at the current node. It is characterized by the largest ratio x iU x iL ; U L x i;0 x i;0 where x Li;0 and xUi;0 are the lower and upper bounds on variable xi at the first node of the branch and bound tree and x Li and xUi are the current lower and upper bounds on variable xi . The main disadvantage of this simple rule is that it does not account for the specificities of the participation of each variable in the problem and therefore cannot accurately identify the critical variables that determine the quality of the underestimators. Term Measure A more sophisticated rule is based on the computation of a term measure tj for term t j defined as tj D t j (x ) t˘j (x ; w );

(15)

where t j (x) is a bilinear, trilinear, fractional, fractional trilinear, univariate concave or general nonconvex term, t˘j (x; w) is the corresponding convex underestimator, x is the solution vector corresponding to the

minimum of the convex lower bounding problem, and w is the solution vector for the new variables at the minimum of the convex lower bounding problem. One of the variables participating in the term with the largest measure tj is selected for branching. Variable Measure A third strategy is based on a variable measure vi which is computed from the term measures tj . For variable xi , this measure is X vi D tj ; (16) j2T i

where T i is the set of terms in which xi participates. The variable with the largest measure vi is branched on. Variable Bound Updates The effect of the variable bounds on the convexification/relaxation procedure motivates the tightening of the variable bounds. However, the trade-off between tight underestimators generated at a large computational cost and looser underestimators obtained more rapidly must be taken into account when designing a variable bound update strategy. For this reason, one of several approaches can be adopted, depending on the degree of nonconvexity of the problem [1,3]: variable bound updates – at the beginning of the algorithmic procedure only; or – at each iteration; bound updates – for all variables in the problem; or – bound updates for those variables that most affect the quality of the lower bounds as measured by the variable measure vi . Two different techniques can be used to tighten the variable bounds. The first is based on the generation and solution of a series of convex optimization problems while the second is an iterative procedure relying on the interval evaluation of the functions in the nonconvex NLP. Optimization-Based Approach In the optimization approach, a new lower or upper bound for variable xi is obtained by solving the convex

˛BB Algorithm

problem 8 ˆ minx;w or maxx;w ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ s.t. ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ
0. Similarly, an equality constraint h(x) = 0 is infeasible in this domain if its range [hL , hU ], computed so that h(x) 2 [hL , hU ], 8 x 2 [xL , xU ], is such that 0 62 [hL , hU ]. The variable bounds are updated based on the feasibility of the constraints in the original problem and the additional constraint that the objective function should be less than or equal to the current best upper bound f . The feasible region is therefore defined as

A

:

In general, the interval-based bound update strategy is less computationally expensive than the optimization-based approach. However, at the beginning of the branch and bound search, when the bound updates are most critical and the variable ranges are widest, the overestimations inherent in interval computations often lead to looser updated bounds in the interval-based approach than in the optimization-based technique.

Algorithmic Procedure Based on the developments presented in previous sections, the procedure for the ˛BB algorithm can be summarized by the following pseudocode:

69

70

A

˛BB Algorithm

PROCEDURE ˛BB algorithm() Decompose functions in problem; Set tolerance ; Set f = f 0 = 1 and f¯ = f¯0 = +1; Initialize list of lower bounds f f 0 g; DO f¯ f > Select node k with smallest lower bound, f k , from list of lower bounds; Set f = f k ; (Optional) Update variable bounds for current node using optimization or interval approach; Select branching variable; Partition to create new nodes; DO for each new node i Generate convex lower bounding NLP Introduce new variables, constraints; Linearize univariate concave terms; Compute interval Hessian matrices; Compute ˛ values; Find solution f i of convex lower bounding NLP; IF infeasible or f i > f¯ + Fathom node; ELSE Add f i to list of lower bounds; Find a solution f¯i of nonconvex NLP; IF f¯i < f¯ Set f¯ = f¯i ; OD; OD; RETURN( f¯ and variables values at corresponding node); END ˛BB algorithm; A pseudocode for the ˛BB algorithm

Computational Experience Significant computational experience with the ˛BB algorithm has been acquired through the solution of a wide variety of problems involving different types of nonconvexities and up to 16000 variables [1,2, 3,4,6,9,12]. These include problems such as pooling/blending, design of reactor networks, design of batch plants under uncertainty [9], stability studies belonging to the class of generalized geometric program-

˛BB Algorithm, Figure 1 Simplified alkylation process flowsheet

ming problems, characterization of phase-equilibrium using activity coefficient models, identification of stable molecular conformations and the determination of all solutions of systems of nonlinear equations. In order to illustrate the performance of the algorithm and the importance of variable bound updates, a medium-size example is presented. The objective is to maximize the profit for the simplified alkylation process presented in [7] and shown in Fig. 1. An olefin feed (100% butene), a pure isobutane recycle and a 100% isobutane make up stream are introduced in a reactor together with an acid catalyst. The reactor product stream is then passed through a fractionator where the isobutane and the alkylate product are separated. The spent acid is also removed from the reactor. The formulation used here includes 7 variables and 16 constraints, 12 of which are nonlinear. The variables are defined as follows: x1 is the olefin feed rate in barrels per day; x2 is the acid addition rate in thousands of pounds per day; x3 is the alkylate yield in barrels per day; x4 is the acid strength (weight percent); x5 is the motor octane number; x6 is the external isobutane-to-olefin ratio; x7 is the F-4 performance number. The profit maximization problem is then expressed as:

Profit D min(1:715x1 C 0:035x1 x6 C 4:0565x3 C 10:0x2 0:063x3 x5 )

˛BB Algorithm

subject to: 0:0059553571x62 x1 C 0:88392857x3 0:1175625x6 x1 x1 0; 1:1088x1 C 0:1303533x1 x6 0:0066033x1 x62 x3 0; 6:66173269x62 C 172:39878x5 56:596669x4 191:20592x6 10000; 1:08702x6 C 0:32175x4 0:03762x62 x5 56:85075; 0:006198x7 x4 x3 C 2462:3121x2 25:125634x2 x4 x3 x4 0; 161:18996x3 x4 C 5000:0x2 x4 489510:0x2 x3 x4 x7 0;

A

set of the algorithm only (‘Single Up’), or an update of all bounds at each iteration of the ˛BB algorithm (‘One Up/Iter’). An intermediate strategy might involve bound updates for those variables that affect the underestimators most significantly or bound updates at only a few levels of the branch and bound tree. The results of runs performed on an HP9000/730 are summarized in the table below. t U denotes the percentage of CPU time devoted to the construction of the convex underestimating problem. Although the approach relying most heavily on variable bound updates results in tighter underestimators, and hence a smaller number of iterations, the time requirements for each iteration are significantly larger than when no bounds updates are performed. Thus, the overall CPU requirements often increase when all variable bounds are updated at each iteration.

0:33x7 x5 C 44:333333 0; 0:022556x5 0:007595x7 1; 0:00061x3 0:0005x1 1; 0:819672x1 x3 C 0:819672 0; 24500:0x2 250:0x2 x4 x3 x4 0; 1020:4082x4 x2 C 1:2244898x3 x4 100000x2 0; 6:25x1 x6 C 6:25x1 7:625x3 100000; 1:22x3 x6 x1 x1 C 1 0; 1500 1 3000 85 90 3 145

x1 x2 x3 x4 x5 x6 x7

2000; 120; 3500; 93; 95; 12; 162:

The maximum profit is $1772.77 per day, and the optimal variable values are x1 = 1698.18, x2 = 53.66, x3 = 3031.30, x4 = 90.11, x5 = 95.00, x6 = 10.50, x7 = 153.53. In this example, variable bound tightening is performed using the optimization-based approach. An update of all the variable bounds therefore involves the solution of 14 convex NLPs. The computational cost is significant and may not always be justified by the corresponding decrease in number of iterations. Two extreme tightening strategies were used to illustrate this trade-off: an update of all variable bounds at the on-

Meth

Iter.

I.1 I.2a I.2b I.3 I.4 I.5 I.6 II.1a II.1b II.2 II.3

74 61 61 69 61 61 59 56 38 62 54

Single up CPU tU sec. (%) 37:5 0:5 30:6 1:6 29:2 1:0 32:8 1:9 31:6 1:4 32:8 12:3 32:9 1:4 24:9 0:3 13:6 1:7 32:7 0:6 21:8 16:7

One Up/Iter Iter. CPU tU sec. (%) 31 41:6 0:0 25 37:2 0:2 25 35:4 0:1 25 31:5 0:2 25 33:1 0:2 25 36:7 1:7 25 32:8 0:5 30 36:5 0:3 17 19:9 0:5 25 34:5 0:3 23 30:4 5:0

Alkylation process design results

In order to determine the best technique for the construction of convex underestimators, the percentage of computational effort dedicated to this purpose, t U , is tracked. As can be seen in the above table, the generation of the convex lower bounding does not consume a large share of the computational cost, regardless of the method. It is, however, significantly larger for Methods I.5 and II.3 as they require the solution of a polynomial and a semidefinite programming problem respectively. t U decreases when bound updates are performed at each iteration as a large amount of time is

71

72

A

˛BB Algorithm

spent solving the bound updates problems. In this example, the scaled Gershgorin approach with di = (xUi x Li ) (Method II.1b) gives the best results both in terms of number of iterations and CPU time. Conclusions The ˛BB algorithm is guaranteed to identify the global optimum solution of problems belonging to the broad class of twice continuously differentiable NLPs. It is a branch and bound approach based on a rigorous convex relaxation strategy, which involves the decomposition of the functions into a sum of terms with special mathematical structure and the construction of different convex underestimators for each class of term. In particular, the treatment of general nonconvex terms requires the analysis of their Hessian matrix through interval arithmetic. Efficient branching and variable bound update strategies can be used to enhance the performance of the algorithm. See also Bisection Global Optimization Methods Continuous Global Optimization: Applications Continuous Global Optimization: Models, Algorithms and Software Convex Envelopes in Optimization Problems D.C. Programming Differential Equations and Global Optimization DIRECT Global Optimization Algorithm Eigenvalue Enclosures for Ordinary Differential Equations Generalized Primal-relaxed Dual Approach Global Optimization Based on Statistical Models Global Optimization in Batch Design Under Uncertainty Global Optimization in Binary Star Astronomy Global Optimization in Generalized Geometric Programming Global Optimization Methods for Systems of Nonlinear Equations Global Optimization in Phase and Chemical Reaction Equilibrium Global Optimization Using Space Filling Hemivariational Inequalities: Eigenvalue Problems Interval Analysis: Eigenvalue Bounds of Interval Matrices

Interval Global Optimization MINLP: Branch and Bound Global Optimization Algorithm MINLP: Global Optimization with ˛BB Reformulation-linearization Methods for Global Optimization Reverse Convex Optimization Semidefinite Programming and Determinant Maximization Smooth Nonlinear Nonconvex Optimization Topology of Global Optimization

References 1. Adjiman CS, Androulakis IP, Floudas CA (1998) A global optimization method, ˛BB, for general twice-differentiable constrained NLPs – II. Implementation and computational results. Comput Chem Eng 22:1159 2. Adjiman CS, Androulakis IP, Maranas CD, Floudas CA (1996) A global optimization method, ˛BB, for process design. Comput Chem Eng 20:S419–S424 3. Adjiman CS, Dallwig S, Floudas CA, Neumaier A (1998) A global optimization method, ˛BB, for general twicedifferentiable constrained NLPs – I. Theoretical advances. Comput Chem Eng 22:1137 4. Adjiman CS, Floudas CA (1996) Rigorous convex underestimators for twice-differentiable problems. J Global Optim 9:23–40 5. Al-Khayyal FA, Falk JE (1983) Jointly constrained biconvex programming. Math Oper Res 8:273–286 6. Androulakis IP, Maranas CD, Floudas CA (1995) ˛BB: A global optimization method for general constrained nonconvex problems. J Global Optim 7:337–363 7. Bracken J, McCormick GP (1968) Selected applications of nonlinear programming. Wiley, New York 8. Deif AS (1991) The interval eigenvalue problem. Z Angew Math Mechanics 71:61–64 9. Harding ST, Floudas CA (1997) Global optimization in multiproduct and multipurpose batch design under uncertainty. Industr Eng Chem Res 36:1644–1664 10. Hertz D (1992) The extreme eigenvalues and stability of real symmetric interval matrices. IEEE Trans Autom Control 37:532–535 11. Kharitonov VL (1979) Asymptotic stability of an equilibrium position of a family of systems of linear differential equations. Differential Eq:1483–1485 12. Maranas CD, Floudas CA (1994) Global minimum potential energy conformations of small molecules. J Global Optim 4:135–170 13. Maranas CD, Floudas CA (1995) Finding all solutions of nonlinearly constrained systems of equations. J Global Optim 7:143–182

Alternative Set Theory

14. Maranas CD, Floudas CA (1997) Global optimization in generalized geometric programming. Comput Chem Eng 21:351–370 15. McCormick GP (1976) Computability of global solutions to factorable nonconvex programs: part I – Convex underestimating problems. Math Program 10:147–175 16. McDonald CM, Floudas CA (1994) Decomposition based and branch and bound global optimization approaches for the phase equilibrium problem. J Global Optim 5:205–251 17. McDonald CM, Floudas CA (1995) Global optimization and analysis for the Gibbs free energy function for the UNIFAC, Wilson, and ASOG equations. Industr Eng Chem Res 34:1674–1687 18. McDonald CM, Floudas CA (1995) Global optimization for the phase and chemical equilibrium problem: Application to the NRTL equation. Comput Chem Eng 19:1111–1141 19. McDonald CM, Floudas CA (1995) Global optimization for the phase stability problem. AIChE J 41:1798–1814 20. McDonald CM, Floudas CA (1997) GLOPEQ: A new computational tool for the phase and chemical equilibrium problem. Comput Chem Eng 21:1–23 21. Mori T, Kokame H (1994) Eigenvalue bounds for a certain class of interval matrices. IEICE Trans Fundam E77-A:1707– 1709 22. Neumaier A (1992) An optimality criterion for global quadratic optimization. J Global Optim 2:201–208 23. Rohn J (1996) Bounds on eigenvalues of interval matrices. Techn Report Inst Computer Sci Acad Sci Prague 688 24. Stephens C (1997) Interval and bounding Hessians. In: Bonze IM et al (eds) Developments in Global Optimization. Kluwer, Dordrecht, pp 109–199 25. Vandenberghe L, Boyd S (1996) Semidefinite programming. SIAM Rev 38:49–95

A

Topology Basic Definitions

Motion Utility Theory Conclusion See also References Keywords Sets; Semisets; Infinity; Countability; Continuum; Topology; Indiscernibility; Motion; Utility theory Alternative set theory has been created and, together with his colleagues at Charles University, developed by P. Vopˇenka since the 1970s. In agreement with Husserl’s phenomenology, he based his theory on the natural world and the human view thereof. The most important for any set theory is the way it treats infinity. A different approach to infinity forms the key difference between AST and classical set theories based on the Cantor set theory (CST). Cantor’s approach led to the creation of a rigid, abstract world with an enormous scale of infinite cardinalities while Vopˇenka’s infinity, based on the notion of horizon, is more natural and acceptable. Another source of inspiration were nonstandard models of Peano arithmetics with infinitely large (nonstandard) numbers. The way to build them in AST is easy and natural. The basic references are [9,10,11].

Alternative Set Theory AST ˇ ˇ PETR VOP ENKA , KATE RINA TRLIFAJOVÁ Charles University, Prague, Czech Republic

MSC2000: 03E70, 03H05, 91B16 Article Outline Keywords Classes, Sets and Semisets Infinity Axiomatic System of AST Rational and Real Numbers Infinitesimal Calculus

Classes, Sets and Semisets AST, as well as CST, builds on notions of ‘set’, ‘class’, ‘element of a set’ and, in addition, introduces the notion of ‘semiset’. A class is the most general notion used for any collection of distinct objects. Sets are such classes that are so clearly defined and clean-cut that their elements could be, if necessary, included in a list. Semisets are classes which are not sets, because their borders are vague, however, they are parts of sets. For example, all living people in the world form a class—some are being born, some are dying, we do not know where all of them are. The citizens of Prague, registered at the given moment in the register, form a set. However, all the beautiful women in Prague or brave men in Prague

73

74

A

Alternative Set Theory

form a semiset, since it is not clear who belongs to this collection and who not. In the real world, we may find many other semisets. Almost each property defines a semiset of objects, e. g., people who are big, happy or sick. Many properties are naturally connected with a vagueness. Also, what we see and perceive can be vague and limited by a horizon. Objects described in this way may form a semiset, e. g. flowers I can see in the blooming meadow, all my friends, sounds I can hear. Infinity This interpretation differs from the normal one and corresponds more to the etymological origin of the word infinity. We will call finite those classes any part of which is surveyable and forms a set. Any finite class is a set. Fin(X) , (8Y)(Y X H) Set(Y)): On the other side, infinite classes include ungrasped parts, semisets. This phenomenon may occur also when watching large sets in the case when it is not possible to capture them clearly as a whole. There are two different forms of infinity traditionally called denumerability and continuum. A countable (denumerable) class, in a way, represents a road towards the horizon. Its beginning is clear and definite but it comes less and less clear and its end loses in a vagueness. A countable class is defined as an infinite class with a linear ordering such that each initial part (segment) is finite. For instance, a railway track with cross-ties leading straight to the horizon, days of our life we are to live or ever smaller and smaller reflections in two mirrors facing each other. The most important example is a class of natural numbers that will be discussed later. The phenomenon of denumerability corresponds to a road towards the horizon. Though we get to the last point we can see, we can still go a bit further, the road will not disappear immediately. People have always tried to look a bit behind the horizon, to gain understanding and to overcome it in their thinking. This experience is expressed here by the important axiom of prolongation (see Axiom A6). The other type of infinity, continuum, is based on the following experience. If we watch an object, how-

ever, are not able to distinguish individual elements which form it since they lie beyond the horizon of our perception. For example, the class of all geometric points in the plane, class of all atoms forming a table or grains of sand which together form a heap. In fact the classical infinite mathematics, when applied to the real world, then solely to the above two types of infinity. The intention of AST is to built on the natural world and human intuition. There is no reason for other types of infinity which are enforced in CST by its assumption that natural numbers form a set and that a power set is a set. That is why there are only two infinite cardinalities in AST: denumerability and continuum (see Axiom A8). All examples from mathematical and real worlds are intentionally set out here together. They serve the purpose of inspiration to see where the idea of infinity comes from, they should be kept in mind when one deals with infinity. The mathematical world is an ideal one, it is a perfect world of objective truths abstracted from all that is external. There is only little space for subjectivity of perception in it. That is why not all semisets from the real world may be interpreted directly. The axiomatic system bellow describes that part of the AST which can be expressed in a strictly formal way. This basis provides space for extending AST by semisets which are parts of big, however, classically finite sets and thus make a lot of applications possible. Axiomatic System of AST [3] The language of AST uses symbols 2 and =, symbols X, Y, Z, . . . for class variables and symbols x, y, z, . . . for set variables. Sets are created by iteration from the empty set by Axiom A3. Classes are defined by formulas by Axiom A2. Every set is a class. Formally, a set is a class that is a member of another class: Set(X) , (9Y)(X 2 Y): AST is a theory with the following axioms: A1 (extensionality). (X = Y) , (8Z)(Z 2 X) , (Z 2 Y); A2 (existence of classes). If is a formula, then (9Y)(8x)(x 2 Y ,

(x; X1 ; : : : ; X n ));

A

Alternative Set Theory

A3 (existence of sets).

Rational and Real Numbers

Set(;) ^ (8x; y)Set(x [ fyg): A set-formula is a formula in which only set variables and constants occur. A4 (induction). If is a set-formula, then ( (;) ^ (8x, y)( (x) ) (x [ {y})) ) (8x) (x). A5 (regularity). If is a set-formula, then (9x) (x) ) (9x)( (x) ^ (8y 2 x): (y)). As usual, the class of natural numbers N is defined in the von Neumann way N D x:

(8y 2 x)(y x) ^(8y; z 2 x)(y 2 z _ y D z _ z 2 y)

The class of finite natural numbers (FN) consists of the numbers represented by a finite set. They are accessible, easy to overlook and lie before the horizon:

Rational numbers Q are constructed in the usual way from N as the quotient field of the class N [ {n; n 2 N} Because N includes infinitely large numbers, Q includes infinitely small numbers. Finite rational numbers FQ are similarly constructed from finite natural numbers FN. They include quantities that are before the horizon with respect to distance and depth. Surely FQ Q. We define that x, y 2 Q are infinitely near by 8 ˆ ˆ n ^ y > n) ˆ ˆ :_(x < n ^ y < n): This relation is an equivalence. The corresponding partition classes are called monads. For x 2 Q Mon(x) D fy : yDxg ˙ :

FN D fx 2 N : Fin(x)g FN forms a countable class in the sense described above. The class FN correspond to classical natural numbers and the class N to their nonstandard model. Both N and FN satisfy the axioms of Peano arithmetic. Two classes X, Y are equivalent if there is a one-one mapping of X onto Y, i. e. X Y. A6 (prolongation). Every countable function can be prolonged to a function which is a set, i. e. (8F)((Fnc(F) ^ (F FN)) ) (9f )(Fnc(f ) ^ F f )). An easy corollary is that a countable class is a semiset. Also FN is a semiset and it can be prolonged to a set which is an element of N and which is greater than all finite natural numbers and so it represents an infinitely large natural number. Consequently, the class N is not countable. The universal class V includes all sets created by iteration from the empty set. A7 (choice). The universal class V can be well ordered. A8 (two cardinalities). Every two infinite classes that are not countable are equivalent. Thus, any infinite class is either equivalent to FN or N. Using ultrapowers, the relative consistency of AST can be proved.

Rational numbers x that are elements of Mon(0), i. e. (x˙e0), are infinitely small. All monads are of the same nature except for the two limit ones. These consists of infinitely large positive and negative numbers. The class of bounded rational numbers is BQ D fx 2 Q : (9n)((n 2 FN) ^ (jxj < n))g Now, it is easy and natural to construct real numbers: R D fMon(x) : x 2 BQg : Real numbers built in this way display the same characteristics as real numbers in CST. This motivation for expressing real numbers as monads of rational numbers corresponds rather to etymology than to the traditional interpretation. Rational numbers are constructed by reason, perfectly exact; their existence is purely abstract. On the other hand, real numbers are more similar to those that are used in the real world. If we say: one eighth of a cake, we surely do not expect it to be the ideal eighth, it is rather a portion which differs from the ideal one by a difference which is beyond the horizon of our perception. A similar situation occurs in the case of a pint of milk or twenty miles.

75

76

A

Alternative Set Theory

Infinitesimal Calculus [12] Infinitesimal calculus in AST is based on the same point of view and intuition as that of its founders, I. Newton and G.W. Leibniz. It is so because infinitely small or infinitesimal quantities are naturally available in AST. For example, the limit of a function and the continuity in a 2 Q are defined, respectively, by: lim f (x) D b

x!a

, (8x)((x Da ˙ ^ x ¤ a) ) f (x)Db)); ˙ (8x)(x Da ˙ ) f (x)D ˙ f (a)): This topic is discussed in detail in [9]. As a method, these definitions were successfully used for teaching students. Topology Classes described by arbitrary formulas can be complex and difficult to capture. The easiest are sets, also classes described by using set-formulas, so-called setdefinable classes (Sd-classes) can be described well. Semisets which are defined by a positive property (big, blue or happy and also distinguishable or to be a finite natural number) can be described as a countable union of Sd-classes, the so-called -classes. On the other hand, classes whose definition is based on negation (not big, not happy, indistinguishable), are the socalled -classes—countable intersections of Sd-classes. A class which is at the same time and is an Sd-class. Using combinations of and , a set hierarchy can be described. One of the most important tasks of mathematics is to handle the notion of the continuum. AST is based on the assumption that this phenomenon is caused by that of the indiscernibility of elements of the observed class. That is why, for the study of topology, the basic notion is a certain relation of indiscernibility ( ). Two elements are indiscernible if, when observed, available criteria that might distinguish them fail. It is a negative feature, therefore it must be a -class. The relation of indiscernibility is naturally reflexive and symmetric. In pure mathematics, it is in addition transitive (because FN is closed under addition), thus it is an equivalence. This relation must also be compact, i. e. for each infinite set u dom( ) there are x, y 2 u such that x 6D y ^ x

y. The corresponding topological space is a compact metric space. The relation of infinite nearness in rational numbers represents a special case of equivalence of indiscernibility. Monads and figures correspond to phenomena of points and shapes, respectively: Mon(x) D fy : y xg ; Fig(X) D fy : (9x 2 X)(y x)g : Basic Definitions Two classes X, Y are separable, Sep(X, Y) , (9Z)(Sd(Z) ^ Fig(X) Z ^ Fig(Y) \ Z = ;). A closure X of a class X is defined as X D fx : : Sep(fxg; X)g. A class X is closed if X D X. A set u is connected if (8w)(; 6D = u ) Fig(w) \ (uw) 6D ;). It is quite easy to prove basic topological theorems. Also proofs of some classical theorems are much simpler here. For instance the Sierpinski theorem: If v is a connected set then Fig(v) cannot be expressed as a countable union of disjoint closed sets. The fundamental indiscernibility $c is defined as follows. If c is a set then x $c y if for any set-formula with the constants from c and for any x, it is (x) , (y). This relation has a special position. For any relation of indiscernibility there is a set c such that ${c} is finer than i. e. ${c{ . Motion Unlike classical mathematics, the motion is captured in AST by any relation of indiscernibility . Everybody knows the way films work. Pictures coming one after another are almost indiscernible from each other, however, when shown in a rapid sequence, the pictures start to move. The continuous motion may be viewed like this, as a sequence of indiscernible stages in certain time intervals. A function d is a motion of a point in the time ı 2 N if dom(f ) = ı ^ (8˛ < ı)(d(˛) d(˛+1)). If ı 2 FN then the point does not move, it can move only in an infinitely big time interval.

Alternative Set Theory

A sequence {d(˛): ˛ 2 dom(d)} is a sequence of states. The number ı = dom(d) is the number of moments and rng(d) is the trace of a moving point. A trace is a connected set and for each nonempty connected set u there is a motion of a point such that u is the trace of d. A motion of a set is defined similarly, only the last condition is different: (8˛ < ı)(Fig(d(˛)) = Fig(d(˛+1))). The following theorem is proved in [10,11]: Each motion of a set may be divided into motions of points. This does not involve only the mechanical motion, but any motion describing a continuous change. Thus, for example, even the growth of a tree from a planted seed may be divided into movements of individual points while all of their initial stages are already contained in the seed. In addition, it is possible to describe conditions under which such a change is still continuous. Utility Theory [7] The utility theory is one of nice examples of applying AST. Its aim is to find a valuation of elements of a class S. There is a preference relation on linear combinations of elements of S with finite rational coefficients, i. e. on the class 9 =

8 n

m X

ˇ j F(u j ):

jD1

It is not necessary to require the so-called Archimedes property on the relation of preference thanks to the possibility of using infinitely small and infinitely large rational numbers. It is possible to capture finer and more complex relations than in classic mathematics, e. g. the fact that the value of one element is incom-

A

parably higher than that of another element or it is possible to compare infinitely small differences of values. For each class S with a preference relation a valuation may be found. Such a valuation is not uniquely defined, it is possible to construct it so that rng(F) N. Conclusion The aim of this short survey is to demonstrate the basic ideas of AST. Yet, there are other areas of mathematics which were studied in it, for instance measurability [8], ultrafilters [6], endomorphic universes [5] and automorphisms of natural numbers [2], representability [1] metamathematics [3] and models of AST [4]. See also Boolean and Fuzzy Relations Checklist Paradigm Semantics for Fuzzy Logics Finite Complete Systems of Many-valued Logic Algebras Inference of Monotone Boolean Functions Optimization in Boolean Classification Problems Optimization in Classifying Text Documents References 1. Mlˇcek J (1979) Valuation of structures. Comment Math Univ Carolinae 20:681–695 2. Mlˇcek J (1985) Some automorphisms of natural numbers in AST. Comment Math Univ Carolinae 26:467–475 3. Sochor A (1992) Metamathematics of AST. From the logical point of view 1:61–75 4. Sochor A, Pudlák P (1984) Models of AST. J Symbolic Logic 49:570–585 5. Sochor A, Vopˇenka P (1979) Endomorfic universes and their standard extensions. Comm Math Univ Carolinae 20:605–629 6. Sochor A, Vopˇenka P (1981) Ultrafilters of sets. Comment Math Univ Carolinae 22:698–699 7. Trlifajová K, Vopeˇ nka P (1985) Utility theory in AST. Comment Math Univ Carolinae 26:699–711 ˇ 8. Cuda K (1986) The consistency of measurability of projective semisets. Comment Math Univ Carolinae 27:103–121 ˇ 9. Cuda K, Sochor A, Vopeˇ nka P, Zlatoš P (1989) Guide to AST. Proc. First Symp. Mathematics in AST, Assoc. Slovak Mathematicians and Physicists, Bratislava 10. Vopénka P (1979) Mathematics in AST. Teubner, Leipzig 11. Vopénka P (1989) Introduction to mathematics in AST. Alfa Bratislava, Bratislava 12. Vopénka P (1996) Calculus infinitesimalis-pars prima. Práh Praha, Praha

77

78

A

Approximation of Extremum Problems with Probability Functionals, APF

Approximation of Extremum Problems with Probability Functionals APF RIHO LEPP Tallinn Technical University, Tallinn, Estonia MSC2000: 90C15 Article Outline Keywords See also References Keywords Discrete approximation; Probability functionals To ensure a certain level of reliability for the solution of an extremum problem under uncertainty it has become a spread approach to introduce probabilistic (chance) cost and/or constraints into the model. The stability analysis of chance constraint problems is rather complicated due to complicated properties of the probability function vt (x), defined as v t (x) D P fs : f (x; s) tg :

(1)

Here f (x, s) is a real valued function, defined on Rr × Rr , t is a fixed level of reliability, s is a random vector and P denotes probability. The function vt (x) is never convex, only in some cases (e. g., f (x, s) linear in s and distribution of the random parameter s normal), it is quasiconvex. Note that for a fixed x function vt (x), as a function of t, is the distribution function of the random variable f (x, s). The ‘inverse’, the quantile function w˛ (x), to the probability function vt (x) is defined in such a way that the probability level ˛, 0 < ˛ < 1, is fixed earlier, and the purpose is to minimize the reliability level t: w˛ (x) D min ft : P fs : f (x; s) tg ˛g : t

(2)

Varied examples of extremum problems with probability and quantile functions are presented in [7] and

in [8]. Some of these models have such a complicated structure, see [8, Chap. 1.8], about correction of a satellite orbit, that we are forced to look for a solution x from a certain class of strategies, that means, the solution x itself depends on the random parameter s, x = x(s). This class of probability functions was introduced to stochastic programming by E. Raik, and lower semicontinuity and continuity properties of vt (x) and w˛ (x) in Lebesgue Lp -spaces, 1 p < 1, were studied in [12]. Simultaneously, in [4] problems with various classes of solutions x(s) (measurable, continuous, linear, etc) were considered. Since the paper [4] solutions x(s) are called decision rules, and we will follow also this terminology. Differently from [4], here we will consider approximation of a decision rule x(s) by sequences of vectors {xn }, xn = (x1n , . . . , xnn ), n = 1, 2, . . . , with increasing dimension in order to maximize the value of the probability functional v(x) under certain set C of decision rules. It will be assumed that the set C will be bounded in the space L1 (S, ˙, ) = L1 () of integrable functions x(s), x 2 L1 (): max v t (x) D max P fs : f (x(s); s) tg : x2C

x2C

(3)

Here S is the support of random variable s with distribution (probability measure) () and ˙ denotes the sigma-algebra of Borel measurable sets from Rr . Due to technical reasons we are forced to assume that the random parameter s has bounded support S Rr , diam S < 1, and its distribution is atomless, fs : js s0 j D constg D 0;

8s0 2 Rr :

(4)

Since the problem (3) is formulated in the function space L1 () of -integrable functions, the first step in its solution is the approximation step where we will replace the initial problem (3) by a sequence of finitedimensional optimization problems with increasing dimension. Second step, solution methods were considered in a series of papers of the author (see, e. g., [9]), where the gradient projection method was suggested together with simultaneous Parzen–Rosenblatt kerneltype smooth approximation of the discontinuous integrand from (1). There are several ways to divide the support S of the probability measure into smaller parts in discretization, e. g., taking disjoint subsets Sj , j = 1, . . . , k, of S

Approximation of Extremum Problems with Probability Functionals, APF

from the initial sigma-algebra ˙ as in [11], or using in the partition of S only convex sets from ˙, as in [5]. We will divide the support S into smaller parts by using only sets Ain , i = 1, . . . , n, n 2 N = {1, 2, . . . }, with -measure zero of their boundary, i. e., (intAin ) = (Ain ) = (clAin ), where int A and cl A denote topological interior and closure of a set A, respectively. Such division is equivalent to weak convergence of a sequence of discrete measures {(mn , sn )} to the initial probability measure , see, e. g. [14]: n X iD1

Z h(s i n )m i n !

h(s) (ds);

n 2 N;

(5)

S

for any continuous on S function h(s), h 2 C(S). The usage of the weak convergence of discrete measures in stochastic programming has its disadvantages and advantages. An example in [13] shows that, in general, the stability of a probability function with respect to weak convergence cannot be expected without additional smoothness assumptions on the measure . This is one of the reasons, why we should use only continuous measures with the property (4). An advantage of the usage of the weak convergence is that it allows us to apply in the approximation process instead of conditional means [11] the more simple, grid point approximation scheme. Since the functional vt (x) is not convex, we are not able to exploit in the stability analysis of discrete approximation of the problem (3) the more convenient, weak topology, but only the strong (norm) topology. As the first step we will approximate vt (x) so, that the discrete analogue of continuous convergence of a sequence of approximate functionals will be guaranteed. Schemes of stability analysis (e. g., finite-dimensional approximations) of extremum problems in Banach spaces require from the sequence of solutions of ‘approximate’ problems certain kind of compactness. Assuming that the constraint set C is compact in L1 (), we, as the second step, will approximate the set C by a sequence of finite-dimensional sets {Cn } with increasing dimension so, that the sequence of solutions of approximate problems is compact in a certain (discrete convergence) sense in L1 (). Then the approximation scheme for the discrete approximation of (3) will follow formed schemes of approximation of extremum problems in Banach spaces, see e. g. [2,3,15].

A

Redefine the functional vt (x) by using the Heaviside zero-one function : Z v t (x) D (t f (x(s); s)) (ds); (6) S

where ( (t f (x(s); s)) D

1

if f (x(s); s) t;

0

if f (x(s); s) > t:

Since the integrand () itself, as a zero-one function, is discontinuous, we will assume that the function f (x, s) is continuous both in (x, s) and satisfies following growth and ‘platform’ conditions: j f (x; s)j a(s) C ˛ jxj ; a 2 L1 ();

˛ > 0;

fs : f (x; s) D constg D 0; 8(x; s) 2 Rr S:

(7) (8)

The continuity assumption is technical in order to simplify the description of the approximation scheme below. The growth condition (7) is essential: without it the superposition operator f (x) = f (x(s), s) will not map an element from L1 to L1 (is even not defined). Condition (8) means that the function f (x, s) should not have horizontal platforms with positive measure. Constraint set C is assumed to be a set of integrable functions x(s), x 2 L1 (), with properties Z (9) jx(s)j (ds) M < 1; 8x 2 C S

for some M > 0 (C is bounded in L1 ()); Z jx(s)j K(D); 8x 2 C; D 2 ˙

(10)

D

for some K > 0; (x(s) x(t); s t) 0 for a.a. s; t 2 S

(11)

(functions x 2 C are monotone almost everywhere and a.a. denotes abbreviation of ‘almost all’). Conditions (9), (10) guarantee that the set C is weakly compact (i. e., compact in the (L1 , L1 )-topology, see, e. g., [6, Chap. 9.1.2]). Condition (11) guarantees now, following [1, Lemma 3], that the set C is strongly compact in L1 (). Then, following [11], we can conclude that assumptions (7)–(11) together with

79

80

A

Approximation of Extremum Problems with Probability Functionals, APF

atomless assumption (4) for the measure guarantee the existence of a solution of problem (3) in the Banach space L1 () of -integrable functions (the cost functional vt (x) is continuous in x and the constraint set C is compact in L1 ()). Since approximate problems will be defined in Rrn , we should define a system of connection operators P = {pn } between spaces L1 () and Rrn , n 2 N. In Lp -spaces, 1 p 1, systems of connection operators should be defined in a piecewise integral form (as conditional means): Z x(s) (ds); (12) (p n x) i n D (A i n )1 A in

where i = 1, . . . , n, and sets Ain , i = 1, . . . , n, n 2 N, that define connection operators (12), satisfy following conditions A1)–A7): A1) (Ain )> 0; A2) Ain \ Ajn = ;, i 6D j; A3) [niD1 Ain = S; Pn A4) iD1 |min (Ain )| ! 0, n 2 N; A5) maxi diamAin ! 0, n 2 N; A6) sin 2 Ain ; A7) (intAin ) = (Ain ) = (clAin ). Remark 1 Weak convergence (5) is equivalent to the partition {An } of S, An = {A1n , . . . , Ann }, with properties A1)–A7), see [14]. Remark 2 Collection of sets {Ain } with the property A7) constitutes an algebra ˙ 0 ˙, and if S = [0, 1] and if is Lebesgue measure on [0, 1], then integrability relative to |˙0 means Riemann integrability. Define now the discrete convergence for the space L1 () of -integrable functions. Definition 3 A sequence of vectors {xn }, xn 2 Rrn , P converges (or converges discretely) to an integrable function x(s), if n X

jx i n (p n x) i n j m i n ! 0;

n 2 N:

(13)

iD1

Remark 4 Note that in the space L1 () of -integrable functions we are also able to use the projection methods approach, defining convergence of {xn } to x(s) as follows: ˇ Z ˇˇ n ˇ X ˇ ˇ x i n A in (s)ˇ (ds) ! 0; n 2 N: ˇx(s) ˇ ˇ S iD1

Remark 5 Projection methods approach does not work in the space L1 () of essentially bounded measurable functions with vraisup-norm topology (L1 () is a nonseparable Banach space and the space C(S) of continuous functions is not dense there). We need the space L1 (), which is the topological dual to the space L1 () of -integrable functions, in order to define also the discrete analogue of the weak convergence in L1 (). Definition 6 A Sequence of vectors {xn }, xn 2 Rrn , n 2 N, wP -converges (or converges weakly discretely) to an integrable function x(s), x 2 L1 (), if Z n X (z i n ; x i n )m i n ! (z(s); x(s)) (ds); S (14) iD1 n 2 N; for any sequence {zn } of vectors, zn 2 Rrn , n 2 N, and function z(s), z 2 L1 (), such that max jz i n (p n z) i n j ! 0;

n 2 N:

1in

(15)

In order to formulate the discretized problem and to simplify the presentation, we will assume that in partition {An } of S, where An = { A1n , . . . , Ann }, with properties A1)–A7), in property A4) we will identify min and (Ain ), i. e. min = (Ain ) (e. g. squares with decreasing diagonal in R2 ). Discretize now the probability functional vt (x): v tn (x n ) D

n X

(t f (x i n ; s i n ))m i n ;

(16)

iD1

and formulate the discretized problem: max v tn (x n )

x n 2C n

D max

x n 2C n

n X

(17) (t f (x i n ; s i n ))m i n ;

iD1

where constraint set Cn will satisfy discrete analogues of conditions (9)–(11), covered to the set C: n X

jx i n j m i n M

iD1

X i2I n

jx i n j m i n K

8x n 2 C n ;

X

(18)

mi n ;

i2I n

8x n 2 C n ;

8I n f1; : : : ; ng;

(19)

Approximation of Extremum Problems with Probability Functionals, APF

r X

(x ikk n x kjk n )(i k j k ) 0;

8i k ; j k :

ik < jk ;

kD1

(20) and such that 0 ik , jk n, 8n 2 N. Definition 7 A sequence of sets {Cn }, Cn Rrn , n 2 N, converges to the set C L1 () in the discrete Mosco sense if 1) for any subsequence {xn }, n 2 N0 N, such that xn 2 Cn , from convergence wP -lim xn = x, n 2 N, it follows that x 2 C; 2) for any x 2 C there exists a sequence {xn }, xn 2 Cn , which P -converges to x, P -lim xn = x, n 2 N. Remark 8 If in the above definition also ‘for any’ part 1) is defined for P -convergence of vectors, then it is said that sequence of sets {Cn } converges to the set C in the discrete Painlevé–Kuratowski sense. Denote optimal values and optimal solutions of problems (3) and (17) by v , x and vn , xn , respectively. Let function f (x, s) be continuous in both variables (x, s) and satisfy growth and platform conditions (7) and (8). Then from convergence P -lim xn = x, n 2 N, for any monotone a.e. function x(s), it follows convergence vn (xn ) ! v(x), n 2 N. Verification of this statement is quite lengthy and technically complicated: we should first approximate discontinuous function (t f (x, s)) by continuous function c (t f (x, s)) in the following way: c (t f (x; s)) 8 ˆ if f (x; s) t; ˆ t C ı for some (small) ı, and then a discontinuous solution x(s), x 2 L1 (), by continuous function xc (s) (in L1 norm topology). Let constraint sets C and Cn satisfy conditions (9)– (11) and (18)–(20), respectively. Let discrete measures {(mn , sn )} converge weakly to the measure . Then the sequence of sets {Cn } converges to the set C in the discrete Painlevé–Kuratowski sense. Verification of this statement relies on the two following convergences:

A

1) sequence of sets, determined by inequalities (18), (19) converges, assuming weak convergence of discrete measures (5), in discrete Mosco sense to the weakly compact in L1 () set, determined by inequalities (9), (10); 2) adding to both, approximate and initial sets of admissible solutions monotonicity conditions (20) and (11), respectively, we can guarantee the discrete convergence of sequence {Cn } to C in Painlevé– Kuratowski sense. Now we can formulate the discrete approximation conditions for a stochastic programming problem with probability cost function in the class of integrable decision rules. Let function f (x, s) be continuous in both variables (x, s) and satisfy growth and platform conditions (7) and (8), constraint set C satisfy conditions (9)–(11) and let discrete measures {(mn , sn )} converge weakly to the atomless measure . Then vn ! v , n 2 N, and sequence of solutions {xn } of approximate problems (17) has a subsequence, which converges discretely to a solution of the initial problem (3). Remark 9 The usage of the space L1 () of integrable functions is essential. In reflexive Lp -spaces, 1 < p < 1, serious difficulties arise with application of the strong (norm) compactness criterion for a maximizing sequence. As a rule, problems with probability cost function are maximized, whereas stochastic programs with quantile cost are minimized, see, e. g., [8,10]. Consider at last discrete approximation of the quantile minimization problem (2): min w˛ (x) x2C

D min minfP( f (x(s); s) t) ˛g; x2C

t

(21)

It was verified in [10] that under certain (quasi)convexity-concavity assumptions the quantile minimization problem (21) is equivalent to the following Nash game: max v t (x) D J1 ;

(22)

min(v t (x) ˛)2 D J2 :

(23)

x2C

t

81

82

A

Approximation of Extremum Problems with Probability Functionals, APF

Discretizing vt (x) as in (16) and w˛ (x) as ( w˛n (x n ) D min t

n X

) (t f (x i n ; s i n ))m i n ˛ ;

iD1

we can, analogously to the probability functional approximation, approximate the quantile minimization problem (21) too. In other words, to replace the Nash game (22), (23) with the following finite-dimensional game: ; max v tn (x n ) D J1n

(24)

min(v tn (x n ) ˛)2 D J2n :

(25)

x n 2C n

t

Verification of convergences J 1n ! J 1 and J 2n ! J 2 , n 2 N, is a little bit more labor-consuming compared with approximate maximization of probability functional vt (x), since we should guarantee also convergence of the sequence of optimal quantiles {t n } of minimization problems (25).

See also Approximation of Multivariate Probability Integrals Discretely Distributed Stochastic Programs: Descent Directions and Efficient Points Extremum Problems with Probability Functions: Kernel Type Solution Methods General Moment Optimization Problems Logconcave Measures, Logconvexity Logconcavity of Discrete Distributions L-Shaped Method for Two-Stage Stochastic Programs with Recourse Multistage Stochastic Programming: Barycentric Approximation Preprocessing in Stochastic Programming Probabilistic Constrained Linear Programming: Duality Theory Probabilistic Constrained Problems: Convexity Theory Simple Recourse Problem: Dual Method Simple Recourse Problem: Primal Method Stabilization of Cutting Plane Algorithms for Stochastic Linear Programming Problems Static Stochastic Programming Models

Static Stochastic Programming Models: Conditional Expectations Stochastic Integer Programming: Continuity, Stability, Rates of Convergence Stochastic Integer Programs Stochastic Linear Programming: Decomposition and Cutting Planes Stochastic Linear Programs with Recourse and Arbitrary Multivariate Distributions Stochastic Network Problems: Massively Parallel Solution Stochastic Programming: Minimax Approach Stochastic Programming Models: Random Objective Stochastic Programming: Nonanticipativity and Lagrange Multipliers Stochastic Programming with Simple Integer Recourse Stochastic Programs with Recourse: Upper Bounds Stochastic Quasigradient Methods in Minimax Problems Stochastic Vehicle Routing Problems Two-stage Stochastic Programming: Quasigradient Method Two-Stage Stochastic Programs with Recourse

References 1. Banaš J (1989) Integrable solutions of Hammerstein and Urysohn integral equations. J Austral Math Soc (Ser A) 46:61–68 2. Daniel JW (1971) The approximate minimization of functionals. Prentice-Hall, Englewood Cliffs 3. Esser H (1973) Zur Diskretisierung von Extremalproblemen. Lecture Notes Math, vol 333. Springer, Berlin, pp 69– 88 4. Garstka J, Wets RJ-B (1974) On decision rules in stochastic programming. Math Program 7:117–143 5. Hernandez-Lerma O, Runggaldier W (1994) Monotone approximations for convex stochastic control problems. J Math Syst, Estimation and Control 4:99–140 6. Ioffe AD, Tikhomirov VM (1979) Theory of extremal problems. North-Holland, Amsterdam 7. Kall P, Wallace SW (1994) Stochastic programming. Wiley, New York 8. Kibzun AI, Kan YS (1995) Stochastic programming problems with probability and quantile functions. Wiley, New York 9. Lepp R (1983) Stochastic approximation type algorithm for the maximization of the probability function. Proc Acad Sci Estonian SSR Phys Math 32:150–156

A

Approximation of Multivariate Probability Integrals

10. Malyshev VV, Kibzun AI (1987) Analysis and synthesis of high precision aircraft control. Mashinostroenie, Moscow, Moscow 11. Olsen P (1976) Discretization of multistage stochastic programming problems. Math Program Stud 6:111–124 12. Raik E (1972) On stochastic programming problem with probability and quantile functionals. Proc Acad Sci Estonian SSR Phys Math 21:142–148 13. Römisch W, Schultz R (1988) On distribution sensitivity in chance constrained programming. Math Res 45:161–168. Advances in Mathematical Optimization, In: Guddat J et al (eds) 14. Vainikko GM (1971) On convergence of the method of mechanical curvatures for integral equations with discontinuous kernels. Sibirsk Mat Zh 12:40–53 15. Vasin VV (1982) Discrete approximation and stability in extremal problems. USSR Comput Math Math Phys 22:57–74

Approximation of Multivariate Probability Integrals

of the probability integral is multidimensional interval, then the problem reduces to the approximation of multivariate probability distribution function values.

Lower and Upper Bounds Let | = ( 1 , . . . , n ) be a random vector with given multivariate probability distribution. Introducing the events A1 D f1 < x1 g; : : : ; A n D fn < x n g; where x1 , . . . , xn are arbitrary real values the multivariate probability distribution function of the random vector can be expressed in the following way: F(x1 ; : : : ; x n ) D P(1 < x1 ; : : : ; n < x n ) D P(A1 \ \ A n )

TAMÁS SZÁNTAI Technical University, Budapest, Hungary

D 1 P(A1 [ [ An ) n

MSC2000: 65C05, 65D30, 65Cxx, 65C30, 65C40, 65C50, 65C60, 90C15 Article Outline Keywords Lower and Upper Bounds Monte-Carlo Simulation Algorithm One- and Two-Dimensional Marginal Distribution Functions Examples Remarks See also References

Keywords Boole–Bonferroni bounds; Hunter–Worsley bounds; Approximation; Probability integrals; Variance reduction; Probabilistic constrained stochastic programming Approximation of multivariate probability integrals is a hard problem in general. However, if the domain

D 1 S 1 C S 2 C (1)n S ; where Ai D f i x i g;

i D 1; : : : ; n;

and Sk D

X

P(Ai 1 \ \ Ai k ); k D 1; : : : ; n :

1i 1 1 then one has F(z1 ; : : : ; z n ) D 1 n C

Example 3 n X

n D 15; Fi (z i ):

iD1

b) If z1 + z2 + z3 > 1 then one has F(z1 ; : : : ; z n ) n X X n1 (n2) D Fi (z i )C Fi j (z i ; z j ): 2 iD1 1i< jn

Here F i (zi ) and F ij (zi , zj ) are the one- and twodimensional marginal probability distribution functions. This theorem was formulated and proved by Szántai in [13]. It also can be found in [11]. Examples For illustrating the lower and upper bounds on the multivariate normal probability distribution function value and the efficiency of the variance reduction technique described before one can regard the following examples. Example 2 n D 10; x1 D 1:7;

x2 D 0:8;

x3 D 5:1;

x4 D 3:2;

x5 D 2:4;

x6 D 1:8;

x7 D 2:7;

x8 D 1:5;

x9 D 1:2;

x10 D 2:6; r i j D 0:0;

i D 2; : : : ; 10;

j D 1; : : : ; i 1;

except r21 = 0.6, r43 = 0.9, r65 = 0.4, r87 = 0.2, r10, 9 = 0.8. Number of trials: 10000.

Lower bound by S1, S2 Lower bound by Hunter Upper bound by S1, S2 Estimated value Standard deviation Time in seconds (PC-586) Efficiency

0:524736 0:563719 0:588646 0:582743 0:000608 0:77 65:73

x1 D 2:9;

x2 D 2:9;

x3 D 2:9;

x4 D 2:9;

x5 D 2:9;

x6 D 2:9;

x7 D 2:9;

x8 D 2:9;

x9 D 2:9;

x10 D 2:9;

x11 D 2:9;

x12 D 2:7

x13 D 1:6;

x14 D 1:2;

x15 D 2:1;

r i j D 0:2;

i D 2; : : : ; 10;

r i j D 0:0;

i D 11; : : : ; 15;

j D 1; : : : ; i 1; j D 1; : : : ; i 1

except r13, 12 = 0.3, r15, 14 = 0.95. Number of trials = 10000. Lower bound by S1, S2 Lower bound by Hunter Upper bound by S1, S2 Estimated value Standard deviation Time in seconds (PC-586) Efficiency

0:790073 0:798730 0:801745 0:801304 0:000193 1:38 417:84

Both of the above examples are taken from [2, Exam. 4; 6] and they are according to standard multivariate normal probability distributions, i. e. all components of the normally distributed random vector have expected value zero and variance one. The efficiency of the Monte-Carlo simulation algorithm was calculated according to the crude Monte-Carlo algorithm in the usual way, i. e. it equals to the fraction (t 0 20 )/(t 1 21 ) where t 0 , t 1 are the calculation times and 20 , 21 are the variances of the crude and the compared simulation algorithms. Remarks In many applications one may need finding the gradient of multivariate distribution functions, too. As one has the general formula @F(z1 ; : : : ; z n ) @z i D F(z1 ; : : : ; z i1 ; z iC1 ; : : : ; z n jz i ) f i (z i ); where F(z1 , . . . , zi1 , zi+ 1 , . . . , zn | zi ) is the conditional probability distribution function of the random variables 1 , . . . , i 1 , i+ 1 , . . . , n , given that i = zi , and

Approximation of Multivariate Probability Integrals

f i (z) is the probability density function of the random variable i , finding the gradient of a multivariate probability distribution function can be reduced to finding conditional distribution functions. In the cases of multivariate normal and Dirichlet distributions the conditional distributions are also multivariate normal and Dirichlet, and in the case of multivariate gamma distribution they are different and more complicated as it was obtained by Prékopa and Szántai [12]. In the case of multivariate normal probability distribution I. Deák [2] proposed another simulation technique which proved to be as efficient as the method described here. The main advantage of Deák’s method is that it easily can be generalized for calculation the probability content of more general sets in the multidimensional space, like convex polyhedrons, hyperellipsoids, circular cones, etc. Its main drawback is that it works only for the multivariate normal probability distribution. The methods of Szántai and Deák have been combined by H. Gassmann to compute the probability of an n-dimensional rectangle in the case of multivariate normal distribution (see [3]). Also in the case of multivariate normal probability distribution A. Genz proposed the transformation of the original integration region to the unit hypercube [0, 1]n and then the application of a crude Monte-Carlo method or some lattice rules for the numerical integration of the resulting multidimensional integral. A comparison of methods for the computation of multivariate normal probabilities can be found in [4]. When the three-dimensional marginal probability distribution function values are also calculated by numerical integration there exist some new, sharper bounds. See [16] for these bounds and their effect on the efficiency of the Monte-Carlo simulation algorithm. Approximation of multivariate probability integrals has a central role in probabilistic constrained stochastic programming when the probabilistic constraints are joint. The computer code PCSP (probabilistic constrained stochastic programming) originally was developed for handling the multivariate normal probability distributions in this framework (see [15]). A new version of the code now can handle multivariate gamma and Dirichlet distributions as well. The calculation procedures of this paper also has been applied by J. Mayer in his code solving this type of stochastic programming problems by reduced gradient algorithm (see [10]).

A

These codes have been integrated by P. Kall and Mayer into a more advanced computer system for modeling in stochastic linear programming (see [7]). See also Approximation of Extremum Problems with Probability Functionals Discretely Distributed Stochastic Programs: Descent Directions and Efficient Points Extremum Problems with Probability Functions: Kernel Type Solution Methods General Moment Optimization Problems Logconcave Measures, Logconvexity Logconcavity of Discrete Distributions L-shaped Method for Two-stage Stochastic Programs with Recourse Multistage Stochastic Programming: Barycentric Approximation Preprocessing in Stochastic Programming Probabilistic Constrained Linear Programming: Duality Theory Probabilistic Constrained Problems: Convexity Theory Simple Recourse Problem: Dual Method Simple Recourse Problem: Primal Method Stabilization of Cutting Plane Algorithms for Stochastic Linear Programming Problems Static Stochastic Programming Models Static Stochastic Programming Models: Conditional Expectations Stochastic Integer Programming: Continuity, Stability, Rates of Convergence Stochastic Integer Programs Stochastic Linear Programming: Decomposition and Cutting Planes Stochastic Linear Programs with Recourse and Arbitrary Multivariate Distributions Stochastic Network Problems: Massively Parallel Solution Stochastic Programming: Minimax Approach Stochastic Programming Models: Random Objective Stochastic Programming: Nonanticipativity and Lagrange Multipliers Stochastic Programming with Simple Integer Recourse Stochastic Programs with Recourse: Upper Bounds

89

90

A

Approximations to Robust Conic Optimization Problems

Stochastic Quasigradient Methods in Minimax Problems Stochastic Vehicle Routing Problems Two-stage Stochastic Programming: Quasigradient Method Two-stage Stochastic Programs with Recourse

functions. Ann Oper Res (to appear) Special Issue: Research in Stochastic Programming (Selected refereed papers from the VII Internat. Conf. Stochastic Programming, Aug. 10– 14, Univ. British Columbia, Vancouver, Canada). 17. Takács L (1955) On the general probability theorem. Comm Dept Math Physics Hungarian Acad Sci 5:467–476 (In Hungarian.) 18. Worsley KJ (1982) An improved Bonferroni inequality and applications. Biometrika 69:297–302

References 1. Bonferroni CE (1937) Teoria statistica delle classi e calcolo delle probabilita. Volume in onordi Riccardo Dalla Volta: 1–62 2. Deák I (1980) Three digit accurate multiple normal probabilities. Numerische Math 35:369–380 3. Gassmann H (1988) Conditional probability and conditional expectation of a random vector. In: Ermoliev Y, Wets RJ-B (eds) Numerical Techniques for Stochastic Optimization. Springer, Berlin, pp 237–254 4. Genz A (1993) Comparison of methods for the computation of multivariate normal probabilities. Computing Sci and Statist 25:400–405 5. Hunter D (1976) Bounds for the probability of a union. J Appl Probab 13:597–603 6. IMSL (1977) Library 1 reference manual. Internat. Math. Statist. Library 7. Kall P, Mayer J (1995) Computer support for modeling in stochastic linear programming. In: Marti K, Kall P (eds) Stochastic Programming: Numerical Methods and Techn. Applications. Springer, Berlin, pp 54–70 8. Kennedy WJ Jr, Gentle JE (1980) Statistical computing. M. Dekker, New York 9. Kruskal JB (1956) On the shortest spanning subtree of a graph and the travelling salesman problem. Proc Amer Math Soc 7:48–50 10. Mayer J (1988) Probabilistic constrained programming: A reduced gradient algorithm implemented on PC. Working Papers IIASA WP-88-39 11. Prékopa A (1995) Stochastic programming. Akad. Kiadó and Kluwer, Budapest–Dordrecht 12. Prékopa A, Szántai T (1978) A new multivariate gamma distribution and its fitting to empirical streamflow data. Water Resources Res 14:19–24 13. Szántai T (1985) Numerical evaluation of probabilities concerning multivariate probability distributions. Thesis Candidate Degree Hungarian Acad Sci (in Hungarian) 14. Szántai T (1986) Evaluation of a special multivariate gamma distribution function. Math Program Stud 27:1–16 15. Szántai T (1988) A computer code for solution of probabilistic-constrained stochastic programming problems. In: Ermoliev Y, Wets RJ-B (eds) Numerical Techniques for Stochastic Optimization. Springer, Berlin, pp 229–235 16. Szántai T: Improved bounds and simulation procedures on the value of multivariate normal probability distribution

Approximations to Robust Conic Optimization Problems MELVYN SIM NUS Business School, National University of Singapore, Singapore, Republic of Singapore

Article Outline Introduction Formulation Affine Data Dependency Tractable Approximations of a Conic Chance Constrained Problem

References Introduction We consider a general conic optimization problem under parameter uncertainty is as follows: max s.t.

c0 x n P

˜ 2K ˜ jxj B A

(1)

jD1

x2X; where the cone K is a regular cone, i. e., a closed, convex and pointed cone. The space of the data ˜ n ; B) ˜ depends on the cone, K. The most ˜ 1; : : : ; A (A m common cone is the cone of non-negative orthant, 0 : (11)

bounded as 1 a1 Zd;n CO nd2 nd1 1 a1 de d1 e ; CO nd2 nd1

n ! 1 ; (14)

which immediately yields the rate of convergence to as n approaches infinity: zero for Zd;n

sD1

Then, for any integer m 1, lower and upper bounds of the Z d;n ; Z d;n (4) on the expected optimal cost Z d;n MAP can be asymptotically evaluated as Z d;n

n ( C 1) s C 1 Dan C as C s C 1 sD1 ! ( C 1) m n;d

C1 ; ! 1 ; CO n m C C1 m1 X

(12a) Z d;n

n ( C 1) s C ˛ Dan C as (˛) C s C 1 sD1 ! ( C 1) m n;d

C˛ ; ! 1 : CO n m (˛) C C 1 m1 X

(12b) It can be shown that the lower and upper bounds de fined by (12a, 12b) are convergent, i. e., jZ d;n Z d;n j ! n;d

0; ! 1; whereas the corresponding asymptotical bounds for the case of distributions with support unbounded from below may be divergent in the sense that

n;d

jZ d;n Z d;n j ¹ 0 when ! 1. The asymptotical representations (12a, 12b) for the bounds Z d;n and Z d;n are simplified when the inverse F 1 of the c.d.f. of the distribution has a regular power series expansion in the vicinity of zero. Assume, for example, that function F 1 can be written as F 1 (u) D a1 u C O(u 2 );

u ! 0+ :

(13)

It is then easy to see that for n 1 and d fixed the expected optimal value of the MAP is asymptotically

Corollary 2. Consider a d 3; n 3 MAP (1) with cost coefficients that are iid random variables from an absolutely continuous distribution with existing first moment. Let the inverse F 1 of the c.d.f. of the distribution satisfy (13). Then, for a fixed d and n ! 1 the ex of the MAP converges to zero pected optimal value Zd;n (d2) . as O n For example, the expected optimal value of 3-dimensional (d D 3) MAP with uniform U(0; 1) or exponential distributions converges to zero as O(n1 ) when n ! 1. We illustrate the tightness of the developed bounds (12a, 12b) by comparing them to the computed expected optimal values of MAPs with coefficients c i 1 i d drawn from the uniform U(0; 1) distribution and exponential distribution with mean 1. It is elementary that the inverse functions F 1 () of the c.d.f.’s for both these distributions are representable in form (13) with a1 D 1. The numerical experiments involved solving multiple instances of randomly generated MAPs with the number of dimensions d ranging from 3 to 10, and the number n of elements in each dimension running from 3 to 20. The number of instances generated for estimation of the expected optimal value of the MAP with a given distribution of cost coefficients varied from 1000 (for smaller values of d and n) to 50 (for problems with largest n and d). To solve the problems to optimality, we used a branch-and-bound algorithm that navigated through the index tree representation of the MAP. Figures 1 and 2 display the obtained expected optimal values of MAP with uniform and exponential iid cost coefficients when d is fixed at d D 3 or 5 and n D 3; : : : ; 20, and when n D 3 or 5 and d runs from 3 to 10. This “asymmetry” in reporting of the results is explained by

Asymptotic Properties of Random Multidimensional Assignment Problem

A

Asymptotic Properties of Random Multidimensional Assignment Problem, Figure 1 , lower and upper bounds Z ; Z of an MAP with fixed d D 3 (left) and d D 5 (right) for uniExpected optimal value Zd;n d;n d;n form U(0; 1) and exponential (1) distributions

Asymptotic Properties of Random Multidimensional Assignment Problem, Figure 2 , lower and upper bounds Z ; Z of an MAP with fixed n D 3 (left) and n D 5 (right) for uniform Expected optimal value Zd;n d;n d;n U(0; 1) and exponential(1) distributions

the fact that the implemented branch-and-bound algorithm based on index tree is more efficient in solving “shallow” MAPs, i. e., instances that have larger n and smaller d. The solution times varied from several seconds to 20 hours on a 2GHz PC. The conducted numerical experiments suggest that the constructed lower and upper bounds for the expected optimal cost of random MAPs are quite tight, with the upper bound Z d;n being tighter for the case of fixed n and large d (see Figs. 1, 2).

Expected Number of Local Minima in Random MAP Local Minima and p-exchange Neighborhoods in MAP As it has been mentioned in the Introduction, we consider local minima of a MAP with respect to a local neighborhood, in the sense of [15]. For any p D 2; : : : ; n, we define the p-exchange local neighborhood N p (i) of the ith feasible solu-

119

120

A

Asymptotic Properties of Random Multidimensional Assignment Problem

tion fi1(1) id(1) ; : : : ; i1(n) id(n) g of the MAP (1) as the set of solutions obtained from i by permuting p or less elements in one of the dimensions 1; : : : ; d. More formally, N p (i) is the set of n-tuples (1) (n) (n) (1) (n) f j(1) 1 j d ; : : : ; j 1 j d g such that f j k ; : : : ; j k g is a permutation of f1; : : : ; ng for all 1 k d, and, furthermore, there exists only one k0 2 f1; : : : ; dg such that 2

n X

ı¯i (r) j(r) p;

while

k0 k0

rD1

n X rD1

for all

ı¯i (r) j(r) D 0 k

k

k 2 f1; : : : ; dgnk0 ; (15)

where ı¯i j is the negation of the Kroneker delta, ı¯i j D 1 ı i j . As an example, consider the following feasible solution to a d D 3, n D 3 MAP: f111; 222; 333g. Then, one of its 2-exchange neighbors is f111; 322; 233g, another one is {131, 222, 313}; a 3-exchange neighbor is given by {311, 122, 233}, etc. Evidently, one has N p N pC1 for p D 2; : : : ; n 1. Proposition 1. For any p D 2; : : : ; n, the size jN p j of the p-exchange local neighborhood of a feasible solution of a MAP (1) is equal to jN p j D d

p X kD2

D(k) D

D(n) D n! 1

1 1!

C

1 2!

1 3!

CC

n! ; e n1:

(1) n n!

The definition of a local minimum with respect to the p-exchange neighborhood is then straightforward. The kth feasible solution with cost z k is a p-exchange local minimum iff z k z j for all j 2 N p (k). Continuing the example above, the solution f111; 222; 333g is a 2-exchange local minimum iff its cost z1 D c111 C c222 C c333 is less than or equal to costs of all of its 2-exchange neighbors. The number M p of local minima of the MAP is obtained by counting the feasible solutions that are local minima with respect to neighborhoods N p . In a random MAP, where the assignment costs are random variables, M p becomes a random quantity itself. In this paper we are interested in determining the expected number E[M p ] of local minima in random MAPs that have iid assignment costs with continuous distribution. Expected Number of Local Minima in MAP with n = 2

! n D(k) ; k

where

p, jN p j is either polynomial or exponential in the number of elements n per dimension, as follows from the representation

k X jD0

!

(1) k j

k j! : j

(16)

The quantity D(k) in (16) is known as the number of derangements of a k-element set [29], i. e., the number of permutations f1; 2; : : : ; kg 7! fi (1) ; i (2) ; : : : ; i (k) g such that i (1) ¤ 1; : : : ; i (k) ¤ k, and can be easily calculated by means of the recurrent relation (see [29]) D(k) D kD(k 1) C (1) k ;

D(1) D 0 ;

so that, for example, D(2) D 1, D(3) D 2, D(4) D 9, and so on. Then, according to Proposition the size of n 1, of a 2-exchange neighborhood is jN2 j D d 2, the size a 3-exchange neighborhood is jN3 j D d n2 C 2 n3 , etc. Note also that size of the p-exchange neighborhood is linear in the number of dimensions d. Depending on

As it was noted above, in the special case of random MAP with n D 2, d 3, the costs of feasible solutions are iid random variables with distribution F F, where F is the distribution of the assignment costs. This special structure of the feasible set allows for a closed-form expression for the expected number of local minima E[M] (note that in a n D 2 MAP the largest local neighborhood is N2 , thus M D M2 ), as established in [11]. Theorem 2. In a n D 2, d 3 MAP with cost coefficients that are iid continuous random variables, the expected number of local minima is given by E[M] D

2d1 : dC1

(17)

Equality (17) implies that in a n D 2; d 3 MAP the number of local minima E[M] is exponential in d, when the cost coefficients are independently drawn from any continuous distribution.

Asymptotic Properties of Random Multidimensional Assignment Problem

A

Expected Number of Local Minima in a Random MAP with Normally Distributed Costs

Vector Z has a normal distribution N(0; ˙) with the covariance matrix ˙ defined as

Our ability to derive a closed-form expression (17) for the expected number of local minima E[M] in the previous section has relied on the independence of feasible solution costs (2) in a n D 2 MAP. As it is easy to verify directly, in the case n 3 the costs of feasible solutions are generally not independent. This complicates analysis significantly if an arbitrary continuous distribution for assignment costs c i 1 i d in (1) is assumed. However, as we show below, one can derive upper and lower bounds for E[M] in the case when the costs coefficients of (1) are independent normally distributed random variables. First, we develop bounds for the number of local minima E[M2 ] defined with respect to 2-exchange neighborhoods N2 that are most widely used in practice.

Cov(Zrsq ; Z i jk ) D 8 4 2 ; if i D r; j D s; q D k; ˆ ˆ < 2 2 ; if i D r; j D s; q ¤ k; ˆ 2 ; if (i D r; j ¤ s) or (i ¤ r; j D s) ; ˆ : 0; if i ¤ r; j ¤ s:

2-exchange Local Neighborhoods Noting that in the general case the number N of the feasible solutions to MAP (1) is equal to N D (n!)d1 , the expected number of local minima E[M2 ] with respect to local 2-exchange neighborhoods can be written in the form

E[M2 ] D

N h \ i X P zk z j 0 ; kD1

(18)

j2N2 (k)

(21) While the value of F˙ (0) in (19) is difficult to compute exactly for large d and n, lower and upper bounds can be constructed using Slepian’s inequality [30]. To this end, we introduce covariance matrices ˙ D ( i j ) and ˙ D (¯ i j ) as 8 4 2 ; ˆ ˆ < 2 2 ; ij D ˆ ˆ : 0;

if i D j; if i ¤ j and (i 1) div d D ( j 1) div d otherwise

(22a) ¯ i j D

4 2 ; if i D j; 2 2 ; otherwise

i h \ z k z j 0 D F˙ (0) ; P

;

(22b)

so that i j i j i j holds for all 1 i; j jN2 j, with i j being the components of the covariance matrix ˙ (21). Then, Slepian’s inequality claims that F˙ (0) F˙ (0) F˙ (0) ;

where N2 (k) is the 2-exchange neighborhood of the kth feasible solution, and z i is the cost of the ith feasible solution, i D 1; : : : ; N: If we allow the nd cost coefficients c i 1 i d of the MAP to be independent standard normal N(; 2 ) random variables, then the probability term in (18) can be expressed as

;

(23)

where F˙ (0) and F˙ (0) are c.d.f.’s of random variables X˙ N(0; ˙) and X˙ N(0; ˙) respectively. The structure of matrices ˙ and ˙ allows the corresponding values F˙ (0) and F˙ (0) to be computed in a closed form, which leads to the following bounds for the expected number of local minima in random MAP with iid normal coefficients:

(19)

Theorem 3. In a n 3; d 3 MAP with iid normal cost coefficients, the expected number of 2-exchange local minima is bounded as

where F˙ is the c.d.f. of the jN2 j-dimensional random vector

2(n!)d1 (n!)d1 : (24) E[M ] 2 (d C 1)n(n1)/2 n(n 1)d C 2

j2N2 (k)

Z D (Z121 ; : : : ; Z12d ; Z131 ; : : : ; Z13d ;

; Zrs1 ; : : : ; Zrsd ; ; Z n1;n;1 ; : : : ; Z n1;n;d ; r Ax b > x s.t.

x 2 Rn ;

where A is an n × n positive definite symmetric matrix, and b 2 Rn . A gradient iteration of the form x :D (I A)x C b will be convergent provided that the maximum row sum of I A is less than 1, i. e.: ˇ X ˇ ˇ ˇ ˇ1 ˛ i j ˇ C ˇ a i j ˇ < 1; i D 1; : : : ; n; j: j¤i

implying the diagonal dominance condition: Xˇ ˇ ˇa i j ˇ ; 8i: ai j > j: j¤i

If we consider the general nonlinear unconstrained optimization problem: ( min g(x) s.t.

x 2 Rn ;

where g: Rn ! R is a twice-differentiable convex function, with Hessian matrix r 2 g(x) which is positive definite. If one considers a Newton mapping given by: f (x) D x [r 2 g(x)]1 r g(x) The norm k x k = maxi |xi | makes f a contraction mapping in the neighborhood of x? (the optimal point). Extensions of the ordinary gradient method f (x) D x ˛r g(x)

125

126

A

Asynchronous Distributed Optimization Algorithms

are also discussed in [5]. The shortest path problem is defined in terms of a directed graph consisting of n nodes. We denote by A(i) the set of all nodes j for which there is an outgoing arc (i, j) from node i. The problem is to find a path of minimum length starting at node i and ending at node j. [4] considered the application of the asynchronous convergence theorem to fixed point iterations involving monotone mappings by considering the Bellman–Ford algorithm, [3], applied to the shortest path problem. This takes the form: x i (t C 1) D min (a i j C x j ( ji (t)); j2A(i)

i D 2; : : : ; n;

t 2 Ti;

x1 (t C 1) D 0: A(i) is the set of all nodes j for which there exists an arc (i, j). Linear network flow problems are discussed in [8] and asynchronous distributed versions of the auction algorithm are discussed. In the general linear network flow problem we are given a set of N nodes and a set of arcs A, each arc (i, j) has associated with it an integer aij , referred to as the cot coefficient. The problem is to optimally assign flows, f ij to each one of the arcs, and the problem is represented mathematically as follows: 8 X ˆ min ai j fi j ˆ ˆ ˆ ˆ (i; j)2A < X X s.t. fi j f ji D s i ; 8i 2 N; ˆ ˆ ˆ j:(i; j)2A j:( j;i) ˆ ˆ : b i j f i j c i j ; 8(i; j) 2 A; where aij , bij , cij and si are integers. Extensions of the sequential auction algorithms are discussed in [6], in which asynchronism manifests itself in the sense that certain processors may be calculating actions bids which other update object prices. [7] extended the analysis to cover certain classes of nonlinear network flow problems in which the costs aij are functions of the flows f ij : 8 X ˆ min ai j ( fi j) ˆ ˆ ˆ ˆ (i; j)2A < X X s.t. fi j f ji D s i ; 8i 2 N; ˆ ˆ ˆ j:(i; j)2A j:( j;i) ˆ ˆ : b i j f i j c i j ; 8(i; j) 2 A: Imposing additional reasonable assumptions to the general framework of totally asynchronous iterative algorithms can substantially increase the applicability of

the concept. A natural extension is therefore the partially asynchronous iterative methods, whereby two major assumptions are be satisfied: a) each processor performs an update at least once during any time interval of length B; b) the information used by any processor is outdated by at most B time units. In other words, the partial asynchronism assumption extends the original model of computation by stating that: There exists a positive integer B such that: For every i and for every t 0, at least one of the elements of the set {t, . . . , t + B 1} belongs to T i . There holds: t B ji (t) t; for all i and j, and all t 0 belonging to T i . There holds ii (t) = t for all i and t 2 T i . [17] developed a very elegant framework with important implications on the asynchronous minimization of continuous functions. It was established that, while minimize function F(x), the asynchronous implementation of a gradient-based algorithm: x :D x rF(X) is convergent if and only if the stepsize is small compared to the inverse of the asynchronism measure B. Specifically, let F: Rn ! R be a cost function to be minimized subject to no constraints. It will be further assumed that: 1) F(x) > 0, 8x 2 Rn ; 2) F(x) is Lipschitz continuous: krF(x) rF(y)k K1 kx yk ; 8x; y; 2 Rn : The asynchronous gradient algorithm of the synchronous iteration: x :D x rF(x) is denoted by: x i (t C 1) :D x i (t) s i (t);

i D 1; : : : ; n;

Asynchronous Distributed Optimization Algorithms

where is a positive stepsize, and si (t) is the update direction. It will be assumed that s i (t) D 0;

8t … T i :

It is important to realize that processor i at time time t has knowledge of a vector xi (t) that is a, possibly, outdated version of x(t). In other words: xi (t) = ((x1 ( 1i (t)), . . . , xn ( ni (t))). It is further assumed that when xi is being updated, the update direction si is a descent direction: For every i and t: s i (t)r i F(x i (t)) 0 there exists positive constants K 2 , K 3 such that ˇ ˇ ˇ ˇ K1 ˇr i F(x i (t))ˇ js i (t)j K3 ˇr i F(x i (t))ˇ ; 8t 2 T i ;

8i:

If all of the above is satisfied, then for the asynchronous gradient iteration it can be shown that: There exists some 0 , depending on n, B, K 1 , K 3 , such that if 0 < < 0 , then limt ! 1 F(x(t)) = 0. It can actually be further shown that the choice D

1 K3 K1 (1 C B C nB)

can guarantee convergence of the asynchronous algorithm. This results clearly states that one can always, in principle, identify an adequate stepsize for any finite delay. Furthermore, [14] elaborated on the use of gradient projection algorithm, within the asynchronous iterative framework, for addressing certain classes of constraint nonlinear optimization problems. The constrained optimization problems considered, is that of minimizing a convex function F: Rn ! R, defined over the space Q X = niD1 X i of lower-dimensional sets X i Rn i , and Pm iD1 ni = n. The ith component of the solution vector is now updated by x i (t C 1) D [x i (t) r i F(x i (t))]C where []+ denotes the projection on the set X i . Once again: xi (t + 1) = xi (t), t 62 T i . Once again, a gradient based algorithm is defined, for which 8 1 ˆ i C ˆ ˆ < [x i (t) r i F(x (t))] x i (t) ; s i (t) D t 2 Ti; ˆ ˆ ˆ :0 t … T i :

A

It can actually be shown that for, provided that the partial asynchronism assumption holds, one can always define, in principle, a suitable stepsize 0 such that for any 0 < < 0 the limit point, x , of the sequence generated by the partially asynchronous gradient projection iteration minimizes the Lipschitz continuous, convex function F over the set X. Recently, [2], analyzed asynchronous algorithms for minimizing a function when the communication delays among processors are assumed to be stochastic with Markovian character. The approach is also based on a gradient projection algorithm and was used to address a an optimal routing problem. A major consideration in asynchronous distributed computing is the fact that since no globally controlling mechanism exists makes the use of any termination criterion which is based on local information obsolete. Clearly, when executing asynchronously a distributed iteration of the form xi f i (x) local error estimates can, and will be, misleading in terms of the global state of the system. Recently [13] made several suggestions as to how the standard model can be supplemented with an additional interprocessor communication protocol so as to address the issue of finite termination of asynchronous iterative algorithms. See also Automatic Differentiation: Parallel Computation Heuristic Search Interval Analysis: Parallel Methods for Global Optimization Load Balancing for Parallel Optimization Techniques Parallel Computing: Complexity Classes Parallel Computing: Models Parallel Heuristic Search Stochastic Network Problems: Massively Parallel Solution References 1. Baudet GM (1978) Asynchronous iterative methods for multiprocessors. J ACM 25:226–244 2. Beidas BF, Papavassilopoulos GP (1995) Distributed asynchronous algorithms with stochastic delays for constrained optimization problems with conditions of time drift. Parallel Comput 21:1431–1450

127

128

A

Auction Algorithms

3. Bellman R (1957) Dynamic programming. Princeton University Press, Princeton 4. Bertsekas DP (1982) Distributed dynamic programming. IEEE Trans Autom Control AC-27:610–616 5. Bertsekas DP (1983) Distributed asynchronous computation of fixed points. Math Program 27:107–120 6. Bertsekas DP, Eckstein J (1987) Distributed asynchronous relaxation methodfs for linear network flow problems. Proc IFAC:39–56 7. Bertsekas DP, El Baz D (1987) Distributed asynchronous relaxation methods for convex network flow problems. SIAM J Control Optim 25:74–85 8. Bertsekas DP, Tsitsiklis JN (1989) Parallel and distributed computation: Numerical methods. Prentice-Hall, Englewood Cliffs, NJ 9. Chazan D, Miranker W (1968) Chaotic reaxation. Linear Alg Appl 2:199–222 10. Ferreira A, Pardalos PM (eds) (1997) Solving combinatorial optimization problems in parallel. Springer, Berlin 11. Pardalos PM, Phillips AT, Rosen JB (eds) (1992) Topics in parallel computing in mathematical programming. Sci Press, Marrickville, Australia 12. Pardalos PM (ed) (1992) Advances in optimization and parallel computing. North-Holland, Amsterdam 13. Savari SA, Bertsekas DP (1996) Finite termination of asynchronous iterative algorithms. Parallel Comput 22:39–56 14. Tseng P (1991) On the rate of convergence of a partially asynchronous gradient projection algorithm. SIAM J Optim 1:603–619 15. Tsitsiklis JN (1987) On the stability of asynchronous iterative processes. Math Syst Theory 20:137–153 16. Tsitsiklis JN (1989) A comparison of Jacobi and Gauss– Seidel parallel iterations. Appl Math Lett 2:167–170 17. Tsitsiklis JN, Bertsekas DP, Athans M (1986) Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans Autom Control ac-31:803– 813 18. Tsitsiklis JN, Bertsekas DP, Athnas M (1986) Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans Autom Control AC-31:803– 812 19. Zenios AS (1994) Parallel numerical optimization: Current status and annotated bibliography. ORSA J Comput 1: 20–42

Auction Algorithms DIMITRI P. BERTSEKAS Labor. Information and Decision Systems, Massachusetts Institute Technol., Cambridge, USA MSC2000: 90C30, 90C35

Article Outline Keywords The Auction Process Optimality Properties at Termination Computational Aspects: -Scaling Parallel and Asynchronous Implementation Variations and Extensions See also References Keywords Linear programming; Optimization; Assignment problem; Transshipment problem The auction algorithm is an intuitive method for solving the classical assignment problem. It outperforms substantially its main competitors for important types of problems, both in theory and in practice, and is also naturally well suited for parallel computation. In this article, we will sketch the basic principles of the algorithm, we will explain its computational properties, and we will discuss its extensions to more general network flow problems. For a detailed presentation, see the survey paper [3] and the textbooks [2,4]. For an extensive computational study, see [8]. The algorithm was first proposed in the 1979 report [1]. In the classical assignment problem there are n persons and n objects that we have to match on a one-toone basis. There is a benefit aij for matching person i with object j and we want to assign persons to objects so as to maximize the total benefit. Mathematically, we want to find a one-to-one assignment [a set of personobject pairs (1, j1 ), . . . , (n, jn ), such that the objects j1 , . . . , j are all distinct] that maximizes the total benefit Pn n iD1 a i j i . The assignment problem is important in many practical contexts. The most obvious ones are resource allocation problems, such as assigning personnel to jobs, machines to tasks, and the like. There are also situations where the assignment problem appears as a subproblem in various methods for solving more complex problems. The assignment problem is also of great theoretical importance because, despite its simplicity, it embodies a fundamental linear programming structure. The most important type of linear programming prob-

Auction Algorithms

lems, the linear network flow problem, can be reduced to the assignment problem by means of a simple reformulation. Thus, any method for solving the assignment problem can be generalized to solve the linear network flow problem, and in fact this approach is particularly helpful in understanding the extension of auction algorithms to network flow problems that are more general than assignment. The classical methods for assignment are based on iterative improvement of some cost function; for example a primal cost (as in primal simplex methods), or a dual cost (as in Hungarian-like methods, dual simplex methods, and relaxation methods). The auction algorithm departs significantly from the cost improvement idea; at any one iteration, it may deteriorate both the primal and the dual cost, although in the end it finds an optimal assignment. It is based on a notion of approximate optimality, called -complementary slackness, and while it implicitly tries to solve a dual problem, it actually attains a dual solution that is not quite optimal. The Auction Process To develop an intuitive understanding of the auction algorithm, it is helpful to introduce an economic equilibrium problem that turns out to be equivalent to the assignment problem. Let us consider the possibility of matching the n objects with the n persons through a market mechanism, viewing each person as an economic agent acting in his own best interest. Suppose that object j has a price pj and that the person who receives the object must pay the price pj . Then, the (net) value of object j for person i is aij pj and each person i would logically want to be assigned to an object ji with maximal value, that is, with a i j i p j i D max fa i j p j g: jD1;:::;n

(1)

We will say that a person i is ‘happy’ if this condition holds and we will say that an assignment and a set of prices are at equilibrium when all persons are happy. Equilibrium assignments and prices are naturally of great interest to economists, but there is also a fundamental relation with the assignment problem; it turns out that an equilibrium assignment offers maximum total benefit (and thus solves the assignment problem), while the corresponding set of prices solves an associ-

A

ated dual optimization problem. This is a consequence of the celebrated duality theorem of linear programming. Let us consider now a natural process for finding an equilibrium assignment. I will call this process the naive auction algorithm, because it has a serious flaw, as will be seen shortly. Nonetheless, this flaw will help motivate a more sophisticated and correct algorithm. The naive auction algorithm proceeds in ‘rounds’ (or ‘iterations’) starting with any assignment and any set of prices. There is an assignment and a set of prices at the beginning of each round, and if all persons are happy with these, the process terminates. Otherwise some person who is not happy is selected. This person, call him i, finds an object ji which offers maximal value, that is, j i 2 arg max fa i j p j g; jD1;:::;n

(2)

and then: a) Exchanges objects with the person assigned to ji at the beginning of the round; b) Sets the price of the best object ji to the level at which he is indifferent between ji and the second best object, that is, he sets p j i to p j i C i ;

(3)

where i D vi w i ;

(4)

vi is the best object value, v i D maxfa i j p j g; j

(5)

and wi is the second best object value w i D maxfa i j p j g; j¤ j i

(6)

that is, the best value over objects other than ji . (Note that i is the largest increment by which the best object price p j i can be increased, with ji still being the best object for person i.) This process is repeated in a sequence of rounds until all persons are happy. We may view this process as an auction, where at each round the bidder i raises the price of his or her preferred object by the bidding increment i . Note that i

129

130

A

Auction Algorithms

cannot be negative since vi wi (compare (5) and (6)), so the object prices tend to increase. Just as in a real auction, bidding increments and price increases spur competition by making the bidder’s own preferred object less attractive to other potential bidders. Does this auction process work? Unfortunately, not always. The difficulty is that the bidding increment i is zero when more than one object offers maximum value for the bidder i (cf. (4) and (6)). As a result, a situation may be created where several persons contest a smaller number of equally desirable objects without raising their prices, thereby creating a never ending cycle. To break such cycles, we introduce a perturbation mechanism, motivated by real auctions where each bid for an object must raise its price by a minimum positive increment, and bidders must on occasion take risks to win their preferred objects. In particular, let us fix a positive scalar and say that a person i is ‘almost happy’ with an assignment and a set of prices if the value of its assigned object ji is within of being maximal, that is, a i j i p j i max fa i j p j g : jD1;:::;n

(7)

We will say that an assignment and a set of prices are almost at equilibrium when all persons are almost happy. The condition (7), introduced first in 1979 in conjunction with the auction algorithm, is known as complementary slackness and plays a central role in several optimization contexts. For = 0 it reduces to ordinary complementary slackness (compare (1)). We now reformulate the previous auction process so that the bidding increment is always at least equal to . The resulting method, the auction algorithm, is the same as the naive auction algorithm, except that the bidding increment i is i D v i w i C ;

(8)

(rather than i = vi wi as in (4)). With this choice, the bidder of a round is almost happy at the end of the round (rather than happy). The particular increment i = vi wi + used in the auction algorithm is the maximum amount with this property. Smaller increments i would also work as long as i , but using the largest possible increment accelerates the algorithm. This is consistent with experience from real auctions,

which tend to terminate faster when the bidding is aggressive. We can now show that this reformulated auction process terminates in a finite number of rounds, necessarily with an assignment and a set of prices that are almost at equilibrium. To see this, note that once an object receives a bid for the first time, then the person assigned to the object at every subsequent round is almost happy; the reason is that a person is almost happy just after acquiring an object through a bid, and continues to be almost happy as long as he holds the object (since the other object prices cannot decrease in the course of the algorithm). Therefore, the persons that are not almost happy must be assigned to objects that have never received a bid. In particular, once each object receives at least one bid, the algorithm must terminate. Next note that if an object receives a bid in m rounds, its price must exceed its initial price by at least m. Thus, for sufficiently large m, the object will become ‘expensive’ enough to be judged ‘inferior’ to some object that has not received a bid so far. It follows that only for a limited number of rounds can an object receive a bid while some other object still has not yet received any bid. Therefore, there are two possibilities: either a) the auction terminates in a finite number of rounds, with all persons almost happy, before every object receives a bid; or b) the auction continues until, after a finite number of rounds, all objects receive at least one bid, at which time the auction terminates. (This argument assumes that any person can bid for any object, but it can be generalized for the case where the set of feasible person-object pairs is limited, as long as at least one feasible assignment exists.) Optimality Properties at Termination When the auction algorithm terminates, we have an assignment that is almost at equilibrium, but does this assignment maximize the total benefit? The answer here depends strongly on the size of . In a real auction, a prudent bidder would not place an excessively high bid for fear that he might win the object at an unnecessarily high price. Consistent with this intuition, we can show that if is small, then the final assignment will be ‘almost optimal’. In particular, we can show that the total benefit of the final assignment is within n of being

Auction Algorithms

optimal. To see this, note that an assignment and a set of prices that are almost at equilibrium may be viewed as being at equilibrium for a slightly different problem where all benefits aij are the same as before, except for the n benefits of the assigned pairs which are modified by an amount no more than . Suppose now that the benefits aij are all integer, which is the typical practical case (if aij are rational numbers, they can be scaled up to integer by multiplication with a suitable common number). Then, the total benefit of any assignment is integer, so if n < 1, a complete assignment that is within n of being optimal must be optimal. It follows, that if
)X D e MCm ; then @s m D X i : @x i So both the calculation of the Jacobian by the forward and backward method are equivalent to solving a very sparse set of equations. If the Wengert list is used, each row of L contains at most two nonzeros. It has therefore been suggested that methods for solving linear equations with sparse matrices could be used to calculate J, A. Griewank and S. Reese [14] suggested using the Markowitz rule, while U. Geitner, J. Utke and Griewank [11] applied the method of Newsam and Ramsdell.

139

140

A

Automatic Differentiation: Calculation of Newton Steps

we will denote by K, are

Hessian Calculations The calculation of the Hessian, as discussed in Automatic differentiation: Calculation of the Hessian, can also be formulated as a sparse matrix calculation. Using the notation of Automatic differentiation: Calculation of the Hessian if the calculation of f (x) consists of

@X k ; @x i

K5;3 D K18;16 D 1;

K7;2 D K19;14 D 2X2 ; K8;7 D K14;13 D 3;

m 2 Mk :

K9;8 D K13;12 D 1; K10;6 D K15;11 D X9 ;

If now we denote Yk D

K4;1 D K20;17 D cos X1 ;

K6;5 D K16;15 D 1;

then the reverse gradient calculation consists of @ k ; @X m

K3;2 D K19;18 D X1 ;

K5;4 D K17;16 D 1;

X k D k (X m ; m < k; m 2 M k );

D Xm C X k Xm

K3;1 D K20;18 D X2 ;

K10;9 D K12;11 D X6 ; k D 1; : : : ; M;

K12;6 D K14;9 D X10 ;

K19;1 D K20;2 D X3 ;

and Yk D

@X2MkC1

@x i

K19;2 D 2X7 ; ;

k D M C 1; : : : ; 2M;

L contains 11 nonzeros and B contains 6. The matrix is very sparse and the same sparse matrix techniques could be used to solve this system of equations.

then we obtain Yk D

@ k Ym ; @X m

K20;1 D X4 sin X1 ;

k D 1; : : : ; M;

The Newton Step

and Y2MC1m D Y2MC1m C Y2MC1k C X k

@ k @X m

@2 k Yj : @X m @X j

The second derivatives are 1, if is a multiplication, 0 if is an addition, and if is unary only nonzero if j = m. If we denote these second order terms by B, the calculation of H ei is equivalent to solving ei IL 0 Y D : 0 B I LS Here the superscript S indicates that L has been transposed through both diagonals. The ith column of the Hessian is then the last n values of Y. For the illustrative example f (x) D (x1 x2 C sin x1 C 4)(3x22 C 6) used in Automatic differentiation: Calculation of the Hessian, the off-diagonal nonzeros in the matrix which

As the notation is easier we will consider the Jacobian case. We have shown that if we solve (I L) Y = em , then column m of the Jacobian J is in the last n terms of Y. If we wish to evaluate J p we simply have to solve (I L)Y D p0 where p0 has its first n terms equal to p and the remaining terms zero. Then the solution is again in the last n terms of Y. To calculate the Newton step we know J d as it must be equal to s, but we do not know d. We must therefore add the equations YMCi D s i to the equations, and delete the equations Y i = pi . For convenience we will partition L, putting the first n columns into A, retaining L for the remainder. So we have to solve A I L d 0 D 0 E Y s

Automatic Differentiation: Calculation of Newton Steps

for d. The matrix E is rectangular and is full of zeros except for the diagonals which are 1. Solving for d gives E(I L)1 Ad D s; so J D E(I L)1 A; which is also the Schur complement of the sparse set of equations. One popular way of solving a sparse set of equations is to form the Schur complement and solve the resulting equations, in this instance this becomes ‘form J and solve J d = s’, which would be the normal indirect method. This also justifies the attention given in this article to the efficient calculation of J. Griewank [12] observed that it may be possible to calculate the Newton step more cheaply than forming J and then solving the Newton equations. Utke [16] demonstrated that a number of ways of solving the sparse set of equations were indeed quicker. His implementation was compatible with ADOL-C and included many rules for eliminating variables. This approach was motivated by noting that if the Jacobian J = D + a b| , where D is diagonal and a and b vectors, then J is full and so solving J x = s is an O(n3 ) operation. However introducing one extra variable z = b| x enables the extended matrix to be solved very cheaply b > x z D 0; Dx C az D s gives x D D1 (az C s); z D b > D1 (az C s);

z D (1 C b D

1

1 >

the Schur complement can be computed in kNNZ operations and the Newton step obtained by solving the resulting equations in O(k3 ) steps. The straight forward sparse system is an echelon form with k = n, so he suggested that by re-arranging rows and columns it might be possible to reduce k. This would reduce the operations needed for both parts of the calculation. Many sorting algorithms have been proposed for reducing the echelon index of sparse matrices. J.S. Duff et al. [9] discuss the performance of methods known as P4 and P5 . R. Fletcher [10] introduced SPK1. Dixon and Z. Maany [8] introduced another which when applied to the extended matrix of the extended Rosenbrock function reduces the echelon index from n to n/2 and gives a diagonal Schur complement. It follows that this method, too, has considerable potential. All these approaches still require further research. Truncated Methods Experience using the truncated Newton code has led many researchers to doubt the wisdom of calculating accurate Newton steps. Approximate solutions are often preferred in which the conjugate gradient method is applied to H d = g; this can be implemented by calculating H p at each inner iteration. H p can be calculated very cheaply by a single forward doublet pass with initial values set at p through list for g obtained by reverse differentiation. The operations required to compute H p are therefore bounded by 15M. If an iterative method is used to solve J d = s, the products J p and J | v can both be obtained cheaply, the first by forward, the second by reverse automatic differentiation. See also

so >

A

1

a) b D s;

and then x may be determined by substitution, which is an O(n) operation. The challenge to find an automatic process that finds such short cuts is still open. L.C.W. Dixon [6] noted that the extended matrix is an echelon form. An echelon matrix of degree k has ones on the k super-diagonal and zeros above it. If the lower part is sparse and contains NNZ nonzeros then

Automatic Differentiation: Calculation of the Hessian Automatic Differentiation: Geometry of Satellites and Tracking Stations Automatic Differentiation: Introduction, History and Rounding Error Estimation Automatic Differentiation: Parallel Computation Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators

141

142

A

Automatic Differentiation: Geometry of Satellites and Tracking Stations

Automatic Differentiation: Root Problem and Branch Problem Dynamic Programming and Newton’s Method in Unconstrained Optimal Control Interval Newton Methods Nondifferentiable Optimization: Newton Method Nonlocal Sensitivity Analysis with Automatic Differentiation Unconstrained Nonlinear Optimization: Newton–Cauchy Framework

References 1. Berz M, Bischof Ch, Corliss G, Griewank A (eds) (1996) Computational differentiation: techniques, applications, and tools. SIAM, Philadelphia 2. Christianson DB, Dixon LCW (1992) Reverse accumulation of Jacobians in optimal control. Techn Report Numer Optim Centre, School Inform Sci Univ Hertfordshire 267 3. Coleman TF, Cai JY (1986) The cyclic coloring problem and estimation of sparse Hessian matrices. SIAM J Alg Discrete Meth 7:221–235 4. Coleman TF, Verma A (1996) Structure and efficient Jacobian calculation. In: Berz M, Bischof Ch, Corliss G, Griewank A (eds) Computational Differentiation: Techniques, Applications, and Tools. SIAM, Philadelphia, pp 149–159 5. Curtis AR, Powell MJD, Reid JK (1974) On the estimation of sparse Jacobian matrices. J Inst Math Appl 13:117–119 6. Dixon LCW (1991) Use of automatic differentiation for calculating Hessians and Newton steps. In: Griewank A, Corliss GF (eds) Automatic Differentiation of Algorithms: Theory, Implementation, and Application. SIAM, Philadelphia, pp 114–125 7. Dixon LCW (1993) On automatic differentiation and continuous optimization. In: Spedicato E (ed) Algorithms for continuous optimisation: the state of the art. NATO ASI series. Kluwer, Dordrecht, pp 501–512 8. Dixon LCW, Maany Z (Feb. 1988) The echelon method for the solution of sparse sets of linear equations. Techn Report Numer Optim Centre, Hatfield Polytechnic NOC TR177 9. Duff IS, Anoli NI, Gould NIM, Reid JK (1987) The practical use of the Hellerman–Ranck P4 algorithm and the P5 algorithm of Erisman and others. Techn Report AERE Harwell CSS213 10. Fletcher R, Hall JAJ (1991) Ordering algorithms for irreducible sparse linear systems. Techn Report Dundee Univ NA/131 11. Geitner U, Utke J, Griewank A (1991) Automatic computation of sparse Jacobians by applying the method of Newsam and Ramsdell. In: Griewank A, Corliss GF (eds) Automatic Differentiation of Algorithms: Theory, Implementation, and Application, SIAM, Philadelphia, pp 161–172

12. Griewank A (1991) Direct calculation of Newton steps without accumulating Jacobians. In: Griewank A, Corliss GF (eds) Automatic Differentiation of Algorithms: Theory, Implementation, and Application, SIAM, Philadelphia, pp 126–137 13. Griewank A, Corliss GF (eds) (1991) Automatic differentiation of algorithms: theory, implementation, and application. SIAM, Philadelphia 14. Griewank A, Reese S (1991) On the calculation of Jacobian matrices by the Markowitz rule. In: Griewank A, Corliss GF (eds) Automatic Differentiation of Algorithms: Theory, Implementation, and Application, SIAM, Philadelphia, pp 126–135 15. Parkhurst SC (Dec. 1990) The evaluation of exact numerical Jacobians using automatic differentiation. Techn Report Numer Optim Centre, Hatfield Polytechnic, NOC TR224 16. Utke J (1996) Efficient Newton steps without Jacobians. In: Berz M, Bischof Ch, Corliss G, Griewank A (eds) Computational Differentiation: Techniques, Applications, and Tools, SIAM, Philadelphia, pp 253–264

Automatic Differentiation: Geometry of Satellites and Tracking Stations DAN KALMAN American University, Washington, DC, USA MSC2000: 26A24, 65K99, 85-08 Article Outline Keywords Geometric Models Sample Optimization Problems Minimum Range Direction Angles and Their Derivatives Design Parameter Optimization

Automatic Differentiation Scalar Functions and Operations Vector Functions and Operations Implementation Methods

Summary See also References

Keywords Astrodynamics; Automatic differentiation; Satellite orbit; Vector geometry

Automatic Differentiation: Geometry of Satellites and Tracking Stations

Satellites are used in a variety of systems for communication and data collection. Familiar examples of these systems include satellite networks for broadcasting video programming, meteorological and geophysical data observation systems, the global positioning system (GPS) for navigation, and military surveillance systems. Strictly speaking, these are systems in which satellites are just one component, and in which there are other primary subsystems that have no direct involvement with satellites. Nevertheless, they will be referred to as satellite systems for ease of reference. Simple geometric models are often incorporated in simulations of satellite system performance. Important operational aspects of these systems, such as the times when satellites can communicate with each other or with installations on the ground (e. g. tracking stations), depend on dynamics of satellite and station motion. The geometric models represent these motions, as well as constraints on communication or data collection. For example, the region of space from which an antenna on the ground can receive a signal might be modeled as a cone, with its vertex centered on the antenna and axis extending vertically upward. The antenna can receive a signal from a satellite only when the satellite is within the cone. Taking into account the motions of the satellite and the earth, the geometric model predicts when the satellite and tracking station can communicate. Elementary optimization problems often arise in these geometric models. It may be of interest to determine the closest approach of two satellites, or when a satellite reaches a maximum elevation as observed from a tracking station, or the extremes of angular velocity and acceleration for a rotating antenna tracking a satellite. Optimization problems like these are formulated in terms of geometric variables, primarily distances and angles, as well as their derivatives with respect to time. The derivatives appear both in the optimization algorithms, as well as in functions to be optimized. One of the previously mentioned examples illustrates this. When a satellite is being tracked from the ground, the antenna often rotates about one or more axes so as to remain pointed at the satellite. The angular velocity and acceleration necessary for this motion are the first and second derivatives of variables expressed as angles in the geometric configuration of the antenna and satellite. Determining the extreme values of these

A

derivatives is one of the optimization problems mentioned earlier. Automatic differentiation is a feature that can be included in a computer programming language to simplify programs that compute derivatives. In the situation described above, satellite system simulations are developed as computer programs that include computed values for the distance and angle variables of interest. With automatic differentiation, the values of derivatives are an automatic by-product of the computation of variable values. As a result, the computer programmer does not have to develop and implement the computer instructions that go into calculating derivative values. As a specific example of this idea, consider again the rotating antenna tracking a satellite. Imagine that the programmer has worked out the proper equations to describe the angular position of the antenna at any time. The simulation also needs to compute values for the angular velocity and acceleration, the first and second derivatives of angular position. However, the programmer does not need to work out the proper equations for these derivatives. As soon as the equations for angular position are included in the computer program, the programming language provides for the calculation of angular velocity and acceleration automatically. That is the effect of automatic differentiation. Because the derivatives of geometric variables such as distances and angles can be quite involved, automatic differentiation results in computer programs that are much easier to develop, debug, and maintain. The preceding comments have provided a brief overview of geometric models for satellite systems, as well as associated optimization problems and the use of automatic differentiation. The discussion will now turn to a more detailed examination of these topics. Geometric Models The geometric models for satellite systems are formulated in the context of three-dimensional real space. A conventional rectangular coordinate system is defined by mutually perpendicular x, y, and z axes. The earth is modeled as a sphere or ellipsoid centered at the origin (0, 0, 0), with the north pole on the positive z axis, and the equator in the xy plane. The coordinate axes are considered to retain a constant orientation rel-

143

144

A

Automatic Differentiation: Geometry of Satellites and Tracking Stations

ative to the fixed stars, so that the earth rotates about the z axis. In this setting, tracking station and satellite locations are represented by points moving in space. Each such moving point is specified by a vector valued function r(t) = (x(t), y(t), z(t)) where t represents time. Geometric variables such as angles and distances can be determined using standard vector operations: c(x; y; z) D (cx; c y; cz); (x; y; z) ˙ (u; v; w) D (x ˙ u; y ˙ v; z ˙ w); (x; y; z) (u; v; w) D xu C yv C zw; (x; y; z) (u; v; w) D (yw zv; zu xw; xv yu); q k(x; y; z)k D x 2 C y2 C z2 p D (x; y; z) (x; y; z): The distance between two points r and s is then given by k r s k. The angle defined by rays from point r through points p and q is determined by

cos D

(p r) (q r) : kp rk kq rk

A more complete discussion of vector operations, their properties, and geometric interpretation can be found in any calculus textbook; [9] is one example. There are a variety of models for the motions of points representing satellites and tracking stations. The familiar conceptions of a uniformly rotating earth circled by satellites that travel in stable closed orbits is only approximately correct. For qualitative simulations of the performance of satellite systems, particularly at preliminary stages of system design, these models may be adequate. More involved models can take into account such effects as the asphericity of the gravitational field of the earth, periodic wobbling of the earth’s axis of rotation, or atmospheric drag, to name a few. Modeling the motions of the earth and satellites with high fidelity is a difficult endeavor, and one that has been studied extensively. Good general references for this subject are [1,2,3,10]. For illustrative purposes, a few of the details will be presented for the simplest models, circular orbits

Automatic Differentiation: Geometry of Satellites and Tracking Stations, Figure 1 Earth rotation angle

around a spherical earth, uniformly spinning on a fixed axis. The radius of the earth will be denoted Re . As a starting point, the rotation of the earth can be specified by a single function of time, ˝(t), representing the angular displacement of the prime meridian from a fixed direction, typically the direction specified by the positive x axis (see Fig. 1.). At any time, the positive x axis emerges from the surface of the earth at some point on the equator. Suppose that at a particular time t, the point where the positive x axis emerges happens to be on the prime meridian, located at latitude 0 and longitude 0. Then ˝(t) = 0 for that t. As time progresses, the prime meridian rotates away from the x axis, counter-clockwise as viewed by an observer above the north pole. The function ˝ measures the angle of rotation, starting at 0 each time the prime meridian is aligned with the x axis, and increasing toward a maximum of 360° (2 in radian measure) with each rotation of the earth. With a uniformly spinning earth, ˝ increases linearly with t during each rotation. Once ˝ is specified, any terrestrial location given by a latitude , longitude , and altitude a can be transformed into absolute coordinates in space, according to the equations D C ˝(t);

(1)

r D R e C a;

(2)

x D r cos cos ;

(3)

Automatic Differentiation: Geometry of Satellites and Tracking Stations

y D r sin cos ;

(4)

z D r sin :

(5)

Holding latitude, longitude, and altitude constant, these equations express the position in space of a fixed location on the earth for any time, thereby modeling the point’s motion. It is also possible to develop models for tracking stations that are moving on the surface of the earth, say on an aircraft or on a ship in the ocean. For example, if it is assumed that the moving craft is traveling at constant speed on a great circle arc or along a line of constant latitude, it is not difficult to express latitude and longitude as functions of time. In this case, the equations above reflect a dependence on t in and , as well as in ˝. A more complicated example would be to model the motion of a missile or rocket launched from the ground. This can be accomplished in a similar way: specify the trajectory in earth relative terms, that is, using latitude, longitude, and altitude, and then compute the absolute spatial coordinates (x, y, z). In each case, the rotation of the earth is accounted for solely by the effect of ˝(t). For a satellite in circular orbit, the position at any time is specified by an equation of the following form: r(t) D r[cos(!t)u C sin(!t)v]: In this equation, ! t is understood as an angle in radian measure for the sin and cos operations; r, !, u, and v are constants. The first, r is the length of the orbit circle’s radius. It is equal to the sum of the earth’s radius Re and the satellite’s altitude. The constant ! is the angular speed of the satellite. The satellite completes an orbit every 2 /! units of time, thus giving the orbital period. Both u and v are unit vectors: u is parallel to the initial position of the satellite; v is parallel to the initial velocity. See Fig. 2. Mathematically, the equation above describes some sort of orbit no matter how the constants are selected. But not all of these are accurate descriptions of a free falling satellite in circular orbit. For one thing, u and v must be perpendicular to produce a circular orbit. In addition, there is a physical relationship linking r and !. Assuming that the circular orbit follows Newton’s laws of motion and gravitation, r and ! satisfy 3

! D Kr 2

(6)

A

Automatic Differentiation: Geometry of Satellites and Tracking Stations, Figure 2 Circular orbit

where K is a physical constant that depends on both Newton’s universal gravitational constant and the mass of the earth. Its numerical value also depends on the units of measurement used for time and distance. For units of hours and kilometers, the value of K is 2.27285 106 . As this relationship shows, for a given altitude (and hence a given value of r), there is a unique angular speed at which a satellite will maintain a circular orbit. Equivalently, the altitude of a circular orbit determines the constant speed of the satellite, as well as the period of the satellite. Generally, constants are chosen for a circular orbit based on some geometric description. Here is a typical approach. Assume that the initial position of the satellite is directly above the equator, with latitude 0, a given longitude, and a given altitude. In other words, assume that the initial position is in the plane of the equator, and so has a z coordinate of 0. (This is the situation depicted in Fig. 2.) Moreover, the initial heading of the satellite can be specified in terms of the angle it makes with the xy plane (which is the plane of the equator). Call that angle ı. From these assumptions we can determine values for the constants r, !, u, and v in the equation for r(t). Now the altitude for the orbit is constant, so the initial altitude determines r, as well as ! via equation (6). The initial latitude, longitude, and altitude also provide enough information to determine absolute coordinates (x, y, z) for the initial satellite position using equations (1)–(5). Accordingly, the unit vector u is given by uD

(x; y; z) : k(x; y; z)k

As already observed, the z coordinate of u will be 0. Finally, the unit vector v is determined from the initial position and heading. It is known that v make an angle

145

146

A

Automatic Differentiation: Geometry of Satellites and Tracking Stations

of ı with the xy plane, and hence makes an angle of /2 ı with the z axis. This observation can be expressed as the equation v (0; 0; 1) D sin ı: It is also known that v must be perpendicular to u, so v u D 0: Finally, since v is a unit vector, v v D 1: If u = (u1 , u2 , 0), then these three equations lead to v = (˙ u2 cos ı, u1 cos ı, sin ı). The ambiguous sign can be resolved by assuming that the direction of orbit is either in agreement with or contrary to the direction of the earth’s rotation. Assuming that the orbit is in the same direction as the earth’s rotation, v = ( u2 cos ı, u1 cos ı, sin ı). The alternative possibility, that the satellite orbit opposes the rotation of the earth, is generally not practically feasible, so is rarely encountered. The preceding paragraphs are intended to provide some insight about the mathematics used to describe the movement of satellites and terrestrial observers. Although the models presented here are the simplest ones available, they appear in the same general framework as much more sophisticated models. In particular, in any of these models, it is necessary to be able to compute instantaneous positions for satellites and terrestrial observers at any time during a simulation. Moreover, the use of vector algebra and geometry to set up the simple models is representative of the methods used in more complicated cases. Sample Optimization Problems Computer simulations of satellite system performance provide one tool for comparing alternative designs and making cost/benefit trade-offs in the design process. Optimization problems contribute both directly and indirectly. In many cases, system performance is characterized in terms of extreme values of variables: what is the maximum number of users that can be accommodated by a communications system? At a given latitude, what is the longest period of time during which at most three satellites can be detected from some point on the ground? In these examples, the optimization problems are directly connected with the goals of the simulation.

Optimization problems also arise indirectly as part of the logistics of the simulation software. This is particularly the case when a simulation involves events that trigger some kind of system response. Examples of such events include the passage of a satellite into or out of sunlight, reaching a critical level of some resource such as power or data storage, or the initiation or termination of radio contact with a tracking station. The detection of these events typically involves either root location or optimization. These processes are closely related: the root of an equation can usually be characterized as an extreme value of a variable within a suitable domain; conversely, optimization algorithms often generate candidate solutions by solving equations. In many of these event identification problems, the independent variable is time. The objective functions ultimately depend on the geometric models for satellite and tracking station motion, and so can be formulated in terms of explicit functions of time. In contrast, some of the optimization problems that concern direct estimation of system performance seek to optimize that performance by varying design parameters. A typical approach to this kind of problem is to treat performance measures as functions of the parameters, where the values of the functions are determined through simulation. Both kinds of optimization are illustrated in the following examples. Minimum Range As a very simple example of an optimization problem, it is sometimes of interest to determine the closest approach of two orbiting bodies. Assume that a model has been developed, with r(t) and s(t) representing the positions at time t for the two bodies. The distance between them is then expressed as k r(t) s(t) k. This is the objective function to be minimized. Observe that it is simply expressed as a composition of vector operations and the motion models for the two bodies. A variation of this problem occurs when several satellites are required to stay in radio communication. In that case, an antenna on one satellite (at position A, say) may need to detect signals from two others (at positions B and C). In this setting, the measure of †BAC is of interest. If the angle is wide, the antenna requires a correspondingly wide field of view. As the satellites proceed in their orbits, what is the maximum value of

Automatic Differentiation: Geometry of Satellites and Tracking Stations

A

the angle? Equivalently, what is the minimum value of the cosine of the angle? As before, the objective function in this minimization problem is easily expressed by applying vector operations to the position models for the satellites. If a(t), b(t), and c(t) are the position functions for the three satellites, then cos †BAC D

(b a) (c a) : kb ak kc ak

This is a good example of combining vector operations with the models for satellite motion to derive the objective function in an optimization problem. The next example is similar in style, but mathematically more involved. Direction Angles and Their Derivatives A common aspect of satellite system simulation is the representation of sensors of various kinds. The images that satellites beam to earth of weather systems and geophysical features are captured by sensors. Sensors are also used to locate prominent astronomical features such as the sun, the earth, and in some cases bright stars, in order to evaluate and control the satellite’s attitude. Even the antenna used for communication is a kind of sensor. It is frequently convenient to define a coordinate system that is attached to a sensor, that is, define three mutually perpendicular axes which intersect at the sensor location, and which can be used as an alternate means to assign coordinates to points in space. Such a coordinate system is then used to describe the vectors from the sensor to other objects, and to model sensor sensitivity to signals arriving from various directions. With several different coordinate systems in use, it is necessary to transform information described relative to one system into a form that makes sense in the context of another system. This process also often involves what are called direction angles. As a concrete example, consider an antenna at a fixed location on the earth, tracking a satellite in orbit. The coordinate system attached to the tracking antenna is the natural map coordinate system at that point on the earth: the local x and y axes point east and north, respectively, and the z axis points straight up (Fig. 3). The direction from the station to the satellite is expressed in terms of two angles: the elevation ı of the satellite above the local xy plane, and the compass angle ˛ measured clockwise from north. (See Fig. 4.) To illustrate,

Automatic Differentiation: Geometry of Satellites and Tracking Stations, Figure 3 Local map coordinates

here is the meaning of an elevation of 30 degrees and a compass angle of 270 degrees. Begin by looking due north. Turn clockwise through 270 degrees, maintaining a line of sight that is parallel to the local xy plane. At that point you are looking due west. Now raise the line of sight until it makes a 30 degree angle with the local xy plane. This direction of view, with elevation 30 and compass angle 270 degrees, might thus be described as 30 degrees above a ray 270 degrees clockwise from due north. The elevation and compass angle are examples of direction angles. Looked at another way, if a spherical coordinate system is imposed on the local rectan-

Automatic Differentiation: Geometry of Satellites and Tracking Stations, Figure 4 Compass and elevation angles

147

148

A

Automatic Differentiation: Geometry of Satellites and Tracking Stations

gular system at the antenna, then every point in space is described by a distance and two angles. The angles are direction angles. Direction angles can be defined in a similar way for any local coordinate system attached to a sensor. How are direction angles computed? In general terms, the basic idea is to define the local coordinate system in terms of moving vectors, and then to use vector operations to define the instantaneous value of direction angles. Here is a formulation for the earth based antenna. First, the local z axis points straight up. That means the vector from the center of the earth to the location of the antenna on the surface is parallel to the z axis. Given the latitude, longitude, and altitude of the antenna, its absolute position r(t) = (x, y, z) is computed using equations (1)–(5), as discussed earlier. The parallel unit vector is then given by r/ k r k. To distinguish this from the global z axis, we denote it as the vector up. The vector pointing due east must be perpendicular to the up direction. It also must be parallel to the equatorial plane, and hence perpendicular to the global z axis. Using properties of vector cross products, a unit vector pointing east can therefore be expressed as east D

(0; 0; 1) up : k(0; 0; 1) upk

Finally, the third perpendicular vector is given by the cross product of the other two: north = up × east. Note that these vectors are defined as functions of time. At each value of t the earth motion model gives an instantaneous value for r(t), and that, in turn, determines the vectors up, east, and north. Next, suppose that a satellite is included in the model, with instantaneous position s(t). The view vector from the antenna to the satellite is given by v(t) = [s(t) r(t)]/ k s(t) r(t) k. The goal is to calculate the direction angles ˛ and ı for v. Since ı measures the angle between v and the plane of east and north, the complimentary angle can be measured between v and up. This leads to the equation sin ı D up v: The angle ˛ is found from v n D v north v e D v east

according to the equations vn cos ˛ D p 2 v n C v 2e ve sin ˛ D p : 2 v n C v 2e These follow from the fact that the projection of v into the local xy plane is given by ve east + vn north. In this example, direction angles play a role in several optimization problems. First, it may be of interest to predict the maximum value of ı as a satellite passes over the tracking station. This maximum value of elevation is an indication of how close the satellite comes to passing directly overhead, and may be used to determine whether communication will be possible between satellite and tracking station. Additional optimization problems concern the derivatives of ˛ and ı. In many designs, an antenna can turn about horizontal and vertical axes to point the center of the field of view in a particular direction. In order to stay pointed at a passing satellite, the antenna must be rotated on its axes so as to match the motion of the satellite, and ˛ and ı specify exactly how far the antenna must be rotated about each axis at each time. However, there are mechanical limits on how fast the antenna can turn and accelerate. For this reason, during the time that the satellite is in view, the maximum values of the first and second derivatives of ˛ and ı are of interest. If the first derivatives exceed the antenna’s maximum turning speed, or if the second derivatives exceed the antenna’s maximum acceleration, the antenna will not be able to remain pointed at the satellite. Design Parameter Optimization The preceding examples all involve simple kinds of optimization problems with objective functions depending only on time. There are also many situations in which system performance variables are optimized over some domain of design parameters. As one example of this, consider a system with a single satellite traveling in a circular orbit. Assume that the initial point of the orbit falls on the equator, with angle ı between the initial heading and the xy plane, as in Fig. 2. In this example, the object is to choose an optimal value of ı. The optimization problem includes several tracking stations on the ground that are capable of communicating with the

Automatic Differentiation: Geometry of Satellites and Tracking Stations

satellite. As it orbits, there may be times when the satellite cannot communicate with any of the tracking stations. At other times, one or more stations may be accessible. Over the simulation period, the total amount of time during which at least one tracking station is accessible will depend on the value of ı. It is this total amount of access time (denoted A) that is to be maximized. In this problem, the objective function A is not given as a mathematical expression involving the variable ı. An appropriate simulation can be created to compute A for any particular ı of interest. This can then be used in conjunction with an optimization algorithm, with the simulation executed each time it is necessary to calculate A(ı). The preceding example is a simple one, and the execution time required to compute A(ı) is small. For more complicated situations, each execution of the simulation can require a significant amount of time. In these cases, it may be more practical to use some sort of interpolation scheme. The idea would be to run the simulation for some values of the parameter(s), and to interpolate between these values as needed during the optimization process. In some situations, there is a resource allocation problem that can add yet another level of complexity to optimizing system performance. For example, if there are several satellites that must compete for connection time with the various tracking stations, just determining how to assign the tracking stations to the satellites is not a simple matter. In this situation, there may be one kind of optimization problem performed during the simulation to make the resource allocations, and then a secondary optimization that considers the effect of changing system design parameters. An example of this kind of problem is described in detail in [6]. The preceding examples have been provided to illustrate the kinds of optimization problems that arise in simulations of satellite systems. Although there has been very little discussion of methods to solve these optimization problems, it should be clear that standard methods apply, especially in the cases for which the independent variable is time. In that context, the ability to compute derivatives relative to time for the objective function is of interest. In addition, it sometimes occurs that the objective function is, itself, defined as a derivative of some geometric variable, providing an-

A

other motivation for computing derivatives. The next topic of discussion concerns the use of automatic differentiation for computing the desired derivatives. Automatic Differentiation Automatic differentiation refers to a family of techniques for automatically computing derivatives as a byproduct of function evaluation. A survey of different approaches and applications can be found in [5] and in-depth treatment appears in [4]. For the present discussion, attention will be restricted to what is called the forward mode of automatic differentiation, and in particular, the approach described in [8]. In this approach, to provide automatic calculation of the first m derivatives of real valued expressions of a single variable x, an algebraic system is defined consisting of real m+ 1 tuples, to which are extended the familiar binary operations and elementary functions generally defined on real variables. For concreteness, m will be assumed to be 3 below, but the discussion can be generalized to other values in an obvious way. With m = 3, the objects manipulated by the automatic differentiation system are 4-tuples. The idea is that each 4-tuple represents the value of a function and its first 3 derivatives, and that the operations on tuples preserve this interpretation. Thus, if a = (a0 , a1 , a2 , a3 ) consists of the value of f (t), f 0 (t), f 00 (t), and f 000 (t) at some t, and if b = (b0 , b1 , b2 , b3 ) is similarly defined for function g, then the product ab that is defined for the automatic differentiation system will consist of the value at t of fg and its first 3 derivatives. Similarly, the extension of the squareroot function to 4-tuples p is so p contrived that a will consist of the value of f (t) and its first 3 derivatives. In the preceding remarks, the functions f and g are assumed to be real valued, but similar ideas work for vector valued functions. The principle difference is this: when f (t) is a vector, then so are its derivatives, and the ai referred to above are then vectors rather than scalars. In addition, for vector valued functions, there are different operations than for scalar valued functions. For example, vector functions may be combined with a dot product, as opposed to the conventional product of real scalars, and while the squareroot operation is not defined for vector valued functions, the norm operation k f (t) k is.

149

150

A

Automatic Differentiation: Geometry of Satellites and Tracking Stations

In an automatic differentiation system built along these lines, there must be some functions that are evaluated directly to produce 4-tuples. For example, the constant function with value c can be evaluated directly to produce the tuple (c, 0, 0, 0), and the identity function I(t) = t can be evaluated directly to produce (t, 1, 0, 0). For geometric satellite system simulations, it is also convenient to provide direct evaluation of tuples for the motion models. For example, let r(t) be the position vector for a tracking station, as developed in equations (1)–(5). It is a simple matter to work out appropriate formulas for the first three derivatives of r(t), each of which is also a vector. This is included in the automatic differentiation system so that when a particular value of t is given, the motion model computes the 4-tuple (r(t), r0 (t), r00 (t), r000 (t). A similar arrangement is made for every moving object represented in the simulation, including satellites, tracking stations, ships, aircraft, and so on. Here is a simple example of how automatic differentiation is used. In the earlier discussion of optimization problems, there appeared the following equation: (b a) (c a) : cos †BAC D kb ak kc ak Using automatic differentiation, a, b, and c would be 4-tuples, each consisting of four vectors. These are produced by the motion models for three satellites, as the values of position and its first three derivatives at a specific time. The operations used in the equation, vector difference, dot product, and norm, as well as scalar multiplication and division, are all special modified operations that work directly on 4-tuples. The end result is also a 4-tuple, consisting of the cosine of angle BAC, as well as the first three derivatives of that function, all at the specified value of t. As a result, the programmer can obtain computed values for the derivatives of the function without explicitly coding equations for these derivatives. More generally, after defining appropriate 4-tuples for all of the motion models, the programmer automatically obtains derivatives for any function that is defined by operating on the motion models, just by defining the operations. No explicit representation of the derivatives of the operations is needed. Some details of how the system works follow.

Scalar Functions and Operations Consider first operations which apply to scalars. There are two basic types: binary operations (+, , ×, ) and elementary functions (squareroot, exponential and logarithm, trigonometric functions, etc.). These operations must be defined for the 4-tuples of the automatic differentiation system in such a way that derivatives are correctly propagated. The definition for multiplication will illustrate the general approach for binary operations. Suppose that (a, b, c, d) and (u, v, w, x) are two 4-tuples of scalars. They represent values of functions and their derivatives, say, (a, b, c, d) = (f (t), f0 (t), f 00 (t), f000 (t)) and (u, v, w, x) = (g(t), g0 (t), g 000 (t), g 000 (t)). The product is supposed to give ((fg) (t), (fg)0 (t), (fg)00 (t), (fg)00 (t)). Each of these derivatives can be computed using the derivatives of f and g. ( f g)(t) D f (t)g(t); ( f g)0 (t) D f 0 (t)g(t) C f (t)g 0 (t); ( f g)00 (t) D f 00 (t)g(t) C 2 f 0 (t)g 0 (t) C f (t)g 00 (t); ( f g)000 (t)) D f 000 (t)g(t) C 3 f 00 (t)g 0 (t) C 3 f 0 (t)g 00 (t) C f (t)g 000 (t)): On the right side of each equation, now substitute the entries of (a, b, c, d) and (u, v, w, x). ( f g)(t) D au; ( f g)0 (t) D av C bu; ( f g)00 (t) D aw C 2bv C cu; ( f g)000 (t)) D ax C 3bw C 3cv C du: This shows that 4-tuples must be multiplied according to the rule (a; b; c; d)(u; v; w; x) D (au; av C bu; aw C 2bv C cu; ax C 3bw C 3cv C du): For addition, subtraction, and division a similar approach can be used. All that is required is that successive derivatives of the combination of f and g be expressed in terms of the derivatives of f and g separately. Replacing these derivatives with the appropriate components of (a, b, c, d) and (u, v, w, x) produces the desired formula for operating on 4-tuples.

Automatic Differentiation: Geometry of Satellites and Tracking Stations

To define the operation on a 4-tuple of an elementary function, a similar approach will work. Consider defining how a function h should apply to a 4-tuple (a, b, c, d) = (f (t), f0 (t), f 00 (t), f00 (t). This time, the desired end result should contain derivatives for the composite function h ı f , and so should have the form ((h ı f ) (t), (h ı f )0 (t), (h ı f )00 (t), (h ı f )000 (t)) The derivative of h ı f is given by h0 (f (t)) f 0 (t), which becomes h0 (a) b after substitution. Similar computations produce expressions for the second and third derivatives: (h ı f )00 (t) D h 00 ( f (t)) f 0(t)2 C h 0 ( f (t)) f 00(t) D h 00 (a)b 2 C h 0 (a)c and 000

(h ı f ) (t) D h 000 ( f (t)) f 0(t)3 C 3h 00 ( f (t)) f 0(t) f 00 (t) C h 0 ( f (t)) f 000(t) D h 000 (a)b 3 C 3h 00 (a)bc C h 0 (a)d: These results lead to h(a; b; c; d) D (h(a); h 0(a)b; h 00 (a)b 2 C h 0 (a)c; h 000 (a)b 3 C 3h 00 (a)bc C h 0 (a)d): As an example of how this is applied, let h(t) = et . Then h(a) = h0 (a) = h00 (a) = h000 (a) = ea so e (a;b;c;d) D (e a ; e a b; e a b 2 C e a c; e a b 3 C 3e a bc C e a d) D e a (1; b; b 2 C c; b 3 C 3bc C d): Other functions are a little more complicated, but the overall approach is generally correct. The preceding discussion indicates how operations on 4-tuples would be built into an automatic differentiation system. However, the user of such a system would simply apply the operations. So, if an appropriate definition has been provided for ˝(t) as discussed earlier, along with the derivatives, the program would compute a 4-tuple for ˝ and its derivatives at a particular time. Say that is represented in the program by the variable W. If the program later includes the call sin(W), the result would be a 4-tuple with values for sin(˝(t)), and the first three derivatives.

A

Vector Functions and Operations The approach for vector functions is basically the same as for scalar functions. The only modification that is needed is to recognize that the components of 4-tuples are now vectors. Because the rules for computing derivatives of vector operations are so similar to those for scalar operations, there is little difference in the appearance of the definitions. For example, here is the definition for the dot product of two 4-tuples, whose components are vectors: (a; b; c; d) (u; v; w; x) D (a u; a v C b u; a w C 2b v C c u; a x C 3b w C 3c v C d u): The formulation for vector cross product is virtually identical, as is the product of a scalar 4-tuple with a vector 4-tuple. For the vector norm, simply define p k(a; b; c; d)k D (a; b; c; d) (a; b; c; d): Since both dot product of vector 4-tuples and squareroot of scalar 4-tuples have already been defined in the automatic differentiation system, this equation will propagate derivatives correctly. With a full complement of scalar and vector operations provided by the automatic differentiation system, all of the geometric variables discussed in previous examples can be included in a computer program, with derivatives generated automatically. As a particular case, reconsider the discussion earlier of computing elevation ı and compass angle ˛ for a satellite as viewed from a tracking station. Assuming that r and s have been defined as 4-tuples for the vector positions of that station and satellite, the following fragment of pseudocode would carry out the computations described earlier: up

= r/norm(r)

east

= cross(pole, up)

east

= east/norm(east)

north = cross(up, east) v

= (sr)/norm(sr)

vn

= dot(v, north)

ve

= dot(v, east)

vu

= dot(v, up)

delta = asin(vu) alpha = atan2(ve, vn)

151

152

A

Automatic Differentiation: Geometry of Satellites and Tracking Stations

Executed in an automatic differentiation system, this code produces not just the instantaneous values of the angles ˛ and ı, but their first three derivatives, as well. The programmer does not need to derive and code explicit equations for these derivatives, a huge savings in this problem. And all of the derivative information is useful. Recall that the first and second derivatives are of interest for their physical interpretations as angular velocities and accelerations. The third derivatives are used in finding the maximum values of the second derivatives (accelerations). Implementation Methods One of the simplest ways to implement automatic differentiation is to use a language like C++ that supports the definition of abstract data types and operator overloading. Then the automatic differentiation system would be implemented as a series of data types and operations, and included as part of the code for a simulation. A discussion of one such implementation can be found in [7]. Another approach is to develop a preprocessor that automatically augments code with the steps needed to compute derivatives. With such a system, the programmer develops code in a conventional language such as FORTRAN, with some additional features that control the application of automatic differentiation. Next, this code is operated on by the preprocessor, producing a modified program. That is then compiled and executed in the usual way. Examples of this approach can be found in [5]. Summary Geometric models are very useful in representing the motions of satellites and terrestrial objects in simulations of satellite systems. These models are defined in terms of vector operations, which permit the convenient formulation of equations for geometric constructs such as distances and angles arising in the satellite system configuration. Equations which specify instantaneous positions in space of moving objects are a fundamental component of the geometric modeling framework. Optimization problems occur in this framework in two guises. First, there are problems in which the objective functions are directly defined as features of the

geometric setting. An example of this would be to find the minimum distance between two satellites. Second, measures of system performance are derived via simulation as a function of design parameters, and these measures are optimized by varying the parameters. An example of this kind of problem would be to seek a particular orbit geometry in order to maximize the total amount of time a satellite has available to communicate with a network of tracking stations. Automatic differentiation is a feature of an environment for implementing simulations as computer programs. In an automatic differentiation system, the equations which define values of variables automatically produce the values of the derivatives, as well. In the geometric models of satellite systems, derivatives of some variables are of intrinsic interest as velocities and accelerations. Derivatives are also useful in solving optimization problems. Automatic differentiation can be provided by replacing single operands with tuples, representing the operands and their derivatives. For some tuples, the derivatives must be explicitly provided. This is the case for the motion models. For tuples representing combinations of the motion models, the derivatives are generated automatically. These combinations can be defined using any of the supported operations provided by the automatic differentiation system, typically including the operations of scalar and vector arithmetic, as well as scalar functions such as exponential, logarithmic, and trigonometric functions. Languages which support abstract data types and operator overloading are a convenient setting for implementing an automatic differentiation system.

See also Automatic Differentiation: Calculation of the Hessian Automatic Differentiation: Calculation of Newton Steps Automatic Differentiation: Introduction, History and Rounding Error Estimation Automatic Differentiation: Parallel Computation Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators

Automatic Differentiation: Introduction, History and Rounding Error Estimation

Automatic Differentiation: Root Problem and Branch Problem Nonlocal Sensitivity Analysis with Automatic Differentiation References 1. Bate RR, Mueller DD, White JE (1971) Fundamentals of astrodynamics. Dover, Mineola, NY 2. Battin RH (1987) An introduction to the mathematics and methods of astrodynamics. AIAA Education Ser. Amer. Inst. Aeronautics and Astronautics, Reston, VA 3. Escobal PR (1965) Methods of orbit determination. R.E. Krieger, Huntington, NY 4. Griewank A (2000) Evaluating derivatives: Principles and techniques of algorithmic differentiation. SIAM, Philadelphia 5. Griewank A, Corliss GF (eds) (1995) Automatic differentiation of algorithms: Theory, implementation, and application. SIAM, Philadelphia 6. Kalman D (1999) Marriages made in the heavens: A practical application of existence. Math Magazine 72(2):94–103 7. Kalman D, Lindell R (1995) Automatic differentiation in astrodynamical modeling. In: Griewank A and Corliss GF (eds) Automatic differentiation of algorithms: Theory, implementation, and application. SIAM, Philadelphia, pp 228– 241 8. Rall LB (Dec. 1986) The arithmetic of differentiation. Math Magazine 59(5):275–282 9. Thomas GB Jr, Finney RL (1996) Calculus and analytic geometry, 9th edn. Addison-Wesley, Reading, MA 10. Wertz JR (ed) (1978) Spacecraft attitude determination and control. Reidel, London

Automatic Differentiation: Introduction, History and Rounding Error Estimation MASAO IRI, KOICHI KUBOTA Chuo University, Tokyo, Japan MSC2000: 65D25, 26A24 Article Outline Keywords Introduction Algorithms Complexity

History

A

Estimates of Rounding Errors See also References

Keywords Differentiation; System analysis; Error analysis Introduction Most numerical algorithms for analyzing or optimizing the performance of a nonlinear system require the partial derivatives of functions that describe a mathematical model of the system. The automatic differentiation (abbreviated as AD in the following), or its synonym, computational differentiation, is an efficient method for computing the numerical values of the derivatives. AD combines advantages of numerical computation and those of symbolic computation [2,4]. Given a vector-valued function f: Rn ! Rm : 1 0 f 1 (x1 ; : : : ; x n ) C B :: (1) y D f(x) @ A : f m (x1 ; : : : ; x n ) of n variables represented by a big program with hundreds or thousands of program statements, one often had encountered (before the advent of AD) some difficulties in computing the partial derivatives @f i /@xj with conventional methods (as will be shown below). Now, one can successfully differentiate them with AD, deriving from the program for f another program that efficiently computes the numerical values of the partial derivatives. AD is entirely different from the well-known numerical approximation with quotients of finite differences, or numerical differentiation. The quotients of finite differences, such as (f (x + h) f (x))/h and (f (x + h) f (x h))/2h, approximate the derivative f 0 (x), where truncation errors are of O(h) and O(h2 ), respectively, but there is an insurmountable difficulty to compute better and better approximation. For, although an appropriately small value of h is chosen, it may fail to compute the values of the function when x ˙ h is out of the domain of f , and, furthermore, the effect of rounding errors in computing the values of the functions is of problem.

153

154

A

Automatic Differentiation: Introduction, History and Rounding Error Estimation

AD is also different from symbolic differentiation with a symbolic manipulator. The symbolic differentiation derives the expressions of the partial derivatives rather than the values. The mathematical model of a large scale system may be described in thousands of program statements so that it becomes very difficult to handle whole of them with an existing symbolic manipulator. (There are a few manipulators combined with AD, which can handle such large scale programs. They should be AD regarded as a symbolic manipulator.) Example 1 Program 1 computes an output value y1 as a composite function f 1 for given input values x1 = 2, x2 = 3, x3 = 4: y1 D f 1 (x1 ; x2 ; x3 ) D

x1 (x2 x3 ) : exp(x1 (x2 x3 )) C 1

(2)

The execution of this program is traced by a sequence of assignment statements (Program 2). y1 y1

x1 (x2 x3 ), y1 /(exp(y1 ) + 1).

Automatic Differentiation: Introduction, History and Rounding Error Estimation, Program 2 Program 1 expanded to straight line program for the specified input values

A set of unary or binary arithmetic operators (+, ,

, /) and elementary transcendental functions (exp, log, sin, cos, . . . ) that may be used in the programs will be called basic operations. (Some special operations such as those generating ‘constant’ and ‘input’ are also to be counted among basic operations.) Program 2 can be expanded into a sequence of assignment statements each of whose right side has only one basic operation (Program 3), where z1 , . . . , zs are temporary variables (s = 2 for this example).

x2 x3 , x 1 z1 , exp(z1 ), z2 + 1, z1 /z2 .

Automatic Differentiation: Introduction, History and Rounding Error Estimation, Program 3 Expanded history of execution with each line having only one basic operation

Moreover, it is useful to rewrite Program 3 into a sequence of single assignment statements, in which each variable appears at most once in the left sides (Program 4), hence, ‘ ’ can be replaced by ‘ = ’. v1 v2 v3 v4 v5

1 2 3 4 5

IF (x2 .le.x3 ) THEN y1 = x1 (x2 x3 ) ELSE y1 = x1 (x2 + x3 ) ENDIF y1 = y1 /(exp(y1 ) + 1). Automatic Differentiation: Introduction, History and Rounding Error Estimation, Program 1 Example

z1 z1 z2 z2 z1

1 2 3 4 5

x2 x3 , x 1 v1 , exp(v2 ), v3 + 1;, v2 /v4 ,

Automatic Differentiation: Introduction, History and Rounding Error Estimation, Program 4 Computational process

The sequence is called a computational process, where the additional variables v1 , . . . , v5 are called intermediate variables that keep the intermediate results. A graph called a computational graph, G = (V, A), may be used to represent the process (see Fig. 1). Algorithms There are two modes for AD algorithm, forward mode and reverse mode. The forward mode is to compute @yi /@xj (i = 1, . . . , m) for a fixed j, whereas the reverse mode is to compute @yi /@xj (j = 1, . . . , n) for a fixed i. The forward mode corresponds to tracing an expanded program such as Program 3 in the natural order. Assume that execution of the kth assignment in the program is represented as zc

k (z a ; z b )

:

(3)

When the values of both @za /@xj and @zb /@xj are known, @zc /@xj can be computed by applying the chain rule of

A

Automatic Differentiation: Introduction, History and Rounding Error Estimation

Automatic Differentiation: Introduction, History and Rounding Error Estimation, Table 1 Elementary partial derivatives

zc D za ˙ zb

# #za 1

# #zb ˙1

zc D za zb

zb

za

zc D za /zb p zc D za

1/zb za /zb2 1 p 1 2 / za (D 2 /zc ) –

zc D log(za )

1/za

–

zc D exp(za )

exp(za )(D zc )

–

zc D cos(za )

sin(za )

–

zc D sin(za ) :: :

cos(za ) :: :

– :: :

zc D

1 10 20 2 3 30 4 40 5 50

Automatic Differentiation: Introduction, History and Rounding Error Estimation, Figure 1 Computational graph

differentiation to (3): @z c @x j

@ k @z a @ k @zb C : @z a @x j @zb @x j

(4)

@ k /@za and @ k /@zb are called elementary partial derivatives, and are computed by Table 1 for various k. Introducing new variables z1 ; : : : ; z s , x 1 ; : : : ; x n corresponding to @z1 /@xj , . . . , @zs /@xj , @x1 /@xj , . . . , 0 (1 k @xn /@xj , respectively, and initializing x k 1, we may express (4) as n, k 6D j) and x j zc

@ k @ k za C zb : @z a @zb

(5)

Thus, we can write down the whole program for the forward mode as shown in Program 5. The reverse mode corresponds to tracing a computational process such as Program 4 backwards. The kth computational step, i. e., execution of the kth assignment in the program, can be written in general as vk D

k (u k1 ; u k2 )ju k1 Dv ˛ k ;u k2 Dv ˇ k ;

(6)

(za ; zb )

(D zc /zb )

Initialization xj 1, x 0 (1 k n; k ¤ j), Forward algorithm: z1 x2 x3 , z1 1 x2 1 x3, z1 z1 x 1 + x 1 z 1 , z1 x 1 z1 , z2 exp(z1 ), z2 z2 z 1 , z2 z2 + 1, z2 1 z2, z1 z1 /z2 , z1 (1/z2 ) z 1 (z1 /z2 ) z2

Automatic Differentiation: Introduction, History and Rounding Error Estimation, Program 5 Forward mode program for differentation

where uk, 1 and uk, 2 are formal parameters, v˛k and vˇk are real parameters representing some of x1 , . . . , xn , v1 , . . . , vk 1 . If k is unary, uk, 2 and vˇk are omitted. Let r be the total number of computational steps. In Program 4, we have r = 5 and, for k = 2, e. g., 2 = ‘ ’, v˛2 = x1 and vˇ2 = v1 . The total differentiation of (6) yields the relations among dx1 , . . . , dxn , dv1 , . . . , dvr such as follows: dv k D

@ k @ k dv˛k C dvˇk @u k;1 @u k;2

(k D 1; : : : ; r) : (7)

The computation of the partial derivatives of the ith component of the final result yi = f i (x1 , . . . , xn ) in (1)

155

156

A

Automatic Differentiation: Introduction, History and Rounding Error Estimation

with respect to x1 , . . . , xn is that of the coefficients of the relation among dx1 , . . . , dxn and dyi . Here, new variables x 1 ; : : : ; x n , v 1 ; : : : ; v r are introduced for the computation of those coefficients. Without loss of generality, we may assume that the value of yi is computed at vr . After Program 4 is executed in the natural order with all the information on intermediate results preserved, these new variables are initialized as xj 0 (j = 1, . . . , n), v k 0 (k = 1, . . . , r 1) and vr 1, then the relation dy D

n X

x j dx j C

jD1

r X

v k dv k

100

holds. Secondly, dvr , dvr 1 , . . . , dvk can be eliminated from (8) in this order by modifying v ˛k

v ˛k

v ˇk

v ˇk

(9)

n X

x j dx j :

Automatic Differentiation: Introduction, History and Rounding Error Estimation, Program 6 Reverse mode program

(10)

Finally, if we change k in the reverse order, i. e., k = r, r 1, . . . , 1, we can successfully eliminate all the dvk (k = 1, . . . , r) to have dy D

400 300 200

(8)

kD1

@ k C vk ; @v˛k @ k C vk : @vˇk

500

Forward sweep: (insert Program 4 here) Initialization: (n = 3; r = 5) xj 0 ( j = 1; : : : ; n), vk 0 (k = 1; : : : ; r 1), vr 1, Reverse elimination: v2 v 2 + (1/v4 ) v 5 , v4 v 4 + (v5 /v4 ) v 5 , v3 v3 + 1 v4, v2 v 2 + v3 v 3 , x1 x 1 + v1 v 2 , v1 v 1 + x1 v 2 , x2 x2 + 1 v1, x3 x 3 + (1) v 1 .

(11)

jD1

The final coefficient x j indicates the value of @f i /@xj (j = 1, . . . , n). Program 6 in which modifications (9) and (10) are embedded is the reverse mode program, which is sometimes called the adjoint program of Program 4. It is easy to extend the algorithms for computing a linear combination of the column vectors of the Jacobian matrix J with the forward mode, and a linear combination of the row vectors of J with the reverse mode. Complexity It is proved that, for a constant C ( = 4 6, varying under different computational models), the total operation count for @yi /@xj ’s with a fixed j in the forward mode algorithm, as well as that for @yi /@ xj ’s with a fixed i in the reverse mode algorithm, is at most C r, i. e., in O(r). Roughly speaking, r is proportional to the execution time T of the given program, so that the time complexity is in O(T). Furthermore, we have to repeat such computation n times to get all the required partial

derivatives by the forward mode, and m times by the reverse mode. What should be noted here is that the computational time of the forward or reverse mode algorithm for one set of derivatives does not depend on m or n but only on r. Denoting the spatial complexity of the original program by S, that of the forward mode algorithm is in O(S). However, the spatial complexity of the reverse mode is in O(T), since the reverse mode requires a history of the forward sweep recorded in storage whose size is in O(T). A rough sketch of the proof is as follows. Without loss of generality, assume that the given program is expanded into a sequence of single assignment statements with a binary or unary basic operation as shown in Program 3 and 4. The operation count for computing the elementary partial derivatives (Table 1) is bounded by a constant. The additional operation count for modifying v k ’s and x j ’s in (5), (9) and (10) is also bounded since there are at most two additions and two multiplications. There are r operations in the original program, so that the total operation count in the forward mode algorithm as well as that in the reverse mode algorithm is in O(r). Note that the computational complexities of the forward mode and the reverse mode may not be optimal, but at least one can compute them in time proportional

Automatic Differentiation: Introduction, History and Rounding Error Estimation

to that for the computation of the given original program. One can extend the AD algorithms to compute higher derivatives. In particular, it is well known how to compute a truncated Taylor series to get arbitrarily higher-order derivatives of a function with one variable [14]. One may regard a special function such as a Bessel function or a block of several arithmetic operations, such as the inner product of vectors, as a basic operation if the corresponding elementary partial derivatives are given with computational definitions. An analogy is pointed out in [7] between the algorithms for the partial derivatives and those of the computation of the shortest paths in an acyclic graph. It has also been pointed out that there may be pitfalls in the derived program with AD. For example, a tricky program IF (x.ne.1.0) THEN y = x x ELSE y = 1.0 + (x 1.0) b ENDIF can compute the value of a function f (x) = x2 correctly for all x. However, the derived program fails to compute f 0 (1.0), because the differentiation of the second assignment with respect to x is not 2.0 but b. Thus conditional branches (or equations equivalent to conditional branches) should be carefully dealt with. History A brief history of AD is as follows. There were not a few researchers in the world who had more or less independently proposed essentially the same algorithms. The first publication on the forward mode algorithm was presumably the paper by R.E. Wengert in 1964 [16]. After 15 years, books were published by L.B. Rall [14] and by H. Kagiwada et al. [9] which have been influential on the numerical-computational circle. The practical and famous software system for the forward mode automatic differentiation was Pascal-SC, and its descendants Pascal-XSC and C-XSC are popular now. The paper [13] might be the first to propose systematically the reverse mode algorithm. But there are many ways through which to approach the reverse mode algorithm. In fact, it is related to Lagrange multipliers, error analysis, generation of adjoint systems, reduction of computational complexity of computing the gradi-

A

ent, neural networks, etc. Of course, the principles of the derived algorithms are the same. Some remarkable works on the reverse mode algorithm had been done by S. Linnainmaa [11] and W. Miller and C. Wrathall [12] from the viewpoint of the error analysis, by W. Baur and V. Strassen [1] from that of complexity, and by P.J. Werbos [17] from that of the optimization of neural networks. A practical program had been developed by B. Speelpenning in 1980 [15] and it was rewritten into Fortran by K.E. Hillstrom in 1985 (now registered in Netlib [5,6]). Two proceedings of the international workshops held in 1991 and 1996 collect all the theories, techniques, practical programs, current works, and future problems as well as history on automatic differentiation [2,4]. It should be noted that, in 1992, A. Griewank proposed a drastic improvement of the reverse mode algorithm using the so-called checkpointing technique. He succeeded in reducing the order of the size of storage required for the reverse mode algorithm [3]. Several software tools for automatic differentiation have been developed and popular in the world, e. g., ADIC, ADIFOR, ADMIT-1, ADOL-C, ADOL-F, FADBAD, GRESS, Odyssée, PADRE2, TAMC, etc. (See [2,4].) Estimates of Rounding Errors In order to solve practical real-world problems, the approximation with floating-point numbers is inevitable so that it is important to analyze and estimate the accumulated rounding errors in a big numerical computation. Moreover, in terms of estimates of the accumulated rounding errors, one can define a normalized (or weighted) norm for a numerically computed vector, that is useful for checking whether the computed vector can be regarded as zero or not from the viewpoint of numerical computation [8]. For the previous example, let us denote as ı k the rounding error generated at the execution of the basic operation to compute the value of vk . Then, the rounding errors in the example is explicitly written: 1 2 3 4 5

e v1 e v2 e v3 e v4 e v5

=e x2 e x 3 + ı1 ; =e x1 e v 1 + ı2 ; = exp(e v 2 ) + ı3 ; =e v 3 + 1 + ı4 ; =e v 2 /e v 4 + ı5 :

157

158

A

Automatic Differentiation: Introduction, History and Rounding Error Estimation

Here,e v k is the value with accumulated rounding errors. Defining a function e f as e f (x1 ; x2 ; x3 ; ı1 ; ı2 ; ı3 ; ı4 ; ı5 ) x1 (x2 x3 C ı1 ) C ı2 Cı5 ; D exp(x1 (x2 x3 C ı1 ) C ı2 ) C ı3 C 1 C ı4

ing error. Regarding the locally generated errors ı k ’s as pseudo-probabilistic variables uniformly distributed over [ |vk | "M , |vk |"M ]’s, [f ]P , called probabilistic estimate, is defined by v !2 u r u 1 X @e f vk : [ f ]P " M t 3 @ı k

(15)

kD1

one has e f (x1 ; x2 ; x3 ; ı1 ; : : : ; ı5 ); v5 D e e v5 D f (x1 ; x2 ; x3 ; 0; : : : ; 0) : Here, e v 5 v5 is the accumulated rounding error in the function value. For v5 = v2 /v4 = ' 5 (v2 , v4 ), one has e v 5 v5 D '5 (e v 2 ;e v 4 ) '5 (v2 ; v4 ) C ı5 D

@'5 (2 ; 4 ) (e v 2 v2 ) @v2 @'5 C (2 ; 4 ) (e v 4 v 4 ) C ı5 ; @v4

where 2 D 0e v 2 C(1 0 )v2 and 4 D 00e v 4 C(1 00 )v4 0 00 for 0 < ; < 1. Expandinge v 2 v2 ande v 4 v4 similarly and expanding the other intermediate variables sequentially, the approximation: e v 5 v5 '

5 X @e f ık @ı k

(12)

kD1

@e f is derived [10]. Note that @ık are computed as v k in Program 6, which are the final results of (9) and (10). The locally generated rounding error ı k for the floating-point number system is bounded by

jı k j c jv k j " M ;

(13)

where "M indicates so-called ‘machine epsilon’ and c = 1 may be adopted for arithmetic operations according to IEEE754 standard. Then [f ]A , called absolute estimation, is defined by ˇ ˇ r ˇ eˇ X ˇ @f ˇ [ f ]A ˇ ˇ jv k j " M ; ˇ @ı k ˇ

(14)

kD1

which is an upper bound on the accumulated round-

There are several reports in which these estimates give quite good approximations to the actual accumulated rounding errors [8]. Moreover, one could answer the problem how to choose a norm for measuring the size of numerically computed vector. By means of the estimates of the rounding errors, a weighted norm of a vector f = [f 1 , . . . , f m ] whose components are numerically computed is defined by

f1 fm

;

;:::; jjfjjN

[ f ] [ f ]

1 A

m A

(16)

p

(p = 1,2 or 1). This weighted norm is called normalized norm, because it is normalized with respect to accumulated rounding errors. With this normalized norm, one can determine whether a computed vector approaches to zero or not in reference to the rounding errors accumulated in the components. Note that, since all the components of the vector are divided by the estimates of accumulated rounding errors, they have no physical dimension. The normalized norm may be used effectively as stopping criteria for iterative methods like the Newton–Raphson method.

See also Automatic Differentiation: Calculation of the Hessian Automatic Differentiation: Calculation of Newton Steps Automatic Differentiation: Geometry of Satellites and Tracking Stations Automatic Differentiation: Parallel Computation Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators

Automatic Differentiation: Parallel Computation

Automatic Differentiation: Root Problem and Branch Problem Nonlocal Sensitivity Analysis with Automatic Differentiation References 1. Baur W, Strassen V (1983) The complexity of partial derivatives. Theor Comput Sci 22:317–330 2. Berz M, Bischof C, Corliss G, Griewank A (eds) (1996) Computational differentiation: Techniques, applications, and tools. SIAM, Philadelphia 3. Griewank A (1992) Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation. Optim Methods Soft 1:35–54 4. Griewank A, Corliss GF (eds) (1991) Automatic differentiation of algorithms: Theory, implementation, and application. SIAM, Philadelphia 5. Hillstrom KE (1985) Installation guide for JAKEF. Techn Memorandum Math and Computer Sci Div Argonne Nat Lab ANL/MCS-TM-17 6. Hillstrom KE (1985) User guide for JAKEF. Techn Memorandum Math and Computer Sci Div Argonne Nat Lab ANL/MCS-TM-16 7. Iri M (1984) Simultaneous computation of functions, partial derivatives and estimates of rounding errors – Complexity and practicality. Japan J Appl Math 1:223–252 8. Iri M, Tsuchiya T, Hoshi M (1988) Automatic computation of partial derivatives and rounding error estimates with applications to large-scale systems of nonlinear equations. J Comput Appl Math 24:365–392 9. Kagiwada H, Kalaba R, Rasakhoo N, Spingarn K (1986) Numerical derivatives and nonlinear analysis. Math. Concepts and Methods in Sci. and Engin., vol 31. Plenum, New York 10. Kubota K, Iri M (1991) Estimates of rounding errors with fast automatic differentiation and interval analysis. J Inform Process 14:508–515 11. Linnainmaa S (1976) Taylor expansion of the accumulated rounding error. BIT 16:146–160 12. Miller W, Wrathall C (1980) Software for roundoff analysis of matrix algorithms. Acad Press, New York 13. Ostrovskii GM, Wolin JM, Borisov WW (1971) Über die Berechnung von Ableitungen. Wiss Z Techn Hochschule Chemie 13:382–384 14. Rall LB (1981) Automatic differentiation – Techniques and applications. Lecture Notes Computer Science, vol 120. Springer, Berlin 15. Speelpenning B (1980) Compiling fast partial derivatives of functions given by algorithms. Report Dept Computer Sci Univ Illinois UIUCDCS-R-80-1002 16. Wengert RE (1964) A simple automatic derivative evaluation program. Comm ACM 7:463–464 17. Werbos P (1974) Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD Thesis, Appl. Math. Harvard University

A

Automatic Differentiation: Parallel Computation CHRISTIAN H. BISCHOF1 , PAUL D. HOVLAND2 1 Institute Sci. Computing, University Technol., Flachen, Germany 2 Math. and Computer Sci. Div., Argonne National Lab., Argonne, USA MSC2000: 65Y05, 68N20, 49-04 Article Outline Keywords Background Implementation Approaches AD of Parallel Programs AD-Enabled Parallelism Data Parallelism Time Parallelism

Parallel AD Tools Summary See also References Keywords Automatic differentiation; Parallel computing; MPI Research in the field of automatic differentiation (AD) has blossomed since A. Griewank’s paper [15] in 1989 and the Breckenridge conference [17] in 1991. During that same period, the power and availability of parallel machines have increased dramatically. A natural consequence of these developments has been research on the interplay between AD and parallel computations. This relationship can take one of two forms. One can examine how AD can be applied to existing parallel programs. Alternatively, one can consider how AD introduces new potential for parallelism into existing sequential programs. Background Automatic differentiation relies upon the fact that all programming languages are based on a finite number of elementary functions. By providing rules for the differentiation of these elementary functions, and by com-

159

160

A

Automatic Differentiation: Parallel Computation

bining these elementary derivatives according to the chain rule of differential calculus, an AD system can differentiate arbitrarily complex functions. The chain rule is associative—partial derivatives can be combined in any order. The forward mode of AD combines the partial derivatives in the order of evaluation of the elementary functions to which they correspond. The reverse mode combines them in the reverse order. For systems with a large ratio of dependent to independent variables, the reverse mode offers lower operation counts, at the cost of increased storage costs [15]. The forward and the reverse mode are the extreme ends of a wide algorithmic spectrum of accumulating derivatives. Recently, hybrid approaches have been developed which combine the forward and the reverse mode [5,10], or apply them in a hierarchical fashion [8,25]. In addition, efficient checkpointing schemes have been developed which address the potential storage explosion of the reverse mode by judicious recomputation of intermediate states [16,19]. Viewing the problem of automatic differentiation as an edge elimination problem on the program graph corresponding to a particular code, one can in fact show that the problem of computing derivatives with minimum cost is NP-hard [21]. The development of more efficient heuristics is an area of active research (see, for example, several of the papers in [3]). Implementation Approaches Automatic differentiation is a particular instantiation of a rule-based semantic transformation process. That is, whenever a floating-point variable changes, an associated derivative object must be updated according to the chain rule of differential calculus. For example, in the forward mode of AD, a derivative object carries the partial derivative(s) of an associated variable with respect to the independent variable(s). In the reverse mode of AD, a derivative object carries the partial derivative(s) of the dependent variable(s) with respect to an associated variable. Thus, any AD tool must provide an instantiation of a ‘derivative object’, maintain the association between an original variable and its derivative object, and update derivative objects in a timely fashion. Typically AD is implemented in one of two ways: operator overloading or source transformation. In languages that allow operator overloading, such as C++

and Fortran90, each elementary function can be redefined so that in addition to the normal function, derivatives are computed as well, and either saved for later use or propagated by the chain rule. A simple class definition using the forward mode might be implemented as follows:

class adouble{ private: double value, grad[GRAD_ LENGTH]; public: /* constructors omitted */ friend adouble operator*(const adouble &, const adouble &); /* similar decs for other ops */ } adouble operator*(const adouble &g1, const adouble &g2){ int i; double newgrad[GRAD_ LENGTH]; for (i=0; i m, then it is more efficient to consider the input variables to be independent and then compose rf by the standard formula given below. This limits the computational effort for the forward mode to an amount essentially proportional to nm. The reverse mode is another way to apply the chain rule. Instead of propagating the seed gradients rt1 , . . . , rt m throughout the computation, differentiation is applied to the code list in reverse order. In the case of

Automatic Differentiation: Point and Interval

a single output variable t n , first t n is differentiated with respect to itself, then with respect to t n1 , . . . , t 1 . The resulting adjoints @t n / @t m , . . . , @t n / @t 1 and the seed gradients then give

r tn D

m X @t n iD1

@t i

r ti :

Formally, the adjoints are given by @t n D 1; @t n

X @t k @t i @t n D ; @t k @t i @t k i2I k

k = n 1, . . . , 1, where I k is the set of indices i> k such that t i depends explicitly on t k . It follows that the computational effort to obtain adjoints in the reverse mode is proportional to n, the length of the code list, and is essentially independent of the number of input variables and the dimensionalities of the seed gradients. This can result in significant savings in computational time. In the general case of several output variables, the same technique is applied to each to obtain their gradients. The reverse mode applied to the example code list gives @t10 @t10 @t10 @t9 @t10 @t8 @t10 @t7 @t10 @t6 @t10 @t5 @t10 @t4 @t10 @t3 @t10 @t2

D 1; D t6 ; @t10 @t9 D t6 1; @t9 @t8 @t10 @t8 D D t6 3; @t8 @t7 D

D t9 ; @t10 @t6 D t9 1; @t6 @t5 @t10 @t5 D D t9 1; @t5 @t4 @t10 @t5 D D t9 1; @t5 @t3 @t10 @t7 @t10 @t3 D C @t7 @t2 @t3 @t2 D (3t6 ) (2t2 ) C t9 t1 ; D

@t10 @t10 @t4 @t10 @t3 D C D t9 cos t1 C t9 t2 : @t1 @t4 @t1 @t3 @t1

A

Although this computation appears to be complicated, a comparison of operation counts in the case x, y are independent variables shows that even for this lowdimensional example, the reverse mode requires 13 operations to evaluate rf in addition to the operations required to evaluate f itself, while the forward mode requires 22 = 2 + 10 m. In reverse mode, the entire code list has to be evaluated and its values stored before the reverse sweep begins. In forward mode, since the computation of t i and each component of rt i can be carried out independently, a parallel computer with a sufficient number of processors could compute t n , rt n in a single pass through the code list, that is, with effort proportional to n. A more detailed comparison of forward and reverse modes for calculating gradients can be found in the tutorial article [1, pp. 1–18] and the book [3]. Implementation of automatic differentiation can be by interpretation, operator overloading, or code transformation. Early software for automatic differentiation simply interpreted a code list by calling the appropriate subroutines for each arithmetic operation or library function. Although inefficient, this approach is still useful in interactive applications in which functions entered from the keyboard are parsed to form code lists, which are then interpreted to evaluate the functions and their derivatives. Operator overloading is a familiar concept in mathematics, as the symbol ‘+’ is used to denote addition of such disparate objects as integers, real or complex numbers, vectors, matrices, functions, etc. It follows that a code list as defined above can be evaluated in any mathematical system in which the required arithmetic operations and library function are available, including differentiation arithmetics [14, pp. 73–90]. These arithmetics can be used to compute derivatives or Taylor coefficients of any order of sufficiently smooth functions. In optimization, gradient and Hessian arithmetics are most frequently used. In gradient arithmetic, the basic data type is the ordered pair (f , rf ) of a number and a vector representing values of a function and its gradient vector. Arithmetic operations in this system are defined by ( f ; r f ) ˙ (g; r g) D ( f ˙ g; r f ˙ r g); ( f ; r f )(g; r g) D ( f g; f r g C gr f ); f gr f f r g (f;r f) D ; ; (g; r g) g g2

167

168

A

Automatic Differentiation: Point and Interval

division by 0 excluded. If is a differentiable library function, then its extension to gradient arithmetic is defined by ( f ; r f ) D (( f ); 0 ( f )r f ); which is just the chain rule. Hessian arithmetic extends the same idea to triples (f , rf , H f ), where H f is a matrix representing the value of the Hessian of f , H f = [ @2 f / @xi @xj ]. Programming differentiation arithmetic is convenient in modern computer languages which support operator overloading [9, pp. 291–309]. In this setting, the program is written with expressions or routines for functions in the regular form, and the compiler produces executable code for evaluation of these functions and the desired derivatives. For straightforward implementations such as the one cited above, the differentiation mode will be forward, which has implications for efficiency. Code transformation essentially consists of analyzing the code for functions to generate code for derivatives. This results in a new computer program which then can be compiled and run as usual. To illustrate this idea, note that in the simple example given above, the expressions

can then be appended to the code list for the function to obtain a routine with output values t 10 = f (x, y), tx3 = f x (x, y), and ty4 = f y (x, y). Further, automatic differentiation can be applied to this list to obtain routines for higher derivatives of f [13]. As a practical matter, duplicate assignments can be removed from such lists before compilation. Up to this point, the discussion has been of point AD, values have been assumed to be real or complex numbers with all operations and library functions evaluated exactly. In reality, the situation is quite different. Expressions, meaning their equivalent code lists, are evaluated in an approximate computer arithmetic known as floating-point arithmetic. This often yields very accurate results, but examples of simple expressions are known for which double and even higher precision calculation gives an answer in which even the sign is wrong for certain input values. Furthermore, such failures can occur without any outward indication of trouble. In addition, values of input variables may not be known exactly, thus increasing the uncertainty in the accuracy of outputs. The use of interval arithmetic (abbreviated IA) provides a computational way to attack these problems [11]. The basic quantities in interval arithmetic are finite closed real intervals X = [x1 , x2 ], which represent all real numbers x such that x1 x x2 . Arithmetic operations ı on intervals are defined by

f x (x; y) D t9 (t2 C cos t1 ); f y (x; y) D 6t2 t6 C t1 t9 ; were obtained for the partial derivatives of the function in either forward or reverse mode. This differs from symbolic differentiation in that values of intermediate entries in the code list for f (x, y) are involved rather than the variables x, y. The corresponding lists for these expressions tx1 D cos t1 ; tx2 D t2 C tx1 ; tx3 D t9 tx2 ; t y1 D t2 t6 ; t y2 D 6t y1 ; t y3 D t1 t9 ; t y4 D t y2 C t y3 ;

X ı Y D fx ı y : x 2 X; y 2 Yg ; again an interval, division by an interval containing zero excluded. Library functions are similarly extended to interval functions ˚ such that (x) 2 ˚(X) for all x 2 X with ˚(X) expected to be an accurate inclusion of the range (X) of on X. Thus, if f (x) is a function defined by a code list, then assignment of the interval value X to the input variable and evaluation of the entries in interval arithmetic yields the output F(X) such that f (x) 2 F(X) for all x 2 X. The interval function F obtained in this way is called the united extension of f [11]. In the floating-point version of interval arithmetic, all endpoints are floating-point numbers and hence exactly representable in the computer. Results of arithmetic operations and calls of library functions are

Automatic Differentiation: Point and Interval

rounded outwardly (upper endpoints up, lower endpoints down) to the closest or very close floating-point numbers to maintain the guarantee of inclusion. Thus, one is still certain that for the interval extension F of f actually computed, f (x) 2 F(X) for all x 2 X. Thus, for example, an output interval F(X) which is very wide for a point input interval X = [x, x] would serve as a warning that the algorithm is inappropriate or illconditioned, in contrast to the lack of such information in ordinary floating-point arithmetic. Automatic differentiation carried out in interval arithmetic is called interval automatic differentiation. Interval computation has numerous implications for optimization, with or without automatic differentiation [6]. Maxima and minima of functions can ‘slip through’ approximate sampling of values at points of the floating-point grid, but have to be contained in the computable interval inclusion F(X) of f (x) over the same interval region X, for example. Although interval arithmetic properly applied can solve many optimization and other computational problems, a word of warning is in order. The properties of interval arithmetic differ significantly from those of real arithmetic, and simple ‘plugging in’ of intervals for numbers will not always yield useful results. In particular, interval arithmetic lacks additive and multiplicative inverses, and multiplication is only subdistributive across addition, X(Y+ Z) XY+ XZ [11]. A real algorithm which uses one or more of these properties of real arithmetic is usually inappropriate for interval computation, and should be replaced by one that is suitable if possible. To this point, automatic differentiation has been applied only to code lists, which programmers customarily refer to as ‘straight-line code’. Automatic differentiation also applies to subroutines and programs, which ordinarily contain loops and branches in addition to expressions. These latter present certain difficulties in many cases. A loop which is traversed a fixed number of times can be ‘unrolled,’ and thus is equivalent to straight-line code. However, in case the stopping criterion is based on result values, the derivatives may not have achieved the same accuracy as the function values. For example, if the inverse function of a known function is being computed by iterative solution of the equation f (x) = y for x = f 1 (y), then automatic differentiation should be applied to f and the derivative

A

of the inverse function obtained from the standard formula (f 1 )0 (y) = (f 0 (x))1 . Branches essentially produce piecewise defined functions, and automatic differentiation then provides the derivative of the function defined by whatever branch is taken. This can create difficulties as described by H. Fischer [4, pp. 43–50], especially since a smooth function can be approximated well in value by highly oscillatory or other nonsmooth functions such as result from table lookups and piecewise rational approximations. For example, one would not expect to obtain an accurate approximation to the cosine function by applying automatic differentiation to the library subroutine for the sine. As with any powerful tool, automatic differentiation should not be expected to provide good results if applied indiscriminately, especially to ‘legacy’ code. As with interval arithmetic, automatic differentiation will yield the best results if applied to programs written with it in mind. Current state of the art software for point automatic differentiation of programs are ADOL-C, for programs written in C/C++ [5], and ADIFOR for programs in Fortran 77 [1, pp. 385–392]. Numerous applications of automatic differentiation to optimization and other problems can be found in the conference proceedings [1,4], which also contain extensive bibliographies. An important result with implications for optimization is that automatic differentiation can be used to obtain Newton steps without forming Jacobians and solving linear systems, see [1, pp. 253– 264]. From a historical standpoint, the principles of automatic differentiation go back to the early days of calculus, but implementation is a product of the computer age, hence the designation ‘automatic’. The terminology ‘algorithmic differentiation’, to which the acronym automatic differentiation also applies, is perhaps better. Since differentiation is widely understood, automatic differentiation literature contains many anticipations and rediscoveries. The 1962 Stanford Ph.D. thesis of R.E. Moore deals with both interval arithmetic and automatic differentiation of code lists to obtain Taylor coefficients of series solution of systems of ordinary differential equations. In 1964, R.E. Wengert [15] published on automatic differentiation of code lists and noted that derivatives could be recovered from Taylor coefficients. Early results in automatic differentiation were applied to code lists in forward mode, as described

169

170

A

Automatic Differentiation: Point and Interval

in [13]. G. Kedem [8] showed that automatic differentiation applies to subroutines and programs, again in forward mode. The reverse mode was anticipated by S. Linnainmaa in 1976 [10], and in the Ph.D. thesis of B. Speelpenning (Illinois, 1980), and published in more complete form by M. Iri in 1984 [7]. automatic differentiation via operator overloading and the concept of differentiation arithmetics, which are commutative rings with identity, were introduced by L.B. Rall [9, pp. 291– 309], [14, pp. 73–90], [4, pp. 17–24]. For additional information about the early history of automatic differentiation, see [13] and the article by Iri [4, pp. 3–16] for later developments. Analysis of algorithms for automatic differentiation has been carried out on the basis of graph theory by Iri [7], A. Griewank [12, pp. 128–161], [3], and equivalent matrix formulation by Rall [2, pp. 233–240]. See also Automatic Differentiation: Calculation of the Hessian Automatic Differentiation: Calculation of Newton Steps Automatic Differentiation: Geometry of Satellites and Tracking Stations Automatic Differentiation: Introduction, History and Rounding Error Estimation Automatic Differentiation: Parallel Computation Automatic Differentiation: Point and Interval Taylor Operators Automatic Differentiation: Root Problem and Branch Problem Bounding Derivative Ranges Global Optimization: Application to Phase Equilibrium Problems Interval Analysis: Application to Chemical Engineering Design Problems Interval Analysis: Differential Equations Interval Analysis: Eigenvalue Bounds of Interval Matrices Interval Analysis: Intermediate Terms Interval Analysis: Nondifferentiable Problems Interval Analysis: Parallel Methods for Global Optimization Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods

Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Unconstrained and Constrained Optimization Interval Analysis: Verifying Feasibility Interval Constraints Interval Fixed Point Theory Interval Global Optimization Interval Linear Systems Interval Newton Methods Nonlocal Sensitivity Analysis with Automatic Differentiation References 1. Berz M, Bischof Ch, Corliss G, Griewank A (eds) (1996) Computational differentiation, techniques, applications, and tools. SIAM, Philadelphia 2. Fischer H, Riedmueller B, Schaeffler S (eds) (1996) Applied mathematics and parallel computing. Physica Verlag, Heidelberg 3. Griewank A (2000) Evaluating derivatives: Principles and techniques of algorithmic differentiation. SIAM, Philadelphia 4. Griewank A, Corliss GF (eds) (1991) Automatic differentiation of algorithms, theory, implementation, and application. SIAM, Philadelphia 5. Griewank A, Juedes D, Utke J (1996) ADOL-C, a package for the automatic differentiation of programs written in C/C++. ACM Trans Math Softw 22:131–167 6. Hansen E (1992) Global optimization using interval analysis. M. Dekker, New York 7. Iri M (1984) Simultaneous computation of functions, partial derivatives, and rounding errors: complexity and practicality. Japan J Appl Math 1:223–252 8. Kedem G (1980) Automatic differentiation of computer programs. ACM Trans Math Softw 6:150–165 9. Kulisch UW, Miranker WL (eds) (1983) A new approach to scientific computation. Acad. Press, New York 10. Linnainmaa S (1976) Taylor expansion of the accumulated rounding error. BIT 16:146–160 11. Moore RE (1979) Methods and applications of interval analysis. SIAM, Philadelphia 12. Pardalos PM (eds) (1993) Complexity in nonlinear optimization. World Sci, Singapore 13. Rall LB (1981) Automatic differentiation: Techniques and applications. Springer, Berlin 14. Ullrich C (eds) (1990) Computer arithmetic and selfvalidating numerical methods. Acad Press, New York 15. Wengert RE (1964) A simple automatic derivative evaluation program. Comm ACM 7:463–464

Automatic Differentiation: Point and Interval Taylor Operators

Automatic Differentiation: Point and Interval Taylor Operators AD, Computational Differentiation JAMES B. WALTERS, GEORGE F. CORLISS Marquette University, Milwaukee, USA MSC2000: 65K05, 90C30

Article Outline Keywords Introduction Operator Overloading Automatic Differentiation

Taylor Coefficients Point and Interval Taylor Operators Design of Operators Use of Interval Operators

One-at-a-Time Coefficient Generation Trade-Offs See also References

Keywords Automatic differentiation; Code list; Interval arithmetic; Overloaded operator; Taylor series

Frequently of use in optimization problems, automatic differentiation may be used to generate Taylor coefficients. Specialized software tools generate Taylor series approximations, one term at a time, more efficiently than the general AD software used to compute (partial) derivatives. Through the use of operator overloading, these tools provide a relatively easy-to-use interface that minimizes the complications of working with both point and interval operations.

f is an analytic function f : R ! R. Automatic differentiation (AD or computational differentiation) is the process of computing the derivatives of a function f at a point t = t 0 by applying rules of calculus for differentiation [9,10,17,18]. One way to implement AD uses overloaded operators.

Operator Overloading An overloaded (or generic) operator invokes a procedure corresponding to the types of its operands. Most programming languages implement this technique for arithmetic operations. The sums of two floating point numbers, two integers, or one floating point number and one integer are computed using three different procedures for addition. Fortran 77 or C denies the programmer the ability to replace or modify the various routines used implicitly for integer, floating point, or mixed-operand arithmetic, but Fortran 95, C++, and Ada support operator overloading for user-defined types. Once we have defined an overloaded operator for each rule of differentiation, AD software performs those operations on program code for f , as shown below. The operators either propagate derivative values or construct a code list for their computation. We give prototypical examples of operators overloaded to propagate Taylor coefficients below.

Automatic Differentiation The AD process requires that we have f in the form of an algorithm (e. g. computer program) so that we can easily separate and order its operations. For example, given f (t) = et /(2 + t), we can express f as an algorithm in Fortran 95 or in C++ (using an assumed AD module or class): In this section, we use AD to compute first derivatives. In the next section, we extend to point- and interval-valued Taylor series. To understand the AD process, we parse the program above into a sequence of unary and binary operations, called a code list, computational graph, or ‘tape’ [9]:

Introduction First, we briefly survey the tools of automatic differentiation and operator overloading used to compute pointand interval-valued Taylor coefficients. We assume that

A

x0 D t0 ;

x2 D 2 C x0 ; x1 x3 D : x1 D exp(x0 ); x2

171

172

A

Automatic Differentiation: Point and Interval Taylor Operators

program Example1 use AD_Module type(AD_Independent) :: t AD_Independent(0) type(AD_Dependent) :: f f = exp(t)/(2 + t) end program Example1 #include ‘AD_class.h’ void main (void) { AD_Independent t(0); AD_Dependent f ; f = exp(t)/(2 + t); } Automatic Differentiation: Point and Interval Taylor Operators, Figure 1 Fortran and C++ calls to AD operators

Differentiation is a simple mechanical process for propagating derivative values. Let t = t 0 represent the value of the independent variable with respect to which we differentiate. We know how to take the derivative of a variable, a constant, and unary and binary operations (i. e. +, , , /, sin, cos, exp, etc.). Then AD software annotates the code list: x0 D t0 ; rx0 D 1; x1 D exp(x0 ); rx1 D exp(x0 ) rx0 ; x2 D 2 C x0 ; rx2 D 0 C rx0 ; x1 x3 D ; x2 (rx1 rx2 x3 ) rx3 D : x2 AD propagates values of derivatives, not expressions as symbolic differentiation does. AD values are exact (up to round-off), not approximations of unknown quality as finite differences. For more information regarding AD and its applications, see [2,8,9,10, 17,18], or the bibliography [21].

AD software can use overloaded operators in two different ways. Operators can propagate both the value xi and its derivative rxi , as suggested by the annotated code list above. This approach is easy to understand and to program. We give prototypical Taylor operators of this flavor below. The second approach has the operators construct and store the code list. Various optimizations and parallel scheduling [1,4,12] may be applied to the code list. Then the code list is interpreted to propagate derivative values. This is the approach of AD tools such as ADOL-C [11], ADOL-F [20], AD01 [16], or INTOPT_90 [13]. The second approach is much more flexible, allowing the code list to be traversed in either the forward or reverse modes of AD (see [9]) or with various arithmetics (e. g. point- or interval-valued series). AD may be applied to functions of more than one variable, in which partial derivatives with respect to each are computed in turn, and to vector functions, in which the component functions are differentiated in succession. In addition, we can compute higher order derivative values. One application of AD involving higher order derivatives of f is the computation of Taylor (series) coefficients to which we turn in the next section. Source code transformation is a third approach to AD software used by ATOMFT [5] for Taylor series and by ADIFOR [3], PADRE2 [14], or Odyssée [19] for partial derivatives. Such tools accept the algorithm for f as data, rather than for execution, and produce code for computing the desired derivatives. The resulting code often executes more rapidly than code using overloaded operators.

Taylor Coefficients We define the Taylor coefficients of the analytic function f at the point t = t0 :

( f jt0 ) i :D

1 d i f (t0 ) ; i! dt i

for i = 0, 1, . . . , and let F := ((f |t 0 )i ) denote the vector of Taylor coefficients. Then Taylor’s theorem says

Automatic Differentiation: Point and Interval Taylor Operators

that there exists some point (usually not practically obtainable) between t and t 0 such that

f (t) D

p X ( f jt0 ) i (t t0 ) i iD0

1 d pC1 f () C (t t0 ) pC1 : (p C 1)! dt pC1

(1)

Computation of Taylor coefficients requires differentiation of f . We generate Taylor coefficients automatically using recursion formulas for unary and binary operations. For example, the recurrences we need for our example f (t) = et /(2 + t) are x(t) D exp u(t) ) x 0 D xu 0 ; (x)0 D exp(u)0 ; (x) i D

i1 X

(x) j (u) i j

jD0

(i j) ; i

x(t) D u(t) C v(t); (x) i D (u) i C (v) i ; u(t) ) xv D u; v(t) P (u) i i1 jD0 (x) j (v) i j (x) i D : (v)0

x(t) D

The recursion relations are described in more detail in [17]. Except for + and , each recurrence follows from Leibniz’ rule for the Taylor coefficients of a product. The relations can be viewed as a lower triangular system. The recurrence represents a solution by forward substitution, but there are sometimes accuracy or stability advantages in an iterative solution to the lower triangular system. The recurrences for each operation can be evaluated in floating-point, complex, interval, or other appropriate arithmetic. To compute the formal series for f (t) = et /(2 + t) expanded at t = 0, 8 ˆ X0 ˆ ˆ ˆ 0, the program Q0 will produce f 0 (x). And for x 2 D with A(x) = 0, the program Q0 fails because of division by zero. The case in which x 2 D with A(x ) = 0 is ambiguous. It says nothing about the existence of f 0 (x ). In this case, we distinguish the following four situations: A) f 0 (x ) does not exist, for instance n = 2, A(x) = x21 + x22 and B(x, y) = y, x = 0. B) A alone guarantees existence of f 0 (x ), for instance n = 2, A(x) = x41 + x42 , x = 0. C) B alone guarantees existence of f 0 (x ), for instance B(x, y) = y2 . D) A and B together guarantee existence of f 0 (x ), for instance n = 2, A(x) = x21 + x22 and B(x, y) = x1 x2 y, x = 0. What can be done to resolve the root problem? The use of AD tools for higher derivatives may be helpful. Consider the simple case n = 1, A 2 C1 , DB = Rn+1 , B(x, y) = y. So we have D :D fx : x 2 D A ; A(x) 0g p and f : D R ! R with f (x) D A(x). Assume that for x 2 R it can be decided whether or not x 2 D, for instance by testing x in a program for evaluating A.

For x 2 D, we require the value of the derivative f (x ). Below, we list the relevant implications: 1 0 A(x ) > 0 ) f 0 (x ) D 2pA(x ) A (x ). A(x ) = 0 ) no answer possible. A(x ) = 0, A0 (x ) 6D 0 ) f 0 (x ) does not exist. A(x ) = 0, A0 (x ) = 0 ) no answer possible. A(x ) = 0, A0 (x ) = 0, A00 (x ) 6D 0 ) f 0 (x ) does not exist. A(x ) = 0, A0 (x ) = 0, A00 (x ) = 0 ) no answer possible. A(x ) = 0, A0 (x ) = 0, A00 (x ) = 0, A000 (x ) 6D 0 ) f 0 (x ) does not exist. A(x ) = 0, A0 (x ) = 0, A00 (x ) = 0, A000 (x ) = 0 ) no answer possible. A(x ) = 0, A0 (x ) = 0, A00 (x ) = 0, A000 (x ) = 0, A(4) (x ) > 0 ) f 0 (x ) = 0. A(x ) = 0, A0 (x ) = 0, A00 (x ) = 0, A00 (x ) = 0, A(4) (x )< 0 ) f 0 (x ) does not exist. A(x ) = 0, A0 (x ) = 0, A00 (x ) = 0, A000 (x ) = 0, A(4) (x ) = 0 ) no answer possible. Let n 2 {1, 2, 3 . . . } and A(k) (x ) = 0 for k = 0, . . . , 2n. A(2n+1) (x ) 6D 0 ) f 0 (x ) does not exist. A(2n+1) (x ) = 0, A(2n+2) > 0 ) f 0 (x ) = 0. A(2n+ 1) (x ) = 0, A(2n+ 2) < 0 ) f 0 (x ) does not exist. A(2n+ 1) (x ) = 0, A(2n+ 2) = 0 ) no answer possible. For a nonstandard treatment of these implications see [6]. Of course in the general situation given in Table 3, the classification of cases is more problematic. 0

Branch Problem A typical example for the branch problem is Gausselimination for solving a system of linear equations with parameters. For illustrative purposes, it suffices to consider two equations with a two-dimensional parameter x (see Table 4). Here, it is assumed that: a) D is a nonempty open subset of R2 ; b) the function M: D R2 ! R2, 2 is differentiable; c) the function R: D R2 ! R2 is differentiable; d) x 2 D ) the matrix M(x) is regular. The program GAUSS defines the function F : D R2 ! R2 with M(x) F(x) D R(x):

Automatic Differentiation: Root Problem and Branch Problem

Automatic Differentiation: Root Problem and Branch Problem, Table 4 Program GAUSS for evaluating f at x

GAUSS’ produces F 0 (1; 1) D

input: x 2 D M11 M11 (x) M12 M12 (x) M21 M21 (x) M22 M22 (x) R1 R1 (x) R2 R2 (x) IF M11 ¤ 0 THEN E M21 / M11 M22 M22 E M12 R2 R2 E R1 F2 R2 / M22 F1 (R1 M12 F2) / M11 ELSE F2 R1 / M12 F1 (R2 M22 F2) / M21 output: F(x) = (F1,F2)

S1:

S2:

Since the matrix M(x) is regular for x 2 D, the program GAUSS and the function f are well-defined. Furthermore, the function f is differentiable. Standard AD (in the forward mode) transforms GAUSS into a new program by inserting assignment statements for derivatives in proper places. The resulting program GAUSS’ is also well-defined, and for x 2 D it is supposed to produce F(x) and F 0 (x). Now choose ˚

2

D D x 2 R : 0 < x1 < 2; 0 < x2 < 2

40 100

M(x) D

M11 (x) M21 (x)

M12 (x) M22 (x)

D

x1 x2 10

1 ; x1 C x2

R(x) D

R1 (x) 100(x1 C 2x2 ) D : R2 (x) 100(x1 2x2 )

It is easy to see that D is a nonempty open subset of R2 , that the functions M and R are differentiable, and that M(x) is regular for x 2 D.

90 ; 200

but the correct value is F 0 (1; 1) D

54 170

76 : 130

One can easily check that the wrong result is not limited to the forward mode, because the reverse mode yields exactly the same wrong result. To better understand the situation we define D1 :D fx : x 2 D; M11 (x) ¤ 0g ; D2 :D fx : x 2 D; M11 (x) D 0g : The program GAUSS can be considered as a piecewise definition of the function F, ( F(x) according to S1; for x 2 D1 ; F(x) D F(x) according to S2; for x 2 D2 : Normally, one is not too concerned about the domain of a function. But indeed in this case, we must be concerned. Let F| D 1 denote the restriction of F to D1 and let F| D 2 denote the restriction of F to D2 . Then, of course ( (Fj D 1 )(x) for x 2 D1 ; F(x) D (Fj D 2 )(x) for x 2 D2 : The domain D1 of the function F| D 1 is an open set, x 2 D1 is an interior point of D1 , and hence F 0 (x) D (Fj D 1 )0 (x)

and

A

for x 2 D1 ;

and this is the value GAUSS’ produces. The domain D2 of the function F| D 2 is too thin, it has no interior points, and hence F| D 2 is not differentiable. In other words, the function F| D 2 does not provide enough information to obtain F 0 (x) for x 2 D2 . Thus GAUSS’ cannot produce F 0 (x) for x 2 D2 . What GAUSS’ actually presents for F 0 (x) is the value for the derivative of another function, which is of no interest here. For more see [1]. In [4] it is claimed that the use of a certain branching function method makes the branch problem vanish.

179

180

A

Automatic Differentiation: Root Problem and Branch Problem

This is true in certain cases, in our example the branching function method fails because it encounters division by zero. At least this suggests that something went wrong. For a partial solution to the branch problem, see [1] and for a nonstandard treatment of the branch problem, see [6]. A simple example of the branch problem is shown in the informal program IF x ¤ 1 THEN f (x) ELSE f (x) 1:

Automatic Differentiation: Root Problem and Branch Problem, Table 5 Program SW for computing f (x)

S1:

xx S2:

This program defines the function f : R ! R with f (x) D x 2 : Of course, f is differentiable, in particular we have f 0 (1) = 2. Standard AD software produces the wrong result f 0 (1) = 0. It is not surprising that symbolic manipulation packages produce the same wrong result. Here it is obvious that the else-branch does not carry enough information for computing the correct f 0 (1). Sometimes branching is done to save work. Consider the function n

f: DR !R with

input: x 2 D c(x) IF c(x) ¤ 0 THEN s(x) E(x) r(x) s(x) + c(x) E(x) f (x) r(x) ELSE s(x) f (x) s(x) output: f (x)

SW 0 works correctly to produce f 0 (x) D r 0 (x) for x 2 D1 : Looking at SW, it is tempting to assume: f 0 (x) D s0 (x) for x 2 D2 and SW 0 actually follows this assumption. But it is clear that f 0 (x) D s0 (x) C E(x) c 0 (x) C c(x) E 0 (x) for x 2 D;

f (x) D s(x) C c(x) E(x); and in particular where D is an open set. The real-valued functions s, c, E may be given explicitly or by subroutines. Assume that f (x) has to be evaluated many times for varying x-s, that c(x) = 0 for many interesting values of x, and that E(x) is computationally costly. Then it is effective to set up a program for computing f (x) as shown in Table 5. Assume that the functions s, c, E are differentiable. Then f is differentiable too. For given x 2 D we ask for f 0 (x). Standard AD (in the forward mode) transforms SW into a new program by inserting assignment statements concerning derivatives. The resulting program SW 0 is well-defined, and for given x 2 D it is supposed to produce f (x) and f 0 (x). Define the sets D1 :D fx : x 2 D; c(x) ¤ 0g ; D2 :D fx : x 2 D; c(x) D 0g :

f 0 (x) D s0 (x) C E(x) c 0 (x) for x 2 D2 : If x 2 D2 , and if either E(x) = 0 or c0 (x) = 0, then SW 0 produces the correct F 0 (x), otherwise SW 0 fails. See also Automatic Differentiation: Calculation of the Hessian Automatic Differentiation: Calculation of Newton Steps Automatic Differentiation: Geometry of Satellites and Tracking Stations Automatic Differentiation: Introduction, History and Rounding Error Estimation Automatic Differentiation: Parallel Computation

Automatic Differentiation: Root Problem and Branch Problem

Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators Nonlocal Sensitivity Analysis with Automatic Differentiation References 1. Beck T, Fischer H (1994) The if-problem in automatic differentiation. J Comput Appl Math 50:119–131 2. Berz M, Bischof Ch, Corliss GF, Griewank A (eds) (1996) Computational differentiation: Techniques, applications, and tools. SIAM, Philadelphia

A

3. Griewank A, Corliss GF (eds) (1991) Automatic differentiation of algorithms: Theory, implementation, and application. SIAM, Philadelphia 4. Kearfott RB (1996) Rigorous global search: Continuous problems. Kluwer, Dordrecht 5. Rall LB (1981) Automatic differentiation: Techniques and applications. Lecture Notes Computer Sci, vol 120. Springer, Berlin 6. Shamseddine K, Berz M (1996) Exception handling in derivative computation with nonarchimedean calculus. In: Berz M, Bischof Ch, Corliss GF, Griewank A (eds) Computational Differentiation: Techniques, Applications, and Tools. SIAM, Philadelphia, pp 37–51

181

B

Bayesian Global Optimization

B

Bayesian Global Optimization BA JONAS MOCKUS Institute Math. and Informatics, Vilnius, Lithuania MSC2000: 90C26, 90C10, 90C15, 65K05, 62C10 Article Outline Keywords See also References Keywords Global optimization; Discrete optimization; Bayesian approach; Heuristics The traditional numerical analysis considers optimization algorithms which guarantee some accuracy for all functions to be optimized. This includes the exact algorithms (that is the worst-case analysis). Limiting the maximal error requires a computational effort that often increases exponentially with the size of the problem. An alternative is average case analysis where the average error is made as small as possible. The average is taken over a set of functions to be optimized. The average case analysis is called the Bayesian approach (BA) [7,14]. There are several ways of applying the BA in optimization. The direct Bayesian approach (DBA) is defined by fixing a prior distribution P on a set of functions f (x) and by minimizing the Bayesian risk function R(x) [6,14]. The risk function describes the average deviation from the global minimum. The distribution P is regarded as a stochastic model of f (x), x 2 Rm , where

f (x) might be a deterministic or a stochastic function. In the Gaussian case assuming (see [14] that the (n + 1)th observation is the last one R(x) D p

1 2 s n (x)

Z

C1

min(c n ; z)e

12

ym n (x) 2 s n (x)

dz;

1

(1) Here, cn = mini zi , zi = f (xi ), mn (x) is the conditional expectation given the values of zi , i = 1, . . . n, dn (x) is the conditional variance, and > 0 is a correction parameter. The objective of DBA (used mainly in continuous cases) is to provide as small average error as possible while keeping the convergence conditions. The Bayesian heuristic approach (BHA) means fixing a prior distribution P on a set of functions f K (x) that define the best values obtained using K times some heuristic h(x) to optimize a function v(y) of variables y 2 Rn [15]. As usual, the components of y are discrete variables. The heuristic h(x) defines an expert opinion about the decision priorities. It is assumed that the heuristics or their ‘mixture’ depend on some continuous parameters x 2 Rm , where m < n. The Bayesian stopping rules (BSR) [3] define the best on average stopping rule. In the BSR, the prior distribution is determined regarding only those features of the objective function f (x) which are relevant for the stopping of the algorithm of global optimization. Now all these ways will be considered in detail starting from the DBA. The Wiener process is common [11,16,19] as a stochastic model applying the DBA in the one-dimensional case m = 1. The Wiener model implies that almost all the sample functions f (x) are continuous, that increments f (x4 ) f (x3 ) and f (x2 ) f (x1 ), x1 < x2 < x3 < x4 are stochasti-

183

184

B

Bayesian Global Optimization

cally independent, and that f (x) is Gaussian (0, x) at any fixed x > 0. Note that the Wiener process originally provided a mathematical model of a particle in Brownian motion. The Wiener model is extended to multidimensional case, too [14]. However, simple approximate stochastic models are preferable, if m > 1. These models are designed by replacing the traditional Kolmogorov consistency conditions because they require the inversion of matrices of nth order for computing the conditional expectation mn (x) and variance dn (x). The favorable exception is the Markov process, including the Wiener one. Extending the Wiener process to m > 1 the Markovian property disappears. Replacing the regular consistency conditions by: continuity of the risk function R(x); convergence of xn to the global minimum; simplicity of expressions of mn (x) and sn (x), the following simple expression of R(x) is obtained using the results of [14]: R(x) D min z i min 1in

1in

kx x i k2 : z i cn

The aim of the DBA is to minimize the expected deviation. In addition, DBA has some good asymptotic properties, too. It is shown in [14] that d D da

fa f C

Bayesian Global Optimization, Figure 1 The Wiener model

1/2 ;

n ! 1;

where d is the density of xi around the global optimum f , da and f a are the average density of xi and the average value of f (x), and is the correction parameter in expression (1). That means that DBA provides convergence to the global minimum for any continuous f (x) and greater density of observations xi around the global optimum, if n is large. Note that the correction parameter has a similar influence as the temperature in simulated annealing. However, that is a superficial similarity. Using DBA, the good asymptotic behavior should be regarded just as an interesting ‘by-product’. The reason is that Bayesian decisions are applied for the small size samples where asymptotic properties are not noticeable. Choosing the optimal point xn+1 for the next iteration by DBA one solves a complicated auxiliary optimization problem minimizing the expected deviation

R(x) from the global optimum (see Fig. 1). That makes the DBA useful mainly for the computationally expensive functions of a few (m < 20) continuous variables. This happens in wide variety of problems such as maximization of the yield of differential amplifiers, optimization of mechanical system of shock absorber, optimization of composite laminates, estimation of parameters of immunological model and nonlinear time series, planning of extremal experiments on thermostable polymeric composition [14]. Using DBA the expert knowledge is included by defining the prior distribution. In BHA the expert knowledge is involved by defining the heuristics and optimizing their parameters using DBA. If the number of variables is large and the objective function is not expensive, the Bayesian heuristic approach is preferable. That is the case in many discrete optimization problems. As usual, these problems

Bayesian Global Optimization

are solved using heuristics based on an expert opinion. Heuristics often involve randomization procedures depending on some empirically defined parameters. The examples of such parameters are the initial temperature, if the simulated annealing is applied, or the probabilities of different randomization algorithms, if their mixture is used. In these problems, the DBA is a convenient tool for optimization of the continuous parameters of various heuristic techniques. That is the Bayesian heuristic approach [15]. The example of knapsack problem illustrates the basic principles of BHA in discrete optimization. Given a set of objects j = 1, . . . , n with values cj and weights g j , find the most valuable collection of limited weight: 8 n X ˆ ˆ ˆmax v(y) D v(y) D cj yj ˆ < y jD1

ˆ ˆ ˆ ˆ :s.t.

n X

B

probabilities of using randomizations l = 0, 1, 1 correspondingly. The optimal x may be applied in different but related problems, too [15]. That is very important in the ‘on-line’ optimization adapting the BHA algorithms to some unpredicted changes. Another simple example of BHA application is by trying different permutations of some feasible solution y0 . Then heuristics are defined as the difference hi = v(yi ) v(y0 ) between the permuted solution yi and the original one y0 . The well-known simulated annealing algorithm illustrates the parameterization of x (hj ) related to a single parameter x. Here the probability of accepting a worse solution is equal to eh i /x , where x is the ‘annealing temperature’. The comparison of BHA with exact branch and bound algorithms solving a set of the flow-show problems is shown by the Table from [15]:

g j y j g:

jD1

Here the objective function v(y) depends on n Boolean variables y = (y1 , . . . , yn ), where yj = 1 if object j is in the collection, and yj = 0 otherwise. The well-known greedy heuristics hj = cj /g j is the specific value of object j. The greedy heuristic algorithm: ‘take the greatest feasible hj ’, is very fast but it may get stuck in some nonoptimal decision. A way to force the heuristic algorithm out of such nonoptimal decisions is to make decision j with probability rj = x(hj ), where x (hj ) is an increasing function of hj and x = (x1 , . . . xm ) is a parameter vector. The DBA is used to optimize the parameters x by minimizing the best result f K (x) obtained applying K times the randomized heuristic algorithm x (hj ). That is the most expensive operation of BHA. Therefore, the parallel computation of f K (x) should be used when possible reducing the computing time in proportion to a number of parallel processors. Optimization of x adapts the heuristic algorithm x (hj ) to a given problem. Let us illustrate the parameterization of x (hj ) using three randomization funcP tions: r li = h li / j h lj , l = 0, 1, 1. Here, the upper index l = 0 denotes the uniformly distributed component and l = 1 defines the linear component of randomization. The index 1 denotes the pure heuristics with no 1 randomization where r1 i = 1 if hi = maxj hj and r i = 0, otherwise. Here, parameter x = (x0 , x1 , x1 ) defines the

R = 100; K = 1; J = 10; S = 10; O = 10 Technique fB dB x0 x1 x1 BHA 6:18 0:13 0:28 0:45 0:26 CPLEX 12:23 0:00 Here S is the number of tools, J is the number of jobs, O is the number of operations, f B , x0 , x1 , x1 are the mean results, dB is the variance, and ‘CPLEX’ denotes the standard MILP technique truncated after 5000 iterations. The table shows that in the randomly generated flow-shop problems the average make-span obtained by BHA was almost twice less that obtained by the exact branch and bound procedure truncated at the same time as BHA. The important conclusion is that stopping the exact methods before they reach the exact solution is not a good way to obtain the approximate solution. The BHA has been used to solve the batch scheduling [15] and the clustering (parameter grouping) problems. In the clustering problem the only parameter x was the initial annealing temperature [8]. The main objective of BHA is to improve any given heuristic by defining the best parameters and/or the best ‘mixtures’ of different heuristics. Heuristic decision rules mixed and adapted by BHA often outperform (in terms of speed) even the best individual heuristics as judged by the considered examples. In addition, BHA provides almost sure convergence. However, the final

185

186

B

Bayesian Global Optimization

results of BHA depend on the quality of the specific heuristics including the expert knowledge. That means the BHA should be regarded as a tool for enhancing the heuristics but not for replacing them. Many well-known optimization algorithms such as genetic algorithms (GA) [10], GRASP [13], and tabu search (TS) [14], may be regarded as generalized heuristics that can be improved using BHA. There are many heuristics tailored to fit specific problems. For example, the Gupta heuristic was the best one while applying BHA to the flow-shop problem [15]. Genetic algorithms [10] is an important ‘source’ of interesting and useful stochastic search heuristics. It is well known [2] that the results of the genetic algorithms depend on the mutation and cross-over parameters. The Bayesian heuristic approach could be used in optimizing those parameters. In the GRASP system [13] the heuristic is repeated many times. During each iteration a greedy randomized solution is constructed and the neighborhood around that solution is searched for the local optimum. The ‘greedy’ component constructs a solution, one element at a time until a solution is constructed. A possible application of the BHA in GRASP is in optimizing a random selection of a candidate to be in the solution because different random selection rules could be used and their best parameters should be defined. BHA might be useful as a local component, too, by randomizing the local decisions and optimizing the corresponding parameters. In tabu search the issues of identifying best combinations of short and long term memory and best balances of intensification and diversification strategies may be obtained using BHA. Hence the Bayesian heuristics approach may be considered when applying almost any stochastic or heuristic algorithm of discrete optimization. The proven convergence of a discrete search method (see, for example, [1]) is an asset. Otherwise, the convergence conditions are provided by tuning the BHA [15], if needed. The third way to apply the Bayesian approach is the Bayesian stopping rules (BSR) [3]. The first way, the DBA, considers a stochastic model of the whole function to be optimized. In the BSR the stochastic models regard only the features of the objective function which are relevant for the stopping of the multistart algorithm.

In [20] a statistical estimate of the structure of multimodal problems is investigated. The results are applied developing BSR for the multistart global optimization methods [4,5,18]. Besides these three ways, there are other ways to apply the Bayesian approach in global optimization. For example, the Bayes theorem was used to derive the posterior distribution of the values of parameters in the simulated annealing algorithm to make an optimal choice in the trade-off between small steps in the control parameter and short Markov chains and large steps and long Markov chains [12]. In the information approach [17] a prior distribution is considered on the location parameter ˛ of the global optimum of an one-dimensional objective function. Then an estimate of ˛ is obtained maximizing the likelihood function after a number of evaluations of the objective function. This estimate is assumed as the next search point. For the solution of multidimensional problems, it is proposed to transform the problem into a one-dimensional problem by means of Peano maps.

See also Adaptive Simulated Annealing and its Application to Protein Folding Genetic Algorithms for Protein Structure Prediction Global Optimization Based on Statistical Models Monte-Carlo Simulated Annealing in Protein Folding Packet Annealing Random Search Methods Simulated Annealing Simulated Annealing Methods in Protein Folding Stochastic Global Optimization: Stopping Rules Stochastic Global Optimization: Two-Phase Methods

References 1. Andradottir S (1996) A global serach method for discrete stochastic optimization. SIAM J Optim 6:513–530 2. Androulakis IP, Venkatasubramanian V (1991) A genetic algorithmic framework for process design and optimization. Comput Chem Eng 15:217–228 3. Betro B (1991) Bayesian methods of global optimization. J Global Optim 1:1–14

Bayesian Networks

4. Betro B, Schoen F (1987) Sequential stopping rules for the multistart algorithm in global optimization. Math Program 38:271–286 5. Boender G, Rinnoy-Kan A (1987) Bayesian stopping rules for multi-start global optimization methods. Math Program 37:59–80 6. DeGroot M (1970) Optimal statistical decisions. McGrawHill, New York 7. Diaconis P (1988) Bayesian numerical analysis. In: Statistical Decision Theory and Related Topics. Springer, Berlin, pp 163–175 8. Dzemyda G, Senkiene E (1990) Simulated annealing for parameter grouping. Trans Inform Th, Statistical Decision Th, Random Processes, 373–383 9. Glover F (1994) Tabu search: improved solution, alternatives. In: Mathematical Programming. State of the Art. Univ. Michigan, Ann Arbor, MI, pp 64–92 10. Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading, MA 11. Kushner HJ (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J Basic Eng 86:97–100 12. van Laarhoven PJM, Boender CGE, Aarts EHL, Rinnooy-Kan AHG (1989) A Bayesian approach to simulated annealing. Probab Eng Inform Sci 3:453–475 13. Mavridou T, Pardalos PM, Pitsoulis LS, Resende MGC (1997) A GRASP for the biquadratic assignment problem. Europ J Oper Res 14. Mockus J (1989) Bayesian approach to global optimization. Kluwer, Dordrecht 15. Mockus J, Eddy W, Mockus A, Mockus L, Reklaitis G (1997) Bayesian heuristic approach to discrete and global optimization. Kluwer, Dordrecht 16. Saltenis VR (1971) On a metod of multiextremal optimization. Automatics and Computers (Avtomatika i Vychislitelnayya Tekchnika) 3:33–38. (In Russian) 17. Strongin RG (1978) Numerical methods in multi-extremal problems. Nauka, Moscow 18. Timmer GT (1984) Global optimization: A stochastic approach. PhD Thesis, Erasmus Univ. Rotterdam, The Netherlands 19. Törn A, Žilinskas A (1989) Global optimization. Springer, Berlin 20. Zielinski R (1981) A statistical estimate of the sructure of multiextremal problems. Math Program 21:348–356

Bayesian Networks ALLA R. KAMMERDINER Department of Industrial and Systems Engineering, University of Florida, Gainesville, USA

B

Article Outline Keywords Synonyms Introduction Definitions The Chain Rule for Bayesian Network Cases/Models Methods Applications

See also References Keywords Graphical models; Joint probability distribution; Bayesian statistics; Data mining; Optimization Synonyms Bayes nets Introduction After the initial introduction in 1982, Bayesian networks (BN) have quickly developed into a dynamic area of research. This is largely due to the special structure of Bayesian networks that allows them to be very efficient in modeling domains with inherent uncertainty. In addition, there is a strong connection between Bayesian networks and other adjacent areas of research, including data mining and optimization. Bayesian networks have their lineage in statistics, and were first formally introduced in the field of artificial intelligence and expert systems by Pearl [17] in 1982 and Spiegelhalter and Knill-Jones [21] in 1984. The first real-life applications of Bayesian networks were Munin [1] in 1989 and Pathfinder [7] in 1992. Since the 1990s, the amount of research in Bayesian networks has increased dramatically, resulting in many modern applications of Bayesian networks to various problems of data mining, pattern recognition, image processing and data fusion, engineering, etc. Bayesian networks comprise a class of interesting special cases, many of which were in consideration long before the first introduction of Bayesian networks. Among such interesting cases are some frequently used types of the model simplifying assumptions including naïve Bayes, the noisy-OR and noisy-AND mod-

187

188

B

Bayesian Networks

els, as well as different models with specialized structure, in particular the time-stamped models, the strictly repetitive models, dynamic Bayesian networks, hidden Markov models, Kalman filter, Markov chains. Artificial neural networks are another subclass of Bayesian networks, which has many applications, in particular in biology and computer science.

Definitions Based on classical probability calculus, the idea of a Bayesian network has its early origins in Bayesian statistics. On the other hand, it has an added benefit of incorporating the notions of graph theory and networks that allows us to visualize the relationships between the variables represented by the nodes of a Bayesian network. In other words, a Bayesian network is a graphical model providing a compact representation for communicating causal relationships in a knowledge domain. Below we introduce two alternative definitions of the general notion of a Bayesian network, based on the usual concepts of probability and graph theory (e. g. joint probability distribution, conditional probability distribution; nodes and edges of a graph, a parent of a node, a child of a node, etc.). Roughly speaking, a Bayesian network can be viewed as an application of Bayesian calculus on a causal network. More precisely, one can describe a Bayesian network as a mathematical model for representing the joint distribution of some set of random variables as a graph with the edges characterized by the conditional distributions for each variable given its parents in the graph. Given a finite collection of random variables X D fX1 ; X2 ; : : : ; X n g, the formal definition of a Bayesian network can be stated as follows: Definition 1 A Bayesian network is an ordered pair (G,D), where The first component G represents a directed acyclic graph with nodes, which correspond to the random variables X1 ; X2 ; : : : ; X n , and directed arcs, which symbolize conditional dependencies between the variables. The set of all the arcs of G satisfies the following assumption: Each random variable in the graph is conditionally independent of its nondescendants in G, given its parents in G.

The second component D corresponds to the set of parameters that, for each variable X i ; 1 i n, define its conditional distribution given its parents in the graph G. Note that the variables in a Bayesian networks can follow discrete or continuous distributions. Clearly, for continuously distributed variables, there is a correspondent conditional probability density function f (x i jPa(x i )) of X i given its parents Pa(X i ). (From now on we denote by x i the realization of the correspondent random variable X i .) In many real-life applications modeled by Bayesian networks the set of states for each variable (node) in the network is finite. In the special case when all variables have finite sets of mutually exclusive states and follow the discrete distributions, the previous definition of a Bayesian network can be reformulated in the following fashion: Definition 2 A Bayesian network is a structure that consists of the following elements: A collection of variables with a finite set of mutually exclusive states; A set of directed arcs between the variables symbolizing conditional independence of variables; A directed acyclic graph formed by the variables and the arcs between them; A potential table Pr(X i jPa(X i )) associated with each variable X i having a set of parent variables denoted by Pa(X i ). Observe that we do not require causality in Bayesian networks, i. e. the arcs of a graph do not have to symbolize causal relationship between the variables. However, it is imperative that the so-called d-separation rules implied by the structure are satisfied [12,19]. If variables X and Y are d-separated in a Bayesian network under the presence of evidence e, then Pr(XjY; e) D Pr(Xje), i. e. the variables are conditionally independent given the evidence. Furthermore, the d-separation rules are applied to prove one of the key laws used in Bayesian networks, a so-called chain rule for Bayesian networks. The joint probability table Pr(X) D Pr(X1 ; X2 ; : : : ; X n ) sufficiently describes the belief structure on the set X D fX1 ; X2 ; : : : ; X n g of variables in the model. In particular, for each variable X i , using the joint probability table, one can easily calculate the prior

Bayesian Networks

probabilities Pr(X i ) as well as the conditional probability Pr(X i je) given an evidence e. Nevertheless, with increase in the number of variables, the joint probability table quickly becomes unmanageably large, since the table size grows exponentially fast with the size n of the variable set. Thus, it is necessary to find another representation, which adequately and more efficiently describes the belief structure in the model. A Bayesian network over X D fX1 ; X2 ; : : : ; X n g provides such a representation. In fact, a graph in a Bayesian network gives a compact representation of conditional dependencies in the network, which allows one to compute the joint probability table from the conditional probabilities specified by the network using the chain rule below. The Chain Rule for Bayesian Networks [8] The joint probability distribution Pr(X) D Pr(X1 ; X2 ; : : : ; X n ) of the variables X D fX1 ; X2 ; : : : ; X n g in a Bayesian network is given by the formula Pr(X) D

n Y

Pr(X i jPa(X i )) ;

(1)

iD1

where Pa(X i ) denotes the set of all parents of variable Xi . The chain rule for Bayesian networks also provides an efficient way for probability updating when the new information is received about the model. There is a variety of different types of such new information, i. e. evidence. Two most common types of evidence are finding and likelihood evidence. Finding is evidence that specifies which states are possible for some variables, while likelihood evidence gives a proportion between the probabilities of two given states. Note that some types of evidence including likelihood evidence cannot be given in the form of findings. Cases/Models Bayesian networks provide a general framework for a number of specialized models, many of which were identified long before the concept of a Bayesian network was proposed. Such special cases of BN vary in their graph structures as well as the probability distribution. The probability distributions for a Bayesian network can be defined in several ways. In some situations,

B

it is possible to use theoretically well-defined distributions. In others, the probabilities can be estimated from data as frequencies. In addition, absolutely subjective probability estimates are often used for practical purposes. For instance, when the number of conditional probability distributions to acquire from the data is very large, some simplifying assumptions may be appropriate. The simplest Bayesian network model is the wellknown naïve Bayes (or simple Bayes) model [4], which can be summarized as follows: The graph structure of the model consists of one hypothesis variable H, and a finite set of information variables I D fI1 ; I2 ; : : : ; I n g with the arcs from H to every I k ; 1 k n. In other words, the variables form a diverging connection, where the hypothesis variable H is a common parent of variables I1 ; I2 ; : : : ; I n ; The probability distributions are given by the values Pr(I k jH), for every information variable I k ; 1 k n. The probability updating procedure based on the naïve Bayes model works in the following manner: Given a collection of observations e1 ; e2 ; : : : ; e n on the variables I1 ; I2 ; : : : ; I n respectively, the likelihood of H given e1 ; e2 ; : : : ; e n is computed: L(Hje1 ; e2 ; : : : ; e n ) D

n Y

Pr(e i jH) :

(2)

iD1

Then the posterior probability of H is obtained from the formula: Pr(Hje1 ; e2 ; : : : ; e n ) D C Pr(H) L(Hje1; e2 ; : : : ; e n ); (3) where C is a normalization constant. Another special case of BNs is a model underlined by the simplifying assumption called noisy-OR [18]. This model can be constructed as follows: Let A1 ; A2 ; : : : ; A n represent some binary variables listing all parents of a binary variable B. Each event A i D x; x 2 f0; 1g, causes B D x except when an inhibitor prevents it, with the probability pi , i. e. Pr(B D 1 xjA i D x) D p i . Suppose that all inhibitors are independent.

189

190

B

Bayesian Networks

Then the graph of a corresponding Bayesian network is represented by the converging connection with B as the child node of A1 ; A2 ; : : : ; A n , while the conditional probabilities are given by Pr(B D xjA i D x) D 1 p i . Since the conditional distributions are independent of each other, then Pr(B D 1 xjA1 ; A2 ; : : : ; A n ) D

n Y

pi :

(4)

iD1

The noisy-OR assumption gives a significant advantage for efficient probability updating, since the number of distributions increases linearly with respect to the number of parents. The construction complementary to noisy-OR is called noisy-AND. In the noisy-AND model, the graph is the convergent connection just as in the noisy-OR model, all the causes are required to be on in order to have an effect, and all the causes have mutually independent random inhibitors. Both noisy-OR and noisyAND are special cases of a general method called noisy functional dependence. Many modeling approaches have been developed which employ introduction of mediating variables in a Bayesian network. One of these methods, called divorcing, is the process separating parents A1 ; A2 ; : : : ; A i and A iC1 ; : : : ; A n of a node B by introducing a mediating variable C as a child of divorced parent nodes A1 ; A2 ; : : : ; A i and a parent of the initial child node B. The divorcing of A1 ; A2 ; : : : ; A i is possible if the following condition is satisfied: The set of all configurations of A1 ; A2 ; : : : ; A i can be partitioned into the sets c1 ; c2 ; : : : ; cs so that for every 1 j m, any two configurations 1 ; 2 2 c j have the same conditional probabilities: Pr(Bj1 ; A iC1 ; : : : ; A n ) D Pr(Bj2 ; A iC1 ; : : : ; A n ) : (5) Other modeling methods, which engage the mediating variables, involve modeling undirected relations, and situations with expert disagreement. Various types of undirected dependencies, including logical constraints, are represented by adding an artificial child C of the constrained nodes A1 ; A2 ; : : : ; A n so that the conditional probability Pr(CjA1 ; A2 ; : : : ; A n ) emulates the relation. The situation, where k experts disagree

on the conditional probabilities for different variables B1 ; B2 ; : : : ; B n in the model can be modeled by introducing a mediating node M with k states m1 ; m2 ; : : : ; m k so that the variables B1 ; B2 ; : : : ; B n on whose probabilities the experts disagree become the only children of expert node M. Another approach to modeling expert disagreements is by introducing alternative models with weights assigned to each model. An important type of Bayesian networks are socalled time-stamped models [10]. These models reflect the structure which changes over time. By introducing a discrete time stamp in such structures, the timestamped models are partitioned into submodels for every unit of time. Each local submodel is called a time slice. The complete time-stamped model consists of all its time slices connected to each other by temporal links. A strictly repetitive model is a special case of a timestamped model such that all its time slices have the same structure and all the temporal links are alike. The well-studied hidden Markov models is a special class of strictly repetitive time-stamped models for which the Markov property holds, i. e. given the present, the past is independent of the future. A hidden Markov model with only one variable in each time slice connected to the variables outside the time slice is a Kalman filter. Furthermore, a Markov chain can be represented as a Kalman filter with only one variable in every time slice. It is possible to convert a hidden Markov model into a Markov chain by crossmultiplying all variables in each time slice. The time-stamped models can have either finite horizon or infinite horizon. An infinite Markov chain would be an example of a time-stamped model with an infinite horizon. Furthermore, the repetitive timestamped models with infinite horizon are also known as dynamical Bayesian networks. By utilizing the special structure of many repetitive temporal models, they can be compactly represented [2]. Such special representation can often facilitate the design of efficient algorithms in updating procedures. Artificial neural networks can also be viewed as a special case of Bayesian networks, where the nodes are partitioned into n mutually exclusive layers, and the set of arcs represented by the links from the nodes on layer i to the nodes on i C 1; 1 i n. Layer 1 is usually called the input layer, while layer n is known as the output layer.

B

Bayesian Networks

Hence, the expected penalty is calculated as:

Methods Just as the BNs have their roots in statistics, the approaches for discovering a BN structure utilize statistical methods. That is why a database of cases is instrumental for discovery of the graph configuration of a Bayesian network as well as probability updating. There are three basic types of approaches to extracting BNs from data: batch learning, adaptation, and tuning. Batch Learning. Batch Learning is a process of extracting the information from a database of collected cases in order to establish a graph structure and the probability distributions for a certain Bayesian networks. Often there are many ways to model a Bayesian network. For example, we may obtain two different probability distributions to model the true distribution of the variable in the network. To make an intelligent choice between two available distributions, it is important to have an appropriate measure of their accuracy. A logical way to approach this subject is by assigning penalties for a wrong forecast on the base of a specified distribution. For example, two widely accepted ways for assigning penalties are the quadratic (Brier) scoring rule and the logarithmic scoring rule. Given the true distribution p D (p1 ; p2 ; : : : ; p m ) of a discrete random variable with m states, and some approximate distribution q D (q1 ; q2 ; : : : ; q m ), the quadratic scoring rule assigns the expected penalty as:

ESQ (p; q) D

m X

0 p i @(1 q i )2 C

iD1

X

1 q2j A :

(6)

j¤i

The distance between true distribution p and approximation q is given by the formula: d Q (p; q) D ESQ (p; q) ESQ (p; p) :

(7)

Hence, from (6) we have: d Q (p; q) D

m X

(p i q i )2 :

(8)

iD1

The distance d Q (p; q) given in (8) is called the Euclidean distance. The logarithmic scoring rule assigns to each outcome i the corresponding penalty S L (q; i) D log q i .

ESL (p; q) D

m X

p i log q i :

(9)

iD1

From (7), we obtain an expression for the distance between the true distribution p, and the approximation q: d L (p; q) D

m X

p i log

iD1

pi ; qi

(10)

which is called the Kulbach–Leibler divergence. Note that both definitions, the Euclidean distance and the Kulbach–Leibler divergence, can be easily extended in the case of continuous random variables. Moreover, both scoring rules, the quadratic and the logarithmic, possess the following useful property: only the true distribution minimizes the score. The scoring rules that exhibit this property are called strictly proper. Since the quadratic and the logarithmic scoring rules are strictly proper, then the corresponding distance measures dQ and dL both satisfy the following: d(p; q) D 0

if and only if

pDq:

Different scoring rules and corresponding distance measures for discrete and continuous random variables have been extensively studied in statistics. A comprehensive review of strictly proper scoring rules is given in [6]. Naturally, among several different Bayesian networks that model the situation equally closely, the one of the smallest “size” would be preferred. Let M denote a Bayesian network over the variable set X D fX1 ; X2 ; : : : ; X n g. Then the size of M is given by Size(M) D

n X

s(X i ) ;

(11)

iD1

where s(X i ) denotes the number of entries in the conditional probability table Pr(X i jPa(X i )), and Pa(X i ) is the set of parents of X i . The following measure accounts for both the size of the model and its accuracy. Given a Bayesian network M over X with the true probability distribution p, and an approximate Bayesian network model N with distribution q, we define the acceptance measure as ˛(p; N) D Size(N) C C d(p; q) ;

(12)

191

192

B

Bayesian Networks

where Size() is the network size defined by (11), d(p; q) is a distance measure between probability distributions p and q, and C is a positive real constant. The general approach to batch learning a Bayesian network from the data set of cases can be summarized as follows: Select an appropriate threshold for distance measure d(p; q) between two distributions; Fix a suitable constant C in a definition of acceptance measure ˛(p; N); Among all Bayesian network models over X and distribution q such that d(p; q) < , select the model that minimizes ˛(p; N). Although simple, this approach has many practical issues. The data sets in batch learning are usually very large, the model space grows exponentially in the number of variables, there may be missing data in the data set, etc. To extract structure from such data, one often has to employ special heuristics for searching the model space. For instance, causality can be used to cluster the variables according to a causal hierarchy. In other words, we partition the variable set X into subsets S1 ; S2 ; : : : ; S k , so that the arcs satisfy a partial order relation. If we find the model N having the distance d(p; q) < , the search stops; otherwise we consider the submodel of N. Adaptation It is often desirable to build a system capable of automatically adapting to different settings. Adaptation is a process of adjusting a Bayesian network model so that it is better able to accommodate to new accumulated cases. When building a Bayesian network, usually there is an uncertainty whether the chosen conditional probabilities are correct. This is called the second-order uncertainty. Suppose that we are not sure which table out of m different conditional probability tables T1 ; T2 ; : : : ; Tm represents the true distribution for Pr(X i jPa(X i )) for some variable X i in a network. By introducing a socalled type variable T with states t1 ; t2 ; : : : ; t m into the graph so that T is a parent of X i , we can model this uncertainty into the network. Then the prior probability Pr(t1 ; t2 ; : : : ; t m ) represents our belief about the correctness of the tables T1 ; T2 ; : : : ; Tm respectively. Next, we set Pr(X i jPa(X i ); t j ) D Tj . Our belief about the correctness of the tables is updated each time we receive

new evidence e. In other words, for the next case, we use Pr(t1 ; t2 ; : : : ; t m je) as the new prior probability of tables’ accuracy. Sometimes the second-order uncertainty about the conditional probabilities cannot be modeled by introducing type variables. In such cases, various statistical methods can be applied. Normally such methods exploit various properties of parameters, such as global independence, local independence, etc. The property of global independence states that the second-order uncertainty for the variables is independent, i. e. the probability tables for the variables can be adjusted independently from each other. The local independence property holds if and only if for any two different parent configurations 1 ; 2 , the second-order uncertainty on Pr(Aj 1 ) is independent of the second-order uncertainty on Pr(Aj 2 ), and the two distributions can be updated independently from each other. In other words, local independence means the independence of the uncertainties of the distributions for different configurations of parents. The fractional updating scheme [22], is an algorithm for reducing the second-order uncertainty about the distributions based on the received evidence. Suppose that the properties of global and local independence for the second-degree uncertainty hold simultaneously. For every configuration of parents of variable X i , the certainty about Pr(X i j ) is given through an artificially selected sample size parameter ni , and j for any state x i of variable X i we have a corresponding j j count n i D n i Pr(x i j ). After receiving an evidence j e, we compute probabilities Pr(x i ; je). Then the upj j j dated count n i is the sum of Pr(x i ; je) and the old n i . P j Since n i D j n i , the old sample size parameter ni becomes n i C Pr( je). Although efficient in reducing the uncertainty about the distributions, this scheme has some serious drawbacks. In fact, it tends to reduce the second-degree uncertainty too fast, by overestimating the counts. In order to avoid this, one can introduce a so-called fading factor f . Then after receiving an evidence, the sample size ni is changed to f n i C Pr( je), and the counts j j j n i are updated to f n i C Pr(x i ; je). Therefore, the fading factor f insures that the influence of the past decreases exponentially [16]. After describing some approaches in adapting a Bayesian network to different settings of distribution

Bayesian Networks

parameters, it is equally important to discuss the uncertainty in the graph structure. In many cases, we can compensate for the variability in the graph structure of a Bayesian network just by modifying the parameters of distributions in the network. Sometimes it may not be sufficient to adjust the distribution parameters in order to account for the change in the model. In fact, the difference in the graph structure may be so significant that it becomes impossible to accurately reflect the situation by a mere parameter change. There are two main approaches to graph structure adaptation in Bayesian networks. The first method works by collecting the cases, and re-running the batch learning procedure to update the graph structure. The second method, also known as the expert disagreement approach, works simultaneously with a set of different models, and updates the weight of each model according to the evidence. More precisely, suppose there are m alternative models M1 ; M2 ; : : : ; M m with corresponding initial weights w1 ; w2 ; : : : ; w m that express our certainty of the models. Let Y be some variable in the network. After receiving an evidence e, we obtain the probabilities Pr i (Yje) :D Pr(Yje; M i ) and Pri (e) :D Pr(ejM i ) according to each model M i , for 1 i m. Then, Pr(Yje) :D

m X

w i Pr i (Yje) ;

(13)

iD1

and the updated weights wi are computed as the probabilities of the corresponding models M i given the past evidence: w i D Pr(M i je). Hence, by the well-known Bayes formula: wi D

Pr(ejM i ) Pr(M i ) P : w j Pr j (e)

(14)

j

Note that the expert disagreement approach to graph structure adaptation can be further extended to include the adaptation of distribution parameters based on the above methods, such as fractional updating. Tuning Tuning is the process of adjusting the distribution parameters so that some prescribed requests for the model distributions are satisfied. The commonly used approach to tuning is the gradient descent on the parameters similar to training in neural networks.

B

Let represent the set of parameters which are chosen to be altered. Let p() denote the current model distribution, and q be the target distribution. Suppose d(p; q) represents the distance between two distributions. The following gradient descent tuning algorithm is given in [9]: Compute the gradient of d(p; q) with respect to the parameters ; ! Select a step size ˛ > 0, and let D ˛ r d(p; q)(0 ), i. e. give 0 a displacement in the opposite direction to the gradient of d(p; q)(0 ); Repeat this procedure until the gradient is sufficiently close to zero. Evolutionary methods, simulated annealing, expectation-maximization and non-parametric methods are among other commonly used methods for tuning or training Bayesian networks. Applications The concept of a Bayesian network can be interpreted in different contexts. From a statistical point of view, a Bayesian network can be defined as a compact representation of the joint probability over a given set of variables. From a broader point of view, a Bayesian network is a special type of graphical model capable of reflecting causality, as well as updating its beliefs in view of received evidence. All these features make a Bayesian network a versatile instrument that can be used for various purposes, including facilitating communication between human and computer, extracting hidden information and patterns from data, simplifying decision making, etc. Due to their special structure, Bayesian networks have found many applications in various areas such as artificial intelligence and expert systems, machine learning and data mining. Bayesian networks are used for modeling knowledge in text analysis, image processing, speech pattern analysis, data fusion, engineering, biomedicine, gene and protein regulatory networks, and even meteorology. Furthermore, it has been expressed that the inductive inference procedures based on Bayesian networks can be used to introduce inductive reasoning in such a previously strictly deductive science as mathematics. The large scope of different applications of Bayesian networks is especially impressive when taking into ac-

193

194

B

Bayesian Networks

count that the theory of Bayesian networks has only been around for about a quarter of a century. Next, several examples of recent real-life applications of Bayesian networks are considered to illustrate this point. Recent research in the field of automatic speech recognition [13] indicates that dynamic Bayesian networks can effectively model hidden features in speech including articulatory and other phonological features. Both hidden Markov models (HMM), which are a special case of dynamic Bayesian networks (DBN), and more general dynamic Bayesian networks have been applied for modeling audio-visual speech recognition. In particular, a paper by A.V. Nefian et al. [15] describes an application of the coupled HMM and the factorial HMM as two suitable statistical models for audiovideo integration. The factorial HMM is a generalization of HMM, where the hidden state is represented by a collection of variables also called factors. These factors, although independent of each other, all impact the observations, and hence become connected indirectly. The coupled HMM is a DBN represented as two regular HMM whose hidden state nodes have links to the hidden state nodes from the next time slice. The coupled HMM has also been applied to model hand gestures, the interaction between speech and hand gestures, etc. In addition, face detection and recognition problems have been studied with the help of Bayesian networks. Note that different fields of application may call for specialized employment of Bayesian network methods, and conversely, similar approaches can be successfully used in different application areas. For instance, along with the applications to speech recognition above, coupled hidden Markov models have been employed in modeling multi-channel EEG (electroencephalogram) data. An interesting example of the application of a Bayesian network to expert systems includes developing strategies for troubleshooting complex electromechanical systems, presented in [23]. The constructed Bayesian network has the structure of a naïve Bayes model. In the decision tree for the troubleshooting model, the utility function is given by the cost of repair. Hence, the goal is to find a strategy minimizing the expected cost of repair. An interesting recent study [3] describes some applications of Bayesian networks in meteorology from

a data mining point of view. A large database of daily observations of precipitation levels and maximum wind speed is collected. The Bayesian network structure is constructed from meteorological data by using various approaches, including batch learning procedure and simulation techniques. In addition, an important data mining application of Bayesian networks is illustrated by giving an example of missing data values estimation from the evidence received.

Applications of Bayesian Networks to Data Mining; Naïve Bayes Rapid progress in data collection techniques and data storage has enabled an accumulation of huge amounts of experimental, observational and operational data. As the result, massive data sets containing a large amount of information can be found almost everywhere. A well-known example is the data set containing the observed information about the human genome. The need to quickly and correctly analyze or manipulate such enormous data sets facilitated the development of data mining techniques. Data mining is research aimed at discovery of various types of knowledge from large data warehouses. Data mining can also be seen as an integral part of the more general process of knowledge discovery in databases. Two other parts of this knowledge discovery are preprocessing and postprocessing. As seen above, Bayesian networks can also extract knowledge from data, which is called evidence in the Bayesian framework. In fact, the Bayesian network techniques can be applied to solve data mining problems, in particular, classification. Many effective techniques in data mining utilize methods from other multidisciplinary research areas such as database systems, pattern recognition, machine learning, and statistics. Many of these areas have a close connection to Bayesian networks. In actuality, data mining utilizes a special case of Bayesian networks, namely, naïve Bayes, to perform effective classification. In a data mining context, classification is the task of assigning objects to their relevant categories. The incentive for performing classification of data is to attain a comprehensive understanding of differences and similarities between the objects in different classes. In the Bayesian framework, the data mining classification problem translates into finding the class param-

Bayesian Networks

eter which maximizes the posterior probability of the unknown instance. This statement is called the maximum a posteriori principle. As mentioned earlier, the naïve Bayes is an example of a simple Bayesian network model. Similarly to the naïve Bayes classifier, classification by way of building suitable Bayesian networks is capable of handling the presence of noise in the data as well as the missing values. Artificial neural networks can serve as an example of the Bayesian network classifier designed for a special case.

Application to Global and Combinatorial Optimization In the late 1990s, a number of studies were conducted that described how BN methodology can be applied to solve problems of global and combinatorial optimization. The connection between graphical models (e. g. Bayesian networks) and evolutionary algorithms (applied to optimization problems) was established. In particular, P. Larrañaga et al. combined some techniques from learning BN’s structure from data with an evolutionary computation procedure called the Estimation of Distribution Algorithm [11] to devise a procedure for solving combinatorial optimization problems. R. Etxerberria and P. Larrañaga proposed a similar approach for global optimization [5]. Another method based on learning and simulation of BNs that is known as the Bayesian Optimization Algorithm (BOA) was suggested by M. Pelikan et al. [20]. The method works by randomly generating an initial population of solutions and then updating the population by using selection and variation. The operation of selection makes multiple copies of better solutions and removes the worst ones. The operation of variation, at first, constructs a Bayesian network as a model of promising solutions following selection. Then new candidate solutions are obtained by sampling of the constructed Bayesian network. New solutions are incorporated into the population in place of some old candidate solutions, and the next iteration is executed unless a termination criterion is reached. For additional information on some real-world applications of Bayesian networks to classification, reliability analysis, image processing, data fusion and bioinformatics, see the recent book edited by A. Mittal et al. [14].

B

See also Bayesian Global Optimization Evolutionary Algorithms in Combinatorial Optimization Neural Networks for Combinatorial Optimization

References 1. Andreassen S (1992) Knowledge representation by extended linear models. In: Keravnou E (ed) Deep Models for Medical Knowledge Engineering. Elsevier, pp 129–145 2. Bangsø O, Wuillemin PH (2000) Top-down Construction and Repetitive Structures Representation in Bayesian Networks, Proceedings of the Thirteenth International FLAIRS Conference. AIII Press, Cambridge, MA 3. Cano R, Sordo C, Gutierrez JM (2004) Applications of Bayesian Networks in Meteorology, Advances in Bayesian Networks. In: Gamez et al (eds) Springer, pp 309–327 4. de Dombal F, Leaper D, Staniland J, McCan A, Harrocks J (1972) Computer-aided diagnostics of acute abdominal pain. Brit Med J 2:9–13 5. Etxerberria R, Larrañaga P (1999) Global optimization with Bayesian networks, II Symposium on Artificial Intelligence, CIMAF-99. Special Session on Distribution and Evolutionary Optimization. ICIMAF, La Habana, Cuba, pp 332–339 6. Gneiting T, Raftery AE (2005) Strictly proper scoring rules, prediction, and estimation, Technical Report no. 463R. Department of Statistics, University of Washington 7. Heckerman D, Horvitz E, Nathwani B (1992) Towards normative expert systems: Part I, the Pathfinder project. Method Inf Med 31:90–105 8. Jensen FV (1996) An Introduction to Bayesian Networks. UCL Press, London 9. Jensen FV (1999) Gradient descent training of Bayesian networks, Proceedings of the Fifth European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU). Springer, Berlin, pp 190–200 10. Kjærulff U (1995) HUGS: Combining exact inference and Gibbs sampling in junction trees, Proceedings of the Eleventh Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, CA, pp 368–375 11. Larrañaga P, Etxeberria R, Lozano JA, Peña JM (1999) Optimization by learning and simulation of Bayesian and Gaussian networks, Technical Report EHU-KZAA-IK-4/99. Department of Computer Science and Artificial Intelligence, University of the Basque Country 12. Lauritzen SL (1996) Graphical Models. Oxford University Press, Oxford 13. Livescu K, Glass J, Bilmes J (2003) Hidden feature modeling for speech recognition using dynamic Bayesian networks. Proc. EUROSPEECH, Geneva Switzerland, August– September

195

196

B

Beam Selection in Radiotherapy Treatment Design

14. Mittal A, Kassim A, Tan T (2007) Bayesian Network Technologies: Applications and Graphical Models, Interface Graphics, Inc., Minneapolis, USA 15. Nefian AV, Liang L, Pi X, Liu X, Murphy K (2002) Dynamic Bayesian Networks for Audio-visual Speech Recognition. J Appl Signal Proc 11:1–15 16. Olesen KG, Lauritzen SL, Jensen FV (1992) aHUGIN: A system creating adaptive causal probabilistic networks, Proceedings of the Eighth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, San Francisco, pp 223–229 17. Pearl J (1982) Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach, National Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, pp 133– 136 18. Pearl J (1986) Fusion, propagation, and structuring in belief networks. Artif Intell 29(3):241–288 19. Pearl J (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference Series in Representation and Reasoning. Morgan Kaufmann, San Francisco 20. Pelikan M, Goldberg DE, Cantú-Paz E (1999) BOA: The Bayesian Optimization Algorithm, Proceedings of the Genetic and Evolutionary Computation conference GECCO99, vol 1. Morgan Kaufmann, San Francisco 21. Spiegelhalter DJ, Knill-Jones RP (1984) Statistical and knowledge-based approaches to clinical decision-support systems. J Royal Stat Soc A147:35–77 22. Spiegelhalter D, Lauritzen SL (1990) Sequential updating of conditional probabilities on directed graphical structures. Networks 20:579–605 23. Vomlel J (2003) Two applications of Bayesian networks, Proceedings of conference Znalosti. Ostrava, Czech Republic, pp 73–82

Beam Selection in Radiotherapy Treatment Design ALLEN HOLDER Department of Mathematics and the University of Texas Health Science Center at San Antonio, Department of Radiological Sciences, Trinity University, San Antonio, USA Article Outline Synonyms Introduction Definitions Formulation Models Conclusions

See also References Synonyms Beam orientation optimization; Beam angle optimization Introduction Cancer is typically treated with 3 standard procedures: 1) surgery – the intent of which is to physically rescind the disease, 2) chemotherapy – drug treatment that attacks fast proliferating cells, and 3) radiotherapy – the targeted treatment of cancer with ionizing beams of radiation. About half of all cancer patients receive radiotherapy, which is delivered by focusing high-energy beams of radiation on a patient’s tumor(s). Treatment design is traditionally considered in three phases: Beam Selection The process of deciding the number and trajectory of the beams that will pass through the patient. Fluence Optimization Calculating the amount of dose to deliver along each of the selected beams so that the patient is treated as well as possible. Delivery Optimization Deciding how to best deliver the treatment designed in the first two steps. The fundamental question in optimizing radiotherapy treatments is how to best treat the patient, and such research requires detailed knowledge of medical physics and optimization. Unlike the numerous research pursuits within the field of optimization that require a specific expertise, the goals of this research rely on an overriding understanding of modeling, solving and analyzing optimization problems as well as an understanding of medical physics. The necessary spectrum of knowledge is commonly collected into a research group that is comprised of medical physicists, operations researchers, computer scientists, industrial engineers, and mathematicians. In a modern clinic, the first phase of selecting beams is accomplished by a treatment planner, and hence, the quality of the resulting treatment depends on the expertise of this person. Fluence optimization is automatically conducted once beams are selected, and the resulting treatment is judged with a variety of metrics and visualization tools. If the treatment is acceptable,

Beam Selection in Radiotherapy Treatment Design

the process ends. However, unacceptable treatments are common, and in this scenario the collection of beams is updated and fluence optimization is repeated with the new beams. This trial-and-error approach oscillates between the first two phases of treatment design and often continues for hours until an acceptable treatment is rendered. The third phase of delivery optimization strives to orient the treatment machinery so that the patient is treated as efficiently as possible, where efficiency is interpreted as shortest delivery time, shortest exposure time, etc. The focus of this entry is Beam Selection, which has a substantial literature in the medical physics community and a growing one in the operations research community. As one would expect, no single phase of treatment design exists in isolation, and although the three phase approach pervades contemporary thinking, readers should be aware that future efforts to optimize the totality of treatment design are being discussed. The presentation below is viewed as part of this bigger goal.

B

Beam Selection in Radiotherapy Treatment Design, Figure 1 A typical treatment configuration

Definitions An understanding of the technical terms used to describe radiotherapy is needed to understand the scope of Beam Selection. Patient images such as CAT scans or MRI images are used to identify and locate the extent of the disease. Treatment design begins with the tedious task of delineating the target and surrounding tissues on each of the hundreds of images. The resulting 3D structures are individually classified as either a target, a critical structure, or normal tissue. An oncologist prescribes a goal dose for the target and upper bounds on the remaining tissues. This prescription is tailored to the optimization model used in the second phase of treatment design and is far from unique. A discussion of the myriad of models used for fluence optimization exceeds the confines of this article and is fortunately not needed. The method of treatment depends on the clinic’s technology, and we begin with the general concepts common to all modalities. A patient lies on a treatment couch that can be moved vertically and horizontally and rotated in the plane horizontal to the floor. A gantry rotates around the patient in a great circle, the head of which is used to focus the beam on the patient, see Fig. 1. Shaping and modulating the beam is important

Beam Selection in Radiotherapy Treatment Design, Figure 2 A multileaf collimator

in all forms of treatment, and although these tasks are accomplished differently depending on the technology, it is common to control smaller divisions of each beam called sub-beams. As an example, the gantry’s head often contains a multileaf collimator that is capable of dividing the beam (Fig. 2), a technology that is modeled by replacing the whole beam with a grid of rectangular sub-beams. Previous technology shaped and modulated the beam without a collimator, but the concept of a subbeam remains appropriate. The center of the gantry’s rotation is called the isocenter, a point that is placed near the center of the target by repositioning the patient via couch adjust-

197

198

B

Beam Selection in Radiotherapy Treatment Design

ments. The beam can essentially be focused on the patient from any point on a sphere with a one meter radius that encompasses the patient, although some positions are not possible due to patient-gantry interference. The beam selection problem is to choose a few of these positions so that the resulting treatment is of high quality. If the selection process is restricted to a single great circle, then the term beam is often replaced with angle (in fact these terms are used synonymously in much of the literature). The collection of positions on the sphere from which we are allowed to select is denoted by A. This set contains every point of the sphere in the continuum, but in practice A is a finite set of candidate beams. The problem of selecting beams depends on a judgment function, which is a mapping from the power set of A, denoted P(A), into the nonnegative extended reals, denoted RC D fx 2 R : x 0g [ f1g. Assuming that low values correspond with high-quality treatments, we have that a judgment function is a mapping f : P(A) ! RC with the monotonicity property that if A0 and A00 are subsets of A such that A0 A00 , then f (A0 ) f (A00 ). The monotonicity condition guarantees that treatment quality can not degrade if beams are added to an existing treatment. The judgment function is commonly the optimal value from the second phase of treatment design, and for any A0 2 P(A), we let X(A0 ) be the feasible region of the optimization problem that decides fluences. An algebraic description of this set relies on the fact that we can accurately model how radiation is deposited as it passes through the anatomy. There are several competing radiobiological models that accomplish this task, each of which produces the rate coefficient A( j; a; i) , which is the rate at which sub-beam i in beam a deposits energy into the anatomical position j. These values form a dose matrix A, with rows being indexed by j and columns by (a; i). The term used to measure a subbeam’s energy is fluence, and experimentation validates that anatomical dose, which is measured in Grays (Gy), is linear in fluence. So, if x(a; i) is the fluence of subbeam i in beam a, then the linear map x 7! Ax transforms fluence values into anatomical dose. We partition the rows of the dose matrix into those that correspond with anatomical positions in the target – forming the submatrix AT , in a critical structure – forming the submatrix AC , and in normal tissue – forming the subma-

Beam Selection in Radiotherapy Treatment Design, Figure 3 A dose-volume histogram, the horizontal axis is the anatomical dose (measured in Grays) and the vertical axis is the percent of volume

trix AN . With this notation, A T x, A C x and A N x are the delivered doses to the target, the critical structures, and the normal tissues under treatment x. Treatment planners use visual and numerical methods to evaluate treatments. The two most common visual tools are the dose-volume histogram (DVH) and a collection of isocontours. A DVH is a plot of dose versus volume and allows a treatment planner to quickly gauge the extent to which each structure is irradiated, an example is found in Fig. 3. The curve in the upper right side of the figure corresponds to the target, which is the growth to the left of the brain stem in Fig. 4. The ideal curve for the target would be one that remains at 100% until the desired dose and then falls immediately to zero, and the ideal curves for the remaining structures would be ones that fall immediately to zero. The curve passing through the middle of Fig. 3 corresponds to the brain stem and indicates that approximately 80% of the brain stem is receiving half of the target dose. What a DVH lacks is spatial detail about the anatomical dose, but this information is provided by the isocontours, which are level curves drawn on each of the patient images. For example, if the target’s goal is

Beam Selection in Radiotherapy Treatment Design

B

treatment. The prevailing thought is that fewer beams are preferred if all other treatment goals remain satisfactory, and if f adequately measures treatment quality, a model that represents this sentiment is minfN : minf f (A0 ) : A0 2 P(A); jA0 j D Ng "g ; where " defines the quality of an acceptable treatment. As mentioned in the previous section, the judgment function is typically the objective value from fluence optimization. A common least-squares approach defines X(A0 ) to be X x(a;i) D 0 for a 2 AnA0 g fx : x 0; i

and f (A0 ) to be Beam Selection in Radiotherapy Treatment Design, Figure 4 A collection of isocontours on a single patient image

80 Gy, then the 90% isocontour contains the anatomical region that receives at least 0:9 80 D 72 Gy. Figure 4 illustrates the 100%, 90%, . . . , 10% isocontours on a single patient image. One would hope that these isocontour would tightly contain the target on each of the patient images, a goal commonly referred to as conformality. Although a DVH is often used to decide if a treatment is unacceptable, both the DVH and the isocontours are used to decide if a treatment is acceptable. Although treatments are commonly evaluated exclusively with a DVH and the isocontours, there are well established numerical scores that are also used. Such scores are called conformality indices and consider the ratios of under and over irradiated tissue, and as such, these values collapse the DVH into a numerical value. We do not discuss these measures here, but the reader should be aware that they exist. Formulation The N-beam selection problem for the judgment function f and candidate set of beams A is minf f (A0 ) : A0 2 P(A); jA0 j D Ng :

minf!T kA T x TGk2 C !C kA C xk2 C ! N kA N xk2 : x 2 X(A0 )g ;

(2)

where TG is a vector that expresses the target’s treatment goal and !T , !C and ! N weight the objective terms to express clinical desires. The prescription for this model is TG, but more complicated models with sophisticated prescriptions are common. In particular, dose-volume constraints that restrict the amount of each structure that is permitted to violate a bound are common. Readers interested in fluence optimization are directed to the entry on Cancer Radiation Treatment: Optimization Models. Models The N-beam selection problem is often addressed as a mixed integer problem. As an example, for the judgment function in (2) the N-beam selection problem can be expressed as 9 > min !T kA T x TGk2 C !C > > > > kA C xk2 C ! N kA N xk2 > > P = subject to: x(a;i) M y a ; fora 2 A i P (3) > > a ya N > > > x0 > > ; jAj y 2 f0; 1g ;

(1)

The parameter N is provided by the treatment planner and is intended to control the complexity of the

where M is an arbitrarily large value that bounds each beam’s fluence. This is one of many possible models, with simple adjustments including the replacement of

199

200

B

Beam Selection in Radiotherapy Treatment Design

the 2-norm with the 1 and 1 norms, both of which result in a linear mixed integer problem. A modest discretization of the sphere, with 72 great circles through the north and south poles equally spaced at 5 degrees at the equator and each great circle having beams equally spaced at 5 degrees, produces a set of 4902 candidate beams. This means the search tree associated with the mixed integer model above terminal nodes, which for the clinically valid has 4902 N N D 10 is approximately 2:2 1030 . Beyond the immenseness of this search space, branch-and-bound procedures are difficult for two reasons, 1) the number of N element subsets leading to near optimal solutions is substantial, and 2) the evaluation of the judgment function at each node requires the solution to an underlying fluence model, which in itself is time consuming. This inherit difficulty has driven the development of heuristic approaches, which separate into the two steps of: 1) assigning each beam a value that measures its worth to the overall treatment, and 2) using the individual beam values to select a collection of N beams. As a simple example, a scoring technique evaluates each beam and then simply selects the top N beams. The remainder of this section discusses several of the common heuristics. A selection technique is called informed if it requires the evaluation of the underlying judgment function. One example would be to iteratively let A0 be the singleton beam sets and evaluate f (A0 ) for each. The N beams with the best scores would be selected for the treatment. If a selection method uses the data forming the optimization problem that defines f but fails to evaluate f , then the technique is called weakly informed. The preponderance of techniques suggested in the medical physics literature fall into this category. An example based solely on the dose matrix A is to value beam a with max(i; j) fA( j;a;i) : j 2 Tg ; min(i; j) fA( j;a;i) : j 2 C [ Ng where we assume the minimums in the denominator are nonzero. This ratio is high if a beam can deliver large amounts of dose to the target without damaging other tissues. A scoring technique based on this would terminate with the collection of N beams with the highest values. Since weakly informed methods do not require the solution of an optimization problem, they tend to be fast.

The concern about the size of the underlying fluence model has lead to a sampling heuristic that reduces the accuracy of the radiobiological model. Clinical relevance mandates that the anatomy be discretized so that dose is measured at distances no greater than 2 mm. For a 20 cm3 portion of the anatomy, roughly the volume of the cranium, this means the coarsest 3D grid permitted in the clinic divides the anatomy into 106 sub-regions called voxels, which are indexed by j. Cases in the chest and abdomen are substantially larger and require a significant increase in the number of voxels. A natural question is whether or not all of these regions are needed for beam selection. One approach is to repeatedly sample these regions together with the candidate set of beams and solve (1). Each beam is valued by the number of times it has a high fluence. Beams with high values create A in (1) with j being indexed over all regions. The goal of this technique is to identify a candidate set of beams whose size is slightly larger than N, which keeps the search space manageable with the full compliment of voxels. The sampling procedure is crucial to the success of the procedure since it is known that beam selection depend on the collection of voxels. Once beams are valued, there are many ways to use this information to construct a collection of favorable beams. As already discussed, common scoring methods select the best N beams. Another approach is based on set covering, which uses a high-pass filter to decide if a beam adequately treats the target. Allowing " to be the threshold at which we say beam a treats position j within the target, we let U( j;a) D

P 1; A " P i ( j;a;i) 0; i A ( j;a;i) < " ;

for each j 2 T. If each beam has a value of ca , where low values are preferred, the set cover heuristic forms a collection of beams by solving min

nX a

ca ya :

X

U( j;a) y a 1 ;

a

o for each j 2 T; y a 2 f0; 1g :

(4)

This in itself is a binary optimization problem, and if " is small enough to guarantee that every beam treats the target, which is typical, then the size of the search space is the same as the original problem in (1). How-

Beam Selection in Radiotherapy Treatment Design

ever, the set cover problem has favorable solution properties, and this problem solves efficiently in practice. The search space decreases in size as " increases, and designing an appropriate heuristic requires both a judicious selection of " and an appropriate objective. This method can be informed or weakly informed depending on how the objective coefficients are constructed. Another approach is to use the beam values as a probability distribution upon normalization. This allows one to address the problem probabilistically, a perspective that has been suggested within column generation and vector quantization. The column generation approach prices beams with respect to the likelihood that they will improve the judgment function, and beams with high probabilities are added to the current collection. The process of adding and deleting beams produces a sequence of beam sets A1 ; A2 ; : : : ; An , and problem (1) is solved with A replaced with A k , k D 1; 2; : : : ; n. Although it is possible for this technique to price all subsets of A whose cardinality is greater than N, which is significantly greater than the size of the original search space in (1), the pricing scheme tends to limit the number A k s. The probabilistic perspective is further incorporated with heuristics based in information science. In particular, a method based on vector quantization, which is a modeling and solution procedure used in data compression, has been suggested. Allowing ˛(a) to be the probability associated with beam a, this heuristic constructs a collection of beams by solving nX o ˛(a)(a; Q(a)) : jQ(A)j D N ; min Q

(5)

a

where Q is a mapping from A into itself and is a metric appropriate to the application. A common metric is to let (a; Q(a)) be the arc length between a and Q(a). In the finite case, each N element subset, say A0 , of A uniquely defines Q by setting Q(A) D A0 . Assuming this equality, we complete the definition by setting Q(a) D a0 2 A0 if and only if (a; a0 ) (a; a00 ) for all a00 2 A0 , a condition referred to as the nearest neighbor condition. Since the optimization problem in (5) is defined over the collection of these functions, the size of the feasible region is the same as the original beam selection problem in (1). Unlike the set cover approach, which solves (4) to optimality, and the column genera-

B

tion technique, which repeatedly solves (1) to optimality with a restricted beam set, the vector quantization method often solves (5) heuristically. The most common heuristic is the Lloyd algorithm, a technique that begins with an initial collection of N beams and then iterates between 1. defining Q with the nearest neighbor condition, and 2. forming a new collection of beams with the centroids of Q 1 (a), where beam a is in the current collection. This technique guarantees that the objective in (5) decreases with each new collection. Conclusions Selecting beams is one of the three sub-problems in the design of radiotherapy treatments, a problem that currently does not have an appropriate solution outside the clinical practice of manually selecting beams through trial-and-error. However, research into automating the selection of beams with optimization is promising. We conclude with a few words on the totality of treatment design. The overriding goal of treatment design is to remove the threat of cancer while sparing non-cancerous tissues. The status quo is to assume that a patient is static while designing a treatment. Indeed, treatment planners expand targeted regions to address the dynamic patient movement in the static approach, i. e. the target is increased to include the gross volume that contains the estimated movement of the actual target. The primary goal of the third phase of treatment design is to deliver the treatment as efficiently as possible to limit patient movement. This leads to a dilemma. The monotonicity property of the judgment function encourages treatments with many beams, but conventional wisdom dictates that the number of beams and the efficiency of the delivery are inversely proportional. However, in many settings the number of beams is a poor surrogate of efficiency. As an example, the most time demanding maneuver is to rotate the couch since it requires a technician to enter the treatment vault. So, treatments with many beams but fewer couch rotations are preferred to treatments with fewer beams but more couch rotations. The point to emphasize from the previous paragraph is that the problem of selecting beams is always expressed in terms of the number of beams, which is a byproduct of the three-phase approach. Although the

201

202

B

Best Approximation in Ordered Normed Linear Spaces

separation of the design process into phases is natural and useful for computation, the division has drawbacks. Fluence models are large and difficult to solve, and every attempt is made to reduce their size. As already discussed, the voxels need to be under 2 mm3 to reach clinical viability, and hence, the index set for j is necessarily large. The number and complexity of the sub-beams has increased dramatically with advanced technology, similarly making the index set for i large. This leaves the number of beams as the only control, and treatment designers are asked to select beams so that the fluence model is manageable. Years of experience have developed standard collections for many cancers, but asking a designer to select one of the 2:2 1030 possible collections for a 10 beam treatment in a non-standard case is daunting. A designer’s instinct is to value a beam individually rather than as part of a collection. Several of the weakly informed selection methods from the medical physics literature have the same weakness. Such individual valuation typically identifies all but a few beams of a quality solution, but the last few are often unintuitive. Automating beam selection with an optimization process so that beams are considered within a collection is a step in the right direction. The future of treatment design is to build global models and solution procedures that simultaneously address all three phases of treatment design. Such models are naturally viewed from the perspective of beam selection. What is missing is a judgment function that includes both fluence and delivery optimization. Learning how to model and solve these holistic models would alleviate the design process from a designer’s (lack of) expertise and would provide a uniform level of care available to clinics with comparable technology. Such improvements are the promise of the field. See also Credit Rating and Optimization Methods Evolutionary Algorithms in Combinatorial Optimization Optimization Based Frameworkfor Radiation Therapy The literature on beam selection is mature within the medical physics community but is in its infancy within optimization. The five citations below cover the topics

discussed in this article and contain bibliographies that adequately cite the work in medical physics. References 1. Acosta R, Ehrgott M, Holder A, Nevin D, Reese J, Salter B (2007) Comparing Beam Selection Strategies in Radiotherapy Treatment Design: The Influence of Dose Point Resolution. In: Alves C, Pardalos P, Vicente L (eds) Optimization in Medicine, International Center for Mathematics, Springer Optimization and Its Applications. Springer, pp 1–25 2. Aleman D, Romeijn E, Dempsey J (2006) Beam orientation optimization methods in intensity modulated radiation therapy. IIE Conference Proceedings 3. Ehrgott M, Holder A, Reese J (2008) Beam Selection in Radiotherapy Design. In: Linear Algebra and Its Applications, vol 428. pp 1272–1312. doi:10.1016/j.laa.2007.05.039 4. Lim G, Choi J, Mohan R Iterative Solution Methods for Beam Angle and Fluence Map Optimization in Intensity Modulated Radiation Therapy Planning. to appear in OR Spectrum. doi:10.1007/s00291-007-0096-1 5. Lim G, Ferris M, Shepard D, Wright S, Earl M (2007) An Optimization Framework for Conformal Radiation Treatment Planning. INFORMS J Comput 19(3):366–380

Best Approximation in Ordered Normed Linear Spaces HOSSEIN MOHEBI Mahani Mathematical Research Center, and Department of Mathematics, University of Kerman, Kerman, Iran MSC2000: 90C46, 46B40, 41A50, 41A65 Article Outline Keywords and Phrases Introduction Metric Projection onto Downward and Upward Sets Sets ZC and Z Downward Hull and Upward Hull Metric Projection onto a Closed Set Best Approximation in a Class of Normed Spaces with Star-Shaped Cones Characterization of Best Approximations Strictly Downward Sets and Their Best Approximation Properties References

B

Best Approximation in Ordered Normed Linear Spaces

Keywords and Phrases Best approximation; Downward and upward sets; Global minimum; Necessary and sufficient conditions; Star-shaped set; Proximinal set Introduction We study the minimization of the distance to an arbitrary closed set in a class of ordered normed spaces (see [8]). This class is broad enough. It contains the space C(Q) of all continuous functions defined on a compact topological space Q and the space L1 (S; ˙; ) of all essentially bounded functions defined on a measure space (S, ˙, ). It is assumed that these spaces are equipped with the natural order relation and the uniform norm. This class also contains direct products X D R Y, where Y is an arbitrary normed space, with the norm k(c; y)k D jcj C kyk. The space X is equipped with the order relation induced by the cone K D f(c; y) : c kykg. Let U be a closed subset of X, where X is a normed space from the given class, and let t 2 X. We consider the problem Pr(U, t): minimize

ku tk subject to u 2 U:

(1)

It is assumed that there exists a solution of Pr(U, t). This solution is called a metric projection of t onto U, or a best approximation of t by elements of U. We use the structure of the objective function in order to present necessary and sufficient conditions for the global minimum of Pr(U, t) that give a clear understanding of the structure of a metric projection and can be easily verified for some classes of problems under consideration. We use the so-called downward and upward subsets of a space X as a tool for analysis of Pr(U, t). A set U X is called downward if (u 2 U; x u) H) x 2 U. A set V X is called upward if (v 2 V; x v) H) x 2 V. Downward and upward sets have a simple structure so the problem Pr(U, t) can be easily analyzed for these sets U. If U is an arbitrary closed subset of X we can consider its downward hull U D U K and upward hull U D U C K, where K D fx 2 X : x 0g is the cone of positive elements. These hulls can be used for examination of Pr(U, t). We also suggest an approach based on a division of a normed space under consideration into two

homogeneous not necessarily linear subspaces. A combination of this approach with the downward-upward technique allows us to give simple proofs of the proposed necessary and sufficient conditions. Properties of downward and upward sets play a crucial role in this article. These properties have been studied in [6,13] for X D Rn . We show that some results obtained in [6,13] are valid in a much more general case. In fact, the first necessary and sufficient conditions for metric projection onto closed downward sets in Rn have been given in [1, p. 132, Theorem 9]. Proposition 1(1) and (2) are extensions of Rn and 1 D (1; : : : ; 1); of [1, Proposition 1(a) and (b)], respectively. Also, Propositions 2 and 3 are extensions of [1, p. 116, Proposition 2]. Furthermore, Corollary 3 is an extension of [1, p. 116, Corollary 2 and p. 117, Remark 2]. In connection with Proposition 6, the downward hull U has been introduced in [1, Sect. 1], where the first results on the connection between d(t, U) and d(t, U ) have been given, for the particular case n . We use methods of where U is a normal subset of RC abstract convexity and monotonic analysis (see [11]) in this study. Let X be a normed space. Let K X be a closed convex and pointed cone. (The latter means that K \ (K) D f0g.) The cone K generates the order relation on X. By definition x y () x y 2 K. We say that x is greater than y and write x > y if x y 2 K n f0g. Assume that K is solid, that is, the interior int K of K is nonempty. Let 1 2 int K. Using 1 we can define the following function: p(x) D inff 2 R : x 1g;

(x 2 X) :

(2)

It is easy to check that p is finite. It follows from (2) that x p(x)1;

(x 2 X) :

(3)

It is easy to check (and well known) that p is a sublinear function, that is, p(x) D p(x)

( > 0; x 2 X);

p(x C y) p(x) C p(y)

(x; y 2 X):

We need the following definition (see [13] and references therein). A function s : X ! R is called topical if s is increasing: x y implies s(x) s(y) and s(x C 1) D s(x) C for all x 2 X and 2 R.

203

204

B

Best Approximation in Ordered Normed Linear Spaces

It follows from the definition of p that p is topical. Consider the function kxk :D max(p(x); p(x)) :

x kxk1;

p(c; y) D inff 2 R : (c; y) 1g

(4)

It is easy to check (and well known) that k k is a norm on X. In what follows we assume that the norm (4) coincides with the norm of the space X. It follows from (3) that x kxk1;

(c; y) 2 X we have

(x 2 X) :

(5)

The ball B(t; r) D fx 2 X : kx tk rg has the form B(t; r) D fx 2 X : t r1 x t C r1g :

(6)

We now present three examples of spaces under consideration. Example 1 Let X be a vector lattice with a strong unit 1. The latter means that for each x 2 X there exists 2 R such that jxj 1. Then

D inff 2 R : (; 0) (c; y) 2 Kg D inff 2 R : ( c; y) 2 Kg D inff 2 R : c k ykg D c C kyk : Hence k(c; y)k D max(p(c; y); p((c; y))) D max(c C kyk; c C kyk) D jcj C kyk : Example 3 Consider the space l1 of all summable sequences with the usual norm. Let Y D fx D (x i ) 2 l 1 : x1 D 0g. Then we can identify l1 with the space R Y. Let y 2 Y and x D (x1 ; y) 2 l 1 . Then kxk D P jx1 j C kyk. Let K D fx D (x i ) 2 l 1 : x1 1 iD2 jx i jg. Assume that l1 is equipped with the order relation generated by K: if x D (x i ) and z D (z i ), then x z () x1 z1

where norm k k is defined by (4). It is well known (see, for example, [21]) that each vector lattice X with a strong unit is isomorphic as a vector-ordered space to the space C(Q) of all continuous functions defined on a compact topological space Q. For a given strong unit 1 the corresponding isomorphism can be chosen in such a way that (1)(q) D 1 for all q 2 Q. The cone (K) coincides with the cone of all nonnegative functions defined on Q. If X D C(Q) and 1(q) D 1 for all q, then q2Q

and

jx i z i j :

iD2

kxk D inff > 0 : jxj 1g ;

p(x) D max x(q)

1 X

kxk D max jx(q)j : q2Q

A well-known example of a vector lattice with a strong unit is the space L1 (S; ˙; ) of all essentially bounded functions defined on a measure space (S, ˙, ). If 1(s) D 1 for all s 2 S, then p(x) D ess sups2S x(s) and kxk D ess sups2S jx(s)j. Example 2 Let X D R Y, where Y is a normed space with a norm k k, and let K X be the epigraph of the norm K D f(; x) : kxkg. The cone K is closed solid convex and pointed. It is easy to check and well known that 1 D (1; 0) is an interior point of K. For each

Let 1 D (1; 0; : : : ; 0; : : :). Consider the function p defined on l1 by p(x) D x1 C

1 X

jx i j;

x D (x1 ; x2 ; : : :) 2 l 1 :

iD2

Then (see the previous example) p(x) D inff 2 P1 R : x 1g and kxk D iD1 jx i j coincides with max(p(x); p(x)). Let X be a normed vector space. For a nonempty subset U of X and t 2 X; define d(t; U) D infu2U kt uk. A point u0 2 U is called a metric projection of t onto U, or a best approximation of t by elements of U, if kt u0 k D d(t; U). Let U X. For t 2 X, denote by PU (t) the set of all metric projections of t onto U: PU (t) D fu 2 U : kt uk D d(t; U)g :

(7)

It is wellknown that PU (t) is a closed and bounded subset of X. If t … U, then PU (t) is located in the boundary of U. We shall use the following definitions. A pair (U, t) where U X and t 2 X is called proximinal if there exists a metric projection of t onto U. A pair (U, t) is called

Best Approximation in Ordered Normed Linear Spaces

B

Chebyshev if there exists a unique metric projection of t onto U. A set U X is called proximinal, if the pair (U, t) is proximinal for all t 2 X. A set U X is called Chebyshev if the pair (U, t) is Chebyshev for all t 2 X. A set U X is called boundedly compact if the set Ur D fu 2 U : kuk rg is compact for each r > 0. (This is equivalent to the following: the intersection of a closed neighborhood of a point u 2 U with U is compact.) Each boundedly compact set is proximinal. For any subset U of a normed space X we shall denote by int U, cl U, and bd U the interior, the closure, and the boundary of U, respectively.

that x " k 1 2 U for all k 1. Let y x be arbitrary and y k D y " k 1 x " k 1(k D 1; 2; : : :). Then y k 2 U(k D 1; : : :). Since y k ! y as k ! C1, it follows that y 2 cl U.

Metric Projection onto Downward and Upward Sets

Let u0 D t r1. Then

Definition 1 A set U X is called downward if (u 2 U; x u) H) x 2 U. First we describe some simple properties of downward sets. Proposition 1 Let U be a downward subset of X and x 2 X. Then the following assertions are true: (1) If x 2 U, then x "1 2 int U for all " > 0. (2) int U D fx 2 X : x C "1 2 U f or some " > 0g. Proof (1) Let " > 0 be given and x 2 U. Let N D fy 2 X : ky (x "1)k < "g be an open neighborhood of (x "1). Then, by (6) N D fy 2 X : x 2"1 < y < xg. Since U is a downward set and x 2 U, it follows that N U, and so x "1 2 int U. (2) Let x 2 int U. Then there exists "0 > 0 such that the closed ball B(x; "0 ) U. In view of (6), we get x C "0 1 2 U. Conversely, suppose that there exists " > 0 such that x C "1 2 U. Then, by (1): x D (x C "1) "1 2 int U, which completes the proof.

Proposition 2 A closed downward subset U of X is proximinal. Proof Let t 2 X n U be arbitrary and r :D d(t; U) D infu2U kt uk > 0. This implies that for each " > 0 there exists u" 2 U such that kt u" k < r C ". Then, by (6): (r C ")1 u" t (r C ")1 :

(8)

kt u0 k D kr1k D r D d(t; U) : In view of (8), we have u0 "1 D t r1 "1 u" . Since U is a downward set and u" 2 U, it follows that u0 "1 2 U for all " > 0. The closedness of U implies u0 2 U, and so u0 2 PU (t). Thus the result follows. Remark 1 We proved that for each t 2 X n U the set PU (t) contains the element u0 D t r1 with r D d(t; U). If t 2 U, then u0 D t and PU (t) D fu0 g. Proposition 3 Let U be a closed downward subset of X and t 2 X. Then there exists the least element u0 :D min PU (t) of the set PU (t), namely, u0 D t r1, where r :D d(t; U). Proof If t 2 U, then the result holds. Assume that t … U and u0 D t r1. Then, by Remark 1, u0 2 PU (t). Applying (6) and the equality kt u0 k D r we get x t r1 D u0 8 x 2 B(t; r) :

Corollary 1 Let U be a closed downward subset of X and u 2 U. Then, u 2 bd U if and only if 1 C u … U for all > 0.

This implies that u0 is the least element of the closed ball B(t, r). Now, let u 2 PU (t) be arbitrary. Then kt uk D r, and so u 2 B(t; r). Therefore, u u0 . Hence, u0 is the least element of the set PU (t).

Lemma 1 The closure cl U of a downward set U is downward.

Corollary 2 Let U be a closed downward subset of X; t 2 X and u0 D min PU (t). Then, u0 t.

Proof Let x k 2 U, k D 1; 2; : : :, and x k ! x as k ! C1. Let kx k xk D " k (k D 1; 2; : : :). Using (6) we get x " k 1 x k for all k 1. Since U is a downward set and x k 2 U for all k 1, we conclude

Corollary 3 Let U be a closed downward subset of X and t 2 X be arbitrary. Then d(t; U) D minf 0 : t 1 2 Ug :

205

206

B

Best Approximation in Ordered Normed Linear Spaces

Proof Let A D f 0 : t 1 2 Ug. If t 2 U, then t 0 1 D t 2 U, and so min A D 0 D d(t; U). Suppose that t … U; then r :D d(t; U) > 0. Let > 0 be arbitrary such that t 1 2 U. Thus D k1k D kt (t 1)k d(t; U) D r : Since, by Proposition 3, t r1 2 U, it follows that r 2 A. Hence, min A D r, which completes the proof. The results obtained demonstrate that for the search of a metric projection of an element t onto a downward set U we need to solve the following optimization problem: minimize

subject to

t 1 2 U; 0 : (9)

This is a one-dimensional optimization problem that is much easier than the original problem Pr(U, t). Problem (9) can be solved, for example, by a common bisection procedure: first find numbers 1 and 1 such that t 1 1 2 U and t 1 1 … U. Let k 1. Assume that numbers k and k are known such that t k 1 2 U and t k 1 … U. Then consider the number k D 1/2( k C k ). If t k 1 2 U, then put kC1 D k , kC1 D k . If t k 1 … U, then put kC1 D k , kC1 D k . The number r D lim k k D lim k k is the optimal value of (9). The following necessary and sufficient conditions for the global minimum easily follow from the results obtained. Theorem 1 Let U be a closed downward set and t … U. Then u0 2 U is a solution of the problem Pr(U, t) if and only if (i) u0 u¯ :D t r1, where r D minf 0 : t 1 2 Ug; (ii) p(t u0 ) p(u0 t). Proof Let u0 2 PU (t). Since u¯ :D t r1 is the least el¯ so (i) is proved. ement of PU (t), it follows that u0 u, We now demonstrate that (ii) is valid. In view of the equality r D kt u0 k D max(p(t u0 ); p(u0 t)), we conclude that p(u0 t) r and p(t u0 ) r. We need to prove that p(t u0 ) D r. Assume on the contrary that p(t u0 ) :D inff : t u0 1g < r. Then there exists " > 0 such that t u0 (r ")1. This implies that u0 t r1 C "1 D u¯ C "1. Since u0 2 U and U is downward, it follows that u¯ C "1 2 U, so u¯

is an interior point of U. This contradicts the fact that u¯ is a best approximation of t by U. Assume now that both items (i) and (ii) hold. It follows from (i) that t u0 r1. Since p is a topical function, we conclude that p(t u0 ) r. Item (ii) implies kt u0 k D p(t u0 ) r. Since r D minu2U kt uk, we conclude that u 2 PU (t). We now turn to upward sets. Definition 2 A set V X is called upward if (v 2 V; x v) H) x 2 V. Clearly V is upward if and only if U D V is downward, so all results obtained for downward sets can be easily reformulated for upward sets. Proposition 4 A closed upward subset V of X is proximinal. Proof This is an immediate consequence of Proposition 2. Theorem 2 Let U be a closed upward set and t … U. Then u0 is a solution of the problem Pr(U, t) if and only if (i) u0 t C r1, where r D minf 0 : t C 1 2 V g. (ii) p(u0 t) p(t u0 ). Proof The result can be obtained by application of Theorem 1 to the problem Pr(U; t). Corollary 4 Let V X be a closed upward set and t 2 X. Then d(t; V ) D minf 0 : t C 1 2 V g. Sets Z + and Z Consider function s defined on X by 1 s(x) D (p(x) p(x)) : 2 We now indicate some properties of function s. (1) s is homogeneous of degree one, that is, s(x) D s(x) for 2 R. Indeed, we need to check that s(x) D s(x) for all x 2 X and s(x) D s(x) for all x 2 X and all 2 R. Both assertions directly follow from the definition of s. (2) s is topical. It follows directly from the definition of s that s is increasing. We now check that s(x C 1) D s(x) C for all x 2 X and all 2 R.

Best Approximation in Ordered Normed Linear Spaces

Indeed, 1 s(x C 1) D (p(x C 1) (p(x 1)) 2 1 D (p(x) p(x) C 2) 2 D s(x) C : We will be interested in the level sets ZC D fx 2 X : s(x) 0g and Z D fx 2 X : s(x) 0g of function s. The following holds: x 2 ZC () p(x) p(x) () p(x) D kxk :

B

Let Q consist of two points. Then C(Q) coincides with R2 and s(x) D x1 C x2 , that is, s is a linear function. If Q contains more than two points, then s is not linear. Example 5 Let X D R Y, where Y is a normed space (Example 2). Let x D (c; y); then p(x) D c C kyk. Hence 1 s(x) D [(c C kyk) (c C k yk)] D c ; 2 so s is linear. The following holds: Z0 D f(c; y) : c D 0g;

ZC D f(c; y) : c 0g;

Z D f(c; y) : c 0g: x 2 Z () p(x) p(x) () p(x) D kxk : Since s is homogeneous, it follows that Z D ZC . Let Z0 D fx : s(x) D 0g. Then ZC \ Z D Z0 ;

Z [ ZC D X :

Since s is continuous, it follows that Z+ and Z are closed subsets of X. Note that both Z+ and Z are conic sets. (Recall that a set C X is called conic if (x 2 C; > 0) H) x 2 C). Since s is increasing, it follows that Z+ is upward and Z is downward. Let R D f1 : 0g be the ray passing through 1. In view of the topicality of s, ZC D Z0 C R;

Z D Z0 R :

Indeed, let x 2 ZC ; then s(x) :D 0. Let u D x 1. Then s(u) D 0, hence u 2 Z0 . We demonstrated that x 2 Z0 C R, so ZC Z0 C R. The opposite inclusion trivially holds. Thus, ZC D Z0 C R. We also have Z D Z0 R D Z0 R. We now give some examples. Example 4 Let X D C(Q) be the space of all continuous functions defined on a compact topological space Q and p(x) D max q2Q x(q). Then s(x) D max q2Q x(q) C min q2Q x(q); therefore Z0 D fx 2 C(Q) : maxq2Q x(q) D min q2Q x(q)g. Thus x 2 Z0 if and only if there exist points qC ; q 2 Q such that jx(qC )j D jx(q )j D kxk and x(qC ) > 0, x(q ) < 0. Further, x 2 ZC if and only if kxk D max q2Q x(q) > min q2Q x(q) and x 2 Z if and only if kxk D maxq2Q (x(q)) > min q2Q (x(q)) D max q2Q x(q).

Example 6 Let X D l 1 (see Example 3). Then s(x) D x1 and Z0 D fx D (x i ) 2 l 1 : x1 D 0g; ZC D fx D (x i ) 2 l 1 : x1 0g; Z D fx D (x i ) 2 l 1 : x1 0g:

Downward Hull and Upward Hull Let U be a subset of X. The intersection U of all downward sets that contain U is called the downward hull of U. Since the intersection of an arbitrary family of downward sets is downward, it follows that U is downward. Clearly U is the least (by inclusion) downward set, which contains U. The intersection U of all upward sets containing U is called the upward hull of U. The set U is upward and is the least (by inclusion) upward set containing U. Proposition 5 ([15], Proposition 3) Let U X. Then U D U K :D fu v : u 2 U; v 2 Kg; U D U C K :D fu C v : u 2 U; v 2 Kg : We need the following result: Proposition 6 Consider a closed subset U of X. (1) Let t 2 X be an element such that t U ZC . Then d(t; U) D d(t; U ). (2) Let t 2 X be an element such that t U Z . Then d(t; U) D d(t; U ).

207

208

B

Best Approximation in Ordered Normed Linear Spaces

Proof We shall prove only the first part of the proposition. The second part can be proved in a similar way. Let r D d(t; U ). Since U U , it follows that r d(t; U), so we need only check the reverse inequality. Let u 2 U be arbitrary. Then, by Proposition 5, there exist u 2 U and v 2 K such that u D u v. Hence t u D t u C v D x u with x :D t C v t : By hypothesis, t u 2 ZC . Since x t and Z+ is upward, it follows that x u 2 ZC . Since kzk D p(z) for all z 2 ZC and p is increasing, we have kt u k D kx uk D p(x u) p(t u) D kt uk : Thus for each u 2 U there exists u 2 U such that kt u k kt uk. This means that r :D d(t; U ) d(t; U). We proved that d(t; U) D r. Proposition 7 (1) Let t 2 X be an element such that t U ZC and let U be a closed set. Then (U, t) is a proximinal pair. (2) Let t 2 X be an element such that t U Z and let U be a closed set. Then (U, t) is a proximinal pair. Proof We shall prove only the first part of the proposition. Since U is a closed downward set in X, it follows, by Proposition 3, that the least element u0 of the set PU (t) exists and u0 D t r1, where r D d(t; U ). In view of Proposition 6, r D d(t; U). Since u0 2 U , by Proposition 5, there exist u 2 U and v 2 K such that u0 D t r1 D u v. Then t u D r1 v and p(t u) D p(r1 v) p(r1) D r :

t U ZC and U is closed. In particular, we can give the following necessary and sufficient conditions for a solution of the problem Pr(U, t) for these sets. Theorem 3 (1) Let t U ZC and U is closed. Then u0 2 U a solution of Pr(U, t) if and only if (i) u0 t r1 where r D minf 0 : t 1 U Kg. (ii) p(t u0 ) p(u0 t); (2) Let t U Z and U is closed. Then u0 2 U a solution of Pr(U, t) if and only if (i0 ) u0 t C r1 where r D minf 0 : t C 1 U C Kg. (ii0 ) p(u0 t) p(t u0 ).

is 2

is 2

Proof We again prove only the first part of the theorem. Due to Proposition 6, we get d(t; U) D d(t; U ) D r. Since U is closed and downward, it follows (Proposition 3) that u¯ :D t r1 2 PU (t). Let u0 u¯ and u0 2 U. Then u0 2 U and in view of Proposition 6, it holds: d(u0 ; U) D d(u0 ; U ) D r D minf 0 : t 1 2 U g : Applying Theorem 1 we conclude that u0 is a best approximation of t by U . Since u0 2 U, it follows that u0 is a best approximation of t by U. Consider now a best approximation u0 of t by U. Applying again Proposition 6 we deduce that ktu0 k D d(t; U) D d(t; U ) D r. Theorem 1 demonstrates that both (i) and (ii) hold. Metric Projection onto a Closed Set

Since, by hypothesis, t u 2 ZC , it follows that kt uk D p(t u) r. On the other hand, kt uk d(t; U) D r. Hence kt uk D r, and so u 2 PU (t), which completes the proof.

Downward and upward sets can be used for examination of best approximations by arbitrary closed sets (it is assumed that a metric projection exists). We start with the following assertion.

Remark 2 Let U X be a closed set. Assume that there exists a set V X such that V U V and V is closed. Then U D V ; hence U is closed. In particular, U is closed if there exists a compact set V such that V U V .

Proposition 8 Let U be a closed subset of X and t 2 X. Consider the following sets:

Proposition 7 can be used for the search of a metric projection of an element t onto a set U such that

U tC D U \ (t ZC );

U t D U \ (t Z ) : (10)

Then (1) t U tC ZC ; t U t Z . (2) U tC [ U t D U.

Best Approximation in Ordered Normed Linear Spaces

(3) U tC \ U t D U \ (t Z0 ), where Z0 D fx 2 X : s(x) D 0g. (4) U tC and U t are closed. (5) If U is downward, then U tC is downward; if U is upward, then U t is upward. Proof (1) It is easy to check (t U) \ ZC D t [U \ (t ZC )] D t U tC : Hence t U tC ZC . A similar argument shows that t U t Z . (2) The following holds: U tC [ U t D [(t ZC ) \ U] [ [(t Z ) \ U] D [(t ZC ) [ (t Z )] \ U

it

follows

t

that

U tC \ U t D [U \ [(t ZC )] \ [U \ (t Z )] D U \ [(t ZC ) \ (t Z )]

Consider a fixed proximinal pair (U, t). Let U tC and U t be the sets defined by (10). Since U tC [ U t D U, it follows that inf kt uk D min( inf kt u C k; inf kt u k) : u 2U t

(11) It follows from (11) that at least one of the pairs (U tC ; t) and (U t ; t) is proximinal and a metric projection of t onto U coincides with a metric projection onto at least one of the sets U tC or U t . Let rC D inf kt uk; u2U tC

r D inf kt uk; u2U t

r D inf kt uk D min(rC ; r ) : u2U

If r < rC , then a metric projection of t onto U coincides with a metric projection of t onto U t . If the set (U t ) is closed, we can assert that

p(u t) p(t u)g :

Since ZC \ Z D Z0 , the result follows. (4) This is clear. (5) It follows from the fact that t ZC is downward and t Z is upward.

u C 2U tC

p(t u) p(u t)g :

PU (t) D PU t (t) D fu 2 U t : u t C r1;

D U \ [t (ZC \ Z )] :

u2U

For examination of metric projections of t onto U we need to find numbers r+ and r . The number r+ can be found by solving a one-dimensional optimization problem of the form (9); r can be found by solving a similar problem. If rC < r , then a metric projection of t onto U coincides with a metric projection of t onto U tC . Since t U tC ZC , we can use the results of this section for analyzing the problem Pr(U, t) and its solution. In particular, if the downward hull (U tC ) of the set U tC is closed, we can assert that the set PU (t) coincides with the set PU C (t). Using Theorem 3 we can give necessary t and sufficient conditions for the global minimum in this case in terms of the set U tC . They can be expressed in the following form: PU (t) D PU C (t) D fu 2 U tC : u t rC 1;

D [t (ZC [ Z )] \ U : Since ZC [ Z D X; U tC [ U t D U. (3) The following holds:

B

If r D rC ; then we can use both sets U tC and U t . We assume in the rest of this section that both pairs (U tC ; t); (U t ; t) are proximinal. In particular, these pairs are proximinal for arbitrary t, if U is a locally compact set. We are now interested in metric projections u of t onto U such that s(u t) D 0. We introduce the following definition. Definition 3 A pair (U, t) with U X, t 2 X is called strongly proximinal if s(u t) D 0 for each metric projection u of t onto U. Recall that s(u t) D 0 if and only if u t 2 ZC \ Z . Proposition 9 The following assertions (i) and (ii) are equivalent: (i) (U, t) is a strongly proximinal pair; (ii) PU (t) D PU C (t) \ PU t (t). t

(12)

Proof (i) H) (ii). Let u 2 PU (t). Since u t 2 Z D ZC and u 2 U, it follows that u 2 U \ (t ZC ) D U tC .

209

210

B

Best Approximation in Ordered Normed Linear Spaces

Then kt uk D minu 0 2U kt u 0 k minu 0 2U C kt t

U tC ,

0

u k. Since u 2 we conclude that the equality kt uk D minu 0 2U C kt u 0 k holds. Thus u 2 t PU C (t). A similar argument shows that u 2 PU t (t). t Let u 2 PU C (t) \ PU t (t). Then t

ku tk D

d(t; U tC )

D

d(t; U t ) :

Combining the equality U D U tC [ U t with (11), we get ku tk D minu 0 2U ku 0 tk, and hence u 2 PU (t). (ii) H) (i). Since (ii) holds, it follows that PU (t) D PU C (t) \ PU t (t) t

u¯ 2 U for all small enough ı > 0. Since v˜ ıq D t ¯ we obtain u˜ ıq D t u, ¯ D k˜v ıqk < k˜v k D kt uk ˜ : min kt uk kt uk u2U

This is a contradiction because u˜ 2 PU (t).

Example 7 Let U 0 X be a locally compact set and q 2 int K. Consider the set U D U 0 Cfq : 2 Rg D fu 0 Cq : u 0 2 U 0 ; 2 Rg : Clearly U is a locally compact set and U is weakly Kopen. Then for each t 2 X the pair (U, t) is strongly proximinal.

D fu 2 U tC : t r1 ug \ fu 2 U t : u t C r1g D fu 2 U tC \ U t : t r1 u t C r1g : Applying Proposition 8 (3), we conclude that PU (t) D fu 2 U \ (t Z0 ) : t r1 u t C r1g D U \ (t Z0 ) \ B(t; r) : Since PU (t) D U \ B(t; r) (by definition), it follows that PU (t) t Z0 , that is, the pair (U, t) is strongly proximinal. Let (U, t) be a proximinal pair. We are interested in a description of conditions that guarantee that ˜ where u˜ is a metric projection of t onto U, v˜ :D t u, belongs to ZC \ Z D Z0 . First, we give the following definition: Definition 4 We say that a set U X is weakly Kopen if for each u 2 U there exists an element q 2 int K such that u C ıq 2 U for all ı with a small enough jıj. Proposition 10 Assume that (U, t) is a proximinal pair such that the set U is weakly K-open. Let u˜ 2 PU (t). Then v˜ :D t u˜ 2 Z0 . Proof Let v˜ … Z0 ; then v˜ … (ZC \ Z ). Assume for the sake of definiteness that v˜ 2 Z C , that is, k˜v k D p(˜v ) > p(˜v ). Since U is weakly K-open and u˜ 2 U, it follows that there exists q 2 int K such that u˜ C ıq 2 U for all small enough ı > 0. Then: p(˜v ) > p(˜v ıq) p(˜v C ıq) D p((˜v ıq)) : Hence k˜v ıqk D p(˜v ıq) < p(˜v ) D k˜v k. Let u¯ D u˜ C ıq. Because U is weakly K-open, we conclude that

Best Approximation in a Class of Normed Spaces with Star-Shaped Cones The theory of best approximation by elements of convex sets in normed linear spaces is well developed and has found many applications [1,2,4,5,10, 16,17,18,19,20]. However, convexity is sometimes a restrictive assumption, and therefore the problem arises of how to examine best approximation by not necessarily convex sets. Special tools for this are needed. The aim of the present article is to develop a theory of best approximation by elements of closed sets in a class of normed spaces with star-shaped cones (see [9]). A star-shaped cone K in a normed space X generates a relation K on X, which is an order relation if and only if K is convex. It can be shown that each star-shaped cone K, such that the interior of the kernel K is not empty, can be represented as the union of closed solid convex pointed cones K i (i 2 I, where I is an index set) such that the interior of the cone K :D \ i2I K i is not empty. A point 1 2 int K generates the norm k k* on X, where kxk D inff > 0 : x K 1; x K 1g, and we assume that X is equipped with this norm. In the special case I D f1g (that is, K is a closed convex solid pointed cone) the class of spaces under consideration contains such Banach lattices as the space L1 (S; ˙; ) of all essentially bounded functions defined on a measure space (S, ˙, ) and the space C(Q) of all continuous functions defined on a compact topological space Q. Now, let X be a normed space and U X. The set kern U consisting of all u 2 U such that

Best Approximation in Ordered Normed Linear Spaces

(x 2 U; 0 ˛ 1) H) u C ˛(x u) 2 U is called the convex kernel of U. A nonempty set U is called starshaped if kern U is not empty. It is known (see, for example, [12]) that kern U is convex for an arbitrary starshaped set U. If U is closed, then kern U is also closed. Indeed, let u k 2 kern U; k D 1; : : : and u k ! u. For each k D 1; 2; : : :, x 2 U and ˛ 2 [0; 1], we have u k C ˛(x u k ) 2 U, and so u C ˛(x u) 2 U. This means that u 2 kern U. We need the following statement. Proposition 11 Let U X be a set and let u 2 U. Then the following assertions are equivalent: (i) There exists " > 0, an index set I, and a family of convex sets (U i ) i2I such that [ UD U i and U i B(u; ") (i 2 I): (13)

B

In the sequel, we shall study star-shaped cones. Recall that a set K X is called a cone (or conic set) if ( > 0; x 2 K) H) x 2 K. Let K be a star-shaped cone and K D kern K. Then, K is also a cone. Indeed, let u 2 K , > 0 and x 2 K. Let x 0 D x/. Then, x 0 2 K, and so u C ˛(x 0 u) 2 K for all ˛ 2 [0; 1]. We have u C ˛(x 0 u) D u C ˛(x u) 2 K. Since x is an arbitrary element of K, it follows that u 2 kern K D K . We now give an example. Example 7 Let X coincide with the space C(Q) of all continuous functions defined on a compact metric space Q and K D fx 2 C(Q) : max q2Q x(q) 0g. Clearly K is a nonconvex cone. It is easy to check that K is a star-shaped cone and kern K D KC , where KC D fx 2 C(Q) : x(q) 0 for all q 2 Qg D fx 2 C(Q) : min x(q) 0g :

i2I

x2Q

(ii) U is a star-shaped set and u 2 int kern U. Proof (i) H) (ii). Let z 2 B(u; ") and let x 2 U; ˛ 2 [0; 1]. It follows from (13) that there exists i 2 I such that x 2 U i . Since U i is convex and z 2 B(u; ") U i , we conclude that z C ˛(x z) 2 U i U. Hence, z 2 kern U for each z 2 B(u; "), and so B(u; ") kern U. (ii) H) (i). Let I D U. Since u 2 int kern U, it follows that there exists " > 0 such that B(u; ") kern U. Let x 2 U and U x D co(x [ B(u; ")). Then the set U x is convex and closed and x 2 U x . Hence, S U x2U U x . Applying the definition of the convex kernel we conclude that U x U. Hence, S x2U U x U.

Indeed, let u 2 KC . Consider a point x 2 K. Then there exists a point q0 2 Q such that x(q0 ) 0. Since u(q) 0 for all q 2 Q, it follows that ˛u(q0 ) C (1 ˛)x(q0 ) 0 for all ˛ 2 [0; 1]. Therefore, ˛u C (1 ˛)x 2 K. We proved that KC kern K. Now, consider u … KC . Then there exists a point q0 such that u(q0 ) < 0. Since u is continuous, we can find an open set G Q such that u(q) < 0 for q 2 G. Let x 2 K be a function such that x(q) < 0 for all q … G (such a function exists). Since the set Q n G is compact, it follows that max q…G x(q) < 0; hence ˛x(q) C (1 ˛)u(q) < 0 for all q 2 Q and small enough ˛ > 0. Therefore ˛x C (1 ˛)u … K for these numbers ˛. The equality kern K D KC has been proved. Note that int kern K ¤ ;.

If 0 2 kern U, then the Minkowski gauge U of U can be defined as follows:

The following statement plays an important role in this paper.

U (x) D inff > 0 : x 2 Ug :

(14)

(It is assumed that inf ; D 0.) Let u 2 kern U. Then, 0 2 kern (U u), and so we can consider the Minkowski gauge U u of the set U u.

Theorem 5 Let K X be a closed cone and let u 2 K. Then the following assertions are equivalent: (i) There exists " > 0, an index set I and a family of closed convex cones (K i ) i2I such that [ KD K i and K i B(u; ") (i 2 I) : (15) i2I

Theorem 4 Let u 2 int kern U. Then the Minkowski gauge U u of the set U u is Lipschitz. Theorem 4 has been proved in [11] (Theorem 5.2) for finite-dimensional spaces. The proof from [11] holds for an arbitrary normed space and we omit it.

(ii) K is a star-shaped cone and u 2 int kern K. Proof (i) H) (ii). It follows from Proposition 11 that K is a star-shaped set and u 2 int kern K. Since K i is a cone for each i 2 I, it follows that K is a cone.

211

212

B

Best Approximation in Ordered Normed Linear Spaces

(ii) H) (i). In view of Proposition 11, there exists a family of convex sets U i ; (i 2 I) such S U i B(u; ") and K D i2I U i . Let K i be the S closed conic hull of U i : K i D cl > 0 U i . Then S K D i2I K i . Remark 3 (1) Let K be a closed star-shaped cone with int kern K ¤ ;. Then the set K D kern K is a closed solid convex cone. (Recall that a convex cone K is called solid if int K ¤ ;.) (2) Note that in Theorem 5, the family (K i ) i2I can be chosen such that each K i is a closed solid pointed convex cone. Indeed, if u 2 int kern K, then u ¤ 0 and a neighborhood B(u; ") kern K can be chosen in such a way that 0 … B(u; "). Then the closed S conic hull K i D cl > 0 U i is a closed solid pointed convex cone. S Let K be a star-shaped cone and K D i2I K i , T where K i is a convex cone and K D i2I K i . Then kern K K . Indeed, let u 2 K and x 2 K. Then there exists j 2 I such that x 2 K j . The inclusion T u 2 i2I K i implies that u 2 K j . Since K j is a convex cone, it follows that ˛x C (1 ˛)u 2 K j for all ˛ 2 (0; 1). This means that u 2 kern K. Let K be a closed star-shaped cone and u 2 int kern K. Consider the function p u;K (x) D inff 2 R : u x 2 Kg :

(16)

Functions (16) are well known if K is a convex cone. These functions have been defined and studied in [12] for the so-called strongly star-shaped cones (see [11] for the definition of strongly star-shaped sets). Each star-shaped set U with int kern U ¤ ; is strongly starshaped. (It was shown in [11] for finite-dimensional space; however, the same argument is valid for arbitrary normed spaces.) It was shown [12] that pu,K is a finite positively homogeneous function of the first degree and the infimum in (16) is attained, so p u;K (x)u x 2 int K. The following equality holds: p u;K (x u) D Ku ( u x) ;

(17)

where Ku is the Minkowski gauge of K u. In view of Theorem 4, the function Ku is Lipschitz, therefore pu,K is also Lipschitz. If K is a convex cone, then pu,K is a sublinear function. This function is also increasing in

the sense of the order relation induced by the convex cone K. The following assertion holds (see [12]). Proposition 12 Let K be a star-shaped cone and u 2 int kern U. Then: p u;K (x Cu) D p u;K (x)C;

x 2 X; 2 R (18)

and fx : p u;K (x) g D u K;

2 R:

(19)

We also need the following assertion. Proposition 13 Let (K i ) i2I be a family of closed T star-shaped cones such that i2I int kern K i ¤ ;. Let S T u 2 i2I int kern K i . Let K D i2I K i and K D T K . Then i2I i p u;K (x) D inf p u;K i (x) ; i2I

p u;K (x) D sup p u;K i (x);

(x 2 X) :

i2I

Proof Let L be a cone such that u 2 int kern L. For each x 2 X consider the set x (L) D f 2 R : u x 2 Lg. It was proved in [12], Proposition 1, that this set is a closed segment of the form [x ; C1), where x D p u;L (x). We have [ Ki g x;K D f 2 R : u 2 x C D f 2 R : u 2 D

[

[

i2I

(x C K i )g

i2I

f 2 R : u 2 x C K i g D

i2I

[

x;K i :

i2I

Hence p u;K (x) D inf x;K D inf

[

x;K i

i2I

D inf inf x;K i D inf p u;K i (x) : i2I

i2I

The second part of the proposition can be proved by a similar argument. Let K be a closed star-shaped cone with int kern K ¤ ;. Then K can be represented as the union of a family of closed convex cones (K i ) i2I . One such family has been described in the proofs of Proposition 11 and Theorem 5: I D K, K i D cl cone co fi [ B(u; ")g,

B

Best Approximation in Ordered Normed Linear Spaces

where u 2 int kern K and " > 0 so small such that B(u; ") kern K. This family is very large; often we can find a much simpler presentation. For example, assume that a cone K is given as the union of a family of T closed convex cones (K i ) i2I such that the cone i2I K i has a nonempty interior. Then this cone is contained in kern K; we can use the given cones K i in such a case. We always assume that cones K i are pointed for all i 2 I, that is, K i \ (K i ) D f0g. An arbitrary star-shaped cone K induces a relation K on X, where x K y means that y x 2 K. This relation is a preorder relation if and only if K is a convex set. Although K is not necessarily an order relation, we will say that x is greater than or equal to y in the sense of K if x K y. We say that x is greater than y and write S x > K y if x y 2 K n f0g. Let K D i2I K i , where K i is a convex cone. The cone K i induces the order relation K i . The relation K , which is induced by cone K, can be represented in the following form: x K y

if and only if there exists

i2I

such that x K i y :

(20)

In the rest of this article, we assume that X is equipped with a closed star-shaped cone K with int kern K ¤ ;. We also assume that a family (K i ) i2I of closed solid convex pointed cones K i is given such that T S K D i2I K i and K D i2I K i has a nonempty interior. Let an element 1 2 int K be fixed. It is clear that 1 2 int K i for all i 2 I. We will also use the following notations: p1;K D p;

p1;K i D p i ;

p1;K D p :

(21)

i2I

kxk i :D max(p i (x); p i (x))

x 2 X:

(24)

Let kxk D sup kxk i

(x 2 X; i 2 I) :

(25)

i2I

We now show that kxk < C1 for each x ¤ 0. Indeed, since 1 2 int K int K i , it follows that there exists " > 0 such that 1 C "B˜ K i for all i 2 I, where B˜ D fx 2 X : kxk 1g is the closed unit ball with respect to the initial norm k k of the normed space X. Let ˜ hence 1 x 0 2 K i . x ¤ 0. Then x 0 D ("/kxk)x 2 "B; This implies that p i (x 0 ) D inff 2 R : 1 x 0 2 K i g 1 : Since pi is a positively homogeneous function, it follows that kxk 0 x p i (x) D p i " kxk kxk p i (x 0 ) : D " " The same argument demonstrates that p i (x) kxk/". Hence kxk D sup kxk i i2I

D sup max(p(x i ); p(x i )) i2I

kxk < C1 : "

Clearly k k* is a norm on X. It is easy to see that

It follows from Proposition 13 that p(x) D inf p i (x);

Since K i is a closed solid convex pointed cone, it is easy to check that Bi (i 2 I) can be considered as the unit ball of the norm k ki defined on X by

p (x) D sup p i (x) :

(22)

kxk D max(p (x); p (x))

x 2 X:

(26)

i2I

A function f : X ! R is called plus-homogeneous (with respect to 1) if f (x C 1) D f (x) C for all x 2 X and 2 R : (The term plus homogeneous was coined in [13].) It follows from (18) that p i (i 2 I); p and p are plushomogeneous functions. Let B i D fx 2 X : 1 K i x K i 1g

i 2 I:

(23)

Due to (23), we have B i (x; r) :D fy 2 X : ky xk i rg D fy 2 X : x C r1 K i y K i x r1g ; (27) where x 2 X, i 2 I and r > 0. Let x 2 X and r > 0. Consider the closed ball B(x, r) with center x and radius r with respect to k k* : B(x; r) :D fy 2 X : ky xk rg D fy 2 X : x C r1 K y K x r1g : (28)

213

214

B

Best Approximation in Ordered Normed Linear Spaces

It follows from (20), (27), and (28) that \ B(x; r) D B i (x; r) ;

Characterization of Best Approximations (29)

Let ' : X X ! R be a function defined by

i2I

'(x; y) :D supf 2 R : x C y K 1g (x; y 2 X) :

and

(31)

B(x; r) fy 2 X : x C r1 K y K x r1g : (30) We now present an example. Example 8 Let X D R2 . Consider the cones A D f(x; y) 2 X : x 0 and y 2xg; 1 B D (x; y) 2 X : x 0 and y x ; 2 1 C D (x; y) 2 X : x 0 and y x ; 2 D D f(x; y) 2 X : x 0 and y 2xg : Set K1 D A [ B, K2 D C [ D, K D K1 [ K2 , and K :D K1 \ K2 D A [ D. It is easy to check that K is not a convex set while K 1 , K 2 and K are convex sets. We also have: p (x) D max(y2x; yC2x) for all x D (x; y) 2 X ; kxk D jyj C 2jxj

Since 1 2 int K , it follows that the set f 2 R : x C y K 1g is nonempty and bounded from above (by the number kx C yk ). Clearly this set is closed. It follows from the definition of ' that the function ' has the following properties: 1 < '(x; y) kx C yk

(32) x C y K '(x; y)1 for all '(x; y) D '(y; x) for all

x; y 2 X ; x; y 2 X ;

D0

for all

x 2 X;

'(x; y C 1) D '(x; y) C

In the remainder of the article, we consider a normed space X with a closed star-shaped cone K such that int kern K is not empty. Assume that K is given as S K D i2I K i , where I is an arbitrary index set; K i ; (i 2 I) is a closed solid convex pointed cone; T The interior int K of the cone K D i2I K i is nonempty. In the sequel, assume that the norm k k of X coincides with the norm k k* defined by (26).

'(x C 1; y) D '(x; y) C

2 R;

'( x; y) D '(x; y) for all

(36)

x; y 2 X

for all and

(35)

x; y 2 X

for all and

(33) (34)

'(x; x) D supf 2 R : 0 D x x K 1g

for all x D (x; y) 2 X :

Example 9 Let X be a normed space with a norm k k. Let Y D X R and K :D epik k Y be the epigraph of k k. (Recall that epik k D f(x; ) 2 Y : kxkg.) Then K is a convex closed cone and (0; 1) 2 int K. Assume now that X is equipped with two equivalent norms k k1 and k k2 . Let K i D epik k i ; i D 1; 2, and K D K1 [ K2 . If there exist x 0 2 X and x 00 2 X such that kx 0 k1 < kx 0 k2 and kx 00 k1 > kx 00 k2 , then K is not convex. Clearly K is a pointed cone. The set int K contains (0, 1); hence it is nonempty. Clearly K n f0g is contained in the open half-space f(x; ) : > 0g. Cone K is star-shaped. It can be proved that kern K D K1 \ K2 .

x; y 2 X ;

for each

2 R;

(37)

x; y 2 X and

> 0 : (38)

Proposition 14 Let ' be the function defined by (31). Then '(x; y) D p(x y); (x; y 2 X) ;

(39)

and hence '(x; y) D sup[p i (x y)] (x; y 2 X) :

(40)

i2I

Proof For each x; y 2 X, we have '(x; y) D supf 2 R : (x C y) K 1g D inff 2 R : (x C y) K 1g D inff0 2 R : (x C y) K 0 1g D inff0 2 R : 0 1 K x C yg D p(x C y) :

B

Best Approximation in Ordered Normed Linear Spaces

Hence '(x; y) D p(x y). In view of (21), we get (40).

Proof The proof is similar to the proof of Lemma 4.3 in [7].

Now, consider x; y 2 X. We define the functions 'x : X ! R and ' y : X ! R by

For x 2 X and a nonempty subset W of X, we will use the following notations:

'x (t) D '(x; t)

t2X

(41)

d i (x; W) :D inf kx wk i w2W

and

i2I

and

' y (t) D '(t; y)

t 2 X:

(42)

Note that 'x and ' y are nonincreasing functions with respect to the relation generated by K on X. We have the following result: Corollary 5 Let ' be the function defined by (31). Then ' is Lipschitz continuous. Proof This is an immediate consequence of Lipschitz continuity of p and Proposition 14. Corollary 6 For each x; y 2 X, the functions defined by (41) and (42) are Lipschitz continuous. Proof It follows from Corollary 5.

Proposition 15 Let ' be the function defined by (31) and set (y; ˛) D fx 2 X : '(x; y) ˛g (y 2 X; ˛ 2 R) :

i PW (x) D fw 2 W : kx wk i D d i (x; W)g

i 2 I:

Lemma 3 Let W be a closed downward subset of X, x 2 X n W, r > 0, and i 2 I. Then r D d i (x; W) if and only if x r1 2 W and p i (x w r1)) 0 for all w 2 W. Proof Let r D d i (x; W). In a manner analogous to the proof of Proposition 3, one can prove i i (x) W. Since PW (x) bd W, it that x r1 2 PW follows from Lemma 2 and Proposition 14 that p i (x w r1) 0 for all w 2 W. Conversely, suppose that x r1 2 W and p i (x w r1) 0 for all w 2 W. Let w 2 W be arbitrary. Since pi is plushomogeneous and p i (x w r1) D p i (x w) r, it follows from (24) that kx wk i p i (x w) r :

Then, (y; ˛) D K C ˛1 y for all y 2 X and all ˛ 2 R.

Since kx (x r1)k i D r and x r1 2 W, we con clude that r D d i (x; W).

Proof Fix y 2 X and ˛ 2 R. Then

Lemma 4 Let W be a closed downward subset of X, x 2 X n W, and r > 0. Then r D d(x; W) if and only if x r1 2 W and for some i 2 I, p i (x w r1) 0 for all w 2 W.

x 2 (y; ˛) () '(x; y) ˛ : Due to Proposition 14, this happens if and only if p(x y) ˛, and hence by Proposition 12, if and only if x y 2 ˛1 K. This is equivalent to x 2 K C ˛1 y, which completes the proof. Corollary 7 Under the hypotheses of Proposition 15, we have '(x; y) ˛

x C y K ˛1

if and only if

(x; y 2 X; ˛ 2 R) : Lemma 2 Let W be a closed downward subset of X, y0 2 bd W and ' be the function defined by (31). Then '(w; y0 ) 0 D '(y0 ; y0 )

8w 2W:

(43)

Proof Let r D d(x; W). By Proposition 3 we have x r1 2 PW (x) bd W. Then it follows from Lemma 3 that '(w; r1 x) 0 for all w 2 W. In view of (40), we get p i (x w r1) 0 for all w 2 W and all i 2 I. Conversely, suppose that x r1 2 W and for some i 2 I, p i (x w r1) 0 for all w 2 W. Consider w 2 W. Since pi is plus-homogeneous and p i (x w r1) D p i (x w) r, it follows from (24) and (25) that kx wk kx wk i p i (x w) r : Since r D kx (x r1)k and x r1 2 W, one thus has r D d(x; W).

215

216

B

Best Approximation in Ordered Normed Linear Spaces

The following result is an immediate consequence of Lemmas 3 and 4. Corollary 8 Let W be a closed downward subset of X, x 2 X n W. Then d(x; W) D d i (x; W)

for all

i 2 I:

(44)

Corollary 9 Let W be a closed downward subset of X, x 2 X n W, and w0 2 W. Then, w0 2 PW (x) if and only i (x) for each i 2 I. if w0 2 PW Proof Let w0 2 PW (x). Then kx w0 k D d(x; W). In view of (25) and (44), we have kx w0 k i D i (x) for each d i (x; W) for each i 2 I. Therefore, w0 2 PW i i 2 I. Conversely, let w0 2 PW (x) for each i 2 I. Then kx w0 k i D d i (x; W) for each i 2 I. Hence, by (44), we get kx w0 k D max i2I kx w0 k i D d(x; W), that is, w0 2 PW (x). Theorem 6 Let W be a closed downward subset of X, x0 2 X n W, y0 2 W, and r0 :D kx0 y0 k . Assume that ' is the function defined by (31). Then the following assertions are equivalent: (1) y0 2 PW (x0 ). (2) There exists l 2 X such that '(w; l) 0 '(y; l);

8 w 2 W; y 2 B(x0 ; r0 ) : (45)

Moreover, if (45) holds with l D y0 , then y0 D w0 D minPW (x0 ), where w0 D x0 r1 is the least element of the set PW (x0 ) and r :D d(x0 ; W). Proof (1) H) (2). Suppose that y0 2 PW (x0 ). Then r0 D kx0 y0 k D d(x0 ; W) D r. Since W is a closed downward subset of X, it follows from Proposition 3 that the least element w0 D x0 r0 1 of the set PW (x0 ) exists. Let l D w0 and y 2 B(x0 ; r0 ) be arbitrary. Then, by (30), we have y K l or y C l K 0. It follows from Corollary 7 that '(y; l) 0. On the other hand, since w0 2 PW (x0 ), it follows that w0 2 bd W. Hence, by Lemma 2 we have '(w; l) 0 for all w 2 W. (2) H) (1). Assume that (2) holds. By (28) it is clear that x0 r0 1 2 B(x0 ; r0 ). Therefore, by (45) we have '(x0 r0 1; l) 0. Due to Corollary 7, we get

x0 r0 1 C l K 0, and so l r0 1 K x0 . Hence there exists j 2 I such that l r0 1 K j x0 :

(46)

Now, let w 2 W be arbitrary. Since pj is topical and (21), (39), and (45) hold, it follows from (46) that p j (x0 w) p j (r0 1 l w) D p j (l w) C r0 p(l w) C r0 D '(w; l) C r0 0 C r0 D r0 : Then, by (24) and (25), we have r0 p j (x0 w) kx0 wk j kx0 wk for all w 2 W : Thus kx0 y0 k D d(x0 ; W). Consequently, y0 2 PW (x0 ). Finally, suppose that (45) holds with l D y0 . Then, by the implication (2) H) (1), we have y0 2 PW (x0 ), and so r0 D kx0 y0 k D d(x0 ; W) and y0 K w0 , where w0 D x0 r1 is the least element of the set PW (x0 ) and r :D d(x0 ; W). Now, let w 2 PW (x0 ) be arbitrary. Then kx0 wk D d(x0 ; W) D r0 , that is, w 2 B(x0 ; r0 ). It follows from (45) that '(w; y0 ) 0. In view of Corollary 7, we have w y0 K 0, and so w K y0 . This means that y0 D minPW (x0 ) D w0 . This completes the proof. Strictly Downward Sets and Their Best Approximation Properties We start with the following definitions, which were introduced in [7] for downward subsets of a Banach lattice. Definition 5 A downward subset W of X is called strictly downward if for each boundary point w0 of W the inequality w > K w0 implies w … W. Definition 6 Let W be a downward subset of X. We say that W is strictly downward at a point w 0 2 bd W if for all w0 2 bd W with w 0 K w0 the inequality w > K w0 implies w … W. The following lemmas have been proved in [7]; however, those proofs hold for the case under consideration.

Best Approximation in Ordered Normed Linear Spaces

Lemma 5 Let f : X ! R be a continuous strictly increasing function. Then all nonempty level sets S c ( f )(c 2 R) of f are strictly downward. Lemma 6 Let W be a closed downward subset of X. Then W is strictly downward at w 0 2 bd W if and only if (i) w > K w 0 H) w … W; (ii) (w 0 K w0 ; w0 2 bd W) H) w0 D w 0 . Lemma 7 Let W be a closed downward subset of X. Then W is strictly downward if and only if W is strictly downward at each of its boundary points. Lemma 8 Let ' be the function defined by (31) and W be a closed downward subset of X that is strictly downward at a point w 0 2 bd W. Then there exists unique l 2 X such that '(w; l) 0 D '(w 0 ; l);

B

Since y > K i w0 for some i 2 I and pi is increasing, it follows from (21), (39), and(48) that 0 D p i (w0 l 0 ) p i (y l 0 ) p(y l 0 ) D '(y; l 0 ) 0. This, together with (48), implies that '(w; l 0 ) 0 D '(y; l 0 ) 8 w 2 W :

(49)

Since w0 ¤ y, it follows that l 0 ¤ l. Hence (47) and (49) contradict the uniqueness of l. We have demonstrated that the assumption y 2 W leads to a contradiction. Thus y … W. This means that W is strictly downward. Corollary 10 Let f : X ! R be a continuous strictly increasing function and ' be the function defined by (31). Then for each x 2 X there exists unique l D x such that

8w2W:

Theorem 7 Let ' be the function defined by (31). Then for a closed downward subset W of X the following assertions are equivalent: (1) W is strictly downward. (2) For each w0 2 bd W there exists unique l 2 X such that '(w; l) 0 D '(w0; l) 8 w 2 W : Proof The implication (1) H) (2) follows from Lemma 8. We now prove the implication (2) H) (1). Assume that for each w0 2 bd W there exists unique l 2 X such that '(w; l) 0 D '(w0; l) 8 w 2 W : Let w0 2 bd W and y 2 X with y > K w0 . Assume that y 2 W. We claim that y C 1 … W for all > 0. Suppose that there exists 0 > 0 such that y C 0 1 2 W. Since y C 0 1 > K w0 C 0 1 and W is a downward set, we have w0 C 0 1 2 W. In view of Corollary 1, it contradicts with w0 2 bd W, and so the claim is true. Then, by Corollary 1, we have y 2 bd W. Let l D y. It follows from Lemma 2 that '(w; l) 0 D '(y; l) 8 w 2 W :

(47)

On the other hand, applying Lemma 2 to the point w0 we have for l 0 D w0 : '(w; l 0 ) 0 D '(w0 ; l 0 ) 8 w 2 W :

(48)

'(w; l) 0 D '(x; l) 8 w 2 Sc ( f ) ; where c D f (x). Proof This is an immediate consequence of Lemma 5 and Theorem 7. Definition 7 Let W be a downward subset of X. A point w 0 2 bd W is said to be a Chebyshev point if for each w0 2 bd W with w 0 K w0 and for each x0 … W such that w0 2 PW (x0 ) it follows that PW (x0 ) D fw0 g, that is, the best approximation of x0 is unique. Definition 7 was introduced in [7] for a downward subset of a Banach lattice. Definition 8 Let W be a downward subset of X. A point w 0 2 bd W is said to be a Chebyshev point of W with respect to each K i (i 2 I) if for each w0 2 bd W with w 0 K w0 and for each x0 … W such that w0 2 P i W (x0 ) for each i 2 I it follows that P i W (x0 ) D fw0 g for each i 2 I. Remark 4 In view of Corollary 8, we have that Definitions 7 and 8 are equivalent. Theorem 8 Let W be a closed downward subset of X and w 0 2 bd W. If w0 is a Chebyshev point of W with respect to each K i (i 2 I), then W is a strictly downward set at w0 .

217

218

B

Best Approximation in Ordered Normed Linear Spaces

Proof Suppose that w0 is a Chebyshev point of W with respect to each K i (i 2 I). Assume, if possible, that W is not strictly downward at w0 . Then we can find w0 2 bd W and w 2 W such that w 0 K w0 and w > K w0 . Let r kw w0 k > 0. It follows from (27) that r1 K i w w0

8 i 2 I:

Thus, w0 C r1 K i w for all i 2 I. Set x0 D w0 C r1 2 X. Since w0 2 bd W, by Lemma 6 we have '(y; w0 ) 0 for all y 2 W. Also, x0 r1 D w0 2 W. Thus, by (21), Proposition 14, and Lemma 4 we get r D d(x0 ; W). Since kx0 w0 k i D kr1k i D r for all i 2 I, it follows from (25) that kx0 w0 k D r, and hence w0 2 PW (x0 ). In view of Corollary 9, we obtain i (x0 ) for all i 2 I. w0 2 PW On the other hand, we have x0 D w0 C r1 K i w for all i 2 I. Since w > K w0 , we conclude that there exists j 2 I such that w > K j w0 . It follows that r1 D x0 w0 > K j x0 w K j 0. Hence kx0 wk j kr1k j D r D d j (x0 ; W) kx0 wk j : Thus kx0 wk j D d j (x0 ; W), and so w 2 P j W (x0 ) with w ¤ w0 . Whence there exist a point x0 2 X n W and a point w0 2 bd W with w 0 K w0 such that w0 2 P i W (x0 ) for each i 2 I and P j W (x0 ) contains at least one point different from w0 . This is a contradiction because w0 is a Chebyshev point of W with respect to each K i (i 2 I), which completes the proof. Proposition 16 Let W be a closed downward subset of X and w 0 2 bd W. If W is a strictly downward set at w0 , then w0 is a Chebyshev point of W. Proof The proof is similar to that of Theorem 4.2 (the implication (2) H) (1)) in [7]. Corollary 11 Let f : X ! R be a continuous strictly increasing function. Then Sc ( f )(c 2 R) is a Chebyshev subset of X. Proof This is an immediate consequence of Lemma 5 and Proposition 16. References 1. Chui CK, Deutsch F, Ward JD (1990) Constrained best approximation in Hilbert space. Constr Approx 6:35–64

2. Chui CK, Deutsch F, Ward JD (1992) Constrained best approximation in Hilbert space II. J Approx Theory 71:213– 238 3. Deutch F (2000) Best approximation in inner product spaces. Springer, New York 4. Deutsch F, Li W, Ward JD (1997) A dual approach to constrained interpolation from a convex subset of a Hilbert space. J Approx Theory 90:385–414 5. Deutsch F, Li W, Ward JD (2000) Best approximation from the intersection of a closed convex set and a polyhedron in Hilbert space, weak Slater conditions, and the strong conical hull intersection property. SIAM J Optim 10: 252–268 6. Martinez-Legaz J-E, Rubinov AM, Singer I (2002) Downward sets and their separation and approximation properties. J Global Optim 23:111–137 7. Mohebi H, Rubinov AM (2006) Best approximation by downward sets with applications. J Anal Theory Appl 22(1):1–22 8. Mohebi H, Rubinov AM (2006) Metric projection onto a closed set: necessary and sufficent conditions for the global minimum. J Math Oper Res 31(1):124–132 9. Mohebi H, Sadeghi H, Rubinov AM (2006) Best approximation in a class of normed spaces with star-shaped cones. J Numer Funct Anal Optim 27(3–4):411–436 10. Mulansky B, Neamtu M (1998) Interpolation and approximation from convex sets. J Approx Theory 92:82–100 11. Rubinov AM (2000) Abstract convex analysis and global optimization. Kluwer, Boston Dordrecht London 12. Rubinov AM, Gasimov RN (2004) Scalarization and nonlinear scalar duality for vector optimization with preferences that are not necessarily a pre-order relation. J Glob Optim 29:455–477 13. Rubinov AM, Singer I (2001) Topical and sub-topical functions, downward sets and abstract convexity. Optimization 50:307–351 14. Rubinov AM, Singer I (2000) Best approximation by normal and co-normal sets. J Approx Theory 107:212–243 15. Singer I (1997) Abstract convex analysis. Wiley-Interscience, New York 16. Singer I (1970) Best approximation in normed linear spaces by elements of linear subspaces. Springer, New York 17. Jeyakumar V, Mohebi H (2005) A global approach to nonlinearly constrained best approximation. J Numer Funct Anal Optim 26(2):205–227 18. Jeyakumar V, Mohebi H (2005) Limiting and "-subgradient characterizations of constrained best approximation. J Approx Theory 135:145–159 19. Vlasov LP (1967) Chebyshev sets and approximatively convex sets. Math Notes 2:600–605 20. Vlasov LP (1973) Approximative properties of sets in normed linear spaces. Russ Math Surv 28:1–66 21. Vulikh BZ (1967) Introduction to the theory of partially ordered vector spaces. Wolters-Noordhoff, Groningen

Bilevel Fractional Programming

Bilevel Fractional Programming HERMINIA I. CALVETE, CARMEN GALÉ Dpto. de Métodos Estadísticos, Universidad de Zaragoza, Zaragoza, Spain

convex on S; and S D f(x1 ; x2 ) : A1 x1 C A2 x2 b; x1 0; x2 0g, which is assumed to be nonempty and bounded. Let S1 be the projection of S on Rn 1 . For each x˜1 2 S1 provided by the upper level decision maker, the lower level one solves the fractional problem:

MSC2000: 90C32, 90C26 min x2

Article Outline

s.t.

Keywords Introduction Formulation Theoretical Results Algorithms References

Fractional bilevel programming; Hierarchical optimization; Nonconvex optimization Introduction Fractional bilevel programming (FBP), a class of bilevel programming [6,10], has been proposed as a generalization of standard fractional programming [9] for dealing with hierarchical systems with two decision levels. FBP problems assume that the objective functions of both levels are ratios of functions and the common constraint region to both levels is a nonempty and compact polyhedron. Formulation Using the common notation in bilevel programming, the FBP problem [1] can be formulated as: x 1 ;x 2

f 1 (x1 ; x2 ) D

h1 (x1 ; x2 ) ; g1 (x1 ; x2 )

where x2 solves min

f 2 (x1 ; x2 ) D

s.t.

(x1 ; x2 ) 2 S ;

x2

f 2 (x˜1 ; x2 ) D

h2 (x˜1 ; x2 ) g2 (x˜1 ; x2 )

A2 x2 b A1 x˜1 x2 0 :

Keywords

min

B

h2 (x1 ; x2 ) g2 (x1 ; x2 )

where x1 2 Rn 1 and x2 2 Rn 2 are the variables controlled by the upper level and the lower level decision maker, respectively; hi and g i are continuous functions, hi are nonnegative and concave and g i are positive and

Let M(x˜1 ) denote the set of optimal solutions to this problem. In order to ensure that the FBP problem is well posed it is also assumed that M(x˜1 ) is a singleton for all x˜1 2 S1 . The feasible region of the upper level decision maker, also called the inducible region (IR), is implicitly defined by the lower level decision maker: IR D f(x˜1 ; x˜2 ) : x˜1 0; x˜2 D argmin f f 2(x˜1 ; x2 ) : A1 x˜1 C A2 x2 b; x2 0gg : Therefore, the FBP problem can be stated as: h1 (x1 ; x2 ) g1 (x1 ; x2 )

min

f 1 (x1 ; x2 ) D

s.t.

(x1 ; x2 ) 2 IR :

x 1 ;x 2

Theoretical Results The FBP problem is a nonconvex optimization problem but, taking into account the quasiconcavity of f 2 and the properties of polyhedra, in [1] it was proved that the inducible region is formed by the connected union of faces of the polyhedron S. One of the main features of FBP problems is that, even with the more complex objective functions, they retain the most important property related to the optimal solution of linear bilevel programming problems. That is, there is an extreme point of S which solves the FBP problem [1]. This result is a consequence of the properties of IR as well as of the fact of f 1 being quasiconcave. The same conclusion is also obtained when both level objective functions are defined as the minimum of a finite number of functions which are ratios with the previously stated conditions or, in general, if they are quasiconcave.

219

220

B

Bilevel Fractional Programming

Under the additional assumption that the upper level objective function is explicitly quasimonotonic, another geometrical property of the optimal solution of the FBP problem can be obtained by introducing the concept of boundary feasible extreme point. According to [7], a point (x1 ; x2 ) 2 IR is a boundary feasible extreme point if there exists an edge E of S such that (x1 ; x2 ) is an extreme point of E, and the other extreme point of E is not an element of IR. Let us consider the relaxed problem: min

f 1 (x1 ; x2 ) D

s.t.

(x1 ; x2 ) 2 S :

x 1 ;x 2

h1 (x1 ; x2 ) ; g1 (x1 ; x2 )

hedron defined by the common constraints. However, in [3] an example of the FBP problem is proposed in which M(x˜1 ) is not single-valued for given x˜1 2 S1 and this assertion is not true. Firstly, IR no longer consists of the union of faces of the polyhedron S. Secondly, if the pessimistic approach is used, then an optimal solution to the example does not exist. Finally, if the optimistic approach is taken the optimal solution to the example is not an extreme point of the polyhedron S. Algorithms

(1)

Since f 1 is a quasiconcave function and S is a nonempty and compact polyhedron, an extreme point of S exists which solves (1). Obviously, if an optimal solution of (1) is a point of IR, then it is an optimal solution to the FBP problem. However, in general, this will not be true, since both decision makers usually have conflicting objectives. Hence, if f 1 is explicitly quasimonotonic and there exists an extreme point of S not in IR that is an optimal solution of the relaxed problem (1), then a boundary feasible extreme point exists that solves the FBP problem [3]. Although FBP problems retain some important properties of linear bilevel problems, it is worth pointing out at this time some differences related to the existence of multiple optima when solving the lower level problem for given x1 2 S1 . Different approaches have been proposed in the literature to make sure that the bilevel problem is well posed [6]. The most common one is to assume that M(x1 ) is single-valued for all x1 2 S1 . Other approaches give rules for selecting x2 2 M(x1 ) in order to be able to evaluate the upper level objective function f1 (x1 ; x2 ). The optimistic approach assumes that the upper level decision maker has the right to influence the lower level decision maker so that the latter selects x2 to provide the best value of f 1 . On the contrary, the pessimistic approach assumes that the lower level decision maker always selects x2 which gives the worst value of f 1 . It is well-known [8] that, under the optimistic approach, at least one optimal solution of the linear bilevel problem is obtained at an extreme point of the poly-

Bearing in mind that there is an extreme point of S which solves the FBP problem, an enumerative algorithm can be devised which examines the set of extreme points of S in order to identify the best one regarding f 1 , which is a point of IR. The bottleneck of the algorithm would be the generally large number of extreme points of a polyhedron together with the process of checking if an extreme point of S is a point of IR or not. In the particular case in which f 1 is linear and f 2 is linear fractional (LLFBP problem), in [2] an enumerative algorithm has been proposed which finds a global optimum in a finite number of stages by examining implicitly only bases of the matrix A2 . This algorithm connects the points of IR with the bases of A2 , by applying the parametric approach to solve the fractional problem of the lower level. One of the main advantages of the procedure is that only linear problems have to be solved. When f 1 is linear fractional and f 2 is linear (LFLBP problem), the algorithm developed in [2] combines local search in order to find an extreme point of IR with a better value of f 1 than any of its adjacent extreme points in IR and a penalty method when looking for another point of IR from which a new local search can start. The Kth-best algorithm has been proposed in [3] to globally solve the FBP problem when both objective functions are linear fractional (LFBP). It essentially asserts that the best (in terms of the upper level objective function) of the extreme points of S which is a point of IR is an optimal solution to the problem. Moreover, the search for this point can be made sequentially by computing adjacent extreme points to the incumbent extreme point.

Bilevel Linear Programming

B

Finally, recently two genetic algorithms have been proposed [4,5] which allow us to solve LLFBP, LFBP and LFLBP problems. Both algorithms provide excellent results in terms of both accuracy of the solution and time invested, proving that they are effective and useful approaches for solving those problems. Both algorithms associate chromosomes with extreme points of S. The fitness of a chromosome evaluates its quality and penalizes it if the associated extreme point is not in IR. The algorithms mainly differ in the procedure of checking if an extreme point is in IR. When f 2 is linear, all lower level problems have the same dual feasible region, so it is possible to prove several properties which simplify the process.

Article Outline

References

Many hierarchical optimization problems involving two or more decision makers can be modeled as a multilevel mathematical program. The two-level structure is commonly known as a Stackelberg game where a leader and a follower try to minimize their individual objective functions F(x, y) and f (x, y), respectively, subject to a series of interdependent constraints [2,9]. Play is defined as sequential and the mood as noncooperative. The decision variables are partitioned between the players in such a way that neither can dominate the other. The leader goes first and through his choice of x 2 Rn is able to influence but not control the actions of the follower. This is achieved by reducing the set of feasible choices available to the latter. Subsequently, the follower reacts to the leader’s decision by choosing a y 2 Rm in an effort to minimizes his costs. In so doing, he indirectly affects the leader’s solution space and outcome. Two basic assumptions underlying the Stackelberg game are that full information is available to the players and that cooperation is prohibited. This precludes the use of correlated strategies and side payments. The vast majority of research on this problem has centered on the linear case known as the linear bilevel program (BLP) [3,6]. Relevant notation, the basic model, and a discussion of its theoretical properties follow. For x 2 X Rn , y 2 Y Rm , F: X × Y ! R1 , and f : X × Y ! R1 , the linear bilevel programming problem can be written as follows:

1. Calvete HI, Galé C (1998) On the quasiconcave bilevel programming problem. J Optim Appl 98(3):613–622 2. Calvete HI, Galé C (1999) The bilevel linear/linear fractional programming problem. Eur J Oper Res 114(1):188–197 3. Calvete HI, Galé C (2004) Solving linear fractional bilevel programs. Oper Res Lett 32(2):143–151 4. Calvete HI, Galé C, Mateo PM (2007) A genetic algorithm for solving linear fractional bilevel problems. To appear in Annals Oper Res 5. Calvete HI, Galé C, Mateo PM (2008) A new approach for solving linear bilevel problems using genetic algorithms. Eur J Oper Res 188(1):14–28 6. Dempe S (2003) Annotated bibliography on bilevel programming and mathematical programs with equilibrium constraints. Optimization 52:333–359 7. Liu YH, Hart SM (1994) Characterizing an optimal solution to the linear bilevel programming problem. Eur J Oper Res 73(1):164–166 8. Savard G (1989) Contribution à la programmation mathématique à deux niveaux. PhD thesis, Ecole Polytechnique de Montréal, Université de Montréal, Montréal, Canada 9. Schaible S (1995) Fractional programming. In: Horst R, Pardalos PM (eds) Handbook of global optimization. Kluwer, Dordrecht, pp 495–608 10. Vicente LN, Calamai PH (1994) Bilevel and multilevel programming: a bibliography review. J Glob Optim 5:291–306

Bilevel Linear Programming JONATHAN F. BARD University Texas, Austin, USA MSC2000: 49-01, 49K10, 49M37, 90-01, 91B52, 90C05, 90C27

Keywords Definitions Theoretical Properties Algorithmic Approaches See also References

Keywords Bilevel linear programming; Hierarchical optimization; Stackelberg game; Multiple objectives; Complementarity

min x2X

F(x; y) D c1 x C d1 y;

(1)

221

222

B

Bilevel Linear Programming

s.t.

A 1 x C B1 y b 1 ;

(2)

min

f (x; y) D c2 x C d2 y;

(3)

s.t.

A 2 x C B2 y b 2 ;

(4)

y2Y

where c1 , c2 2 Rn , d1 , d2 2 Rm , b1 2 Rp , b2 2 Rq , A1 2 Rp × n , B1 2 Rp × m , A2 2 Rq ×n , B2 2 Rq × m . The sets X and Y place additional restrictions on the variables, such as upper and lower bounds or integrality requirements. Of course, once the leader selects an x, the first term in the follower’s objective function becomes a constant and can be removed from the problem. In this case, we replace f (x, y) with f (y). The sequential nature of the decisions in (1)–(4) implies that y can be viewed as a function of x; i. e., y = y(x). For convenience, this dependence will not be written explicitly. Definitions a) Constraint region of the linear BLP: S D f(x; y) :

x 2 X; y 2 Y;

A 1 x C B1 y b 1 ; A 2 x C B2 y b 2 g : b) Feasible set for follower for each fixed x 2 X:

IR represents the set over which the leader may optimize. Thus in terms of the above notation, the BLP can be written as min fF(x; y) : (x; y) 2 IRg :

(5)

Even with the stated assumptions, problem (5) may not have a solution. In particular, if P(x) is not singlevalued for all permissible x, the leader may not achieve his minimum payoff over IR. To avoid this situation in the development of algorithms, it is usually assumed that P(x) is a point-to-point map. Because a simple check is available to see whether the solution to (1)–(4) is unique (see [2]) this assumption does not appear to be unduly restrictive. It should be mentioned that in practice the leader will incur some cost in determining the decision space S(X) over which he may operate. For example, when BLP is used as a model for a decentralized firm with headquarters representing the leader and the divisions representing the follower, coordination of lower level activities by headquarters requires detailed knowledge of production capacities, technological capabilities, and routine operating procedures. Up-to-date information in these areas is not likely to be available to corporate planners without constant monitoring and oversight.

S(x) D fy 2 Y : A2 x C B2 y b2 g : Theoretical Properties c) Projection of S onto the leader’s decision space: S(X) D fx 2 X :

9 y 2 Y;

A 1 x C B1 y b 1 ; A 2 x C B2 y b 2 g : d) Follower’s rational reaction set for x 2 S(X): P(x) D fy 2 Y : ˚ y 2 arg min f (x;b y) : b y 2 S(x) : e) Inducible region: IR D f(x; y) : (x; y) 2 S; y 2 P(x)g : To ensure that (1)–(4) is well posed it is common to assume that S is nonempty and compact, and that for all decisions taken by the leader, the follower has some room to respond; i. e., P(x) 6D ;. The rational reaction set P(x) defines the response while the inducible region

The linear bilevel program was first shown to be NPhard by R.G. Jeroslow [7] using satisfiability arguments common in computer science. The complexity of the problem is further elaborated in Bilevel linear programming: Complexity, equivalence to minmax, concave programs. Issues related to the geometry of the solution space are now discussed. The main result is that when the linear BLP is written as a standard mathematical program (5), the corresponding constraint set or inducible region is comprised of connected faces of S and that a solution occurs at a vertex (see [1] or [8] for the proofs). For ease of presentation, it will be assumed that P(x) is single-valued and bounded, S is bounded and nonempty, and that Y = {y : y 0}. Theorem 1 The inducible region can be written equivalently as a piecewise linear equality constraint comprised of supporting hyperplanes of S.

B

Bilevel Linear Programming

A straightforward corollary of this theorem is that the linear BLP is equivalent to minimizing F over a feasible region comprised of a piecewise linear equality constraint. In general, because a linear function F = c1 x + d1 y is being minimized over IR, and because F is bounded below on S by, say, min{c1 x + d1 y: (x, y) 2 IR}, it can also be concluded that the solution to the linear BLP occurs at a vertex of IR. An alternative proof of this result was given by W.F. Bialas and M.H. Karwan [4] who noted that (5) could be written equivalently as min fc1 x C d1 y : (x; y) 2 co IRg ; where co IR is the convex hull of the inducible region. Of course, co IR is not the same as IR, but the next theorem states their relationship with respect to BLP solutions. Theorem 2 The solution (x , y ) of the linear BLP occurs at a vertex of S. In general, at the solution (x , y ) the hyperplane {(x, y): c1 x + d1 y = c1 x + d1 y } will not be a support of the set S. Furthermore, a by-product of the proof of Theorem 2 is that any vertex of IR is also a vertex of S, implying that IR consists of faces of S. Comparable results were derived by Bialas and Karwan who began by showing that any point in S that strictly contributes in any convex combination of points in S to form a point in IR must also be in IR. This leads to the fact that if x is an extreme point of IR, then it is an extreme point of S. A final observation about the solution of the linear BLP can be inferred from this last assertion. Because the inducible region is not in general convex, the set of optimal solutions to (1)–(4) when not single-valued is not necessarily convex. In searching for a way to solve the linear BLP, it would be helpful to have an explicit representation of IR rather than the implicit representation given by Definition e). This can be achieved by replacing the follower’s problem (3)-(4) with his Kuhn–Tucker conditions and appending the resultant system to the leader’s problem. Letting u 2 Rq and v 2 Rm be the dual variables associated with constraints (4) and y 0, respectively, leads to the proposition that a necessary condition for (x , y ) to solve the linear BLP is that there exists (row) vectors

u and v such that (x , y , u , v ) solves: min

c1 x C d1 y;

(6)

s.t.

A 1 x C B1 y b 1 ; uB2 v D d2 ; u(b2 A2 x B2 y) C v y D 0;

(7) (8) (9)

A 2 x C B2 y b 2 ; x 0; y 0; u 0;

v 0:

(10) (11)

This formulation has played a key role in the development of algorithms. One advantage that it offers is that it allows for a more robust model to be solved without introducing any new computational difficulties. In particular, by replacing the follower’s objective function (3) with a quadratic form f (x; y) D c2 x C d2 y C x > Q1 y C

1 > y Q2 y; 2

(12)

where Q1 is an n × m matrix and Q2 is an m × m symmetric positive semidefinite matrix, the only thing that changes in (6)–(11) is constraint (8). The new constraint remains linear but now includes all problem variables; i. e., x > Q1 C y> Q2 C uB2 v D d2 :

(13)

From a conceptual point of view, (6)–(11) is a standard mathematical program and should be relatively easy to solve because all but one constraint is linear. Nevertheless, virtually all commercial nonlinear codes find complementarity terms like (9) notoriously difficult to handle so some ingenuity is required to maintain feasibility and guarantee global optimality. Algorithmic Approaches There have been nearly two dozen algorithms proposed for solving the linear BLP since the field caught the attention of researchers in the mid-1970s. Many of these are of academic interest only because they are either impractical to implement or highly inefficient. In general, there are three different approaches to solving (1)–(4) that can be considered workable. The first makes use of Theorem 2 and involves some form of vertex enumeration in the context of the simplex method. W. Candler and R. Townsely [5] were the first to develop an algorithm that was globally optimal. Their scheme repeatedly solves two linear programs, one for the leader in

223

224

B

Bilevel Linear Programming

all of the x variables and a subset of the y variables associated with an optimal basis to the follower’s problem, and the other for the follower with all the x variables fixed. In a systematic way they explore optimal bases of the follower’s problem for x fixed and then return to the leader’s problem with the corresponding basic y variables. By focusing on the reduced cost coefficients of the y variables not in an optimal basis of the follower’s problem, they are able to provide a monotonic decrease in the number of follower bases that have to be examined. Bialas and Karwan [4] offered a different approach that systematically explores vertices beginning with the basis associated with the optimal solution to the linear program created by removing (3). This is known as the high point problem. The second and most popular method for solving the linear BLP is known as the Kuhn–Tucker approach and concentrates on (6)–(11). The fundamental idea is to use a branch and bound strategy to deal with the complementarity constraint (9). Omitting or relaxing this constraint leaves a standard linear program which is easy to solve. The various methods proposed employ different techniques for assuring that complementarity is ultimately satisfied (e. g., see [3,6]). The third method is based on some form of penalty approach. E. Aiyoshi and K. Shimizu (see [8, Chap. 15]) addressed the general BLP by first converting the follower’s problem to an unconstrained mathematical program using a barrier method. The corresponding stationarity conditions are then appended to the leader’s problem which is solved repeatedly for decreasing values of the barrier parameter. To guarantee convergence the follower’s objective function must be strictly convex. This rules out the linear case, at least in theory. A different approach using an exterior penalty method was proposed by Shimizu and M. Lu [8] that simply requires convexity of all the functions to guarantee global convergence. In the approach of D.J. White and G. Anandalingam [10], the gap between the primal and dual solutions of the follower’s problem for x fixed is used as a penalty term in the leader’s problem. Although this results in a nonlinear objective function, it can be decomposed to provide a set of linear programs conditioned on either the decision variables (x, y) or the dual variables (u, v) of the follower’s problem. They show that an exact penalty function exists that yields the global solution.

Related theory and algorithmic details are highlighted in [8, Chap. 16], along with presentations of several vertex enumeration and Kuhn–Tucker-based implementations.

See also Bilevel Fractional Programming Bilevel Linear Programming: Complexity, Equivalence to Minmax, Concave Programs Bilevel Optimization: Feasibility Test and Flexibility Index Bilevel Programming Bilevel Programming: Applications Bilevel Programming: Applications in Engineering Bilevel Programming: Implicit Function Approach Bilevel Programming: Introduction, History and Overview Bilevel Programming in Management Bilevel Programming: Optimality Conditions and Duality Multilevel Methods for Optimal Design Multilevel Optimization in Mechanics Stochastic Bilevel Programs

References 1. Bard JF (1984) Optimality conditions for the bilevel programming problem. Naval Res Logist Quart 31:13–26 2. Bard JF, Falk JE (1982) An explicit solution to the multi-level programming problem. Comput Oper Res 9(1):77–100 3. Bard JF, Moore JT (1990) A branch and bound algorithm for the bilevel programming problem. SIAM J Sci Statist Comput 11(2):281–292 4. Bialas WF, Karwan MH (1984) Two-level linear programming. Managem Sci 30(8):1004–1020 5. Candler W, Townsely R (1982) A linear two-level programming problem. Comput Oper Res 9(1):59–76 6. Hansen P, Jaumard B, Savard G (1992) New branch-andbound rules for linear bilevel programming. SIAM J Sci Statist Comput 13(1):1194–1217 7. Jeroslow RG (1985) The polynomial hierarchy and a simple model for competitive analysis. Math Program 32:146–164 8. Shimizu K, Ishizuka Y, Bard JF (1997) Nondifferentiable and two-level mathematical programming. Kluwer, Dordrecht 9. Simaan M (1977) Stackelberg optimization of two-level systems. IEEE Trans Syst, Man Cybern SMC-7(4):554–556 10. White DJ, Anandalingam G (1993) A penalty function for solving bi-level linear programs. J Global Optim 3:397–419

Bilevel Linear Programming: Complexity, Equivalence to Minmax, Concave Programs

Bilevel Linear Programming: Complexity, Equivalence to Minmax, Concave Programs JONATHAN F. BARD University Texas, Austin, USA MSC2000: 49-01, 49K45, 49N10, 90-01, 91B52, 90C20, 90C27 Article Outline Keywords Related Optimization Problems Complexity of the Linear BLPP Problem See also References

B

where c1 , c2 2 Rn , d1 , d2 2 Rm , b1 2 Rp , b2 2 Rq , A1 2 Rp × n , B1 2 Rp × m , A2 2 Rq × n , B2 2 Rq × m . The sets X and Y place additional restrictions on the variables, such as upper and lower bounds. Note that it is always possible to drop components separable in x from the follower’s objective function (3). Out of practical considerations, it is further supposed that the feasible region given by (2), (4), X and Y is nonempty and compact, and that for each decision taken by the leader, the follower has some room to respond. The rational reaction set, P(x), defines these responses while the inducible region, IR, represents the set over which the leader may optimize. These terms are defined precisely in Bilevel linear programming. In the play, y is restricted to P(x). Given these assumptions, the BLPP may still not have a well-defined solution. In particular, difficulties may arise when P(x) is multivalued and discontinuous. This is illustrated by way of example in [2,3].

Keywords Bilevel linear programming; Hierarchical optimization; Stackelberg game; Computational complexity; Concave programming; Minmax problem; Bilinear programming A sequential optimization problem in which independent decision makers act in a noncooperative manner to minimize their individual costs, may be categorized as a Stackelberg game. The bilevel programming problem (BLPP) is a static, open loop version of this game where the leader controls the decision variables x 2 X Rn , while the follower separately controls the decision variables y 2 Y Rm (e. g., see [3,9]). In the model, it is common to assume that the leader goes first and chooses an x to minimize his objective function F(x, y). The follower then reacts by selecting a y to minimize his individual objective function f (x, y) without regard to the impact this choice has on the leader. Here, F: X × Y ! R1 and f : X × Y ! R1 . The focus of this article is on the linear case introduced in Bilevel linear programming and given by: min

F(x; y) D c1 x C d1 y;

(1)

s.t.

A 1 x C B1 y b 1 ;

(2)

min

f (x; y) D c2 x C d2 y;

(3)

s.t.

A 2 x C B2 y b 2 ;

(4)

x2X

y2Y

Related Optimization Problems The linear minmax problem (LMMP) is a special case of (1)–(4) obtained by omitting constraint (2) and setting c2 = c1 , d2 = d1 . It is often written compactly without the subscripts as min max fcx C d y : Ax C By D bg x2X y2Y

(5)

or equivalently as

min cx C max d y ; x2X

y2S(x)

(6)

where S(x) = {y 2 Y : By b Ax}. Several restrictive versions of (5) where, for example, X and Y are polyhedral sets and Ax + By b is absent, as well as related optimality conditions are discussed in [8]. Although important in its own right, the LMMP plays a key role in determining the computational complexity of the linear BLPP. This is shown presently. Consider now the inner maximization problem in (6) with Y = {y = 0}. Its dual is: min{u> (b Ax): u 2 U}, where u is a q-dimensional decision vector and U = {u : u> B d, u 0}. Note that the dual objective function is parameterized with respect to the vector x. Replacing the inner maximization problem with its dual leads to

225

226

B

Bilevel Linear Programming: Complexity, Equivalence to Minmax, Concave Programs

a second representation of (5): min (cx u > Ax C u > b);

(7)

x2X;u2U

which is known as a disjoint bilinear programming problem. The theoretical properties of (7) along with its relationship to other optimization problems are highlighted in [1]. A more general version of a bilinear programming problem can be obtained directly from the linear BLPP. To see this, it is necessary to examine the Kuhn–Tucker formulation of the latter given by (6)–(11) in Bilevel linear programming. Placing the complementarity constraint in the objective function as a penalty term gives the following bilinear programming problem: 8 ˆ min ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ (b2 A2 x B2 y) C v > y]; A 1 x C B1 y b 1 ;

(8)

u > B2 v > D d2 ; A 2 x C B2 y b 2 ; x 0;

y 0;

u 0;

v 0;

where M is a sufficiently large constant. In [10] it is shown that a finite M exists for the solution of (8) to be a solution of (1)–(4), and that (8) is a concave program; that is, its objective function is concave. This point is further elaborated in the next section. Complexity of the Linear BLPP Problem (1)–(4) can be classified as NP-hard which loosely means that no polynomial time algorithm exists for solving it unless P = NP. To substantiate this claim, it is necessary to demonstrate that through a polynomial transformation, some known NP-hard problem can be reduced to a linear BLPP. This will be done below constructively by showing that the problem of minimizing a strictly concave quadratic function over a polyhedron (see [5]) is equivalent to solving a linear minmax problem (cf. [4]). For an alternative proof based on satisfiability arguments from computer science see [7]. Theorem 1 The linear minmax problem is NP-hard. To begin, let x be an n-dimensional vector of decision variables, and c 2 Rn , b 2 Rq , A 2 Rq × n , D 2 Rn × n be

constant arrays. For A of full row rank and D positive definite, it will be shown that the following minimization problem can be transformed into a LMMP: 8 Dx; 2 x (9) :s.t. Ax b; where it is assumed that the feasible region in (9) is bounded and contains all nonnegativity constraints on the variables. The core argument centers on the fact that the Kuhn–Tucker conditions associated with the concave program (9) must necessarily be satisfied at optimality. These conditions may be stated as follows: Ax b;

(10)

x > D u > A D c;

(11)

u > (b Ax) D 0;

(12)

u 0;

(13)

where u is a q-dimensional vector of dual variables. Now, multiplying (11) on the right by x/2, adding cx/2 to both sides of the equation, and rearranging gives 1 1 (cx u > Ax) D cx x > Dx: 2 2

(14)

From (12) we observe that u> b = u> Ax, so (14) becomes 1 1 (cx u > b) D cx x > Dx: 2 2

(15)

Replacing the objective function in (9) with the lefthand side of (15), and appending the Kuhn–Tucker conditions to (9) results in 8 ˆ D min cx u > b; ˆ ˆ x;u ˆ ˆ ˆ ˆ s.t. Ax b; < (16) x > D u > A D c; ˆ ˆ ˆ > ˆ u (b Ax) D 0; ˆ ˆ ˆ : u 0; which is an alternative representation of (9). Thus a quadratic objective function in (9) has been traded for a complementarity constraint in (16). Turning attention to this term, let z be a q-dimensional nonnegative vector and note that u> (b Ax)

B

Bilevel Linear Programming: Complexity, Equivalence to Minmax, Concave Programs

can be replaced by zi = min[ui , (b Ax)i ], i = 1, . . . , m, where (b Ax)i is the ith component of b Ax, as long P as i zi = 0. The aim is to show that the following linear minmax problem is equivalent to (16): 8 ˆ ˆ ˆ 0 D min ˆ ˆ x;u ˆ ˆ ˆ ˆ ˆ ˆ s.t. ˆ ˆ ˆ ˆ ˆ ˆ < ˆ max ˆ ˆ z ˆ ˆ ˆ ˆ ˆ ˆ s.t. ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ :

cx u > b C

q X

or (cx b > u ) (cx 0 b > u 0 ) > M

M

z i 0;

i D 1; : : : ; q;

i D 1; : : : ; q;

where M in the objective functions of problem (17) is a sufficiently large constant whose value must be determined. Before proceeding, observe that an optimal solution to (16), call it (x , u ), is feasible to (17) and yields the same value for the first objective function in (17). This P follows because i zi = 0, where zi = zi (x , u ). It must now be shown that (x , u , z ) also solves (17). Assume the contrary; i. e., there exists a vector (x0 , u0 , z0 ) in the inducible region of (17) such that 0 < and P 0 P 0 0 i z i > 0. (Of course, if i z i = 0 and < this would contradict the optimality of (x , u ).) To exhibit a contradiction an appropriate value of M is needed. Accordingly, let S be the polyhedron defined by all the constraints in (17) and let (18)

Evidently, because S is compact, + in (18) is finite. Compactness follows from the assumption that {x: Ax b} is bounded, and the fact that A has full row rank which implies that u is bounded in the second constraint in (17). Now define: C 0; M> P 0 i zi P where is any value in (0, i z0i ). This leads to the following series of inequalities: 0 D cx 0 b > u 0 C M

X i

z0i M > C

(cx b > u ) (cx 0 b > u 0 ) > M

X

z0i

i

i D 1; : : : ; q;

˚ C D min cx u > b : (x; u; z) 2 S :

X i

(17)

iD1

z i (b Ax) i ;

(19)

But from the definition of M along with (19), one has

Mz i ;

Ax b;

zi ui ;

z0i :

i

iD1

x > D u > A D cu 0; q X Mz i ;

X

z0i < cx b > u D

P which implies that the open interval (0, i z0i ) does not P 0 exist so i z i = 0, the desired contradiction. Similar arguments can be used to show the reverse; therefore, if (x , u ) solves (16), it also solves (17) and vice versa. Finally, note that the transformation from (9) to (17) is polynomial because it only involves the addition of 2q variables and 2q + n constraints to the formulation. The statement of the theorem follows from these developments. A straightforward corollary is that the linear BLPP is NP-hard. In describing the size of a problem instance, I, it is common to reference two variables: 1) its Length[I], which is an integer corresponding to the number of symbols required to describe I under some reasonable encoding scheme, and 2) its Max[I], also an integer, corresponding to the magnitude of the largest number in I. When a problem is said to be solvable in polynomial time, it means that an algorithm exists that will return an optimal solution in an amount of time that is a polynomial function of the Length[I]. A closely related concept is that of a pseudopolynomial time algorithm whose time complexity is bounded above by a polynomial function of the two variables Length[I] and Max[I]. By definition, any polynomial time algorithm is also a pseudopolynomial time algorithm because it runs in time bounded by a polynomial in Length[I]. The reverse is not true. The theory of NP-completeness states that NP-hard problems are not solvable with polynomial time algorithms unless P = NP; however, a certain subclass may be solvable with pseudopolynomial time algorithms. Problems that do not yield to pseudopolynomial time algorithms are classified as NP-hard in the strong sense.

227

228

B

Bilevel Optimization: Feasibility Test and Flexibility Index

The linear BLPP falls into this category. The proof in [6], once again, is actually a corollary to the following theorem. Theorem 2 The linear minmax problem is strongly NPhard. The proof is based on the notion of a kernel K of a graph G = (V, E) which is a vertex set that is stable (no two vertices of K are adjacent) and absorbing (any vertex not in K is adjacent to a vertex of K). It is shown that the strongly NP-hard problem of determining whether or not G has a kernel (see [5]) is equivalent to determining whether or not a particular LMMP has an optimal objective function value of zero. See also Bilevel Fractional Programming Bilevel Linear Programming Bilevel Optimization: Feasibility Test and Flexibility Index Bilevel Programming Bilevel Programming: Applications Bilevel Programming: Applications in Engineering Bilevel Programming: Implicit Function Approach Bilevel Programming: Introduction, History and Overview Bilevel Programming in Management Bilevel Programming: Optimality Conditions and Duality Concave Programming Minimax: Directional Differentiability Minimax Theorems Minimum Concave Transportation Problems Multilevel Methods for Optimal Design Multilevel Optimization in Mechanics Nondifferentiable Optimization: Minimax Problems Stochastic Bilevel Programs Stochastic Programming: Minimax Approach Stochastic Quasigradient Methods in Minimax Problems References 1. Audet C, Hansen P, Jaumard B, Savard G (1996) On the linear maxmin and related programming problems. GERAD École des Hautes Études Commerciales, Montreal, Working paper G-96–15

2. Bard JF (1991) Some properties of the bilevel programming problem. J Optim Th Appl 68(2):371–378 3. Bard JF, Falk JE (1982) An explicit solution to the multi-level programming problem. Comput Oper Res 9(1):77–100 4. Ben-Ayed O, Blair CE (1990) Computational difficulties of bilevel linear programming. Oper Res 38(1):556–560 5. Garey MR, Johnson DS (1979) Computers and intractability: A guide to the theory of NP-completeness. Freeman, New York 6. Hansen P, Jaumard B, Savard G (1992) New branch-andbound rules for linear bilevel programming. SIAM J Sci Statist Comput 13(1):1194–1217 7. Jeroslow RG (1985) The polynomial hierarchy and a simple model for competitive analysis. Math Program 32:146–164 8. Shimizu K, Ishizuka Y, Bard JF (1997) Nondifferentiable and two-level mathematical programming. Kluwer, Dordrecht 9. Simaan M (1977) Stackelberg optimization of two-level systems. IEEE Trans Syst, Man Cybern SMC-7(4):554–556 10. White DJ, Anandalingam G (1993) A penalty function approach for solving bi-level linear programs. J Global Optim 3:397–419

Bilevel Optimization: Feasibility Test and Flexibility Index MARIANTHI IERAPETRITOU Department Chemical and Biochemical Engineering, Rutgers University, Piscataway, USA MSC2000: 90C26 Article Outline Keywords Problem Statement Local Optimization Framework Design Optimization Bilevel Optimization Global Optimization Framework Feasibility Test Flexibility Index Illustrative Example Conclusions See also References Keywords Bilevel optimization; Uncertainty; Flexibility

B

Bilevel Optimization: Feasibility Test and Flexibility Index

Production systems typically involve significant uncertainty in their operation due to either external or internal resources. Variability of process parameters during operation and plant model mismatch (both parametric and structural) could give rise to suboptimality and even infeasibility of the deterministic solutions. Consequently, plant flexibility has been recognized to represent one of the important components in the operability of the production processes. In a broad sense the area covers a feasibility test that requires constraint satisfaction over a specified space of uncertain parameters; a flexibility index associated with a given design that represents a quantitative measure of the range of uncertainty space that satisfies the feasibility requirement; and the integration of design and operations where trade-offs between design cost and plant flexibility are considered. K.P. Halemane and I.E. Grossmann [21] proposed a feasibility measure for a given design based on the worst points for feasible operation, which can be mathematically formulated as a max-min-max optimization problem as will be discussed in detail in the next section. Different approaches exist in the literature that quantify the flexibility for a given design involve the deterministic measures such as the resilience index, RI, proposed in [38], the flexibility index proposed in [41,42] and the stochastic measures such as the design reliability proposed in [27] and the stochastic flexibility index proposed in [37] and [40]. The incorporation of uncertainty into design optimization problems transforms the deterministic process models to stochastic/parametric problems, the solution of which requires the application of specialized optimization techniques. The consideration of the feasibility objective within the design optimization can be targeted towards the following two design capabilities. The first one concerns the design with fixed degree of flexibility that has the capability to cope with a finite number of different operating conditions ([19,20,32,34,40]). The second one considers the design optimization with optimal degree of flexibility that can be achieved by the trade-off of the cost of the plant and its flexibility ([22,33,35,36]). In the next section the feasibility test and the flexibility index problem will be considered in detail.

Problem Statement The design problem can be described by a set of equality constraints I and inequality constraints J, representing plant operation and design specifications: h i (d; z; x; ) D 0;

i 2 I;

g j (d; z; x; ) 0;

j 2 J;

(1)

where d corresponds to the vector of design variables, z the vector of control variables, x the state variables and the vector of uncertain parameters. As has been shown in [21] for a specific design d, given this set of constraints, the design feasibility test problem can be formulated as the max-min-max problem: (d)

D max min max

2T

z

j2J; i2I

h i (d; z; x; ) D 0; g j (d; z; x; ) 0

;

(2)

where the function (d) represents a feasibility measure for design d. If (d) 0, design d is feasible for all 2 T, whereas if (d) > 0, the design cannot operate for at least some values of 2 T. The above max-minmax problem defines a nondifferentiable global optimization problem which however can be reformulated as the following two-level optimization problem: 8 ˆ (d) ˆ ˆ ˆ ˆ ˆ ˆ ˆ y k?,

3

where is a small positive number. Set k = k+1, go to Step 1. If k = 0, then BLPP has no solution. Otherwise, x k ; y k?is an -global?optimum to BLPP,where = ?c1> x k + d1> y k ?.

The key to this algorithm is the ability to efficiently solve LCP(k ) in Step 1. J. Judice and A. Faustino [12]

˛ i 1:

(4)

iD1

Moreover, it can be shown that the following conditions must hold [11]: X ˛ i if d j < 0; (5) n

i : B 2 i j >0

o

X n

i : B 2 i j 0;

(6)

for j = 1, . . . , n2 . It is possible to use (4)–(6) as branching criteria in a branch and bound tree. Each of these conditions, when tight, can be used to eliminate a variable from the inner constraints. By combining these conditions with the use of linear relaxations to obtain lower

Bilevel Programming: Global Optimization

B

bounds, a branch and bound algorithm can be developed to solve the BLPP [11]. An alternate method to the use of binary variables is to establish a one-to-one correspondence between each ˛ i and each i , as follows:

It is then possible to solve a dual problem 8 x C d1> y ˆ ˆ x ˆ ˆ ˆ ˆ s.t. A 1 x C B1 y b 1 ˆ ˆ ˆ ˆ ˆ A 2 x C B2 y b 2 ˆ ˆ ˆ < ˛(A2 x C B2 y b2 ) D 0 ˆ ˆ d2 C A> 2D 0 ˆ ˆ ˆ ˆ ˆ i M˛ i ˆ ˆ ˆ ˆ ˆ ˛ i M i ˆ ˆ ˆ : 0; ˛ D f0; 1g:

which provides a lower bound on the global solution. The dual problem is actually solved by partitioning the y-space using the gradients of L and solving a relaxed dual subproblem in each region. In [16] it has been shown that for the bilevel problems, only one dual subproblem needs to be solved at each iteration. This approach can also be used when the inner problem objective function is quadratic. Another approach, proposed in [2], can also be used when the inner level problem has a convex quadratic objective function. The basic idea is to first solve the one-level linear problem by dropping the complementarity conditions. At each iteration, a check is made to see if the complementarity condition is satisfied. If it is, the corresponding solution is in the inducible region IR, and hence a candidate solution for BPP. If not, a branch and bound scheme is used to implicitly examine all combinations of complementary slackness. Let W 1 = {i: i = 0}, W 2 = {i: g i = 0}, W 3 = {i: i 62 W 1 [ W 2 }.

By partitioning the variables into x D (x; y) and y D (; ˛), it can be seen that this problem is of the form 8 ˆ min f (x; y) ˆ < x;y s.t. g(x; y) 0 ˆ ˆ : h(x; y) D 0; where f (x; y), g(x; y) and h(x; y) are bilinear functions. Thus, the GOP algorithm of [8,9] can be applied to solve this problem. The algorithm works by solving a set of primal and relaxed dual problems that bound the global solution. The primal problem is 8 ˆ min f (x; y k ) ˆ < x s.t. g(x; y k ) 0 ˆ ˆ : h(x; y k ) D 0; where yk is a fixed number. Because this problem is linear, it can be solved for its global solution, and yields an upper bound on the global solution. It also provides multipliers for the constraints, k and k , which can be used to construct a Lagrange function of the form L(x; y; k ; k ) D f (x; y k ) C k g(x; y k ) C k h(x; y k ):

y

:s.t.

0 1

2

3 4

5

u L(x; y; k ; k );

Set k = 0, W1 = W2 = ;, W3 = i, F = 1. Set i = 0, i 2 W1 , g i = 0, i 2 W2 . Solve the relaxed system. Let (x k ; y k ; k ) be the solution. If no solution exists, or if F(x k ; y k ) F, go to Step 4. If i g i = 0, 8i, go to Step 3. Otherwise seˆ Let lect i such that i g i is maximal, say i. ˆ ˆ W1 = W1 [ i, W3 = W3 [ i, and go to Step 1. Update F = F(x k ; y k ). If all nodes in the three have been exhausted, go to Step 5. Else, branch to the newest unfathomed node, say j, and set W1 = W1 [ j, W2 = W2 [ j. Go to Step 1. If F = 1, no solution exists to BPP. Otherwise, the point corresponding to F is the optimum.

Computational Results and Test Problems The difficulty of solving bilevel problems depends on a number of factors, including the number of inner

259

260

B

Bilevel Programming: Implicit Function Approach

versus outer level variables, degree of cooperation between the leader and follower objective functions, number of inner level constraints and the density of the constraints. Computational results have been reported by many authors, including [2,11,12] and [16]. Generally, these have so far been limited to problems involving up to 100 inner level variables and constraints. See [5,6] for methods for automatically generating linear and quadratic bilevel problems which can be used to test any of these and other algorithms for bilevel programming.

See also Bilevel Fractional Programming Bilevel Linear Programming Bilevel Linear Programming: Complexity, Equivalence to Minmax, Concave Programs Bilevel Optimization: Feasibility Test and Flexibility Index Bilevel Programming Bilevel Programming: Applications Bilevel Programming: Applications in Engineering Bilevel Programming: Implicit Function Approach Bilevel Programming: Introduction, History and Overview Bilevel Programming: Optimality Conditions and Duality Multilevel Methods for Optimal Design Multilevel Optimization in Mechanics Stochastic Bilevel Programs References 1. Bard JF (1991) Some properties of the bilevel programming problem. J Optim Th Appl 68:371–378 2. Bard JF, Moore J (1990) A branch and bound algorithm for the bilevel programming problem. SIAM J Sci Statist Comput 11:281–292 3. Ben-Ayed O, Blair C (1990) Computational difficulties of bilevel linear programming. Oper Res 38:556–560 4. Bialas W, Karwan M (1984) Two-level linear programming. Managem Sci 30:1004–1020 5. Calamai P, Vicente L (1993) Generating linear and linearquadratic bilevel programming problems. SIAM J Sci Statist Comput 14:770–782 6. Calamai P, Vicente L (1994) Generating quadratic bilevel programming problems. ACM Trans Math Softw 20: 103–119

7. Edmunds T, Bard J (1991) Algorithms for nonlinear bilevel mathematical programming. IEEE Trans Syst, Man Cybern 21:83–89 8. Floudas CA, Visweswaran V (1990) A global optimization algorithm (GOP) for certain classes of nonconvex NLPs: I. theory. Comput Chem Eng 14:1397 9. Floudas CA, Visweswaran V (1993) A primal-relaxed dual global optimization approach. J Optim Th Appl 78(2):187 10. Fortuny-Amat J, McCarl B (1981) A representation and economic interpretation of a two-level programming problem. J Oper Res Soc 32:783–792 11. Hansen P, Jaumard B, Savard G (1992) New branching and bounding rules for linear bilevel programming. SIAM J Sci Statist Comput 13:1194–1217 12. Júdice J, Faustino A (1992) A sequential LCP method for bilevel linear programming. Ann Oper Res 34:89–106 13. Migdalas A, Pardalos PM, Värbrand P (1998) Multilevel optimization: Algorithms and applications. Kluwer, Dordrecht 14. Vicente LN, Calamai PH (1994) Bilevel and multilevel programming: A bibliography review. J Global Optim 5: 291–306 15. Vicente L, Savard G, Júdice (1994) Descent approaches for quadratic bilevel programming. J Optim Th Appl 81: 379–399 16. Visweswaran V, Floudas CA, Ierapetritou MG, Pistikopoulos EN (1996) A decomposition-based global optimization approach for solving bilevel linear and quadratic programs. In: Floudas CA, Pardalos PM (eds) State of the Art in Global Optimization. Kluwer, Dordrecht, pp 139–162

Bilevel Programming: Implicit Function Approach BP STEPHAN DEMPE Freiberg University Mining and Technol., Freiberg, Germany MSC2000: 90C26, 90C31, 91A65 Article Outline Keywords Reformulation as a One-Level Problem Properties of the Solution Function Optimality Conditions Conditions Using the Directional Derivative of the Solution Function Conditions Using the Generalized Jacobian of the Solution Function

Bilevel Programming: Implicit Function Approach

Solution Algorithms Descent Algorithms Bundle Algorithms

See also References Keywords Bilevel programming problem; Stackelberg game; Implicit function approach; Strongly stable solution; Piecewise continuously differentiable function; Necessary optimality conditions; Sufficient optimality conditions; Solution algorithms The bilevel programming problem is a hierarchical problem in the sense that its constraints are defined in part by a second parametric optimization problem. Let (x) be the solution set of this second problem (the socalled lower level problem): (x) :D Argmin f f (x; y) : g(x; y) 0g ;

(1)

B

about the follower’s responses y(x) for all x 2 X the leader’s task is it to minimize the function G(x) := F(x, y(x)) over the set X, i. e.to solve problem (2). The bilevel programming problem has a large number of applications e. g.in economics, natural sciences, technology (cf. [17,25] and the references therein). The quotation marks in (2) have been used to indicate that, due to minimization only with respect to x in the upper level problem (2), this problem is not well defined in the case that the lower level problem (1) has not a uniquely determined optimal solution for all values of x [6]. Minimization only with respect to x in (2) takes place in many applications of bilevel programming, e. g.in the cases when the lower level problem represents the reactions of the nature on the leader’s actions. If (x) does not reduce to a singleton for all parameter values x 2 X, either an optimistic or a pessimistic approach has to be used to obtain a well defined auxiliary problem. In the optimistic case, problem (2) is replaced by

y

where f , g i 2 C2 (Rn × Rm , R), i = 1, . . . , p. Then, the bilevel programming problem is defined as “ min ” fF(x; y) : y 2 (x); x 2 Xg x

(2)

with F 2 C1 (Rn × Rm , R) and X Rn is closed. Problem (2) is also called the upper level problem. The inclusion of equality constraints in the problem (1) is possible without difficulties. If inequalities and/or equations in both x and y appear in the problem (2), this problem becomes even more difficult since these constraints restrict the set (x) after a solution y out of it has been chosen. This can make the selection of y 2 (x) a posteriori infeasible [6]. The bilevel programming problem can easily be interpreted in terms of Stackelberg games which are a special case of them widely used in economics. In Stackelberg games the inclusion of lower level constraints g(x, y) 0 is replaced by y 2 Y where Y Rm is a fixed closed set. Consider two decision makers which select their actions in an hierarchical manner. First the leader chooses x 2 X and announces his selection to the follower. Knowing the selection x the follower computes his response y(x) on it by solving the problem (1). Now, the leader is able to evaluate the value of his initial choice by computing F(x, y(x)). Having full knowledge

min fF(x; y) : y 2 (x); x 2 Xg x;y

(3)

[6,11], where minimization is taken with respect to both x and y. The use of (3) instead of (2) means that the leader is able to influence the choice of the follower. If the leader is not able to force the follower to take that solution y 2 (x) which is the best possible for him, he has to bound the damage resulting from an unwelcome choice of the follower. Hence, the leader has to take the worst solution in (x) into account for computing his decision. This leads to the auxiliary problem in the pessimistic case: (4) min max fF(x; y) : y 2 (x)g : x 2 X x

y

[15,16]. In the sequel it is assumed that the lower level problem (1) has a unique (global) optimal solution y(x) for all x 2 X. This is guaranteed to be true at least if the assumptions C), SCQ), and SSOC) below are satisfied. Then, the implicit function approach to bilevel programming can be used which means that problem (2) (and equivalently (3)) is replaced by min fG(x) :D F(x; y(x)) : x 2 Xg : x

C)

(5)

The functions f (x, ), g i (x, ): Rm ! R are convex in y for each x 2 X.

261

262

B

Bilevel Programming: Implicit Function Approach

SCQ) For each x 2 X there exists a pointe y(x) such that g(x;e y(x)) < 0. For convex problems, Slater’s condition SCQ) implies that a feasible point y(x) to (1) is optimal if and only if the Karush–Kuhn–Tucker conditions for this problem are valid: There exists a point 2 (x; y(x)), where

If at an optimal solution y(x) of the convex problem (1) at x D x the assumptions SCQ) and SSOC) are satisfied, then y(x) is a strongly stable optimal solution in the sense of M. Kojima [13]. This means that there exists an open neighborhood U of x and a uniquely determined continuous function y: U ! Rm such that y(x) is the uniquely determined optimal solution of (1) for all x 2 U. Hence, for convex problems (1), the assumptions (x; y(x)) ˚ SCQ) and SSOC) imply that there is a uniquely deterD 0 : r y L(x; y(x)) D 0; > g(x; y(x)) D 0 mined implicit function y(x) describing the unique op(6) timal solution of the problem (1) for all x 2 X. This > with L(x, y) = f (x, y) + g(x, y) denoting the Lagrange function can be inserted into the problem (2) which results in the third equivalent one-level problem (5). function of the problem (1). Problem (5) consists in minimizing the implicitly determined, generally nonsmooth, nonconvex objective Reformulation as a One-Level Problem function F(x, y(x)) on the set X. It has an optimal soluThere are several methods to reformulate (3) as an tion if the set X is compact or the function F(, ) satisfies equivalent one-level problem. some coercivity assumption [11]. The first possibility consists in replacing the lower Under suitable assumptions, the parametric comlevel problem (1) by its Karush–Kuhn–Tucker condi- plementarity problem as well as the parametric variations (6): tional inequality describing the constraints in a mathematical program with equilibrium constraints also pos8 9 r L(x; y) D 0; y ˆ > sess a uniquely determined continuous solution funcˆ > < = > g(x; y) D 0; : (7) tion [17]. Then, the implicit function approach can also min F(x; y) : x;y ˆ g(x; y) 0; > ˆ > be used to investigate MPECs. : ; 0; x 2 X

This is an optimization problem with constraints given in part by a parametric complementarity condition. A second possibility is to use a variational inequality describing the set (x). Let assumption C) be satisfied. Then, the problem (3) is equivalent to 8
0; the following inequality holds: d > r y2 y L(x; y; )d > 0:

Properties of the Solution Function For the investigation of bilevel programming problems via (5) the knowledge of properties of the solution function y: X ! Rm is needed. If the assumptions C), SCQ), and SSOC) are satisfied, this function is continuous [13], upper Lipschitz continuous [22], Hölder continuous with exponent 1/2 [9] and directionally differen˚ tiable [3,24]. Let z D (x; y(x)), I :D j : g j (z) D 0 , J() := {j : j > 0}. The directional derivative y0 (x; r) D lim t 1 [y(x C tr) y(x)] t!0C

of the function y() at a point x can be computed as the unique optimal solution y0 (x; r) of the convex quadratic problem 1 > 2 d r y y L(z; )d C d > rx2 y L(z; )r ! min; d 2 rx g i (z)r C r y g i (z)d D 0; 8 i 2 J(); rx g i (z)r C r y g i (z)d 0;

8 i 2 I n J();

(9)

Bilevel Programming: Implicit Function Approach

for some suitably chosen Lagrange multiplier 2 Argmax frx L(z; )r : 2 (z)g

(10)

[3]. The correct choice of is a rather difficult task since it possibly belongs to the relative interior of some facet of the polyhedral set (z) [3]. For making the application of these properties of the solution function easier, a further assumption is used: CR) For each pair (x; y), x 2 X; y 2 (x), there is an open neighborhood V Rn × Rm of (x; y) such that, for all I I, the family of gradients {r y g i (x, y) : i 2 I} has constant rank on V. If the assumptions C), SCQ), SSOC), and CR) are satisfied, the function y: X ! Rm is a piecewise continuously differentiable function [21], i. e. it is continuous and there exist an open neighborhood U of x and a finite number of continuously differentiable functions yi U ! Rm , i = 1, . . . , k, such that y() is a selection of the yi : ˚ y(x) 2 y i (x) : i D 1; : : : ; k ; 8 x 2 U: The functions yi : U ! Rm describe locally optimal solutions of auxiliary problems ˚ min f (x; y) : g j (x; y) D 0; j 2 I i ; y

where the sets I i , i = 1, . . . , k, satisfy the following two conditions: there exists a vertex 2 (x; y(x)) such that J() I i I; and ˚ the gradients r y g j (x; y(x)) : j 2 I i are linearly independent [14]. Let IS(x) denote the family of all sets I i having these two properties. Then, k is the cardinality of IS(x). The functions yi : U ! Rm are continuously differentiable at x [7]. For the computation of the Jacobian of the function yi () at x D x the unique solution of a system of linear equations is to be computed. Moreover, the directional derivative y0 (x; r) is equal to the unique optimal solution of the quadratic problem (9) for each optimal solution of the linear problem (10) [21]. For fixed x, it is a continuous, piecewise linear function of the direction r. The quadratic problem (9) has an optimal solution if and only if solves the linear problem (10). Hence, for computing a linear approximation of the function y: X ! Rm it is sufficient to

B

solve the parametric quadratic optimization problems (9) for all vertices 2 (x; y(x)). Piecewise continuously differentiable functions are locally Lipschitz continuous [10]. The generalized Jacobian [1] of the function y() satisfies ˚ @y(x) conv r y i (x) : i D 1; : : : ; k

(11)

[14]. Let g I (z) = (g i (z))i 2 I . If the assumption FRR) For each x 2 X, for each vertex 2 (z) with z D (x; y(x)), the matrix ! r y2 y L(z; ) r y> g J( ) (z) rx2 y L(z; ) r y g I (z) 0 rx g I (z) has full row rank is added to C), SCQ), SSOC), and CR), then equality holds in (11) [5]. Optimality Conditions Even under very restrictive assumptions, problem (5) is a nondifferentiable, nonconvex optimization problem. For the derivation of necessary and sufficient optimality conditions, various approaches of nondifferentiable optimization can be used. Conditions Using the Directional Derivative of the Solution Function Let X = {x: hk (x) 0, k 2 K}, where hk 2 C1 (Rn , R), k 2 K and K is a finite set. Generalizations of the following results to larger classes of constraint sets are obvious. Let x 2 X, y(x) 2 (x), z D (x; y(x)). Let the assumptions C), SCQ), SSOC), and CR) as well as MFCQ) There exists a direction d such that r h k (x)d < 0 for all k 2 K :D fl : h l (x) D 0g be valid. Then, if x is a locally optimal solution of the problem (5) (and thus of the bilevel problem (2)), there cannot exist a feasible direction of descent, i. e. rx F(z)r C r y F(z)y0 (x; r) 0

(12)

for all directions r satisfying r h k (x) 0, k 2 K. By use of the above approach for computing the directional derivative of the solution function y(), the verification of this necessary optimality condition can be done by solving a bilevel optimization problem of minimizing the function (12) subject to the condition that y0 (x; r)

263

264

B

Bilevel Programming: Implicit Function Approach

is an optimal solution of the problem (9). By replacing problem (9) with its Karush–Kuhn–Tucker conditions and applying an active index set strategy the following condition is obtained: If x is a locally optimal solution of the problem (2) then v :D min f'(x; I) : I 2 IS(x)g 0;

(13)

Conditions Using the Generalized Jacobian of the Solution Function By [1], the generalized differential of the function G(x) := F(x, y(x)) is equal to ˚ @G(x) D conv rx F(z) C r y F(z)! : ! 2 @y(x) ; (14)

where '(x; I) denotes the optimal objective function value of the problem rx F(z)r C r y F(z)d ! min; d;r;˛

rx h k (x)r 0; rx2 y L(z; )r

C

k 2 K; r y2 y L(z; )d

C r y> g I (z)˛ D 0;

rx g i (z)r C r y g i (z)d D 0;

i 2 I;

rx g i (z)r C r y g i (z)d 0;

i 2 I n I;

˛ i 0;

i 2 I n J();

krk D 1;

and is the unique vertex of (z) with J() I [2]. Problem (13) is a combinatorial optimization problem and can be solved by enumeration algorithms. In [2] a more general necessary optimality condition is given even without assuming CR). Then, the directional derivative of the solution function is in general discontinuous with respect to perturbations of the direction and is to be replaced by the contingent derivative of the solution function. In [17] it is shown that nonexistence of directions of descent in the tangent cone to the feasible set is also a necessary optimality condition for MPECs. In general, this tangent cone is not convex. Using a so-called basic constraint qualification it is shown that it is equal to the union of a finite number of polyhedral cones. The resulting condition is similar to (13). Dualizing this condition, some kind of a Karush–Kuhn–Tucker condition for MPECs is obtained. It is also possible to obtain a sufficient optimality condition by use of the directional derivative. Namely, if for the optimal function value in (13) the strict inequality v > 0 holds then, for each c 2 (0, v), there exists " > 0 such that F(x; y(x)) F(x; y(x)) C c kx xk for all x satisfying h(x) 0 and kx xk " [2]. Necessary and sufficient optimality conditions of second order based on the implicit function approach (applied to the more general MPEC formulation) are given in [17].

provided that the conditions C), SCQ), SSOC), and CR) are satisfied. Hence, the application of the necessary optimality conditions from Lipschitz optimization to problem (5) leads to necessary optimality conditions for the bilevel problem (2). Thus, if x is a locally optimal solution of the problem (2) and the assumptions C), SCQ), SSOC), CR), and MFCQ) are satisfied, then there exist Lagrange multipliers i 0, i 2 K, such that 0 2 @G(x) C

X

i fr h i (x)g:

i2K

This is an obvious generalization of the necessary optimality condition given in [4], where no upper level constraints in (2) appeared, and is also a special case of the results in [19], where the general constraint set x 2 X in the upper level problem (2) together with more restrictive assumptions for the lower level problem are used. For the use of this necessary optimality condition in computations the explicit description of the generalized Jacobian in (11) (with equality instead of inclusion) is needed.

Solution Algorithms The implicit function approach leads to the problem (5) of minimizing a nondifferentiable, nonconvex, implicitly determined function on a fixed set. Any algorithm solving nonsmooth optimization problems can be applied to this problem. Due to the structure of (5) the computation of function values and derivative information for the objective function is expensive. Two types of algorithms are proposed: descent and bundle algorithms. The convergence proofs show that the algorithms converge to points where the above optimality conditions are satisfied, i. e. to solutions where no descent direction exists respectively to Clarke stationary points.

Bilevel Programming: Implicit Function Approach

where

Descent Algorithms

n ˛ k;i D max G(x k ) v(z i )> (x k z i ) G(z i );

o

c0 x k z i ;

Let X D fx : h k (x) 0; k 2 Kg : Descent algorithms are iterative methods which compute a sequence of feasible points {xi }i 2 N by xi + 1 = xi + t i ri , 8i, where ri is a feasible direction of descent and t i is a stepsize. For bilevel problems a feasible direction of descent is obtained by minimizing the function (12) rx F(z)r C r y F(z)y0 (x; r) subject to r being an inner direction of the cone of feasible directions to X: ˚ min ˛ : rx F(z)r C r y F(z)y0 (x; r) ˛; ˛;r

r h i (x)r ˛;

i 2 K;

krk 1 :

Inserting the Karush–Kuhn–Tucker conditions of the quadratic optimization problem (9) for the computation of y0 (x; r) and again using an active set strategy this problem is converted into an equivalent combinatorial optimization problem. For the computation of a stepsize, e. g., Armijo’s rule can be applied. Such an algorithm is described in [6,8,17]. In [6] it is also investigated how this idea can be generalized to the case when the lower level problem (1) is not assumed to have a uniquely determined optimal solution for all values of the parameter. In [17] this approach is applied to the more general MPEC. Bundle Algorithms Let X = Rn . Different constraint sets can be treated by use of approaches in [12]. As in descent algorithms, in bundle algorithms for minimizing Lipschitz nonconvex functions a sequence of iterates {xi }i 2 N with xi + 1 = xi + t i ri , 8i, is computed. For computing a direction a model of the function to be minimized is used. In the paper [23], the following bundle algorithm has been proposed. Let two sequences of points {xi } kiD1 , {zi } kiD1 have already been computed. Then, for minimizing a nonconvex function G(x), this model has the form max fv(z i )> d ˛ k;i g C

1ik

B

u k d> d ; 2

(15)

v(zi ) is a subgradient of the function G(x) at x = zi and uk is a weight. If the direction computed by minimizing the model function (15) realizes a sufficient decrease, a serious step is made (i. e. t k = 1 is used). Otherwise, either a short step (which means that t k is computed according to a stepsize rule) or a null step (only the model is updated by computing a new subgradient) is made. For updating the model (15), in each iteration of the bundle algorithm a subgradient of the objective function is needed. For its computation formula (14) can be used. The bundle algorithm is applied to problem (5) in [4,18,20]. In [4], the lower level problem is not assumed to have a uniquely determined optimal solution for all parameter values. The Lipschitz optimization problem (5) is obtained via a regularization approach in the lower level problem (1). Numerical experience for solving bilevel problems (in the formulation (2) as well as in the more general MPEC formulation) with the bundle algorithm is reported in [18,20]. See also Bilevel Fractional Programming Bilevel Linear Programming Bilevel Linear Programming: Complexity, Equivalence to Minmax, Concave Programs Bilevel Optimization: Feasibility Test and Flexibility Index Bilevel Programming Bilevel Programming: Applications Bilevel Programming: Applications in Engineering Bilevel Programming: Introduction, History and Overview Bilevel Programming in Management Bilevel Programming: Optimality Conditions and Duality Multilevel Methods for Optimal Design Multilevel Optimization in Mechanics Stochastic Bilevel Programs

265

266

B

Bilevel Programming: Introduction, History and Overview

References 1. Clarke FH (1983) Optimization and nonsmooth analysis. Wiley, New York 2. Dempe S (1992) A necessary and a sufficient optimality condition for bilevel programming problems. Optim 25:341–354 3. Dempe S (1993) Directional differentiability of optimal solutions under Slater’s condition. Math Program 59: 49–69 4. Dempe S (1997) An implicit function approach to bilevel programming problems. In: Migdalas A, Pardalos PM, Värbrand P (eds) Multilevel Optimization: Algorithms, Complexity and Applications. Kluwer, Dordrecht 5. Dempe S, Pallaschke D (1997) Quasidifferentiability of optimal solutions in parametric nonlinear optimization. Optim 40:1–24 6. Dempe S, Schmidt H (1996) On an algorithm solving twolevel programming problems with nonunique lower level solutions. Comput Optim Appl 6:227–249 7. Fiacco AV, McCormic GP (1968) Nonlinear programming: Sequential unconstrained minimization techniques. Wiley, New York 8. Gauvin J, Savard G (1994) The steepest descent direction for the nonlinear bilevel programming problem. Oper Res Lett 15:265–272 9. Gfrerer H (1987) Hölder continuity of solutions of perturbed optimization problems under MangasarianFromowitz constraint qualification. In: Guddat J et al (eds) Parametric Optimization and Related Topics. Akad Verlag, Berlin, pp 113–127 10. Hager WW (1979) Lipschitz continuity for constrained processes. SIAM J Control Optim 17:321–328 11. Harker PT, Pang J-S (1988) Existence of optimal solutions to mathematical programs with equilibrium constraints. Oper Res Lett 7:61–64 12. Kiwiel KC (1985) Methods of descent for nondifferentiable optimization. Springer, Berlin 13. Kojima M (1980) Strongly stable stationary solutions in nonlinear programs. In: Robinson SM (ed) Analysis and Computation of Fixed Points. Acad Press, New York pp 93–138 14. Kummer B (1988) Newton’s method for non-differentiable functions. Adv Math Optim, In: Math Res, vol 45. Akad Verlag, Berlin 15. Loridan P, Morgan J (1989) -regularized two-level optimization problems: Approximation and existence results. In: Optimization: Fifth French-German Conf (Varez), In: Lecture Notes Math, vol 1405. Springer, Berlin, pp 99–113 16. Lucchetti R, Mignanego F, Pieri G (1987) Existence theorem of equilibrium points in Stackelberg games with constraints. Optim 18:857–866 17. Luo Z-Q, Pang J-S, Ralph D (1996) Mathematical programs with equilibrium constraints. Cambridge Univ Press, Cambridge

18. Outrata J (1990) On the numerical solution of a class of Stackelberg problems. ZOR: Methods and Models of Oper Res 34:255–277 19. Outrata JV (1993) Necessary optimality conditions for Stackelberg problems. J Optim Th Appl 76:305–320 20. Outrata J, Zowe J (1995) A numerical approach to optimization problems with variational inequality constraints. Math Program 68:105–130 21. Ralph D, Dempe S (1995) Directional derivatives of the solution of a parametric nonlinear program. Math Program 70:159–172 22. Robinson SM (1982) Generalized equations and their solutions, Part II: Applications to nonlinear programming. Math Program Stud 19:200–221 23. Schramm H, Zowe J (1992) A version of the bundle idea for minimizing a nonsmooth function: conceptual idea, convergence analysis, numerical results. SIAM J Optim 2:121– 152 24. Shapiro A (1988) Sensitivity analysis of nonlinear programs and differentiability properties of metric projections. SIAM J Control Optim 26:628–645 25. Vicente LN, Calamai PH (1994) Bilevel and multilevel programming: A bibliography review. J Global Optim 5(3)

Bilevel Programming: Introduction, History and Overview BP LUIS N. VICENTE Department Mat., University de Coimbra, Coimbra, Portugal MSC2000: 90C26, 90C30, 90C31 Article Outline Keywords See also References Keywords Bilevel programming; Multilevel programming; Hierarchical optimization; Nondifferentiable optimization; Game theory; Stackelberg problems The bilevel programming (BP) problem is a hierarchical optimization problem where a subset of the variables is constrained to be a solution of a given optimization

Bilevel Programming: Introduction, History and Overview

problem parameterized by the remaining variables. The BP problem is a multilevel programming problem with two levels. The hierarchical optimization structure appears naturally in many applications when lower level actions depend on upper level decisions. The applications of bilevel and multilevel programming include transportation (taxation, network design, trip demand estimation), management (coordination of multidivisional firms, network facility location, credit allocation), planning (agricultural policies, electric utility), and optimal design. In mathematical terms, the BP problem consists of finding a solution to the upper level problem 8 : : : > c an a . Let K a D f1; 2; : : : ; n a g. Because of the concavity of f a (x a ), it can be written in the following alternative form f a (x a ) D min f f ak (x a )g D min fc ak x a C s ak g : k2K a

k2K a

(4)

By introducing additional variables y ak 2 [0; 1], k 2 K a , construct the following bilinear problem.

min g(x; y) D x;y

X

2 4

a2A

D

X

3 c ak y ak 5 x a C

k2K a

X X

X X

s ak y ak

a2A k2K a

f ak (x a )y ak

a2A k2K a

(5) s.t. X

Bx D b y ak D 1

(6) 8a 2 A

(7)

k2K a

x a 2 [0; a ]; y ak 0

8a 2 A and

k 2 Ka

(8)

In [13], the authors show that at any local minima of the bilinear problem, (xˆ ; yˆ), yˆ is either binary vector or can be used to construct a binary vector with the same objective function value. Although the vector yˆ may have a fractional components, the authors note that in practical problems it is highly unlikely. The proof of the theorem below follows directly from (4). Details on the proof as well as transformation of the problem (1)–(3) into (5)–(8) can be found in [13]. Theorem 1 If (x* ,y* ) is a global optima of the problem (5)–(8) then x* is a solution of the problem (1)–(3). According to the theorem, the concave piecewise linear network flow problem is equivalent to a bilinear problem in a sense that the solution of the later is a solution of the former. It is important to notice that the problem (5)–(8) does not have binary variables, i. e., all variables are continuous. However, at optimum y* is a binary vector, which makes sure that in the objective only one linear piece is employed. Fixed Charge Network Flow Problem In the case of the fixed charge network flow problem, we assume that the function f a (x a ) has the following structure. c a x a C s a x a 2 (0; a ] f a (x a ) D ; 0 xa D 0 Observe that the function is discontinuous at the origin and linear on the interval (0; a ].

283

284

B

Bilinear Programming: Applications in the Supply Chain Management

where V(x) denotes the set of vertices of the polyhedra (10)–(11). Observe that ı is the minimum among all positive components of all vectors x v 2 V (x); therefore, ı > 0. Theorem 3 (see [14]) For all " such that " a 2 (0; ı] for all a 2 A, " (x " ) D f (x ).

Bilinear Programming: Applications in the Supply Chain Management, Figure 1 Approximation of function fa (xa )

x;y

Let " a 2 (0; a ], and define a"a (x a )

D

ca xa C sa c "aa x a

c "aa

s.t.

x a 2 [" a ; a ] x a 2 [0; " a )

D c a C s a /" a . It is easy to see that D where S f a (x a ), 8x a 2 f0g [" a ; a ] and a"a (x a ) < f a (x a ), 8x a 2 (0; " a ), i. e., a"a (x a ) approximates the function f a (x a ) from below. (see Fig. 1). Let us construct the following concave two-piece linear network flow problem.

x

s.t.

X

a"a (x a )

(9)

a2A

Bx D b;

x a 2 [0; a ];

8a 2 A ;

a2A

Bx D b ;

xa 0 ; a"a (x a )

min " (x) D

Theorem 3 proves the equivalence between the fixed charge network flow problem and the concave twopiece linear network flow problem (9)–(11) in a sense that the solution of the later is a solution of the former. As we have seen in the previous section, concave piecewise linear network flow problems are equivalent to bilinear problems. In particular, problem (9)–(11) is equivalent to the following bilinear problem. X (12) min [c a x a C s a ] y a C c "aa x a 1 y a

(10) (11)

where " denotes the vector of " a . Function " (x) as well as the problem (9)–(11) depends on the value of the vector ". In the paper [14], the authors show that for any value of " a 2 (0; a ], a global solution of the problem (9)–(11) provides a lower bound for the fixed charge network flow problem, i. e., " (x " ) f (x ), where x " and x* denote the solutions of the corresponding problems. Theorem 2 (see [14]) For all " such that " a 2 (0; a ] for all a 2 A, " (x " ) f (x ). Furthermore, by choosing a sufficiently small value for " a one can ensure that both problems have the same solution. Let ı D minfx va jx v 2 V(x); a 2 A; x va > 0g,

and

(13) y a 2 [0; 1] ;

8a 2 A ;

(14)

where " a 2 (0; ı]. Capacitated Multi-Item Dynamic Pricing Problem In the problem, we assume that a company during a discrete time period is able to produce different commodities from a set P. In addition, we assume that at each point of time j 2 and for each product p 2 P a functional relationship f (p; j) (d(p; j) ) between the satisfied demand and the price is given, i. e., in order to satisfy the demand d(p; j) of the product p, the price of the product at time j should be equal to f(p; j) (d(p; j) ). As a result, the revenue generated from the sales of the product p at time j is g(p; j) (d(p; j) ) D f (p; j) (d(p; j) )d(p; j) . Although we do not specify the function f(p; j) (d(p; j) ), it should ensure that g(p; j) (d(p; j) ) is a concave function (see Fig. 2a). Because of the concavity of g(p; j) (d(p; j) ), there exists a point d˜(p; j) , such that the function reaches its maximum, and producing and selling more than d˜(p; j) is not profitable. Therefore, without lost of generality, we can assume that d(p; j) 2 [0; d˜(p; j) ]. According to the definition of g(p; j) (d(p; j) ), it is a concave monotone function on the interval [0; d˜(p; j) ]. To avoid nonlinearity in the objective, one can approximate it by a concave piecewise linear function. Doing so, divide

B

Bilinear Programming: Applications in the Supply Chain Management

Bilinear Programming: Applications in the Supply Chain Management, Figure 2 The revenue function and its approximation

k [0; d˜(p; j) ] into intervals of equal length, and let d(p; j) , S S k 2 f1; : : : ; Ng f0g D K f0g, denote the end points of the intervals. Then the approximation can be defined as

g˜(p; j) ((p; j) ) D

N X kD1

p2P i2 j2ji j k2K

(15) X

X

X

p2P j2ji j k2K

k x(p;i; j)

8p 2 P

C i ; 8i 2 ;

(16)

and

i 2;

(17)

and

j 2;

(18)

8p 2 P; i; j 2 and

k2K;

(19)

k X x(p;i; j) k d(p; j)

k2K i2ji j

k k k k where g(p; f (p; j) (d(p; j) D g(p; j) (d(p; j) ) D j) )d(p; j) , PN k k kD0 (p; j) D 1, and (p; j) 0; 8p 2 P; j 2 (see Fig. 2b). k Let x(p;i; j) denote the amount of product p that is produced at time i and sold at time j using the unit k k k k price g(p; j) /d(p; j) D f (p; j) D f (p; j) (d(p; j) ). In addition, let y(p;i) denote a binary variable, which equals one if P P k k j x(p;i; j) > 0 and zero otherwise. Costs associated with the production process include inventory costs pr in st c(p;i; j) , production costs c(p;i) , and setup costs c(p;i) . At last, let Ci represent the production capacity at time i, which is “shared” by all products. Using those definitions, one can construct a linear mixed integer formulation of the problem. Below we provide a simplified formulation of the problem, where the variables (p; j) are eliminated from the formulation. For the details on the mathematical formulation of the problem and its simplification we refer to [15]. 2 3 XX X X k k st 4 5 q(p;i; max j) x(p;i; j) c(p;i) y(p;i) x;y

k x(p;i; j) C i y(p;i) ;

j2ji j k2K

X

k k g(p; j) (p; j) ;

X

X

1; 8p 2 P

k x(p;i; j) 0 ;

y(p;i) 2 f0; 1g ;

pr

k k in where q(p;i; j) D f (p; j) c(p;i; j) c(p;i) . k Let X D fxjx 0 and x(p;i; j) be feasible to (16)

and (18)g, and Y D [0; 1]jPjjj . Consider the following bilinear problem. max '(x; y) D 2 3 X X XX k k st 5 4 q(p;i; j) x(p;i; j) c(p;i) y(p;i)

x2X;y2Y

p2P i2

j2ji j k2K

(20) Theorem 4 (see [15]) A global maximum of the bilinear problem (20) is a solution or can be transformed into a solution of the problem (15)–(19). Methods In the previous section, we have discussed several problems arising in the supply chain management. To solve the bilinear formulations of the problems, one can em-

285

286

B

Bilinear Programming: Applications in the Supply Chain Management

ploy techniques applicable for general bilinear problems. In particular, a cutting plain algorithm proposed by Konno can be applied to find a global solution of the problems. In addition, he proposes an iterative procedure, which converges to a local minimum of the problem in a finite number of iterations. For details on the procedure, which is also known as “mountain climbing” procedure (MCP), and the cutting plain algorithm we refer to the paper [11] or Bilinear Programming section of this encyclopedia. Below, we discuss problem specific difficulties of applying the above mentioned algorithms and some effective heuristic procedures, which are able to provide a near optimum solution using negligible computer resources. The MCP, which is used by the heuristics to find a local minimum/maximum of the problems, is very fast due to a special structure of both LP problems employed by the procedure. However, to obtain a high quality solution, in some problems it is necessary to solve a sequence of approximate problems. The bilinear formulations of the supply chain problems typically have many local minima. Therefore, cutting plain algorithms may require many cuts to converge. By combining the heuristic procedures with the cutting plain algorithm, one can reduce the number of cuts by generating deep cuts. One of the main properties of a bilinear problem with a disjoint feasible region is that by fixing vectors x or y to a particular value, the problem reduces to a linear one. The “mountain climbing” procedure employs this property and iteratively solves two linear problems by fixing the corresponding vectors to the solution of the corresponding linear programs. In the case of concave piecewise linear network flow problem, given the vector xˆ , the problem (5)–(8) can be decomposed into jAj problems, X [c ak xˆ a C s ak ]y ak min fy ak jk2K a g

s.t.

k2K a

X

y ak D 1 ;

y ak 0

8k 2 K a :

k2K a

Furthermore, it can be shown that a solution of the problem is a binary vector, which has to satisfy the inequality X X ak1 y ak xˆ a ak y ak : k2K a

k2K a

As a result, one can employ a search technique by asˆ ˆ ˆ signing y ak D 1 if ak1 xˆ a ak and y ak D 0, ˆ On the other hand, by fixing the vector 8k 2 K a , k ¤ k. y to the value of the constructed vector yˆ, the problem (5)–(8) reduces to the following network flow problem.

min x

s.t.

X a2A

2 4

X

3 c ak yˆ ak 5 x a

k2K a

Bx D b;

x a 0;

8a 2 A

P ˆ Observe that k2K a c ak yˆ ak D c ak , and different vectors yˆ change the cost vector in the problem. Although the MCP converges to a local minimum, it can provide a near optimum solution for the problem (5)–(8) if the initial vector yˆ is such that yˆ na a D 1 and yˆ ak D 0, 8k 2 K a , k ¤ n a . The effectiveness of the procedure is partially due to the fact that in the supply chain problems f a (x a ) is an increasing function. In addition, the procedure requires less computer resources to converge because both linear problems are relatively easy to solve. A detailed description of the procedure, properties of the linear problems, and computational experiments can be found in [13]. In the case of fixed charge network flow problems, it is not obvious how to choose the vector ". Theorem 3 guarantees the equivalence between the fixed charge network flow problem and the bilinear problem (12)– (14) if " a 2 (0; ı]. However, according to the definition, it is necessary to find all vertices of the feasible region to compute the value of ı, which is computationally expensive. Even if the correct value of ı is known, typically it is a very small number. As a result, the value of " a is close to zero, and c "aa is very large compared to the value of ca . The later creates some difficulties for finding a global solution of the bilinear problem. In particular, the MCP may converge to a local minimum, which is far from being a global solution. To overcome those difficulties, [14] proposes a procedure where it gradually decreases the value of " (see Algorithm 1). The algorithm starts from an initial value for the vector ", i. e., " a D a . After constructing the corresponding bilinear problem, it employs the MCP to find a local minimum of the problem. If the stopping criteria is not satisfied, the value of " is updated, i. e., " a D ˛" a where ˛ 2 (0; 1), and the algorithm again

B

Bilinear Programming: Applications in the Supply Chain Management

solves the updated bilinear problem using the current solution as an initial vector for the MCP. The choice of ˛ has a direct influence on the CPU time of the algorithm and the quality of the solution. Specifically, if the value of ˛ is closer to one, then due to the fact that " decreases slowly, the algorithm requires many iterations to stop. On the other hand, if the values of the parameter is closer to zero, it may worsen the quality of the solution. A proper choice of the parameter depends on the problem, and it should be chosen by trials and errors. In the paper [14], the authors test the algorithm on various randomly generated test problems and found satisfactory to choose ˛ D 0:5. As for the stopping criteria, it is possible to show that the solution of the final bilinear problem is the solution of the fixed charge network flow problem if on Step 2 one is able to find a global solution of the corresponding bilinear problems. For details on the numerical experiments, stopping criteria and other properties of the algorithm, we refer to [14]. In the problems with pricing decisions, one may also experience some difficulties to employ the MCP for finding a near optimum solution. To explore the properties of the problem, consider the following two linear problems, which are constructed from the problem (20) by fixing either vector x or y to the value of the vector xˆ or yˆ, respectively. LP1 :

x2X

0, y0a

0, and

Step 2: Find a local minimum of the problem (12)(14) using the MCP. Let (x m ; y m ) denote the solution found by the algorithm. Step 3: If 9a 2 A such that x am 2 (0; "m a ) then ˛" a , m m + 1, and go to step 2. Other"a wise, stop. Bilinear Programming: Applications in the Supply Chain Management, Algorithm 1

k it is likely that at optimum of LP2 , xˆ(p;i; j) D 0, 8 j 2 , k 2 K. From the later, it follows that yˆ(p;i) D 0 during the next iteration, and one concludes that if some products are eliminated from the problem during the iterative process, the MCP does not consider them again. Therefore, it is likely that the solution returned by the algorithm is far from being a global one. To avoid zero coefficients in the objective of LP2 , [15] proposes an approximation to the problem (20), which can be used in the MCP to find a near optimum solution. To construct the approximate problem, let X X 1 k k st (x(p;i) ) D q(p;i; '(p;i) j) x(p;i; j) c(p;i) ; j2ji j k2K

X

3 k k ˆ(p;i; q(p;i; j) x j)

st 5 c(p;i) y(p;i)

2 (x(p;i) ) D '(p;i)

X "(p;i) st "(p;i) C c(p;i)

X

k k q(p;i; j) x(p;i; j) ;

j2ji j k2K

p2P i2 j2ji j k2K

LP2 : max

a , x 0a

and 2

XX X 4 max y2Y

Step 1: Let " a m 1.

XX X

Xh

i k k ˆ y q(p;i; x(p;i; (p;i) j) j) :

p2P i2 j2ji j k2K

The MCP solves iteratively LP1 and LP2 problems, where the solution of the first problem is used to fix the corresponding vector in the second problem. However, if one of the components of the vector y equals to zero during one of the iterations, e. g., yˆ(p;i) D 0, then in the second problem coefficients of the corresponding k variables x(p;i; j) are equal to zero as well. As a result, changes in the values of those variables do not have any influence on the objective function value. Furthermore, because the products “share” the capacity and other products may have positive coefficients in the objective,

Step 1: Let "(p;i) be a sufficiently large number, 0 = 1, 8p 2 P, i 2 , and m 0. y(p;i) Step 2: Construct the approximation problem (21), and find a local maximum of the problem using the MSP. Let (x m+1 ; y m+1 ) denote the solution returned by the algorithm. Step 3: If 9p 2 P and i 2 such that P P (m+1)k k st m j2ji j k2K q (p;i; j) x(p;i; j) c(p;i) "(p;i) and P P (m+1)k ˛", m j2ji j k2K x(p;i; j) > 0 then " m + 1 and go to Step 2. Otherwise, stop. Bilinear Programming: Applications in the Supply Chain Management, Algorithm 2

287

288

B

Bi-objective assignment problemBi-Objective Assignment Problem

k where "(p;i) > 0, and x(p;i) is the vector of x(p;i; j) . Using those functions, construct the following bilinear problem

max ' " (x; y) D i XXh 1 2 (x(p;i) )y(p;i) C '(p;i) (x(p;i) )(1 y(p;i) ) ; '(p;i) x2X;y2Y

p2P i2

(21) where the feasible region is the same as in the problem (20). The authors show that ' " (x; y) approximates the function '(x; y) from above. Theorem 5 (see [15]) There exists a sufficiently small " > 0 such that a solution of the problem (20) is a solution of the problem (21). Algorithm 2 starts from a sufficiently large value of "(p;i) and finds a local maximum of the corresponding bilinear problem (21) using the MCP. If the stopping criteria is not satisfied then it updates the value of " to ˛", updates the bilinear problem (21), and employs the MCP to find a better solution. Similar to the fixed charge network flow problem, the choice of ˛ has a direct influence on the CPU time of the algorithm and the quality of the returned solution. The running time of the algorithm and the quality of the solution for the different values of ˛ are studied in [15]. In addition to ˛, one has to find a proper initial value for the parameter "(p;i) . Ideally, it should be equal to the maximum profit that can be generated by producing only product p at time i. However, it requires solving a linear problem for each pair (p; i) 2 P , which is computationally expensive. On the other hand, it is not necessary to find an exact solution of those LPs, and one might consider a heuristic procedure which provides a quality solution within a reasonable time. One of such procedures is discussed in [15]. References 1. Barr R, Glover F, Klingman D (1981) A New Optimization Method for Large Scale Fixed Charge Transportation Problems. Oper Res 29:448–463 2. Cabot A, Erenguc S (1984) Some Branch-and-Bound Procedures for Fixed-Cost Transportation Problems. Nav Res Logist Q 31:145–154 3. Cooper L, Drebes C (1967) An Approximate Solution Method for the Fixed Charge Problem. Nav Res Logist Q 14:101–113

4. Diaby M (1991) Successive Linear Approximation Procedure for Generalized Fixed-Charge Transportation Problem. J Oper Res Soc 42:991–1001 5. Gray P (1971) Exact Solution for the Fixed-Charge Transportation Problem. Oper Res 19:1529–1538 6. Guisewite G, Pardalos P (1990) Minimum concave-cost network flow problems: applications, complexity, and algorithms. Ann Oper Res 25:75–100 7. Kennington J, Unger V (1976) A New Branch-and-Bound Algorithm for the Fixed Charge Transportation Problem. Manag Sci 22:1116–1126 8. Khang D, Fujiwara O (1991) Approximate Solution of Capacitated Fixed-Charge Minimum Cost Network Flow Problems. Netw 21:689–704 9. Kim D, Pardalos P (1999) A Solution Approach to the Fixed Charge Network Flow Problem Using a Dynamic Slope Scaling Procedure. Oper Res Lett 24:195–203 10. Kim D, Pardalos P (2000) Dynamic Slope Scaling and Trust Interval Techniques for Solving Concave Piecewise Linear Network Flow Problems. Netw 35:216–222 11. Konno H (1976) A Cutting Plane Algorithm for Solving Bilinear Programs. Math Program 11:14–27 12. Kuhn H, Baumol W (1962) An Approximate Algorithm for the Fixed Charge Transportation Problem. Nav Res Logist Q 9:1–15 13. Nahapetyan A, Pardalos P (2007) A Bilinear Relaxation Based Algorithm for Concave Piecewise Linear Network Flow Problems. J Ind Manag Optim 3:71–85 14. Nahapetyan A, Pardalos P (2008) Adaptive Dynamic Cost Updating Procedure for Solving Fixed Charge Network Flow Problems. Comput Optim Appl 39:37–50. doi:10. 1007/s10589-007-9060-x 15. Nahapetyan A, Pardalos P (2008) A Bilinear Reduction Based Algorithm for Solving Capacitated Multi-Item Dynamic Pricing Problems. Comput Oper Res J 35:1601–1612. doi:10.1016/j.cor.2006.09.003 16. Palekar U, Karwan M, Zionts S (1990) A Branch-and-Bound Method for Fixed Charge Transportation Problem. Manag Sci 36:1092–1105

Bi-Objective Assignment Problem JACQUES TEGHEM Lab. Math. & Operational Research Fac., Polytechn. Mons, Mons, Belgium MSC2000: 90C35, 90C10 Article Outline Keywords Direct Methods

Bi-Objective Assignment Problem

Two-Phase Methods First Step Second Step

Heuristic Methods Preliminaries Determination of PE( (l) ), l = 1, . . . , L Generation of E (P) Concluding Remarks

b

See also References Keywords Multi-objective programming; Combinatorial optimization; Assignment Until recently (1998), multi-objective combinatorial optimization (MOCO) did not receive much attention in spite of its potential applications. The reason is probably due to specific difficulties of MOCO models as pointed out in Multi-objective combinatorial optimization. Here we consider a particular bi-objective MOCO problem, the assignment problem (AP). This is a basic well-known combinatorial optimization problem, important for applications and as a subproblem of more complicated ones, like the transportation problem, distribution problem or traveling salesman problem. Moreover, its mathematical structure is very simple and there exist efficient polynomial algorithms to solve it in the single objective case, like the Hungarian method. In a bi-objective framework, the assignment problem can be formulated as: 8 n n X X ˆ ˆ 0 0 ˆ min z (X) D c (k) ˆ k i j xi j; ˆ ˆ ˆ iD1 jD1 ˆ ˆ ˆ ˆ ˆ k D 1; 2; ˆ ˆ ˆ n < X x i j D 1; i D 1; : : : ; n; (P) ˆ ˆ jD1 ˆ ˆ ˆ n ˆ X ˆ ˆ ˆ x i j D 1; j D 1; : : : ; n; ˆ ˆ ˆ ˆ iD1 ˆ ˆ : x 2 f0; 1g ij

where c kij are nonnegative integers and X = (x11 , . . . , xnn ). Our aim is to generate the set of efficient solutions E(P). It is important to stress that the distinction between the supported efficient solutions (belonging to SE (P)), i. e. those which are optimal solutions

B

of the single objective problem obtained by a linear aggregation of the objectives, and the nonsupported efficient solutions (belonging to NSE(P) = E(P)\SE(P)) (see Multi-objective integer linear programming) is still necessary even if the constraints of the problem satisfy the so-called ‘totally unimodular’ or ‘integrality’ property: when this property is verified, the integrality constraints of the single objective problem can be relaxed without any deterioration of the objective function, i. e. the optimal values of the variables are integer even if only the linear relaxation of the problem is solved. It is well known that the single objective assignment problem satisfies this integrality property, and thus this is true for the problem (see Multi-objective combinatorial optimization): 8 ˆ ˆ ˆmin z n (X) D 1 z1 (X) C 2 z2 (X) ˆ ˆ X ˆ ˆ ˆ x i j D 1; i D 1; : : : ; n; ˆ ˆ ˆ ˆ jD1 < n X (P ) ˆ x i j D 1; j D 1; : : : ; n; ˆ ˆ ˆ ˆ iD1 ˆ ˆ ˆ ˆ x i j 2 f0; 1g ˆ ˆ ˆ : 1 0; 2 0: Nevertheless, in the multi-objective framework, there exist nonsupported efficient solutions, as indicated by the following didactic example: 0 1 5 1 4 7 B 6 2 2 6C C C (1) D B @ 2 8 4 4A ; 3 5 7 1 0 1 3 6 4 2 B 1 3 8 3C C C (2) D B @ 5 2 2 3A : 4 2 3 5 The values of the feasible solutions are represented in the objective space in Fig. 1 There are four supported efficient solutions, corresponding to points Z1 , Z2 , Z3 and Z4 ; two nonsupported efficient solutions corresponding to points Z5 and Z6 ; the eighteen other solutions are nonefficient. Remark 1 In [7], D.J. White analyzes a particular case of problem (P) corresponding to c (k) i j D c i j ı jk

289

290

B

Bi-Objective Assignment Problem

generacy of the assignment problem are not taken into account in [1]. Two-Phase Methods The principle of this approach, and the first phase designed to generate SE(P), are described in Multi-objective combinatorial optimization; by complementary, we analyse here the second phase [3]. The purpose is to examine each triangle MZr Zs determined by two successive solutions X r and X s of SE(P) (see Fig. 2) and to determine the possible nonsupported solutions whose image lies inside this triangle. We note that z (X) D 1 z1 (X) C 2 z2 (X) Bi-Objective Assignment Problem, Figure 1 The feasible points in the (z1 , z2 )-space for the didactic example

where ( ı jk D

1 if j D k; 0 if j ¤ k:

For this particular problem, he proves that E(P) = SE(P). We consider the problem to generate E(P) and (see Multi-objective combinatorial optimization) we can distinguish three methodologies: direct methods; twophase methods and heuristic methods. Direct Methods In [1], the authors propose a theoretical enumerative procedure to generate E(P) in the order of increasing values of z1 : at each step they consider the admissible edges incident at the current basis and among the set of possible new bases, they selected the one with the best value of z1 : they affirm that this basis corresponds to a new efficient solution. As proved by the example described above, this procedure appears false: for instance from point Z5 = (16, 11), corresponding to the solution x14 = x22 = x33 = x41 = 1, it is impossible to obtain by an unique change of basis the following point Z6 = (19, 10), corresponding to the solution x13 = x21 = x34 = x42 = 1. Moreover the real difficulties induced by the high de-

(1) with 1 = z2r z2s and 2 = z1s z1r and c( ) i j = 1 c i j

+ 2 c(2) ij . In the first phase, the objective function z (X) has been optimized by the Hungarian method giving e z D 1 z1r C 2 z2r D 1 z1s C 2 z2s , the optimal value of z (X); ( ) the optimal value of the reduced cost c( ) i j D ci j (u i C v j ), where ui and vj are the dual variables associated respectively to constraints i and j of problem (P ). 0 and e xi j D 1 ) At optimality, we have c( ) ij c ( ) i j D 0. First Step

n o We consider L D x i j : c ( ) i j > 0 . To generate nonsupported efficient solution in triangle MZr Zs , each variable xij 2 L is candidate to be fixed to 1. Nevertheless, a variable can be eliminated if we are sure that the reoptimization of problem (P) will provide a dominated point in the objective space. If xij 2 L is set to 1, z is given by a lower bound lij of the increase ofe ( ) ( ) C min c ( ) l i j D c ( ) ij i r j r ; min c i r k C min c k j r ; k¤ j

k¤i

( ) ( ) c ( ) ; min c C min c is js is k k js ; k¤ j

k¤i

where the indices ir and jr (is and js ) are such that in the solution X r (respectively, X s ) we have x i r j D x i j r D 1;

(x i s j D x i j s D 1):

Bi-Objective Assignment Problem

Bi-Objective Assignment Problem, Figure 2 Test 1

Effectively, to re-optimize problem (P ) with xij = 1, in regard with its optimal solution X r (respectively, X s ), it is necessary to determine, at least, a new assignment in the line ir (respectively, is ) and in the column jr (respectively, js ). But clearly, to be inside the triangle MZr Zs , we must have (see Fig. 2) e z C l i j < 1 z1s C 2 z2r : Consequently, we obtain the following fathoming test: (Test 1): xij 2 L can be eliminated if e z + lij 1 z1s + 2 z2r or, equivalently, if lij 1 2 . So in this first step, the lower bound lij is determined for all xij 2 L; the list is ordered by increasing values of lij . Only the variables not eliminated by test 1 are kept. Problem (P ) is re-optimized successively for each noneliminated variable; let us note that only one iteration of the Hungarian method is needed. After the optimization, the solution is eliminated if its image in the objective space is located outside the triangle MZr Zs . Otherwise, a nondominated solution is obtained and put in a list NSrs ; at this time, the second step is applied. Second Step When nondominated points Z1 , . . . , Zm 2 NSrs are found inside the triangle MZr Zs , then test 1 can be improved. Effectively (see Fig. 3), in this test the value 1 z1s C 2 z2r can be replaced by the lower value ( ) max 1 z1;iC1 C 2 z2;i ; iDo;:::;m

where Zo Zr , Zm + 1 Zs , with 1 z1, m + 1 + 2 z2, 0 .

B

Bi-Objective Assignment Problem, Figure 3 Test 2

The new value corresponds to an updated upper bound of z (X) for nondominated points. More variables of L can be eliminated with the new test (Test 2): xij 2 L can be eliminated if e z C l i j max 1 z1;iC1 C 2 z2;i : iDo;:::;m

Each time a new nondominated point is obtained, the list NSrs and the test 2 are updated. The procedure stops when all the xij 2 L have been either eliminated or analyzed. At this moment the list NSrs contains the nonsupported solutions corresponding to the triangle MZr Zs . When each triangle have been examined N SE(P) D [rs N Srs : Numerical results are given in [3]. Heuristic Methods As described in Multi-objective combinatorial optimization, the MOSA method is an adaptation of the simulated annealing heuristic procedure to a multiobjective framework. Its aim is to generate a good approximation, denoted E(P), of E(P) and the procedure is valid for any number K 2 of objectives. Similarly to a single objective heuristic in which a potentially optimal solution emerges, in the MOSA method the set E(P) will contain potentially efficient solutions.

b

b

Preliminaries A wide diversified set of weights is considered: different weight vectors (l) , l 2 L, are generated where (l) = ((lk ) )k = 1, . . . , K with (lk ) > 0, 8k and K X kD1

(lk ) D 1;

8l 2 L:

291

292

B

Bi-Objective Assignment Problem

A scalarizing function s(z, ) is chosen, the effect of this choice on the procedure is small due to the stochastic character of the method. The weighted sum is very well known and it is the easiest scalarizing function: s(z; ) D

K X

k z k :

Nc D 0:

Y

X nC1

Else we accept the new solution with a certain probability p = exp( s/T n ): 8 < p Y; Nc D 0; X nC1 1p : X ; N D N C 1: n

kD1

The three classic parameters of a simulated annealing procedure are initialized – T 0 : initial temperature (or alternatively an initial acceptance probability P0 ); – ˛ (< 1): the cooling factor; – N step : the length of temperature step in the cooling schedule; and the two stopping criteria are fixed: – T stop : the final temperature; – N stop : the maximum number of iterations without improvement A neighborhood V(X) of feasible solutions in the vicinity of X is defined. This definition is problem dependent. It is particularly easy to define V(X) in the case of the assignment problem: if X is characterized by x i j i = 1, i = 1, . . . , n, then V(X) contains all the solutions Y satisfying y i j i D 1;

If s 0, we accept the new solution:

i 2 f1; : : : ; ng n fa; bg;

y a j b D yb j a D 1; where a, b are chosen randomly in {1, . . . , n}. Determination of PE((l) ), l = 1, . . . , L For each l 2 L the following procedure is applied to determine a list PE((l) ) of potentially efficient solutions. a) (Initialization): – Draw at random an initial solution X 0 . – Evaluate zk (X 0 ), 8k. – PE((l) ) = {X 0 }; N c = n = 0. b) (Iteration n): – Draw at random a solution Y 2 V(X n ) – evaluate zk (Y) and determine

c

c

– If necessary, update the list PE((l) ) in regard to the solution Y. – n n+1 mod Nstep ) D 0

IF n(

THEN Tn D ˛Tn1 ; ELSE Tn D Tn1 : IF Nc D Nstop OR T < Tstop THEN stop ELSE iterate:

b

Generation of E(P) Because of the use of a scalarizing function, a given set of weights (l) induces a privileged direction on the efficient frontier. The procedure generates only a good subset of potentially efficient solutions in that direction. Nevertheless, it is possible to obtain solutions which are not in this direction, because of the large exploration of D at high temperature; these solutions are often dominated by some solutions generated with other weight sets. To obtain a good approximation E(P) to E(P) it is thus necessary to filter the set

b

(l ) [jLj l D1 PE( )

by pairwise comparisons to remove the dominated solutions. This filtering procedure is denoted by ^ such that

b

(l ) E(P) D ^jLj l D1 PE( ):

A great number of experiments is required to determine the number L of set of weights sufficient to give a good approximation of the whole efficient frontier. Concluding Remarks

z k D z k (Y) z k (X n );

8k:

– Calculate s D s(z(Y); ) s(z(X n ); ):

Details and numerical results are given in [3] and [5]. Let us add that it is easy to adapt the MOSA method in an interactive way [2]; a special real case study of an assignment problem is treated in this manner in [6].

Biquadratic Assignment Problem

See also Assignment and Matching Assignment Methods in Clustering Communication Network Assignment Problem Decision Support Systems with Multiple Criteria Estimating Data for Multicriteria Decision Making Problems: Optimization Techniques Financial Applications of Multicriteria Analysis Frequency Assignment Problem Fuzzy Multi-Objective Linear Programming Maximum Partition Matching Multicriteria Sorting Methods Multi-Objective Combinatorial Optimization Multi-Objective Integer Linear Programming Multi-Objective Optimization and Decision Support Systems Multi-Objective Optimization: Interaction of Design and Control Multi-Objective Optimization: Interactive Methods for Preference Value Functions Multi-Objective Optimization: Lagrange Duality Multi-Objective Optimization: Pareto Optimal Solutions, Properties Multiple Objective Programming Support Outranking Methods Portfolio Selection and Multicriteria Analysis Preference Disaggregation Preference Disaggregation Approach: Basic Features, Examples From Financial Decision Making Preference Modeling Quadratic Assignment Problem

References 1. Malhotra R, Bhatia HL, Puri MC (1982) Bicriteria assignment problem. Oper Res 19(2):84–96 2. Teghem J, Tuyttens D, Ulungu EL (2000) An interactive heuristic method for multi-objective combinatorial optimization. Comput Oper Res 27:621–624 3. Tuyttens D, Teghem J, Fortemps Ph, Van Nieuwenhuyse K (1997) Performance of the MOSA method for the bicriteria assignment problem. Techn Report Fac Polytechn Mons (to appear in J. Heuristics) 4. Ulungu EL, Teghem J (1994) Multi-objective combinatorial optimization problems: A survey. J Multi-Criteria Decision Anal 3:83–104

B

5. Ulungu EL, Teghem J, Fortemps Ph, Tuyttens D (1999) MOSA method: A tool for solving MOCO problems. J Multi-Criteria Decision Anal 8:221–236 6. Ulungu EL, Teghem J, Ost Ch (1998) Efficiency of interactive multi-objective simulated annealing through a case study. J Oper Res Soc 49:1044–1050 7. White DJ (1984) A special multi-objective assignment problem. J Oper Res Soc 35(8):759–767

Biquadratic Assignment Problem BiQAP LEONIDAS PITSOULIS Princeton University, Princeton, USA MSC2000: 90C27, 90C11, 90C08 Article Outline Keywords See also References Keywords Optimization The biquadratic assignment problem was first introduced by R.E. Burkard, E. Çela and B. Klinz [2], as a nonlinear assignment problem that has applications in very large scale integrated (VLSI) circuit design. Given two fourth-dimensional arrays A = (aijkl ) and B = (bmpst ) with n4 elements each, the nonlinear integer programming formulation of the BiQAP is 8 X X ˆ min a i jk l b m pst x i m x j p x ks x l t ˆ ˆ ˆ ˆ i; j;k;l m;p;s;t ˆ ˆ n ˆ X ˆ ˆ ˆ 0) Probl em

[m; 1)

Epigraph of f f

tm t m Find m

Point

Find m Natural domain (in Rn ) line segment (n = 1) hexagon (n = 2) rhombic dodecahedron (n = 3) ::: Initial bracket (in Rn+1 )

Single interval Interval halving

Union of (n + 1)-dimensional simplexes Bracket reduction Reduction of simplexes, followed by elimination Convergence

\all brackets = fmg Bracket size halves

\all brackets = {all global minima} Bracket depth reduces linearly

in [m, 1) then we retain the lower interval whereas if the midpoint is not in [m, 1) we retain the upper interval. It is this idea that has been generalized to higher dimensions to give the algorithm, detailed here, that has been termed in the literature multidimensional bisection. It can be shown (see [7]) that the analogue in Rn+1 of an upper semi-infinite interval in R is the epigraph (everything above and including the graph) of a Lipschitz continuous function. Multidimensional bisection finds the set of global minima of a Lipschitz continuous function f of n variables over a compact domain, in a manner analogous to the bisection method. At any stage in the iteration the bracket is a union of similar simplexes

in Rn+1 , with the initial bracket a single simplex. (A simplex is a convex hull of affinely independent points, so a triangle, a tetrahedron and so on.) In the raw version of the algorithm the depth of the bracket decreases linearly and the infinite intersection of all brackets is the set of global minima of the graph of the function. The algorithm works thanks to two simple facts and a very convenient piece of geometry. First, however, we note a property of a Lipschitz continuous function with Lipschitz constant M: if x 2 Rn lies in the domain of the function and (x, y) (with y 2 R) lies in the epigraph of the function, then (x, y) + C lies in the epigraph, where C is an upright spherically based cone of slope M, with apex at the origin.

295

296

B

Bisection Global Optimization Methods

Bisection Global Optimization Methods, Figure 1 A standard simplex and the three smaller standard simplexes resulting from reduction; when (x, f (x)) is removed from the standard simplex three similar standard simplexes remain

Now for the two simple facts: if we evaluate the function f at any point in the domain, then no point higher than (x, f (x)) can be the global minimum on the graph of f and no point in the interior of a (x, f (x)) C can be the global minimum. Informally, this means that every evaluation of f lets us slice away an upper half space and an upside down ice-cream cone, with apex at (x, f (x)), from the space Rn+1 ; we are sure the global optima are not there. These two operations coalesce in the familiar bisection method. Now for the convenient geometry, which comes to light as soon as we attempt to generalise the bisection method. Spherically based cones are ideal to use, but hard to keep track of efficiently [3], so we use a simplicial approximation to the spherical base of the cone to make the bookkeeping easy. Such a simplex-based cone, , has a cap which we call a standard simplex; one is shown as the large simplex in Fig. 1, for the case when n = 2. It fits snugly inside C, so the sloping edges have slope M. If we know that the global optimum lies in this simplex bracket and evaluate f at x, then we can remove (x, f (x)) from the space. Conveniently, this leaves three similar standard simplexes whose union must contain the global minima, as shown in Fig. 1. This process is termed reduction of the simplex.

What does a typical iteration of the algorithm do? At the start of each iteration the global minima are held in a multidimensional bracket, a union of similar standard simplexes. We denote this set of simplexes, or system, by S. An iteration consists of reducing some (possibly all) of these simplexes, followed by elimination, or retaining the portions of the bracket at the level of, or below, the current lowest function evaluation. For this reason an iteration can be thought of informally as ‘chop and drop’, or formally as ‘reduce and eliminate’. How do we start off? The algorithm operates on certain natural domains which we must assume contain a global minimizer (just as we begin in the familiar bisection method by containing the point of interest in an interval). For functions of one variable a natural domain is an interval, for functions of two variables it is a hexagon, while for functions of three variables the natural domain is a rhombic dodecahedron (the honeycomb cell). For higher dimensions the pattern continues; in each dimension the natural domains are capable of tiling the space. By means of n + 1 function evaluations at selected vertices of the natural domain it is possible to bracket the global optima over the natural domain in an initial single standard simplex, termed the initial system. In brief, given a Lipschitz continuous function f on a standard domain, the algorithm can be summarised as:

1 2 3

Set i = 0 and form the initial system S0 . Form S i+1 , by applying reduction and then elimination to the system S i . If a stopping criterion is satisfied (such as that the variation of the system is less than a preassigned amount), then stop. Otherwise, increment i and return to Step 2.

Multidimensional bisection

By the variation of the system is meant the height from top to bottom of the current set of simplexes. The following example illustrates the course of a run of multidimensional bisection. 2 Take f (x1 , x2 ) = ex 1 sin x1 + |x2 |, which has a global minimum on its graph at (0.653273, 0, 0.396653). There are also local minima along the

Boolean and fuzzy relationsBoolean and Fuzzy Relations

Bisection Global Optimization Methods, Table 2 Example of a run of multidimensional bisection. Note how the number of simplexes in the system decreases in the 8th iteration; this corresponds to the elimination of simplexes around local, and nonglobal, minima

Iter Simpl. Variat. in the system 0 1 33:300 1 3 20:000 2 9 9:892 5 108 1:959 7 264 0.504 8 39 0:257 15 369 0:007 18 924 0:001 19 1287 0:000

Best point to date (10:000, 10:000, 10:000) (10:000, 6:667, 6:667) (1:340, 1:667, 1:505) (25:637, 0:185, 0:185) (0.839, 0.074, 0:294) (0:649, 0:036, 0:361) (0:669, 0:000, 0:396) (0:653, 0:000, 0:397) (0:651, 0:000, 0:397)

x1 -axis. We use as our standard domain the regular hexagon with center at (10, 10) and radius 20, and use M = 1. Table 2 provides snapshots of the progress of the algorithm to convergence; it stops when the variation is less than 0.001. We carry the best point to date, shown in the final column of the table. In this example we reduced all simplexes in the system at each iteration. This ensures that the infinite intersection of the brackets is the set of global minima. In [6] it is shown that, under certain conditions, the optimal one-step strategy is to reduce only the deepest simplex in each iteration. With this reduction and n = 1 multidimensional bisection is precisely the Piyavskii– Shubert algorithm [4,5]. Raw multidimensional bisection can require a large number of function evaluations, but can be economical with computer time (see [2]). As described so far, the method does not use the full power of the spherical cone, rather a simplicial approximation, and this approximation rapidly worsens as the dimension increases. Fortunately, much of the spherical power can be utilized very simply, by raising the function evaluation to an effective height. This is trivial to implement and has been called spherical reduction [6]. Reduction, as described so far, removes material only from a single simplex, whose apex determines the evaluation point. Simplexes overlap when n 2, and it is possible to re-

B

move material from many simplexes rather than just one. This is harder to implement, but has been carried out in [1] where it is termed complete reduction. The algorithm operates more efficiently when such improved reduction methods are used. Multidimensional bisection collapses to bisection with n = 0 when we use a primitive reduction process, one which depends only on whether the point in Rn+1 considered lies in the epigraph of f ; this is described in [7]. A summary comparison of bisection and multidimensional bisection is given in Table 1. See also ˛BB Algorithm References 1. Baoping Zhang, Wood GR, Baritompa WP (1993) Multidimensional bisection: the performance and the context. J Global Optim 3:337–358 2. Horst R, Pardalos PM (eds) (1995) Handbook of Global Optimization. Kluwer, Dordrecht 3. Mladineo RG (1986) An algorithm for finding the global maximum of a multimodal, multivariate function. Math Program 34:188–200 4. Piyavskii SA (1972) An algorithm for finding the absolute extremum of a function. USSR Comput Math Math Phys 12: 57–67 5. Shubert BO (1972) A sequential method seeking the global maximum of a function. SIAM J Numer Anal 9:379–388 6. Wood GR (1991) Multidimensional bisection and global optimisation. Comput Math Appl 21:161–172 7. Wood GR (1992) The bisection method in higher dimensions. Math Program 55:319–337

Boolean and Fuzzy Relations LADISLAV J. KOHOUT Department Computer Sci., Florida State University, Tallahassee, USA MSC2000: 03E72, 03B52, 47S40, 68T27, 68T35, 68Uxx, 91B06, 90Bxx, 91Axx, 92C60 Article Outline Keywords Boolean Relations Propositional Form Heterogeneous and Homogeneous Relations

297

298

B

Boolean and Fuzzy Relations

The Satisfaction Set The Extensionality Convention The Digraph Representation Foresets and Aftersets of Relations Matrix Representation

Operations and Inclusions in R(A Ý B) Unary Operations

Binary Operations on Successive Relations Matrix Formulation of the Binary Operations Non-Associative Products of Relations

Characterization of Special Properties of Relations Between Two Sets Relations on a Single Set: Special Properties Partitions IN and ON a Set Tolerances and Overlapping Classes Hierarchies in and on a Set: Local and Global Orders and Pre-orders Fuzzy Relations Definitions

Operations and Inclusion on RF (X Ý Y) Fuzzy Relations with Min, Max Connectives Fuzzy Relations Based on Łukasiewicz Connectives Fuzzy Relations With t-Norms and Co-Norms

Products: RF (X Ý Y) × RF (Y Ý Z) ! RF (X Ý Z) N-ary Relations

Special Properties of Fuzzy Relations Alpha-cuts of Fuzzy Relations Fuzzy Partitions, Fuzzy Clusters and Fuzzy Hierarchies

Closures and Interiors with Special Properties Applications of Relational Methods in Engineering, Medicine and Science Brief Review of Theoretical Development Basic Books and Bibliographies See also References Keywords Fuzzy relations; Local relational properties; Closures; Interiors; Pre-order; Tolerances; Equivalences; BK-products; Relational compositions; Nonassociative products; Generalized morphism; Universal properties of relations; n-ary relation; Scientific applications; Medicine; Psychology; Engineering applications; Artificial intelligence; Value analysis; Decision theory The conventional nonfuzzy relations using the classical two-valued Boolean logic connectives for defining their

operations will be called crisp. The extensions that replace the 2-valued Boolean logic connectives by manyvalued logic connectives will be called fuzzy. A unified approach of relations is provided here, so that the Boolean (crisp, nonfuzzy) relations and sets are just special cases of fuzzy relational structures. The first part of this entry on nonfuzzy relations can be used as reference independently, without any knowledge of fuzzy sets. The second part on fuzzy structures, however, refers frequently to the first part. This is so because most formulas in the matrix notation carry over to the many-valued logics based extensions. In order to make this material useful not only theoretically but also in practical applications, we have paid special attention to the form in which the material is presented. There are seven distinguishing features of our approach that facilitate the unification of crisp and fuzzy relations and enhance their practical applicability: 1) Relations in their predicate forms are distinguished from their satisfaction sets. 2) Foresets and aftersets of relations are used in addition to relational predicates. 3) Relational properties are not only global but also local (important for applications). 4) Nonassociative BK-products are introduced and used both in definitions of relational properties and in computations. 5) The unified treatment of computational algorithms by means of matrix notation is used which is equally applicable to both crisp and fuzzy relations. 6) The theory unifying crisp and fuzzy relations makes it possible to represent a whole finite nested family of crisp relations with special properties as a single cutworthy fuzzy relation for the purpose of computation. After completing the computations, the resulting fuzzy relation is again converted by ˛-cuts to a nested family of crisp relations, thus increasing the computing performance considerably. 7) Homomorphisms between relations are extended from mappings used in the literature to general relations. This yields generalized morphisms important for practical solving of relational inequalities and equations. These features were first introduced in 1977 by W. Bandler and L.J. Kohout [1] and extensively developed over the years both in theory and practical applications [7,30,52].

Boolean and Fuzzy Relations

Boolean Relations Propositional Form A binary relation (from A to B) is given by an open predicate __P__ with two empty slots; when the first is filled with the name a of an element of A and the second with the name b of an element of B, there results a proposition, which is either true or false. If aPb is true, we write aRP b and say that ‘a is RP -related to b’. If a P b is false, we write a : RP b and say that ‘a is not RP related to b’, etc. When it is unnecessary to emphasize the propositional form the subscript is dropped in RP , writing: R, a Rb, a : Rb, respectively. Heterogeneous and Homogeneous Relations The lattice of all binary (two-place, 2-argument) relations from A to B is denoted by R(A Ý B). Relations of this kind are usually called heterogeneous. Nothing forbids the set B to be the same as A, in which case we speak of relations ‘within a set’ or ‘in a set’, or ‘on a set’, and call these homogeneous. Relations from A to B can always be considered as relations within A [ B, but so ‘homogenized’ relations may lose some valuable properties (discussed below), when so viewed. For this reason, we do not attempt to assimilate relations between distinct sets to those within a set. The Satisfaction Set The satisfaction set or representative set or extension set of a relation R 2 R(A ! B) is the set of all those pairs (a, b) 2 A × B for which it holds:

B

the same pairs: RS = RS 0 ) RP = RP 0 . In the set theory, this appears as the axiom of extensionality. This convention is not universally convenient; it is perhaps partly responsible for delays in the application of relation theory in the engineering, social and economical sciences and elsewhere. Once the extensionality convention has been adopted, it becomes a matter of indifference, or mere convenience, whether a relation is given by an open predicate or by the specification of its satisfaction set. There is a one-to-one correspondence between the subsets RS of A × B and the (distinguishable) relations RP in R(A ! B). Since RS and RP now uniquely determine each other, the current fashion for set-theoretical parsimony suggests that they be identified. This view is common in the literature, which often defines relations as being satisfaction sets. We, however, maintain the distinction in principle.

Example of the failure of the extensionality convention R , Q> 2 R(A Ý B); A = {1, 6, 8}, B = { 0, 5, 7}. Predicates: P1 := ‘__ __’ (‘__ is greater than or equal to __’) P2 := ‘__>__’ (‘__ is greater than __’) Relations in their Predicate Form: R = {1 0, 8 0, 8 5, 8 7, 6 0, 6 5 } Q> = {1 > 0, 8 > 0, 8 > 5, 8 > 7, 6 > 0, 6 > 5 } The Satisfaction Sets: RS = QS = {(1, 0), (8, 0), (8, 5), (8, 7), (6, 0), (6, 5)}. By the extensionality convention: RS = QS ) R = Q> . So, R should be the same relation as Q. This is not the case, because the predicates are not equivalent: (8x) x P1 x is true, but (8x) x P2 x is false. Hence the extensionality convention fails for these relations.

R S D f(a; b) 2 A B : aRbg : Clearly RS is a subset of the Cartesian product A × B. Knowing RP , we know RS ; knowing RS , we know everything about RP except the wording of its ‘name’ __P__. The Extensionality Convention This convention says that, regardless of their propositional wordings, two relations should be regarded as the same if they hold, or fail to hold between exactly

The Digraph Representation When B = A, so that we are dealing with a relation within a set, we may use the digraph RD to represent it; in which an arrow goes from a to a0 if and only if a R a0 . Any relation within a finite or countably infinite set can, in principle, be shown in a digraph; conversely, every digraph (with unlabelled arrows) represents a relation in the set of its vertices. Interesting properties of relations are often derived from digraphical considerations; there is a whole literature on digraphs.

299

300

B

Boolean and Fuzzy Relations

Foresets and Aftersets of Relations

Operations and Inclusions in R(A Ý B)

These are defined for any relation R from A to B. The afterset of a 2 A is

There are a considerable number of natural and important operations. We begin with unary operations and then proceed to several kinds of binary ones.

aR D fb 2 B : aRbg : The foreset of b 2 B is Rb D fa 2 A: aRbg : Mnemonically and semantically, an afterset consists of all those elements which can correctly be written after a given element, a foreset of those which can correctly be written before it. An afterset or foreset may well be empty. Clearly, b 2 a R if and only if a 2 R b. A relation is completely known if all its foresets or all its aftersets are known. Matrix Representation Very important computationally and even conceptually, as well as being a useful visual aid, is the incidence matrix RM of a relation R. This arises from a table in which the row-headings are the elements of A and the column-headings are the elements of B, so that the cells represent A × B. In the (a, b)-cell is entered 1 if a R b, and 0 if a : Rb. For visual purposes it is better to suppress the 0s, but they should be understood to be there for computational purposes.

Unary Operations The negated or complementary relation of R 2 R(A ! B) is : R 2 R(A ! B) given by a : R b if and only if it is not the case that aRb. The converse or transposed relation of R 2 R(A ! B) is R| 2 R(B ! A) given by bR> a

,

aRb:

(It is also called the inverse and is therefore often written R1 . In no algebraic sense it is an inverse, in general.) Both operators | and : are involutory, that is, when applied twice they give the original object: (R| )| = R, : (: R) = R. They commute with each other: : (R| ) = (: R)| , so that the parentheses may be omitted safely. One can write: : R| . Definition 1 (Binary operators and a binary relation on R(A ! B)) The intersection or meet or AND-ing: a(R u R0 )b

,

aRb and aR0 b:

The union or join or OR-ing: a(R t R0 )b

Example: The matrix representation RM and the afterset representation of a relation R

aRb or aR0 b:

A relation R ‘is contained in’ (is a subrelation of) a relation R0 , and R0 ‘contains’ (is a superrelation of) R0 , R v R0 : R v R0

Clearly there is a one-to-one correspondence (bijection) between distinct tables and distinct relations, and, as soon as there has been agreement on the names and ordering of the row and column headings, between either of these and distinct matrices of size |A| × |B| with entries from {0, 1}. Furthermore, the afterset ai R is in one-to-one correspondence with the nonzero entries of the ith row of RM ; the foreset Rbj is in one-to-one correspondence with the nonzero entries of the jth column of RM .

,

,

,

(8a)(8b)(aRb ! aR0 b)

R u R0 D R

,

R t R0 D R0 ;

where ! is the Boolean implication operator. Definition 2 The relative complement of R with respect to R0 , or difference between R0 and R, is R0 \ R, given by: a(R0 n R)b

,

aR0 b but a:Rb;

that is, by R0 \ R = R0 u : R.

Boolean and Fuzzy Relations

Binary Operations on Successive Relations Definition 3 (Circle and square products) Where R 2 R(A ! B) and S 2 R(B ! C), the following compositions give a relation in R(A ! C): The circle product or round composition is ı, given by aR ı Sc , aR \ Sc 6D ;. The square composition or square product is , given by aR Sc , aR = Sc. The circle product is the usual one, to be found throughout the literature going back at least to the nineteenth century. The square product is a more recent (1977) innovation. The product belongs to the family of products sometimes called BK-products. Further interesting kinds of BK-products and their uses are discussed in the sequel. Proposition 4 (Properties of -product) 1) (R S) u (R0 S) v (R u R0 ) S v (R t R0 ) S v (R S) t (R0 S); 2) (R S)1 = S1 R1 ; 3) R S = : R : S; 4) the square product is not associative. Matrix Formulation of the Binary Operations All of the binary operations on relations have a convenient formulation in matrix terms – using the matrix operations given in Proposition 6. The matrix operations use in their definitions standard Boolean logic connectives for crisp relations. By replacing these by the connectives of suitable many-valued logics, all the formulas easily generalize to fuzzy relations. Thus matrix formulation of binary operations and compositions unifies computationally crisp and fuzzy relations.

B

Proposition 6 (Matrix notation) 1) (R u S)ij = Rij ^ Sij ; 2) (R t S)ij = Rij _ Sij ; W 3) (R ı S) i j D k (R i k ^ S k j ); V 4) (R S) i k D j (R i j _ S jk ); V 5) (RS) i k D j (R i j S jk ); 6) (R1 × R2 ) i 1 i 2 j 1 j 2 = (R1 ) i 1 j 1 ^ (R2 ) i 2 j 2 . Non-Associative Products of Relations Definition 7 (Triangle products) Subproduct G: x(R G S) z , xR Sz; Superproduct F: x(R F S) z , xR Sz. The matrix formulation of G and F products uses the Boolean connectives !, , ˚ on the set B2 = {0, 1} given by

Proposition 8 (Logic notation for G and F) V (R C S) i k D j (R i j ! S jk ); V S jk ). (R B S) i k D j (R i j Only the conventional ı -product is associative. The product is not associative [2]. Proposition 9 The following mixed pseudoassociativities hold for the triangle products, with Q 2 B (W Ý X) and the triple products in B (W Ý Z): Q G (R F S) = (Q G R) F S; Q G (R G S) = (Q ı R) G S; Q F (R F S) = Q F (R ı S).

Definition 5 The Boolean connectives ^, _, $, on the set B2 = {0, 1} are given by:

Characterization of Special Properties of Relations Between Two Sets

For a pair (x1 , x2 ) of elements from B2 , we infix the operators: x1 ^ x2 , etc., while for a list (xk )k = 1, . . . , n or V (xk )k 2 K of elements from B2 , we write nkD1 x k or V V k2K x k or simply k x k . (Note that K can be denumerably infinite, or even greater, without spoiling the definition; no convergence problems are involved.)

Definition 10 (Special properties of a heterogeneous relation R 2 R (X Ý Y)): R is covering if and only if (8x) 2 X (9y) 2 Y such that xRy. R is onto if and only if (8y) 2 Y (9x) 2 X such that xRy. R is univalent if and only if (8x) 2 X, if xRy and xRy0 then y = y0 . R is separating if and only if (8y) 2 Y, if xRy and x0 Ry then x = x0 .

301

302

B

Boolean and Fuzzy Relations

Composed properties can be defined by combining these four basic properties. Well-known is the combination ‘covering’ and ‘univalent’ which defines functional. Other frequently used combination is ‘onto’ and ‘separating’. The self-inverse circle product is very useful in the characterization of special properties of relations between two distinct sets. Using the product, one can characterize these properties in purely relational way, without directly referring to individual elements of the relations involved. Proposition 11 (Special properties of a heterogeneous relation R 2 R (X Ý Y)): R is covering if and only if EX v R ı R1 . R is univalent if and only if R1 ı R v EY . R is onto if and only if (for all) EY v R1 ı R. R is separating if and only if R ı R1 v EX . Here EX and EY are the left and right identities, respectively. Relations on a Single Set: Special Properties The self-inverse products are a fertile source of relations on the single set X. There are certain well-known special properties which a relation may possess (or may lack), of which the most important are reflexivity, symmetry, antisymmetry, strict antisymmetry, and transitivity, together with their combinations, forming preorders (reflexive and transitive) (partial) orders (reflexive, antisymmetric and transitive), equivalences (reflexive, transitive and symmetric). Definition 12 (Special properties of binary relations from X to X) Covering: every xi is related by R to something , 8i 2 I 9j 2 I such that Rij = 1. Locally reflexive: if xi is related to anything, or if anything is related to xi , then xi is related to itself , 8i 2 I Rii = maxj (Rij , Rji ). Reflexive: covering and locally reflexive , 8i 2 I Rii = 1. Transitive 8i, j, k 2 I (xi Rxj and xj Rxk ) xi Rxk ) , R2 v R. Symmetric: (xi Rxj ) xj Rxi ) , R| = R. Antisymmetric: (xi Rxj and xj Rxi ) xi = xj ) , if i 6D j then min(Rij , Rji ) = 0. Strictly antisymmetric: never both xi Rxj and xj Rxi , 8i, j 2 I min(Rij , Rji ) = 0.

Most of the properties listed above are common in the literature. Local reflexivity is worthwhile exception. It appeared in [1] and was generalized to fuzzy relations in [4], leading to new computational algorithms for both crisp and fuzzy relations [4,10]. Unfortunately, it is absent from the textbooks, yet it is extremely important in applications of relational methods to analysis of the real life data (see the notion of participant in the next two sections). Partitions IN and ON a Set A partition on a set X is a division of X into nonoverlapping (and nonempty) subsets called blocks. A partition in a set X is a partition on the subset of X [17,18] called the subset of participants. There is a one-to-one correspondence between partitions in X and local equivalences (i. e. locally reflexive, symmetric and transitive relations) in R(X Ý B). The partitions in X (so also the local equivalences in R(X Ý B)) form a lattice with ‘__is-finer-than__’ as its ordering relation. This whole subject is coextensive with classification or taxonomy, i. e., very extensive indeed. Furthermore, classification is the first step in abstraction, one of the fundamental processes in human thought. Tolerances and Overlapping Classes Some tests for tolerance and equivalence are as follows: R ı R| is always symmetric and locally reflexive. R ı R| is a tolerance if and only if R is covering. R R| is always a (local) tolerance. R R| v R if and only if R is reflexive. E v R v R R| if and only if R is an equivalence. R R| = R if and only if R is an equivalence. R R| v R ı R| if and only if R is covering. It is not always the case that one manages, or even attempts, to classify participants into nonoverlapping blocks. Local tolerance relations (i. e. locally reflexive and symmetric) lead to classes which may well overlap, where one participant may belong to more than one class. The classic case, giving its name to this kind of relation, is ‘__is-within-one milimeter-of__’. This is quite a different model from the severe partitions [80], and has been for a long time unduly neglected both in theory and applications, even when the data mutely favor it.

Boolean and Fuzzy Relations

Hierarchies in and on a Set: Local and Global Orders and Pre-orders An example of a hierarchy in a finite set X is displayed in Fig. 1. In such a hierarchy, there is a finite number of levels and there is no ambiguity in the assignment of a level to an element. The elements which appear eventually in the hierarchy are the participants; those which do not are nonparticipants; if all of X participates, then the hierarchy is on X.

B

Definition 14 Let F, R, G, S be heterogeneous relations between the sets A, B, C, D such that R 2 R(A Ý B). The conditions that (for all a 2 A, b 2 B, c 2 C, d 2 D) the expression (aFc ^ aRb ^ bGd) ! cSd we denote by FRG:S. We say that FRG:S is forward compatible, or, equivalently, that F, G are generalized morphisms. The following Bandler–Kohout compatibility theorem holds, [6]: Theorem 15 (Generalized morphisms) FRG : S are forward compatible if and only if F | ı R ı G v S. Formulas for computing the explicit compatibility criteria for F and G are: FRG : S are forward-compatible if and only if F v R G (G G S| ).

Every local order (i. e. locally reflexive, transitive and antisymmetric relation) from a finite set to itself establishes a hierarchy in that set, that is, can be used as the ‘precedes’ relation in the hierarchy. Conversely, given any hierarchy, its ‘__precedes__’ is a local order. The hierarchy is on X exactly when the local order is the global one. The picture of the hierarchy is called its Hasse diagram. It can always be obtained from the digraph of the local-order relation by the suppression of loops and of those arrows which directly connect nodes between which there is also a longer path. The formulas of Theorem 13 can be used for fast computational testing of the listed properties. Theorem 13 The following conditions universally characterize the transitivity, reflexivity and pre-order on R 2 R (X Ý X): R is transitive if and only if R v R F R1 . R is reflexive if and only if R F R1 v R. R is a pre-order if and only if R = R F R1 . More complex relational structures are investigated by theories of homomorphisms, which can be further generalized [6].

The R’s of forward compatibility constitute a lower ideal. Similarly, the backward compatibility given by F ı S ı G| v R gives a generalized proteromorphism. It constitutes an upper ideal or filter: FRG : S are backward compatible if and only if F ı S ı G| v R if and only if S v F | G R F G. FRG : S are both-way compatible if they are both forward and backward compatible. The conventional homomorphism is a special case of both-way compatibility, where F and G are not general relations but just many-to-one mappings. The generalized morphisms of Bandler and Kohout [6] are relevant not only theoretically, but have also an important practical use in solving systems of inequalities and equations on systems of relations. For partial homomorphisms the situation becomes more complicated. In partial structures the conventional homomorphism splits into mutually related weak, strong and very strong kinds of homomorphism [5]. Fuzzy Relations Mathematical relations can contribute to investigation of properties of a large variety of structures in sciences and engineering. The power of relational analysis stems from the elegant algebraic structure of relational systems that is supplemented by the computational power of relational matrix notation. This power is further enhanced by many-valued logic based (fuzzy) extensions of the relational calculus.

303

304

B

Boolean and Fuzzy Relations

As often in mathematics, where terms are used inclusively, the crisp (nonfuzzy) sets and relations are merely special cases of fuzzy sets and relations, in which the actual degrees happen to be the extreme ones. On the theoretical side, fuzzy relations are extensions of standard nonfuzzy (crisp) relations. By replacing the usual Boolean algebra by many-valued logic algebras, one obtains extensions that contain the classical relational theory as a special case. Definitions A fuzzy set is one to which any element may belong to various degrees, rather than either not at all (degree 0) or utterly (degree 1). Similarly, a fuzzy relation is one which may hold between two elements to any degree between 0 and 1 inclusive. The sentence xi Ryj takes its value ı (xi R yj ) = Rij , from the interval [0, 1] of real numbers. In early papers on fuzzy relations R (xi , yj ) was usually written instead of Rij . The matrix notation used in the previous sections for nonfuzzy (crisp) relations is directly applicable to the fuzzy case. Thus, all the definitions of operations, compositions and products can be directly extended to the fuzzy case. Operations and Inclusion on RF (X Ý Y) Fuzzy Relations with Min, Max Connectives This has been the most common extension of relations to the fuzzy realm. Boolean ^ and _ are replaced by many-valued connectives min, max in all crisp definitions. In matrix terms, this yields the following intersection and union operations: (R u S) i j D min(R i j ; S i j ); (R t S) i j D max(R i j ; S i j ): (In older -notation, (R u S) (xi , xj ) = min(R (xi , yj ), S (xi , yj )), etc.) The negation of R is given by (: R)ij = 1 Rij . The converse of R is given by (R| )ij = Rji . Fuzzy Relations Based on Łukasiewicz Connectives When the bold (Łukasiewicz) connectives x _ y = min(1, x + y), x ^ y = max(0, x + y 1) are used to

define t, u operations, this is an instance of relations in MV-algebras. Fuzzy Relations With t-Norms and Co-Norms Fuzzy logics can be further generalized. ^ and _ are obtained by replacing min and max by a t-norm and a tconorm, respectively. A t-norm is an operation : [0, 1]2 ! [0, 1] which is commutative, associative, nondecreasing in both arguments and having 1 as the unit element and 0 as the zero element. Taking a continuous t-norm, by residuation we obtain a many-valued logic implication !. Using { ^, _, , ! } one can define families of deductive systems for fuzzy logics called BLlogics [31]. In relational systems using BL-logics, one can define again various t-norm based relational properties [53,83], BK-products and generalized morphisms of relations [47]. Definition 16 (Inclusion of relations) A relation R is ‘contained in’ or is a subrelation of a relation S, written R v S, if and only if (8i)(8j) Rij Sij . This definition guarantees that R is a subrelation of R0 if and only if every R˛ is a subrelation of its corresponding R˛ . (This convenient meta-property is called cutworthiness, see Theorem 17 below.) Products: RF (X Ý Y) × RF (Y Ý Z) ! RF (X Ý Z) For fuzzy relations, there are two versions of products: harsh and mean [3,52]. Most conveniently, again, in matrix terms harsh products syntactically correspond to matrix formulas for the crisp relations. The fuzzy relational products are obtained by replacing the Boolean logic connectives AND, OR, both implications and the equivalence of crisp products by connectives of some many-valued logic chosen according to the properties of the products required. Thus the ı-product and -product are given exactly as in Proposition 6 above by formulas 3) and 5), respectively; for triangle products as given in Proposition 8 above. For the MVL implication operators most often used to define fuzzy triangle products, see Checklist paradigm semantics for fuzzy logics, Table 1, or [8]. The details of choice of the appropriate many-valued connectives are discussed in [3,7,8,40,43,52]. Given the general formula (R@S)ik := # (Rij Sjk ) for a relational product, a mean product is obtained by

Boolean and Fuzzy Relations

B

Boolean and Fuzzy Relations, Table 1 Closures and an interior 1. The locally reflexive closure of R: locref clo R = R t ER . 2. The symmetric closure of R: sym clo R = R t R| . 3. symmetric interior of R: sym int R = R u R| . 4. The transitive closure of R: tra clo R = R u R2 u = uk2Z C Rk . 5. The local tolerance closure of R: loctol clo R = locref clo (sym clo R). 6. The local pre-order closure of R: locpre clo R = locref clo (tra clo R) = tra clo (locref clo R). 7. The local equivalence closure of R: locequ clo R = tra clo (sym clo (locref clo R)). 8. The reflexive closure of R: ref clo R = R t EX . 9. The tolerance closure of R: tol clo R = ref clo (sym clo R). 10. The pre-order closure of R: pre clo R = ref clo (tra clo R). 11. The equivalence closure of R: equ clo R = tra clo (tol clo R).

P replacing the outer connective # by and normalizing the resulting product appropriately. In more concrete terms, in order to obtain the mean products, the outer V W connectives j in ı and j in , G, F are replaced by P 1/n) j [3].

Alpha-cuts of Fuzzy Relations It is often convenient to study fuzzy relations through their ˛-cuts; for any ˛ in the half-open interval [0, 1], the ˛-cut of a fuzzy relation R is the crisp relation R˛ given by (

N-ary Relations An n-ary relation R is an open sentence with n slots; when these are filled in order by the names of elements from sets X 1 , . . . , X n , there results a proposition that is either true or false if the relation is crisp, or is judged to hold to a certain degree if the relation is fuzzy. This ‘intensional’ definition is matched by the satisfaction set RS of R, which is a fuzzy subset the n-tuple of X 1 , . . . , X n , and can be used, if desired as its extensional definition. The matrix notation works equally well for n-ary relations and all the types of the BK-products are also defined. For details see [9].

(R˛ ) i j D

1 if R i j ˛; 0 otherwise:

Compatibility of families of crisp relations with their fuzzy counterpart (the original relation on which the ˛-cuts have been performed) is guaranteed by the following theorem on cutworthy properties [10]: Theorem 17 It is true of each simple property P (given in Definition 12) and every compound property P (listed in Table 1), that every ˛-cut of a fuzzy relation R possesses P in the crisp sense, if and only if R itself possesses in the fuzzy sense. (Such properties are called cutworthy.)

Special Properties of Fuzzy Relations The special properties of crisp relations can be generalized to fuzzy relations exactly as they stand in Definition 12, using in each case the second of the two given definitions. It is perhaps worthwhile spelling out the requirements for transitivity in more detail: R2 v R , (8i; k) max(min R i j ; R jk )) R i k : i

Useful references provide further pointers to the literature: general [43] on fuzzy partitions [14,69], fuzzy similarities [69], tolerances [34,75,85].

Fuzzy Partitions, Fuzzy Clusters and Fuzzy Hierarchies Via their ˛-cuts, fuzzy local and global equivalences provide precisely the nested families of partitions in and on a set which are required by the theory and for the applications in taxonomy envisaged in [17,18]. Fuzzy local and global tolerances similarly provide families of tolerance classes for the cluster type of classification which allows overlaps. Fuzzy local and global orders furnish nested families of hierarchies in and on a set, with their accompanying families of Hasse diagrams.

305

306

B

Boolean and Fuzzy Relations

The importance of fuzzy extensions cannot be overestimated. Thus, one may identify approximate similarities in data, approximate equivalences and orders. Such approximations are paramount in many applications, in situations when only incomplete, partial information about the domain of scientific or technological application is available. Closures and interiors of relations play an important role in design of fast fuzzy relational closure algorithms [4,9,10,11] for computing such approximations. Theorem 17 and other theorems on commuting of cuts with closures [11,42] guarantee their correctness. Closures and Interiors with Special Properties For certain properties P which a fuzzy relation may have or lack, there always exists a well-defined P-closure of R, namely the least inclusive relation V which contains R and has the property P. Also, for some properties P, the P-interior of R is the most inclusive relation Q contained in R and possessing P. Clearly, where the P-closure exists, R itself possesses P if and only if R is equal to P-clo(R), and the same for interiors. Certain closures use the local equality ER of R, given by (ER )ii = maxi (max(Rij , Rji )), (ER )ij = 0 if j 6D i. Others use the equality on X given by (EX )ii = 1, (EX )ij = 0 for j 6D i. Important closures and one important interior are given in Table 1. See [4,10] for further details. Applications of Relational Methods in Engineering, Medicine and Science Relational properties are important for obtaining knowledge about characteristics and interactions of various parts of a relational model used in real life applications. Identification of composite properties of mathematical relations, such as local or global pre-orders, orders, tolerances or equivalences, plays an important role in evaluation of empirical data, (e. g. medical data, commercial data etc. or data for technological forecasting) and building and evaluating relational models based on such data [48,49]. The local and global properties detect important semantic distinctions between various concepts captured by relational structures. For example the interactions between technological parts, processes etc., or relationships of cognitive constructs elicited experimen-

tally [37,39,41,55]. Capturing both, local and global properties is important for distinguishing participants from nonparticipants in a relational structure. This distinction is crucial for obtaining a nondistorted picture of reality. In the general terms, the abstract theoretical tools supporting identification and representation of relational properties are fuzzy closures and interiors [4,10]. Having such means for testing relational properties opens the avenue to linking the empirical structures that can be observed and captured by fuzzy relations with their abstract, symbolic representations that have well defined mathematical properties. This opens many possibilities for computer experimentation with empirically identified logical, say, predicate structures. These techniques found practical use in directing resolution based theorem prover strategy [56], relation-based inference in medical diagnosis [48,58] and at extracting predicate structures of ‘train of thought’ from questionnaires presented to people by means of Kelly’s repertory grids. BK-relational products and fast fuzzy relational algorithms based on fuzzy closures and interiors have been essential for computational progress of in this field and for optimization of computational performance. See the survey in [52] with a list of 50 selected references on the mathematical theory and applications of BK-products in various fields of science and engineering. Further extensions or modifications of BK-products have been suggested in [19,20,21,30]. Applications of relational theories, computations and modeling include the areas of medicine [48,59], psychology [49], cognitive studies [36,38], nuclear engineering [84], industrial engineering and management [25,46], architecture and urban studies [65,66] value analysis in business and manufacturing [60] information retrieval [51,54], computer security [45,50] databases, theoretical computer science [13,68,71], software engineering [78], automated reasoning [56], and logic [12,28,63]. Particularly important for software engineering is the contribution of C.A.R. Hoare and He Jifeng [33] who use the crisp triangle BK-superproduct for software specification, calling the crisp G products in fact ‘weak prespecifications’. Relational equations [22] play an important role in applications [70] in general, and also in AI and applications of causal reasoning [24]; fuzzy inequalities in

Boolean and Fuzzy Relations

mathematical programming [72]. Applications in game theory of crisp relations is well established [78,79]. Brief Review of Theoretical Development Binary (two place) relations were first perceived in their abstract mathematical form by Galen of Pergamon in the 2nd century AD [57]. After a long gap, first systematic development of the calculus of relations (concerned with the study of logical operations on binary relations) was initiated by A. DeMorgan, C.S. Pierce and E. Schröder [9,64]. Significant investigation into the logic of relations was the 1900 paper of B. Russell [76] and axiomatization of the relational calculus in 1941 by A. Tarski [64,81]. Extensibility of Tarski’s axioms to the fuzzy domain has been investigated by Kohout [44]. Later algebraic advances in relational calculus [9] stem jointly from the elegant work of J. Riguet (1948) [74], less widely known but important work of O. Bor˚uvka (1939) [15,16,17,18] and the stimulus of fuzzy set theory of L.A. Zadeh (1965) [35,85,86], and include a sharpened perception of special properties and the construction of new kinds of relational products [3], together with the extension of the theory from Boolean to multiple-valued logic based relations [2,9]. The triangle subproduct R G S, the triangle superproduct R F S, and square product R S were introduced in their general form defined below by Bandler and Kohout in 1977, and are referred to as the BK-products in the literature [19,20,30]. The square product, however, stems from Riguet (1948) [74], needing only to be made explicit [1,9]. E. Sanchez independently defined an ˛-compostition [77] which is in fact G using Heyting–Gödel implication. The special instances of the triangle BK-products were more recently rediscovered and described in 1986 by J.P. Doignon, B. Monjardet, M. Roubens, and P. Vincke [23,26] calling these ‘traces of relations’. Hence, a ‘trace-of-relation’ is a BKtriangle superproduct in which ! is the residuum of a commutative ^. The crisp square product was also independently introduced in 1986 by R. Berghammer, G. Schmidt and H. Zierer [13] as a generalization of Riguet’s ‘noyau’ [74]. On the other hand, advances in abstract relational algebras stems from the work of Tarski [81] and his school [32,64,67,82]. Tarski’s axiomatization [81] of

B

homogeneous relational calculus takes relations and operations over relations as the primitives. It applies only to homogeneous relations as it has only one constant entity, the identity relation E. For heterogeneous relations, taking e. g. U XY as the universal relation we have a finite number of separate identity relations (constants) i. e. EXY , EYZ , . . . , etc. [4,10]. Therefore viewed syntactically through the logic axioms, the axiomatization of heterogeneous relations (containing a whole family of universal relations) would be a many-sorted theory [30], each universal relation belonging to a different sort. Tarski’s axioms of homogeneous relations R ı E = E ı R; (R ı S)| = S| ı R| ; (R| )| = R; (: R)| = : (R| ); (R t S)| = R| t S| ; (R ı S) ı T = R ı (S ı T); (R t S) ı T = (R ı T) t (S ı T); R ı (S t T) = (R ı S) t (R ı T); (R| ı :(R ı S)) t : S = : S.

Taking the axioms on their own opens the way to abstract relational algebras (RA) with new problems at hand. Tarski and his school have investigated the interrelationship of various generalizations of associative RAs in a purely abstract way. In some of these generalizations, the axiom of associativity for relational composition is dropped. This leads from representable (RRA) to semi-associative (SA), weakly associative (WA) and nonassociative (NA) relational algebras. In 1982 R.D. Maddux [62] gave the following result: RRA RA SA WA NA: All these generalizations deal only with one relational composition. The equations for pseudo-associativities given above (Proposition 9) and the nonassociativity of the square product (Proposition 4) show that there exist nonassociative representations of relational algebras (RA) in the relational calculus. Theorem 15 and Proposition 8 show that the interplay of several relational compositions is essentially involved in the computationally more powerful formulas of the relational calculus. The Tarskian RA axiomatizations, however, do not express fully the richness of the calculus of binary relations and the mutual interplay of associative ı, pseudo-associative F, G and nonassociative prod-

307

308

B

Boolean and Fuzzy Relations

ucts. Considerable scope for further research into new axiomatizations still remains. Our results based on nonassociative BK-products of Bandler and Kohout that historically precede abstract nonassociative generalizations in relational algebras of Maddux show that the nonassociative products have representations and that these representations offer various computational advantages. There is also a link of RA with projective geometries [61].

2.

3.

Basic Books and Bibliographies The best general books on theory of crisp relations and applications are [78] and [80]. In fuzzy field, there is no general book available at present. There are, however, extant some more specialized monographs: on solving fuzzy relations equations [27], on preference modeling and multicriteria decision making [39], on representation of cognitive maps by relations [39] and on crisp and fuzzy BK-products of relations [53]. One can also find some specialized monographs on logic foundations and relational algebras: [32,82]. All these books also contain important list of references. The most important bibliography of selected references on the topic related to fuzzy sets and relations is contained in [43]. The early years of fuzzy sets (1965–1975) are covered very comprehensively in the critical survey and annotated bibliography [29]. Many-valued logic connectives form an important foundation for fuzzy sets and relations. The book of N. Rescher [73] still remains the best comprehensive survey that is also accessible to a nonlogician. It contains almost complete bibliography of many-valued logics from the end of the 19th century to 1968.

4.

5.

6. 7.

8.

9.

10.

11.

See also Alternative Set Theory Checklist Paradigm Semantics for Fuzzy Logics Finite Complete Systems of Many-valued Logic Algebras Inference of Monotone Boolean Functions Optimization in Boolean Classification Problems Optimization in Classifying Text Documents

12.

13. 14. 15.

References 1. Bandler W, Kohout LJ (1977) Mathematical relations, their products and generalized morphisms. Techn Report Man-

16.

Machine Systems Lab Dept Electrical Engin Univ Essex, Colchester, Essex, UK, EES-MMS-REL 77-3. Reprinted as Chap. 2 of Kohout LJ, Bandler W (eds) Survey of Fuzzy and Crisp Relations, Lect Notes in Fuzzy Mathematics and Computer Sci, Creighton Univ Omaha Bandler W, Kohout LJ (1980) Fuzzy relational products as a tool for analysis and synthesis of the behaviour of complex natural and artificial systems. In: Wang PP, Chang SK (eds) Fuzzy Sets: Theory and Appl. to Policy Analysis and Information Systems. Plenum, New York, pp 341–367 Bandler W, Kohout LJ (1980 1981) Semantics of implication operators and fuzzy relational products. Internat J Man-Machine Studies 12:89–116 Reprinted in: In: Mamdani EH, Gaines BR (eds) Fuzzy Reasoning and its Applications. Acad. Press, New York, 219–246 Bandler W, Kohout LJ (1982) Fast fuzzy relational algorithms. In: Ballester A, Cardús D, Trillas E (eds) Proc. Second Internat. Conf. Math. at the Service of Man (Las Palmas, Canary Islands, Spain, 28 June-3 July), Univ. Politechnica de Las Palmas, pp 123–131 Bandler W, Kohout LJ (1986) On new types of homomorphisms and congruences for partial algebraic structures and n-ary relations. Internat J General Syst 12:149–157 Bandler W, Kohout LJ (1986) On the general theory of relational morphisms. Internat J General Syst 13:47–66 Bandler W, Kohout LJ (1986) A survey of fuzzy relational products in their applicability to medicine and clinical psychology. In: Kohout LJ, Bandler W (eds) Knowledge Representation in Medicine and Clinical Behavioural Sci. Abacus Book. Gordon and Breach, New York, pp 107–118 Bandler W, Kohout LJ (1987) Fuzzy implication operators. In: Singh MG (ed) Systems and Control Encyclopedia. Pergamon, Oxford, pp 1806–1810 Bandler W, Kohout LJ (1987) Relations, mathematical. In: Singh MG (ed) Systems and Control Encyclopedia. Pergamon, Oxford, pp 4000–4008 Bandler W, Kohout LJ (1988) Special properties, closures and interiors of crisp and fuzzy relations. Fuzzy Sets and Systems 26(3) (June):317–332 Bandler W, Kohout LJ (1993) Cuts commute with closures. In: Lowen B, Roubens M (eds) Fuzzy Logic: State of the Art. Kluwer, Dordrecht, pp 161–167 Benthem J van (1994) General dynamic logic. In: Gabbay DM (ed) What is a Logical System. Oxford Univ. Press, Oxford pp 107–139 Berghammer R, Schmidt G (1989/90) Symmetric quotients and domain constructions. Inform Process Lett 33:163–168 Bezdek JC, Harris JD (1979) Convex decompositions of fuzzy partitions. J Math Anal Appl 67:490–512 Boruvka ˚ O (1939) Teorie grupoidu (Gruppoidtheorie, I. Teil). Publ Fac Sci Univ Masaryk, Brno, Czechoslovakia 275:1–17, In Czech, German summary Boruvka ˚ O (1941) Über Ketten von Faktoroiden. MATH-A 118:41–64

Boolean and Fuzzy Relations

17. Boruvka ˚ O (1945) Théorie des décompositions dans un ensemble. Publ Fac Sci Univ Masaryk, Brno, Czechoslovakia:278 1–37 (In Czech, French summary) 18. Boruvka ˚ O (1974) Foundations of the theory of groupoids and groups. VEB Deutsch. Verlag Wissenschaft., Berlin, Also published as Halsted Press book by Wiley, 1976 19. DeBaets B, Kerre E (1993) Fuzzy relational compositions. Fuzzy Sets and Systems 60(1):109–120 20. DeBaets B, Kerre E (1993) A revision of Bandler–Kohout composition of relations. Math Pannonica 4:59–78 21. DeBaets B, Kerre E (1994) The cutting of compositions. Fuzzy Sets and Systems 62(3):295–310 22. DiNola A, Pedrycz W, Sanchez E (1989) Fuzzy relation equations and their applications to knowledge engioneering. Kluwer, Dordrecht 23. Doignon JP, Monjardet B, Roubens M, Vincke P (1986) Biorders families, valued relations and preference modelling. J Math Psych 30:435–480 24. Dubois D, Prade H (1995) Fuzzy relation equations and causal reasoning. Fuzzy Sets and Systems 75(2):119–134 25. Dubrosky B, Kohout LJ, Walker RM, Kim E, Wang HP (1997) Use of fuzzy relations for advanced technological cost modeling and affordability decisions. In: 35th AIAA Aerospace Sci. Meeting and Exhibit (Reno, Nevada, January 6-9, 1997), Amer. Inst. Aeronautics and Astronautics, Reston, VA, 1–12, Paper AIAA 97-0079 26. Fodor JC (1992) Traces of fuzzy binary relations. Fuzzy Sets and Systems 50(3):331–341 27. Fodor J, Roubens M (1994) Fuzzy preference modelling and multicriteria decision support. Kluwer, Dordrecht 28. Gabbay DM (1994) What is a logical system? In: Gabbay DM (ed) What is a Logical System? Oxford Univ. Press, Oxford, pp 179–216 29. Gaines BR, Kohout LJ (1977) The fuzzy decade: A bibliography of fuzzy systems and closely related topics. Internat J Man-Machine Studies 9:1–68 (A critical survey with bibliography.) Reprinted in: Gupta MM, Saridis GN, Gaines BR (eds) (1988) Fuzzy Automata and Decision Processes. Elsevier/North-Holland, Amsterdam, pp 403–490 30. Hájek P (1996) A remark on Bandler–Kohout products of relations. Internat J General Syst 25(2):165–166 31. Hájek P (1998) Metamathematics of fuzzy logic. Kluwer, Dordrecht 32. Henkin L, Monk JD, Tarski A (1985) Cylindric algebras, vol II. North-Holland, Amsterdam 33. Hoare JAR, Jifeng He (1986) The weakest prespecification I-II. Fundam Inform 9:51–84; 217–251 34. Höhle U (1988) Quotients with respect to similarity relations. Fuzzy Sets and Systems 27(1):31–44 35. Höhle U, Klement EP (1995) Non-classical logics and their applications to fuzzy subsets: A handbook of mathematical foundations of fuzzy sets. Kluwer, Dordrecht 36. Juliano BA (1993) A fuzzy logic approach to cognitive diagnosis. PhD Thesis, Dept. Comput. Sci., Florida State Univ., Tallahassee, Fl

B

37. Juliano BA (1996) Towards a meaningful fuzzy analysis of urbanistic data. Inform Sci 94(1–4):191–212 38. Juliano BA, Bandler W (1989) A theoretical framework for modeling chains-of-thought: Automating fault detection and error diagnosis in scientific problem solving. In: Fishman MB (ed) Proc. Second Florida Artificial Intelligence Res. Symp., Florida AI Res. Soc., FLAIRS, pp 118–122 39. Juliano B, Bandler W (1996) Tracing chains-of-thought: Fuzzy methods in cognitive diagnosis. Physica Verlag, Heidelberg 40. Kandel A (1986) Fuzzy mathematical techniques with applications. Addison-Wesley, Reading, MA 41. Kim E, Kohout LJ, Dubrosky B, Bandler W (1996) Use of fuzzy relations for affordability decisions in high technology. In: Adey RA, Rzevski G, Sunol AK (eds) Applications of Artificial Intelligence in Engineering XI. Computational Mechanics Publ., Billerica, MA 42. Kitainik L (1992) For closeable and cutworthy properties, closures always commute with cuts. In: Proc. IEEE Internat. Conf. Fuzzy Systems, IEEE, New York, pp 703–704 43. Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic: Theory and applications. Prentice-Hall, Englewood Cliffs, NJ 44. Kohout LJ (2000) Extension of Tarski’s axioms of relations to t-norm fuzzy logics. In: Wang PP (ed) Proc. 5th Joint Conf. Information Sciences, I Assoc. Intelligent Machinery, Durham44–47 45. Kohout LJ (1990) A perspective on intelligent systems: A framework for analysis and design. Chapman and Hall and v. Nostrand, London–New York 46. Kohout LJ (1997) Fuzzy relations and their products. In: Wang P (ed) Proc. 3rd Joint Conf. Inform. Sci. JCIS’97, Duke Univ., March,), Keynote Speech VIII: Prof. W. Bandler Memorial Lecture; to appear in: Inform. Sci. 47. Kohout LJ (1998 1999) Generalized morphisms in BLlogics. In: Logic Colloquium: The 1998 ASL Europ. Summer Meeting, Prague, August 9-15 1998, Assoc. Symbolic Logic (Extended abstract presenting the main mathematical theorems). Reprinted in: Bull Symbolic Logic (1999) 5(1):116– 117 48. Kohout LJ, Anderson J, Bandler W, et al. (1992) Knowledgebased systems for multiple environments. Ashgate Publ. (Gower), Aldershot, Hampshire, UK 49. Kohout LJ, Bandler W (eds) (1986) Knowledge representation in medicine and clinical behavioural science. Abacus Book. Gordon and Breach, New York 50. Kohout LJ, Bandler W (1987) Computer security systems: Fuzzy logics. In: Singh MG (ed) Systems and Control Encyclopedia. Pergamon, Oxford 51. Kohout LJ, Bandler W (1987) The use of fuzzy information retrieval techniques in construction of multi-centre knowledge-based systems. In: Bouchon B, Yager RR (eds) Uncertainty in Knowledge-Based Systems. Lecture Notes Computer Sci. Springer, Berlin, pp 257–264 52. Kohout LJ, Bandler W (1992) Fuzzy relational products in knowledge engineering. In: Novák V, et al (eds) Fuzzy Ap-

309

310

B 53.

54.

55.

56.

57.

58.

59.

60.

61. 62. 63. 64.

65.

66.

Boolean and Fuzzy Relations

proach to Reasoning and Decision Making. Academia and Univ. Press, Prague, pp 51–66 Kohout LJ, Bandler W (1999) A survey of fuzzy and crisp relations. Lecture Notes Fuzzy Math and Computer Sci Creighton Univ., Omaha, NE Kohout LJ, Keravnou E, Bandler W (1984) Automatic documentary information retrieval by means of fuzzy relational products. In: Zadeh LA, Gaines BR, Zimmermann H-J (eds) Fuzzy Sets in Decision Analysis. North-Holland, Amsterdam, pp 383–404 Kohout LJ, Kim E (1997) The role of semiotic descriptors in relational representation of fuzzy granular structures. In: Albus J (ed) ISAS ’97 Intelligent Systems and Semiotics: A Learning Perspective. Special Publ. Nat. Inst. Standards and Techn., US Dept. Commerce, Washington, DC, pp 31– 36 Kohout LJ, Kim Yong-Gi (1993) Generating control strategies for resolution-based theorem provers by means of fuzzy relational products and relational closures. In: Lowen B, Roubens M (eds) Fuzzy Logic: State of the Art. Kluwer, Dordrecht, pp 181–192 Kohout LJ, Stabile I (1992) Logic of relations of Galen. In: Svoboda V (ed) Proc. Internat. Symp. Logica ’92, Inst. Philosophy Acad. Sci. Czech Republic, Prague, pp 144–158 Kohout LJ, Stabile I, Bandler W, Anderson J (1995) CLINAID: Medical knowledge-based system based on fuzzy relational structures. In: Cohen M, Hudson D (eds) Comparative Approaches in Medical Reasoning. World Sci., Singapore, pp 1–25 Kohout LJ, Stabile I, Kalantar H, San-Andres M, Anderson J (1995) Parallel interval-based reasoning in medical knowledge-based system Clinaid. Reliable Computing (A special issue on parallel systems) 1(2):109–140 Kohout LJ, Zenz G (1997) Activity structures and triangle BK-products of fuzzy relations – a useful modelling and computational tool in value analysis studies. In: Mesiar R, et al (eds) Proc. IFSA 1997 (The World Congress of Internat. Fuzzy Systems Assoc., Prague), IV, Academia, Prague, pp 211–216 Lyndon RC (1961) Relation algebras and projective geometries. Michigan Math J 8:21–28 Maddux RD (1982) Some varieties containing relation algebras. Trans Amer Math Soc 272(2):501–526 Maddux RD (1983) A sequent calculus for relation algebras. Ann Pure Appl Logic 25:73–101 Maddux RD (1991) The origin of relation algebras in the development and axiomatization of the calculus of relations. Studia Logica 50(3–4):421–455 Mancini V, Bandler W (1988) Congruence of structures in urban knowledge representation. In: Bouchon B, Saita L, Yager R (eds) Uncertainty and Intelligent Systems. Lecture Notes Computer Sci. Springer, Berlin, pp 219–225 Mancini V, Bandler W (1992) Design for designing: Fuzzy relational environmental design assistant (FREDA). In: Kan-

67. 68. 69. 70.

71. 72.

73. 74. 75.

76.

77.

78. 79.

80. 81. 82.

83. 84.

85. 86.

del A (ed) Fuzzy Expert Systems. Addison-Wesley, Reading, MA, pp 195–202 McKinsey JCC (1940) Postulates for the calculus of binary relations. J Symbolic Logic 5:85–97 Nemeti I (1991) Algebraization of quantifier logics, an introductory overview. Studia Logica 50(3–4):485–569 Ovchinikov S (1991) Similarity relations, fuzzy partitions, and fuzzy ordering. Fuzzy Sets and Systems 40(1):107–126 Pedrycz W (1991) Processing in relational structures: Fuzzy relational equations. Fuzzy Sets and Systems 40(1): 77–106 Pratt V (1991) Dynamic algebras: examples, constructions, applications. Studia Logica 50(3–4):571–605 Ramik J, Rommelfanger H (1996) Fuzzy mathematical programming based on some new inequality relations. Fuzzy Sets and Systems 81(1):77–87 Rescher N (1969) Many-valued logic. McGraw-Hill, New York Riguet J (1948) Relations binaires, fermetures, correspondences de Galois. Bull Soc Math France 76:114–155 Rundensteiner E, Bandler W, Kohout L, Hawkes LW (1987) An investigation of fuzzy nearness measure. In: Proc. Second IFSA Congress, Internat. Fuzzy Systems Assoc., pp 362–365 Russell B (1900/1) The logic of relations: with some applications to the theory of series. Rivisita di Mat 7:115–148, English translation (revised by Lord Russell) in: Marsh RC (ed) (1956) Logic and Knowledge – Essays 1901-1950. Allen– Unwin, London, 1–38 (in French) Sanchez E (1988) Solutions in composite fuzzy relation equations. In: Gupta MM, Saridis GN, Gaines BR (eds) Fuzzy Automata and Decision Processes. Elsevier and Univ. Press, Amsterdam, pp 221–234 Schmidt G, Ströhlein T (1993) Relations and graphs: Discrete mathematics for computer scientists. Springer, Berlin Schmidt G, Ströohlein T (1985) On kernels of graphs and solutions of games: A synopsis based on relations and fixpoints. SIAM J Alg Discrete Meth 6:54–65 Schreider JuA (1975) Equality, resemblance, and order. MIR, Moscow Tarski A (1941) Calculus of relations. J Symbolic Logic 6(3): 73–89 Tarski A, Givant S (1987) A formalization of set theory without variables. Colloq Publ, vol 41. Amer. Math. Soc., Providence, RI Valverde L (1985) On the structure of F-indistinguishability operators. Fuzzy Sets and Systems 17:313–328 Walle B Van der, DeBaets B, Kerre EE (1995) Fuzzy multicriteria analysis of cutting techniques in a nuclear reactor dismantling project. Fuzzy Sets and Systems 74(1):115– 126 Zadeh LA (1987) Fuzzy sets: Selected papers I. In: Yager R et al (eds) Wiley, New York Zadeh LA (1996) Fuzzy sets: Selected papers II. In: Klir G, Yuan B (eds) World Sci., Singapore

Bottleneck Steiner Tree Problems

Bottleneck Steiner Tree Problems BSTP ALEXANDER ZELIKOVSKY Georgia State University, Atlanta, USA MSC2000: 05C05, 05C85, 68Q25, 90B80 Article Outline Keywords See also References Keywords Bottleneck Steiner trees; Facility location; Geometric algorithms; Minmax multicenter; Approximation algorithms A bottleneck Steiner tree (or a min-max Steiner tree) is a Steiner tree (cf. Steiner tree problems) in which the maximum edge weight is minimized. Several multifacility location and VLSI routing problems ask for bottleneck Steiner trees. Consider the problem of choosing locations for a number of hospitals serving homes where the goal is to minimize maximum weighted distance to any home from the hospital that serves it and between hospitals. The solution is a tree which spans all hospitals and connects each home to the closest hospital. This tree can be seen as a Steiner tree where the homes are terminals and hospitals are Steiner points (cf. Steiner tree problems). Unlike the classical Steiner tree problem where the total length of Steiner tree is minimized, in this problem it is necessary to minimize maximum edge weight. The other instance of the bottleneck Steiner tree problem occurs in electronic physical design automation where nets are routed subject to delay minimization [2,3]. The terminals of a net are interconnected possibly through intermediate nodes (Steiner points) and for electrical reasons one would like to minimize maximum distance between each pair of interconnected points. The most popular versions of the bottleneck Steiner tree problem in the literature are geometric. Note that if

B

the number of Steiner points is not bounded, then any edge can be subdivided into infinitely small segments and the resulting maximum edge length becomes zero. Therefore, any meaningful formulation should bound the number of Steiner points. One such formulation is suggested in [9]. Problem 1 Given a set of n points in the plane (called terminals), find a bottleneck Steiner tree spanning all terminals such that degree of any Steiner point is at least 3. Instead of introducing constraints, one can minimize the number of Steiner points. The following formulation has been proved to be NP-hard [15] and approximation algorithms have been suggested in [11,14]. Problem 2 Given a set of n terminals in the plane and > 0, find a Steiner tree spanning n terminals with the minimum number of Steiner points such that every edge is not longer than . Sometimes the bottleneck Steiner tree has predefined topology, i. e. the unweighted tree consisting of edges between terminals and Steiner points [4,5,10]. Then it is necessary to find the optimal positions of all Steiner points. Since the number of different topologies for a given set of terminals grows exponentially, fixing the topology greatly reduces the complexity of the bottleneck Steiner tree problem. Problem 3 Find a bottleneck Steiner tree with a given topology T which spans a set of n terminals in the plane. The first algorithms for the Euclidean case of Problem 3 are based on nonlinear optimization [7] and [13]. For a given > 0, the algorithm from [15] finds whether a Steiner tree ST with the maximum edge weight exists as follows. The topology T is first transformed into a forest by removing edges between terminals, if any such edge has length more than , then ST does not exist. Each connected component T is processed separately. The following regions are computed in bottom-up fashion: i) the region of the plane R(s) where a Steiner point s can be placed; and ii) the region R+ (s) where the Steiner point adjacent to s can be placed which is the area within distance at most from R(s). If a Steiner point p is adjacent to nodes s1 , . . . , sk in T i , then R(s) = R+ (s1 ) \ \ R+ (sk ). The number a(s)

311

312

B

Bottleneck Steiner Tree Problems

of arcs bounding R(s) may be as high as the number of leaves in T i . In order to keep this number low, the tree K can be decomposed in O(log n) levels such that in total there will be only O(n) arcs in all regions. Thus the runtime of the algorithm is O(n log n) [15]. When the distance between points is rectilinear, several efficient algorithms are suggested for Problem 3 [4,9,10]. The algorithm above can be adjusted for the rectilinear plane: the regions R(s) are rectangles. The fastest known algorithm solves Problem 3 in time O(n2 ) [9]. Each bottleneck Steiner problems can be generalized to arbitrary weights on edges and formulated for weighted graphs [6]. Problem 4 Given a graph G = (V, E, w) with nonnegative weight w on edges, and a set of terminals S V, find a Steiner tree spanning S with the smallest maximum edge weight. Problem 4 can be solved efficiently in the optimal time O(|E|) time [6]. Unfortunately, the above formulation does not bound the number of Steiner points. To bound the number of Steiner points it is necessary to take in account that unlike the classical Steiner tree problem in graphs (cf. Steiner tree problems), an edge cannot be replaced with a shortest path without affecting the bottleneck objective. The following graph-theoretical generalization of Problem 1 considered in [1,9] has been proved to be NP-hard. Problem 5 Given a complete graph G = (V, E, w) with nonnegative weight w on edges, and a set of terminals S V, find a Steiner tree spanning S with the smallest maximum edge weight such that each Steiner point has degree at least 3. Similarly to the classical Steiner tree problem, if no Steiner points are allowed, the minimum spanning tree (cf. also Capacitated minimum spanning trees) is the optimal solution for Problems 1 and 5. Therefore, similarly to the Steiner ratio, it is valid to consider the bottleneck Steiner ratio B (n). The bottleneck Steiner ratio is defined as the supremum over all instances with n terminals of the ratio of the maximum edge weight of the minimum spanning tree over the maximum edge weight of the bottleneck Steiner tree. It has been proved that B (n) = 2 blog2 nc ı, where ı is either 0 or 1 de-

pending on whether mantissa of log2 n is greater than log2 3/2 [9]. The approximation complexity of the Problem 5 is higher than for the classical Steiner tree problem: even (2 )-approximation is NP-hard for any > 0 [1]. On the other hand, the best known approximation algorithm for Problem 5 has approximation ratio log2 n [1]. The algorithm looks for an approximate bottleneck Steiner tree in the collection C of edges between all pairs of terminals and minimum bottleneck Steiner trees for all triples of terminals. Using Lovasz’ algorithm [12] it is possible to find out whether such a collection contains a valid Steiner tree, i. e. a Steiner tree with all Steiner points of degree at least three. The algorithm finds the smallest such that C still contains valid Steiner tree if all edges of weight more than are removed. It has been shown that M log2 n, where M is the maximum edge weight of the optimal bottleneck Steiner tree. See also Capacitated Minimum Spanning Trees Directed Tree Networks Minimax Game Tree Searching Shortest Path Tree Algorithms Steiner Tree Problems References 1. Berman P, Zelikovsky A (2000) On the approximation of power-p and bottleneck Steiner trees. In: Adv. in Steiner Trees. Kluwer, Dordrecht, pp 117–135 2. Boese KD, Kahng AB, McCoy BA, Robins G (1995) Nearoptimal critical sink routing tree constructions. IEEE Trans Computer-Aided Design Integr Circuits and Syst 14:1417– 11436 3. Chiang C, Sarrafzadeh M, Wong CK (1990) Global routing based on Steiner min-max trees. IEEE Trans ComputerAided Design Integr Circuits and Syst 9:1318–1325 4. Dearing PM, Francis RL (1974) A network flow solution to a multifacility location problem involving rectilinear distances. Transport Sci 8:126–141 5. Drezner Z, Wesolowsky GO (1978) A new method for the multifacility minimax location problem. J Oper Res Soc 29:1095–1101 6. Duin CW, Volgenant A (1997) The partial sum criterion for Steiner trees in graphs and shortest paths. Europ J Oper Res 97:172–182 7. Elzinga J, Hearn D, Randolph WD (1976) Minimax multifacility location with Euclidean distances. Transport Sci 10:321–336

Boundary Condition Iteration BCI

8. Erkut E, Francis RL, Tamir A (1992) Distance-constrained multifacility minimax location problems on tree networks. Networks 22(1):37–54 9. Ganley JL, Salowe JS (1996) Optimal and approximate bottleneck Steiner trees. Oper Res Lett 19:217–224 10. Ichimori T (1996) A shortest pathe approach to a multifacility minimax location problem with rectilinear distances. J Res Soc Japan 19:217–224 11. Lin G-H, Hue G (1999) Steiner tree problem with minimum number of Steiner points and bounded edge-length. Inform Process Lett 69:53–57 12. Lovasz L, Plummer MD (1986) Matching theory. Elsevier, Amsterdam 13. Love RF, Weselowsky GO, Kraemer SA (1997) A multifacility minimax location method for Euclidean distances. Internat J Production Res 97:172–182 14. Mandoiu II, Zelikovsky AZ (2000) A note on the MST heuristic for bounded edge-length Steiner tress with minimum number of Steiner points. Inform Process Lett 75:165–167 15. Sarrafzadeh M, Wong CK (1992) Bottleneck Steiner trees in the plane. IEEE Trans Comput 41:370–374

Boundary Condition Iteration BCI REIN LUUS Department Chemical Engineering, University Toronto, Toronto, Canada MSC2000: 93-XX Article Outline Keywords Illustration of the Boundary Condition Iteration Procedure Sensitivity Information Without Evaluating the Transition Matrix See also References Keywords Optimal control; Boundary condition iteration; BCI; Control vector iteration; Pontryagin’s maximum principle; Iterative dynamic programming; IDP In solving optimal control problems involving nonlinear differential equations, some iterative procedure must be used to obtain the optimal control policy. From Pontryagin’s maximum principle it is known that the

B

minimum of the performance index corresponds to the minimum of the Hamiltonian. Obtaining the minimum value for the Hamiltonian usually involves some iterative procedure. Here we outline a procedure that uses the necessary condition for optimality, but the boundary conditions are relaxed. In essence we have the optimal control policy at each iteration to a wrong problem. Iterations are performed, so that in the limit the boundary conditions, as specified for the optimal control problem, are satisfied. Such a procedure is called approximation to the problem or boundary condition iteration method (BCI). Many papers have been written about the method. As was pointed out in [1], the method is fundamentally very simple and computationally attractive for some optimal control problems. In [3] some evaluations and comparisons of different approaches were carried out, but the conclusions were not very definitive [5]. Although for control vector iteration (CVI) many papers are written to describe and evaluate different approaches with widely different optimal control problems, see for example [14], for BCI such comparisons are much more limited and there is sometimes the feeling that the method works well only if the answer is already known. However, BCI is a useful procedure for determining the optimal control policy for many problems, and it is unwise to dispatch it prematurely. To illustrate the boundary condition iteration procedure, let us consider the optimal control problem, where the system is described by the differential equation kxk

dx D f(x; u); dt

with x(0) given;

(1)

where x is an n-dimensional state vector and u is an r-dimensional control vector. The optimal control problem is to determine the control u in the time interval 0 t < t f , so that the performance index Z tf ID (x; u) dt (2) 0

is minimized. We consider the case where the final time t f is given and there are no constraints on the control or the state variables. According to Pontryagin’s maximum principle, the minimum value of the performance index in (2) is obtained by minimizing the Hamiltonian HD

C z> f :

(3)

313

314

B

Boundary Condition Iteration BCI

The adjoint variable z is defined by dz @H D ; dt @x

with z(t f ) D 0;

(4)

Suppose at iteration j the use of x(j) (t f ) gives the initial state x(j) (0) which is different from the given initial state x(0). Then a new choice will be made at iteration (j + 1) through the use of

which may be written as @f> @ dz D z ; dt @x @x

x( jC1) (t f ) D x( j) (t f ) C ˚(0)(x( j) (0) x(0)); with z(t f ) D 0:

The necessary condition for the minimum of the Hamiltonian is @H D 0: @u

(6)

Let us assume that (6) can be solved explicitly for the control vector u D g(x; z):

(7)

If we now substitute (7) into (1) and (5), and integrate these equations simultaneously backward from t = t f to t = 0 with some value assumed for x(t f ), we have the optimal control policy for a wrong problem, because there is no assurance that upon backward integration the given value of the initial state x(0) will be obtained. Therefore it is necessary to adjust the guessed value for the final state, until finally an appropriate value for x(t f ) is found. For this reason the method is called the boundary condition iteration method (BCI). In order to find how to adjust the final value of the state, based on the deviation obtained from the given initial state, we need to find the mathematical relationship to establish the effect of the change in the final state on the change in initial state. Many papers have been written in this area. The development of the necessary sensitivity equations is presented very nicely in [1]. In essence, the sensitivity information can be obtained by getting the transition matrix for the linearized state equation. Linearization of (1) gives dıx D dt

@f> @x

>

ıx C

@f> @u

> ıu:

(8)

The transition matrix ˚ is thus obtained from solving d˚ D dt

@f> @x

> ˚;

with ˚(t f ) D I;

where I is the (n × n) identity matrix.

(10)

(5)

(9)

where a stabilizing parameter is introduced to avoid overstepping. A convenient way of measuring the deviation from the given initial state is to define the error as the Euclidean norm

e ( j) D x( j) (0) x(0) :

(11)

Once the error is sufficiently small, say less than 106 , then the iteration procedure can be stopped. The algorithm for boundary condition iteration may thus be presented as follows: Choose an initial value for the final state x(1) (t f ) and a value for ; set the iteration index j to 1. Integrate (1), (2), (5) and (9) backwards from t = t f to t = 0, using for control (7). (2) is not needed for the algorithm, but it will give the performance index. Evaluate the error in the initial state from (11), and if it is less than the specified value, end the iteration. Increment the iteration index j by one. Choose a new value for the final state x(j) (t f ) from (10) and go to step 2. The procedure is therefore straightforward, since the equations are all integrated in the same direction. Furthermore, there is no need to store any variables over the trajectory. There is the added advantage that the control appears as a continuous variable, and therefore the accuracy of results will not depend on the size of the integration time step. Theoretically the results should be as good as can be obtained by the second variation method in control vector iteration. It is important to realize, however, that the Hamiltonian must be well behaved, so that (7) can be obtained analytically. The only drawback is the potential instability since the state equation and the sensitivity equation are integrated backwards, and problems may arise if the final time t f is too large. For many problems in chemical engineering the BCI method can be easily applied as is shown in the following example.

Boundary Condition Iteration BCI

Illustration of the Boundary Condition Iteration Procedure Let us consider the nonlinear continuous stirred tank reactor that has been used for optimal control studies in [4, pp. 308–318], and which was shown in [13] to exhibit multiplicity of solutions. The system is described by the two equations dx1 D 2(x1 C 0:25) dt 25x1 u(x1 C 0:25); C (x2 C 0:5) exp x1 C 2 dx2 25x1 D 0:5 x2 (x2 C 0:5) exp ; dt x1 C 2

(12)

(13)

with the initial state x1 (0) = 0.09 and x2 (0) = 0.09. The control u is a scalar quantity related to the valve opening of the coolant. The state variables x1 and x2 represent deviations from the steady state of dimensionless temperature and concentration, respectively. The performance index to be minimized is Z ID 0

tf

(x12 C x22 C 0:1u 2 ) dt;

(14)

The equations for the transition matrix are: d˚11 dt d˚12 dt d˚21 dt d˚22 dt

C z2 (0:5 x2 R) C x12 C x22 C 0:1u 2 ;

@ f1 @x1 @ f1 @x2 @ f2 @x1 @ f2 @x2

D 2 C

where R = (x2 + 0.5) exp (25 x1 /(x1 + 2)). The adjoint equations are dz1 (z2 z1 ) D (u C 2)z1 2x1 C 50R ; dt (x1 C 2)2

(16)

dz2 (z2 z1 ) D 2x2 C R C z2 : dt (x2 C 0:5)

(17)

The gradient of the Hamiltonian is @H D 0:2u (x1 C 0:25)z1 ; @u

(18)

so the optimal control is given by u D 5(x1 C 0:25)z1 :

(19)

@ f1 ˚21 ; @x2 @ f1 ˚22 ; @x2 @ f2 ˚21 ; @x2 @ f2 ˚22 @x2

50R u; (x1 C 2)2

R ; (x2 C 0:5) 50R D ; (x1 C 2)2 R : D 1 C (x2 C 0:5) D

The adjustment of the final state is carried out by the following two equations: ( jC1)

x1

( jC1)

(15)

@ f1 ˚11 C @x1 @ f1 D ˚12 C @x1 @ f2 D ˚11 C @x1 @ f2 D ˚12 C @x1 D

where

where the final time t f = 0.78. The Hamiltonian is H D z1 (2(x1 C 0:25) C R u(x1 C 0:25))

B

x2

h ( j) ( j) (t f ) D x1 (t f ) C ˚11 (0)(x1 (0) x1 (0)) i ( j) C˚12 (0)(x2 (0) x2 (0)) ; (20) h ( j) ( j) (t f ) D x2 (t f ) C ˚21 (0)(x1 (0) x1 (0)) i ( j) C˚22 (0)(x2 (0) x2 (0)) : (21)

To illustrate the computational aspects of BCI, the above algorithm was used with a Pentium-120 personal computer using WATCOM Fortran compiler version 9.5. The calculations were done in double precision. When the performance index is included, there are 9 differential equations to be integrated backwards at each iteration. Standard fourth order Runge–Kutta method was used for integration with a stepsize of 0.01. For stability, it was found that had to be taken of the order of 0.1. For all the runs, therefore, this value of was used. As is shown in Table 1, to get the error less than 106 , a large number of iterations are required, but the computation time is quite reasonable. The optimal value of the performance index is very close to the value I = 0.133094 reported in [13] with the second variation

315

316

B

Boundary Condition Iteration BCI

Boundary Condition Iteration BCI, Table 1 Application of BCI to CSTR Initial choice x1 (tf ) D x2 (tf )

where

Performance index

Number of CPU time s iterations

0:045

0:133095

2657

13:9

0:00

0:133097

2858

14:9

0:045

0:133097

2931

15:3

0:01

0:133097

2805

14:7

P D x( jnC1) (t f )

x( j) (t f ) ;

(23)

x( j) (0) :

(24)

and Q D x( jnC1) (0)

The transformation matrix A D PQ1

method and is essentially equivalent to I = 0.133101 obtained in [6] by using 20 stages of piecewise linear control with iterative dynamic programming. By refining the error tolerance to e < 108 required no more than an additional thousand iterations with an extra expenditure of about 6 seconds of computation time in each case. Then the final value of the performance index for each of the four different initial starting points was I = 0.133096. Now that computers are very fast and their speed is rapidly being improved, and computation time is no longer prohibitively expensive, the large number of iterations required by BCI should not discourage one from using the method. Since the control policy is directly inside the integration routine, equivalent results to those obtained by second variation method can be obtained. The number of equations, however, to be integrated is quite high with a moderately high-dimensional system. If we consider a system with 10 state variables, there are 121 differential equations to be integrated simultaneously. Although computationally this does not represent a problem, the programming could be a challenge to derive and enter the equations without error. Therefore, BCI methods for which the (n × n) transition matrix is not used may find a more widespread application. One possible approach is now presented.

Sensitivity Information Without Evaluating the Transition Matrix Suppose at iteration j we have n sets of final states x(j n + 1) (t f ), . . . , x(j) (t f ) with corresponding values for the initial state obtained by integration x(j n + 1) (0), . . . , x(j) (0). Then we can write the transformation P D AQ;

(22)

(25)

and the next vector at t f is chosen as x( jC1) (t f ) D Ax(0):

(26)

(1) and (5) are integrated backward to obtain x(j + 1) (0), and the matrices P and Q are updated and the procedure continued. If the initial guesses are sufficiently close to the optimal, very rapid convergence is expected.

1

2 3 4

5

Pick n sets of values for x(t f ) and integrate (1) and (5) backward from t = t f to t = 0. using (7) for control, to give n sets of initial state vectors. From these two sets of vectors form the (n n) matrices P and Q. Calculate A from (25), and calculate a new vector x( j+1) (t f ) from (26). With the vector from Step 3 as a starting condition, integrate (1) and (5) backward to give x( j+1) (0). Use the vectors in Steps 3 and 4 to replace x( jn+1) (t f ) and x( jn+1) (0) imn matrices P and Q and continue until the error as calculated from (11) is below some tolerance, such as 108 .

Boundary Condition Iteration BCI, Algorithm

For good starting conditions, one may use iterative dynamic programming (IDP) [9], and pick the final states obtained after each of the first n passes. F. Hartig and F.J. Keil [2] found that in the optimization of spherical reactors, IDP provided excellent values which were refined by the use of sequential quadratic programming. For convergence here we need good starting conditions. This is now illustrated with the above example.

Boundary Condition Iteration BCI

By using IDP, as described in [6,7,8] for piecewise linear continuous control, with 3 randomly chosen points and 10 iterations per pass for piecewise linear control with 15 time stages, the data for the first four passes in Table 2 give good starting conditions for BCI. By using as starting conditions the final states obtained in passes 1 and 2 as given in Table 2, the convergence is very fast with the above algorithm as is shown in Table 3. Only 9 iterations are required to yield I = 0.133096. As expected, if the initial set of starting points is better, then the convergence rate is also better as is seen in comparing Table 4 to Table 3. However, in each case the total computation time was only 0.05 seconds on a Pentium-120. Taking into account that it takes 0.77 seconds to generate the initial conditions with IDP, it is observed that the optimum is obtained in less than 1 second of computation time. Therefore, BCI is a very useful procedure if (6) can be solved explicitly for the control and the final time tf is not too large. Simple constraints on control can be readily handled by clipping technique, as shown in [12]. Further examples with this approach are given in [10]. Boundary Condition Iteration BCI, Table 2 Results of the first four passes of IDP Pass no. Perf. index x1 (tf )

x2 (tf )

CPU time s

1

0:1627

0:05359 0:13101 0:39

2

0:1415

0:01940 0:05314 0:77

3

0:1357

0:05014 0:09241 1:16

4

0:1334

0:05670 0:10084 1:54

Boundary Condition Iteration BCI, Table 3 Convergence with the above algorithm from the starting points obtained in passes 1 and 2 by IDP Iteration no. Perf. index Error " 1

0:014520

0:1215

2

0:031301

0:1031

3

0:129568

0:1852 102

4

0:136682

0:2414 102

5

0:135079

0:1350 102

6

0:133218

0:8293 104

7

0:133093

0:2189 105

8

0:133096

0:1373 106

9

0:133096

0:5209 108

B

Boundary Condition Iteration BCI, Table 4 Convergence with the above algorithm from the starting points obtained in passes 3 and 4 by IDP Iteration no. Perf. index Error " 1

0:121769

0:7353 102

2

0:135249

0:1415 102

3

0:133317

0:1531 103

4

0:133138

0:2861 104

5

0:133094

0:1703 105

6

0:133096

0:1190 107

7

0:133096

0:5364 1010

See also Control Vector Iteration

References 1. Denn MM, Aris R (1965) Green’s functions and optimal systems – Necessary conditions and an iterative technique. Industr Eng Chem Fundam 4:7–16 2. Hartig F, Keil FJ (1993) Large scale spherical fixed bed reactors – modelling and optimization. Industr Eng Chem Res 32:57–70 3. Jaspan RK, Coull J (1972) Trajectory optimization techniques in chemical engineering. II. Comparison of the methods. AIChE J 18:867–869 4. Lapidus L, Luus R (1967) Optimal control of engineering processes. Blaisdell, Waltham 5. Luus R (1974) BCI vs. CVI. AIChE J 20:1039–1040 6. Luus R (1993) Application of iterative dynamic programming to very high dimensional systems. Hungarian J Industr Chem 21:243–250 7. Luus R (1993) Piecewise linear continuous optimal control by iterative dynamic programming. Industr Eng Chem Res 32:859–865 8. Luus R (1996) Numerical convergence properties of iterative dynamic programming when applied to high dimensional systems. Chem Eng Res Des 74:55–62 9. Luus R (1998) Iterative dynamic programming: from curiosity to a practical optimization procedure. Control and Intelligent Systems 26:1–8 10. Luus R (2000) Iterative dynamic programming. Chapman and Hall/CRC, London 11. Luus R (2000) A new approach to boundary condition iteration in optimal control. In: Proc. IASTED Internat. Conf. Control and Applications, Cancun, Mexico, May 24–27, 2000, pp 172–176 12. Luus R (2001) Further developments in the new approach to boundary condition iteration in optimal control. Canad J Chem Eng 79:968–976

317

318

B

Bounding Derivative Ranges

13. Luus R, Cormack DE (1972) Multiplicity of solutions resulting from the use of variational methods in optimal control problems. Canad J Chem Eng 50:309–311 14. Rao SN, Luus R (1972) Evaluation and improvement of control vector iteration procedures for optimal control. Canad J Chem Eng 50:777–784

Bounding Derivative Ranges GEORGE F. CORLISS1 , L. B. RALL2 1 Marquette University, Milwaukee, USA 2 University Wisconsin–Madison, Madison, USA MSC2000: 90C30, 90C26 Article Outline Keywords Evaluation of Functions Monotonicity Taylor Form Intersection and Subinterval Adaptation Software Availability See also References Keywords Interval arithmetic; Automatic differentiation; Taylor series Interval arithmetic can be used to bound the range of a real function over an interval. Here, we bound the ranges of its Taylor coefficients (and hence derivatives) by evaluating it in an interval Taylor arithmetic. In the context of classical numerical methods, truncation errors, Lipschitz constants, or other constants related to existence or convergence assertions are often phrased in terms of bounds for certain derivatives. Hence, interval inclusions of Taylor coefficients can be used to give guaranteed bounds for quantities of concern to classical methods. Evaluating the expression for a function using interval arithmetic often yields overly pessimistic bounds for its range. Our goal is to tighten bounds for the range of f and its derivatives by using a differentiation arithmetic for series generation. We apply monotonicity and

Taylor form tests to each intermediate result of the calculation, not just to f itself. The resulting inclusions for the range of derivative values are several orders of magnitude tighter than bounds obtained from differentiation arithmetic and interval calculations alone. Tighter derivative ranges allow validated applications such as optimization, nonlinear equations, quadrature, or differential equations to use larger steps, thus improving their computational efficiency. Consider the set of q times continuously differentiable functions on the real interval x D [x; x] denoted by f (x) 2 Cq [x]. We wish to compute a tight inclusion for n o R( f (p) ; x) :D f (p) (x) : x x x ; (1) where p q. We assume that f is sufficiently smooth for all indicated computations, and that all necessary derivatives are computed using automatic differentiation (cf. [5], Automatic differentiation: Point and interval Taylor operators). Computing an inclusion for the range of f (p) is a generalization of the problem of computing an inclusion for the range of f , R(f ; x). Moore’s natural interval extension [3] gives an inclusion which is often too gross an overestimation to be practical. H. Ratschek and J. Rokne [8] gives a number of improved techniques and many references. The approach of this paper follows from two papers of L.B. Rall [6,7] and from [1]. Taken together, Rall’s papers outline four approaches to computing tight inclusions of R(f ; x), which we apply to derivatives: monotonicity, mean value and Taylor forms, intersection, and subinterval adaptation. We apply the monotonicity tests and the Taylor form to each term of the Taylor polynomial of a function. Whenever we compute more than one enclosure for a quantity, either a derivative or an intermediate value, we compute intersections of all such enclosures. We apply these tests to each intermediate result of the calculation, not just to f itself. The bounds we compute for R(f (p) ; x) are often several orders of magnitude tighter than bounds computed from natural interval extensions. In one example, we improve the interval inclusion for R(f (10) ; x) from [ 3.8E10, 7.8E10] (width = 1.1E11) to [ 2.1E03, 9.6E03] (width = 1.1E04).

Bounding Derivative Ranges

This improvement by a factor of 107 allows a Gaussian quadrature using 5 points per panel or a 10th order ODE solver (applications for which bounds for R(f (10) ; x) might be needed) to increase their stepsizes, and hence their computational efficiency, by a factor of 107/10 5. We discuss the evaluation of a function from a code list representation (see also Automatic differentiation: Point and interval Taylor operators). Then we discuss how monotonicity tests and Taylor form representations can be used to give tighter bounds for R(f (p) ; x). Evaluation of Functions Functions are expressed in most computer languages by arithmetic operations and a set ˚ of standard functions, for example, ˚ = {abs, arctan, cos, exp, ln, sin, sqr, sqrt}. A formula (or expression) can be converted into a code list or computational graph {t 1 , . . . , t n } (cf. [5], Automatic differentiation: Point and interval Taylor operators). The value of each term t i is the result of a unary or binary operation or function applied to constants, values of variables, or one or two previous terms of the code list. For example, the function f (x) D

x 4 10x 2 C 9 x 3 4x 5

:= sqr(x); := sqr(t1 ); := 10 t1 ; := t2 t3 ; := t4 + 9;

serve for the computation of f (x) in real, complex, interval, or differentiation arithmetic. When x is an interval, one gets an interval inclusion f (x) of all real values f (x) for real x 2 x [3,4]. The process of automatic differentiation to obtain derivatives or Taylor coefficients of f (x) can be viewed as the evaluation of the code list for f (x) using a differentiation arithmetic in which the arithmetic operations and standard functions are defined on the basis of the well-known recurrence relations for Taylor coefficients (cf. also [3,4,5], Automatic differentiation: Point and interval Taylor operators). Let (f )i := f (i) (xˇ)/i! be the value of the ith Taylor coefficient of f (x) = f (xˇ C h). Then we can express a Taylor series as f (x) D

1 X iD0

1

f (i) (xˇ )

X hi D ( f )i h i ; i! iD0

and the elements of Taylor series arithmetic are vectors f = ((f )0 , . . . , (f )p ). In Taylor arithmetic, constants c have the representation c = (c, 0, . . . , 0), and x = (x0 , 1, 0, . . . , 0) represents the independent variable x = x0 + h. For example, multiplication f (x) = u(x) v(x) of Taylor variables is defined in terms of the Taylor coefficients of P u and v by (f )i = ijD0 (u)j (v)i j , i = 0, . . . , p. Monotonicity

can be converted into the code list t1 t2 t3 t4 t5

B

t6 t7 t8 t9 t10

:= x t1 ; := 4 x; := t6 t7 ; := t8 5; := t5 /t9 :

Bounding Derivative Ranges, Figure 1 Code list

The final term t n of the code list (t 10 in this case) gives the value of f (x), if defined, for a given value of the variable x. The conversion of a formula into an equivalent code list can be carried out automatically by a computer subroutine. The code list serves equally well for various kinds of arithmetic, provided the necessary arithmetic operations and standard functions are defined for the type of elements considered. Thus, the code list in Fig. 1 can

We extend an idea of R.E. Moore for using monotonicity [4]: we check for the monotonicity of every derivative of f and of every intermediate function t i from the code list. If the ith derivative of f is known to be of one sign on x (R (f (i) ; x) 0 or 0), then f (i 1) is monotonic on the interval x, and its range is bounded by the real values f (i1) (x) and f (i1) (x). This is important because the bounds of R(f (i 1) ; x) by f (i1) (x) and f (i1) (x) may be tighter than the bounds computed by the naive interval evaluation of f (i 1) (x). Hence, in addition to the ranges R(f (i) ; x), we propagate enclosures of the values at the endpoints R( f (i) ; x) and R( f (i) ; x) so that those values are available. (We use R( f (i) ; x) and R( f (i) ; x) instead of R( f (i); x) and R( f (i) ; x) to denote that f (i) at the endpoints is evaluated in interval arithmetic.) Similarly, if R(f (i) ; x) 0 (or 0), then (i 2) is convex (resp. concave), and its maximum f value is max( f (i2) (x); f (i2) (x)) (resp. minimum is min( f (i2) (x); f (i2) (x))).

319

320

B

Bounding Derivative Ranges

Bounding Derivative Ranges, Figure 2 R (f (3) ; x) 0 implies f is monotonic and f 0 is convex

We apply the monotonicity test to each term of each intermediate result because an intermediate result may be monotonic when f is not. Further, by proceeding with tighter inclusions for the terms of the intermediate results, we reduce subsequent over-estimations and improve our chances for validating the monotonicity of higher derivatives. If f (i 1) is found to be monotonic, the tightened enclosure for R(f (i 1) ; x) may allow us to validate R(f (i 1) ; x) 0 (or 0), so we backtrack to lower terms of the series as long as we continue to find monotonicity. In the recurrence relations for divide and for all of the standard functions, the value of f (i) (x) depends on the value of f (i 1) (x). Hence, if the enclosure for R(f (i 1) ; x) is tightened, we recompute the enclosure for R(f (i) ; x) and all subsequent terms. Table 1 shows (some of) the results when the monotonicity test is applied to each of the intermediate results of f (x) D

x 4 10x 2 C 9 x 3 4x 5

on the interval x := [1, 2]. Each row shows enclosures for Taylor coefficients. The row ‘x’ has two entries for the function x evaluates on the interval x and its deriva-

tive. All higher-order derivatives are zero. Similarly, rows t 4 and t 5 have five nonzero derivatives. A few entries show where tightening occurs because of the monotonicity test. For example, at 1 , the 3rd derivative of t 4 is positive. Hence, t 4 is monotonic, but that knowledge yields no tightening. Also t 4 0 is convex, a fact which does allow us to tighten the upper bound from 12 to 8. Similarly at 2 , finding that t 8 is positive allows us to improve the upper bound for t 8 . In this example, the monotonicity tests allow us only two relatively modest tightenings, but those two tighter values propagate through the recurrences to reduce the width of the bound finally computed for t (5) 10 from about 2.3E6 to 300, an improvement of nearly a factor of 104 .

Taylor Form In [6], Rall proves that if xˇ 2 x, then

R( f ; x) F p (x) :D

p1 X ( f ) i (x xˇ ) i iD0

C F (p)

(x)(x xˇ ) p ; (2) p!

Bounding Derivative Ranges

B

Bounding Derivative Ranges, Table 1 Numerical results of applying monotonicity tests

x

x(x) [1, 2] 1 t1 := x 2 t1 (x) [1, 4] [2, 4] t2 := t12 = x 4 t2 (x) [1, 16] [4, 32] t3 := 10 t1 = 10x 2 t3 (x) [10, 40] [20, 40] t4 := t2 t3 = x 4 10x 2 t4 (x) [39, 6] [36, 12] Tightened to: [24, -9] [36, -8] t4 (x) t5 := t4 + 9 = x 4 10x 2 + 9 [30, 15] [36, 12] t5 (x) Tightened as the result of tightening t4 : t5 (x) [15, 0] [36, -8] ::: t8 := t6 t7 = x 3 4x t8 (x) [7, 4] [1, 8] Tightened to: [7, 0] [1, 8] t8 (x) t9 := t8 5 = x 3 4x 5 t9 (x) [12, -1] [1, 8] Tightened as the result of tightening t8 : [12, -5] [1, 8] t9 (x) t10 := t5 /t9 = f t10 (x) [15, 30] [132, 276]

1 [6, 24]

[4, 8]

1

[4, 14]

[4, 8]1

1

[4, 14]

[4, 8]

1

[4, 14]

[4, 8]

1

[4, 14]

[4, 8]

1

[3, 6]2

1

[3, 6]

1

[3, 6]

1

[3, 6]

1

10

[10097, 20823] [87881, 181229] [1160, 2392] [764851, 1577270] Tightened as the result of tightening t5 and t9 : t10 (x) [9.73E0.6, 3.01] [9.68, 51.97] [0.41, 12.01] [21.84, 113.67] [5.21, 23.61] [47.58, 248.96]

and F (p) is an interval extension of f (p) . The F p given by (2) is called the (elementary) Taylor form of f of order p. We expand the Taylor series for the function f and all intermediate functions ti appearing in the code list at three points, x = a := x, x = c := midpoint (x), and x D b :D x. The series for f at x and x are already available since they were computed for the monotonicity test. The extra work required to generate the series at c is often justified because the midpoint form is much narrower than either of the endpoint forms. Let h :=

width (x). We compute the Taylor form (2) for f and each t i at the left endpoint, center, and right endpoint to all available orders and intersect. The remainder using R(f (i + 1) ; x) has the potential for tightening all previous terms: h2 R( f ; x) f (a) C f 0 (a)h[0; 1] C f 00 (a) [0; 1] 2! i h C C f (i) (a) [0; 1] i! iC1 h (iC1) CR( f

[0; 1] ; x) (i C 1)!

321

322

B

Bounding Derivative Ranges

Bounding Derivative Ranges, Table 2 Numerical results of applying Taylor from tests

x

x(x) [1,2] t1 = x 2 t1 ([a; b]) [1, 4] No tightening occurs. ::: t4 = t2 t3 = x 4 10x 2 t4 (x) [39, 6]

1 [2, 4]

[36, 12] [24, 12] [24, -2.5] [20, -7] [20, -8] tightened by tightened by tightened by

[29, -9] [27.438, -9] [24, -9] Tightened to : [24, -9] [20, -8] t4 (x) t5 = t4 + 9 = x 4 10x 2 + 9 t5 (x) [30, 15] [36, 12] Tightened as the result of tightening t4 : t5 (x) [15, 0] [20, -8] t8 = t6 t7 = x 3 4x t8 (x) [7, 4] [1, 8] Tightened to: [4, 0] [1, 8] t8 (x) t10 = t5 /t9 t10 (x) [15, 30] [132, 276]

1

[4, 14] [4, 8] 1 tightened by f (a) using Fab(2) tightened by f (c) using Fab(2) tightened by f (c) using Fab(3) tightened by f (c) using Fab(4) f (a) using Fab(1) f (c) using Fab(1) f (b) using Fab(1) [4, 14]

[4, 8]

1

[4, 14]

[4, 8]

1

[4, 14]

[4, 8]

1

[3, 6]

1

[3, 6]

1 [10097, 20823] [87881, 181229] [764851, 1577270]

[1160, 2392] Tightened as the result of tightening t5 and t9 : t10 (x) [9.73E06, 3.01] [8.57, 39.94] [0.55, 8.81] [19.27, 87.64] [4.57, 18.49] [42.02, 191.84] [6.08E06, 3.01] tightened by f(c) using Fab(1)

\

[1; 1] h 2 [1; 1] C f 00 (c)

2 2! 4 i i [1; 1] h C C f (i) (c)

i! 2i iC1 [1; 1] iC1 h

CR( f (iC1) ; x) (i C 1)! 2 iC1 f (c) C f 0 (c)h

\

f (b) C f 0 (b)h [1; 0] C f 00 (b)

h2

[0; 1] 2!

hi

[1; 0] i i! h iC1

[1; 0] iC1 : CR( f (iC1) ; x) (i C 1)!

C C f (i) (b)

Bounding Derivative Ranges

Bounding Derivative Ranges, Figure 3 Taylor polynomial enclosures for f , remainders from naive interval evaluation

Bounding Derivative Ranges, Figure 4 Taylor polynomial enclosures for f , remainders tightened by Taylor form

B

323

324

B

Bounding Derivative Ranges

For higher-order derivatives, R(f (i) ; x) is contained in similar Taylor forms involving f (i + n) (x), for n > 0. We apply the Taylor form to each intermediate result. Except for the operators +, , , and sqr, whenever one term is tightened, all following terms can be recomputed more tightly. This can result in an iterative process which is finite only by virtue of Moore’s theorem on interval iteration [4]. In practice then, we restrict the number of times subsequent terms are recomputed starting at a given order. Table 2 shows (some of) the results when the Taylor form is applied to each of the intermediate results of f (x) D

x 4 10x 2 C 9 x 3 4x 5

on the interval x := [1, 2]. ‘Tightened by f (a), f (c), or f (b)’ indicates whether the left endpoint, the midpoint, or the right endpoint expansion was used. ‘Using Fab(n)’ indicates that f (i) was tightened using f (i + n) . The pattern of Table 2 is typical: Most Taylor forms give no tightening; there are many small improvements; and the compound effect of many small improvements is significant. Here we have reduced the width of the enclosure for the 6th Taylor coefficient from about 2.3E7 to 2.3E2. Figures 3 and 4 compare the Taylor polynomial enclosures for f resulting from naive interval evaluation of the remainders with the enclosures tightened by the Taylor form computations shown in Table 2. For this example, the bounds achieved using the Taylor form are tighter than those achieved using the monotonicity test. For other examples, the monotonicity test performs better. Hence in practice, we apply both techniques. If the expression for f is rewritten in a mathematically equivalent form to yield tighter interval bounds for R(f ; x), the techniques of this paper can still be used profitably to tighten enclosures of higher derivatives.

Intersection and Subinterval Adaptation The third general technique described by Rall for tightening enclosures of R(f ; x) is to intersect all enclosures for each quantity, as we have done here. That is, whatever bounds for R(f (i) ; x) we compute using monotonicity or Taylor form of any degree, we intersect with the

tightest bound previously computed. Each new bound may improve our lower bound, our upper bound, both, or neither. Some improvements are large. Others are so small as to seem insignificant, but even the smallest improvements may be magnified by later operations. Rall’s fourth technique is the adaptive partitioning of the interval x. The over-estimation of R(f (i) ; x) by naive interval evaluation decreases linearly with width (x), while the over-estimation by the Taylor form decreases quadratically. Hence, partitioning x into smaller subintervals is very effective. However, we view subinterval adaptation as more effectively controlled by the application (e. g., optimization, quadrature, DE solution) than by the general-purpose interval Taylor arithmetic outlined here. Hence, we do not describe it further. Software Availability An implementation in Ada of interval Taylor arithmetic operators for +, , , /, and sqr is available at [9]. Similar implementations could be written in Fortran 90, C++, or any other language supporting operator overloading. See also Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators Global Optimization: Application to Phase Equilibrium Problems Interval Analysis: Application to Chemical Engineering Design Problems Interval Analysis: Differential Equations Interval Analysis: Eigenvalue Bounds of Interval Matrices Interval Analysis: Intermediate Terms Interval Analysis: Nondifferentiable Problems Interval Analysis: Parallel Methods for Global Optimization Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Unconstrained and Constrained Optimization Interval Analysis: Verifying Feasibility

Bounds and Solution Vector Estimates for Parametric NLPS

Interval Constraints Interval Fixed Point Theory Interval Global Optimization Interval Linear Systems Interval Newton Methods References 1. Corliss GF, Rall LB (1991) Computing the range of derivatives. In: Kaucher E, Markov SM, Mayer G (eds) Computer Arithmetic, Scientific Computation and Mathematical Modelling. IMACS Ann Computing Appl Math. Baltzer, Basel, pp 195–212 2. Gray JH, Rall LB (1975) INTE: A UNIVAC 1108/1110 program for numerical integration with rigorous error estimation. MRC Techn Summary Report Math Res Center, Univ Wisconsin–Madison 1428 3. Moore RE (1966) Interval analysis. Prentice-Hall, Englewood Cliffs, NJ 4. Moore RE (1979) Methods and applications of interval analysis. SIAM, Philadelphia 5. Rall LB (1981) Automatic differentiation: techniques and applications. Lecture Notes Computer Sci, vol 120. Springer, Berlin 6. Rall LB (1983) Mean value and Taylor forms in interval analysis. SIAM J Math Anal 2:223–238 7. Rall LB (1986) Improved interval bounds for ranges of functions. In: Nickel KLE (ed) Interval Mathematics (Freiburg, 1985). Lecture Notes Computer Sci, vol 212. Springer, Berlin, pp 143–154 8. Ratschek H, Rokne J (eds) (1984) Computer methods for the range of functions. Horwood, Westergate 9. Website: www.mscs.mu.edu/~georgec/Pubs/eoo_da.tar.gz

Bounds and Solution Vector Estimates for Parametric NLPS VIVEK DUA, EFSTRATIOS N. PISTIKOPOULOS Imperial College, London, UK MSC2000: 90C31

B

Keywords Sensitivity analysis; Linear approximation; Parametric upper and lower bounds In this article, we present some important theoretical results based upon which solution of parametric nonlinear programming problems can be approached. The need for these results arises from the fact that while stability, continuity and convexity properties of objective function value for linear programs are readily available [7], their counterparts in nonlinear programs are valid only for a special class of nonlinear programs. It is not surprising then that a large amount of research has been devoted towards establishing these conditions (see [1] and [3] for a comprehensive list of references). Further, due to the existence of strong duality results for linear models, parametric programming can be done by extending the simplex algorithm for linear models [6]. On the other hand, for nonlinear programs the parametric solution is given by an approximation of the optimal solution. This approximation or estimation of the optimal solution can be achieved by obtaining the optimal solution as a function of parameters. In order to derive these results we first state the following implicit function theorem: Theorem 1 (see for example [3,8]) Suppose that (x, ) is a (r × 1) vector function defined on En × Em , with x 2 En and 2 Em , and Dx (x, ) and D (x, ) indicate the (r × n) and (r × m) matrix of first derivatives with respect to x and respectively. Suppose that : Em + n ! En . Let (x, ) be continuously differentiable in x and in an open set at (x0 , 0 ) where (x0 , 0 ) = 0. Suppose that Dx (x0 , 0 ) has an inverse. Then there is a function x() defined in a neigh in that neighborhood borhood of 0 where for each b [x(b ); b ] D 0. Furthermore, x() is a continuously differentiable function in that neighborhood and

Article Outline

x 0 (0 )

Keywords Parametric Lower Bound Parametric Upper Bound See also References

D D x [x(0 ); 0 ]1 D [x(0 ); 0 ] D D x (x0 ; 0 )1 D (x0 ; 0 ); where x0 ( 0 ) denotes the derivative of x evaluated at 0 .

325

326

B

Bounds and Solution Vector Estimates for Parametric NLPS

Consider the parametric nonlinear programming problem of the following form: 8 z() D min f (x; ) ˆ ˆ x ˆ ˆ < s.t. g i (x; ) 0; i D 1; : : : ; p; (1) ˆ ˆ h j (x; ) D 0; j D 1; : : : ; q; ˆ ˆ : x 2 X; where f , g and h are twice continuously differentiable in x and . The first order KKT conditions for (1) are given as follows: r f (x; )

p X

i r g i (x; ) C

iD1

i g i (x; ) D 0; h j (x; ) D 0;

q X

j r h j (x; ) D 0;

jD1

j D 1; : : : ; q :

An application of the implicit function theorem 1 to the KKT conditions (2) results in the following basic sensitivity theorem: Theorem 2 ([2,3,8]) Let 0 be a vector of parameter values and (x0 , 0 , 0 ) a KKT triple corresponding to (2), where 0 is nonnegative and x0 is feasible in (1). Also assume that: i) strict complementary slackness holds; ii) the binding constraint gradients are linearly independent; iii) the second order sufficiency conditions hold. Then, in neighborhood of 0 , there exists a unique, once continuously differentiable function [x(), (), ()] satisfying (2) with [x( 0 ), ( 0 ), ( 0 )] = (x0 , 0 , 0 ), where x() is a unique isolated minimizer for (1), and 0 1

where

D (M0 )1 N0 ;

(3)

1 r 2 L r g1 r g p r h1 r h q C B r> g g1 1 C B 1 C B :: :: C B : : C B C B M0 D B p r > g p gp C C B > C B r h1 C B :: C B A @ : > r hr 0

N0 D (r 2 x L; 1 r > g1 ; : : : ; p r > g p ; r > h1 ; : : : ; r > h q )> ; L(x; ; ; ) D f (x; ) C

p X

i g i (x; ) C

q X

iD1

j h j (x; ):

jD1

However, for a special case of (1) when the parameters are present on the right-hand side of the constraints, (1) can be rewritten in the following form: 8 ˆ z() ˆ
y> 2 Rn

purpose of evaluating derivatives in the first place. The interpretation of C as computational graph goes back to L.V. Kantorovich and requires a little more explanation. The Computational Graph With respect to the precedence relation ji

()

c i j 6 0

()

( j; i) 2 E ;

the indices i, j 2 V [1 n, . . . , l + m] form a directed graph with the edge set E. Since by assumption j i implies j < i the graph is acyclic and the transitive closure of defines a partial ordering between the corresponding variables vi and vj . The minimal and maximal elements with respect to that order are exactly the independent and dependent variables vj n xj with j = 1, . . . , n and the vm+ i yi with i = 1, . . . , m, respectively. For the two stranded chain scenario with l = 3 one obtains a computational graph of the following form:

for y> 2 Rm ;

using just one multiplication and addition per cij 6 0. So if our goal is the iterative calculation of an approximate Newton-step using just a few matrix-vector products, we are well advised to just work with the collection of nonzero entries of C provided it can be kept in memory. If on the other hand we expect to take a large number of iterations or wish to compute a matrix factorization of the Jacobian we have to first accumulate all mn partial derivatives @yî / @xˆj from the elemental partials cij . It is well understood that a subsequent inplace triangular factorization of the Jacobian F 0 (x) yields an ideal representation if one needs to multiply itself as well as its inverse by several vectors and matrices from the left or right. Hence we have at least three possible ways in which a Jacobian can be represented and kept in storage: unaccumulated: computational graph; accumulated: rectangular array; factorized: two triangular arrays. Here the arrays may be replaced by sparse matrix structures. For the time being we note that Jacobians and Hessians can be provided in various representation at various costs for various purposes. Which one is most appropriate depends strongly on the structure of the problem function F(x) at hand and the final numerical

Assuming that all elemental ' i are unary functions or binary operations we find |E| 2(l+m) l. One may always annotate the graph vertices with the elemental functions ' i and the edges with the nonvanishing elemental partials cij . For most purposes the ' i do not really matter and we may represent the graph (V, E) simply by the sparse matrix C. Forward Mode Given some vector x˙ (˙v jn )j = 1, . . . , n 2 Rn , there exist derivatives ˇ ˇ d v i (x C ˛ x˙ )ˇˇ v˙i for 1 i l C m : d˛ ˛D0 By the chain rule these v˙i satisfy the recurrence X v˙i c i j v˙ j for i D 1; : : : ; l C m :

(2)

j i

The resulting tangent vector y˙ (˙v l Ci )i = 1, . . . , m satisfies y˙ = F0 (x)x˙ and it is obtained at a cost propor-

C

Complexity of Gradients, Jacobians, and Hessians

tional to l. Instead of propagating derivatives with respect to just one direction vector x˙ one may amortize certain overheads by bundling p of them into a matrix X˙ 2 Rn × p and then computing simultaneously Y˙ = F 0 (x) X˙ 2 Rm × p . The cost of this vector forward mode of automatic differentiation is given by prog

forw ˙ pl p OPSfx 7! yg : OPSfC 7! Yg

(3)

If the columns of X˙ are Cartesian basis vectors ej 2 Rn the corresponding columns of the resulting Y˙ are the jth columns of the Jacobian. Hence by setting X˙ = I with p = n we may compute the whole Jacobian at a temporal complexity proportional to nl. Fortunately, in many applications the whole Jacobian is either not needed at all or due to its sparsity pattern it may be reconstructed from its compression Y˙ = F 0 (x) X˙ for a suitable seed ma˙ As in the case of difference quotients this matrix trix X. may be chosen according to the Curtis–Powell–Reid [6] or the Newsam–Ramsdell [12] approach with p usually close to the maximal number of nonzeros in any row of the Jacobian. Bauer’s Formula Using the recurrence for the v˙i given above one may also obtain an explicit expression for each individual partial derivative @yi / @xj . Namely, it is given by the sum over the products of all arc values cˆ{ |ˆ along all paths connecting the minimal node vjn with the maximal node vl+i . This formula due to F.L. Bauer [1] implies in particular that the ijth Jacobian entry vanishes identically exactly when there is no path connecting nodes j n and l + i in the computational graph. In general the number of distinct paths in the graph is very large and it represents exactly the lengths of the formulas obtained if one expresses each yi directly in terms of all xj that it depends on. Hence we may conclude bauer

formul

OPSfC 7! F 0 g OPSfx 7! yg :

celebrated alternative is the reverse or backward mode of automatic differentiation. Reverse Mode Rather than propagating directional derivatives v˙i forward through the computational graph one may also propagate adjoint quantities v i backward. To define them properly one must perturb the original evaluation loop by rounding errors ı i so that now v i D ı i C ' i (v j ) j and

Y D e> m :

Then X D Y F 0 (x) is the last row of the arrowhead matrix F 0 (x) and the two columns of Y˙ = F 0 (x) X˙ contain all other nonzero entries. For pure row or column compression dense rows or columns always force p = n or q = m, respectively. Hence the combination of forward and reverse differentiation offers the potential for great savings. In either case projections and restrictions of the Jacobian to subspaces of the vector functions domain and range can be built into the differentiation process, which is part of the goal-orientation we alluded to before. Second Order Adjoints Rather than separately propagating some first derivatives forward, others reverse, and then combining the results to compute Jacobian matrices efficiently, one may compose these two fundamental modes to compute second derivatives like Hessians of Lagrangians. More specifically, we obtain by directional differentiation of the adjoint relation x D yF 0 (x) the second order adjoint x˙ D yF 00 (x)x˙ 2 Rn : Here we have assumed that the adjoint vector y is constant. We also have taken liberties with matrix vector notation by suggesting that the m × n × n derivative tensor F 00 (x) can be multiplied by the row vector y 2 Rm from the left and the column vector x˙ 2 Rn x 2 Rn from

the right yielding a row vector x˙ of dimension n. In an optimization context y should be thought of as a vector of Lagrange multipliers and x˙ as a feasible direction. By composing the complexity bounds for the reverse and the forward mode one obtains the estimates prog

forw

OPSfx 7! yg OPSfx; x˙ 7! y˙g rev ad ˙ : OPSfx; y 7! xg OPSfx; x˙ ; y 7! xg

Here, ad represents reverse differentiation followed by forward differentiation or vise versa. The former interpretation is a little easier to implement and involves only one forward and one backward sweep through the computational graph. Operations Counts and Overheads From a practical point of view one would of course like to know the proportionality factors in the relations above. If one counts just multiplication operations then y˙ and x are at worst 3 times as expensive as y, and x˙ is at most 9 times as expensive. A nice intuitive example p p is the calculation of the determinant y of a n n matrix whose entries form the variable vector x. Then we have m = 1 and OPSfx 7! yg D

1p 3 n C O(n) 3

multiplications if one uses an LU factorization. Then it can be seen that y D 1/y makes x the transpose of the p 3 inverse matrix and the resulting cost estimate of n C O(n) multiplications conforms exactly with that for the usual substitution procedure. However, these operations count ratios are no reliable indications of actual runtimes, which depend very strongly on the computing platform, the particular problem an hand, and the characteristics of the AD tool. Implementations of the vector forward mode like ADIFOR [3] that generate compilable source codes can easily compete with divided differences, i. e. compute p directional derivatives in the form Y˙ D F 0 (x) X˙ at the cost of about p function evaluations. For sizeable p 10 they are usually faster than divided differences, unless the roughly p-fold increase in storage results in too much paging onto disk. The reverse mode is an entirely different ball-game since most intermediate values vi and some control flow hints need to be first saved

Complexity of Gradients, Jacobians, and Hessians

and later retrieved, which can easily make the calculation of adjoints memory bound. This memory access overhead can be partially amortized in the vector reverse mode, which yields a bundle X D Y F 0 (x) of q gradient vectors. For example in multicriteria optimization one may well have q 10 objectives or soft constraints, whose gradients are needed simultaneously. Worst-Case Optimality Counting only multiplications we obtain for Jacobians F 0 2 Rm×n the complexity bound prog

ad

OPSfx 7! F 0 g 3 min(n; m) OPSfx 7! yg : Here, n and m can be reduced to the maximal number of nonzero entries in the rows and columns of the Jacobian, respectively. Similarly, we have for the one-sided projection of the Lagrangian Hessian 00

H(x; y) yF

m X

y i r 2 Fi 2 Rnn

iD1

˙ onto the space spanned by the columns of X: prog

ad ˙ 9p OPSfx 7! yg : OPSfx 7! H(x; y) Xg

As we already discussed for indefinite integrals there are certainly functions whose derivatives can be evaluated much cheaper than they themselves for example using a computer algebra package. Note that here again we have neglected the preparation effort, which may be very substantial for symbolic differentiation. Nevertheless, the estimates given above for AD are optimal in the sense that there are vector functions F defined by evaluation procedures of the form (1), for which no differentiation process imaginable can produce the Jacobian and projected Hessian significantly cheaper than the given cost bound divided by a small constant. Here, producing these matrices is understood to mean calculating all its elements explicitly, which may or may not be actually required by the overall computation. Consider, for example, the cubic vector function F(x) D x C

b(a> x)3 2

with a; b 2 Rn :

C

Its Jacobian and projected Hessian are given by 2 F 0 (x) D I C b a> x a> 2 Rnn and H(x; y) X˙ D 2a(yb)(a> x)a> X˙ 2 Rnp : ˙ all entries of the matrices F 0 (x) For general a, b and X, and H(x; y) X˙ are distinct and depend nontrivially on x. Hence their explicit calculation by any method requires at least n2 or np arithmetic operations, respectively. Since the evaluation of F itself can be performed using just 3n multiplications and a few additions, the operations count ratios given above cannot be improved by more than a constant. There are other, more meaningful examples [9] with the same property, namely that their Jacobians and projected Hessians are orders of magnitude more expensive than the vector function itself. At least this is true if we insist on representing them as rectangular arrays of reals. This does not contradict our earlier observation that gradients are cheap, because the components of F(x) cannot be considered as independent scalar functions. Rather, their simultaneous evaluation may involve many common subexpressions, as is the case for our rank-one example. These appear to be less beneficial for the corresponding derivative evaluation, thus widening the gap between function and derivative complexities. Expensive Redundant? The rank-one problem and similar examples for which explicit Jacobians or Hessians appear to be expensive have a property that one might call redundancy. Namely, as x varies over some open neighborhood in its domain, the Jacobian F 0 (x) stays in a lower-dimensional manifold of the linear space of all matrices with its format and sparsity pattern. In other words, the nonzero entries of the Jacobian are not truly independent of each other so that computing them all and storing them separately may be wasteful. In the rank-one example the Jacobian F 0 (x) is dense but belongs at all x to the onedimensional affine variety {I + b˛a| : ˛ 2 R}. Note that the vectors a, b 2 Rn are assumed to be dense and constant parameter vectors of the problem at hand. Their elements all play the role of elemental partials cij with the corresponding operation ' i being multiplications. Hence accumulating the extremely sparse trian-

433

434

C

Complexity of Gradients, Jacobians, and Hessians

gular matrix C, which involves only O(n) nonzero entries, to the dense n × n array F 0 (x) is almost certainly a bad idea, no matter what the ultimate purpose of the calculation. In particular, if one wishes to solve linear systems in the Jacobian, the inverse formula of Sherman–Morrison–Woodbury provides a way of computing the solution of rank-one perturbations to diagonal matrices with O(n) effort. This formula may be seen as a very special case of embedding linear systems in F 0 into a much larger and sparse linear system involving C as demonstrated in [11] and [5]. As of now, all our examples for which the array representation of Jacobians and Hessians are orders of magnitude more expensive to evaluate than the underlying vector function exhibit this redundancy property. In other words, we know of no convincing example where vectors that one may actually wish to calculate as end products are necessarily orders of magnitude more expensive than the functions themselves. Especially for large problems it seems hard to imagine that array representations of the Jacobians and Hessians themselves are really something anybody would wish to look at rather than just use as auxiliary quantities within the overall calculation. So evaluating complete derivative arrays is a bit like fitting a handle to a wooden crate that needs to be moved about frequently. If the crate is of small weight and size this job is easily performed using a few screws. If, on the other hand, the crate is large and heavy, fitting a handle is likely to require additional bracing and other reinforcements. Moreover, this effort is completely pointless since nobody can just pick up the crate by the handle anyhow and one might as well use a fork left in the first place. Preaccumulation and Combinatorics The temporal complexity for both the forward and the reverse (vector) mode are proportional to the number of edges in the linearized computational graph. Hence one may try to reduce the number of edges by certain algebraic manipulations that leave the corresponding Jacobian, i. e., the linear mapping between x˙ and y˙ D F 0 (x)x˙ and equivalently also that between y and x D yF 0 (x) unchanged. It can be easily checked that this is the case if given an index j one updates first c i k C D c i j c jk

either for fixed i j and all k j, or for fixed k j and all i j, and then sets cij = 0 or cjk = 0, respectively. In other words, either the edge (j, i) or the edge (k, j) is eliminated from the graph. This leads to fill-in by the creation of new arcs, unless all updated cik were already nonzero beforehand. Eliminating all edges (k, j) with k j or all edges (j, i) with i j is equivalent and amounts to eliminating the vertex j completely from the graph. After all intermediate vertices 1 j l are eliminated in some arbitrary order, the remaining edges cij directly connect independent variables with dependent variables and are therefore entries of the Jacobian F 0 (x). Hence, one refers to the accumulation of the Jacobian F 0 if all intermediate nodes are eliminated and to preaccumulation if some of them remain so that the Jacobian is represented by a simplified graph. As we have indicated in the section on goal oriented differentiation one would have to carefully look at the problem function and the overall computational task to decide how much preaccumulation should be performed. Moreover, there aree l! different orders in which a particular set ofe l l intermediate nodes can be eliminated and even many more different ways of eliminating the corresponding set of edges. So far there have only been few studies of heuristic criteria for finding efficient elimination orderings down to an appropriate preaccumulation level [9]. Summary First and second derivative vectors of the form y˙ D F 0 (x)x˙, x D yF 0 (x) and x˙ D yF 00 (x)x˙ can be evaluated for a fixed small multiple of the temporal complexity of the underlying relation y = F(x). The calculation of the gradient x and the second order adjoint x˙ by the basic reverse method may require storage of order l #intermediates. This possibly unacceptable amount can be reduced to order log(l) at a slight increase in the operations count (see [8]). Jacobians and one-sided projected Hessians can be composed column by column or row by row from vec˙ For sparse derivative mators of the kind y˙, x and x. trices row and/or column compression using suitable seed matrices of type CPR or NR allow a substantial reduction of the computational effort. In some cases the nonzero entries of derivative matrices may be redundant, so that their calculation should be avoided, if

Complexity and Large-Scale Least Squares Problems

the overall computational goal can be reached in some other way. The attempt to evaluate derivative array with absolutely minimal effort leads to hard combinatorial problems. See also Complexity Classes in Optimization Complexity of Degeneracy Complexity Theory Complexity Theory: Quadratic Programming Computational Complexity Theory Fractional Combinatorial Optimization Information-Based Complexity and Information-Based Optimization Kolmogorov Complexity Mixed Integer Nonlinear Programming NP-Complete Problems and Proof Methodology Parallel Computing: Complexity Classes References 1. Bauer FL (1974) Computational graphs and rounding error. SIAM J Numer Anal 11:87–96 2. Berz M, Bischof Ch, Corliss G, Griewank A (eds) (1996) Computational differentiation: Techniques, applications, and tools. SIAM, Philadelphia 3. Bischof Ch, Carle A, Corliss G, Griewank A, Hovland P (1992) ADIFOR: Generating derivative codes from Fortran programs. Scientif Program 1:1–29 4. Coleman TF, Morée JJ (1984) Estimation of sparse Jacobian matrices and graph coloring problems. SIAM J Numer Anal 20:187–209 5. Coleman TF, Verma A (1996) Structure and efficient Jacobian calculation. In: Berz M, Bischof Ch, Corliss G, Griewank A (eds) Computational Differentiation: Techniques, Applications, and Tools. SIAM, Philadelphia, pp 149–159 6. Curtis AR, Powell MJD, Reid JK (1974) On the estimation of sparse Jacobian matrices. J Inst Math Appl 13:117–119 7. Griewank A (1991) The chain rule revisited in scientific computing, I–II. SIAM News 8. Griewank A (1992) Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation. Optim Methods Softw 1:35–54 9. Griewank A (2000) Evaluating derivatives, principles and techniques of algorithmic differentiation. Frontiers in Appl Math, vol 19. SIAM, Philadelphia 10. Griewank A, Corliss GF (eds) (1991) Automatic differentiation of algorithms: Theory, implementation, and application. SIAM, Philadelphia

C

11. Griewank A, Reese S (1991) On the calculation of Jacobian matrices by the Markowitz rule. In: Griewank A and Corliss GF (eds) Automatic Differentiation of Algorithms: Theory, Implementation, and Application. SIAM, Philadelphia, pp 126–135 12. Newsam GN, Ramsdell JD (1983) Estimation of sparse Jacobian matrices. SIAM J Alg Discrete Meth 4:404–417

Complexity and Large-Scale Least Squares Problems JOSEF KALLRATH GVCS, BASF Aktiengesellschaft, Ludwigshafen, Germany MSC2000: 93E24, 34-xx, 34Bxx, 34Lxx Article Outline Introduction A Standard Formulation for Unconstrained Least Squares Problem Solution Methods Explicit Versus Implicit Models Practical Issues of Solving Least Squares Problems

Parameter Estimation in ODE Models The Initial Value Problem Approach The Boundary Value Problem Approach

Parameter Estimation in DAE Models Parameter Estimation in PDE Models Methodology

Least Squares Problems with Massive Data Sets The Matching Approach

Conclusions Acknowledgments References Introduction Least squares problems and solution techniques to solve them have a long history briefly addressed by Björck [4]. In this article we focus on two classes of complex least squares problems. The first one is established by models involving differential equations. The other class is made by least squares problems involving difficult models which need to be solved for many independent observational data sets. We call this least squares problems with massive data sets.

435

436

C

Complexity and Large-Scale Least Squares Problems

A Standard Formulation for Unconstrained Least Squares Problem The unconstrained least squares problem can be expressed by

2 min l2 (p) ; l2 (p) :D r1 x(t1 ); : : : ; x(t k ); p 2 p

D

N X

r1k (p)

2

;

r1 2 IR N :

(1)

kD1

The minimization of this functional, i. e., the minimization of the sum of weighted quadratic residuals, under the assumption that the statistical errors follow a Gaußian distribution with variances as in (4), provides a maximum likelihood estimator ([7] Chap. 7) for the unknown parameter vector p. This objective function dates back to Gauß [14] and in the mathematical literature the problem is synonymously called least squares or `2 approximation problem. The least squares structure (1) may arise either from a nonlinear over-determined system of equations r1k (p) D 0 ;

k D 1; : : : ; N ;

N >n;

(2)

or from a data fitting problem with N given data points ˜ p), (t k ; Y˜k ) and variances , a model function F(t; and n adjustable parameters p: r1k :D r1k (p) D Yk Fk (p) D

p

˜ k ; p) : w k Y˜k F(t (3)

The weights w k are related to the variances k by w k :D ˇ/ k2 :

(4)

Traditionally, the weights are scaled to a variance of unit weights. The factor ˇ is chosen so as to make the weights come out in a convenient range. In short vector notation we get T r1 :D Y F(p) D r11 (p); : : : ; r1N (p) ; F(p); Y 2 IR N : Our least squares problem requires us to provide the following input: 1. model, 2. data, 3. variances associated with the data,

4. measure of goodness of the fit, e. g., the Euclidean norm. In many practical applications, unfortunately, less attention is paid to the variances. It is also very important to point out that the use of the Euclidean norm requires pre-information related to the problem and statistical properties of the data. Solution Methods Standard methods for solving linear version of (1), i. e., F(p) D Ap, are reviewed by Björck [4]. Nonlinear methods for unconstrained least squares problems are covered in detail by Xu [35,36,37]. In addition, we mention a popular method to solve unconstrained least squares problems: the Levenberg–Marquardt algorithm proposed independently by Levenberg [21] and Marquardt [22] and sometimes also called “damped least squares”. It modifies the eigenvalues of the normal equation matrix and tries to reduce the influence of eigenvectors related to small eigenvalues (cf. [8]). Damped (step-size cutting) Gauß–Newton algorithms combined with orthogonalization methods control the damping by natural level functions [6,9,10] seem to be superior to Levenberg–Marquardt type schemes and can be more easily extended to nonlinear constrained least squares problems. Explicit Versus Implicit Models A common basic feature and limitation of least squares methods, but seldom explicitly noted, is that they require some explicit model to be fitted to the data. However, not all models are explicit. For example, some pharmaceutical applications for receptor-ligand binding studies are based on specifically coupled mass equilibrium models. They are used, for instance, for the radioimmunological determination of Fenoterol or related substances, and lead to least squares problems in systems of nonlinear equations [31], in which the model function F(p) is replaced by F(t; p; z) which, besides the parameter vector p and the time t, depends on a vector function z D z(t; p) implictly defined as the solution of the nonlinear equations F2 (t; p; z) D 0 ;

F2 (p) 2 IRn 2 :

(5)

This is a special case of an implicit model. There is a much broader class of implicit models. Most models

Complexity and Large-Scale Least Squares Problems

in science are based on physical, chemical and biological laws or include geometry properties, and very often lead to differential equations which may, however, not be solvable in a closed analytical form. Thus, such models do not lead to explicit functions or models we want to fit to data. We rather need to fit an implicit model (represented by a system of differential equations or another implicit model). The demand for and the applications of such techniques are widespread in science, especially in the rapidly increasing fields of nonlinear dynamics in physics and astronomy, nonlinear reaction kinetics in chemistry [5], nonlinear models in material sciences [16] and biology [2], and nonlinear systems describing ecosystems [28,29] in biology, or the environmental sciences. Therefore, it seems desirable to focus on least squares algorithms that use nonlinear equations and differential equations as constraints or side conditions to determine the solution implicitly. Practical Issues of Solving Least Squares Problems Solving least squares problems involves various difficulties among them to find an appropriate model, nonsmooth models with discontinuous derivatives, data quality and checking the assumption of the underlying error distribution, and dependence on initial parameter or related questions of global convergence. Models and Model Validation A model may be defined as an appropriate abstract representation of a real system. In the natural sciences (e. g., Physics, Astronomy, Chemistry and Biology) models are used to gain a deeper understanding of processes occurring in nature (an epistemological argument). The comparison of measurements and observations with the predictions of a model is used to determine the appropriateness and quality of the model. Sir Karl Popper [26] in his famous book Logic of Scientific Discovery uses the expressions falsification and verification to describe tasks that the models can be used to accomplish as an aid to scientific process. Models were used in early scientific work to explain the movements of planets. Then, later, aspects and questions of accepting and improving global and fundamental models (e. g., general relativity or quantum physics) formed part of the discussion of the philosophy of science. In science models are usually falsified,

C

and, eventually, replaced by modified or completely different ones. In industry, models have a rather local meaning. A special aspect of reality is to be mapped in detail. Pragmatic and commercial aspects are usually the motivation. The model maps most of the relevant features and neglect less important aspects. The purpose is to provide insight into the problem, allow numerical, virtual experimentation but avoid expensive and/or dangerous real experiments, or tune a model for later usage, i. e., determine, for instance, the reaction coefficients of a chemical system – once these parameters are known the dynamics of the process can be computed. A (mathematical) model represents a real-world problem in the language of mathematics, i. e., by using mathematical symbols, variables (in this context: the adjustable least squares parameters), equations, inequalities, and other relations. How does one get a mathematical model for a real-world problem? To achieve that is neither easy nor unique. In some sense it is similar to solving exercises in school where problems are put in a verbal way [25]. The following points are useful to remember when trying to build a model: there will be no precise recipe telling the user how to build a model, experience and judgment are two important aspects of model building, there is nothing like a correct model, there is no concept of a unique model, as different models focusing on different aspects may be appropriate. Industrial models are eventually validated which means that they reached a sufficient level of consensus among the community working with these models. Statistics provide some means to discriminate models but this still is an art and does not replace the need for appropriate model validation. The basic notion is: with a sufficient number of parameters on can fit an elefant. This leads us to one important consequence: it seems to be necessary that one can interpret these model parameters. A reasonable model derived from the laws of science with interpretable parameters is a good candidate to become accepted. Even, if it may lead to a somewhat worse looking fits than a model with a larger number of formal parameters without interpretation.

437

438

C

Complexity and Large-Scale Least Squares Problems

Non-Smooth Models The algorithm reviewed by Xu [35,36,37] for solving least squares problems usually require the continuous first derivatives of the model function with respect to the parameters. We might, however, encounter models for which the first derivatives are discontinuous. Derive-free methods such as Nelder and Mead’s [23] downhill Simplex method, or direction set methods; cf. ([27], p. 406) have been successfully used to solve least squares problems. The Simplex method provides the benefit of exploring parameter space and good starting values for derivative based methods. Powell’s direction set method with appropriate conjugate directions preserve the derivative free nature of the method. Global Convergence Nonlinear least squares algorithms usually converge only if the initial parameters are close to the best fit parameters. Global convergence can be established for some algorithms, i. e., they converge for all initial parameters. An essential support tool accompanying the analysis of difficult least squares problem is to visualize the data and the fits. Inappropriate or premature fits can easily be excluded. Inappropriate fits are possible because all algorithms mentioned in Sect. “Introduction”, “Parameter Estimation in ODE Models”, and “Parameter Estimation in DAE Models” are local algorithm. Only if the least squares problem is convex, they yield the global least squares minimum. Sometimes, it is possible to identify false local minima from the residuals.

usual assumption that the distribution really follows a Gaussian normal distribution. With the Kolmogoroff–Smirnov test (see, e. g., [24]) it is possible to check as follows whether the residuals of a least-squares solution are normally distributed around the mean value 0. 1. let M :D (x1 , x2 ; : : :, x n ) be a set of observations for which a given hypothesis should be tested; 2. let G : x 2 M ! IR, x ! G(x), be the corresponding cumulative distribution function; 3. for each observation x 2 M define S n (x) :D k/n, where k is the number of observations less than or equal to x; 4. determine the maximum D :D max(G(x) S n (x) j x 2 M); 5. Dcrit denotes the maximum deviation allowed for a given significance level and a set of n elements. Dcrit is tabulated in the literature, e. g., ([24], Appendix 2, p. 560); and 6. if D < Dcrit, the hypothesis is accepted. For the least squares problem formulated in Sect. “A Standard Formulation for Unconstrained Least Squares Problem” the hypothesis is “The residuals x :D r1 D YF(p) are normally distributed around the mean value 0”. Therefore, the cumulative distribution function G(x) takes the form Z x p 2 G(x) D g(z)dz Z1 Z x x 0 D g(z)dz C g(z)dz; 1

g(z) :D e Data and Data Quality Least squares analysis is concerned by fitting data to a model. The data are not exact but subject to unknown random errors "k . In ideal cases these errors follow a Gaussian normal distribution. One can test this assumption after the least squares fit by analyzing the distribution of the residuals as described in Sect. “Residual Distributions, Covariances and Parameter Uncertainties”. Another important issue is whether the data are appropriate to estimate all parameters. Experimental design is the discipline which addresses this issue. Residual Distributions, Covariances and Parameter Uncertainties Once the minimal least squares solution has been found one should at first check with the 2 -test or Kolmogoroff–Smirnov test whether the

x 0 12 z 2

:

The value x0 separates larger residuals; this is problem specific control parameter. The derivative based least squares methods usually also give the covariance matrix from which the uncertainties of the parameter are derived; cf. [7], Chap. 7. Least squares parameter estimations without quantifying the uncertainty of the parameters are very doubtful. Parameter Estimation in ODE Models Consider a differential equation with independent variable t for the state variable x0 (t) D

dx D f(t; x; p); dt

x 2 IRn d

;

p 2 IRn p (6)

with a right hand side depending on an unknown parameter vector p. Additional requirements on the solu-

Complexity and Large-Scale Least Squares Problems

tion of the ODE (1) like periodicity, initial or boundary conditions or range restrictions to the parameters can be formulated in vectors r2 and r3 of (component wise) equations and inequalities r2 x(t1 ); : : : ; x(t k ); p D 0 or (7) r3 x(t1 ); : : : ; x(t k ); p 0 : The multi-point boundary value problem is linked to experimental data via minimization of a least squares objective function

2 (8) l2 (x; p) :D r1 x(t1 ); : : : ; x(t k ); p 2 : In a special case of (8) the components ` of the vector r1 2 IRL are “equations of condition” and have the form r1` D i1 j [ i j g i (x(t j ); p)] ; ` D 1; : : : ; L :D

Nj X

J i : (9)

iD1

This case leads us to the least squares function D

l2 (x; p) :D

Nj N X X

2 i2 j [ i j g i (x(t j ); p)] :

(10)

jD1 iD1

Here, N D denotes the number of values of the independent variable (here called time) at which observed data are available, N j denotes the number of observables measured at time t j and i j denotes the observed value which is compared with the value of observable i evaluated by the model where the functions g i (x(t j ); p) relate the state variables to x this observable i j D g i (x(t j ); p) C " i j :

(11)

The numbers " i j are the measurement errors and i2j are weights that have to be adequately chosen due to statistical considerations, e. g. as the variances. The unknown parameter vector p is determined from the measurements such that the model is optimally adjusted to the measured (observed) data. If the errors " i j are independent, normally distributed with the mean value zero and have variances i2j (up to a common factor ˇ 2 ), then the solution of the least squares problem is a maximum likelihood estimate.

C

The Initial Value Problem Approach An obvious approach to estimate parameters in ODE which is also implemented in many commercial packages is the initial value problem approach. The idea is to guess parameters and initial values for the trajectories, compute a solution of an initial value problem (IVP) (6) and iterate the parameters and initial values in order to improve the fit. Characteristic features and disadvantages are discussed in, e. g., [6] or [18]. In the course of the iterative solution one has to solve a sequence of IVPs. The state variable x(t) is eliminated for the benefit of the unknown parameter p and the initial values. Note that no use is made of the measured data while solving the IVPs. They only enter in the performance criterion. Since initial guesses of the parameters may be poor, this can lead to IVPs which may be hard to solve or even have no solution at all and one can come into badly conditioned regions of the IVPs, which can lead to the loss of stability. The Boundary Value Problem Approach Alternatively to the IVP approach, in the “boundary value problem approach” invented by Bock [5], the inverse problem is interpreted as an over-determined, constrained, multiple-point boundary problem. This interpretation does not depend on whether the direct problem is an initial or boundary value problem. The algorithm used here consists of an adequate combination of a multiple shooting method for the discretization of the boundary value problem side condition in combination with a generalized Gauss-Newton method for the solution of the resulting structured nonlinear constrained least squares problem [5,6]. Depending on the vector of signs of the state and parameter dependent switching functions Q it is even possible to allow piecewise smooth right hand side functions f , i. e., differential equations with switching conditions x0 D f (t; x; p; si gn(Q(t; x; p))) ;

(12)

where the right side may change discontinuously if the vector of signs of the switching functions Q changes. Such discontinuities can occur, e. g. as a result of unsteady changes of physical values. The switching points are in general given by the roots of the state-dependent components of the switching functions Q i (t; x; p) D 0 :

(13)

439

440

C

Complexity and Large-Scale Least Squares Problems

Depending on the stability behavior of the ODE and the availability of information about the process (measured data, qualitative knowledge about the problem, etc.) a grid Tm T m : 1 < 2 < : : : < m ;

z kC1 D z k C ˛ k z k

j :D jC1 j ; 1 j m1;

(14)

of m multiple shooting nodes j (m 1 subintervals I j ) is chosen. The grid is adapted to the problem and data and is defined such that it includes the measuring interval ([1 ; m ] D [t0 ; t f ]). Usually, the grid points correspond to values of the independent variable t at which observations are available but additional grid points may be chosen for strongly nonlinear models. At each node j an IVP x0 (t) D f(t; x; p) ;

x(t D j ) D s j 2 IRn d

(15)

has to be integrated from j to jC1 . The m 1 vectors of (unknown) initial values s j of the partial trajectories, the vector sm representing the state at the end point and the parameter vector p are summarized in the (unknown) vector z zT :D (s1T ; : : : ; sTm ; pT ) :

(16)

For a given guess of z the solutions x(t; s j ; p) of the m1 independent initial value problems in each sub interval I j are computed. This leads to an (at first discontinuous) representation of x(t). In order to replace (6) equivalently by these m 1 IVPs matching conditions h j (s j ; s jC1 ; p) :D x( jC1 ; s j ; p) s jC1 D 0 ; h j : IR2n d Cn p ! IR n d

is solved by a damped generalized Gauss-Newton method [5]. If J1 (z k ) :D @z F1 (z k ), J2 (z k ) :D @z F2 (z k ) vis. J3 (z k ) :D @z F3 (z k ) denote the Jacobi matrices of F1 , F2 vis. F3 , then the iteration proceeds as

(17)

are added to the problem. (17) ensures the continuity of the final trajectory x(t). Replacing x(t i ) and p in (10) by z the least squares problem is reformulated as a nonlinear constrained optimization problem with the structure 1 kF1 (z)k22 jF2 (z) D 0 2 IRn 2 ; min z 2 n3 ; (18) F3 (z) 0 2 IR wherein n2 denotes the number of the equality and n3 the number of the inequality constraints. This usually large constrained structured nonlinear problem

(19)

with damping constant ˛ k ; 0 < ˛mi n ˛ k 1, and the increment z k determined as the solution of the constrained linear problem min z

ˇ 1 ˇ kJ1 (z k )´z k C F1 (z k )k22 ˇ 2 J2 (z k )z k C F2 (z k ) D 0 J3 (z k )z k C F3 (z k ) 0

:

(20)

Global convergence can be achieved if the damping strategy is properly chosen [6]. The inequality constraints that are active in a feasible point are defined by the index set I (z k ) :D fijF3i (z k ) D 0 ;

i D 1; : : : ; n3 g :

(21)

The inequalities which are defined by the index set I (z k ) or their derivatives are denoted with Fˆ 3 or ˆJ3 in

the following. In addition to (21) we define F2 J2 ; J : :D Fc :D c ˆJ3 Fˆ 3

(22)

In order to derive the necessary conditions that have to be fulfilled by the solution of the problem (18) the Lagrangian 1 L(z; ; ) :D kF1 (z)k22 T F2 (z) T F3 (z) (23) 2 and the reduced Lagrangian ˆ c ) :D 1 kF1 (z)k22 Tc Fc (z) ; L(z; 2

c :D

c

(24) are defined. The Kuhn–Tucker-conditions, i. e. the necessary conditions of first order, are the feasibility conditions F2 (z ) D 0 ;

F3 (z ) 0

(25)

ensuring that z is feasible, and the stationarity conditions stating that the adjoined variables , exist as

Complexity and Large-Scale Least Squares Problems

solution of the stationary conditions

cobians J1 and Jc , i. e.

T @L (z ; ; )D F1T (z ) J(z ) J2 (z ) @z T J3 (z ) D 0 (26) and 0 ;

i … I (z ) ) i D 0 :

(27)

If (z ; ; ) fulfills the conditions (25), (26) and (27), it is called a Kuhn–Tucker-point and z a stationary point. The necessary condition of second order means that for all directions ˇ ˇ J (z )s D 0 ; i J3i (z )s D 0 s 2 T(x ) :D s ¤ 0 ˇˇ 2 J3 (z )s 0 (28) the Hessian G(z ; ; ) of the Lagrangian is positive semi-definite: sT G(z ; ; )s 0 ; G(z ; ; ) :D

@2 L(z ; ; ) : @z2

C

(29)

As i D 0 for i … I (z ) it is sufficient to postulate the stationary condition for the reduced Lagrangian (24). For the linear problem (20) follows: (z ; ; ) is a Kuhn–Tucker-point of the nonlinear problem (18) if and only, if (0; ; ) is a Kuhn–Tucker-point of the linear problem. The necessary conditions for the existence of a local minimum of problem (18) are: 1. (z ; ; ) is a Kuhn–Tucker-point of the nonlinear problem 2. the Hessian G(z ; ; — ) of the Lagrangian is positive definite for all directions s 2 T(x ), vis. sT G (z ; ; )s > 0 If the necessary conditions for the existence of the local minimum and the condition i ¤ 0 for i 2 I (z ) are fulfilled, two perturbation theorems [6] can be formulated. If the sufficient conditions are fulfilled it can be shown for the neighborhood of a Kuhn–Tucker-point (z ; ; ) of the nonlinear problem (18) that the local convergence behavior of the inequality constrained problem corresponds to that of the equality constrained problem which represents active inequalities and equations. Under the assumption of the regularity of the Ja-

rank

J1 (z k ) Jc (z k )

D nd C n p ;

rank(Jc (z k )) D n c ; (30)

a unique solution z k of the linear problem (20) exists and an unique linear mapping JC k can be constructed which satisfies the relation z k D JC k F(z k ) ;

C C JC k Jk Jk D Jk ; JTk :D J1T (z k ); JTc (z k ) :

(31)

The solution z k of the linear problem or formally the generalized inverse JC k [5] of J k results from the Kuhn– Tucker conditions. But it should be noticed that z k is not calculated from (31) because of reasons of numerical efficiency but is based on a decomposition procedure using orthogonal transformations. By taking into consideration the special structure of the matrices J i caused by the continuity conditions of the multiple shooting discretization (18) can be reduced by a condensation algorithm described in [5,6] to a system of lower dimension 1 kA1 x k C a1 k22 jA2 x k C a2 D 0 ; min 2 A3 x k C a3 0g;

(32)

from which x k can be derived at first and at last z k . This is achieved by first performing a “backward recursion”, the “solution of the condensed problem” and a “forward recursion” [6]. Kilian [20] has implemented an active set strategy following the description in [6] and [33] utilizing the special structure of J2 . The details of the parameter estimation algorithms which are incorporated in the efficient software package PARFIT (a software package of stable and efficient boundary value problem methods for the identification of parameters in systems of nonlinear differential equations) are found in [6]. The damping constant ˛ k in the k-th iteration is computed with the help of natural level functions which locally approximate the distance kz k z k of the solution from the Kuhn–Tucker point z . The integrator METANB (for the basic discretization see, for instance, [3]) embedded in PARFIT is also suitable for the integration of stiff differential equation sys-

441

442

C

Complexity and Large-Scale Least Squares Problems

tems. It allows the user to compute simultaneously the sensitivity matrixes G; G(t; t0; x0 ; p) :D

@ x(t; t0; x0 ; p) 2 M(nd ; nd ) (33) @x0

and H; @ H(t; t0 ; x0 ; p) :D x(t; t0 ; x0 ; p) 2 M(nd ; n p ) (34) @p which are the most costly blocks of the Jacobians J i via the so-called internal numerical differentiation as introduced by Bock [5]. This technique does not require the often cumbersome and error prone formulation of the variational differential equations G0 D fx (t; x; p) G ;

G(t0 ; t0 ; x0 ; p) D 1l

(35)

and H0 D fx (t; x; p) H C f p (t; x; p) ; H(t0 ; t0 ; x0 ; p) D 0

(36)

by the user. Using the multiple shooting approach described above, differential equation systems with poor stability properties and even chaotic systems can be treated [18]. Parameter Estimation in DAE Models Another, even more complex class of problems, are parameter estimation in mechanical multibody systems, e. g., in the planar slider crank mechanisms, a simple model for a cylinder in an engine. These problems lead to boundary problems for higher index differential algebraic systems [34]. Singular controls and state constraints in optimal control also lead to this structure. Inherent to such problems are invariants that arise from index reduction but also additional physical invariants such as the total energy in conservative mechanical systems or the Hamiltonian in optimal control problems. A typical class of DAEs in mechanical multibody systems is given by the equations of motion x˙ D v M(t; x)˙v D f(t; x) rx g(t; x) ; 0 D g(t; x)

(37)

where x D x(t) and v D v(t) are the coordinates and velocities, M is the mass matrix, f denotes the applied forces, g are the holonomic constraints, and are the generalized constraint forces. Usually, M is symmetric and positive definite. A more general DAE system might have the structure x˙ D f(t; x; z; p) 0 D g(t; x; z; p) ; where p denotes some parameters and z D z(t) is a set of algebraic variables, i. e., the differentials z˙ do not appear; in (37) is the algebraic variable. In addition we might have initial values x0 and z0 . Obviously, some care is needed regarding the choice of z0 because it needs to be consistent with the constraint. In some exceptional cases (in which Z :D r z g has full rank and can be inverted analytically) we might insert z D z(t; x; p) into the differential equation. DAE systems with a regular matrix Z are referred to as index-1 systems. Index-1-DAEs can be transformed into equivalent ordinary differential equations by differencing the equations w.r.t. t. At first we get the implicit system of differential equations ˙ Zz˙ D 0 ; g t C X xC

X :D r x g

which, according to the assumption of the regularity of Z, can be written as the explicit system z˙ D Z1 (g t C Xf) : Many practical DAEs have index 1, e. g., in some chemical engineering problems, where algebraic equations are introduced to describe, for instance, mass balances or the equation of state. However, multibody systems such as (37) have higher indices; (37) is of index 3. The reason is, that the multiplier variables, i. e., the algebraic variables, do not occur in the algebraic constraints and it is therefore not possible to extract them directly without further differentiation. If Z does not have full rank the equations are differentiated successively, until the algebraic variables can be eliminated. The smallest number of differentiations required to transform the original DAE system to an ODE system is called the index of the DAE. The approach developed and described by Schulz et al. [34] is capable to handle least squares problems without special assumption to the index.

Complexity and Large-Scale Least Squares Problems

An essential problem for the design, optimization and control of chemical systems is the estimation of parameters from time-series. These problems lead to nonlinear DAEs. The parameters estimation problem leads to a non-convex optimization problem for which several local minima exist. Esposito and Floudas [13] developed two global branch-and-bound and convexunderestimator based optimization approaches to solve this problem. In the first approach, the dynamical system is converted into an algebraic system using orthogonal collocation on finite elements. In the second approach, state profiles are computed by integration. In Esposito and Floudas [12] a similar approach is used to solve optimal control problems.

Parameter Estimation in PDE Models A very complex class of least squares problems are data fitting problems in partial differential equations based models. These include eigenvalue problems, as well as initial and boundary value problems and cover problems in atomic physics, elasticity, electromagnetic fields, fluid flow or heat transfer. Some recent problems are, for instance, in models describing the water balance and solid transport used to analyze the distributions of nutrients and pesticides [1], in the determination of diffusive constants in water absorption processes in hygroscopic liquids discussed in [15], or in multispecies reactive flows through porous media [38]. Such nonlinear multispecies transport models can be used to describe the interaction between oxygen, nitrate, organic carbon and bacteria in aquifers. They may include convective transport and diffusion/dispersion processes for the mobile parts (that is the mobile pore water) of the species. The immobile biophase represents the part where reactions caused by microbial activity take place and which is coupled to transport through mobile pore water. The microorganisms are assumed to be immobile. The model leads to partial differential algebraic equations M@ t u r(Dru) C qru D f1 (u; v; z; p) ; @ t v D f2 (u; v; z; p) ;

(38)

0 D g(u; v; z; p) ; where D and q denote the hydraulic parameters of the

C

model, p denotes a set of reaction parameters, u and v refer to the mobile and immobile species, and z is related to source and sink terms. Methodology To solve least squares problems based on PDE models requires sophisticated numerical techniques but also great attention with respect to the quality of data and identifiability of the parameters. To solve such problems we might use the following approaches: 1. Unstructured approach: The PDE model is, for fixed parameters p, integrated by any appropriate method yielding estimations of the observations. The parameters are adjusted by a derivative-free optimization procedure, e. g., by the Simplex method by Nelder and Mead [23]. This approach is relatively easy to implement, it solves a sequence of direct problems, and is comparable to what in Sect. “Parameter Estimation in ODE Models” has been called the IVP approach. Arning [1] uses such an approach. 2. Structured approach (for initial value PDE problems): Within the PDE model spatial coordinates and time are discretized separately. Especially for models with only one spatial coordinate, it is advantageous to apply finite difference or finite element discretizations to the spatial coordinate. The PDE system is transformed into a system of (usually stiff) ordinary differential equations. This approach is known as the method of lines (see, for example, [30]). It reduces parameter estimation problems subject to time-dependent partial differential equations to parameter identification problems in systems of ordinary differential equations to be integrated w.r.t. time. Now it is possible to distinguish again between the IVP and BVP approach. Schittkowski [32] in his software package EASY-FIT applies the method of lines to PDEs with one spatial coordinate and uses several explicit and implicit integration methods to solve the ODE system. The integration results are used by an SQP optimization routine or a Gauß– Newton method to estimate the parameters. Zieße et al. 38 and Dieses et al. [11], instead, couple the method of lines (in one and two spatial coordinates) with Bock’s [6] BVP approach, discretize time, for instance, by multiple shooting and use an extended version of PARFIT.

443

444

C

Complexity and Large-Scale Least Squares Problems

The method of lines has become one of the standard approaches for solving time-dependent PDEs with only one spatial coordinate. It is based on a partial discretization, which means that only the spatial derivative is discretized but not the time derivative. This leads to a system of N coupled ordinary differential equation, where N is the number of discretization points. Let us demonstrate the method by applying it to the diffusion equation @2 @ c(t; z) D D 2 c(t; z) ; @t @z

0t 2 P the negation of y i;k is coded in j2D k nfig y j;k , and one can proceed as above. Just like in the discrete case, inclusive as well as exclusive ‘or’-relations can be modeled with exact continuous variables.

499

500

C

Continuous Reformulations of Discrete-Continuous Optimization Problems

To circumvent the introduction of nonconvexity into the model by binary multiplication in BM, [15] presents an alternative, convex reformulation approach on the basis of tailored big-M constraints which can also be used in conjunction with exact continuous variables as defined in equation (17). A distinctive property of the binary multiplication-based model formulation BM, however, is the treatment of inconsistent equality constraints. The Case of Inconsistent Equalities In many applications, the constraints (6)–(8) in BM lead to implicit restrictions on the exact continuous variables. In particular, (6) and (8) have to hold simultaneously for i 2 D k ; k 2 K. In process engineering applications, the underlying equations (i. e. h i;k (x) D 0; i 2 D k as well as b k i;k D 0; i 2 D k ) are often inconsistent for fixed k 2 K, that is, they do not admit a common solution or, put geometrically, the sets described by these equations are disjoint. Note that this is an inherent property of a GDP problem with socalled disjoint disjunctions which have non-empty intersecting feasible regions [17]. This is particularly the case, if for fixed k 2 K the values i;k are pairwise distinct for i 2 D k . It implies that at most one of the variables y i;k ; i 2 D k , is nonvanishing. In the case n k D 2 with y D y1 and z D y2 this means that the equation y z D 0 holds automatically. As a consequence, the only constraint needed for the description of A1 (cf. Fig. 7) is y C z 1, that is, the set A2 D f (y; z) 2 R2 j y C z 1 g coincides with A1 in the case of inconsistent equalities (cf. Fig. 8). Although the pair (y; z) does not vary in the complete two-dimensional set A2 from Fig. 8, in the restrictions one does not code the same information twice. This can be expected to lead to better numerical performance when NLP solvers are applied. Conclusions In [15] several example problems involving discrete and continuous decision variables from process engi-

Continuous Reformulations of Discrete-Continuous Optimization Problems, Figure 8 A two-dimensional feasible set for (y; z)

neering are treated numerically, with approximate as well as exact continuous variables representing the discrete decisions. It is shown that, using these reformulations, an efficient numerical treatment of disjunctive optimization problems is possible, but one can only expect to find local solutions when using standard NLP solvers. This is due to the fact that any continuous reformulation of a disjunctive optimization problem leads to a nonconvex optimization problem. Consequently, the reformulation approaches may be combined with global optimization algorithms whenever the problem size admits to do so.

See also Disjunctive Programming Mixed Integer Programming/Constraint Programming Hybrid Methods Order Complementarity

References 1. Bazaraa M, Sherali H, Shetty C (1993) Nonlinear Programming. Wiley, Hoboken, New Jersey

Continuous Review Inventory Models: (Q, R) Policy

2. Chen B, Chen X, Kanzow C (2000) A penalized Fischer– Burmeister NCP-function. Math Program 88:211–216 3. Floudas CA (1995) Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications. Oxford University Press, New York 4. Giannesi F, Niccolucci F (1976) Connections between nonlinear and integer programming problems. In: Institut Nazionale di Alta Mathematica (ed) Symposia Mathematica, vol XIX. Acad. Press, New York, pp 161–176 5. Grossmann IE, Hooker J (2000) Logic based approaches for mixed integer programming models and their application in process synthesis. In: Malone M, Trainham J, Carnahan B (eds) Foundations of Computer-Aided Process Design 323. AIChE Symp. Series. CACHE Publications, Austin, Texas, pp 70–83 6. Jongen HT, Weber G-W (1991) Nonlinear optimization: characterization of structural stability. J Glob Optim 1: 47–64 7. Leyffer S (2006) Complementarity constraints as nonlinear equations: theory and numerical experiences. In: Dempe S, Kalashnikov V (eds) Optimization and Multivalued Mappings. Springer, Dordrecht, pp 169–208 8. Luo Z, Pang J, Ralph D (1996) Mathematical Programs with Equilibrium Constraints. Cambridge University Press, Cambridge 9. Pardalos PM (1994) The linear complementarity problem. In: Gomez S, Hennart JP (eds) Advances in Optimization and Numerical Analysis. Springer, New York, pp 39–49 10. Pardalos PM, Prokopyev OA, Busygin S (2006) Continuous approaches for solving discrete optimization problems. In: Appa G, Pitsoulis L, Williams HP (eds) Handbook on Modelling for Discrete Optimization. Springer, New York, pp 39–60 11. Raghunathan A, Biegler L (2003) Mathematical programs with equilibrium constraints (MPEC) in process engineering. Comp Chem Eng 27(10):1381–1392 12. Raman R, Grossmann IE (1994) Modelling and computational techniques for logic based integer programming. Comput Chem Eng 18(7):563–578 13. Robinson S (1976) Stability theory for systems of inequalities, part II: differentiable nonlinear systems. SIAM J Numer Anal 13:497–513 14. Scheel H, Scholtes S (2000) Mathematical programs with complementarity constraints: stationarity, optimality, and sensitivity. Math Oper Res 25:1–22 15. Stein O, Oldenburg J, Marquardt W (2004) Continuous reformulations of discrete-continuous optimization problems. Comput Chem Eng 28:1951–1966 16. Stubbs R, Mehrotra S (1999) A branch-and-cut method for 0-1 mixed convex programming. Math Program 86: 515–532 17. Vecchietti A, Lee S, Grossmann IE (2003) Modeling of discrete/continuous optimization problems: characterization and formulation of disjunctions and their relaxations. Comput Chem Eng 27(3):433–448

C

Continuous Review Inventory Models: (Q, R) Policy ISMAIL CAPAR1 , BURAK EKSIOGLU2 1 Department of Engineering Technology and Industrial Distribution, Texas A&M University, College Station, USA 2 Department of Industrial and Systems Engineering, Mississippi State University, Mississippi State, USA MSC2000: 49-02, 90-02 Article Outline Keywords Introduction Models Single-Echelon Models Multi-Echelon Models

Conclusions References Keywords Continuous review inventory models; (Q, R) models Introduction Inventory control is an important issue in supply chain management. Today, many different approaches are used to solve the complicated inventory control problems. While some of the approaches use a periodic review cycle, others use methods based on continuous review of inventory. In this survey, stochastic inventory theory that is based on continuous review is analyzed. One of the challenging tasks in continuous review inventory problems is finding the order quantity (Q) and the reorder point (R) such that the total cost is minimized and fill rate constraints are satisfied. The total cost includes ordering cost, backorder cost, and inventory holding cost. The fill rate is defined as the fraction of demand satisfied from inventory on hand. Under the continuous inventory control methodology, when the inventory position (on-hand inventory plus outstanding orders minus backorders) drops down to or below a reorder point, R, an order of size Q is placed. Although they are all the same, there are many different representations of this inventory model such as

501

502

C

Continuous Review Inventory Models: (Q, R) Policy

(Q, r) (Boyaci and Gallego [5]), (Q, R) (Hing et al. [14]), (R, Q) (Axsater [2,3] and Marklund [16]). In addition, for some of the problems it is assumed that the order quantity (nQ) is a multiple of minimum batch size, Q. Here, n is the minimum integer required to increase the inventory position to above R (Chen and Zheng[8]). In this case, the problem is formulated as a (R, nQ) type model. Models When the literature of (Q, R) models is investigated, some similarities and differences among the publications can easily be identified. Thus, the publications can be classified according to those similarities and differences. Two of the most distinctive attributes of (Q, R) models are as follows: 1. Type of supply chain: While some articles only consider one entity that uses a (Q, R) policy [1,5,14], others consider a multi echelon inventory system [2,3,8,16]. 2. Exact evaluation or near-optimal evaluation: The (Q, R) inventory problems are not easy to solve; thus, many of the research papers give approximate solution approaches or try to find bounds on the solutions [1,2,4,5,20], only a small number of articles give the exact evaluation of the (Q, R) inventory system [3,9,21]. In the next section, the literature based on type of supply chain considered and the evaluation methods used is reviewed. First, heuristic methods are analyzed. Second, publications providing optimal methods are reviewed. In the last section, we give some concluding remarks. Single-Echelon Models Hing et al. [14] focus on average inventory level approximation in a (Q, R) system with backorders. They compare different approaches proposed in the literature. Their numerical analysis shows that the approximation developed by Hadley and Whitin [13], 1/2Q + safety stock, is more robust than other approximations that have been proposed so far. Then, the authors propose a new methodology based on spreadsheet optimization. Using numerical examples they show that spreadsheet optimization based approach is better than those methods proposed in the literature.

Agrawal and Seshadri [1] provide upper and lower bounds for optimal R and Q subject to fill rate constraints. Although the authors consider backorder cost, the algorithm that was developed to find bounds can be used when backorder costs are zero. Another important application of the algorithm is that it can be applied when there are no service level constraints. Like Agrawal and Seshadri [1], Platt et al. [19] also consider fill rate constraints and propose two heuristics that can be used for (Q, R) policy models. While the first heuristic is suitable for deterministic lead time demand models, the second one assumes that demand during the lead time follows a normal distribution. Both heuristics are used to find R and Q values. The authors compare the proposed heuristics with others that have been proposed in the literature. Their analysis shows that their heuristics do not necessarily outperform the other heuristics in each problem instance. Boyaci and Gallego [5] propose a new (Q, R) model that minimizes average holding and ordering costs subject to upper bounds on the expected and maximum waiting times for the backordered items. They provide optimality conditions and an exact algorithm for the problem. Boyaci and Gallego [5] conclude their study by performing a numerical analysis. Gallego [12] proposes heuristics to find distribution-free bounds on the optimal cost and optimal batch size when a (Q, R) policy is used. He also shows that the heuristics work well when the demand distribution is Poisson or compound Poisson. Bookbinder and Cakanyildirim [4] consider a (Q, R) policy where lead time is not constant. They treat lead time as a random variable and develop two probabilistic models. While in the first model the lead time is fixed, in the second model the lead time can be reduced by using an expediting factor (). The order quantity, reorder point, and expediting factor are the the three decision variables in the second model. The authors show that for both models the expected cost per unit time is jointly convex. They also make a sensitivity analysis with respect to cost parameters. Ryu and Lee [20] consider the lead time as a decision variable. However, in this study the demand is constant. In their model, Ryu and Lee [20] assume that there are two suppliers for the items to be procured. They mainly consider two cases. In the first case, lead time cannot be decreased but in the second case, or-

Continuous Review Inventory Models: (Q, R) Policy

ders can be expedited. The authors also assume that lead-time distributions are non-identical exponential. For the first case, their objective is to determine a Q, an R, and an order-splitting proportion. In the second case, they find new values for the lead times using the order-splitting proportion. Their sensitivity analysis shows that the order-splitting proportion tends to be a half, and it is biased by the coefficient of the expediting function. Cakanyildirim et al. [6] develop a model that considers lead-time variability. The authors assume that lead time is effected by both the lot size and the reserved capacity. The authors come up with a closed-form solution for the situation where lead time is proportional to the lot size. Cakanyildirim et al. [6] also present the effect of linear and concave lead times on the value of cost function. In the model, in addition to the order quantity and the reorder point, the reserved capacity is also a decision variable. Finally, the authors consider a case in which fixed proportion of capacity is allocated at the manufacturing facility. Most of the articles in the literature consider lead time as a constant and focus on demand during the lead time. However, Wu and Ouyang [21] assume that lead time is a decision variable and that lead-time demand follows a normal distribution. They also assume that an arrival order may contain some defective parts and that those parts will be kept in inventory until next delivery. Moreover, they include an inspection cost for defective parts to the model. Their model is defined as (Q, R, L) inventory model where order quantity (Q), reorder point (R), and lead time (L) are decision variables. The objective is to minimize the total cost which includes ordering costs, inventory holding costs (defective and non-defective), lost sales costs, backorder costs, and inspection costs. The authors present an algorithm to find the optimal solutions for the given problem. Duran et al. [9] present a (Q, R) policy where orders can be expedited. At the time of order release, if inventory position is less than or equal to a critical value re , the order is expedited at an additional cost. If the inventory level is higher then re and lower than or equal to the reorder point R, then order is not expedited. The aim is to find the order quantity (Q), the reorder point (R), and the expediting point re which minimize average cost (note that this does not include backorder

C

costs). The authors present an optimal algorithm to obtain the Q, R, and re values if they are restricted to be integers. The model proposed by Kao and Hsu [15] is different from other models reviewed in this paper because the authors discuss the order quantity and reorder point with fuzzy demand. Kao and Hsu [15] use this fuzzy demand to construct the fuzzy total inventory cost. The authors derive five pairs of simultaneous nonlinear equations to find the optimal order quantity Q and the reorder point R. The authors show that when the demand is a trapezoid fuzzy number, the equations can be reduced to a set of closed-form equations. Then, they prove that the solution to these equations give an optimal solution. Kao and Hsu [15] also present a numerical example to show that the solution methodology developed in the paper is easy to apply in practice. Multi-Echelon Models Moinzadeh and Lee [18] present a model to determine the batch size in a multi-echelon system with one central depot and M sites. In their problem, when the number of failed items is equal to the order quantity Q at any site, then those items are sent to the depot. If the depot has sufficient inventory on hand, it delivers the items immediately; otherwise, the items are backlogged. Although all sites use a (Q, R) policy, the depot uses a (S-1, S) policy. In other words, whenever the depot receives an order of size Q, it places an order simultaneously to replenish its stock. After determining the Q and R values for each site, the authors use an approximation to estimate the total system stock and the backorder levels. The numerical results show that the (Q, R) policy is better than the (S-1, S) policy for such systems. Forsberg [10] deals with a multi-echelon inventory system with one warehouse and multiple non-identical retailers. The author assumes that the retailers face independent Poisson demand and both the warehouse and the retailers use (Q, R) policies. Forsberg [10] evaluates inventory holding and shortage costs using an exact solution approach. Chen and Zheng [8] study a (nQ, R) policy in a multi-stage serial inventory system where stage 1 orders from stage 2, stage 2 from stage 3, etc., and stage N places orders to an outside supplier with unlimited capacity. The demand seen by stage 1 is compound Pois-

503

504

C

Continuous Review Inventory Models: (Q, R) Policy

son and excess demand is backlogged at every stage. The transportation lead times among stages are constant. By using a two-step approach, Chen and Zheng [8] provide near-optimal solution. In the first step, they find the lower and upper bounds on the cost function by changing the penalty cost of being short on inventory. In the second step, the authors minimize the bounds by using three different heuristic approaches. Chen and Zheng [8] also propose an optimal algorithm that requires additional computational effort. Axsater [2] considers a two-stage inventory system with one warehouse and N non-identical retailers. He presents an exact method to evaluate inventory holding and shortage costs when there are only two retailers. He focuses on the timing of the warehouse orders for the sub-batches of Q. He identifies three possibilities and evaluates the cost for each case separately. At the end, total cost is calculated by summing the costs for the three cases. When there are more than two retailers, he extends his evaluation technique by combining the retailers into two groups, and then uses the same approach he developed for the two retailer case. The author also presents a model where the lead times are constant and all facilities use (Q, R) policies with different Q and R values. In this model, all stockouts are backordered, delayed orders are delivered on a firstcome-first-served basis, and partial shipments are also allowed. In order to simplify the problem, Axsater [2] assumes that all batch sizes are multiples of the smallest batch size. In the objective function, the author only considers expected inventory holding cost and backorder cost. Like Axsater [2], Marklund [16] also considers a two-stage supply chain with one central warehouse and an arbitrary number of non-identical retailers. Customer demands occur only at the retailers. The retailers use (Q, R) policies with different parameters, and they request products from the central warehouse whenever their inventory positions reach R or fall below R. The author proposes a new policy (Q0 , a0 ) that is motivated by relating the traditional echelon stock model to the installation stock (Q, R) model where the order quantity Q is a multiple of a minimum batch size. In the article, Marklund [16] gives the detailed derivation of the exact cost function when the retailers use (Q, R) policies and the warehouse uses the new (Q0 , a0 ) policy. The performance of the new policy is compared to traditional ech-

elon stock policy and (Q, R) policy through numerical examples. Although the results show that the proposed policy outperforms the other policies in all numerical examples, the author does not guarantee that the policy will always give the best result. Fujiwara and Sedarage [11] apply a (Q, R) policy for a multi-part assembly system under stochastic lead times. The objective of the article is to simultaneously determine the order quantity and the assembly lot size so that the average total cost per unit time is minimized. The total cost includes setup costs, inventory holding costs of parts and assembled items, and shortage costs of assembled items. The authors try to find separate reorder points, ri , for each part and a global order quantity, Q, which will be used for all parts. Although the authors propose a global order quantity Q, they also mention that this kind of policy may not be optimal. They suggest that instead of a global Q, a common Q where all order quantities are multiples of Q might be more sensible. Chen and Zheng [7] consider a distribution system with one warehouse and multiple retailers. The retailers’ demands follow an independent compound Poisson process. It is assumed that the order quantity is a multiple of the smallest batch size. The order quantity and the reorder point are calculated by using a heuristic. The authors present an exact procedure for evaluating the performance (average cost) of the (nQ, R) policy when the demand is a Poisson process. Chen and Zheng [7] also give two approximation procedures for the case with compound Poisson processes. The approximations are based on exact formulations of the case with Poisson processes. Axsater [3] presents an exact analysis of a two-stage inventory system with one warehouse and multiple retailers. The demand for each retailer follows an independent compound Poisson process. The retailers replenish their stock from the warehouse, and the warehouse replenishes its stock from an outside supplier. The transportation times from the warehouse to the retailers and from the outside supplier to the warehouse are constant. In addition, if there is a shortage, then additional delay may also occur since shortages and stockouts are backordered. The author emphasizes that the approach developed is not directly applicable for items with large demand. Instead, it is suitable mostly for slow-moving parts such as spare parts.

Contraction-Mapping

Moinzadeh [17] also considers a supply chain with one warehouse and multiple identical retailers. The author assumes that demand at the retailers is random but stationary and that each retailer places its order according to a (Q, R) policy. In addition, Moinzadeh [17] assumes that the warehouse receives online information about the demand. The author shows the effect of information sharing on order replenishment decisions of the supplier. In the article, the author first proposes a possible replenishment policy for the supplier and then provides an exact analysis for the operating measures of such systems. The author concludes the article by giving information about when information sharing is most beneficial. Conclusions We provide a literature review on continuous review (Q, R) inventory policies. Although we review most of the well known papers that deal with (Q, R) policy, this is not an exhaustive review of the literature. Our aim is to present the importance of the (Q, R) policy and show possible extensions of the simple (Q, R) model. References 1. Agrawal V, Seshadri S (2000) Distribution free bounds for service constrained (Q, r)inventory systems. Nav Res Logist 47:635–656 2. Axsater S (1998) Evaluation of installation stock based (R, Q)-policies for two-level inventory systems with poisson demand. Oper Res 46(3):135–145 3. Axsater S (2000) Exact analysis of continuous review (R, Q) policies in two-echelon inventory systems with compound poisson demand. Oper Res 48(5):686–696 4. Bookbinder JH, Cakanyildirim M (1999) Random lead times and expedited orders in (Q, r) inventory systems. Eur J Oper Res 115:300–313 5. Boyaci T, Gallego G (2002) Managing waiting times of backordered demands in single-stage (Q, r) inventory systems. Nav Res Logist 49:557–573 6. Cakanyildirim M, Bookbinder JH, Gerchak Y (2000) Continuous review inventory models where random lead time depends on lot size and reserved capacity. Int J Product Econ 68:217–228 7. Chen F, Zheng YS (1997) One warehouse multi-retailer system with centralized stock information. Oper Res 45(2): 275–287 8. Chen F, Zheng YS (1998) Near-optimal echelon-stock (R, nQ) policies in multistage serial systems. Oper Res 46(4):592–602

C

9. Duran A, Gutierrez G, Zequeira RI (2004) A continuous review inventory model with order expediting. Int J Product Econ 87:157–169 10. Forsberg R (1997) Exact evaluation of (R, Q)-policies for two-level inventory systems with Poisson demand. Eur J Oper Res 96:130–138 11. Fujiwara O, Sedarage D (1997) An optimal (Q, r) policy for a multipart assembly system under stochastic part procurement lead times. Eur J Oper Res 100:550–556 12. Gallego G (1998) New bounds and heuristics for (Q, r) policies. Manag Sci 44(2):219–233 13. Hadley G, Whitin TM (1963) Analysis of inventory systems. Prentice-Hall, Englewood Cliffs, NJ 14. Hing A, Lau L, Lau HS (2002) A comparison of different methods for estimating the average inventory level in a (Q; R) system with backorders. Int J Product Eco 79:303–316 15. Kao C, Hsu WH (2002) Lot size-reorder point inventory model with fuzzy demands. Comput Math Appl 43:1291– 1302 16. Marklund J (2002) Centralized inventory control in a twolevel distribution system with poisson demand. Nav Res Logist 49:798–822 17. Moinzadeh K (2002) A multi-echelon inventory system with information exchange. Manag Sci 48(3):414–426 18. Moinzadeh K, Lee HL (1986) Batch size and stocking levels in multi-echelon repariable systems. Manag Sci 32(12): 1567–1581 19. Platt DE, Robinson LW, Freund RB (1997) Tractable (Q, R) heuristic models for constrained service level. Manag Sci 43(7):951–965 20. Ryu SW, Lee KK (2003) A stochastic inventory model of dual sourced supply chain with lead-time reduction. Int J Product Eco 81–82:513–524 21. Wu KS, Ouyang LY (2001) (Q, r, L) inventory model with defective item. Comput Ind Eng 39:173–185

Contraction-Mapping C. T. KELLEY Department Math. Center for Research in Sci., North Carolina State University, Raleigh, USA MSC2000: 65H10, 65J15 Article Outline Keywords Statement of the Result Affine Problems Nonlinear Problems Integral Equations Example

505

506

C

Contraction-Mapping

See also References

Affine Problems An affine fixed point map has the form K(u) D Mu C b

Keywords

where M is a linear operator on the space X. The fixed point equation is

Nonlinear equations; Linear equations; Integral equations; Iterative method; Contraction mapping

(I M)u D b; Statement of the Result The method of successive substitution, Richardson iteration, or direct iteration seeks to find a fixed point of a map K, that is a point u such that u D K(u ): Given an initial iterate u0 , the iteration is u kC1 D K(u k );

for

k 0:

(1)

Let X be a Banach space and let D X be closed. A map K : D ! D is a contraction if kK(u) K(v)k ˛ ku vk

(2)

for some ˛ 2 (0, 1) and all u, v 2 D. The contraction mapping theorem, [3,7,13,14], states that if K is a contraction on D then K has a unique fixed point u in D, and for any u0 2 D the sequence {uk } given by (1) converges to u . The message of the contraction mapping theorem is that if one wishes to use direct iteration to solve a fixed point problem, then the fixed point map K must satisfy (2) for some D and relative to some choice of norm. The choice of norm need not be made explicitly, it is determined implicitly by the K itself. However, if there is no norm for which (2) holds, then another, more robust, method, such as Newton’s method with a line search, must be used, or the problem must be reformulated. One may wonder why a Newton-like method is not always better than a direct iteration. The answer is that the cost for a single iteration is very low for Richardson iteration. So, if the equation can be set up to make the contraction constant ˛ in (2) small, successive substitution, while taking more iterations, can be more efficient than a Newton-like iteration, which has costs in linear algebra and derivative evaluation that are not incurred by successive substitution.

(3)

where I is the identity operator. The classical stationary iterative methods in numerical linear algebra, [8,13], are typically analyzed in terms of affine fixed point problems, where M is called the iteration matrix. Multigrid methods, [2,4,5,9], are also stationary iterative methods. We give an example of how multigrid methods are used later in this article. The contraction condition (2) holds if kMk ˛ < 1:

(4)

In (4) the norm is the operator norm on X. M may be a well defined operator on more than one space and (4) may not hold in all of them. Similarly, if X is finite dimensional and all norms are equivalent, (4) may hold in one norm and not in another. It is known, [10], that (4) holds for some norm if and only if the spectral radius of M is < 1. When (4) does not hold it is sometimes possible to form an approximate inverse preconditioner P so that direct iteration can be applied to the equivalent problem u D (I P(I M))u Pb:

(5)

In order to apply the contraction mapping theorem and direct iteration to (5) we require that kI P(I M)k ˛ < 1 in some norm. In this case we say that P is an approximate inverse for I M. In the final section of this article we give an example of how approximate inverses can be built for discretizations of integral operators. Nonlinear Problems If the nonlinear fixed point map K is sufficiently smooth, then a Newton-like method may be used to solve F(u) D u K(u) D 0:

Contraction-Mapping

The transition from a current approximation uc of u to an update u+ is uC D u c P(u c K(u c ));

(6)

where

C

for all f 2 X. The family of operators {Kl } defined by K l u(x) D

Nl X

k(x; x lj )u(y)w lj

jD1

converges strongly to K, that is 0

1

P F (u )

0

1

D (I K (u )) :

lim K l u D Ku

l !1

P = F(uc )1 is Newton’s method and P = F 0 (u0 )1 is the chord method. It is easy to show [7,13,14] that if u is near u and P is an approximate inverse for F 0 (u ) then the preconditioned fixed point problem u D u P(u K(u)) is a contraction on a neighborhood D of u . This is, in fact, one way to analyze the convergence of Newton’s method. In this article our focus is on preconditioners that remain constant for several iterations and do not require computation of the derivative of K. The point to remember is that, if the goal is to transform a given fixed point map into a contraction, preconditioning of nonlinear problems can be done by the same process (formation of an approximate inverse) as for linear problems.

˝

where f 2 X = C(˝) is given and a solution u 2 X is sought. In this example D = X. We will assume that the linear operator I K is nonsingular on X. We consider a family of increasingly accurate quadrature rules, indexed with a level l, with weights Nl Nl and nodes {x li } iD1 that satisfy {w li } iD1

l !1

jD1

f (x lj )w lj D

Z f (x) dx ˝

is precompact in X. The direct consequences of the strong convergence and collective compactness are that I Kl are nonsingular for l sufficiently large and (I K l )1 ! (I K)1

(7)

strongly in X. The Atkinson–Brakhage preconditioner is based on these results. For g 2 X one can compute v D (I K l )1 g by solving the finite-dimensional linear system g(x il )

C

Nl X

k(x il ; x lj )v j w lj

(8)

jD1

We close this article with the Atkinson–Brakhage preconditioner for integral operators [2,4]. We will begin with the linear case, from which the nonlinear algorithm is a simple step. Let ˝ 2 RN be compact and let k(x, y) be a continuous function on ˝ × ˝. We consider the affine fixed point problem Z k(x; y)u(y) dy; u(x) D f (x) C (Ku)(x) D f (x) C

Nl X

[ l K l (B)

vi D

Integral Equations Example

lim

for all u 2 X. The family {Kl } is also collectively compact, [1]. This means that if B is a bounded subset of X, then

for the values v(x li ) = vi of v at the nodal points and then applying the Nyström interpolation v(x) D g(x) C

Nl X

k(x; x lj )v j w lj D g(x) C (K l v)(x)

jD1

to recover v(x) for all x 2 ˝. (8) can be solved at a cost of O(N 3l ) floating point operations if direct methods for linear equations are used and for much less if iterative methods such as GMRES [15] are used. In that case, only O(1) matrix-vector products are need to obtain a solution that is accurate to truncation error [6]. This is, up to a multiplicative factor, optimal. The Atkinson– Brakhage preconditioner can dramatically reduce this factor, however. The results in [1] imply that M l D I C (I K l )1 K;

507

508

C

Contraction-Mapping

the Atkinson–Brakhage preconditioner, converges to (I K)1 in the operator norm. Hence, for l sufficiently large (coarse mesh sufficiently fine) Richardson iteration can be applied to the system

The Atkinson–Brakhage algorithm can, under some conditions, be further improved, [12] and the number of fine mesh operator-function products per level reduced to one. There is also no need to explicitly represent the operator as an integral operator with a kernel.

u D u M l (I K l )u M l f ; where L l. Applying this idea for a sequence of grids or levels leads to the optimal form of the Atkinson– Brakhage iteration [11]. The algorithm uses a coarse mesh, which we index with l = 0, to build the preconditioner and then cycles through the grids sequentially until the solution at a desired fine (l = L) mesh is obtained. One example of this is a sequence of composite midpoint rule quadratures in which N l+1 = 2N l . Then, [2,11], if the coarse mesh is sufficiently fine, only one Richardson iteration at each level will be needed. The cost at each level is two matrix vector products at level l and a solve at level 0. 1) Solve u0 K0 u0 = f ; set u = u0 . 2) For l = 1, . . . , L: a) Compute r = u Kl u f ; b) u = u M 0 r. Nonlinear problems can be solved with exactly the same idea. We will consider the special case of Hammerstein equations Z k(x; y; u(y)) dy: u(x) D K(u)(x) D ˝

If we use a sequence of quadrature rules as in the linear case we can define K l (u)(x) D

Nl X

k(x; x lj ; u(x lj ))w lj :

jD1

The nonlinear form of the Atkinson–Brakhage algorithm for Hammerstein equations simply uses the approximation I C (I K00 (u0 ))1 K0 (u) (I K0l (u))1 in a Newton-like iteration. One can see from the formal description below that little has changed from the linear case. 1) Solve u0 K0 (u0 ) = 0; set u = u0 . 2) For l = 1, . . . , L: a) Compute r = u Kl (u); b) u = u (I + (I K0l (u0 ))1 K0 (u))r.

See also Global Optimization Methods for Systems of Nonlinear Equations Interval Analysis: Systems of Nonlinear Equations Nonlinear Least Squares: Newton-Type Methods Nonlinear Systems of Equations: Application to the Enclosure of all Azeotropes

References 1. Anselone PM (1971) Collectively compact operator approximation theory. Prentice-Hall, Englewood Cliffs, NJ 2. Atkinson KE (1973) Iterative variants of the Nyström method for the numerical solution of integral equations. Numer Math 22:17–31 3. Banach S (1922) Sur les opérations dans les ensembles abstraits et leur applications aux équations intégrales. Fundam Math 3:133–181 4. Brakhage H (1960) Über die numerische Behandlung von Integralgleichungen nach der Quadraturformelmethode. Numer Math 2:183–196 5. Briggs W (1987) A multigrid tutorial. SIAM, Philadelphia 6. Campbell SL, Ipsen ICF, Kelley CT, Meyer CD, Xue ZQ (1996) Convergence estimates for solution of integral equations with GMRES. J Integral Eq Appl 8:19–34 7. Dennis JE, Schnabel RB (1996) Numerical methods for nonlinear equations and unconstrained optimization. Classics Appl Math, vol 16, SIAM, Philadelphia 8. Golub GH, VanLoan CG (1983) Matrix computations. Johns Hopkins Univ. Press, Baltimore, MD 9. Hackbusch W (1985) Multi-grid methods and applications. Comput Math, vol 4. Springer, Berlin 10. Isaacson E, Keller HB (1966) Analysis of numerical methods. Wiley, New York 11. Kelley CT (1990) Operator prolongation methods for nonlinear equations. In: Allgower EL, Georg K (eds) Computational Solution of Nonlinear Systems of Equations. Lect Appl Math Amer Math Soc, Providence, RI, pp 359–388 12. Kelley CT (1995) A fast multilevel algorithm for integral equations. SIAM J Numer Anal 32:501–513 13. Kelley CT (1995) Iterative methods for linear and nonlinear equations. No. in Frontiers in Appl Math, vol 16 SIAM, Philadelphia 14. Ortega JM, Rheinboldt WC (1970) Iterative solution of nonlinear equations in several variables. Acad Press, New York

Control Vector Iteration CVI

15. Saad Y, Schultz MH (1986) GMRES a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J Sci Statist Comput 7:856–869

Control Vector Iteration CVI REIN LUUS Dept. Chemical Engineering, Univ. Toronto, Toronto, Canada

MSC2000: 93-XX

Optimal Control Problem To illustrate the procedure, let us consider the optimal control problem, where the system is described by the differential equation dx D f(x; u); dt

(x; u) dt

In solving optimal control problems involving nonlinear differential equations, some iterative procedure must be used to obtain the optimal control policy. As is true with any iterative procedure, one is concerned about the convergence rate and also about the reliability of obtaining the optimal control policy. Although from Pontryagin’s maximum principle it is known that the minimum of the performance index corresponds to the minimum of the Hamiltonian, to obtain the minimum value for the Hamiltonian is not always straightforward. Here we outline a procedure that changes the control policy from iteration to iteration, improving the value of the performance index at each iteration, until the improvement is less than certain amount. Then the iteration procedure is stopped and the results are analyzed. Such a procedure is called control vector iteration method (CVI), or iteration in the policy space.

(2)

0

Article Outline

Optimal control; Control vector iteration; Variation method; Pontryagin’s maximum principle

(1)

tf

ID

Keywords and Phrases

with x(0) given ;

where x is an n-dimensional state vector and u is an r-dimensional control vector. The optimal control problem is to determine the control u in the time interval 0 t < tf , so that the performance index Z

Keywords and Phrases Optimal Control Problem Second Variation Method Determination of Stepping Parameter Illustration of the First Variation Method See also References

C

is minimized. We consider the case where the final time tf is given. To carry out the minimization of the performance index in (2) subject to the constraints in (1), we consider the augmented performance index Z

tf

JD 0

dx dt ; C zT f dt

(3)

where the n-dimensional vector of Lagrange multipliers z is called the adjoint vector. The last term in (3) can be thought of as a penalty function to ensure that the state equation is satisfied throughout the given time interval. We introduce the Hamiltonian H D C zT f

(4)

and use integration by parts to simplify (3) to Z

tf

JD

HC

0

dzT x dt z(tf )x(tf )CzT (0)x(0): (5) dt

The optimal control problem now reduces to the minimization of J. To minimize J numerically, we assume that we have evaluated J at iteration j by using control policy denoted by u(j) . Now the problem is to determine the control policy u( jC1) at the next iteration. Since the goal is to minimize J, obviously we want to make the change in J negative and numerically as large as possible. If we let ıu D u( jC1) u( j) , the corresponding change in J is obtained by using Taylor series expansion up to the

509

510

C

Control Vector Iteration CVI

quadratic terms:

subject to the differential equation

Z tf "

!

#

dzT @H T @H T ıJ D C ıu dt ıx C @x dt @u 0 Z @2 H @2 H 1 tf ıxT 2 ıx C 2ıxT C ıu 2 0 @x @x@u @2 H CıuT 2 ıu dt zT (tf )ıx(tf ) : @u (6) The necessary condition for minimum of J is that the first integral in (6) should be zero; i. e., dz @H D ; dt @x

with z(tf ) D 0 :

(7)

and @H D0: @u

(8)

In control vector iteration, we relax the necessary condition in (8) and choose ıu to make ıJ negative and in the limit (8) is satisfied. One approach is to choose ıu D

@H ; @u

Second Variation Method Instead of arbitrarily determining the stepping parameter , one may solve the accessory minimization problem, where ıu is chosen to minimize ıJ given by (6) after the requirements for the adjoint are satisfied; i. e., it is required to find ıu to minimize ıJ given by ıJ D 0

tf

"

@H @u

CıxT

@fT @x

T

ıx C

T

@2 H 1 ıu C ıxT 2 ıx 2 @x

1 @2 H @2 H ıu C ıuT 2 ıu dt ; @x@u 2 @u

(10)

@fT @u

T ıu; with ıx(0) D 0 :

(11)

The solution to this accessory minimization problem is straightforward, since (11) is linear and the performance index in (10) is almost quadratic, and can be easily done, as shown in ([1], pp. 259–266) and [7]. The resulting equations, to be integrated backwards from t D tf to t D 0 with zero starting conditions, are T T @f @2 H @fT dJ CJ JC 2 C dt @x @x @x 2 1 @ H ST S D 0 ; (12) @u2 where the (r × n)-matrix S D @2 H/@u@x C @fT /@xJ, and dg ST dt

@2 H @u2

1

(9)

where is a positive parameter which may vary from iteration to iteration. This method is sometimes called first variation method, since the driving force for the change in the control policy is based only on the first term of the Taylor series expansion. The negative sign in (9) is required to minimize the Hamiltonian, as is required by Pontryagin’s maximum principle. Numerous papers have been written on the determination of the stepping parameter [7].

Z

dıx D dt

T @fT @f @H C g C g D 0: @u @x @x (13)

The control policy is then updated through the equation ( jC1)

u

( j)

Du

1 @fT @H @2 H C g @u2 @u @x 2 1 @ H S x( jC1) x( j) : 2 @u

(14)

This method of updating the control policy is called the second variation method. In (12) the (n × n)-matrix J is symmetric, so the total number of differential equations to be integrated backwards is n(n C 1)/2 C 2n. However, the convergence is quadratic if the initial control policy is close to the optimum. To obtain good starting conditions, Luus and Lapidus [6] suggested the use of first variation method for the first few iterations and then to switch over to the second variation method. One additional feature of the second variation method is that the control policy given in (14) is a function of the present state, so that the control policy is treated as being continuous and is not restricted to be-

Control Vector Iteration CVI

ing piecewise constant over an integration time step, as is the case with the first variation method. As was shown in ([1], pp. 316–317), for the linear six-plate gas absorber example, when the system equation is linear and the performance index is quadratic, the second variation method yields the optimal control policy in a single step. Determination of Stepping Parameter However, the large number and complexity of equations required for obtaining the control policy and the instability of the method for very complex systems led to investigating different means of obtaining faster convergence with the first variation method. The effort was directed on the best means of obtaining the stepping parameter in (9). When is too large, overstepping occurs, and if is too small, the convergence rate is very small. Numerous papers have been written on the determination of . Several methods were compared by Rao and Luus [8] in solving typical optimal control problems. Although they suggested a means of determining the ‘best’ method for performance indices that are almost quadratic, it is found that a very simple scheme is quite effective for a wide variety of optimal control problems. Instead of trying to get very fast convergence and risk instability, the emphasis is placed on the robustness. The strategy is to obtain the initial value for from the magnitude of @H/@u, and then increasing when the iteration has been successful, and reducing its value if overstepping occurs. This type of approach was used in [2] in solving the optimal control of a pyrolysis problem. When the iteration was successful, the stepping parameter was increased by 10 percent, and when overstepping resulted, the stepping parameter was reduced to half its value. The algorithm for first variation method may be presented as follows: Choose an initial control policy u(0) and a value for ; set the iteration index j to 0. Integrate (1) from t D 0 to t D tf and evaluate the performance index in (2). Store the values of the state vector at the end of each integration time step. Integrate the adjoint equation (7) from t D tf to t D 0, using for x the stored values of the state vector in Step 2. At each integration time step evaluate the gradient @H/@u.

C

Choose a new control policy u( jC1) D u( j)

@H : @u

(15)

Integrate (1) from t D 0 to t D tf and evaluate the performance index in (2). Store the values of the state vector at the end of each integration time step. If the performance index is worse (i. e., overstepping has occurred), reduce to half its value and go to Step 4. If the performance index has been improved increase by a small factor, such as 1.10 and go to Step 3, and continue for a number of iterations, or terminate the iterations when the change in the performance index in an iteration is less than some criterion, and interpret the results. Illustration of the First Variation Method Let us consider the nonlinear continuous stirred tank reactor that has been used for optimal control studies in ([1], pp. 308–318) and [6], and which was shown in [4] to exhibit multiplicity of solutions. The system is described by the two equations dx1 D 2 (x1 C 0:25) dt 25x1 C (x2 C 0:5) exp u (x1 C 0:25) ; x1 C 2 (16) dx2 25x1 D 0:5 x2 (x2 C 0:5) exp ; (17) dt x1 C 2 with the initial state x1 (0) D 0:09 and x2 (0) D 0:09. The control u is a scalar quantity related to the valve opening of the coolant. The state variables x1 and x2 represent deviations from the steady state of dimensionless temperature and concentration, respectively. The performance index to be minimized is Z

tf

ID 0

x12 C x22 C 0:1u 2 dt ;

(18)

where the final time tf D 0:78. The Hamiltonian is H D z1 (2 (x1 C 0:25) C R u (x1 C 0:25)) C z2 (0:5 x2 R) C x12 C x22 C 0:1u 2 ;

(19)

511

512

C

Control Vector Iteration CVI

where R D (x2 C 0:5) exp(25x1 /(x1 C 2)). The adjoint equations are dz1 (z2 z1 ) D (u C 2) z1 2x1 C 50R ; dt (x1 C 2)2 (z2 z1 ) dz2 D 2x2 C R C z2 ; dt (x2 C 0:5)

(20)

(21)

and the gradient of the Hamiltonian is @H D 0:2u (x1 C 0:25) z1 : @u

(22)

To illustrate the computational aspects of CVI, the above algorithm was used with a Pentium-120 personal computer using WATCOM Fortran compiler version 9.5. The calculations were done in double precision. As found in [4], convergence to the local optimum was obtained when small values for the initial control policy were used, and the global optimum was obtained when large values were used as initial policy. As is seen in Table 1, when an integration time step of 0.0065 was used (allowing 120 piecewise constant steps), in spite of the large number of iterations, the optimal control policy can be obtained in less than 2 s of computer time. The iterations were stopped when the change in the performance index from iteration to iteration was less than 106 . The total computation time for making this run with 11 different initial control policies was 9.6 s on the Pentium-120 digital computer. When an integraControl Vector Iteration CVI, Table 1 Application of First Variation Method to CSTR Initial Performance Number of CPU policy u(0) index iterations time s 1.0

0.244436

16

0.16

1.2

0.244436

17

0.17

1.4

0.244436

18

0.11

1.6

0.244436

18

0.16

1.8

0.244436

19

0.22

2.0

0.133128

143

1.49

2.2

0.133128

149

1.53

2.4

0.133128

149

1.54

2.6

0.133130

133

1.43

2.8

0.133129

142

1.37

3.0

0.133130

136

1.38

Control Vector Iteration CVI, Table 2 Effect of the number of time stages P on the optimal performance index Number of Optimal I Optimal I time stages P by CVI by IDP 20

0.13429

0.13416

30

0.13363

0.13357

40

0.13339

0.13336

60

0.13323

0.13321

80

0.13317

0.13316

120

0.13313

0.13313

240

0.13310

0.13310

tion time step of 0.00312 was used, the value of the performance index at the global optimum was improved to 0.133104. When a time step of 0.001 was used, giving 780 time steps, the optimal control policy yielded I D 0:133097. Even here the computation time for the 11 different initial conditions was only 31 s. With the use of piecewise linear control and only 20 time stages, a performance index of I D 0:133101 was obtained in [3] with iterative dynamic programming (IDP). To obtain this result with IDP, by using 5 randomly chosen points and 10 passes, each consisting of 20 iterations, took 13.4 s on a Pentium-120. The use of 15 time stages yielded I D 0:133112 and required 7.8 s. Therefore, computationally CVI is faster than IDP for this problem, but the present formulation does not allow piecewise linear control to be used in CVI. The effect of the number of time stages for piecewise constant control is shown in Table 2, where CVI results are compared to those obtained by IDP in [5]. As can be seen, the given algorithm gives results very close to those obtained by IDP, and the deviations decrease as the number of time stages increases, because the approximations introduced during the backward integration when the stored values for the state vector are used, and in the calculation of the gradient of the Hamiltonian in CVI become negligible as the time stages become very small. As is shown in Fig. 1, when the optimal value of the performance index is plotted against 1/P2 , the extrapolated value, as 1/P2 approaches zero, gives the value obtained with the second variation method. The first variation method is easy to program and will continue to be a very useful method of determining the optimal control of nonlinear systems.

Convex Discrete Optimization

C

Semi-infinite Programming and Control Problems Sequential Quadratic Programming: Interior Point Methods for Distributed Optimal Control Problems Suboptimal Control References

Control Vector Iteration CVI, Figure 1 Linear variation of optimal performance index with P2 ; –– CVI, M - - M - - M IDP

1. Lapidus L, Luus R (1967) Optimal control of engineering processes. Blaisdell, Waltham 2. Luus R (1978) On the optimization of oil shale pyrolysis. Chem Eng Sci 33:1403–1404 3. Luus R (1993) Application of iterative dynamic programming to very high dimensional systems. Hungarian J Industr Chem 21: 243–250 4. Luus R, Cormack DE (1972) Multiplicity of solutions resulting from the use of variational methods in optimal control problems. Canad J Chem Eng 50:309–311 5. Luus R, Galli M (1991) Multiplicity of solutions in using dynamic programming for optimal control. Hungarian J Industr Chem 19:55–62 6. Luus R, Lapidus L (1967) The control of nonlinear systems. Part II: Convergence by combined first and second variations. AIChE J 13:108–113 7. Merriam CW (1964) Optimization theory and the design of feedback control systems, McGraw-Hill, New York, pp 259– 261 8. Rao SN, Luus R (1972) Evaluation and improvement of control vector iteration procedures for optimal control. Canad J Chem Eng 50:777–784

See also Boundary Condition Iteration BCI Duality in Optimal Control with First Order Differential Equations Dynamic Programming: Continuous-time Optimal Control Dynamic Programming and Newton’s Method in Unconstrained Optimal Control Dynamic Programming: Optimal Control Applications Hamilton–Jacobi–Bellman Equation Infinite Horizon Control and Dynamic Games MINLP: Applications in the Interaction of Design and Control Multi-objective Optimization: Interaction of Design and Control Optimal Control of a Flexible Arm Optimization Strategies for Dynamic Systems Robust Control Robust Control: Schur Stability of Polytopes of Polynomials

Convex Discrete Optimization SHMUEL ONN Technion – Israel Institute of Technology, Haifa, Israel MSC2000: 05A, 15A, 51M, 52A, 52B, 52C, 62H, 68Q, 68R, 68U, 68W, 90B, 90C Article Outline Abstract Introduction Limitations Outline and Overview of Main Results and Applications Terminology and Complexity

Reducing Convex to Linear Discrete Optimization Edge-Directions and Zonotopes Strongly Polynomial Reduction of Convex to Linear Discrete Optimization Pseudo Polynomial Reduction when Edge-Directions Are not Available

513

514

C

Convex Discrete Optimization

Convex Combinatorial Optimization and More From Membership to Linear Optimization Linear and Convex Combinatorial Optimization in Strongly Polynomial Time Linear and Convex Discrete Optimization over any Set in Pseudo Polynomial Time Some Applications

Linear N-fold Integer Programming Oriented Augmentation and Linear Optimization Graver Bases and Linear Integer Programming Graver Bases of N-fold Matrices Linear N-fold Integer Programming in Polynomial Time Some Applications

Convex Integer Programming Convex Integer Programming over Totally Unimodular Systems Graver Bases and Convex Integer Programming Convex N-fold Integer Programming in Polynomial Time Some Applications

Multiway Transportation Problems and Privacy in Statistical Databases Multiway Transportation Problems and Privacy in Statistical Databases The Universality Theorem The Complexity of the Multiway Transportation Problem Privacy and Entry-Uniqueness

References Abstract We develop an algorithmic theory of convex optimization over discrete sets. Using a combination of algebraic and geometric tools we are able to provide polynomial time algorithms for solving broad classes of convex combinatorial optimization problems and convex integer programming problems in variable dimension. We discuss some of the many applications of this theory including to quadratic programming, matroids, bin packing and cutting-stock problems, vector partitioning and clustering, multiway transportation problems, and privacy and confidential statistical data disclosure. Highlights of our work include a strongly polynomial time algorithm for convex and linear combinatorial optimization over any family presented by a membership oracle when the underlying polytope has few edgedirections; a new theory of so-termed n-fold integer programming, yielding polynomial time solution of important and natural classes of convex and linear integer programming problems in variable dimension; and a complete complexity classification of high dimen-

sional transportation problems, with practical applications to fundamental problems in privacy and confidential statistical data disclosure. Introduction The general linear discrete optimization problem can be posed as follows. LINEAR DISCRETE OPTIMIZATION. Given a set S Zn of integer points and an integer vector w 2 Zn , find an x 2 S maximizing the standard inner product wx :D Pn iD1 w i x i . The algorithmic complexity of this problem, which includes integer programming and combinatorial optimization as special cases, depends on the presentation of the set S of feasible points. In integer programming, this set is presented as the set of integer points satisfying a given system of linear inequalities, which in standard form is given by S

D

fx 2 N n : Ax D bg ;

where N stands for the nonnegative integers, A 2 Zmn is an m n integer matrix, and b 2 Zm is an integer vector. The input for the problem then consists of A; b; w. In combinatorial optimization, S f0; 1gn is a set of f0; 1g-vectors, often interpreted as a family of subsets of a ground set N :D f1; : : : ; ng, where each x 2 S is the indicator of its support supp(x) N. The set S is presented implicitly and compactly, say as the set of indicators of subsets of edges in a graph G satisfying a given combinatorial property (such as being a matching, a forest, and so on), in which case the input is G; w. Alternatively, S is given by an oracle, such as a membership oracle which, queried on x 2 f0; 1gn , asserts whether or not x 2 S, in which case the algorithmic complexity also includes a count of the number of oracle queries needed to solve the problem. Here we study the following broad generalization of linear discrete optimization. CONVEX DISCRETE OPTIMIZATION. Given a set S Zn , vectors w1 ; : : : ; wd 2 Zn , and a convex functional c : Rd ! R, find an x 2 S maximizing c(w1 x; : : : ; wd x). This problem can be interpreted as multi-objective linear discrete optimization: given d linear functionals w1 x; : : : ; wd x representing the values of points x 2 S under d criteria, the goal is to maximize their “con-

C

Convex Discrete Optimization

vex balancing” defined by c(w1 x; : : : ; wd x). In fact, we have a hierarchy of problems of increasing generality and complexity, parameterized by the number d of linear functionals: at the bottom lies the linear discrete optimization problem, recovered as the special case of d D 1 and c the identity on R; and at the top lies the problem of maximizing an arbitrary convex functional over the feasible set S, arising with d D n and with w i D 1 i the ith standard unit vector in Rn for all i. The algorithmic complexity of the convex discrete optimization problem depends on the presentation of the set S of feasible points as in the linear case, as well as on the presentation of the convex functional c. When S is presented as the set of integer points satisfying a given system of linear inequalities we also refer to the problem as convex integer programming, and when S f0; 1gn and is presented implicitly or by an oracle we also refer to the problem as convex combinatorial optimization. As for the convex functional c, we will assume throughout that it is presented by a comparison oracle that, queried on x; y 2 Rd , asserts whether or not c(x) c(y). This is a very broad presentation that reveals little information on the function, making the problem, on the one hand, very expressive and applicable, but on the other hand, very hard to solve. There is a massive body of knowledge on the complexity of linear discrete optimization – in particular (linear) integer programming [55] and (linear) combinatorial optimization [31]. The purpose of this monograph is to provide the first comprehensive unified treatment of the extended convex discrete optimization problem. The monograph follows the outline of five lectures given by the author in the Séminaire de Mathématiques Supérieures Series, Université de Montréal, during June 2006. Colorful slides of theses lectures are available online at [46] and can be used as a visual supplement to this monograph. The monograph has been written under the support of the ISF – Israel Science Foundation. The theory developed here is based on and is a culmination of several recent papers including [5,12,13,14,15,16,17,25,39,47, 48,49,50,51] written in collaboration with several colleagues – Eric Babson, Jesus De Loera, Komei Fukuda, Raymond Hemmecke, Frank Hwang, Vera Rosta, Uriel Rothblum, Leonard Schulman, Bernd Sturmfels, Rekha Thomas, and Robert Weismantel. By developing and using a combination of geometric and algebraic tools,

we are able to provide polynomial time algorithms for several broad classes of convex discrete optimization problems. We also discuss in detail some of the many applications of our theory, including to quadratic programming, matroids, bin packing and cutting-stock problems, vector partitioning and clustering, multiway transportation problems, and privacy and confidential statistical data disclosure. We hope that this monograph will, on the one hand, allow users of discrete optimization to enjoy the new powerful modelling and expressive capability of convex discrete optimization along with its broad polynomial time solvability, and on the other hand, stimulate more research on this new and fascinating class of problems, their complexity, and the study of various relaxations, bounds, and approximations for such problems. Limitations Convex discrete optimization is generally intractable even for small fixed d, since already for d D 1 it includes linear integer programming which is NP-hard. When d is a variable part of the input, even very simple special cases are NP-hard, such as the following problem, so-called positive semi-definite quadratic binary programming, max f(w1 x)2 C C (w n x)2 : x 2 N n ; xi 1 ;

i D 1; : : : ; ng :

Therefore, throughout this monograph we will assume that d is fixed (but arbitrary). As explained above, we also assume throughout that the convex functional c which constitutes part of the data for the convex discrete optimization problem is presented by a comparison oracle. Under such broad presentation, the problem is generally very hard. In particular, if the feasible set is S :D fx 2 N n : Ax D bg n : Ax D and the underlying polyhedron P :D fx 2 RC bg is unbounded, then the problem is inaccessible even in one variable with no equation constraints. Indeed, consider the following family of univariate convex integer programs with convex functions parameterized by 1 < u 1, max fcu (x) : x 2 Ng ; x; if x < u; cu (x) :D x 2u; if x u.

:

515

516

C

Convex Discrete Optimization

Consider any algorithm attempting to solve the problem and let u be the maximum value of x in all queries to the oracle of c. Then the algorithm can not distinguish between the problem with cu , whose objective function is unbounded, and the problem with c1 , whose optimal objective value is 0. Thus, convex discrete optimization (with an oracle presented functional) over an infinite set S Zn is quite hopeless. Therefore, an algorithm that solves the convex discrete optimization problem will either return an optimal solution, or assert that the problem is infeasible, or assert that the underlying polyhedron is unbounded. In fact, in most applications, such as in combinatorial optimization with S f0; 1gn or integer programming with S :D fx 2 Zn : Ax D b; l x ug and l; u 2 Zn , the set S is finite and the problem of unboundedness does not arise. Outline and Overview of Main Results and Applications We now outline the structure of this monograph and provide a brief overview of what we consider to be our main results and main applications. The precise relevant definitions and statements of the theorems and corollaries mentioned here are provided in the relevant sections in the monograph body. As mentioned above, most of these results are adaptations or extensions of results from one of the papers [5,12,13,14,15,16,17,25,39, 47,48,49,50,51]. The monograph gives many more applications and results that may turn out to be useful in future development of the theory of convex discrete optimization. The rest of the monograph consists of five sections. While the results evolve from one section to the next, it is quite easy to read the sections independently of each other (while just browsing now and then for relevant definitions and results). Specifically, Sect. “Convex Combinatorial Optimization and More” uses definitions and the main result of Sect. “Reducing Convex to Linear Discrete Optimization”; Sect. “Convex Integer Programming” uses definitions and results from Sections “Reducing Convex to Linear Discrete Optimization” and “Linear N-fold Integer Programming”; and Sect. “Multiway Transportation Problems and Privacy in Statistical Databases” uses the main results of Sections “Linear N-fold Integer Programming” and “Convex Integer Programming”.

In Sect. “Reducing Convex to Linear Discrete Optimization” we show how to reduce the convex discrete optimization problem over S Zn to strongly polynomially many linear discrete optimization counterparts over S, provided that the convex hull conv(S) satisfies a suitable geometric condition, as follows. Theorem 1 For every fixed d, the convex discrete optimization problem over any finite S Zn presented by a linear discrete optimization oracle and endowed with a set covering all edge-directions of conv(S), can be solved in strongly polynomial time. This result will be incorporated in the polynomial time algorithms for convex combinatorial optimization and convex integer programming to be developed in Sect. “Convex Combinatorial Optimization and More” and Sect. “Convex Integer Programming”. In Sect. “Convex Combinatorial Optimization and More” we discuss convex combinatorial optimization. The main result is that convex combinatorial optimization over a set S f0; 1gn presented by a membership oracle can be solved in strongly polynomial time provided it is endowed with a set covering all edgedirections of conv(S). In particular, the standard linear combinatorial optimization problem over S can be solved in strongly polynomial time as well. Theorem 2 For every fixed d, the convex combinatorial optimization problem over any S f0; 1gn presented by a membership oracle and endowed with a set covering all edge-directions of the polytope conv(S), can be solved in strongly polynomial time. An important application of Theorem 2 concerns convex matroid optimization. Corollary 1 For every fixed d, convex combinatorial optimization over the family of bases of a matroid presented by membership oracle is strongly polynomial time solvable. In Sect. “Linear N-fold Integer Programming” we develop the theory of linear n-fold integer programming. As a consequence of this theory we are able to solve a broad class of linear integer programming problems in variable dimension in polynomial time, in contrast with the general intractability of linear integer programming. The main theorem here may seem a bit technical at a first glance, but is

Convex Discrete Optimization

really very natural and has many applications discussed in detail in Sect. “Linear N-fold Integer Programming”, Sect. “Convex Integer Programming” and Sect. “Multiway Transportation Problems and Privacy in Statistical Databases”. To state it we need a definition. Given an (r C s) t matrix A, let A1 be its r t sub-matrix consisting of the first r rows and let A2 be its s t sub-matrix consisting of the last s rows. We refer to A explicitly as (r C s) t matrix, since the definition below depends also on r and s and not only on the entries of A. The n-fold matrix of an (r C s) t matrix A is then defined to be the following (r C ns) nt matrix, A(n) :D (1n ˝ A1 ) ˚ (I n 0 A1 A1 A1 B A2 0 0 B B 0 0 A B 2 D B :: :: B :: @ : : : 0 0 0

˝ A2 ) :: :

A1 0 0 :: : A2

1 C C C C : C C A

Given now any n 2 N, lower and upper bounds nt l; u 2 Z1 with Z1 :D Z ] f˙1g, right-hand side rCns , and linear functional wx with w 2 Znt , b 2 Z the corresponding linear n-fold integer programming problem is the following program in variable dimension nt, max fwx : x 2 Znt ; A(n) x D b; l x ug : The main theorem of Sect. “Linear N-fold Integer Programming” asserts that such integer programs are polynomial time solvable. Theorem 3 For every fixed (r C s) t integer matrix A, the linear n-fold integer programming problem with any n, l, u, b, and w can be solved in polynomial time. Theorem 3 has very important applications to highdimensional transportation problems which are discussed in Sect. “Three-Way Line-Sum Transportation Problems” and in more detail in Sect. “Multiway Transportation Problems and Privacy in Statistical Databases”. Another major application concerns bin packing problems, where items of several types are to be packed into bins so as to maximize packing utility subject to weight constraints. This includes as a special case the classical cutting-stock problem of [27]. These are

C

discussed in detail in Sect. “Packing Problems and Cutting-Stock”. Corollary 2 For every fixed number t of types and type weights v1 ; : : : ; v t , the corresponding integer bin packing and cutting-stock problems are polynomial time solvable. In Sect. “Convex Integer Programming” we discuss convex integer programming, where the feasible set S is presented as the set of integer points satisfying a given system of linear inequalities. In particular, we consider convex integer programming over n-fold systems for any fixed (but arbitrary) (r C s) t matrix A, where, nt , b 2 ZrCns and given n 2 N, vectors l; u 2 Z1 w1 ; : : : ; wd 2 Znt , and convex functional c : Rd ! R, the problem is max fc(w1 x; : : : ; wd x) : x 2 Znt ; A(n) x D b; l x ug : The main theorem of Sect. “Convex Integer Programming” is the following extension of Theorem 3, asserting that convex integer programming over n-fold systems is polynomial time solvable as well. Theorem 4 For every fixed d and (r C s) t integer matrix A, convex n-fold integer programming with any n, l, u, b, w1 ; : : : ; wd , and c can be solved in polynomial time. Theorem 4 broadly extends the class of objective functions that can be efficiently maximized over n-fold systems. Thus, all applications discussed in Sect. “Some Applications” automatically extend accordingly. These include convex high-dimensional transportation problems and convex bin packing and cuttingstock problems, which are discussed in detail in Sect. “Transportation Problems and Packing Problems” and Sect. “Multiway Transportation Problems and Privacy in Statistical Databases”. Another important application of Theorem 4 concerns vector partitioning problems which have applications in many areas including load balancing, circuit layout, ranking, cluster analysis, inventory, and reliability, see e. g. [7,9,25,39,50] and the references therein. The problem is to partition n items among p players so as to maximize social utility. With each item is associated a k-dimensional vector representing its utility under k criteria. The social utility of a partition is a convex function of the sums of vectors of items that each

517

518

C

Convex Discrete Optimization

player receives. In the constrained version of the problem, there are also restrictions on the number of items each player can receive. We have the following consequence of Theorem 4; more details on this application are in Sect. “Vector Partitioning and Clustering”. Corollary 3 For every fixed number p of players and number k of criteria, the constrained and unconstrained vector partition problems with any item vectors, convex utility, and constraints on the number of item per player, are polynomial time solvable. In the last Sect. “Multiway Transportation Problems and Privacy in Statistical Databases” we discuss multiway (high-dimensional) transportation problems and secure statistical data disclosure. Multiway transportation problems form a very important class of discrete optimization problems and have been used and studied extensively in the operations research and mathematical programming literature, as well as in the statistics literature in the context of secure statistical data disclosure and management by public agencies, see e. g. [4,6, 11,18,19,42,43,53,60,62] and the references therein. The feasible points in a transportation problem are the multiway tables (“contingency tables” in statistics) such that the sums of entries over some of their lower dimensional sub-tables such as lines or planes (“margins” in statistics) are specified. We completely settle the algorithmic complexity of treating multiway tables and discuss the applications to transportation problems and secure statistical data disclosure, as follows. In Sect. “The Universality Theorem” we show that “short” 3-way transportation problems, over r c 3 tables with variable number r of rows and variable number c of columns but fixed small number 3 of layers (hence “short”), are universal in that every integer programming problem is such a problem (see Sect. “The Universality Theorem” for the precise stronger statement and for more details). Theorem 5 Every linear integer programming problem maxfc y : y 2 N n : Ay D bg is polynomial time representable as a short 3-way line-sum transportation problem X x i; j;k D z j;k ; max wx : x 2 N rc3 : i X X x i; j;k D v i;k ; x i; j;k D u i; j : j

k

In Sect. “The Complexity of the Multiway Transportation Problem” we discuss k-way transportation problems of any dimension k. We provide the first polynomial time algorithm for convex and linear “long” (k C 1)-way transportation problems, over m1 m k n tables, with k and m1 ; : : : ; m k fixed (but arbitrary), and variable number n of layers (hence “long”). This is best possible in view of Theorem 21. Our algorithm works for any hierarchical collection of margins: this captures common margin collections such as all line-sums, all plane-sums, and more generally all h-flat sums for any 0 h k (see Sect. “Tables and Margins” for more details). We point out that even for the very special case of linear integer transportation over 3 3 n tables with specified line-sums, our polynomial time algorithm is the only one known. We prove the following statement. Corollary 4 For every fixed d; k; m1 ; : : : ; m k and family F of subsets of f1; : : : ; kC1g specifying a hierarchical collection of margins, the convex (and in particular linear) long transportation problem over m1 m k n tables is polynomial time solvable. In our last subsection Sect. “Privacy and Entry-Uniqueness” we discuss an important application concerning privacy in statistical databases. It is a common practice in the disclosure of a multiway table containing sensitive data to release some table margins rather than the table itself. Once the margins are released, the security of any specific entry of the table is related to the set of possible values that can occur in that entry in any table having the same margins as those of the source table in the data base. In particular, if this set consists of a unique value, that of the source table, then this entry can be exposed and security can be violated. We show that for multiway tables where one category is significantly richer than the others, that is, when each sample point can take many values in one category and only few values in the other categories, it is possible to check entry-uniqueness in polynomial time, allowing disclosing agencies to make learned decisions on secure disclosure. Corollary 5 For every fixed k; m1 ; : : : ; m k and family F of subsets of f1; : : : ; kC1g specifying a hierarchical collection of margins to be disclosed, it can be decided in polynomial time whether any specified entry x i 1 ;:::;i kC1 is the same in all long m1 m k n tables with the disclosed margins, and hence at risk of exposure.

Convex Discrete Optimization

Terminology and Complexity We use R for the reals, RC for the nonnegative reals, Z for the integers, and N for the nonnegative integers. The sign of a real number r is denoted by sign(r) 2 f0; 1; 1g and its absolute value is denoted by jrj. The ith standard unit vector in Rn is denoted by 1 i . The support of x 2 Rn is the index set supp(x) :D fi : xi ¤ 0g of nonzero entries of x. The indicator of a subP set I f1; : : : ; ng is the vector 1I :D i2I 1 i so that supp(1I ) D I. When several vectors are indexed by subscripts, w1 ; : : : ; wd 2 Rn , their entries are indicated by pairs of subscripts, w i D (w i;1; : : : ; w i;n ). When vectors are indexed by superscripts, x 1 ; : : : ; x k 2 Rn , their entries are indicated by subscripts, x i D (x1i ; : : : ; x ni ). The integer lattice Zn is naturally embedded in Rn . The space Rn is endowed with the standard inner product Pn which, for w; x 2 Rn , is given by wx :D iD1 w i x i . Vectors w in Rn will also be regarded as linear functionals on Rn via the inner product wx. Thus, we refer to elements of Rn as points, vectors, or linear functionals, as will be appropriate from the context. The convex hull of a set S Rn is denoted by conv(S) and the set of vertices of a polyhedron P Rn is denoted by vert(P). In linear discrete optimization over S Zn , the facets of conv(S) play an important role, see Chvátal [10] and the references therein for earlier work, and Grötschel, Lovász and Schrijver [31,45] for the later culmination in the equivalence of separation and linear optimization via the ellipsoid method of Yudin and Nemirovskii [63]. As will turn out in Sect. “Reducing Convex to Linear Discrete Optimization”, in convex discrete optimization over S, the edges of conv(S) play an important role (most significantly in a way which is not related to the Hirsch conjecture discussed in [41]). We therefore use extensively convex polytopes, for which we follow the terminology of [32,65]. We often assume that the feasible set S Zn is finite. We then define its radius to be its l1 radius (S) :D maxfkxk1 : x 2 Sg where, as usual, kxk1 :D max niD1 jx i j. In other words, (S) is the smallest 2 N such that S is contained in the cube [; ]n . Our algorithms are applied to rational data only, and the time complexity is as in the standard Turing machine model, see e. g. [1,26,55]. The input typically consists of rational (usually integer) numbers, vectors, matrices, and finite sets of such objects. The bi-

C

nary length of an integer number z 2 Z is defined to be the number of bits in its binary representation, hzi :D 1 C dlog2 (jzj C 1)e (with the extra bit for the sign). The length of a rational number presented as a fraction r D qp with p; q 2 Z is hri :D hpi C hqi. The length of an m n matrix A (and in particular of P a vector) is the sum hAi :D i; j ha i; j i of the lengths of its entries. Note that the length of A is no smaller than the number of entries, hAi mn. Therefore, when A is, say, part of an input to an algorithm, with m; n variable, the length hAi already incorporates mn, and so we will typically not account additionally for m; n directly. But sometimes, especially in results related to n-fold integer programming, we will also emphasize n as part of the input length. Similarly, the length of a finite set E of numbers, vectors or matrices is the sum of lengths of its elements and hence, since hEi jEj, automatically accounts for its cardinality. Some input numbers affect the running time of some algorithms through their unary presentation, resulting in so-called “pseudo polynomial” running time. The unary length of an integer number z 2 Z is the number jzjC1 of bits in its unary representation (again, an extra bit for the sign). The unary length of a rational number, vector, matrix, or finite set of such objects are defined again as the sums of lengths of their numerical constituents, and is again no smaller than the number of such numerical constituents. When studying convex and linear integer programming in Sect. “Linear N-fold Integer Programming” and Sect. “Convex Integer Programming” we sometimes have lower and upper bound vectors l; u with entries in Z1 :D Z ] f˙1g. Both binary and unary lengths of a ˙1 entry are constant, say 3 by encoding ˙1 :D ˙“00". To make the input encoding precise, we introduce the following notation. In every algorithmic statement we describe explicitly the input encoding, by listing in square brackets all input objects affecting the running time. Unary encoded objects are listed directly whereas binary encoded objects are listed in terms of their length. For example, as is often the case, if the input of an algorithm consists of binary encoded vectors (linear functionals) w1 ; : : : ; wd 2 Zn and unary encoded integer 2 N (bounding the radius (S) of the feasible set) then we will indicate that the input is encoded as [; hw1 ; : : : ; wd i].

519

520

C

Convex Discrete Optimization

Some of our algorithms are strongly polynomial time in the sense of [59]. For this, part of the input is regarded as “special”. An algorithm is then strongly polynomial time if it is polynomial time in the usual Turing sense with respect to all input, and in addition, the number of arithmetic operations (additions, subtractions, multiplications, divisions, and comparisons) it performs is polynomial in the special part of the input. To make this precise, we extend our input encoding notation above by splitting the square bracketed expression indicating the input encoding into a “left” side and a “right” side, separated by semicolon, where the entire input is described on the right and the special part of the input on the left. For example, Theorem 1, asserting that the algorithm underlying it is strongly polynomial with data encoded as [n; jEj; h(S); w1 ; : : : ; wd ; Ei], where (S) 2 N, w1 ; : : : ; wd 2 Zn and E Zn , means that the running time is polynomial in the binary length of (S), w1 ; : : : ; wd , and E, and the number of arithmetic operations is polynomial in n and the cardinality jEj, which constitute the special part of the input. Often, as in [31], part of the input is presented by oracles. Then the running time and the number of arithmetic operations count also the number of oracle queries. An oracle algorithm is polynomial time if its running time, including the number of oracle queries, and the manipulations of numbers, some of which are answers to oracle queries, is polynomial in the length of the input encoding. An oracle algorithm is strongly polynomial time (with specified input encoding as above), if it is polynomial time in the entire input (on the “right”), and in addition, the number of arithmetic operations it performs (including oracle queries) is polynomial in the special part of the input (on the “left”). Reducing Convex to Linear Discrete Optimization In this section we show that when suitable auxiliary geometric information about the convex hull conv(S) of a finite set S Zn is available, the convex discrete optimization problem over S can be reduced to the solution of strongly polynomially many linear discrete optimization counterparts over S. This result will be incorporated into the polynomial time algorithms developed in Sect. “Convex Combinatorial Optimization and More”

and Sect. “Convex Integer Programming” for convex combinatorial optimization and convex integer programming respectively. In Sect. “Edge-Directions and Zonotopes” we provide some preliminaries on edge-directions and zonotopes. In Sect. “Strongly Polynomial Reduction of Convex to Linear Discrete Optimization” we prove the reduction which is the main result of this section. In Sect. “Pseudo Polynomial Reduction when Edge-Directions are not Available” we prove a pseudo polynomial reduction for any finite set. Edge-Directions and Zonotopes We begin with some terminology and facts that play an important role in the sequel. A direction of an edge (1-dimensional face) e D [u; v] of a polytope P is any nonzero scalar multiple of u v. A set of vectors E covers all edge-directions of P if it contains a direction of each edge of P. The normal cone of a polytope P Rn at its face F is the (relatively open) cone C PF of those linear functionals h 2 Rn which are maximized over P precisely at points of F. A polytope Z is a refinement of a polytope P if the normal cone of every vertex of Z is contained in the normal cone of some vertex of P. If Z refines P then, moreover, the closure of each normal cone of P is the union of closures of normal cones of Z. The zonotope generated by a set of vectors E D fe1 ; : : : ; e m g in Rd is the following polytope, which is the projection by E of the cube [1; 1]m into Rd , Z

:D :D

zone(E) ( m ) X conv i e i : i D ˙1

Rd :

iD1

The following fact goes back to Minkowski, see [32]. Lemma 1 Let P be a polytope and let E be a finite set that covers all edge-directions of P. Then the zonotope Z :D zone(E) generated by E is a refinement of P. Consider any vertex u of Z. Then u D e2E e e for suitable e D ˙1. Thus, the normal cone C uZ consists of those h satisfying h e e > 0 for all e. Pick any hˆ 2 C uZ and let v be a vertex of P at which hˆ is maximized over P. Consider any edge [v; w] of P. Then v w D ˛ e e for some scalar ˛ e ¤ 0 and some ˆ w) D h˛ ˆ e e, implying ˛ e e > 0. e 2 E, and 0 h(v Proof P

Convex Discrete Optimization

It follows that every h 2 C uZ satisfies h(v w) > 0 for every edge of P containing v. Therefore h is maximized over P uniquely at v and hence is in the cone C vP of P at v. This shows C uZ C vP . Since u was arbitrary, it follows that the normal cone of every vertex of Z is contained in the normal cone of some vertex of P. The next lemma provides bounds on the number of vertices of any zonotope and on the algorithmic complexity of constructing its vertices, each vertex along with a linear functional maximized over the zonotope uniquely at that vertex. The bound on the number of vertices has been rediscovered many times over the years. An early reference is [33], stated in the dual form of 2-partitions. A more general treatment is [64]. Recent extensions to p-partitions for any p are in [3,39], and to Minkowski sums of arbitrary polytopes are in [29]. Interestingly, already in [33], back in 1967, the question was raised about the algorithmic complexity of the problem; this is now settled in [20,21] (the latter reference correcting the former). We state the precise bounds on the number of vertices and arithmetic complexity, but will need later only that for any fixed d the bounds are polynomial in the number of generators. Therefore, below we only outline a proof that the bounds are polynomial. Complete details are in the above references. Lemma 2 The number of vertices of any zonotope Z :D zone(E) generated by a set E of m vectors in Rd is at most Pd1 m1 2 kD0 k . For every fixed d, there is a strongly polynomial time algorithm that, given E Zd , encoded as [m :D jEj; hEi], outputs every vertex v of Z :D zone(E) along with a linear functional hv 2 Zd maximized over Z uniquely at v, using O(m d1 ) arithmetics operations for d 3 and O(m d ) for d 2. Proof We only outline a proof that, for every fixed d, the polynomial bounds O(m d1 ) on the number of vertices and O(m d ) on the arithmetic complexity hold. We assume that E linearly spans Rd (else the dimension can be reduced) and is generic, that is, no d points of E lie on a linear hyperplane (one containing the origin). In particular, 0 … E. The same bound for arbitrary E then follows using a perturbation argument (cf. [39]). Each oriented linear hyperplane H D fx 2 Rd : hx D 0g with h 2 Rd nonzero induces a partition

C

U U of E by E D H H 0 H C , with H :D fe 2 E : he < 0g, E 0 :D E \ H, and H C :D fe 2 E : he > 0g. The vertices of Z D zone(E) are in bijection with ordered 2-partitions of E induced by such hyperplanes U that avoid E. Indeed, if E D H H C then the linear functional hv :D h defining H is maximized over Z P P uniquely at the vertex v :D fe : e 2 H C g fe : e 2 H g of Z. We now show how to enumerate all such 2-partitions m and hence vertices of Z. Let M be any of the d1 subsets of E of size d 1. Since E is generic, M is linearly independent and spans a unique linear hyperˆ D 0g be one plane lin(M). Let Hˆ D fx 2 Rd : hx of the two orientations of the hyperplane lin(M). Note that Hˆ 0 D M. Finally, let L be any of the 2d1 subsets of M. Since M is linearly independent, there is a g 2 Rd which linearly separates L from M n L, namely, satisfies gx < 0 for all x 2 L and gx > 0 for all x 2 M n L. Furthermore, there is a sufficiently small > 0 such that the oriented hyperplane H :D fx 2 Rd : hx D 0g defined by h :D hˆ C g avoids E and the 2-partition induced U U by H satisfies H D Hˆ L and H C D Hˆ C (M n L). P The corresponding vertex of Z is v :D fe : e 2 P H C g fe : e 2 H g and the corresponding linear functional which is maximized over Z uniquely at v is hv :D h D hˆ C g. We claim that any ordered 2-partition arises that way from some M, some orientation Hˆ of lin(M), and some L. Indeed, consider any oriented linear hyperplane H˜ avoiding E. It can be perturbed to a suitable oriented Hˆ that touches precisely d 1 points of E. Put ˆ coincides with one of the two oriM :D Hˆ 0 so that H entations of the hyperplane lin(M) spanned by M, and put L :D H˜ \ M. Let H be an oriented hyperplane obˆ and L by the above procedure. Then tained from M, H U the ordered 2-partition E D H H C induced by H U coincides with the ordered 2-partition E D H˜ H˜ C ˜ induced by H. m many (d 1)-subsets M E, Since there are d1 ˆ of lin(M), and 2d1 subsets L two orientations H M, and d is fixed, the total number of 2-partitions and hence also the total of vertices of Z obey number d m the upper bound 2 d1 D O(m d1 ). Furthermore, for each choice of M, Hˆ and L, the linear functional hˆ ˆ as well as g; ; hv D h D hˆ C g, and the defining H, P P vertex v D fe : e 2 H C g fe : e 2 H g of Z at which hv is uniquely maximized over Z, can all be com-

521

522

C

Convex Discrete Optimization

puted using O(m) arithmetic operations. This shows the claimed bound O(m d ) on the arithmetic complexity.

crete optimization problem

We conclude with a simple fact about edge-directions of projections of polytopes.

Proof First, query the linear discrete optimization oracle presenting S on the trivial linear functional w D 0. If the oracle asserts that there is no optimal solution then S is empty so terminate the algorithm asserting that no optimal solution exists to the convex discrete optimization problem either. So assume the problem is feasible. Let P :D conv(S) Rn and Q :D f(w1 x; : : : ; wd x) : x 2 Pg Rd . Then Q is a projection of P, and hence by Lemma 3 the projection D :D f(w1 e; : : : ; wd e) : e 2 Eg of the set E is a set covering all edge-directions of Q. Let Z :D zone(D) Rd be the zonotope generated by D. Since d is fixed, by Lemma 2 we can produce in strongly polynomial time all vertices of Z, every vertex v along with a linear functional hv 2 Zd maximized over Z uniquely at v. For each of these polynomially many hv , repeat the following procedure. Define a vector gv 2 Zn by Pd gv; j :D iD1 w i; j hv;i for j D 1; : : : ; n. Now query the linear discrete optimization oracle presenting S on the linear functional w :D gv 2 Zn . Let xv 2 S be the optimal solution obtained from the oracle, and let zv :D (w1 xv ; : : : ; wd xv ) 2 Q be its projection. Since P D conv(S), we have that xv is also a maximizer of g v over P. Since for every x 2 P and its projection z :D (w1 x; : : : ; wd x) 2 Q we have hv z D gv x, we conclude that zv is a maximizer of hv over Q. Now we claim that each vertex u of Q equals some zv . Indeed, since Z is a refinement of Q by Lemma 1, it follows that there is some vertex v of Z such that hv is maximized over Q uniquely at u, and therefore u D zv . Since c(w1 x; : : : ; wd x) is convex on Rn and c is convex on Rd , we find that

Lemma 3 If E covers all edge-directions of a polytope P, and Q :D !(P) is the image of P under a linear map ! : Rn ! Rd , then !(E) covers all edge-directions of Q. Proof Let f be a direction of an edge [x; y] of Q. Consider the face F :D ! 1 ([x; y]) of P. Let V be the set of vertices of F and let U D fu 2 V : !(u) D x g. Then for some u 2 U and v 2 V n U, there must be an edge [u; v] of F, and hence of P. Then !(v) 2 (x; y] hence !(v) D x C ˛ f for some ˛ ¤ 0. Therefore, with e :D ˛1 (v u), a direction of the edge [u; v] of P, we find that f D ˛1 (!(v) !(u)) D !(e) 2 !(E). Strongly Polynomial Reduction of Convex to Linear Discrete Optimization A linear discrete optimization oracle for a set S Zn is one that, queried on w 2 Zn , either returns an optimal solution to the linear discrete optimization problem over S, that is, an x 2 S satisfying wx D maxfwx : x 2 Sg, or asserts that none exists, that is, either the problem is infeasible or the objective function is unbounded. We now show that a set E covering all edgedirections of the polytope conv(S) underlying a convex discrete optimization problem over a finite set S Zn allows to solve it by solving polynomially many linear discrete optimization counterparts over S. The following theorem extends and unifies the corresponding reductions in [49] and [12] for convex combinatorial optimization and convex integer programming respectively. Recall from Sect. “Terminology and Complexity” that the radius of a finite set S Zn is defined to be (S) :D maxfjx i j : x 2 S; i D 1; : : : ; ng.

max fc(w1 x; : : : ; wd x) : x 2 Sg :

max c(w1 x; : : : ; wd x) x2S

D max c(w1 x; : : : ; wd x) x2P

D max c(z) z2Q

Theorem 6 For every fixed d there is a strongly polynomial time algorithm that, given finite set S Zn presented by a linear discrete optimization oracle, integer vectors w1 ; : : : ; wd 2 Zn , set E Zn covering all edge-directions of conv(S), and convex functional c : Rd ! R presented by a comparison oracle, encoded as [n; jEj; h(S); w1 ; : : : ; wd ; Ei], solves the convex dis-

D maxfc(u) : u vertex of Qg D maxfc(zv ) : v vertex of Zg : Using the comparison oracle of c, find a vertex v of Z attaining maximum value c(zv ), and output xv 2 S, an optimal solution to the convex discrete optimization problem.

Convex Discrete Optimization

C

Pseudo Polynomial Reduction when Edge-Directions Are not Available

Convex Combinatorial Optimization and More

Theorem 6 reduces convex discrete optimization to polynomially many linear discrete optimization counterparts when a set covering all edge-directions of the underlying polytope is available. However, often such a set is not available (see e. g. [8] for the important case of bipartite matching). We now show how to reduce convex discrete optimization to many linear discrete optimization counterparts when a set covering all edge-directions is not offhand available. In the absence of such a set, the problem is much harder, and the algorithm below is polynomially bounded only in the unary length of the radius (S) and of the linear functionals w1 ; : : : ; wd , rather than in their binary length h(S); w1 ; : : : ; wd i as in the algorithm of Theorem 6 Moreover, an upper bound (S) on the radius of S is required to be given explicitly in advance as part of the input.

In this section we discuss convex combinatorial optimization. The main result is that convex combinatorial optimization over a set S f0; 1gn presented by a membership oracle can be solved in strongly polynomial time provided it is endowed with a set covering all edge-directions of conv(S). In particular, the standard linear combinatorial optimization problem over S can be solved in strongly polynomial time as well. In Sect. “From Membership to Linear Optimization” we provide some preparatory statements involving various oracle presentation of the feasible set S. In Sect. “Linear and Convex Combinatorial Optimization in Strongly Polynomial Time” we combine these preparatory statements with Theorem 6 and prove the main result of this section. An extension to arbitrary finite sets S Zn endowed with edge-directions is established in Sect. “Linear and Convex Discrete Optimization over any Set in Pseudo Polynomial Time”. We conclude with some applications in Sect. “Some Applications”. As noted in the introduction, when S is contained in f0; 1gn we refer to discrete optimization over S also as combinatorial optimization over S, to emphasize that S typically represents a family F 2 N of subsets of a ground set N :D f1; : : : ; ng possessing some combinatorial property of interest (for instance, the family of bases of a matroid over N, see Sect. “Matroids and Maximum Norm Spanning Trees”. The convex combinatorial optimization problem then also has the following interpretation (taken in [47,49]). We are given a weighting ! : N ! Zd of elements of the ground set by d-dimensional integer vectors. We interpret the weight vector !( j) 2 Zd of element j as representing its value under d criteria (e. g., if N is the set of edges in a network then such criteria may include profit, reliability, flow velocity, etc.). The weight of a subset F N P is the sum !(F) :D j2F !( j) of weights of its elements, representing the total value of F under the d criteria. Now, given a convex functional c : Rd ! R, the objective function value of F N is the “convex balancing” c(!(F)) of the values of the weight vector of F. The convex combinatorial optimization problem is to find a family member F 2 F maximizing c(!(F)). The usual linear combinatorial optimization problem over F is the special case of d D 1 and c the identity on

Theorem 7 For every fixed d there is a polynomial time algorithm that, given finite set S Zn presented by a linear discrete optimization oracle, integer (S), vectors w1 ; : : : ; wd 2 Zn , and convex functional c : Rd ! R presented by a comparison oracle, encoded as [; w1 ; : : : ; wd ], solves the convex discrete optimization problem max fc(w1 x; : : : ; wd x) : x 2 Sg : Proof Let P :D conv(S) Rn , let T :D f(w1 x; : : : ; wd x) : x 2 Sg be the projection of S by w1 ; : : : ; wd , and let Q :D conv(T) Rd be the corresponding projection of P. Let r :D n maxdiD1 kw i k1 and let G :D fr; : : : ; 1; 0; 1; : : : ; rgd . Then T G and the number (2r C 1)d of points of G is polynomially bounded in the input as encoded. Let D :D fu v : u; v 2 G; u ¤ vg be the set of differences of pairs of distinct point of G. It covers all edgedirections of Q since vert(Q) T G. Moreover, the number of points of D is less than (2r C 1)2d and hence polynomial in the input. Now invoke the algorithm of Theorem 6: while the algorithm requires a set E covering all edge-directions of P, it needs E only to compute a set D covering all edge-directions of the projection Q (see proof of Theorem 6, which here is computed directly.

523

524

C

Convex Discrete Optimization

R. To cast a problem of that form in our usual setup just let S :D f1F : F 2 F g f0; 1gn be the set of indicators of members of F and define weight vectors w1 ; : : : ; wd 2 Zn by w i; j :D !( j) i for i D 1; : : : ; d and j D 1; : : : ; n. From Membership to Linear Optimization A membership oracle for a set S Zn is one that, queried on x 2 Zn , asserts whether or not x 2 S. An augmentation oracle for S is one that, queried on x 2 S and w 2 Zn , either returns an xˆ 2 S with w xˆ > wx, i. e. a better point of S, or asserts that none exists, i. e. x is optimal for the linear discrete optimization problem over S. A membership oracle presentation of S is very broad and available in all reasonable applications, but reveals little information on S, making it hard to use. However, as we now show, the edge-directions of conv(S) allow to convert membership to augmentation. Lemma 4 There is a strongly polynomial time algorithm that, given set S f0; 1gn presented by a membership oracle, x 2 S, w 2 Zn , and set E Zn covering all edge-directions of the polytope conv(S), encoded as [n; jEj; hx; w; Ei], either returns a better point xˆ 2 S, that is, one satisfying w xˆ > wx, or asserts that none exists. Proof Each edge of P :D conv(S) is the difference of two f0; 1g-vectors. Therefore, each edge direction of P is, up to scaling, a f1; 0; 1g-vector. Thus, scaling e :D kek11 e and e :D e if necessary, we may and will assume that e 2 f1; 0; 1gn and we 0 for all e 2 E. Now, using the membership oracle, check if there is an e 2 E such that x C e 2 S and we > 0. If there is such an e then output xˆ :D x C e which is a better point, whereas if there is no such e then terminate asserting that no better point exists. Clearly, if the algorithm outputs an xˆ then it is indeed a better point. Conversely, suppose x is not a maximizer of w over S. Since S f0; 1gn , the point x is a vertex of P. Since x is not a maximizer of w, there is an edge [x; xˆ ] of P with xˆ a vertex satisfying w xˆ > wx. But then e :D xˆ x is the one f1; 0; 1g edge-direction of [x; xˆ ] with we 0 and hence e 2 E. Thus, the algorithm will find and output xˆ D x C e as it should.

An augmentation oracle presentation of a finite S allows to solve the linear discrete optimization problem maxfwx : x 2 Sg over S by starting from any feasible x 2 S and repeatedly augmenting it until an optimal solution x 2 S is reached. The next lemma bounds the running time needed to reach optimality using this procedure. While the running time is polynomial in the binary length of the linear functional w and the initial point x, it is more sensitive to the radius (S) of the feasible set S, and is polynomial only in its unary length. The lemma is an adaptation of a result of [30,57] (stated therein for f0; 1g-sets), which makes use of bit-scaling ideas going back to [23]. Lemma 5 There is a polynomial time algorithm that, given finite set S Zn presented by an augmentation oracle, x 2 S, and w 2 Zn , encoded as [(S); hx; wi], provides an optimal solution x 2 S to the linear discrete optimization problem maxfwz : z 2 Sg. Proof Let k :D max njD1 dlog2 (jw j j C 1)e and note that k hwi. For i D 0; : : : ; k define a linear functional u i D (u i;1 ; : : : ; u i; n ) 2 Zn by u i; j :D sign(w j ) b2 ik jw j jc for j D 1; : : : ; n. Then u0 D 0, u k D w, and u i 2u i1 2 f1; 0; 1gn for all i D 1; : : : ; k. We now describe how to construct a sequence of points y0 ; y1 ; : : : ; y k 2 S such that yi is an optimal solution to maxfu i y : y 2 Sg for all i. First note that all points of S are optimal for u0 D 0 and hence we can take y0 :D x to be the point of S given as part of the input. We now explain how to determine yi from y i1 for i D 1; : : : ; k. Suppose y i1 has been determined. Set y˜ :D y i1 . Query the augmentation oracle on y˜ 2 S and ui ; if the oracle returns a better point yˆ then set y˜ :D yˆ and repeat, whereas if it asserts that there is no better point then the optimal solution for ui is read off to be y i :D y˜. We now bound the number of calls to the oracle. Each time the oracle is queried on y˜ and ui and returns a better point yˆ, the improvement is by at least one, i. e. u i ( yˆ y˜) 1; this is so because ui , y˜ and yˆ are integer. Thus, the number of necessary augmentations from y i1 to yi is at most the total improvement, which we claim satisfies u i (y i y i1 ) D (u i 2u i1 )(y i y i1 ) C 2u i1 (y i y i1 ) 2n C 0 D 2n ; where :D (S). Indeed, u i 2u i1 2 f1; 0; 1gn and y i ; y i1 2 S [; ]n imply (u i 2u i1 )(y i

Convex Discrete Optimization

y i1 ) 2n; and y i1 optimal for u i1 gives u i1 (y i y i1 ) 0. Thus, after a total number of at most 2nk calls to the oracle we obtain yk which is optimal for uk . Since w D u k we can output x :D y k as the desired optimal solution to the linear discrete optimization problem. Clearly the number 2nk of calls to the oracle, as well as the number of arithmetic operations and binary length of numbers occurring during the algorithm, are polynomial in (S); hx; wi. This completes the proof. We conclude this preparatory subsection by recording the following result of [24] which incorporates the heavy simultaneous Diophantine approximation of [44]. Proposition 1 There is a strongly polynomial time algorithm that, given w 2 Zn , encoded as [n; hwi], proˆ is polynoduces wˆ 2 Zn , whose binary length hwi mially bounded in n and independent of w, and with ˆ D sign(wz) for every z 2 f1; 0; 1gn . sign(wz) Linear and Convex Combinatorial Optimization in Strongly Polynomial Time Combining the preparatory statements of Sect. “From Membership to Linear Optimization” with Theorem 6, we can now solve the convex combinatorial optimization over a set S f0; 1gn presented by a membership oracle and endowed with a set covering all edgedirections of conv(S) in strongly polynomial time. We start with the special case of linear combinatorial optimization. Theorem 8 There is a strongly polynomial time algorithm that, given set S f0; 1gn presented by a membership oracle, x 2 S, w 2 Zn , and set E Zn covering all edge-directions of the polytope conv(S), encoded as [n; jEj; hx; w; Ei], provides an optimal solution x 2 S to the linear combinatorial optimization problem maxfwz : z 2 Sg. Proof First, an augmentation oracle for S can be simulated using the membership oracle, in strongly polynomial time, by applying the algorithm of Lemma 4. Next, using the simulated augmentation oracle for S, we can now do linear optimization over S in strongly polynomial time as follows. First, apply to w the algorithm of Proposition 1 and obtain wˆ 2 Zn ˆ is polynomially bounded in whose binary length hwi

C

ˆ D sign(wz) for every z 2 n, which satisfies sign(wz) n f1; 0; 1g . Since S f0; 1gn , it is finite and has radius (S) D 1. Now apply the algorithm of Lemma 5 to S, x ˆ and obtain a maximizer x of wˆ over S. For every and w, y 2 f0; 1gn we then have x y 2 f1; 0; 1gn and hence ˆ y)). So x is also a maxsign(w(x y)) D sign(w(x imizer of w over S and hence an optimal solution to the given linear combinatorial optimization problem. Now, ˆ is polynomial in n, and x 2 f0; 1gn and (S) D 1, hwi hence hxi is linear in n. Thus, the entire length of the ˆ to the polynomial-time algorithm input [(S); hx; wi] of Lemma 5 is polynomial in n, and so its running time is in fact strongly polynomial on that input. Combining Theorems 6 and 8 we recover at once the following result of [49]. Theorem 9 For every fixed d there is a strongly polynomial time algorithm that, given set S f0; 1gn presented by a membership oracle, x 2 S, vectors w1 ; : : : ; wd 2 Zn , set E Zn covering all edgedirections of the polytope conv(S), and convex functional c : Rd ! R presented by a comparison oracle, encoded as [n; jEj; hx; w1 ; : : : ; wd ; Ei], provides an optimal solution x 2 S to the convex combinatorial optimization problem max fc(w1 z; : : : ; wd z) : z 2 Sg : Proof Since S is nonempty, a linear discrete optimization oracle for S can be simulated in strongly polynomial time by the algorithm of Theorem 8. Using this simulated oracle, we can apply the algorithm of Theorem 6 and solve the given convex combinatorial optimization problem in strongly polynomial time. Linear and Convex Discrete Optimization over any Set in Pseudo Polynomial Time In Sect. “Linear and Convex Combinatorial Optimization in Strongly Polynomial Time” above we developed strongly polynomial time algorithms for linear and convex discrete optimization over f0; 1g-sets. We now provide extensions of these algorithms to arbitrary finite sets S Zn . As can be expected, the algorithms become slower. We start by recording the following fundamental result of Khachiyan [40] asserting that linear programming is polynomial time solvable via the ellipsoid

525

526

C

Convex Discrete Optimization

method [63]. This result will be used below as well as several more times later in the monograph. Proposition 2 There is a polynomial time algorithm that, given A 2 Zmn , b 2 Zm , and w 2 Zn , encoded as [hA; b; wi], either asserts that P :D fx 2 Rn : Ax bg is empty, or asserts that the linear functional wx is unbounded over P, or provides a vertex v 2 vert(P) which is an optimal solution to the linear program maxfwx : x 2 Pg. The following analog of Lemma 4 shows how to covert membership to augmentation in polynomial time, albeit, no longer in strongly polynomial time. Here, both the given initial point x and the returned better point xˆ if any, are vertices of conv(S). Lemma 6 There is a polynomial time algorithm that, given finite set S Zn presented by a membership oracle, vertex x of the polytope conv(S); w 2 Zn , and set E Zn covering all edge-directions of conv(S), encoded as [(S); hx; w; Ei], either returns a better vertex xˆ of conv(S), that is, one satisfying w xˆ > wx, or asserts that none exists. Proof Dividing each vector e 2 E by the greatest common divisor of its entries and setting e :D e if necessary, we can and will assume that each e is primitive, that is, its entries are relatively prime integers, and we 0. Using the membership oracle, construct the subset F E of those e 2 E for which x C re 2 S for some r 2 f1; : : : ; 2(S)g. Let G F be the subset of those f 2 F for which w f > 0. If G is empty then terminate asserting that there is no better vertex. Otherwise, consider the convex cone cone(F) generated by F. It is clear that x is incident on an edge of conv(S) in direction f if and only if f is an extreme ray of cone(F). Moreover, since G D f f 2 F : w f > 0g is nonempty, there must be an extreme ray of cone(F) which lies in G. Now f 2 F is an extreme ray of cone(F) if and only if there do not exist nonnegative e , e 2 F nf f g, such that P f D e¤ f e e; this can be checked in polynomial time using linear programming. Applying this procedure to each f 2 G, identify an extreme ray g 2 G. Now, using the membership oracle, determine the largest r 2 f1; : : : ; 2(S)g for which xCrg 2 S. Output xˆ :D xCrg which is a better vertex of conv(S). We now prove the extensions of Theorems 8 and 9 to arbitrary, not necessarily f0; 1g-valued, finite sets.

While the running time remains polynomial in the binary length of the weights w1 ; : : : ; wd and the set of edge-directions E, it is more sensitive to the radius (S) of the feasible set S, and is polynomial only in its unary length. Here, the initial feasible point and the optimal solution output by the algorithms are vertices of conv(S). Again, we start with the special case of linear combinatorial optimization. Theorem 10 There is a polynomial time algorithm that, given finite S Zn presented by a membership oracle, vertex x of the polytope conv(S), w 2 Zn , and set E Zn covering all edge-directions of conv(S), encoded as [(S); hx; w; Ei], provides an optimal solution x 2 S to the linear discrete optimization problem maxfwz : z 2 Sg. Proof Apply the algorithm of Lemma 5 to the given data. Consider any query x 0 2 S, w 0 2 Zn made by that algorithm to an augmentation oracle for S. To answer it, apply the algorithm of Lemma 6 to x 0 and w 0 . Since the first query made by the algorithm of Lemma 5 is on the given input vertex x 0 :D x, and any consequent query is on a point x 0 :D xˆ which was the reply of the augmentation oracle to the previous query (see proof of Lemma 5), we see that the algorithm of Lemma 6 will always be asked on a vertex of S and reply with another. Thus, the algorithm of Lemma 6 can answer all augmentation queries and enables the polynomial time solution of the given problem. Theorem 11 For every fixed d there is a polynomial time algorithm that, given finite set S Zn presented by membership oracle, vertex x of conv(S), vectors w1 ; : : : ; wd 2 Zn , set E Zn covering all edgedirections of the polytope conv(S), and convex functional c : Rd ! R presented by a comparison oracle, encoded as [(S); hx; w1 ; : : : ; wd ; Ei], provides an optimal solution x 2 S to the convex combinatorial optimization problem max fc(w1 z; : : : ; wd z) : z 2 Sg : Proof Since S is nonempty, a linear discrete optimization oracle for S can be simulated in polynomial time by the algorithm of Theorem 10 . Using this simulated oracle, we can apply the algorithm of Theorem 6 and solve the given problem in polynomial time.

Convex Discrete Optimization

Some Applications Positive Semidefinite Quadratic Binary Programming The quadratic binary programming problem is the following: given an n n matrix M, find a vector x 2 f0; 1gn maximizing the quadratic form x T Mx induced by M. We consider here the instance where M is positive semidefinite, in which case it can be assumed to be presented as M D W T W with W a given d n matrix. Already this restricted version is very broad: if the rank d of W and M is variable then, as mentioned in the introduction, the problem is NP-hard. We now show that, for fixed d, Theorem 9 implies at once that the problem is strongly polynomial time solvable (see also [2]). Corollary 6 For every fixed d there is a strongly polynomial time algorithm that given W 2 Zdn , encoded as [n; hWi], finds x 2 f0; 1gn maximizing the form x T W T Wx. Proof Let S :D f0; 1gn and let E :D f11 ; : : : ; 1n g be the set of unit vectors in Rn . Then P :D conv(S) is just the n-cube [0; 1]n and hence E covers all edge-directions of P. A membership oracle for S is easily and efficiently realizable and x :D 0 2 S is an initial point. Also, jEj and hEi are polynomial in n, and E is easily and efficiently computable. Now, for i D 1; : : : ; d define w i 2 Zn to be the ith row of the matrix W, that is, w i; j :D Wi; j for all i; j. Finally, let c : Rd ! R be the squared l2 norm P given by c(y) :D kyk22 :D diD1 y2i , and note that the comparison of c(y) and c(z) can be done for y; z 2 Zd in time polynomial in hy; zi using a constant number of arithmetic operations, providing a strongly polynomial time realization of a comparison oracle for c. This translates the given quadratic programming problem into a convex combinatorial optimization problem over S, which can be solved in strongly polynomial time by applying the algorithm of Theorem 9 to S, x D 0, w1 ; : : : ; wd , E, and c. Matroids and Maximum Norm Spanning Trees Optimization problems over matroids form a fundamental class of combinatorial optimization problems. Here we discuss matroid bases, but everything works for independent sets as well. Recall that a family B of subsets of f1; : : : ; ng is the family of bases of a matroid if all members of B have the same cardinality, called

C

the rank of the matroid, and for every B; B0 2 B and i 2 B n B0 there is a j 2 B0 such that B n fig [ f jg 2 B. Useful models include the graphic matroid of a graph G with edge set f1; : : : ; ng and B the family of spanning forests of G, and the linear matroid of an mn matrix A with B the family of sets of indices of maximal linearly independent subsets of columns of A. It is well known that linear combinatorial optimization over matroids can be solved by the fast greedy algorithm [22]. We now show that, as a consequence of Theorem 9, convex combinatorial optimization over a matroid presented by a membership oracle can be solved in strongly polynomial time as well (see also [34,47]). We state the result for bases, but the analogous statement for independent sets hold as well. We say that S f0; 1gn is the set of bases of a matroid if it is the set of indicators of the family B of bases of some matroid, in which case we call conv(S) the matroid base polytope. Corollary 7 For every fixed d there is a strongly polynomial time algorithm that, given set S f0; 1gn of bases of a matroid presented by a membership oracle, x 2 S, w1 ; : : : ; wd 2 Zn , and convex functional c : Rd ! R presented by a comparison oracle, encoded as [n; hx; w1 ; : : : ; wd i], solves the convex matroid optimization problem max fc(w1 z; : : : ; wd z) : z 2 Sg : Proof Let E :D f1 i 1 j : 1 i < j ng be the set of differences of pairs of unit vectors in Rn . We claim that E covers all edge-directions of the matroid base polytope P :D conv(S). Consider any edge e D [y; y0 ] of P with y; y0 2 S and let B :D supp(y) and B0 :D supp(y0 ) be the corresponding bases. Let h 2 Rn be a linear functional uniquely maximized over P at e. If B n B0 D fig is a singleton then B0 n B D f jg is a singleton as well in which case y y0 D 1 i 1 j and we are done. Suppose then, indirectly, that it is not, and pick an element i in the symmetric difference BB0 :D (B n B0 ) [ (B0 n B) of minimum value hi . Without loss of generality assume i 2 B n B0 . Then there is a j 2 B0 n B such that B00 :D B n fig [ f jg is also a basis. Let y00 2 S be the indicator of B00 . Now jBB0 qj > 2 implies that B00 is neither B nor B0 . By the choice of i we have hy00 D hy h i C h j hy. So y00 is also a max-

527

528

C

Convex Discrete Optimization

imizer of h over P and hence y00 2 e. But no f0; 1gvector is a convex combination of others, a contradiction. Now, jEj D n2 and E f1; 0; 1gn imply that jEj and hEi are polynomial in n. Moreover, E can be easily computed in strongly polynomial time. Therefore, applying the algorithm of Theorem 9 to the given data and the set E, the convex discrete optimization problem over S can be solved in strongly polynomial time. One important application of Corollary 7 is a polynomial time algorithm for computing the universal Gröbner basis of any system of polynomials having a finite set of common zeros in fixed (but arbitrary) number of variables, as well as the construction of the state polyhedron of any member of the Hilbert scheme, see [5,51]. Other important applications are in the field of algebraic statistics [52], in particular for optimal experimental design. These applications are beyond our scope here and will be discussed elsewhere. Here is another concrete example of a convex matroid optimization application. Example 1 (MAXIMUM NORM SPANNING TREE). Fix any positive integer d. Let k k p : Rd ! R be the 1 P lp norm given by kxk p :D ( diD1 jx i j p ) p for 1 p < 1 and kxk1 :D maxdiD1 jx i j. Let G be a connected graph with edge set N :D f1; : : : ; ng. For j D 1; : : : ; n let u j 2 Zd be a weight vector representing the values of edge j under some d criteria. The weight of a subset P T N is the sum j2T u j representing the total values of T under the d criteria. The problem is to find a spanning tree T of G whose weight has maximum lp norm, P that is, a spanning tree T maximizing k j2T u j k p . Define w1 ; : : : ; wd 2 Zn by w i; j :D u j;i for i D 1; : : : ; d, j D 1; : : : ; n. Let S f0; 1gn be the set of indicators of spanning trees of G. Then, in time polynomial in n, a membership oracle for S is realizable, and an initial x 2 S is obtainable as the indicator of any greedily constructible spanning tree T. Finally, define the convex functional c :D k k p . Then for most common values p D 1; 2; 1, and in fact for any p 2 N, the comparison of c(y) and c(z) can be done for y; z 2 Zd in time polynomial in hy; z; pi by computing and comp p paring the integer valued pth powers kyk p and kzk p . Thus, by Corollary 7, this problem is solvable in time polynomial in hu1 ; : : : ; u n ; pi.

Linear N-fold Integer Programming In this section we develop a theory of linear n-fold integer programming, which leads to the polynomial time solution of broad classes of linear integer programming problems in variable dimension. This will be extended to convex n-fold integer programming in Sect. “Convex Integer Programming”. In Sect. “Oriented Augmentation and Linear Optimization” we describe an adaptation of a result of [56] involving an oriented version of the augmentation oracle of Sect. “From Membership to Linear Optimization”. In Sect. “Graver Bases and Linear Integer Programming” we discuss Graver bases and their application to linear integer programming. In Sect. “Graver Bases of N-fold Matrices” we show that Graver bases of n-fold matrices can be computed efficiently. In Sect. “Linear N-fold Integer Programming in Polynomial Time” we combine the preparatory statements from Sect. “Oriented Augmentation and Linear Optimization”, Sect. “Graver Bases and Linear Integer Programming”, and Sect. “Graver Bases of N-fold Matrices”, and prove the main result of this section, asserting that linear n-fold integer programming is polynomial time solvable. We conclude with some applications in Sect. “Some Applications”. Here and in Sect. “Convex Integer Programming” we concentrate on discrete optimization problems over a set S presented as the set of integer points satisfying an explicitly given system of linear inequalities. Without loss of generality we may and will assume that S is given either in standard form S :D fx 2 N n : Ax D bg where A 2 Zmn and b 2 Zm , or in the form S :D fx 2 Zn : Ax D b; l x ug n where l; u 2 Z1 and Z1 D Z ] f˙1g, where some of the variables are bounded below or above and some are unbounded. Thus, S is no longer presented by an oracle, but by the explicit data A; b and possibly l; u. In this setup we refer to discrete optimization over S also as integer programming over S. As usual, an algorithm solving the problem must either provide an x 2 S maximizing wx over S, or assert that none exists (either because S is empty or because the objective function is unbounded over the underlying polyhedron). We will sometimes assume that an initial point x 2 S is given,

Convex Discrete Optimization

in which case b will be computed as b :D Ax and not be part of the input. Oriented Augmentation and Linear Optimization We have seen in Sect. “From Membership to Linear Optimization” that an augmentation oracle presentation of a finite set S Zn enables to solve the linear discrete optimization problem over S. However, the running time of the algorithm of Lemma 5 which demonstrated this, was polynomial in the unary length of the radius (S) of the feasible set rather than in its binary length. In this subsection we discuss a recent result of [56] and show that, when S is presented by a suitable stronger oriented version of the augmentation oracle, the linear optimization problem can be solved by a much faster algorithm, whose running time is in fact polynomial in the binary length h(S)i. The key idea behind this algorithm is that it gives preference to augmentations along interior points of conv(S) staying far off its boundary. It is inspired by and extends the combinatorial interior point algorithm of [61]. n deFor any vector g 2 Rn , let g C ; g 2 RC C note its positive and negative parts, defined by g j :D maxfg j ; 0g and g j :D minfg j ; 0g for j D 1; : : : ; n. Note that both g C ; g are nonnegative, supp(g) D U supp(gC ) supp(g ), and g D g C g . An oriented augmentation oracle for a set S Zn is one that, queried on x 2 S and wC ; w 2 Zn , either returns an augmenting vector g 2 Zn , defined to be one satisfying x C g 2 S and wC g C w g > 0, or asserts that none exists. Note that this oracle involves two linear functionals wC ; w 2 Zn rather than one (wC ; w are two distinct independent vectors and not the positive and negative parts of one vector). The conditions on an augmenting vector g indicate that it is a feasible direction and has positive value under the nonlinear objective function determined by wC ; w . Note that this oracle is indeed stronger than the augmentation oracle of Sect. “From Membership to Linear Optimization”: to answer a query x 2 S, w 2 Zn to the latter, set wC :D w :D w, thereby obtaining wC g C w g D w g for all g, and query the former on x; wC ; w ; if it replies with an augmenting vector g then reply with the better point xˆ :D x C g, whereas if it asserts that no g exists then assert that no better point exists.

C

The following lemma is an adaptation of the result of [56] concerning sets of the form S :D fx 2 Zn : Ax D b; 0 x ug of nonnegative integer points satisfying equations and upper bounds. However, the pair A; b is neither explicitly needed nor does it affect the running time of the algorithm underlying the lemma. It suffices that S is of that form. Moreover, an arbitrary lower bound vector l rather than 0 can be included. So it suffices to assume that S coincides with the intersection of its affine hull and the set of integer points in a box, that is, S D aff(S) \ fx 2 Zn : l x ug where l; u 2 Zn . We now describe and prove the algorithm of [56] adjusted to any lower and upper bounds l; u. Lemma 7 There is a polynomial time algorithm that, given vectors l; u 2 Zn , set S Zn satisfying S D aff(S) \ fz 2 Zn : l z ug and presented by an oriented augmentation oracle, x 2 S, and w 2 Zn , encoded as [hl; u; x; wi], provides an optimal solution x 2 S to the linear discrete optimization problem maxfwz : z 2 Sg. Proof We start with some strengthening adjustments to the oriented augmentation oracle. Let :D maxfklk1 ; kuk1 g be an upper bound on the radius of S. Then any augmenting vector g obtained from the oriented augmentation oracle when queried on y 2 S and wC ; w 2 Zn , can be made in polynomial time to be exhaustive, that is, to satisfy y C 2g … S (which means that no longer augmenting step in direction g can be taken). Indeed, using binary search, find the largest r 2 f1; : : : ; 2g for which l y C rg u; then S D aff(S) \ fz 2 Zn : l z ug implies y C rg 2 S and hence we can replace g :D rg. So from here on we will assume that if there is an augmenting vector then the oracle returns an exhaustive one. Second, let R1 :D R ] f˙1g and for any vector v 2 Rn n let v 1 2 R1 denote its entry-wise reciprocal defined 1 1 by v i :D v i if v i ¤ 0 and v 1 i :D 1 if v i D 0. For any 1 y 2 S, the vectors (y l) and (u y)1 are the reciprocals of the “entry-wise distance” of y from the given lower and upper bounds. The algorithm will query the oracle on triples y; wC ; w with wC :D w (u y)1 and w :D w C(y l)1 where is a suitable positive scalar and w is the input linear functional. The fact that such wC ; w may have infinite entries does not cause any problem: indeed, if g is an augmenting vector then y C g 2 S implies that g C i D 0 whenever y i D u i

529

530

C

Convex Discrete Optimization

and g i D 0 whenever l i D y i , so each infinite entry in wC or w occurring in the expression wC g C w g is multiplied by 0 and hence zeroed out. The algorithm proceeds in phases. Each phase i starts with a feasible point y i1 2 S and performs repeated augmentations using the oriented augmentation oracle, terminating with a new feasible point y i 2 S when no further augmentations are possible. The queries to the oracle make use of a positive scalar parameters i fixed throughout the phase. The first phase (i=1) starts with the input point y0 :D x and sets 1 :D kwk1 . Each further phase i 2 starts with the point y i1 obtained from the previous phase and sets the parameter value i :D 12 i1 to be half its value in the previous phase. The algorithm terminates at the end of the first phase i for which i < n1 , and outputs x :D y i . Thus, the number of phases is at most dlog2 (2nkwk1 )e and hence polynomial in hl; u; wi. We now describe the ith phase which determines yi from y i1 . Set i :D 12 i1 and yˆ :D y i1 . Iterate the following: query the strengthened oriented augmentation oracle on yˆ, wC :D w i (u yˆ)1 , and w :D w C i ( yˆ l)1 ; if the oracle returns an exhaustive augmenting vector g then set yˆ :D yˆ C g and repeat, whereas if it asserts that there is no augmenting vector then set y i :D yˆ and complete the phase. If i n1 then proceed to the (i C 1)th phase, else output x :D y i and terminate the algorithm. It remains to show that the output of the algorithm is indeed an optimal solution and that the number of iterations (and hence calls to the oracle) in each phase is polynomial in the input. For this we need the following facts, the easy proofs of which are omitted: 1. For every feasible y 2 S and direction g with y C g 2 S also feasible, we have (u y)1 g C C (y l)1 g n : 2. For every y 2 S and direction g with y C g 2 S but y C 2g … S, we have (u y)1 g C C (y l)1 g >

1 : 2

3. For every feasible y 2 S, direction g with y C g 2 S also feasible, and > 0, setting wC :D w (u

y)1 and w :D w C (y l)1 we have wC g C w g D w g (u y)1 g C

C(y l)1 g :

Now, consider the last phase i with i < n1 , let x :D y i :D yˆ be the output of the algorithm at the end of this phase, and let xˆ 2 S be any optimal solution. Now, the phase is completed when the oracle, queried on the triple yˆ, wC D w i (u yˆ)1 , and w D w C i ( yˆ l)1 , asserts that there is no augmenting vector. In particular, setting g :D xˆ yˆ, we find wC g C w g 0 and hence, by facts 1 and 3 above,

w xˆ wx D w g i (u yˆ)1 g C 1 C( yˆ l)1 g < n D 1: n Since w xˆ and wx are integer, this implies that in fact w xˆ wx 0 and hence the output x of the algorithm is indeed an optimal solution to the given optimization problem. Next we bound the number of iterations in each phase i starting from y i1 2 S. Let again xˆ 2 S be any optimal solution. Consider any iteration in that phase, where the oracle is queried on yˆ, wC D w i (u yˆ)1 , and w D w C i ( yˆ l)1 , and returns an exhaustive augmenting vector g. We will now show that w( yˆ C g) w yˆ

1 (w xˆ w y i1 ) ; 4n

(1)

that is, the increment in the objective value from yˆ to 1 times the difthe augmented point yˆ C g is at least 4n ference between the optimal objective value w xˆ and the objective value w y i1 of the point y i1 at the beginning of phase i. This shows that at most 4n such increments (and hence iterations) can occur in the phase before it is completed. To establish (1), we show that w g 12 i and w xˆ w y i1 2n i . For the first inequality, note that g is an exhaustive augmenting vector and so wC g C w g > 0 and yˆ C 2g … S and hence, by facts 2 and 3, w g > i ((u yˆ)1 g C C( yˆ l)1 g ) > 12 i . We proceed with the second inequality. If i D 1 (first phase) then this indeed holds since w xˆ w y0 2nkwk1 D 2n1 . If i 2, let w˜ C :D w i1 (u y i1 )1 and w˜ :D

Convex Discrete Optimization

w C i1 (y i1 l)1 . The (i 1)th phase was completed when the oracle, queried on the triple y i1 , w˜ C , and w˜ , asserted that there is no augmenting vector. In particular, for g˜ :D xˆ y i1 , we find w˜ C g˜C w˜ g˜ 0 and so, by facts 1 and 3, w xˆ w y i1 D w g˜ i1 (u y i1 )1 g˜C C (y i1 l)1 g˜ ) i1 n D 2n i :

Graver Bases and Linear Integer Programming We now come to the definition of a fundamental object introduced by Graver in [28]. The Graver basis of an integer matrix A is a canonical finite set G (A) that can be defined as follows. Define a partial order v on Zn which extends the coordinate-wise order on N n as follows: for two vectors u; v 2 Zn put u v v and say that u is conformal to v if ju i j jv i j and u i v i 0 for i D 1; : : : ; n, that is, u and v lie in the same orthant of Rn and each component of u is bounded by the corresponding component of v in absolute value. It is not hard to see that v is a well partial ordering (this is basically Dickson’s lemma) and hence every subset of Zn has finitely-many v-minimal elements. Let L(A) :D fx 2 Zn : Ax D 0g be the lattice of linear integer dependencies on A. The Graver basis of A is defined to be the set G (A) of all v-minimal vectors in L(A) n f0g. Note that if A is an mn matrix then its Graver basis consist of vectors in Zn . We sometimes write G (A) as a suitable jG (A)j n matrix whose rows are the Graver basis elements. The Graver basis is centrally symmetric (g 2 G (A) implies g 2 G (A)); thus, when listing a Graver basis we will typically give one of each antipodal pair and prefix the set (or matrix) by ˙. Any element of the Graver basis is primitive (its entries are relatively prime integers). Every circuit of A (nonzero primitive minimal support element of L(A)) is in G (A); in fact, if A is totally unimodular then G (A) coincides with the set of circuits (see Sect. “Convex Integer Programming over Totally Unimodular Systems” in the sequel for more details on this). However, in general G (A) is much larger. For more details on Graver bases and their connection to Gröbner bases see Sturmfels [58] and for the currently fastest procedure for computing them see [35,36].

C

Here is a quick simple example; we will see more structured and complex examples later on. Consider the 13 matrix A :D (1; 2; 1). Then its Graver basis can be shown to be the set G (A) D ˙f(2; 1; 0); (0; 1; 2); (1; 0; 1); (1; 1; 1)g. The first three elements (and their antipodes) are the circuits of A; already in this small example non-circuits appear as well: the fourth element (and its antipode) is a primitive linear integer dependency whose support is not minimal. We now show that when we do have access to the Graver basis, it can be used to solve linear integer programming. We will extend this in Sect. “Convex Integer Programming”, where we show that the Graver basis enables to solve convex integer programming as well. In Sect. “Graver Bases of N-fold Matrices” we will show that there are important classes of matrices for which the Graver basis is indeed accessible. First, we need a simple property of Graver bases. P n A finite sum u :D i v i of vectors v i 2 R is conformal if each summand is conformal to the sum, that is, v i v u for all i. Lemma 8 Let A be any integer matrix. Then any h 2 P L(A) n f0g can be written as a conformal sum h :D gi of (not necessarily distinct) Graver basis elements g i 2 G (A). Proof By induction on the well partial order v. Recall that G (A) is the set of v-minimal elements in L(A) n f0g. Consider any h 2 L(A) n f0g. If it is v-minimal then h 2 G (A) and we are done. Otherwise, there is a h 0 2 G (A) such that h 0 @ h. Set h 00 :D h h 0 . Then h 00 2 L(A) n f0g and h 00 @ h, so by induction there is P a conformal sum h 00 D i g i with g i 2 G (A) for all i. P Now h D h 0 C i g i is the desired conformal sum of h. The next lemma shows the usefulness of Graver bases for oriented augmentation. Lemma 9 Let A be an m n integer matrix with Graver n , wC ; w 2 Zn , and b 2 basis G (A) and let l; u 2 Z1 m Z . Suppose x 2 T :D fy 2 Zn : Ay D b; l y ug. Then for every g 2 Zn which satisfies x C g 2 T and wC g C w g > 0 there exists an element gˆ 2 G (A) with gˆ v g which also satisfies x C gˆ 2 T and wC gˆC w gˆ > 0. Proof Suppose g 2 Zn satisfies the requirements. Then Ag D A(x C g) Ax D b b D 0 since x; x C g 2

531

532

C

Convex Discrete Optimization

T. Thus, g 2 L(A) n f0g and hence, by Lemma 8, there P is a conformal sum g D i h i with h i 2 G (A) for all i. g C and h Now, h i v g is equivalent to h C i g , Pi so the conformal sum g D i h i gives corresponding P C sums of the positive and negative parts g C D i hi P and g D i h i . Therefore we obtain X X 0 < wC g C w g D wC hC h i i w i

i

X D (wC h C i w h i ) i

which implies that there is some hi in this sum with wC h C i w h i 0. Now, h i 2 G (A) implies A(x C h i ) D Ax D b. Also, l x; x C g u and h i v g imply that l x C h i u. So x C h i 2 T. Therefore the vector gˆ :D h i satisfies the claim. We can now show that the Graver basis enables to solve linear integer programming in polynomial time provided an initial feasible point is available. Theorem 12 There is a polynomial time algorithm that, given A 2 Zmn , its Graver basis G (A), l; u 2 n , x; w 2 Zn with l x u, encoded as Z1 [hA; G (A); l; u; x; wi], solves the linear integer program maxfwz : z 2 Zn ; Az D b; l z ug with b :D Ax. Proof First, note that the objective function of the integer program is unbounded if and only if the objective function of its relaxation maxfw y : y 2 Rn ; Ay D b; l y ug is unbounded, which can be checked in polynomial time using linear programming. If it is unbounded then assert that there is no optimal solution and terminate the algorithm. Assume then that the objective is bounded. Then, since the program is feasible, it has an optimal solution. Furthermore, (as basically follows from Cramer’s rule, see e. g. [13, Theorem 17.1]) it has an optimal x satisfying jx j j for all j, where is an easily computable integer upper bound whose binary length hi is polynomially bounded in hA; l; u; xi. For instance, :D (nC1)(nC1)!r nC1 will do, with r the maximum among P max i j j A i; j x j j, max i; j jA i; j j, maxfjl j j : jl j j < 1g, and maxfju j j : ju j j < 1g. Let T :D fy 2 Zn : Ay D b; l y ug and S :D T \ [; ]n . Then our linear integer programming problem now reduces to linear discrete optimization over S. Now, an oriented augmentation oracle for S can be simulated in polynomial time using the

given Graver basis G (A) as follows: given a query y 2 S and wC ; w 2 Zn , search for g 2 G (A) which satisfies wC g C w g > 0 and y C g 2 S; if there is such a g then return it as an augmenting vector, whereas if there is no such g then assert that no augmenting vector exists. Clearly, if this simulated oracle returns a vector g then it is an augmenting vector. On the other hand, if there exists an augmenting vector g then y C g 2 S T and wC g C w g > 0 imply by Lemma 9 that there is also a gˆ 2 G (A) with gˆ v g such that wC gˆC w gˆ > 0 and y C gˆ 2 T. Since y; y C g 2 S and gˆ v g, we find that y C gˆ 2 S as well. Therefore the Graver basis contains an augmenting vector and hence the simulated oracle will find and output one. Define ˆl; uˆ 2 Zn by ˆl j :D max(l j ; ); uˆ j :D min(u j ; ), j D 1; : : : ; n. Then it is easy to see that ˆ Now apply the alS D aff(S) \ fy 2 Zn : ˆl y ug. ˆ ˆ S, x, and w, using the above gorithm of Lemma 7 to l; u, simulated oriented augmentation oracle for S, and obtain in polynomial time a vector x 2 S which is optimal to the linear discrete optimization problem over S and hence to the given linear integer program. As a special case of Theorem 12 we recover the following result of [55] concerning linear integer programming in standard form when the Graver basis is available. Theorem 13 There is a polynomial time algorithm that, given matrix A 2 Zmn , its Graver basis G (A), x 2 N n , and w 2 Zn , encoded as [hA; G (A); x; wi], solves the linear integer programming problem maxfwz : z 2 N n ; Az D bg where b :D Ax. Graver Bases of N-fold Matrices As mentioned above, the Graver basis G (A) of an integer matrix A contains all circuits of A and typically many more elements. While the number of circuits is already n typically exponential and can be as large as mC1 , the number of Graver basis elements is usually even larger and depends also on the entries of A and not only on its dimensions m; n. So unfortunately it is typically very hard to compute G (A). However, we now show that for the important and useful broad class of nfold matrices, the Graver basis is better behaved and can be computed in polynomial time. Recall the following definition from the introduction. Given an (r C s) t

Convex Discrete Optimization

matrix A, let A1 be its r t sub-matrix consisting of the first r rows and let A2 be its st sub-matrix consisting of the last s rows. We refer to A explicitly as (r C s) t matrix, since the definition below depends also on r and s and not only on the entries of A. The n-fold matrix of an (r C s) t matrix A is then defined to be the following (r C ns) nt matrix, A(n) :D (1n ˝ A1 ) ˚ (I n 0 A1 A1 A1 B A2 0 0 B B 0 A2 0 D B B : :: :: @ :: : : 0

0

0

˝ A2 ) :: :

A1 0 0 :: :

A2

1 C C C C : C A

We now discuss a recent result of [54], which originates in [4], and its extension in [38], on the stabilization of Graver bases of n-fold matrices. Consider vectors x D (x 1 ; : : : ; x n ) with x k 2 Z t for k D 1; : : : ; n. The type of x is the number jfk : x k ¤ 0gj of nonzero components x k 2 Z t of x. The Graver complexity of an (r C s) t matrix, denoted c(A), is defined to be the smallest c 2 N ] f1g such that for all n, the Graver basis of A(n) consists of vectors of type at most c(A). We provide the proof of the following result of [38,54] stating that the Graver complexity is always finite. Lemma 10 The Graver complexity c(A) of any (rCs)t integer matrix A is finite. Proof Call an element x D (x 1 ; : : : ; x n ) in the Graver basis of some A(n) pure if x i 2 G (A2 ) for all i. Note that the type of a pure x 2 G (A(n) ) is n. First, we claim that if there is an element of type m in some G (A(l ) ) then for some n m there is a pure element in G (A(n) ), and so it will suffice to bound the type of pure elements. Suppose there is an element of type m in some G (A(l ) ). Then its restriction to its m nonzero components is an element Pk i x D (x 1 ; : : : ; x m ) in G (A(m) ). Let x i D jD1 g i; j be a conformal decomposition of xi with g i; j 2 G (A2 ) for all i; j, and let n :D k1 C C k m m. Then g :D (g1;1 ; : : : ; g m;k m ) is in G (A(n) ), else there would be gˆ @ g in G (A(n) ) in which case the nonzero xˆ with xˆ i :D Pk i ˆ @ x and xˆ 2 L(A(m) ), jD1 gî; j for all i would satisfy x contradicting x 2 G (A(m) ). Thus g is a pure element of type n m, proving the claim. We proceed to bound the type of pure elements. Let G (A2 ) D fg1 ; : : : ; g m g be the Graver basis of A2 and

C

let G2 be the t m matrix whose columns are the g i . Suppose x D (x 1 ; : : : ; x n ) 2 G (A(n) ) is pure for some n. Let v 2 N m be the vector with v i :D jfk : x k D g i gj counting the number of g i components of x for each i. P is equal to the type n of x. Next, note that Then m iD1 v i P A1 G2 v D A1 ( nkD1 x k ) D 0 and hence v 2 L(A1 G2 ). We claim that, moreover, v 2 G (A1 G2 ). Suppose indirectly not. Then there is vˆ 2 G (A1 G2 ) with vˆ @ v, and it is easy to obtain a nonzero xˆ @ x from x by zeroing out some components so that vî D jfk : xˆ k D g i gj P for all i. Then A1 ( nkD1 xˆ k ) D A1 G2 vˆ D 0 and hence xˆ 2 L(A(n) ), contradicting x 2 G (A(n) ). So the type of any pure element, and hence the Graver complexity of A, is at most the largest value Pm iD1 v i of any nonnegative element v of the Graver basis G (A1 G2 ). Using Lemma 10 we now show how to compute G (A(n) ) in polynomial time. Theorem 14 For every fixed (r C s) t integer matrix A there is a strongly polynomial time algorithm that, given n 2 N, encoded as [n; n], computes the Graver basis G (A(n) ) of the n-fold matrix A(n) . In particular, the cardinality jG (A(n) )j and binary length hG (A(n) )i of the Graver basis of the n-fold matrix are polynomially bounded in n. Proof Let c :D c(A) be the Graver complexity of A and consider any n c. We show that the Graver basis of A(n) is the union of nc suitably embedded copies of the Graver basis of A(c) . For every c indices 1 k1 < < k c n define a map k 1 ;:::;k c from Zc t to Znt sending x D (x 1 ; : : : ; x c ) to y D (y1 ; : : : ; y n ) with y k i :D x i for i D 1; : : : ; c and y k :D 0 for k … fk1 ; : : : ; k c g. We claim that G(A(n) ) is the union of the images of G (A(c) ) under the nc maps k 1 ;:::;k c for all 1 k1 < < k c n, that is, [ G (A(n) ) D k 1 ;:::;k c (G (A(c) )) : (2) 1k 1 0g and J :D f j : a k; j < 0g, and set P z k :D b k C U j2J ja k; j j. The last coordinate of z is set for consistency with u; v to be z h D z mC1 :D P ¯ j :D U y j the comrU m kD1 z k . Now, with y plement of variable yj as above, the kth equation can be

Convex Discrete Optimization

explicit formulas for u I;J ; v I;K ; z J;K in terms of a i ; b j ; c k and e i; j;k as follows. Put r :D l m and c :D n C l C m. The first index I of each entry x I;J;K will be a pair I D (i; j) in the r-set

rewritten as X X a k; j y j C ja k; j j y¯ j j2J

j2J C

D

n X

a k; j y j C U

jD1

D bk C U

X

X

ja k; j j

f(1; 1); : : : ; (1; m); (2; 1); : : : ;

j2J

(2; m); : : : ; (l; 1); : : : ; (l; m)g :

ja k; j j D z k :

j2J

To encode this equation, we simply “pull down” to the corresponding kth horizontal plane as many copies of each variable yj or y¯ j by suitably setting k C (s) :D k or k (s) :D k. By the choice of rj there are sufficiently many, possibly with a few redundant copies which are absorbed in the last hyperplane by setting k C (s) :D mC 1 or k (s) :D m C 1. This completes the encoding and provides the desired representation. Third, we show that any 3-way polytope with planesums fixed and entry bounds, X l mn F :D y 2 RC : y i; j;k D c k ; i; j

X

y i; j;k D b j ;

i;k

X

y i; j;k D a i ;

j;k

y i; j;k e i; j;k ; can be represented as a 3-way polytope with line-sums fixed (and no entry bounds), T :D

rc3 x 2 RC :

X

x I;J;K D z J;K ;

I

X J

x I;J;K D v I;K ;

C

X

The second index J of each entry x I;J;K will be a pair J D (s; t) in the c-set f(1; 1); : : : ; (1; n); (2; 1); : : : ; (2; l); (3; 1); : : : ; (3; m)g : The last index K will simply range in the 3-set f1; 2; 3g. We represent F as T via the injection given explicitly by (i; j; k) :D ((i; j); (1; k); 1), embedding each variable y i; j;k as the entry x(i; j);(1;k);1 . Let U now denote the minimal between the two values maxfa1 ; : : : ; a l g and maxfb1 ; : : : ; b m g. The line-sums (2-margins) are set to be u(i; j);(1;t) D e i; j;t ; U u(i; j);(2;t) D 0 U u(i; j);(3;t) D 0 8 < U v(i; j);t D e : i; j;C U 8 < cj z(i; j);1 D mU : 0

x I;J;K D u I;J

:

K

In particular, this implies that any face F of a 3-way polytope with plane-sums fixed can be represented as a 3-way polytope T with line-sums fixed: forbidden entries are encoded by setting a “forbidding” upperbound e i; j;k :D 0 on all forbidden entries (i; j; k) … E and an “enabling” upper-bound e i; j;k :D U on all enabled entries (i; j; k) 2 E. We describe the presentation, but omit the proof that it is indeed valid; further details on this step can be found in [14,15,16]. We give

z(i; j);2

z(i; j);3

if t D i, ; otherwise. if t D j; otherwise. if t D 1, if t D 2, ; if t D 3. aj

if i D 1, if i D 2, if i D 3.

8 < eC;C; j c j if i D 1, D 0 if i D 2, ; : if i D 3. bj 8 if i D 1, < 0 D : if i D 2, aj : l U b j if i D 3.

Applying the first step to the given rational polytope P, applying the second step to the resulting Q, and applying the third step to the resulting F, we get in polynomial time a 3-way r c3 polytope T of all line-sums fixed representing P as claimed.

545

546

C

Convex Discrete Optimization

The Complexity of the Multiway Transportation Problem We are now finally in position to settle the complexity of the general multiway transportation problem. The data for the problem consists of: positive integers k (table dimension) and m1 ; : : : ; m k (table sides); family F of subsets of f1; : : : ; kg (supporting the hierarchical collection of margins to be fixed); integer values u i 1 ;:::;i k for all margins supported on F ; and integer “profit” m1 m k array w. The transportation problem is to find an m1 m k table having the given margins and attaining maximum profit, or assert than none exists. Equivalently, it is the linear integer programming problem of maximizing the linear functional defined by w over the transportation polytope TF , max f wx : x 2 N m1 m k : x i 1 ;:::;i k D u i 1 ;:::;i k ; supp(i1 ; : : : ; ik ) 2 F g : The following result of [15] is an immediate consequence of Theorem 21 . It asserts that if two sides of the table are variable part of the input then the transportation problem is intractable already for short 3-way tables with F D ff1; 2g; f1; 3g; f2; 3gg supporting all 2-margins (line-sums). This result can be easily extended to k-way tables of any dimension k 3 and F the collection of all h-subsets of f1; : : : ; kg for any 1 < h < k as long as two sides of the table are variable; we omit the proof of this extended result. Corollary 13 It is NP-complete to decide, given r; c, and line-sums u 2 Zrc , v 2 Zr3 , and z 2 Zc3 , encoded as [hu; v; zi], if the following set of tables is nonempty, S :D

x 2 N rc3 :

X

x i; j;k D z j;k ;

i

X j

x i; j;k D v i;k ;

X

x i; j;k D u i; j

:

k

Proof The integer programming feasibility problem is to decide, given A 2 Zmn and b 2 Zm , if fy 2 N n : Ay D bg is nonempty. Given such A and b, the polynomial time algorithm of Theorem 21 produces r; c and u 2 Zrc , v 2 Zr3 , and z 2 Zc3 , such that fy 2 N n : Ay D bg is nonempty if and only if the set S above is nonempty. This reduces integer programming

feasibility to short 3-way line-sum transportation feasibility. Since the former is NP-complete (see e. g. [55]), so turns out to be the latter. We now show that in contrast, when all sides but one are fixed (but arbitrary), and one side n is variable, then the corresponding long k-way transportation problem for any hierarchical collection of margins is an n-fold integer programming problem and therefore, as a consequence of Theorem 16, can be solved is polynomial time. This extends Corollary 8 established in Sect. “Three-Way Line-Sum Transportation Problems” for 3-way line-sum transportation. Corollary 14 For every fixed k, table sides m1 ; : : : ; m k , and family F of subsets of f1; : : : ; k C 1g, there is a polynomial time algorithm that, given n, integer values u D (u i 1 ;:::;i kC1 ) for all margins supported on F , and integer m1 m k n array w, encoded as [hu; wi], solves the linear integer multiway transportation problem max f wx : x 2 N m1 m k n ; x i 1 ;:::;i kC1 D u i 1 ;:::;i kC1 ; supp(i1 ; : : : ; ikC1 ) 2 F g : Proof Re-index the arrays as x D (x 1 ; : : : ; x n ) with each x j D (x i 1 ;:::;i k ; j ) a suitably indexed m1 m2 m k vector representing the jth layer of x. Then the transportation problem can be encoded as an n-fold integer programming problem in standard form, max fwx : x 2 N nt ; A(n) x D bg ; with an (r C s) t defining matrix A where t :D m1 m2 m k and r; s, A1 and A2 are determined from F , and with right-hand side b :D (b 0 ; b 1 ; : : : ; b n ) 2 ZrCns determined from the margins u D (u i 1 ;:::;i kC1 ), P in such a way that the equations A1 ( njD1 x j ) D b 0 represent the constraints of all margins x i 1 ;:::;i k ;C (where summation over layers occurs), whereas the equations A2 x j D b j for j D 1; : : : ; n represent the constraints of all margins x i 1 ;:::;i k ; j with j ¤ C (where summations are within a single layer at a time). Using the algorithm of Theorem 16, this n-fold integer program, and hence the given multiway transportation problem, can be solved in polynomial time. The proof of Corollary 14 shows that the set of feasible points of any long k-way transportation problem, with all sides but one fixed and one side n variable,

C

Convex Discrete Optimization

for any hierarchical collection of margins, is an n-fold integer programming problem. Therefore, as a consequence of Theorem 20, we also have the following extension of Corollary 14 for the convex integer multiway transportation problem over long k-way tables. Corollary 15 For every fixed d, k, table sides m1 ; : : : ; m k , and family F of subsets of f1; : : : ; k C 1g, there is a polynomial time algorithm that, given n, integer values u D (u i 1 ;:::;i kC1 ) for all margins supported on F , integer m1 m k n arrays w1 ; : : : ; w d , and convex functional c : Rd ! R presented by a comparison oracle, encoded as [hu; w1 ; : : : ; wd i], solves the convex integer multiway transportation problem max f c(w1 x; : : : ; wd x) : x 2 N m1 m k n ; x i 1 ;:::;i kC1 D u i 1 ;:::;i kC1 ; supp(i1 ; : : : ; ikC1 ) 2 F g :

Privacy and Entry-Uniqueness A common practice in the disclosure of a multiway table containing sensitive data is to release some of the table margins rather than the table itself, see e. g. [11,18,19] and the references therein. Once the margins are released, the security of any specific entry of the table is related to the set of possible values that can occur in that entry in any table having the same margins as those of the source table in the data base. In particular, if this set consists of a unique value, that of the source table, then this entry can be exposed and privacy can be violated. This raises the following fundamental entryuniqueness problem: given a consistent disclosed (hierarchical) collection of margin values, and a specific entry index, is the value that can occur in that entry in any table having these margins unique? We now describe the results of [48] that settle the complexity of this problem, and interpret the consequences for secure statistical data disclosure. First, we show that if two sides of the table are variable part of the input then the entry-uniqueness problem is intractable already for short 3-way tables with all 2-margins (line-sums) disclosed (corresponding to F D ff1; 2g; f1; 3g; f2; 3gg). This can be easily extended to k-way tables of any dimension k 3 and F the collection of all h-subsets of f1; : : : ; kg for any 1 < h < k as long as two sides of the table are variable; we omit the proof of this extended result. While this result indicates

that the disclosing agency may not be able to check for uniqueness, in this situation, some consolation is in that an adversary will be computationally unable to identify and retrieve a unique entry either. Corollary 16 It is coNP-complete to decide, given r; c, and line-sums u 2 Zrc , v 2 Zr3 , z 2 Zc3 , encoded as [hu; v; zi], if the entry x1;1;1 is the same in all tables in X x 2 N rc3 : x i; j;k D z j;k ; i

X j

x i; j;k D v i;k ;

X

x i; j;k D u i; j

:

k

Proof The subset-sum problem, well known to be NP-complete, is the following: given positive integers a0 ; a1 ; : : : ; a m , decide if there is an I P f1; : : : ; mg with a0 D i2I a i . We reduce the complement of subset-sum to entry-uniqueness. Given a0 ; a1 ; : : : ; a m , consider the polytope in 2(m C 1) variables y0 ; y1 : : : ; y m ; z0 ; z1 ; : : : ; z m , P :D

: a0 y0 (y; z) 2 R2(mC1) C

m X

ai yi D 0 ;

iD1

y i C z i D 1 ; i D 0; 1 : : : ; m

:

First, note that it always has one integer point with y0 D 0, given by y i D 0 and z i D 1 for all i. Second, note that it has an integer point with y0 ¤ 0 if and only if there is P an I f1; : : : ; mg with a0 D i2I a i , given by y0 D 1, y i D 1 for i 2 I, y i D 0 for i 2 f1; : : : ; mg n I, and z i D 1 y i for all i. Lifting P to a suitable r c 3 line-sum polytope T with the coordinate y0 embedded in the entry x1;1;1 using Theorem 21, we find that T has a table with x1;1;1 D 0, and this value is unique among the tables in T if and only if there is no solution to the subset-sum problem with a0 ; a1 ; : : : ; a m . Next we show that, in contrast, when all table sides but one are fixed (but arbitrary), and one side n is variable, then, as a consequence of Corollary 14, the corresponding long k-way entry-uniqueness problem for any hierarchical collection of margins can be solved is polynomial time. In this situation, the algorithm of Corollary 17 below allows disclosing agencies to efficiently check possible collections of margins before disclosure: if an entry value is not unique then disclosure

547

548

C

Convex Discrete Optimization

may be assumed secure, whereas if the value is unique then disclosure may be risky and fewer margins should be released. Note that this situation, of long multiway tables, where one category is significantly richer than the others, that is, when each sample point can take many values in one category and only few values in the other categories, occurs often in practical applications, e. g., when one category is the individuals age and the other categories are binary (“yes-no”). In such situations, our polynomial time algorithm below allows disclosing agencies to check entry-uniqueness and make learned decisions on secure disclosure. Corollary 17 For every fixed k, table sides m1 ; : : : ; m k , and family F of subsets of f1; : : : ; k C 1g, there is a polynomial time algorithm that, given n, integer values u D (u j 1 ;:::; j kC1 ) for all margins supported on F , and entry index (i1 ; : : : ; i kC1 ), encoded as [n; hui], decides if the entry x i 1 ;:::;i kC1 is the same in all tables in the set fx 2 N m1 m k n : x j 1 ;:::; j kC1 D u j 1 ;:::; j kC1 ; supp(j1 ; : : : ; jkC1 ) 2 F g : Proof By Corollary 14 we can solve in polynomial time both transportation problems l :D min f x i 1 ;:::;i kC1 : x 2 N m1 m k n ; x 2 TF g ; u :D max f x i 1 ;:::;i kC1 : x 2 N m1 m k n ; x 2 TF g ; over the corresponding k-way transportation polytope m 1 m k n : x j 1 ;:::; j kC1 D u j 1 ;:::; j kC1 ; TF :D f x 2 RC

supp(j1 ; : : : ; jkC1 ) 2 F g : Clearly, entry x i 1 ;:::;i kC1 has the same value in all tables with the given (disclosed) margins if and only if l D u, completing the description of the algorithm and the proof. References 1. Aho AV, Hopcroft JE, Ullman JD (1975) The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading 2. Allemand K, Fukuda K, Liebling TM, Steiner E (2001) A polynomial case of unconstrained zero-one quadratic optimization. Math Prog Ser A 91:49–52

3. Alon N, Onn S (1999) Separable partitions. Discret Appl Math 91:39–51 4. Aoki S, Takemura A (2003) Minimal basis for connected Markov chain over 3 × 3 × K contingency tables with fixed two-dimensional marginals. Austr New Zeal J Stat 45:229– 249 5. Babson E, Onn S, Thomas R (2003) The Hilbert zonotope and a polynomial time algorithm for universal Gröbner bases. Adv Appl Math 30:529–544 6. Balinski ML, Rispoli FJ (1993) Signature classes of transportation polytopes. Math Prog Ser A 60:127–144 7. Barnes ER, Hoffman AJ, Rothblum UG (1992) Optimal partitions having disjoint convex and conic hulls. Math Prog 54:69–86 8. Berstein Y, Onn S: Nonlinear bipartite matching. Disc Optim (to appear) 9. Boros E, Hammer PL (1989) On clustering problems with connected optima in Euclidean spaces. Discret Math 75: 81–88 10. Chvátal V (1973) Edmonds polytopes and a hierarchy of combinatorial problems. Discret Math 4:305–337 11. Cox LH (2003) On properties of multi-dimensional statistical tables. J Stat Plan Infer 117:251–273 12. De Loera J, Hemmecke R, Onn S, Rothblum UG, Weismantel R: Integer convex maximization via Graver bases. E-print: arXiv:math.CO/0609019. (submitted) 13. De Loera J, Hemmecke R, Onn S, Weismantel R: N-fold integer programming. Disc Optim (to appear) 14. De Loera J, Onn S (2004) All rational polytopes are transportation polytopes and all polytopal integer sets are contingency tables. In: Proc IPCO 10 – Symp on Integer Programming and Combinatoral Optimization, Columbia University, New York. Lec Not Comp Sci. Springer, 3064, pp 338–351 15. De Loera J, Onn S (2004) The complexity of three-way statistical tables. SIAM J Comput 33:819–836 16. De Loera J, Onn S (2006) All linear and integer programs are slim 3-way transportation programs. SIAM J Optim 17:806–821 17. De Loera J, Onn S (2006) Markov bases of three-way tables are arbitrarily complicated. J Symb Comput 41:173–181 18. Domingo-Ferrer J, Torra V (eds) (2004) Privacy in Statistical Databases. Proc. PSD 2004 – Int Symp Privacy in Statistical Databases, Barcelona, Spain. Lec Not Comp Sci. Springer, 3050 19. Doyle P, Lane J, Theeuwes J, Zayatz L (eds) (2001) Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies. North-Holland, Amsterdam 20. Edelsbrunner H, O’Rourke J, Seidel R (1986) Constructing arrangements of lines and hyperplanes with applications. SIAM J Comput 15:341–363 21. Edelsbrunner H, Seidel R, Sharir M (1991) On the zone theorem for hyperplane arrangements. In: New Results and

Convex Discrete Optimization

22. 23.

24.

25. 26. 27. 28. 29.

30.

31.

32. 33.

34.

35. 36.

37.

38.

39.

40. 41.

Trends in Computer Science. Lec Not Comp Sci. Springer, 555, pp 108–123 Edmonds J (1971) Matroids and the greedy algorithm. Math Prog 1:127–136 Edmonds J, Karp RM (1972) Theoretical improvements in algorithmic efficiency of network flow problems. J Ass Comput Mach 19:248–264 Frank A, Tardos E (1987) An application of simultaneous Diophantine approximation in combinatorial optimization. Combinatorica 7:49–65 Fukuda K, Onn S, Rosta V (2003) An adaptive algorithm for vector partitioning. J Global Optim 25:305–319 Garey MR, Johnson DS (1979) Computers and Intractability. Freeman, San Francisco Gilmore PC, Gomory RE (1961) A linear programming approach to the cutting-stock problem. Oper Res 9:849–859 Graver JE (1975) On the foundation of linear and integer programming I. Math Prog 9:207–226 Gritzmann P, Sturmfels B (1993) Minkowski addition of polytopes: complexity and applications to Gröbner bases. SIAM J Discret Math 6:246–269 Grötschel M, Lovász L (1995) Combinatorial optimization. In: Handbook of Combinatorics. North-Holland, Amsterdam, pp 1541–1597 Grötschel M, Lovász L, Schrijver A (1993) Geometric Algorithms and Combinatorial Optimization, 2nd edn. Springer, Berlin Grünbaum B (2003) Convex Polytopes, 2nd edn. Springer, New York Harding EF (1967) The number of partitions of a set of n points in k dimensions induced by hyperplanes. Proc Edinburgh Math Soc 15:285–289 Hassin R, Tamir A (1989) Maximizing classes of twoparameter objectives over matroids. Math Oper Res 14:362–375 Hemmecke R (2003) On the positive sum property and the computation of Graver test sets. Math Prog 96:247–269 Hemmecke R, Hemmecke R, Malkin P (2005) 4ti2 Version 1.2–Computation of Hilbert bases, Graver bases, toric Gröbner bases, and more. http://www.4ti2.de/. Accessed Sept 2005 Hoffman AJ, Kruskal JB (1956) Integral boundary points of convex polyhedra. In: Linear inequalities and Related Systems, Ann Math Stud 38. Princeton University Press, Princeton, pp 223–246 Ho¸sten S, Sullivant S (2007) A finiteness theorem for Markov bases of hierarchical models. J Comb Theory Ser A 114:311–321 Hwang FK, Onn S, Rothblum UG (1999) A polynomial time algorithm for shaped partition problems. SIAM J Optim 10:70–81 Khachiyan LG (1979) A polynomial algorithm in linear programming. Sov Math Dok 20:191–194 Klee V, Kleinschmidt P (1987) The d-step conjecture and its relatives. Math Oper Res 12:718–755

C

42. Klee V, Witzgall C (1968) Facets and vertices of transportation polytopes. In: Mathematics of the Decision Sciences, Part I, Stanford, CA, 1967. AMS, Providence, pp 257–282 43. Kleinschmidt P, Lee CW, Schannath H (1987) Transportation problems which can be solved by the use of Hirschpaths for the dual problems. Math Prog 37:153–168 44. Lenstra AK, Lenstra HW Jr, Lovász L (1982) Factoring polynomials with rational coefficients. Math Ann 261:515–534 45. Lovàsz L (1986) An Algorithmic Theory of Numbers, Graphs, and Convexity. CBMS-NSF Ser App Math, SIAM 50:iv+91 46. Onn S (2006) Convex discrete optimization. Lecture Series, Le Séminaire de Mathématiques Supérieures, Combinatorial Optimization: Methods and Applications, Université de Montréal, Canada, June 2006. http://ie.technion.ac. il/~onn/Talks/Lecture_Series.pdf and at http://www.dms. umontreal.ca/sms/ONN_Lecture_Series.pdf. Accessed 19– 30 June 2006 47. Onn S (2003) Convex matroid optimization. SIAM J Discret Math 17:249–253 48. Onn S (2006) Entry uniqueness in margined tables. In: Proc. PSD 2006 – Symp. on Privacy in Statistical Databases, Rome, Italy. Lec Not Comp Sci. Springer, 4302, pp 94–101 49. Onn S, Rothblum UG (2004) Convex combinatorial optimization. Disc Comp Geom 32:549–566 50. Onn S, Schulman LJ (2001) The vector partition problem for convex objective functions. Math Oper Res 26:583–590 51. Onn S, Sturmfels B (1999) Cutting Corners. Adv Appl Math 23:29–48 52. Pistone G, Riccomagno EM, Wynn HP (2001) Algebraic Statistics. Chapman and Hall, London 53. Queyranne M, Spieksma FCR (1997) Approximation algorithms for multi-index transportation problems with decomposable costs. Disc Appl Math 76:239–253 54. Santos F, Sturmfels B (2003) Higher Lawrence configurations. J Comb Theory Ser A 103:151–164 55. Schrijver A (1986) Theory of Linear and Integer Programming. Wiley, New York 56. Schulz A, Weismantel R (2002) The complexity of generic primal algorithms for solving general integral programs. Math Oper Res 27:681–692 57. Schulz A, Weismantel R, Ziegler GM (1995) (0; 1)-integer programming: optimization and augmentation are equivalent. In: Proc 3rd Ann Euro Symp Alg. Lec Not Comp Sci. Springer, 979, pp 473–483 58. Sturmfels B (1996) Gröbner Bases and Convex Polytopes. Univ Lec Ser 8. AMS, Providence 59. Tardos E (1986) A strongly polynomial algorithm to solve combinatorial linear programs. Oper Res 34:250–256 60. Vlach M (1986) Conditions for the existence of solutions of the three-dimensional planar transportation problem. Discret Appl Math 13:61–78 61. Wallacher C, Zimmermann U (1992) A combinatorial interior point method for network flow problems. Math Prog 56:321–335

549

550

C

Convex Envelopes in Optimization Problems

62. Yemelichev VA, Kovalev MM, Kravtsov MK (1984) Polytopes, Graphs and Optimisation. Cambridge University Press, Cambridge 63. Yudin DB, Nemirovskii AS (1977) Informational complexity and efficient methods for the solution of convex extremal problems. Matekon 13:25–45 64. Zaslavsky T (1975) Facing up to arrangements: face count formulas for partitions of space by hyperplanes. Memoirs Amer Math Soc 154:vii+102 65. Ziegler GM (1995) Lectures on Polytopes. Springer, New York

Convex Envelopes in Optimization Problems YASUTOSHI YAJIMA Tokyo Institute Technol., Tokyo, Japan MSC2000: 90C26 Article Outline

The properties indicate that an optimal solution of a nonconvex minimization problem could be obtained by minimizing the associated convex envelope. In general, however, finding the convex envelope is at least as difficult as solving the original one. Several practical results have been proposed for special classes of objective functions and constraints. Suppose that the function f is concave and S is a polytope with vertices v0 , . . . , vK . Then, the convex envelope f S over S can be expressed as: 8 K X ˆ ˆ ˆ f (x) D min ˛ i f (v i ) ˆ S ˆ ˆ ˆ iD0 ˆ ˆ ˆ K ˆ X ˆ x C b; which is uniquely determined by solving the following linear system

Keywords Convex underestimator; Nonconvex optimization Let f : S ! R be a lower semicontinuous function, where S Rn is a nonempty convex subset. The convex envelope taken over S is a function f S : S ! R such that f S is a convex function defined over the set S; f S (x) f (x) for all x 2 S; if h is any other convex function such that h(x) f (x) for all x 2 S, then h(x) f S (x) for all x 2 S. In other words, f S is the pointwise supremum among any convex underestimators of f over S, and is uniquely determined. The following demonstrates the most fundamental properties shown by [3,6]. Suppose that the minimum of f over S exists. Then, min f f (x) : x 2 Sg D min f f S (x) : x 2 Sg

a> v i C b D f (v i );

i D 0; : : : ; n:

The properties above have been used to solve concave minimization problems with linear constraints [4,6]. The following property shown in [1,5] is frequently used in the literature. For each i = 1, . . . , p, let f i : Si ! R be a continuous function, where Si Rn i , and let n = n1 + + np . If f (x) D

p X

f i (x i )

iD1

and S D S1 S p ; where xi 2 Rn i , i = 1, . . . , p, and x = (x1 , . . . , xp ) 2 Rn , then the convex envelope f S (x) can be expressed as:

and fx : f (x ) f (x); 8x 2 Sg

fx : f S (x ) f S (x); 8x 2 Sg :

f S (x) D

p X iD1

f Si (x i ):

Convexifiable Functions, Characterization of

P In particular, let f (x) = ni= 1 f i (xi ) be a separable function, where x = (x1 , . . . , xn ) 2 Rn , and let f i (xi ) be concave for each i = 1, . . . , n. Then the convex envelope of f (x) over the rectangle R = {x 2 Rn : ai xi bi , i = 1, . . . , n} can be the affine function, which is given by the sum of the linear functions below: n X l i (x i ); f R (x) D iD1

where li (xi ) meets f i (xi ) at both ends of the interval ai xi bi for each i = 1, . . . , n. B. Kalantari and J.B. Rosen [7] show an algorithm for the global minimization of a quadratic concave function over a polytope. They exploit convex envelopes of separable functions over rectangles to generate lower bounds in a branch and bound scheme. Also, convex envelopes of bilinear functions over rectangles have been proposed in [2]. Consider the following rectangles: li xi Li ; ; ˝ i D (x i ; y i ) : mi yi Mi i D 1; : : : ; n; and let f i (x i ; y i ) D x i y i ;

i D 1; : : : ; n;

be bilinear functions with two variables. It has been shown that for each i = 1, . . . , n, the convex envelope of f i (xi , yi ) over ˝ i is expressed as: f ˝i i (x i ; y i ) D maxfm i x i C l i y i l i m i ; M i x i C L i y i L i M i g: Moreover, it can be verified that f˝i i (xi , yi ) agrees with f i (xi , yi ) at the four extreme points of ˝ i . Thus, the convex envelope of the general bilinear function >

f (x; y) D x y D

n X

f i (x i ; y i );

iD1

where x| = (x1 , . . . , xn ) and y| = (y1 , . . . , yn ) over ˝ = ˝ 1 × × ˝ n can be expressed as f ˝ (x; y) D

n X

f ˝i i (x i ; y i ):

iD1

Another characterization of convex envelopes of bilinear functions over a special type of polytope, which includes a rectangle as a special case, is derived in [8].

C

See also ˛BB Algorithm Global Optimization in Generalized Geometric Programming MINLP: Global Optimization with ˛BB References 1. Al-Khayyal FA (1990) Jointly constrained bilinear programs and related problems: An overview. Comput Math Appl 19(11):53–62 2. Al-Khayyal FA, Falk JE (1983) Jointly constrained biconvex programming. Math Oper Res 8(2):273–286 3. Bazaraa MS, Sherali HD, Shetty CM (1993) Nonlinear programming: Theory and algorithms. Wiley, New York 4. Benson HP, Sayin S (1994) A finite concave minimization algorithm using branch and bound and neighbor generation. J Global Optim 5(1):1–14 5. Falk JE (1969) Lagrange multipliers and nonconvex programs. SIAM J Control 7:534–545 6. Falk JE, Hoffman K (1976) A successive underestimation method for concave minimization problems. Math Oper Res 1(3):251–259 7. Kalantari B, Rosen JB (1987) An algorithm for global minimization of linearly constrained concave quadratic functions. Math Oper Res 12(3):544–561 8. Sherali HD, Alameddine A (1990) An explicit characterization of the convex envelope of a bivariate bilinear function over special polytopes. Ann Oper Res 25(1–4):197–209

Convexifiable Functions, Characterization of1 SANJO ZLOBEC Department of Mathematics and Statistics, McGill University, Montreal, Canada MSC2000: 90C25, 90C26, 90C30, 90C31, 25A15, 34A05 Article Outline Introduction Definitions Characterizations of a Convexifiable Function Canonical Form of Smooth Programs Other Applications Conclusions References 1 Research

partly supported by NSERC of Canada.

551

552

C

Convexifiable Functions, Characterization of

Introduction A twice continuously differentiable function in several variables, when considered on a compact convex set C, becomes convex if an appropriate convex quadratic is added to it, e. g. [2]. Equivalently, a twice continuously differentiable function is the difference of a convex function and a convex quadratic on C. This decomposition is valid also for smooth functions with Lipschitz derivatives [8]. Here we recall three conditions that are both necessary and sufficient for the decomposition [9,10]. We also list several implications of the convexification in optimization and applied mathematics [10,11]. A different notion of convexification is studied in, e. g. [6]; see also [3,5]. Definitions

Convexifiable Functions, Characterization of, Figure 1 Function f (t) D cos t and its convexification

Definition 1. ([7,10]) Given a continuous function f : Rn ! R on a compact convex set C of the Euclidean space Rn , consider : RnC1 ! R defined by (x; ˛) D f (x) 1/2˛x T x where x T is the transpose of x. If (x; ˛) is convex on C for some ˛ D ˛ , then (x; ˛) is said to be a convexification of f and ˛ is its convexifier on C. Function f is convexifiable if it has a convexification. Observation. If ˛ is a convexifier of f on a compact convex set C, then so is every ˛ < ˛ . Illustration 1. Consider f (t) D cos t on, say, t 2 . This function is convexifiable, its convexifier is any ˛ 1. For, e. g., ˛ D 2, its convexification is (t; 2) D cos t C t 2 . Note that f (t) is the difference of (strictly) convex (t; ˛) D cos t 1/2˛t 2 and (strictly) convex quadratic 1/2˛x T x for every sufficiently small ˛. The graphs of f (t) and its convexification (t; 2) are depicted in Fig. 1.

Convexifiable Functions, Characterization of, Figure 2 Mid-point acceleration function of f (t) D cos t

celeration function of f on C is the function (x; y) D (4 / jjx yjj2 )[ f (x) C f (y) 2 f ((x C y) / 2)];

Characterizations of a Convexifiable Function One can characterize convexifiable functions using the fact that a continuous f : Rn ! R is convex if, and only if, f is mid-point convex, i. e., f ((xCy)/2) 1/2( f (x)C f (y)); x; y 2 C, e. g. [4]. Denote the norm of u 2 Rn by jjujj D (u T u)1/2 . With a continuous f : Rn ! R one can associate : Rn Rn ! R : Definition 2. ([10]) Consider a continuous f : Rn ! R on a compact convex set C in Rn . The mid-point ac-

x; y 2 C; x ¤ y :

Function describes a mid-point “displacement of the displacement” (i. e., the “acceleration”) of f at x between x and y along the direction y x. The graph of for the scalar function f (t) D cos t is depicted in Fig. 2. Using one can characterize a convexifiable function: Theorem 1. ([10]) Consider a continuous f : Rn ! R on a compact convex set C in Rn . Function f is convexifi-

Convexifiable Functions, Characterization of

able on C if, and only if, its mid-point acceleration function is bounded from below on C. For scalar functions one can also use a determinant: Theorem 2. ([9] Determinant Characterization of Scalar Convexifiable Function) A continuous scalar function f : R ! R is convexifiable on a compact convex interval I if, and only if, there exists a number ˛ such that for every three points s < t < in I 0 1 1 1 1 1 det @ s t A ˛(s t)(t )( s) : 2 f (s) f (t) f () Illustration 2. Function f (t) D jtj3/2 on C D [1; 1] is continuously differentiable but it is not convexifiable. Indeed, for s D 0; (0; t) D 25/2 (1 21/2 )/t 1/2 ! 1 as t > 0; t ! 0. Also, using Theorem 2 at s D ; t D 0; D " > 0, we find that there is no ˛ such that ˛ 2/ as " ! 0. Function g(t) D jtj is not convexifiable around the origin. Scalar convexifiable functions can be represented explicitly on a compact interval I: Theorem 3. ([9] Explicit Representation of Scalar Convexifiable Functions) A continuous scalar function f : I ! R is convexifiable if, and only if, there exists a number ˛ such that 1 f (t) D f (c) C ˛(t 2 c 2 ) C 2

Zt g(; ˛) d : c

Here c; t 2 I; c < t, and g D g(; ˛) : I ! R is a nondecreasing right-continuous function. An implication of this result is that every smooth function with a Lipschitz derivative, in particular every analytic function and every trajectory of an object governed by Newton’s Second Law, is of this form. Two important classes of functions are convexifiable and a convexifier ˛ can be given explicitly. First, if f is twice continuously differentiable, then the second derivative of f at x is represented by the Hessian matrix H(x) D (@2 f (x)/@x i @x j ); i; j D 1; : : : ; n. This is a symmetric matrix with real eigenvalues. Denote its smallest eigenvalue at x by (x) and its “globally” smallest eigenvalue over a compact convex set C by D min (x) : x2C

C

Corollary 1. ([7]) A twice continuously differentiable function f : Rn ! R is convexifiable on a compact convex set C in Rn and ˛ D is a convexifier. Suppose that f is a continuously differentiable (smooth) function with the derivative satisfying the Lipschitz property, i. e., jjr f (x) r f (y)jj Ljjx yjj for every x; y 2 C and some constant L. Here r f (u) is the (Frechet) derivative of f at u. We represent the derivative of f at x by a column n-tuple gradient r f (x) D (@ f (x)/@x i ). Corollary 2. ([8]) A continuously differentiable function f : Rn ! R, with the derivative having the Lipschitz property with a constant L on a compact convex set C in Rn , is convexifiable on C and ˛ D L is a convexifier. A Lipschitz function may not be convexifiable. For example, f (t) D t 2 sin(1/t) for t ¤ 0 and f (0) D 0 is a Lipschitz function and it is also differentiable (not continuously differentiable). Its derivative is uniformly bounded, but the function is not convexifiable, e. g. [7,11]. Canonical Form of Smooth Programs Every mathematical program (NP) Min f (x); f i (x) 0;

i 2 P D f1; : : : ; mg; x 2 C ;

where the functions f ; f i : Rn ! R; i 2 P are continuous and convexifiable on a compact convex set C can be reduced to a canonical form. First one considers some convexifications of these functions: (x; ˛) D f (x) 1/2˛x T x and i (x; ˛ i ) D f i (x) 1/2˛ i x T x, where ˛; ˛ i are, respectively, arbitrary convexifiers of f ; f i ; i 2 P. Then one associates with (NP) the following program with partly linear convexifications (LF; ; ") : 1 Min(x; ) (x; ˛) C ˛x T ; 2 1 i (x; ˛ i ) C ˛ i x T 0; i 2 P ; 2 x 2 C; kx k " : Here " 0 is a scalar parameter. This parameter was fixed at zero value in [2]. For the sake of “numerical stability” it was extended to " 0 in [7]. If the norm is chosen to be uniform, i. e., kuk1 D max iD1;:::;n ju i j

553

554

C

Convexifiable Functions, Characterization of

then (LF; , ") is a convex program in x for every fixed (, ") and linear in (, ") for every x. Such programs are called partly linear-convex. Theory of optimality and stability for such programs and related models is well studied, e. g. [1,8]. Remark Since one can construct the program (LF; , ") for every (NP) with convexifiable functions, we refer to (LF; , ") as the parametric Liu–Floudas canonical form of (NP). Let us relate an optimal solution of (NP) to optimal solutions x 0 ("); 0 (") of (LF; , "). Theorem 4. ([8,9]) Consider (NP) with a unique optimal solution, where all functions are assumed to be convexifiable, and its partly linear-convex program (LF; , "). Then a feasible x is an optimal solution of (NP) if, and only if, x D lim!0 x 0 (") and D lim"!0 0 ("), with x D . Moreover, the feasible set mapping of (LF; , ") is lower semi-continuous at * and " D 0, relative to all feasible perturbations of (, "). Other Applications There are many other areas of applications of convexifiable functions: (i) Every convexifiable function f is the difference of a convex function (x; ˛) and a convex quadratic 1/2˛x T x for every sufficiently small ˛ on a compact convex set. Hence it follows that the results for convex functions can be applied to (x; ˛). With minor adjustments, pertaining to the quadratic term, such results can be extended to convexifiable (generally nonconvex) functions. Here is an illustration of how this works for the mean value. The result is well known for convex functions (the case ˛ D 0). Theorem 5. ([9]) Consider a continuous scalar convexifiable function f : R ! R on an open interval (a, b) with a convexifier ˛. Then Zd f () d

1/(d c) c

1 ˛(d c)2 ; 12

1 [ f (c) C f (d)] 2

for every

a 0g is independent). Therefore, the extreme points of the set defined by (4) are given by the intersection of the extreme rays of CG with the hyperplane O n X D 1. Since the optimum value of a linear function over a convex set is attained at an extreme point, there is an optimum solution of the form T

X D x x ; x 2 Rn ; x 0; kx k D 1; and where x supports an independent set. Therefore, we can reformulate the program (3) as

max x

n X

MDSCN N 0; S 2 SC n

Parrilo showed in [6] that using sufficiently large systems of linear matrix inequalities, one can approximate the copositive cone Cn to any desired accuracy. Obviously, copositivity of the matrix M is equivalent to (global) nonnegativity of the fourth-degree form: P(x) D (x ı x)T M(x ı x)

!2 xi

N 2 Nn , then it follows that M 2 Cn . Hence, we can obtain a semidefinite relaxation of a copositive program over M introducing the linear matrix constraints:

; kxk D 1;

D

iD1

n n X X iD1 jD1

x 0; x i x j > 0 ) (i; j) … E: Then, it is easy to see that the maximum is attained when x supports a maximum independent set and all x i > 0 are equal to 1/˛(G). This provides the optimum value to the program (3) equal to ˛(G). QED. Since X 2 Cn is always nonnegative, we can reduce the set of constraints x i j D 0; (i; j) 2 E in (4) to a single constraint A X D 0. Thus, the following copositive program is dual to (3), (4):

where “ı” denotes the componentwise (Hadamard) product. It is shown in [6] that the mentioned decomposition into positive semidefinite and nonnegative matrices exists if and only if P(x) can be represented as sum of squares. Higher-order sufficient conditions for copositivity proposed by Parrilo in [6] correspond to checking whether the polynomial (r)

P (x) D ˛(G) D min f : I C yA O n D Q; Q 2 Cn g :

Models Approximating Cn with Linear Matrix Inequalities While, in general, there is no polynomial-time verifiable certificate of copositivity, unless co-N P D N P, in many cases it is still possible to show by a short argument that a matrix is copositive. For instance, if the matrix M can be represented as sum of a positive semidefinite matrix S 2 SC n and a nonnegative matrix

n X

! x 2i

P(x)

(6)

iD1

;y2R

Therefore, the maximum independent set problem is reducible to copositive programming. See also [1,2] for reduction of the standard quadratic optimization problem to copositive programming. Furthermore, it can be shown that checking if a given matrix is not copositve is N P-complete [5] and, hence, checking matrix copositivity is co-NP-complete.

M i j x 2i x 2j 0; x 2 Rn ; (5)

has a sum-of-squares decomposition (or – a weaker condition – whether P(r) (x) has only nonnegative coefficients). These conditions can be expressed via linear matrix inequalities over nr nr symmetric matrices. In particular, for r D 1, Parrilo showed that the existence of a sum-of-squares decomposition of P(1) (x) is equivalent to feasibility of the following system (see also [3]): 8 ˆ ˆ ˆ < ˆ ˆ ˆ :

M M (i) 2 SC n ; (i) M i i D 0; ( j) M (i) C 2M i j D 0; jj ( j)

(k) M (i) jk C M i k C M i j 0;

i D 1; : : : ; n; i D 1; : : : ; n; i ¤ j; i < j < k;

where M (i) (i D 1; : : : ; n) are symmetric matrices. With sufficiently large r, the convergence to the copositivity constraint on M is guaranteed by the famous theorem of Pólya [7]:

Cost Approximation Algorithms

Theorem 3 (Pólya) Let f be a homogeneous polynomial which is positive on the simplex n X n D x2R : x i D 1; x 0 : iD1

Then, for a sufficiently large N, all the coefficients of the polynomial !N n X xi f (x) iD1

are positive. References 1. Bomze IM, de Klerk E (2002) Solving standard quadratic optimization problems via linear, semidefinite and copositive programming. J Glob Optim 24:163–185 2. Bomze IM, Dür M, de Klerk E, Roos C, Quist AJ, Terlaky T (2000) On copositive programming and standard quadratic optimization problems. J Glob Optim 18:301–320 3. de Klerk E, Pasechnik D (2001) Approximation of the stability number of a graph via copositive programming. SIAM J Optim 12(4):875–892 4. Johansson M (1999) Piecewise linear control systems. PhD thesis, Lund Institute of Technology, Lund 5. Murty KG, Kabadi SN (1987) Some NP-complete problems in quadratic and nonlinear programming. Math Prog 39:117– 129 6. Parrilo PA (2000) Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. PhD thesis, California Institute of Technology, Pasadena, CA, http://www.mit.edu/~parrilo/pubs 7. Pólya G (1928) Über positive Darstellung von Polynomen, Vierteljschr. Naturforsch Ges Zürich 73:141–145 (Collected Papers, vol 2, MIT Press, Cambridge, MA, London, 1974, 309– 313) 8. Quist AJ, de Klerk E, Roos C, Terlaky T (1998) Copositive relaxation for general quadratic programming. Optim Method Softw 9:185–208 9. Renegar J (2001) A Mathematical View of Interior-Point Methods in Convex Optimization. SIAM, Philadelphia

Cost Approximation Algorithms CA Algorithms MICHAEL PATRIKSSON Department Math., Chalmers University Technol., Göteborg, Sweden MSC2000: 90C30

C

Article Outline Keywords Instances of the CA Algorithm Linearization Methods Regularization, Splitting and Proximal Point Methods Perturbation Methods

Variational Inequality Problems Descent Properties Optimization Variational Inequality Problems

Steplength Rules Convergence Properties Decomposition CA Algorithms Sequential Decomposition Synchronized Parallel Decomposition Asynchronous Parallel Decomposition

See also References Keywords Linearization methods; Gradient projection; Quasi-Newton; Frank–Wolfe; Sequential quadratic programming; Operator splitting; Proximal point; Levenberg–Marquardt; Auxiliary problem principle; Subgradient optimization; Variational inequality problem; Merit function; Cartesian product; Gauss–Seidel; Jacobi; Asynchronous computation The notion of cost approximation (CA) was created in the thesis [39], to describe the construction of the subproblem of a class of iterative methods in mathematical programming. In order to explain the notion of CA, we will consider the following conceptual problem (the full generality of the algorithm is explained in detail in [45] and in [37,38,40,42,43,44,46]): ( min T(x) :D f (x) C u(x); (1) s.t. x 2 X; where X Rn is nonempty, closed and convex, u: Rn ! R [ {+ 1} is lower semicontinuous (l.s.c.), proper and convex, and f : Rn ! R [ {+ 1} is continuously differentiable (for short, in C1 ) on dom u \ X, where dom denotes ‘effective domain’. This problem is general enough to cover convex optimization (f = 0), unconstrained optimization (f = 0 and X = Rn ), and differentiable constrained optimization (u = 0). We note

567

568

C

Cost Approximation Algorithms

that if int (dom u) \ X is nonempty, then any locally optimal solution x satisfies the inclusion r f (x ) 2 @u(x ) C N X (x );

(2)

where N X is the normal cone operator for X and @u is the subdifferential mapping for u. Equivalently, by the definitions of these two operators, r f (x )> (x x ) C u(x) u(x ) 0;

x 2 X:

The CA algorithm was devised in order to place a number of existing algorithms for (1) in a common framework, thereby facilitating comparisons, for example, between their convergence properties. In short, the method works iteratively as follows. Note that, from (2), we seek a zero of the mapping [rf + @u + N X ]. Given an iterate, xt 2 dom u \ X, this mapping is approximated by a monotone mapping, constructed so that a zero of which is easier to find. Such a point, yt , is then utilized in the search for a new iterate, xt + 1 , having the property that the value of some merit function for (1) is reduced sufficiently, for example through a line search in T along the direction of dt := yt xt . Instances of the CA Algorithm To obtain a monotone approximating mapping, we introduce a monotone mapping ˚ t : dom u \ X ! Rn , which replaces the (possibly nonmonotone) mapping rf ; by subtracting off the error at xt , [˚ t rf ](xt ), from ˚ t , so that the resulting mapping becomes [˚ t + @u + N X ] + [rf ˚ t ](xt ), the CA subproblem becomes the inclusion [˚ t C @u C N X ](y t ) C [r f ˚ t ](x t ) 3 0n :

(3)

We immediately reach an interesting fixed-point characterization of the solutions to (2): Theorem 1 (Fixed-point, [45]) The point xt solves (2) if and only if yt = xt solves (3). This result is a natural starting point for devising stopping criteria for an algorithm. Assume now that ˚ t r' t for a convex function t ' . We may then derive the inclusion equivalently as follows. At xt , we replace f with the function ' t , and subtract off the linearization of the error at xt ; the subproblem objective function then becomes t

t

t

t

>

T' t (y) :D ' (y) C u(y) C [r f (x ) r' (x )] y:

It is straightforward to establish that (3) is the optimality conditions for the convex problem of minimizing T ' t over X. Linearization Methods Our first example instances utilize Taylor expansions of f to construct the approximations. Let u = 0 and X = Rn . Let ˚ t (y) := (1/ t ) Qt y, where t > 0 and Qt is a symmetric and positive definite mapping in Rn × n . The inclusion (3) reduces to r f (x t ) C

1 t t Q (y x t ) D 0n ; t

that is, yt = xt t (Qt )1 rf (xt ). The direction of yt xt , dt := t (Qt )1 rf (xt ), is the search direction of the class of deflected gradient methods, which includes the steepest descent method (Qt := I n , the identity matrix) and quasi-Newton methods (Qt equals (an approximation of) r 2 f (xt ), if positive definite). (See further [5,35,47,50].) In the presence of constraints, this choice of ˚ t t t leads to yt = P QX [xt t (Qt )1 rf (xt )], where P QX [] denotes the p projection onto X with respect to the norm kzkQ t :D z> Q t z. Among the algorithms in this class we find the gradient projection algorithm (Qt := I n ) and Newton’s method (Qt := r 2 f (xt ), t := 1). (See [5,19,27,50].) A first order Taylor expansion of f is obtained from choosing ' t (y) := 0; this results in T ' t (y) = rf (xt )| y (if u = 0 is still assumed), which is the subproblem objective in the Frank–Wolfe algorithm ([5,17]; cf. also Frank–Wolfe algorithm). We next provide the first example of the very useful fact that the result of the cost approximation (in the above examples a linearization), leads to different approximations of the original problem, and ultimately to different algorithms, depending on which representation of the problem to one applies the cost approximation. Consider the problem (

min

f (x)

s.t.

g i (x) D 0;

i D 1; : : : ; `;

(4)

where f and g i , i = 1, . . . , `, are functions in C2 . We may associate this problem with its first order optimal-

Cost Approximation Algorithms

ity conditions, which in this special case is n rx L(x ; ) 0 F(x ; ) :D D ` ; 0 r L(x ; )

(5)

where 2 R` is the vector of Lagrange multipliers for the constraints in (4), and L(x, ) := f (x)+ | g(x) is the associated Lagrangian function. We consider using Newton’s method for this system, and therefore introduce a (primal-dual) mapping ˚:R2(n+ `) ! Rn + ` of the form y ˚((y; p); (x; )) :D rF(x; ) p 2 rx L(x; ) r g(x)> y D : r g(x) 0 p The resulting CA subproblem in (y, p) can be written as the following linear system: rx2 L(x; )(y x) C r g(x)> p D 0n ; r g(x)(y x) D g(x); this system constitutes the first order optimality conditions for (e. g., [4, Sec. 10.4]) 8 > ˆ ˆ rx2 L(x; )(y x)

g(x) C r g(x)> (y x) D 0` ;

where we have added some fixed terms in the objective function for clarity. This is the generic subproblem of sequential quadratic programming (SQP) methods for the solution of (4); see, for example, [5,16]. Regularization, Splitting and Proximal Point Methods We assume now that f := f 1 + f 2 , where f 1 is convex on dom u \ X, and rewrite the cost mapping as [r f C @u C N X ] D [r' t C r f 1 C @u C N X ] [r' t r f 2 ]: The CA subproblem is, as usual, derived by fixing the second term at xt ; the difference to the original setup is that we have here performed an operator splitting in the mapping rf to keep an additive part from being approximated. (Note that such a splitting can always be

C

found by first choosing f 1 as a convex function, and then define f 2 := f f 1 . Note also that we can derive this subproblem from the original derivation by simply redefining ' t := ' t + f 1 .) We shall proceed to derive a few algorithms from the literature. Consider choosing ' t (y) = 1/(2 t ) k y xt k2 , t > 0. If f 2 = 0, then we obtain the subproblem objective T ' t (y) = T(y)+ 1/(2 t ) k y xt k2 , which is the subproblem in the proximal point algorithm (e. g., [32,33,34,51,52]). This is the most classical algorithm among the regularization methods. More general choices of strictly convex functions ' t are of course possible, leading for example to the class of regularization methods based on Bregman functions ([9,14,22]) and -divergence functions ([23,54]). If, on the other hand, f 1 = 0, then we obtain the gradient projection algorithm if also u = 0. We can also construct algorithms in between these two extremes, yielding a true operator splitting. If both f 1 and f 2 are nonzero, choosing ' t = 0 defines a partial linearization ([25]) of the original objective, wherein > | only f 2 is linearized. Letting x = (x> 1 , x2 ) , the choice t 2 t ' (y) = 1/(2 t ) k y1 x1 k leads to the partial proximal point algorithm ([7,20]); choosing ' t (y) = f (y1 , x2t ) leads to a linearization of f in the variables x2 . Several well-known methods can be derived either directly as CA algorithms, or as inexact proximal point algorithms. For example, the Levenberg–Marquardt algorithm ([5,49]), which is a Newton-like algorithm wherein a scaled diagonal matrix is added to the Hessian matrix in order to make the resulting matrix positive definite, is the result of solving the proximal point subproblem with one iteration of a Newton algorithm. Further, the extra-gradient algorithm of [24] is the result of instead applying one iteration of the gradient projection algorithm to the proximal point subproblem. The perhaps most well-known splitting algorithm is otherwise the class of matrix splitting methods in quadratic programming (e. g., [28,29,35,36]). In a quadratic programming problem, we have f (x) D

1 > x Ax C q> x; 2

where A 2 Rn × n . A splitting (A1t , A2t ) of this matrix is one for which A = A1t + A2t , and it is further termed regular if A1t A2t is positive definite. Matrix splitting

569

570

C

Cost Approximation Algorithms

methods correspond to choosing f 1 (x) D

1 > t x A1 x; 2

and results in the CA subproblem mapping y 7! A1t y + [A2t xt + q], which obviously is monotone whenever A1t was chosen positive semidefinite. Due to the fact that proximal point and splitting methods have dual interpretations as augmented Lagrangian algorithms ([51]), a large class of multiplier methods is included among the CA algorithms. See [45, Chapt. 3.2–3.3] for more details. Perturbation Methods All the above algorithms assume that i) the mappings @u and N X are left intact; and ii) the CA subproblem has the fixed-point property of Theorem 1. We here relax these assumptions, and are then able to derive subgradient algorithms as well as perturbed CA algorithms which include both regularization algorithms and exterior penalty algorithms. Let [˚ t + N X ] + [rf + @u + ˚ t ] represent the original mapping, having moved @u to the second term. Then by letting any element u (xt ) 2 @u(xt ) represent this point-to-set mapping at xt , we reach the subproblem mapping of the auxiliary problem principle of [12]. Further letting ˚ t (y) = (1/ t )[y xt ] yields the subproblem in the classical subgradient optimization scheme ([48,53]), where, assuming further that f = 0, yt := PX [xt t u (xt )]. (Typically, `t := 1 is taken.) Let again [˚ t + @u + N X ]+ [rf + ˚ t ] represent the original problem mapping, but further let u be replaced by an epiconvergent sequence {ut } of l.s.c., proper and convex functions. An example of an epiconvergent sequence of convex functions is provided by convex exterior penalty functions. In this way, we can construct CA algorithm that approximate the objective function and simultaneously replace some of the constraints of the problem with exterior penalties. See [3,13] for example methods of this type. One important class of regularization methods takes as the subproblem mapping [˚ t + rf + @u + N X ], where ˚ t is usually taken to be strongly monotone (cf. (12)). This subproblem mapping evidently does not have the fixed-point property, as it is not identical to the original one at xt unless ˚ t (xt ) = 0n holds. In order

to ensure convergence, we must therefore force the sequence {˚ t } of mappings to tend to zero; this is typically done by constructing the sequence as ˚ t := (1/ t )˚ for a fixed mapping ˚ and for a sequence of t > 0 constructed so that { t } ! 1 holds. For this class of algorithms, F. Browder [10] has established convergence to a unique limit point x which satisfies ˚(x ) 2 N X (x ), where X is the solution set of (2). The origin of this class of methods is the work of A.N. Tikhonov [55] for ill-posed problems, that is, problems with multiple solutions. The classical regularization mapping is the scaled identity mapping, ˚ t (y) := (1/ t )[y], which leads to least squares (least norm) solutions. See further [49,56]. Variational Inequality Problems Consider the following extension of (2): F(x ) 2 @u(x ) C N X (x );

(6)

where F: X ! Rn is a continuous mapping on X. When F = rf we have the situation in (2), and also in the case when F(x, y) = (r x ˘ (x, y)| , r y ˘ (x, y)| )| holds for some saddle function ˘ on some convex product set X × Y (cf. (5)), the variational inequality problem (6) has a direct interpretation as the necessary optimality conditions for an optimization problem. In other cases, however, a merit function (or, objective function), for the problem (6) is not immediately available. We will derive a class of suitable merit functions below. Given the convex function ': dom u \ X ! R in C1 on dom u \ X, we introduce the function (x) :D sup L(y; x);

x 2 dom u \ X;

(7)

y2X

where L(y; x) :D u(x) u(y) C '(x) '(y) C [F(x) r'(x)]>(x y):

(8)

We introduce the optimization problem min (x): x2X

(9)

Theorem 2 (Gap function, [45]) For any x 2 X, (x) 0 holds. Further, (x) = 0 if and only if x solves (6). Hence, the solution set of (6) (if nonempty) is identical

Cost Approximation Algorithms

to that of the optimization problem (9), and the optimal value is zero. The Theorem shows that the CA subproblem defines an auxiliary function which measures the violation of (6), and which can be used (directly or indirectly) as a merit function in an algorithm. To immediately illustrate the possible use of this result, let us consider the extension of Newton’s method to the solution of (6). Let x 2 dom u \ X, and consider the following cost approximating mapping: y 7! ˚(y, x) := rF(x)(y x). The CA subproblem then is the linearized variational inequality problem of finding y 2 dom u \ X such that [F(x) C rF(x)> (y x)]> (z y) C u(z); u(y) 0; 8z 2 X:

(10)

Assuming that x is not a solution to (6), we are interested in utilizing the direction d := y x in a line search based on a merit function. We will utilize the primal gap function ([2,62]) for this purpose, which corresponds to the choice ' := 0 in the definition of . We denote the primal gap function by p . Let w be an arbitrary solution to its inner problem, that is, p (x) = u(x) u(w) + F(x)| (x w). The steplength is chosen such that the value of p decreases sufficiently; to show that this is possible, we use Danskin’s theorem and the variational inequality (10) with z = w to obtain (the maximum is taken over all w defining p (x)) 0 p (x; d)

:D max w

n

F(x) C rF(x)> (x w)

p (x)

>

d C u 0 (x; d)

o

d > rF(x)> d;

which shows that d defines a direction of descent with respect to the merit function p at all points outside the solution set, whenever F is monotone and in C1 on dom u \ X. (See also [30] for convergence rate results.) So, if Newton’s method is supplied with a line search with respect to the primal gap function, it is globally convergent for the solution of variational inequality problems. The merit function and the optimization problem (9) cover several examples previously considered for the solution of (6). The primal gap function, as typically all other gap functions, is nonconvex, and further also nondifferentiable in general. In order to utilize methods from

C

differentiable optimization, we consider letting ' be strictly convex, whence the solution yt to the inner problem (7) is unique. Under the additional assumption that dom u \ X is bounded and that u is in C1 on this set, is in C1 on dom u \ X. Among the known differentiable gap functions that are covered by this class of merit functions we find those of [1,18,26,40], and [31,59,60,61]. Descent Properties Optimization Assume that xt is not a solution to (2). We are interested in the conditions under which the direction of dt := yt xt provides a descent direction for the merit function T. Let d t :D y t x t , where y t is a possibly inexact solution to (3). Then, if ˚ t = r' t , the requirement is that T' t (y t ) < T' t (x t );

(11)

that is, any improvement in the value of the subproblem objective over that at the current iterate is enough to provide a descent direction. To establish this result, one simply utilizes the convexity of ' t and u and the formula for the directional derivative of T in the direction of dt (see [45, Prop. 2.14.b]). We further note that (11) is possible to satisfy if and only if xt is not a solution to (2); this result is in fact a special case of Theorem 1. If ˚ t has stronger monotonicity properties, descent is also obtained when ˚ t is not necessarily a gradient mapping, and, further, if it is Lipschitz continuous then we can establish measures of the steepness of the search directions, extending the gradient relatedness conditions of unconstrained optimization. Let ˚ t be strongly monotone on dom u \ X, that is, for x, y 2 dom u \ X, [˚ t (x) ˚ t (y)]> (x y) m˚ t kx yk2 ;

(12)

for some m˚ t > 0. This can be used to establish that

2 T 0 (x t ; d t ) m˚ t d t : If yt is not an exact solution to (3), in the sense that for a vector y t , we satisfy a perturbation of (3) where its right-hand side 0n is replaced by rt 6D 0n , then d t :D y t x t is a descent direction for T at xt if k rt k < m˚ t k dt k.

571

572

C

Cost Approximation Algorithms

Variational Inequality Problems

Steplength Rules

The requirements for obtaining a descent direction in the problem (6) are necessarily much stronger than in the problem (2), the reason being the much more complex form that the merit functions for (6) takes. (For example, the directional derivative of T at x in any direction d depends only on those quantities, while the directional derivative of depends also on the argument y which defines its value at x.) Typically, monotonicity of the mapping F is required, as is evidenced in the above example of the Newton method. If further a differentiable merit function is used, the requirements are slightly strengthened, as the following example result shows.

In order to establish convergence of the algorithm, the steplength taken in the direction of dt must be such that the value of the merit function decreases sufficiently. An exact line search obviously works, but we will introduce simpler steplength rules that do not require a onedimensional minimization to be performed. The first is the Armijo rule. We assume temporarily that u = 0. Let ˛, ˇ 2 (0, 1), and ` :D ˇ { , where { is the smallest nonnegative integer i such that

Theorem 3 (Descent properties, [45,60]) Assume that X is bounded, u is finite on X and F is monotone and in C1 on X. Let ': X × X ! R be a continuously differentiable function on X × X of the form '(y, x), strictly convex in y for each x 2 X. Let ˛ > 0. Let x 2 X, y be the unique vector in X satisfying ˛ (x)

:D max L˛ (y; x); y2X

where

(13)

There exists a finite integer such that (13) is satisfied for t any search direction d :D y t x t satisfying (11), by the descent property and Taylor’s formula (see [45, Lemma 2.24.b]). In the case where u 6D 0, however, the situation becomes quite different, since T := f + u is nondifferentiable. Simply replacing rf (xt )| dt with T 0 (xt ;dt ) does not work. We can however use an overestimate of the predicted decrease T 0 (xt ;dt ). Let ˛, ˇ 2 (0, 1), and ` :D ˇ { , where { is the smallest nonnegative integer i such that T(x t C ˇ i d t ) T(x t )

L˛ (y; x) :D u(x) u(y) 1 C ['(x; x) '(y; x)] ˛ > 1 C F(x) r y '(x; x) (x y): ˛ Then, with d := y x, either d satisfies 0 ˛ (x; d)

˛ (x);

2 (0; 1);

or ˛ (x)

f (x t C ˇ i d t ) f (x t ) ˛ˇ i r f (x t )> d t :

1 ('(y; x) C rx '(y; x)> d): ˛(1 )

A descent algorithm is devised from this result as follows. For a given x 2 X and choice of ˛ > 0, the CA subproblem is solved with the scaled cost approximating, continuous and iteration-dependent function '. If the resulting direction does not have the descent property, then the value of ˛ is increased and the CA subproblem rescaled and resolved. Theorem 3 shows that a sufficient increase in the value of ˛ will produce a descent direction unless x solves (6).

˛ˇ i [r' t (x t ) r' t (y t )]> d t ; where now yt necessarily is an exact solution to (3), and ' t must further be strictly convex. We note that T 0 (xt ;dt ) [r' t (xt ) r' t (yt )]| dt indeed holds, with equality in the case where u = 0 and X = Rn (see [45, Remark 2.28]). To develop still simpler steplength rules, we further assume that rf is Lipschitz continuous, that is, that for x, y 2 dom u \ X, kr f (x) r f (y)k Mr f kx yk ; for some M r f > 0. The Lipschitz continuity assumption implies that for every ` 2 [0, 1], T(x t C `d t ) T(x t ) > ` r' t (x t ) r' t (y t ) d t

2 Mr f 2

` d t ; C 2 adding a strong convexity assumption on ' t yields that

Mr f `

t t t

d t 2 : T(x C `d ) T(x ) ` m' t C 2

Cost Approximation Algorithms

This inequality can be used to validate the relaxation step, which takes 2m' t \ [0; 1]; (14) ` t 2 0; Mr f and the divergent series steplength rule, [0; 1] f` t g ! 0;

1 X

` t D 1:

(15)

tD0

In the case of (14), descent is guaranteed in each step, while in the case of (15), descent is guaranteed after a finite number of iterations. Convergence Properties Convergence of the CA algorithm can be established under many combinations of i) the properties of the original problem mappings; ii) the choice of forms and convexity properties of the cost approximating mappings; iii) the choice of accuracy in the computations of the CA subproblem solutions; iv) the choice of merit function; and v) the choice of steplength rule. A subset of the possible results is found in [45, Chapt. 5–9]. Evident from these results is that convergence relies on reaching a critical mass in the properties of the problem and algorithm, and that, given that this critical mass is reached, there is a very large freedom-of-choice how this mass is distributed. So, for example, weaker properties in the monotonicity of the subproblem must be compensated both by stronger coercivity conditions on the merit function and by the use of more accurate subproblem solutions and steplength rules. Decomposition CA Algorithms Assume that dom u \ X is a Cartesian product set, that is, for some finite index set C and positive integers ni P with i2C n i D n, Y X i ; X i Rn i ; XD i2C

u(x) D

X

u i (x i );

u i : Rn i ! R [ fC1g:

i2C

Such problems arise in applications of equilibrium programming, for example in traffic ([41]) and Nash equi-

C

librium problems ([21]); of course, box constrained and unconstrained problems fit into this framework as well. The main advantage of this problem structure is that one can devise several decomposition versions of the CA algorithm, wherein components of the original problem are updated upon in parallel or sequentially, independently of each other. With the right computer environment at hand, this can mean a dramatic increase in computing efficiency. We will look at three computing models for decomposition CA algorithm, and compare their convergence characteristics. In all three cases, decomposition is achieved by choosing the cost approximating mapping separable with respect to the partition of Rn defined by C: ˚(x)> D [˚1 (x1 )> ; : : : ; ˚jCj (xjCj )> ]:

(16)

The individual subproblems, given x, then are to find yi , i 2 C, such that ˚ i (y i ) C @u i (y i ) C N X i (y i ) C Fi (x) ˚ i (x i ) 3 0n i ; if ˚ i r' i for some convex function ' i : dom ui \ X i ! R in C1 on dom ui \ X i , then this is the optimality conditions for min T' i (y i )

y i 2X i

:D ' i (y i ) C u i (y i ) C [Fi (x) r' i (x i )]> y i : Sequential Decomposition The sequential CA algorithm proceeds as follows. Given an iterate xt 2 dom u \ X at iteration t, choose an index it 2 C and a cost approximating mapping ˚ ti t , and solve the problem of finding y ti t 2 Rn i t such that (i = it ) ˚ it (y it ) C @u i (y it ) C N X i (y it ) C Fi (x t ) ˚ it (x it ) 3 0n i : Let y tj := x tj for all j 2 C \ {it } and dt := yt xt . The next iterate, xt + 1 , is then defined by xt + 1 := xt + `t dt , that is, ( x tj C ` t (y tj x tj ); j D i t ; :D x tC1 j x tj ; j ¤ it ; for some value of `t such that x ti t + `t (y ti t x ti t ) 2 dom u i t \ X i t and the value of a merit function is reduced sufficiently.

573

574

C

Cost Approximation Algorithms

Assume that F is the gradient of a function f : dom u \ X ! R. Let the sequence {it } be chosen according to the cyclic rule, that is, in iteration t, i t :D t

(mod jCj) C 1:

Choose the cost approximating mapping (i = it ) t ; y i ); y i 7! ˚ it (y i ) :D r i f (x¤i

y i 2 dom u i \ X i : Note that this mapping is monotone whenever f is convex in xi . Since ˚ ti (x ti ) = r i f (xt ), the CA subproblem is equivalent (under this convexity assumption) to finding t ; y i ) C u i (y i )g: y it 2 arg min f f (x¤i y i 2X i

An exact line search would produce `t := 1, since y ti minimizes f (x6D i , )+ ui over dom ui \ X i (the remain:= y ti . The ing components of x kept fixed), and so x tC1 i iteration described is that of the classic Gauss–Seidel algorithm ([35]) (also known as the relaxation algorithm, the coordinate descent method, and the method of successive displacements), originally proposed for the solution of unconstrained problems. The Gauss–Seidel algorithm is hence a special case of the sequential CA algorithm. In order to compare the three decomposition approaches, we last provide the steplength requirement in the relaxation steplength rule (cf. (14)). The following interval is valid under the assumptions that for each i 2 C, r i f is Lipschitz continuous on dom ui \ X i and each mapping ˚ ti is strongly monotone: ` i;t

2m˚ it 2 0; \ [0; 1]: Mr i f

either case, the convergence analysis will be the same, with the exception that the value of |C| may change.) In the sequential decomposition CA algorithm, the steplengths are chosen individually for the different variable components, whereas the original CA algorithm uses a uniform steplength, `t . If the relative scaling of the variable components is poor, in the sense that F or u changes disproportionally to unit changes in the different variables xi , i 2 C, then this ill-conditioning may result in a poor performance of the parallel algorithm. Being forced to use the same steplength in all the components can also have an unwanted effect due to the fact that the values of some variable components are close to their optimal ones while others may be far from optimal, in which case one might for example wish to use longer steps for the latter components. These two factors lead us to introduce the possibility to scale the component directions in the synchronized parallel CA algorithm. We stress that such effects cannot in general be accommodated into the original algorithm through a scaling of the mappings ˚ ti . The scaling factors si, t introduced are assumed to satisfy 0 < s i s i;t 1;

i 2 C:

Note that the upper bound of one is without any loss of generality. Assume that F is the gradient of a function f : dom u \ X ! R. In the parallel algorithm, choose the cost approximating mapping of the form (16), where for each i 2 C, t ; y i ); y i 7! ˚ it (y i ) :D r i f (x¤i

y i 2 dom u i \ X i : (17)

Synchronized Parallel Decomposition The synchronized parallel CA algorithm is identical to the original scheme, where the CA subproblems are constructed to match the separability structure in the constraints. We presume the existence of a multiprocessor powerful enough to solve the |C| CA subproblems in parallel. (If fewer than |C| processors are available, then either some of the subproblems are solved in sequence or, if possible, the number of components is decreased; in

This mapping is monotone on dom u \ X whenever f is convex in each component xi . Since ˚ ti (x ti ) = r i f (xt ), i 2 C, it follows that the CA subproblem is equivalent (under the above convexity assumption on f ) to finding t ; y i ) C u i (y i )g: y it 2 arg min f f (x¤i y i 2X i

Choosing `t := 1 and si, t := 1, i 2 C, yields xt + 1 := yt , and the resulting iteration is that of the Jacobi algorithm [8,35] (also known as the method of simultaneous displacements). The Jacobi algorithm, which was originally proposed for the solution of systems of equations,

Cost Approximation Algorithms

is therefore a parallel CA algorithm where the cost approximating mapping is (18) and unit steps are taken. The admissible step in component i is `si, t 2 [0, 1], where 2m˚ it : (18) ` 2 0; min i2C s i;t Mr f The maximal step is clearly smaller than in the sequential approach. To this conclusion contributes both the minimum operation and that M r i f M rf ; both of these requirements are introduced here because the update is made over all variable components simultaneously. (An intuitive explanation is that the sequential algorithm utilizes more recent information when it constructs the subproblems.) One may therefore expect the sequential algorithm to converge to a solution with a given accuracy in less iterations, although the parallel algorithm may be more efficient in terms of solution time; the scaling introduced by si, t may also improve the performance of the parallel algorithm to some extent. Although the parallel version of the algorithm may speed-up the practical convergence rate compared to the sequential one, the need for synchronization in carrying out the updating step will generally deteriorate performance, since faster processors must wait for slower ones. In the next section, we therefore introduce an asynchronous version of the parallel algorithm, in which processors do not wait to receive the latest information available. Asynchronous Parallel Decomposition In the algorithms considered in this Section, the synchronization step among the processors is removed. Because the speed of computations and communications can vary among the processors, and communication delays can be substantial, processors will perform the calculations out of phase with each other. Thus, the advantage of reduced synchronization is paid for by an increase in interprocessor communications, the use of outdated information, and a more difficult convergence detection (see [8]). (Certainly, the convergence analysis also becomes more complicated.) Recent numerical experiments indicate, however, that the introduction of such asynchronous computations can substantially enhance the efficiency of parallel iterative methods (e. g., [6,11,15]).

C

The model of partial asynchronism that we use is as follows. For each processor (or, variable component) i 2 C, we introduce a) initial conditions, xi (t) := x0i 2 X i , for all t 0; b) a set T i of times at which xi is updated; and c) a variable ij (t) for each j 2 C and t 2 T i , denoting the time at which the value of xj used by processor i at time t is generated by processor j, satisfying 0 ij (t) t for all j 2 C and t 0. We note that the sequential CA algorithm and the synchronized parallel CA algorithm can both be expressed as asynchronous algorithms: the cyclic sequential algorithm model is obtained from the choices T i := [k 0 {|C| k + i 1} and ij (t) := t, while the synchronous parallel model is obtained by choosing T i := {1, 2, . . . } and ij (t) := t, for all i, j and t. The communication delay from processor j to processor i at time t is t ij (t). The convergence of the partially asynchronous parallel decomposition CA algorithm is based on the assumption that this delay is upper bounded: there exists a positive integer P such that i) for every i 2 C and t 0, at least one element of {t, . . . , t + P 1} belongs to T i ; ii) 0 t ij (t) P 1 holds for all i, j 2 C and all t 0; and iii) ii (t) = t holds for all i 2 C and all t 0. In short, parts i) and ii) of the assumption state that no processor waits for an arbitrarily long time to compute a subproblem solution or to receive a message from another processor. (Note that a synchronized model satisfies P = 1.) Part iii) of the assumption states that processor i always uses the most recent value of its own component xi of x, and is in [58] referred to as a computational nonredundancy condition. This condition holds in general when no variable component is updated simultaneously by more than one processor, as, for example, in message passing systems. For further discussions on the assumptions, we refer the reader to [8,57]; we only remark that they are easily enforced in practical implementations. The iterate x(t) is defined by the vector of xi (t), i 2 C. At a given time t, processor i has knowledge of a possibly outdated version of x(t); we let i h i (t))> x i (t)> :D x1 (1i (t))>; : : : ; xjCj (jCj

575

576

C

Cost Approximation Algorithms

denote this vector. (Note that iii) above implies the relation x ii (t) := xi ( ii (t)) = xi (t).) To describe the (partially) asynchronous parallel CA algorithm, processor i updates xi (t) according to x i (t C 1) :D x i (t) C `s i (y i (t) x ii (t)); t 2 Ti ; where yi (t) solves the CA subproblem defined at xi (t), and si 2 (0, 1] is a scaling parameter. (We define di (t) := yi (t) xii (t) to be zero at each t 62 T i .) The admissible steplength for i 2 C is `si 2 [0, 1], where ! m 2 min i2C f s˚i i g : (19) ` 2 0; Mr f [1 C (jCj C 1)P] If further for some M 0 and every i 2 C, all vectors x, y in dom u \ X with xi = yi satisfy kr i f (x) r i f (y)k M kx yk ;

(20)

then, in the above result, the steplength restrictions are adjusted to ! m 2 min i2C f s˚i i g : ` 2 0; Mr f C (jCj C 1)MP (We interpret the property (20) as a quantitative measure of the coupling between the variables.) Most important to note is that the upper bound on ` is (essentially) inversely proportional to the maximal allowed asynchronism P; this is very intuitive, since if processors take longer steps then they should exchange information more often. Conversely, the more outdated the information is, the less reliable it is, hence the shorter step. The relations among the steplengths in the three approaches (cf. (17), (18), and (19)) quantify the intuitive result that utilizing an increasing degree of parallelism and asynchronism results in a decreasing quality of the step directions, due to the usage of more outdated information; subsequently, smaller steplengths must be used. More detailed discussions about this topic is found in [45, Sect. 8.7.2]. See also Dynamic Traffic Networks

References 1. Auchmuty G (1989) Variational principles for variational inequalities. Numer Funct Anal Optim 10:863–874 2. Auslender A (1976) Optimisation: Méthodes numériques. Masson, Paris 3. Auslender A, Crouzeix JP, Fedit P (1987) Penalty-proximal methods in convex programming. J Optim Th Appl 55:1–21 4. Bazaraa MS, Sherali HD, Shetty CM (1993) Nonlinear programming: Theory and algorithms, 2nd edn. Wiley, New York 5. Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Sci., Belmont, MA 6. Bertsekas DP, Castanon DA (1991) Parallel synchronous and asynchronous implementations of the auction algorithm. Parallel Comput 17:707–732 7. Bertsekas DP, Tseng P (1994) Partial proximal minimization algorithms for convex programming. SIAM J Optim 4: 551–572 8. Bertsekas DP, Tsitsiklis JN (1989) Parallel and distributed computation: Numerical methods. Prentice-Hall, Englewood Cliffs, NJ 9. Bregman LM (1966) A relaxation method of finding a common point of convex sets and its application to problems of optimization. Soviet Math Dokl 7:1578–1581 10. Browder FE (1966) Existence and approximation of solutions of nonlinear variational inequalities. Proc Nat Acad Sci USA 56:1080–1086 11. Chajakis ED, Zenios SA (1991) Synchronous and asynchronous implementations of relaxation algorithms for nonlinear network optimization. Parallel Comput 17: 873–894 12. Cohen G (1978) Optimization by decomposition and coordination: A unified approach. IEEE Trans Autom Control AC-23:222–232 13. Cominetti R (1997) Coupling the proximal point algorithm with approximation methods. J Optim Th Appl 95: 581–600 14. Eckstein J (1993) Nonlinear proximal point algorithms using Bregman functions, with applications to convex programming. Math Oper Res 18:202–226 15. El Baz D (1989) A computational experience with distributed asynchronous iterative methods for convex network flow problems. Proc. 28th IEEE Conf. Decision and Control, pp 590–591 16. Fletcher R (1987) Practical methods of optimization, 2nd edn. Wiley, New York 17. Frank M, Wolfe P (1956) An algorithm for quadratic programming. Naval Res Logist Quart 3:95–110 18. Fukushima M (1992) Equivalent differentiable optimization problems and descent methods for asymmetric variational inequality problems. Math Program 53:99–110 19. Goldstein AA (1964) Convex programming in Hilbert space. Bull Amer Math Soc 70:709–710

Cost Approximation Algorithms

20. Ha CD (1990) A generalization of the proximal point algorithm. SIAM J Control Optim 28:503–512 21. Harker PT, Pang J-S (1990) Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications. Math Program 48:161–220 22. Kiwiel KC (1998) Generalized Bregman projections in convex feasibility problems. J Optim Th Appl 96:139–157 23. Kiwiel KC (1998) Subgradient method with entropic projections for convex nondifferentiable minimization. J Optim Th Appl 96:159–173 24. Korpelevich GM (1977) The extragradient method for finding saddle points and other problems. Matekon 13:35–49 25. Larsson T, Migdalas A (1990) An algorithm for nonlinear programs over Cartesian product sets. Optim 21:535–542 26. Larsson T, Patriksson M (1994) A class of gap functions for variational inequalities. Math Program 64:53–79 27. Levitin ES, Polyak BT (1966) Constrained minimization methods. USSR Comput Math Math Phys 6:1–50 28. Luo Z-Q, Tseng P (1991) On the convergence of a matrix splitting algorithm for the symmetric monotone linear complementarity problem. SIAM J Control Optim 29:1037– 1060 29. Mangasarian OL (1991) Convergence of iterates of an inexact matrix splitting algorithm for the symmetric monotone linear complementarity problem. SIAM J Optim 1:114–122 30. Marcotte P, Dussault J-P (1989) A sequential linear programming algorithm for solving monotone variational inequalities. SIAM J Control Optim 27:1260–1278 31. Marcotte P, Zhu D (1995) Global convergence of descent processes for solving non strictly monotone variational inequalities. Comput Optim Appl 4:127–138 32. Martinet B (1972) Détermination approchée d’un point fixe d’une application pseudo-contractante. CR Hebdom Séances de l’Acad Sci (Paris), Sér A 274:163–165 33. Minty GJ (1962) Monotone (nonlinear) operators in Hilbert space. Duke Math J 29:341–346 34. Moreau J-J (1965) Proximité et dualité dans un espace Hilbertien. Bull Soc Math France 93:273–299 35. Ortega JM, Rheinboldt WC (1970) Iterative solution of nonlinear equations in several variables. Acad. Press, New York 36. Pang J-S (1982) On the convergence of a basic iterative method for the implicit complementarity problem. J Optim Th Appl 37:149–162 37. Patriksson M (1993) Partial linearization methods in nonlinear programming. J Optim Th Appl 78:227–246 38. Patriksson M (1993) A unified description of iterative algorithms for traffic equilibria. Europ J Oper Res 71:154–176 39. Patriksson M (1993) A unified framework of descent algorithms for nonlinear programs and variational inequalities. PhD Thesis Dept. Math. Linköping Inst. Techn. 40. Patriksson M (1994) On the convergence of descent methods for monotone variational inequalities. Oper Res Lett 16:265–269

C

41. Patriksson M (1994) The traffic assignment problem – Models and methods. Topics in Transportation. VSP, Utrecht 42. Patriksson M (1997) Merit functions and descent algorithms for a class of variational inequality problems. Optim 41:37–55 43. Patriksson M (1998) Cost approximation: A unified framework of descent algorithms for nonlinear programs. SIAM J Optim 8:561–582 44. Patriksson M (1998) Decomposition methods for differentiable optimization problems over Cartesian product sets. Comput Optim Appl 9:5–42 45. Patriksson M (1998) Nonlinear programming and variational inequality problems: A unified approach. Applied Optim, vol 23. Kluwer, Dordrecht 46. Patriksson M (1999) Cost approximation algorithms with nonmonotone line searches for a general class of nonlinear programs. Optim 44:199–217 47. Polyak BT (1963) Gradient methods for the minimisation of functionals. USSR Comput Math Math Phys 3:864–878 48. Polyak BT (1967) A general method of solving extremum problems. Soviet Math Dokl 8:593–597 49. Polyak BT (1987) Introduction to optimization. Optim. Software, New York 50. Pshenichny BN, Danilin Yu M. (1978) Numerical methods in extremal problems. MIR, Moscow 51. Rockafellar RT (1976) Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math Oper Res 1:97–116 52. Rockafellar RT (1976) Monotone operators and the proximal point algorithm. SIAM J Control Optim 14:877–898 53. Shor NZ (1985) Minimization methods for non-differentiable functions. Springer, Berlin 54. Teboulle M (1997) Convergence of proximal-like algorithms. SIAM J Optim 7:1069–1083 55. Tikhonov AN (1963) Solution of incorrectly formulated problems and the regularization method. Soviet Math Dokl 4:1035–1038 56. Tikhonov AN, Arsenin VYa (1977) Solutions of ill-posed problems. Wiley, New York (translated from Russian) 57. Tseng P, Bertsekas DP, Tsitsiklis JN (1990) Partially asynchronous, parallel algorithms for network flow and other problems. SIAM J Control Optim 28:678–710 58. Üresin A, Dubois M (1992) Asynchronous iterative algorithms: Models and convergence. In: Kronsjö L, Shumsheruddin D (eds) Advances in Parallel Algorithms. Blackwell, Oxford, pp 302–342 59. Wu JH, Florian M, Marcotte P (1993) A general descent framework for the monotone variational inequality problem. Math Program 61:281–300 60. Zhu DL, Marcotte P (1993) Modified descent methods for solving the monotone variational inequality problem. Oper Res Lett 14:111–120 61. Zhu DL, Marcotte P (1994) An extended descent framework for variational inequalities. J Optim Th Appl 80: 349–366

577

578

C

Credit Rating and Optimization Methods

62. Zuhovicki˘ı SI, Polyak RA, Primak ME (1969) Two methods of search for equilibrium points of n-person concave games. Soviet Math Dokl 10:279–282

Credit Rating and Optimization Methods CONSTANTIN ZOPOUNIDIS, MICHAEL DOUMPOS Department of Production Engineering and Management, Financial Engineering Laboratory, Technical University of Crete, Chania, Greece MSC2000: 91B28 90C90 90C05 90C20 90C30 Article Outline Synonyms Introduction/Background Definitions Formulation Methods/Applications Logistic Regression Neural Networks Support Vector Machines Multicriteria Value Models and Linear Programming Techniques Evolutionary Optimization

Conclusions References Synonyms Credit scoring; Credit granting; Financial risk management; Optimization Introduction/Background Financial risk management has evolved over the past two decades in terms of both its theory and its practices. Economic uncertainties, changes in the business environment and the introduction of new complex financial products (e. g., financial derivatives) led financial institutions and regulatory authorities to the development of a new framework for financial risk management, focusing mainly on the capital adequacy of banks and credit institutions. Banks and other financial institutions are exposed to many different forms of financial risks. Usually these are categorized as [14]:

Market risk that arises from the changes in the prices of financial securities and currencies. Credit risk originating from the inability of firms and individuals to meet their debt obligations to their creditors. Liquidity risk that arises when a transaction cannot be conducted at the existing market prices or when early liquidation is required in order to meet payments obligations. Operational risk that originate from human and technical errors or accidents. Legal risk which is due to legislative restrictions on financial transactions. Among these types of risk, credit risk is considered as the primary financial risk in the banking system and exists in virtually all income-producing activities [7]. How a bank selects and manages its credit risk is critically important to its performance over time. In this context credit risk management defines the whole range of activities that are implemented in order to measure, monitor and minimize credit risk. Credit risk management has evolved dramatically over the last 20 years. Among others, some factors that have increased the importance of credit risk management include [2]: (i) the worldwide increase in the number of bankruptcies, (ii) the trend towards disintermediation by the highest quality and largest borrowers, (iii) the increased competition among credit institutions, (iv) the declining value of real assets and collateral in many markets, and (v) the growth of new financial instruments with inherent default risk exposure, such as credit derivatives. Early credit risk management was primarily based on empirical evaluation systems of the creditworthiness of a client. CAMEL has been the most widely used system in this context, which is based on the empirical combination of several factors related to capital, assets, management, earnings and liquidity. It was soon realized, however, that such empirical systems cannot provide a solid and objective basis for credit risk management. This led to an outgrowth of studies from academics and practitioners on the development of new credit risk assessment systems. These efforts were also motivated by the changing regulatory framework that now requires banks to implement specific methodologies for managing and monitoring their credit portfolios [4].

Credit Rating and Optimization Methods

The existing practices are based on sophisticated statistical and optimization methods, which are used to develop a complete framework for measuring and monitoring credit risk. Credit rating models are in the core of this framework and are used to assess the creditworthiness of firms and individuals. The following sections describe the functionality of credit rating systems and the type of optimization methods that are used in some popular techniques for developing rating systems. Definitions As already noted, credit risk is defined as the likelihood that an obligor (firm or individual) will be unable or unwilling to fulfill debt obligations towards the creditors. In such a case, the creditors will suffer losses that have to be measured as accurately as possible. The expected loss L i t over a period t from granting credit to a given obligor i can be measured as follows: L i t D PD i t LGD i EAD i where PD i t is the probability of default for the obligor i in the time period t, LGD i is the percentage of exposure the bank might lose in case the borrower defaults and EAD i is the amount outstanding in case the borrower defaults. The time period t is usually taken equal to one year. In the new regulatory framework default is considered to have occurred with regard to a particular obligor when one or more of the following events has taken place [4,11]: it is determined that the obligor is unlikely to pay its debt obligations in full; a credit loss event associated with any obligation of the obligor; the obligor is past due more than 90 days on any credit obligation; or the obligor has filed for bankruptcy or similar protection from creditors. The aim of credit rating models is to assess the probability of default for an obligor, whereas other models are used to estimate LGD and EAD. Rating systems measure credit risk and differentiate individual credits and groups of credits by the risk they pose. This allows bank management and examiners to monitor changes and trends in risk levels thus promoting safety

C

and soundness in the credit granting process. Credit rating systems are also used for credit approval and underwriting, loan pricing, relationship management and credit administration, allowance for loan and lease losses and capital adequacy, credit portfolio management and reporting [7].

Formulation Generally, a credit rating model can be considered as a mapping function f : Rn ! G that estimates the probability of default of an obligor described by a vector x 2 Rn of input features and maps the result to a set G of risk categories. The feature vector x represents all the relevant information that describes the obligor, including financial and nonfinancial data. The development of a rating model is based on the process of Fig. 1. The process begins with the collection of appropriate data regarding known cases in default and nondefault cases. These data can be taken from the historical database of a bank, or from external resources. At this data selection stage, some preprocessing of the data is necessary in order to transform the obtained data into useful features, to clean out the data from possible outliers and to select the appropriate set of features for the analysis. These steps lead to the final data fx i ; y i gm iD1 , where x i is the input feature vector for obligor i, y i in the known status of the obligor (e. g. y i D 1 for cases

Credit Rating and Optimization Methods, Figure 1 The process for developing credit rating models

579

580

C

Credit Rating and Optimization Methods

in default and y i D 1 for nondefault cases), and m in the number of observations in the data set. These data, which are used for model development, are usually referred to as training data. The second stage involves the optimization process, which refers to the identification of the model’s parameters that best fit the training data. In the simplest case, the model can be expressed as a linear function of the form: f (x) D xˇ C ˇ0 where ˇ 2 Rn is the vector with the coefficients of the selected features in the model and ˇ0 is a constant term. Other types of nonlinear models are also applicable. In the above linear case, the objective of the optimization process is to identify the optimal parameter vector ˛ D (ˇ; ˇ0 ) that best fit the training data. This can be expressed as an optimization problem of the following general form: min L(˛; X) ˛2S

(1)

where S is a set of constraints that define the feasible (acceptable) values for the parameter vector ˛, X is the training data set and L is a loss function measuring the differences between the model’s output and the given classification of the training observations. The result of the model optimization process are validated using another sample of obligors with known status. This is referred to as the validation sample. Typically it consists of cases different than the ones of the training sample and for a future time period. The optimal model is applied to these new observations and its predictive ability is measured. If this is acceptable, then the model’s outputs are used to define a set of risk rating classes (usually 10 classes are used). Each rating class is associated with a probability of default and it includes borrowers with similar credit risk levels. The defined rating needs also to be validated in terms of its stability over time, the distribution of the borrowers in the rating groups, and the consistency between the estimated probabilities of default in each group and the empirical ones which are taken from the population of rated borrowers.

Methods/Applications The optimization problem (1) is expressed in different forms depending on the method used to develop the rating model. The characteristics of some popular methods are outlined below. Logistic Regression Logistic regression is the most widely used method in financial decision-making problems, with numerous applications in credit risk rating. Logistic regression assumes that the log of the probability odds is a linear function: log

p D ˇ0 C xˇ 1p

where p D Pr(1 j x) is the probability that an obligor x is a member of class 1, which is then expressed as h i1 p D 1 C exp(ˇ0 Cxˇ) The parameters of the model (constant term ˇ0 and coefficient vector ˇ) are estimated to maximize the conditional likelihood of the classification given the training data. This is expressed as max

ˇ0 ;ˇ2R

m Y

Pr(y i j x i )

iD1

which can be equivalently written as m X yi C 1 1 yi ln(p i ) C ln(1 p i ) 2 2 ˇ0 ;ˇ2R iD1 max

where y i D 1 if obligor i is in the nondefault group and y i D 1 otherwise. Nonlinear optimization techniques such as the Newton algorithm are used to perform this optimization. Logistic regression has been widely applied in credit risk rating both by academics and by practitioners [1]. Its advantages are mainly related to its simplicity and transparency: it provides direct estimates of the probabilities of default as well as estimates for the significance of the predictor variables and it is computationally feasible even for large data sets.

Credit Rating and Optimization Methods

Neural Networks Neural networks is a popular methodology for developing decision-making models in complex domains. A neural network is a network of parallel processing units (neurons) organized into layers. A typical structure of a neural network (Fig. 2) includes the following structural elements: 1. An input layer consisting of a set of nodes (processing units – neurons); one for each input to the network. 2. An output layer consisting of one or more nodes depending on the form of the desired output of the network. In classification problems, the number of nodes of the output layer is determined in accordance with the the number of groups. 3. A series of intermediate layers referred to as hidden layers. The nodes of each hidden layer are fully connected with the nodes of the subsequent and the proceeding layer. Each connection between two nodes of the network is assigned a weight representing the strength of the connection. On the basis of the connections’ weights, the input to each node is determined as the weighted average of the outputs from all the incoming connections. Thus, the input in ir to node i of the hidden layer r is defined as follows: in ir D

nj r1 X X

j

w i k o i k C ir

jD0 kD1

C

where n j is the number of nodes at the hidden layer j, w i k is the weight of the connection between node i at layer r and node k at layer j, o k j is the output of node k at layer j and ir an bias term. The output of each node is specified through a transformation function. The most common form of this function is the logistic function: o ir D (1 C expin i j )1 The determination of the optimal neural network model requires the estimation of the connection weights and the bias terms of the nodes. The most widely used network training methodology is the backpropagation approach [18]. Nonlinear optimization techniques are used for this purpose [10,13,16]. Neural networks have become increasingly popular in recent years for the development of credit rating models [3]. Their main advantages include their ability to model complex nonlinear relationships in credit data, but they have also been criticized for their lack of transparency, the difficulty of specifying a proper architecture and the increased computational resources that are needed for large data sets. Support Vector Machines Support vector machines (SVMs) have become an increasingly popular nonparametric methodology for developing classification models. In a dichotomous classification setting, SVMs can be used to develop a linear decision function f (x) D sgn(xˇ C ˇ0 ). The optimal decision function f should maximize the margin induced in the separation of the classes [24], which is defined as 2/kˇk. Thus, the estimation of the optimal model is expressed as a quadratic programming problem of the following from: min 12 ˇ > ˇ C Ce > d subject to Y(Xˇ C eˇ0 ) C d e

(2)

ˇ; ˇ0 2 R; d 0

Credit Rating and Optimization Methods, Figure 2 A typical architecture of a neural network

where X is an m n matrix with the training data, Y is an mm matrix such that Yi i D y i and Yi j D 0 for all i ¤ j, d is m 1 vector with nonnegative error (slack) variables defined such that d i > 0 iff y i (x i ˇ Cˇ0 ) < 1, e is a m1 vector of ones, and C > 0 is a user-defined constant representing the trade-off between the two con-

581

582

C

Credit Rating and Optimization Methods

flicting objectives (maximization of the separating margin and minimization of the training errors). SVMs can also be used to develop nonlinear models. This is achieved by mapping the problem data to a higher-dimensional space H (feature space) D through a transformation of the form x i x > j (x i ) > (x j ). The mapping function is implicitly defined through a symmetric positive definite kernel function K(x i ; x j ) D (x i ) > (x j ) [22]. The representation of the data using the kernel function enables the development of a linear model in the feature space H. For large training sets several computational procedures have been proposed to enable the fast training of SVM models. Most of these procedures are based on a decomposition scheme. The optimization problem (2) is decomposed into smaller subproblems taking advantage of the sparse nature of SVM models, since only a small part of the data (the support vectors), contribute to the final form of the model. A review of the algorithms for training SVMs can be found in [6]. SVMs seem to be a promising methodology for developing credit rating models. The algorithmic optimization advances enable their application to large credit data sets and they provide a unified framework for developing both linear and nonlinear models. Recent application of SVMs in credit rating can be found in [9,12,21]. Multicriteria Value Models and Linear Programming Techniques The aforementioned classification methods assume that the groups are defined in a nominal way (i. e., the grouping provides a simple description of the cases). However, in credit risk modeling the groups are defined in an ordinal way, in the sense that an obligor classified in a low risk group is preferred to an obligor classified in a high risk group (in terms of its probability of default). Multicriteria methods are well-suited to the study of ordinal classification problems [26]. A typical multicriteria method that is well-suited for the development of credit rating models is the UTADIS method. The method leads to the development of an multiattribute additive value function: V (x) D

n X jD1

w j v j (x j )

where w j is the weight of attribute j, and v j (x j ) is the corresponding marginal value function. Each marginal value function provides a monotone mapping of the performance of the obligors on the corresponding attribute in a scale between 0 (high risk) and 1 (low risk). According to [15], such an additive value function model is well-suited for credit scoring and is widely used by banks in their internal rating systems. Using a piece-wise linear modeling approach, the estimation of the value function is performed based on a set of training data using linear programming techniques. For a two-class problem, the general form of the linear programming formulation is as follows [8]: min d1 C d2 C C d m subject to: y i [V(x i ) ˇ] C d i ı; 1; 2; : : : ; m w1 C w2 C C w n D 1 w j ; di ; ˇ 0 where ˇ is a value threshold that distinguishes the two classes, ı is a small positive user-defined constant and d i D maxf0; ı y i [V(x i ) ˇ]g denotes the classification error for obligor i. Extensions of this framework and alternative linear programming formulations with applications to credit risk rating have been presented by [5,17,19]. The main advantages of these methodologies involve their computational efficiency and the simplicity and transparency of the resulting models. Evolutionary Optimization Evolutionary algorithms (EA) are stochastic search and optimization heuristics inspired from the theory of natural evolution. In an EA, different possible solutions of an optimization problem constitute the individuals of a population. The quality of each individual is assessed with a fitness (objective) function. Better solutions are assigned higher fitness values than worse performing solutions. The key idea of EAs is that the optimal solution can be found if an initial population is evolved using a set of stochastic genetic operators, similar to the “survival of the fittest” mechanism of natural evolution. The fitness values of the individuals in a population are used to define how they will be propagated to subsequent generations of populations. Most EAs include operators that select individuals for reproduction, pro-

Credit Rating and Optimization Methods

duce new individuals based on those selected, and determine the composition of the population at the subsequent generation. Well-known EAs and similar metaheuristic techniques include, among others, genetic algorithms, genetic programming, tabu search, simulated annealing, ant colony optimization and particle swarm optimization. EAs have been used to facilitate the development of credit rating systems addressing some important issues such as feature selection, rule extraction, neural network development, etc. Some recent applications can be found in the works of Varetto [20], SalcedoSanza et al. [25] and Tsakonas et al. [23].

Conclusions Credit rating systems are in the core of the new regulatory framework for the supervision of financial institutions. Such systems support the credit granting process and enable the measurement and monitoring of credit risk exposure. The increasing volume of credit data which are available for developing rating systems highlight the importance of implementing efficient optimization techniques for the construction of rating models. The existing optimization methods used in this field, are mainly based on nonlinear optimization, linear programming and evolutionary algorithms. Future research is expected to take advantage of the advances in computer science, algorithmic developments regarding new forms of decision models, the analysis of the combination of different models, the comparative investigation on the performance of the existing methods and the implementation into decision support system that can be used by credit analysts in their daily practice.

References 1. Altman EI, Avery R, Eisenbeis R, Stinkey J (1981) Application of Classification Techniques in Business, Banking and Finance. JAI Press, Greenwich 2. Altman EI, Saunders A (1998) Credit risk measurement: Developments over the last 20 years. J Banking Finance 21:1721–1742 3. Atiya AF (2001) Bankruptcy prediction for credit risk using neural networks: A survey and new results. IEEE Trans Neural Netw 12:929–935

C

4. Basel Committee on Banking Supervision (2004) International Convergence of Capital Measurement and Capital Standards: A Revised Framework. Bank for International Settlements, Basel, Switzerland 5. Bugera V, Konno H, Uryasev S (2002) Credit cards scoring with quadratic utility functions. J Multi-Criteria Decis Anal 11:197–211 6. Campbell C (2002) Kernel methods: A survey of current techniques. Neurocomput 48:63–84 7. Comptroller of the Currency Administrator of National Banks (2001) Rating Credit Risk: Comptrollers Handbook. Comptroller of the Currency Administrator of National Banks, Washington, DC 8. Doumpos M, Zopounidis C (2002) Multicriteria Decision Aid Classification Methods. Kluwer, Dordrecht 9. Friedman C (2002) CreditModel technical white paper, Techical Report. Standard and Poor’s, New York 10. Hagan MT, Menhaj M (1994) Training feedforward networks with the Marquardt algorithm. IEEE Trans Neural Netw 5:989–993 11. Hayden E (2003) Are credit scoring models sensitive with respect to default definitions? Evidence from the Austrian market. EFMA 2003 Helsinki Meetings (Available at SSRN: http://ssrn.com/abstract=407709) 12. Huang Z, Chen H, Hsu CJ, Chen WH, Wu S (2004) Credit rating analysis with support vector machines and neural networks: A market comparative study. Decis Support Syst 37:543–558 13. Hung MS, Denton JW (1993) Training neural networks with the GRG2 nonlinear optimizer. Eur J Oper Res 69:83–91 14. Jorion P (2000) Value at Risk: The New Benchmark for Managing Financial Risk, 2nd edn. McGraw-Hill, New York 15. Krahnen JP, Weber M (2001) Generally accepted rating principles: A primer. J Banking Finance 25:3–23 16. Moller MF (1993) A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw 6:525–533 17. Mou TY, Zhou ZF, Shi Y (2006) Credit risk evaluation based on LINMAP. Lecture Notes Comput Sci 3994:452–459 18. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representation by error propagation. In: Rumelhart DE, Williams JL (eds) Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge, pp 318–362 19. Ryu YU, Yue WT (2005) Firm bankruptcy prediction: Experimental comparison of isotonic separation and other classification approaches. IEEE Trans Syst, Man Cybern – Part A 35:727–737 20. Salcedo-Sanza S, Fernández-Villacañasa JL, Segovia-Vargas MJ, Bousoño-Calzón C (2005) Genetic programming for the prediction of insolvency in non-life insurance companies. Comput Oper Res 32:749–765 21. Schebesch KB, Stecking R (2005) Support vector machines for classifying and describing credit applicants: detecting typical and critical regions. J Oper Res Soc 56: 1082–1088

583

584

C

Criss-Cross Pivoting Rules

22. Schölkopf B, Smola A (2002) Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge 23. Tsakonas A, Dounias G, Doumpos M, Zopounidis C (2006) Bankruptcy prediction with neural logic networks by means of grammar-guided genetic programming. Expert Syst Appl 30:449–461 24. Vapnik VN (1998) Statistical Learning Theory. Wiley, New York 25. Varetto F (1998) Genetic algorithms applications in the analysis of insolvency risk. J Banking Finance 22:1421– 1439 26. Zopounidis C, Doumpos M (2002) Multicriteria classification and sorting methods: A literature review. Eur J Oper Res 138:229–246

Synonyms Criss-cross Introduction From the early days of linear optimization (LO) (or linear programming), many people have been looking for a pivot algorithm that avoids the two-phase procedure needed in the simplex method when solving the general LO problem in standard primal form ˚ min c > x : Ax D b; x 0 ; and its dual ˚ max b > y : A> y c :

Criss-Cross Pivoting Rules TAMÁS TERLAKY Department Comput. & Software, McMaster University, Hamilton, Canada MSC2000: 90C05, 90C33, 90C20, 05B35, 65K05 Article Outline Keywords Synonyms Introduction Ziont’s Criss-Cross Method

The Least-Index Criss-Cross Method Other Interpretations Recursive Interpretation Lexicographically Increasing List Other Finite Criss-Cross Methods First-in Last-out Rule (FILO) Most Often Selected Variable Rule Exponential and Average Behavior Best-Case Analysis of Admissible Pivot Methods

Generalizations Fractional Linear Optimization Linear Complementarity Problems Convex Quadratic Optimization Oriented Matroids

See also References Keywords Pivot rules; Criss-cross method; Cycling; Recursion; Linear optimization; Oriented matroids

Such a method was assumed to rely on the intrinsic symmetry behind the primal and dual problems (i. e. it hoped to be selfdual), and it should be able to start with any basic solution. There were several attempts made to relax the feasibility requirement in the simplex method. It is important to mention Dantzig’s [7] parametric selfdual simplex algorithm. This algorithm can be interpreted as Lemke’s algorithm [22] for the corresponding linear complementarity problem (cf. Linear complementarity problem) [23]. In the 1960s people realized that pivot sequences through possibly infeasible basic solutions might result in significantly shorter paths to the optimum. Moreover a selfdual one phase procedure was expected to make linear programming more easily accessible for broader public. Probably these advantages stimulated the introduction of the criss-cross method by S. Ziont [39,40]. Ziont’s Criss-Cross Method Assuming that the reader is familiar with both the primal and dual simplex methods, Ziont’s criss-cross method can easily be explained. It can be initialized by any, possibly both primal and dual infeasible basis. If the basis is optimal, we are done. If the basis is not optimal, then there are some primal or dual infeasible variables. One might choose any of these. It is advised to choose once a primal and then a dual infeasible variable, if possible.

Criss-Cross Pivoting Rules

If the selected variable is dual infeasible, then it enters the basis and the leaving variable is chosen among the primal feasible variables in such a way that primal feasibility of the currently primal feasible variables is preserved. If no such basis exchange is possible another infeasible variable is selected. If the selected variable is primal infeasible, then it leaves the basis and the entering variable is chosen among the dual feasible variables in such a way that dual feasibility of the currently dual feasible variables is preserved. If no such basis exchange is possible another infeasible variable is selected. If the current basis is infeasible, but none of the infeasible variables allows a pivot fulfilling the above requirements then it is proved that the LO problem has no optimal solution. Once a primal or dual feasible solution is reached then Ziont’s method reduces to the primal or dual simplex method, respectively. One attractive character of Ziont’s criss-cross method is primal-dual symmetry (selfduality), and this alone differentiates itself from the simplex method. However it is not clear how one can design a finite version (i. e. a finite pivot rule) of this method. Both lexicographic perturbation and minimal index resolution seem not to be sufficient to prove finiteness in the general case when the initial basis is both primal and dual infeasible. Nevertheless, although not finite, this is the first published criss-cross method in the literature. The other thread, that lead to finite criss-cross methods, was the intellectual effort to find finite, other than the lexicographic rule [4,8], variants of the simplex method. These efforts were also stimulated by studying the combinatorial structures behind linear programming. From the early 1970s in several branches of the optimization theory, finitely convergent algorithms were published. In particular A.W. Tucker [32] introduced the consistent labeling technique in the Ford– Fulkerson maximal flow algorithm; pivot selection rules based on least-index ordering, such as the Bard-type scheme for the P-matrix linear complementarity problem (K.G. Murty, [24]) and the celebrated least-index rule in linear and oriented matroid programming (R.G. Bland, [2]). A thorough survey of pivot algorithms can be found in [29].

C

It is remarkable that almost at the same time, in different parts of the world (China, Hungary, USA) essentially the same result was obtained independently by approaching the problem from quite different directions. Below we will refer to the standard simplex (basis) tableau. A tableau is called terminal if it gives a primal and dual optimal solution or evidence of primal or dual infeasibility/inconsistency of the problem. Terminal tableaus have the following sign structure.

Terminal tableaus

The pivot operations at all known pivot methods, including all variants of the primal and dual simplex method and Ziont’s criss-cross method have the following properties. When a primal infeasible variable is selected to leave the basis, the entering variable is selected so that after the pivot both variables involved in the pivot will be primal feasible. Analogously, when a dual infeasible variable is selected to enter the basis, then the leaving variable is selected in such a way that after the pivot both variables involved in the pivot will be dual feasible. Such pivots are called admissible. The sign structure of tableaus at an admissible pivot of ‘type I’ and ‘type II’ are demonstrated by the following figure.

Admissible pivot situations

Observe that, while dual(primal) simplex pivots preserve dual(primal) feasibility of the basic solution, admissible pivots do not in general. Admissible pivots extract the greedy nature of pivot selection, i. e. ‘repair primal/dual infeasibility’ of the pivot variables.

585

586

C

Criss-Cross Pivoting Rules

The Least-Index Criss-Cross Method The first finite criss-cross algorithm, which we call the least-index criss-cross method, was discovered independently by Y.Y. Chang [26], T. Terlaky [26,27,28] and Zh. Wang [34]; further, a strongly related general recursion by D. Jensen [18]. Chang presented the algorithm for positive semidefinite linear complementarity problems, Terlaky for linear optimization, for oriented matroids and with coauthors for QP, LCP and for oriented matroid LCP [9,16,19], while Wang primarily for the case of oriented matroids. The least-index criss-cross method is perhaps the simplest finite pivoting method to LO problems. This criss-cross method is a purely combinatorial pivoting method, it uses admissible pivots and traverses through different (possibly both primal and dual infeasible) bases until the associated basic solution is optimal, or an evidence of primal or dual infeasibility is found. To ease the understanding a figure is included that shows the scheme of the least-index criss-cross method.

Observe the simplicity of the algorithm: It can be initiated with any basis. No two phases are needed. No ratio test is used to preserve feasibility, only the signs of components in a basis tableau and a prefixed ordering of variables determine the pivot selection. Several finiteness proofs for the least-index criss-cross method can be found in the literature. The proofs are quite elementary, they are based on the orthogonality of the primal and dual spaces [14,26,28,29,34]; on recursive argumentation [11,18,33] or on lexicographically increasing lists [11,14]. 0

1

2

3

Let an ordering of the variables be fixed. Let T(B) be an arbitrary basis tableau (it can be neither primal nor dual feasible); Let r be the minimal i such that either x i is primal infeasible or x i has a negative reduced cost. IF there is no r, THEN stop; the first terminal tableau is obtained, thus T(B) is optimal. IF x r is primal infeasible THEN let p := r; q := minf` : t p` < 0g. IF there is no q, THEN stop; the second terminal tableau is obtained, thus the primal problem is infeasible. Go to Step 3. IF x r is dual infeasible, THEN let q := r : p :=minf` : t l q > 0g. IF there is no q, THEN stop: the third terminal tableau is obtained, thus the dual problem is infeasible. Go to Step 3. Pivot on (p; q). Go to Step 1.

The least-index criss-cross rule

One of the most important consequences of the finiteness of the least-index criss-cross method is the strong duality theorem of linear optimization. This gives probably the simplest algorithmic proof of this fundamental result:

Scheme of the least-index criss-cross method

Theorem 1 (Strong duality theorem) Exactly one of the following two cases occurs. At least one of the primal problem and the dual problem is infeasible. Both problems have an optimal solution and the optimal objective values are equal.

Criss-Cross Pivoting Rules

Other Interpretations The least-index criss-cross method can be interpreted as a recursive algorithm. This recursive interpretation and the finiteness proof based on it can be derived from the results in [2,3,18] and can be found in [33]. Recursive Interpretation As performing the least-index criss-cross method at each pivot one can make a note of the larger of the two indices r = max{p, q} that entered or left the basis. In this list, an index must be followed by another larger one before the same index occurs anew. The recursive interpretation is becoming apparent when one notes that it is based on the observation that the size of the solved subproblem (the subproblem for which a terminal tableau is obtained) is monotonically increasing. The third interpretation is based on the proof technique developed by J. Edmonds and K. Fukuda [9] and adapted by Fukuda and T. Matsui [11] to the case of the least-index criss-cross method. Lexicographically Increasing List Let u be an binary vector with appropriate dimension, set initially to be the zero vector. In applying the algorithm let r = max{p, q} be the larger of the two indices p that entered or q that left the basis. At each pivot update u as follows: let ur = 1 and ui = 0, 8i < r. The remaining components of u stay unchanged. Then at each step of the least-index criss-cross method the vector u strictly increases lexicographically, thus the method terminates in a finite number of steps. Other Finite Criss-Cross Methods Both the recursive interpretation and the hidden flexibility of pivot selection in the least-index criss-cross method make it possible to develop other finite variants. Such finite criss-cross methods, which do not rely on a fixed minimal index ordering, were developed on the basis of the finite simplex rules presented by S. Zhang [37]. These finite criss-cross rules [38] are as follows.

C

First-in Last-out Rule (FILO) First, choose a primal or dual infeasible variable that has changed its basis-nonbasis status most recently. Then choose a variable in the selected row or column so that the pivot entry fulfills the sign requirement of the admissible pivot selection and which has changed its basis/nonbasis status most recently. When more than one candidates occur with the same pivot age then one break tie as you like (e. g. randomly). This rule can easily be realized by assigning an ‘age’ vector u to the vector of the variables and using a pivot counter k. Initially we set k = 0 and u = 0. Increase k by one at each pivot and we set the pivot coordinates of u equal to k. Then the pivot selections are made by choosing the variable with the highest possible ui value satisfying the sign requirements. Most Often Selected Variable Rule First, choose a primal or dual infeasible variable that has changed its basis-nonbasis status most frequently. Then choose a variable in the selected row or column so that the pivot entry fulfills the sign requirement of the admissible pivot selection and which has changed its basis/nonbasis status most frequently. When more than one candidates occur with the same pivot age then one break tie as you like (e. g. randomly). The most often selected rule can also be realized by assigning another ‘age’ vector u to the vector of the variables. Initially we set u = 0. At each pivot we increase the pivot-variable components of u by one. Then the pivot selections are made by choosing the variable with the highest possible ui value satisfying the sign requirement. Exponential and Average Behavior The worst-case exponential behavior of the least-index criss-cross method was studied by C. Roos [25]. Roos’

587

588

C

Criss-Cross Pivoting Rules

exponential example is a variant of the cube of V. Klee and G.J. Minty [21]. In this example the starting solution is the origin defined by a feasible basis, the variables are ordered so that the least-index criss-cross method follows a simplex path, i. e. without making any ratio test feasibility of the starting basis is preserved. Another exponential example was presented by Fukuda and Namiki [12] for linear complementarity problems. Contrary to the clear result on the worst-case behavior, to date not much is known about the expected or average number of pivot steps required by finite crisscross methods.

timization problem. Thus it goes without surprise that the least-index criss-cross method is generalized to this class of optimization problems as well [17]. Linear Complementarity Problems The largest solvable class of LCPs is the class of LCPs with a sufficient matrix [5,6]. The LCP least-index crisscross method is a proper generalization of the LO crisscross method. When the LCP arises from a LO problem, the LO criss-cross method is obtained. Convex Quadratic Optimization

Best-Case Analysis of Admissible Pivot Methods As it was discussed above, and it is the case for many simplex algorithms, the least-index criss-cross method is not a polynomial time algorithm. A question naturally arises: whether there exists a polynomial crisscross method? Unfortunately no answer to this question is available at this moment. However some weaker variants of this question can be answered positively. The problem is stated as follows: An arbitrary basis is given. What is the shortest admissible pivot path from this given basis to an optimal basis? For nondegenerate problems, [10] shows the existence of such an admissible pivot sequence of length at most m. The nondegeneracy assumption is removed in [15]. This result solves a relaxation of the d-step conjecture. Observe, that we do not know of any such result for feasibility preserving, i. e. simplex algorithms. In fact, the maximum length of feasibility-preserving pivot sequences between two feasible bases is not known to be bounded by a polynomial in the size of the given LO problem. Generalizations Finite criss-cross methods were generalized to solve fractional linear optimization problems, to large classes of linear complementarity problems (LCPs; cf. Linear complementarity problem) and to oriented matroid programming problems (OMPs). Fractional Linear Optimization Fractional linear or, as it is frequently referred to, hyperbolic programming, can be reformulated as a linear op-

Convex quadratic optimization problems give an LCP with a bisymmetric coefficient matrix. Because a bisymmetric matrix is semidefinite and semidefinite matrices form a subclass of sufficient matrices, one obtain a finite criss-cross algorithm for convex quadratic optimization problems as well. Such criss-cross algorithms were published e. g. in [20]. The least-index criss-cross method is extremely simple for the P-matrix LCP. Starting from an arbitrary complementary basis, here the least-indexed infeasible variable leaves the basis and it is replaced by its complementary pair. This algorithm was originally proposed in [24], and studied in [12]. The general case of sufficient LCPs was treated in [4,13,16]. Oriented Matroids The intense research in the 1970s on oriented matroids and oriented matroid programming [2,9] gave a new insight in pivot algorithms. It became clear that although the simplex method has rich combinatorial structures, some essential results like the finiteness of Bland’s least-index simplex rule [2] does not hold in the oriented matroid context. Edmonds and Fukuda [9] showed that it might cycle in the oriented matroid case due to the possibility of nondegenerate cycling which is impossible in the linear case. The predecessors of finite criss-cross rules are: Bland’s recursive algorithm [2,3], the Edmonds– Fukuda algorithm [9], its variants and generalizations [1,35,36,37]. All these are variants of the simplex method in the linear case, i. e. they preserve the feasibility of the basis, but not in the oriented matroid case. In the case of oriented matroid programming only Todd’s finite lexicographic method [30,31] preserves feasibility

Criss-Cross Pivoting Rules

of the basis and therefore yields a finite simplex algorithm for oriented matroids. The least-index criss-cross method is a finite crisscross method for oriented matroids [28,34]. A general recursive scheme of finite criss-cross methods is given in [18]. Finite criss-cross rules are also presented for oriented matroid quadratic programming and for oriented matroid linear complementarity problems [13,19]. See also Least-Index Anticycling Rules Lexicographic Pivoting Rules Linear Programming Pivoting Algorithms for Linear Programming Generating Two Paths Principal Pivoting Methods for Linear Complementarity Problems Probabilistic Analysis of Simplex Algorithms Simplicial Pivoting Algorithms for Integer Programming References 1. Bjorner A, Vergnas M LAS, Sturmfels B, White N, Ziegler G (1993) Oriented matroids. Cambridge Univ Press, Cambridge 2. Bland RG (1977) A combinatorial abstraction of linear programming. J Combin Th B 23:33–57 3. Bland RG (1977) New finite pivoting rules for the simplex method. Math Oper Res 2:103–107 4. Chang YY (1979) Least index resolution of degeneracy in linear complementarity problems. Techn Report Dept Oper Res Stanford Univ 14 5. Cottle R, Pang JS, Stone RE (1992) The linear complementarity problem. Acad Press, New York 6. Cottle RW, Pang J-S, Venkateswaran V (1987) Sufficient matrices and the linear complementarity problem. Linear Alg Appl 114/115:235–249 7. Dantzig GB (1963) Linear programming and extensions. Princeton Univ Press, Princeton 8. Dantzig GB, Orden A, Wolfe P (1955) Notes on linear programming: Part I – The generalized simplex method for minimizing a linear form under linear inequality restrictions. Pacific J Math 5(2):183–195 9. Fukuda K (1982) Oriented matroid programming. PhD Thesis Waterloo Univ 10. Fukuda K, Luethi H-J, Namiki M (1997) The existence of a short sequence of admissible pivots to an optimal basis in LP and LCP. ITOR 4:273–284

C

11. Fukuda K, Matsui T (1991) On the finiteness of the crisscross method. Europ J Oper Res 52:119–124 12. Fukuda K, Namiki M (1994) On extremal behaviors of Murty’s least index method. Math Program 64:365– 370 13. Fukuda K, Terlaky T (1992) Linear complementarity and oriented matroids. J Oper Res Soc Japan 35:45–61 14. Fukuda K, Terlaky T (1997) Criss-cross methods: A fresh view on pivot algorithms. Math Program (B) In: Lectures on Math Program, vol 79. ISMP97, Lausanne, pp 369–396 15. Fukuda K, Terlaky T (1999) On the existence of short admissible pivot sequences for feasibility and linear optimization problems. Techn Report Swiss Federal Inst Technol 16. Hertog D Den, Roos C, Terlaky T (1993) The linear complementarity problem, sufficient matrices and the criss-cross method. Linear Alg Appl 187:1–14 17. Illés T, Szirmai Á, Terlaky T (1999) A finite criss-cross method for hyperbolic programming. Europ J Oper Res 114:198–214 18. Jensen D (1985) Coloring and duality: Combinatorial augmentation methods. PhD Thesis School OR and IE, Cornell Univ 19. Klafszky E, Terlaky T (1989) Some generalizations of the criss-cross method for the linear complementarity problem of oriented matroids. Combinatorica 9:189–198 20. Klafszky E, Terlaky T (1992) Some generalizations of the criss-cross method for quadratic programming. Math Oper Statist Ser Optim 24:127–139 21. Klee V, Minty GJ (1972) How good is the simplex algorithm? In: Shisha O (ed) Inequalities-III. Acad Press, New York, pp 1159–175 22. Lemke CE (1968) On complementary pivot theory. In: Dantzig GB, Veinott AF (eds) Mathematics of the Decision Sci Part I. Lect Appl Math 11. Amer Math Soc. Providence, RI, pp 95–114 23. Lustig I (1987) The equivalence of Dantzig’s self-dual parametric algorithm for linear programs to Lemke’s algorithm for linear complementarity problems applied to linear programming. SOL Techn Report Dept Oper Res Stanford Univ 87(4) 24. Murty KG (1974) A note on a Bard type scheme for solving the complementarity problem. Opsearch 11(2–3): 123–130 25. Roos C (1990) An exponential example for Terlaky’s pivoting rule for the criss-cross simplex method. Math Program 46:78–94 26. Terlaky T (1984) Egy új, véges criss-cross módszer lineáris programozási feladatok megoldására. Alkalmazott Mat Lapok 10:289–296 English title: A new, finite criss-cross method for solving linear programming problems. (In Hungarian) 27. Terlaky T (1985) A convergent criss-cross method. Math Oper Statist Ser Optim 16(5):683–690 28. Terlaky T (1987) A finite criss-cross method for oriented matroids. J Combin Th B 42(3):319–327

589

590

C

Cutting Plane Methods for Global Optimization

29. Terlaky T, Zhang S (1993) Pivot rules for linear programming: A survey on recent theoretical developments. Ann Oper Res 46:203–233 30. Todd MJ (1984) Complementarity in oriented matroids. SIAM J Alg Discrete Meth 5:467–485 31. Todd MJ (1985) Linear and quadratic programming in oriented matroids. J Combin Th B 39:105–133 32. Tucker A (1977) A note on convergence of the Ford– Fulkerson flow algorithm. Math Oper Res 2(2):143–144 33. Valiaho H (1992) A new proof of the finiteness of the crisscross method. Math Oper Statist Ser Optim 25:391–400 34. Wang Zh (1985) A conformal elimination free algorithm for oriented matroid programming. Chinese Ann Math 8(B1) 35. Wang Zh (1991) A modified version of the Edmonds– Fukuda algorithm for LP in the general form. Asia–Pacific J Oper Res 8(1) 36. Wang Zh (1992) A general deterministic pivot method for oriented matroid programming. Chinese Ann Math B 13(2) 37. Zhang S (1991) On anti-cycling pivoting rules for the simplex method. Oper Res Lett 10:189–192 38. Zhang S (1999) New variants of finite criss-cross pivot algorithms for linear programming. Europ J Oper Res 116: 607–614 39. Zionts S (1969) The criss-cross method for solving linear programming problems. Managem Sci 15(7):426–445 40. Zionts S (1972) Some empirical test of the criss-cross method. Managem Sci 19:406–410

Cutting Plane Methods for Global Optimization HOANG TUY Institute Math., Hanoi, Vietnam MSC2000: 90C26 Article Outline Keywords Outer Approximation Inner Approximation Concavity Cut Nonlinear Cuts References Keywords Cutting plane method; Outer approximation; Inner approximation; Polyhedral annexation; Concavity cut; Intersection cut; Convexity cut; Nonlinear cut; Polyblock approximation; Monotonic optimization

In solving global and combinatorial optimization problems cuts are used as a device to discard portions of the feasible set where it is known that no optimal solution can be found. Specifically, given the optimization problem min f f (x) : x 2 D Rn g ;

(1)

if x0 is an unfit solution and there exists a function l(x) satisfying l(x0 ) > 0, while l(x) 0 for every optimal solution x, then by adding the inequality l(x) 0 to the constraint set we exclude x0 without excluding any optimal solution. The inequality l(x) 0 is called a valid cut, or briefly, a cut. Most often the function l(x) is affine: the cut is then said to be linear, and the hyperplane l(x) = 0 is called a cutting plane. However, nonlinear cuts have proved to be useful, too, for a wide class of problems. Cuts may be employed in different contexts: outer and inner approximation (conjunctive cuts), branch and bound (disjunctive cuts), or in combined form. Outer Approximation Let ˝ Rn be the set of optimal solutions of problem (2). Suppose there exists a family P of polytopes P ˝ such that for each P 2 P a distinguished point z(P) 2 P (conceived of as some approximate solution) can be defined satisfying the following conditions: A1) z(P) always exists (unless ˝ = ;) and can be computed by an efficient procedure; A2) given any P 2 P and the associated distinguished point z = z(P), we can recognize when z 2 ˝ and if z 62 ˝, we can construct an affine function l(x) such that P0 = P \ {x: l(x) 0} 2 P , and l(z)> 0, while l(x) 0, 8 x 2 ˝, i. e. ˝ P0 P \ {z}. Under these conditions, one can attempt to solve problem (2) by the following outer approximation method (OA method) [8]: Prototype OA (outer approximation) procedure 0 Start with an initial polytope P1 2 P . Set k = 1. 1 Compute the distinguished point zk = z(Pk ) (by A1)). If z(Pk ) does not exist, terminate: the problem is infeasible. If z(Pk ) 2 ˝, terminate. Otherwise, continue. 2 Using A2), construct an affine function lk (x) such that Pk+ 1 = Pk \ {x:lk (x) 0{ 2 P and lk (x) strictly separates zk from ˝, i. e. satisfies lk (zk ) > 0; Set k

lk (x) 0 8x 2 ˝:

k + 1 and return to Step 1.

(2)

Cutting Plane Methods for Global Optimization

The algorithm is said to be convergent if it is either finite or generates an infinite sequence {zk } every cluster point of which is an optimal solution of problem (2). Usually the distinguished point zk is defined as a vertex of the polytope Pk satisfying some criterion (e. g., minimizing a given concave function). In these cases, the implementation of the above algorithm requires a procedure for computing, at iteration k, the vertex set V k of the current polytope Pk . At the beginning, V 1 is supposed to be known, while Pk+1 is obtained from Pk simply by adding one more linear constraint lk (x) 0. Using this information V k+1 can be derived from V k by an on-line vertex enumeration procedure [1]. Example 1 (Concave minimization.) Consider the problem (1) where f (x) is concave and D is a convex compact set with int D 6D ;. Assume that D is defined by a convex inequality g(x) 0 and let w 2 int D. Take P to be the collection of all polytopes containing D. For every P 2 P define z := z(P) to be a minimizer of f (x) over the vertex set V of P (hence, by concavity of f (x), a minimizer of f (x) over P). Clearly, if z 2 D, it solves the problem. Otherwise, the line segment joining z to w meets the boundary of D at a unique point y and the affine function l(x) = hp, x yi + g(y) with p 2 @ g(y) strictly separates D from z (indeed, l(z) = g(z)> 0 while l(x) g(x) g(z)+ g(z) 0 for all x 2 D. Obviously P0 = P \ {x : l(x) 0} 2 P , so Assumptions A1) and A2) are fulfilled and the OA algorithm can be applied. The convergence of the algorithm is easy to establish. Example 2 (Reverse convex programming.) Consider the problem (1) where f (x) = hc, xi, while D = {x 2 Rn : h(x) 0 g(x)} with g(x), h(x) continuous convex functions. Assume that the problem is stable, i. e. that D = cl(int D), so a feasible solution x 2 D is optimal if and only if fx 2 D : hc; x xi 0g fx : g(x) 0g :

(3)

Also for simplicity assume a point w is available satisfying max{h(w), g(w)}< 0 and hc, wi < min{hc, xi:h(x) 0 g(x)} (the latter assumption amounts to assuming that the constraint g(x) 0 is essential). Let ˝ be the set of optimal solutions, P the collection of all polytopes containing ˝. For every P 2 P let z = z(P) be a maximizer of g(x) over the vertex set V

C

of the polyhedron P \ }x : hc, xi }, where is the value of the objective function at the best feasible solution currently available (set = +1 if no feasible solution is known yet). By (3), if g(z) 0, then is the optimal value (for < +1), or the problem is infeasible (for = +1). Otherwise, g(z)> 0, and we can construct an affine function l(x) strictly separating z from ˝ as follows. Since max{h(w), g(w)}< 0 while max{h(z), g(z)}> 0 the line segment joining z, w meets the surface max{h(x), g(x)} = 0 at a unique point y. 1) If g(y) = 0 (while h(y) 0), then y is a feasible solution and since y = w+ (1 ) z for some 2 (0, 1) we must have hc, yi = hc, wi + (1 )hc, zi < , so the cut l(x) = hc, xyi 0 strictly separates z from ˝. 2) If h(y) = 0, then the cut l(x) = hp, xy + h(y) 0, where p 2 @ h(y), strictly separates z from ˝ (indeed, l(x) h(x) h(y)+ h(y) = h(x) 0 for all x 2 ˝ while l(z)> 0 because l(w)< 0, l(y) = 0). Thus assumptions A1), A2) are satisfied, and again the OA algorithm can be applied. The convergence of the OA algorithm for this problem is established by a more elaborate argument than for the concave minimization problem (see [3,8]). Various variants of OA method have been developed for a wide class of optimization problems, since any optimization problem described by means of differences of convex functions can be reduced to a reverse convex program of the above form [3]. However, a difficulty with this method when solving large scale problems is that the size of the vertex set V k of Pk may grow exponentially with k, creating serious storage problems and making the computation of V k almost impracticable. Inner Approximation Consider the concave minimization problem under linear constraints, i. e. the problem (2) when f (x) is a concave function and D is a polytope in Rn . Without loss of generality we may assume that 0 is a vertex of D. For any real number f (0), the set C = {x 2 Rn }{f (x) } is convex and 0 2 D \ C . Of course, D C if and only if f (x) for all x 2 D. The idea of the inner approximation method (IA method), also called the polyhedral annexation method (or PA method)[3], is to construct a sequence of expanding polytopes P1 P2 together with a nonin-

591

592

C

Cutting Plane Methods for Global Optimization

creasing sequence of real numbers 1 2 , such that k 2 f (D), Pk C k , k = 1, 2, . . . , and eventually D Ph for some h: then h f (x) for all x 2 D, i. e. h will be the optimal value. For every set P Rn let P° be the polar of P, i. e. P° = {y 2 Rn :hy, xi 1, 8 x 2 P. As is well known P° is a closed convex set containing 0 (in fact a polyhedron if P is a polyhedron), and P Q only if P° Q°; moreover, if C is a closed convex set containing 0, then (C°)° = C. Therefore, setting Sk = (Pk )°, the IA method amounts to constructing a sequence of nested polyhedra S1 Sh satisfying S°k C k , k = 1, . . . , h and Sh D°. The key point in this scheme is: Given k 2 f (D) and a polyhedron Sk such that Sk ° C k , check whether Sk D° and if there is yk 2 Sk \D°, then construct a cut lk (y) 1 to exclude yk and to form a smaller polyhedron Sk+ 1 such that Sk+1 ° C k+1 for some k+1 2 f (D) satisfying k+1 k . To deal with this point, define s(y) = max{hy, xi : x 2 D}. Since y 2 D° whenever s(y) 1 we will have Sk D° whenever max fs(y) : y 2 S k g 1:

(4)

But clearly the function s(y) is convex as the pointwise maximum of a family of linear functions. Therefore, denoting the vertex set and the extreme direction set of Sk by V k , U k , respectively, we will have (4) (i. e. Sk D°) whenever ( max fs(y) : y 2 Vk g 1; (5) max fs(y) : y 2 U k g 0: Thus, checking the inclusion Sk D° amounts to checking (5), a condition that fails to hold in either of the following cases: s(y k ) > 1 for some y k 2 Vk

(6)

s(y k ) > 0 for some y k 2 U k :

(7)

In each case, it can be verified that if xk maximizes hyk , xi over D, and k+1 = min{ k , f (xk )} while n o k D sup : f ( x k ) kC1 ; then Sk+1 = Sk \ {y : hxk , yi 1/ k } satisfies PkC1 :D S ıkC1 D conv(Pk [ f k x k g) CkC1 :

In the case (6), Sk+1 no longer contains yk while in the case (7), yk is no longer an extreme direction of Sk+1 . In this sense, the cut hxk , y 1/ k excludes yk . We can thus state the following algorithm. IA Algorithm (for concave minimization) 0 By translating if necessary, make sure that 0 is a vertex of D. Let x 1 be the best basic feasible solution available, 1 = f (x1 ). Take a simplex P1 C 1 and let S1 = P1 °, V 1 = vertex set of S1 , U1 = extreme direction set of S1 . Set k = 1. 1 Compute s(y) for every new y 2 (V k [ Uk )\{0}. If (5) holds, then terminate: Sk D° so x k is a global optimal solution. 2 If (6) or (7) holds, then let ˚˝ ˛ xk 2 arg max yk ; x : x 2 D : Update the current best feasible solution by comparing xk and x k . Set kC1 D f (x kC1 ). 3 Compute k = max{ 1:f ( xk ) k+1 } and let o n ˝ ˛ SkC1 D Sk \ y : xk ; y 1k : From V k and Uk derive the vertex set V k+1 and the extreme direction set Uk+1 of Sk+1 . Set k k+ 1 and go to Step 1.

It can be shown that the IA algorithm is finite [3]. Though this algorithm can be interpreted as dual to the OA algorithm, its advantage over the OA method is that it can be started at any vertex of D, so that each time the set V k has reached a certain critical size, it can be stopped and ‘restarted’ at a new vertex of D, using the last obtained best value of f (x) as the initial 1 . In that way the set V k can be kept within manageable size. Note that if D is contained in a cone M and P1 = {x 2 M:hv1 , xi 1} C 1 , then it can be shown that (7) automatically holds, and only (6) must be checked [6]. Concavity Cut The cuts mentioned above are used to separate an unfit solution from some convex set containing at least one optimal solution. They were first introduced in convex programming [2,4]. Another type of cuts originally devised for concave minimization [7] is the following. Suppose that a feasible solution x has already been known with f (x) D and we would like to check whether there exists a better feasible solution. One way to do that is to take a vertex x0 of D with f (x0 ) > and to construct a cone M, as small as possible, vertexed at x0 , containing D and having exactly n edges. Since x0 is interior to the convex set C = {x : f (x) }, each

Cutting Plane Methods for Global Optimization

ith edge of M, for i = 1, . . . , n, meets the boundary of C at a uniquely defined point yi (assuming that C is bounded). Through these n points y1 , . . . , yn (which are affinely independent) one can draw a unique hyperplane, of equation (x x0 ) = 1 such that (yi x0 ) = 1 (i = 1, . . . , n), hence = e| U 1 , where U is the matrix of columns y1 x0 , . . . , yn x0 and e denotes a vector of n ones. Since the linear inequality >

e U

1

0

(x x ) 1

(8)

excludes x0 without excluding any feasible solution x better than x, this inequality defines a valid cut. In particular, if it so happens that the whole polytope D is cut off, i. e. if ˚ D x : e > U 1 (x x 0 ) 1 ;

C

Nonlinear Cuts In many problems, nonlinear cuts arise in a quite natural way. For example, consider the following problem of monotonic optimization [10]: ˚ n ; max f (x) : g(x) 1; h(x) 1; x 2 RC

(10)

n where f , g, h are continuous increasing functions on RC n (a function f (x) is said to be increasing on RC if 0 x x0 ) f (x) f (x0 ); the notation x x0 means xi x0i for all i while x < x0 means xi < x0i for all i). As argued in [10], a very broad class of optimization problems can be n g(x) 1}, H = cast in the form (10). Define G = {x 2 RC n : h(x) 1}, so that the problem is to maximize {x 2 RC f (x) over the feasible set G \ H. Clearly

(9)

0 x x0 2 G

)

x 2 G;

(11)

then x is a global optimal solution. This cut is often referred to as a -valid concavity cut for (f , D) at x0 [3]. Its construction requires the availability of a cone M S vertexed at x0 and having exactly n edges. In particular, if the vertex x0 of D has exactly n neighboring vertices then M can be taken to be the cone generated by the n halflines from x0 through each of these neighbors of x0 . Note, however, that the definition of the concavity cut can be extended so that its construction is possible even when the cone M has more than n edges (as e. g., when x0 is a degenerated vertex of D). Condition (9), sufficient for optimality, suggests a cutting method for solving the linearly constrained concave minimization problem by using concavity cuts to iteratively reduce the feasible polyhedron. Unfortunately, experience has shown that concavity cuts, when applied repeatedly, tend to become shallower and shallower. Though these cuts can be significantly strengthened by exploiting additional structure of the problem (e. g., in concave quadratic minimization, bilinear programming [5] and also in low rank nonconvex problems [6]), pure cutting methods are often outperformed by branch and cut methods where cutting is combined with successive partition of the space [8]. Concavity cuts have also been used in combinatorial optimization (‘intersection cuts’, or in a slightly extended form, ‘convexity cuts’).

0 x x0 … H

)

x … H:

(12)

Assume that g(0) < 1 and 0 < a x b for all x 2 G \ H (so 0 2 int G, b 2 H). From (11) it follows that if z n \ G and (z) is the last point of G on the halfline 2 RC n : x> from 0 through z, then the cone K (z)} = {x 2 RC

(z)} separates z from G, i. e. G \ K (z) = ;, while z 2 K (z). A set of the form P = [y 2 V {x: 0 y}, where V is n , is called a polyblock of vertex set V a finite subset of RC [9]. A vertex v is said to be improper if v v0 for some v0 2 V \ {v}. Of course, improper vertices can be dropped without changing P. Also if P G \ H then the polyblock of vertex set V 0 = V \ H still contains G \ H because v 62 H implies that [0, v] \ H = ;. With these properties in mind we can now describe the polyblock approximation procedure for solving (10). Start with the polyblock P1 = [0, b] G \ H and its vertex set V 1 = {b} H. At iteration k we have a polyblock Pk G \ H with vertex set V k H. Let yk 2 arg max{f (x) : x 2 V k }. Clearly yk maximizes f (x) over Pk , and yk 2 H, so if yk 2 G then yk is an optimal solution. If yk 62 G then the point xk = (yk ) determines a cone K x k such that the set Pk+1 = Pk \ K x k excludes yk but still contains G \ H. It turns out that Pk+1 is a polyblock whose vertex set V k+1 is obtained from V k by adding n points vk, 1 , . . . , vk, n (which are the n vertices of the hyperrectangle [xk , yk ] adjacent to yk ) and then dropping

593

594

C

Cutting-Stock Problem

all those which do not belong to H. With this polyblock Pk+1 , we pass to iteration k+1. In that way we generate a nested sequence of polyblocks P1 P2 G \ H. It can be proved that either yk is an optimal solution at some iteration k or f (yk ) & := max{f (x) : x 2 G \ H}. A similar method can be developed for solving the problem ˚ n min f (x) : g(x) 1; h(x) 1; x 2 RC by interchanging the roles of g, h and a, b. In contrast with what happens in OA methods, the vertex set V k of the polyblock Pk in the polyblock approximation algorithm is extremely easy to determine. Furthermore this method admits restarts, which provide a way to prevent stall and overcome storage difficulties when solving large scale problems [10]. References 1. Chen P, Hansen P, Jaumard B (1991) On-line and offline vertex enumeration by adjacent lists. Oper Res Lett 10:403–409 2. Cheney EW, Goldstein AA (1959) Newton’s method for convex programming and Tchebycheff approximation. Numerische Math 1:253–268 3. Horst R, Tuy H (1996) Global optimization: deterministic approaches, 3rd edn. Springer, Berlin 4. Kelley JE (1960) The cutting plane method for solving convex programs. J SIAM 8:703–712 5. Konno H (1976) A cutting plane algorithm for solving bilinear programs. Math Program 11:14–27 6. Konno H, Thach PT, Tuy H (1997) Optimization on low rank nonconvex structures. Kluwer, Dordrecht 7. Tuy H (1964) Concave programming under linear constraints. Soviet Math 5:1437–1440 8. Tuy H (1998) Convex analysis and global optimization. Kluwer, Dordrecht 9. Tuy H (1999) Normal sets, polyblocks and monotonic optimization. Vietnam J Math 27(4):277–300 10. Tuy H (2000) Monotonic optimization: Problems and solution approaches. SIAM J Optim 11(2):464–494

Cutting-Stock Problem ANDRÁS PRÉKOPA1 , CSABA I. FÁBIÁN2 1 RUTCOR, Rutgers Center for Operations Research, Piscataway, USA 2 Eötvös Loránd University, Budapest, Hungary

MSC2000: 90B90, 90C59 Article Outline Keywords See also References Keywords Cutting-stock problem; Cutting pattern; Column generation; Knapsack problem A company that produces large rolls of paper, textile, steel, etc., usually faces the problem of how to cut the large rolls into smaller rolls, called finished rolls, in such a way that the demands for all finished rolls be satisfied. Any large roll is cut according to some cutting pattern and the problem is to find the cutting patterns to be used and to how many large rolls they should be applied. We assume, for the sake of simplicity, that each large roll has width W, an integer multiple of some unit and the finished roll widths are also specified by some integers w1 , . . . , wm . Let aij designate the number of rolls of width wi produced by the use of the jth pattern, i = 1, . . . , m, j = 1, . . . , n. Let further bi designate the demand for roll i, i = 1, . . . , m, and cj = 1, j = 1, . . . , n. If A = (aij ), b = (b1 , . . . , bm )| , c = (c1 , . . . , cn )| , then the problem is: 8 > ˆ ˆ

B . Now, if a = (a1 , . . . , am ) 2 ZC | satisfies the inequality w a W, then, by definition, a represents a cutting pattern, a column of the matrix A. Since the cutting-stock problem is a minimization problem, the basis B is optimal if | a 1 for any a that satisfies w| a W. We can check it by solving the linear program: 8 ˆ ˆmin > a < s.t. ˆ ˆ :

w> a W

m : a 2 ZC

If the optimum value is greater than 1, then the optimal a vector may enter the basis, otherwise B is an optimal basis and xB is an optimal solution to the problem. The problem to find the vector a is a knapsack problem for which efficient solution methods exist. In practice, however, frequently more complicated cutting-stock problems come up, due to special customer requirements depending on quality and other characteristics. In addition, we frequently need to include set up costs, capacity constraints and costs due to delay in manufacturing. These lead to the development of special algorithms as described in [1,4,5,6,7]. Recently Cs.I. Fábián [2] formulated stochastic variants of the cutting-stock problem, for use in fiber manufacturing. See also Integer Programming References 1. Dyckhoff H, Kruse HJ, Abel D, Gal T (1985) Trim loss and related problems. OMEGA Internat J Management Sci 13: 59–72 2. Fábián CsI (1998) Stochastic programming model for optical fiber manufacturing. RUTCOR Res Report 34–98 3. Gilmore PC, Gomory RE (1961) A linear programming approach to the cutting stock problem. Oper Res 9:849–859 4. Gilmore PC, Gomory RE (1963) A linear programming approach to the cutting stock problem, Part II. Oper Res 11:863–888 5. Gilmore PC, Gomory RE (1965) Multistage cutting stock problems of two and more dimensions. Oper Res 13:94–120 6. Johnson MP, Rennick C, Zak E (1998) Skiving addition to the cutting stock problem in the paper industry. SIAM Rev 39(3):472–483

C

7. Nickels W (1988) A knowledge-based system for integrated solving cutting stock problems and production control in the paper industry. In: Mitra G (ed) Mathematical Models for Decision Support. Springer, Berlin

Cyclic Coordinate Method CCM VASSILIOS S. VASSILIADIS, RAÚL CONEJEROS Chemical Engineering Department, University Cambridge, Cambridge, UK MSC2000: 90C30 Article Outline Keywords See also References Keywords Cyclic coordinate search; Line search methods; Pattern search; Aitken double sweep method; Gauss–Southwell method; Nondifferentiable optimization Often the solution of multivariable optimization problems it is desired to be done with a gradient-free algorithm. This may be the case when gradient evaluations are difficult, or in fact gradients of the underlying optimization method do not exist. Such a method that offers this feature is the method of the cyclic coordinate search and its variants. The minimization problem considered is: min f (x): x

The method in its basic form uses the coordinate axes as the search directions. In particular, the search directions d(1) , . . . , d(n) , where the d(i) are vectors of zeros, except for a 1 in the ith position. Therefore along each search direction d(i) the corresponding variable xi is changed only, with all remaining variables being kept constant to their previous values. It is assumed here that the minimization is carried out in order over all variables with indices 1, . . . , n at each iteration of the algorithm. However there are

595

596

C

Cyclic Coordinate Method

variants. The first of these is the Aitken double sweep method, which processes first the variables in the order mentioned above, and then in the second sweep returns in reverse order, that is n 1, . . . , 1. The second variant is termed the Gauss–Southwell method [2], according to which the component (variable) with largest partial derivative magnitude in the gradient vector is selected for line searching. The latter requires the availability of first derivatives of the objective function. The algorithm of the cyclic coordinate method can be summarized as follows: 1. Initialization Select a tolerance > 0, to be used in the termination criterion of the algorithm. Select an initial point x(0) and initialize by setting z(1) = x(0) . Set k = 0 and i = 1. 2. Main iteration Let ˛ i (scalar variable) be the optional solution to the line search problem of minimizing f (z(i) + ˛d i ). Set z(i+1) = z(i) + ˛ i d(i) . If j < n, then increase i to i + 1 and repeat step 2. Otherwise, if j = n, then go to step 3. 3. Termination check Set x k+1 = z(n) . If the termination criterion is satisfied, for example jjx(k+1) x(k) jj , then stop. Else, set z(1) = x(k+1) . Increase k to k + 1, set i = 1 and repeat step 2. The steps above outline the basic cyclic coordinate method, the Aitken and Gauss–Southwell variants can be easily included by modifying the main algorithm.

In terms of convergence rate comparisons, D.G. Luenberger [3] remarks that such comparisons are not easy. However, an interesting analysis presented there indicates that roughly n 1 coordinate searches can be as effective as a single gradient search. Unless the variables are practically uncoupled from one another then coordinate search seems to require approximately n line searches to bring about the same effect as one step of steepest descent. It can generally be proved that the cyclic coordinate method, when applied to a differentiable function, will converge to a stationary point [1,3]. However, when differentiability is not present then the method can stall at a suboptimal point. Interestingly there are ways to overcome such difficulties, such as by applying at every pth iteration (a heuristic number, user specified) the search direction x(k+ 1) x(k) . This is even applied in practice for differentiable functions, as it is found to be helpful in accelerating convergence. These modifications are referred to as acceleration steps or pattern searches. See also Powell Method Rosenbrock Method Sequential Simplex Method References 1. Bazaraa MS, Sherali HD, Shetty CM (1993) Nonlinear programming, theory and algorithms. Wiley, New York 2. Forsythe GE (1960) Finite difference methods for partial differential equations. Wiley, New York 3. Luenberger DG (1984) Linear and nonlinear programming, 2nd edn. Addison-Wesley, Reading, MA

Data Envelopment Analysis

D

D

Data Envelopment Analysis DEA R. DE LEONE Dip. Mat. e Fisica, University degli Studi di Camerino, Camerino, Italy MSC2000: 90B50, 90B30, 91B82, 90C05 Article Outline Keywords See also References Keywords DEA; Comparative efficiency assessment; Linear programming Data envelopment analysis (DEA) is a novel technique based on linear programming for evaluating the relative performance of similar units, referred to as decision making units (DMUs). The system under evaluation consists of n DMUs: each DMU consumes varying amount of m1 different inputs (resources) to produce m2 different outputs (products). Specifically, the jth DMU is characterized by the input vector xj > 0 and the output vector yj > 0. The aim of DEA is to discern, for each DMU, whether or not is operating in an efficient way, given its inputs and outputs, relative to all remaining DMUs under consideration. The measure of efficiency is the ratio of a weighted sum of the outputs to a weighted sum of the inputs. For each DMU, the weights are different and obtained by solving a linear programming problem with the objective of showing the DMU in the best possible light.

The ability to deal directly with incommensurable inputs and outputs, the possibility of each DMU of adopting a different set of weights and the focus on individual observation in contrast to averages are among the most appealing features of model based on DEA. A process is defined output-efficient if there is no other process that, using the same or smaller amount of inputs, produces higher level of outputs. A process is defined input-efficient if there is no other process that produces the same or higher level of outputs, using smaller amount of inputs. For each orientation there are four possible models: 1) the ‘constant returns’ model; 2) the ‘variable returns’ model; 3) the ‘increasing returns’ model; 4) the ‘decreasing returns’ model. Each model is defined by a specific set of economic assumptions regarding the relation between inputs and outputs [10,11]. Associated with each of the four DEA models, independent of the orientation, there is a production possibility set, that is, the set of all possible inputs and outputs for the entire system. This set consists of the n DMUs and of ‘virtual’ DMUs obtained as linear combination of the original data. The efficient frontier is a subset of the boundary points of this production set. The objective of DEA is to determine if the DMU under evaluation lies on the efficient frontier and to assign a score based on the distance from this frontier [6]. The production set for the ‘constant returns’ model is 8 9 P x njD1 x j j ; ˆ > ˆ > ˆ > ˆ > ˆ > P ˆ > n j < = y jD1 y j ; ; T1 D (x; y) 0 : Pn ˆ > ˆ > ˆ > 8 j 0; ˆ jD1 j D 1; > ˆ > ˆ > : ; >0

597

598

D

Data Envelopment Analysis

while for the ‘variable returns’ model we have 8 P x njD1 x j j ; ˆ ˆ > = > > ;

For the output-oriented case the dual is: :

The production sets for the ‘increasing’ (resp. ‘decreasing’) returns models are similar to the set T 2 above P with the equality constraint njD1 j = 1 replaced by the P Pn inequality jD1 j 1 (resp. njD1 j 1). The ‘constant returns, input oriented’ envelopment LP is given next: 8 ˆ min ˆ ˆ

; 0 ˆ ˆ ˆ n ˆ X ˆ x j C > y j C ˇ 0

j ˆ ˆ

x 1 ˆ ˆ :

0; 0; where ˇ = 0, ˇ unrestricted, ˇ 0 and ˇ 0 for the constant, variable, increasing and decreasing return DEA models.

8 min ˆ ˆ ˆ ;;ˇ ˆ ˆ ˆ ˆ x j C ˇ

>x j > y j C ˇ 0 j D 1; : : : ; n > y

j

0;

(4)

1 0

with ˇ = 0, ˇ unrestricted ˇ 0 and ˇ 0 for the constant, variable, increasing and decreasing returns DEA models. For the ‘input-oriented, constant returns’ case, the reference DMU j is inefficient if – the optimal value of problem (1) is different from 1, or – the optimal value of Problem (1) is equal to 1 but there exists an optimal solution with at least one slack variable strictly positive; efficient in the remaining cases. Moreover the efficient DMU j can be extreme-efficient if Problem (1) has the unique solution j = 1, j = 0, j = 1, . . . , n, j 6D j ; nonextreme efficient when Problem (1) has alternate optimal solutions. The efficiency for the other models is defined in a similar manner. The conditions 0 and 0 can be introduced without loss of generality in (1) and (2) since only nonnegative values for these variables are possible given our assumption on the data. Since j = 1, j = 0 for j 6D j , = 1, and j = 1, j = 0 for j 6D j , = 1 are always feasible for (1) and (2), respectively, the optimal objective function value lies in the interval (0, 1] for the input orientation case and [1, 1) for the output orientation case. The linear programs (1) and (3) above can be interpreted in the following way. In the input-oriented case, we compare the reference DMU j with a ‘virtual’ DMU obtained as linear combination of the original DMUs. Each input and output of this virtual DMU is a linear combination of the corresponding component of the inputs and outputs of all the DMUs. The optimal value is, in this case, always less than or equal to 1. If the optimal value is strictly less than 1, then it is possible to construct a virtual DMU that produces at least the

Data Envelopment Analysis

same amount of outputs as the reference DMU using an amount of inputs that is strictly smaller than amount used by the j th DMU. When this happens we declare the DMU j inefficient. Instead, when the optimal value is equal to 1 there are three possible cases: there exists an optimal solution with at least one slack variable strictly positive; the optimal solution is unique; there exists multiple optimal solutions. In the first case we declare the reference DMU inefficient. In the last two cases the j th DMU is efficient (extreme-efficient, respectively nonextreme efficient). For Problem (3), the optimal solution and represent the weights that are the most favorable for the reference DMU, i. e., the weights that produce the highest efficiency score under the hypothesis that, using the same weights for the other DMUs, the efficiency remains always below 1. Similar interpretations can be given for the outputoriented case for Problems (2) and (4). In Fig. 1 it is represented the production possibility set and the efficient frontier for the five DMUs ‘A’ to ‘E’. These DMUs are characterized by two different inputs and a single output value set to some fixed value. All the DMUs are efficient but the DMU ‘E’. The DMU ‘B’ is efficient but nonextreme. The virtual DMU ‘K’, obtained as convex combination of the DMUs ‘C’ and ‘D’, is more efficient than the DMU ‘E’. The optimal value for the linear programming problem (1) for the DMU ‘E’ is exactly the ratio of the lengths of the segments OE and OK.

Data Envelopment Analysis, Figure 1 Two-input, single output DMUs

D

Data Envelopment Analysis, Figure 2 Two-output, single input DMUs

Figure 2 shows the case of DMUs characterized by two distinct outputs and a single input set to a fixed value. All the DMUs are efficient except the DMU ‘F’ that is dominated by the virtual DMU ‘K’. The optimal value for the linear programming problem (2) for the DMU ‘F’ is the ratio of the lengths of the segments OE and OK. The original ‘constant returns’ model was proposed in [4]. In [2] the variable returns model was proposed with the objective of discriminating between technical efficiency and scale efficiency. The bibliography published in [7] (part of [3]) contains more than 500 references to published article in the period 1978–1992 and many more articles appeared since. In all the DEA models discussed above, all efficient DMUs receive an equal score of 1. An important modification proposed in [1] allows to rank efficient units. The main idea is to exclude the column being scored from the DEA envelopment LP technology matrix. The efficiency score is now a value between (0, +1] in both orientations. In [5] are discussed in detail the issues (infeasibility, relationship between modified and standard formulation, degeneracy, interpretation of the optimal solutions) related to these DEA models.

599

600

D

Data Mining

In [8] and [9] the properties of ‘unit invariance’ (independence of the units in which inputs and outputs are measured) and ‘translation invariance’ (independence of an affine translation of the inputs and the outputs) of an efficiency DEA measure are discussed. The translation invariance property is particularly important when data contain zero or negative values. Standard DEA models are not unit invariant and translation invariant. In [8] it is proposed a weighted additive DEA model that satisfies these properties: 8 m1 m2 X X ˆ ˆ C C ˆ min w s C wr s ˆ r i i ˆ C ;s ˆ ;s ˆ rD1 iD1 ˆ ˆ n ˆ X ˆ j j ˆ ˆ s.t. x i j C sC ˆ i D xi ˆ ˆ ˆ jD1 ˆ ˆ ˆ ˆ ˆ i D 1; : : : ; m1 ˆ ˆ ˆ n ˆ X ˆ j j ˆ < yr j s r D yr (5) jD1 ˆ ˆ ˆ ˆ r D 1; : : : ; m2 ˆ ˆ ˆ n ˆ X ˆ ˆ ˆ j D 1 ˆ ˆ ˆ ˆ jD1n ˆ ˆ ˆ ˆ ˆ j 0; j D 1; : : : ; n; ˆ ˆ ˆ ˆ ˆ i D 1; : : : ; m1 ; sC ˆ i 0; ˆ ˆ : s 0; r D 1; : : : ; m : r

2. Banker RD, Charnes A, Cooper WW (1984) Some models for estimating technological and scale inefficiencies in data envelopment analysis. mansci 30:1078–1092 3. Charnes A, Cooper W, Lewin AY, Seiford LM (1994) Data envelopment analysis: Theory, methodology and applicastions. Kluwer, Dordrecht 4. Charnes A, Cooper WW, Rhodes E (1978) Measuring the efficiency of decision making units. ejor 2:429–444 5. Dulá JH, Hickman BL (1997) Effects of excluding the column being scored from the DEA envelopment LP technology matrix. jors 48:1001–1012 6. Farrell MJ (1957) The measurement of productive efficiency. J Royal Statist Soc A120:253–281 7. Seiford LM (1994) A DEA bibliography (1978–1992). In: Charnes A, Cooper W, Lewin AY, Seiford LM (eds) Data Envelopment Analysis: Theory, Methodology and Applicastions. Kluwer, Dordrecht, pp 437–469 8. Pastor JT (1994) New DEA additive model for handling zero and negative data. Techn Report Dept Estadistica e Invest. Oper Univ Alicante 9. Pastor JT, Knox Lovell CA (1995) Units invariant and translation invariant DEA models. orl 18:147–151 10. Shepard RW (1953) Cost and production functions. Princeton Univ. Press, Princeton 11. Shepard RW (1970) The theory of cost and production functions. Princeton Univ Press, Princeton

Data Mining

2

where wC i and wr are the sample standard deviation of the inputs and outputs variables respectively. Models based on data envelopment analysis have been widely used in order to evaluate efficiency in both public and private sectors. [3, Part II] contains 15 application of DEA showing the ‘range, power, elegance and insight obtainable via DEA analysis’. Banks, hospitals, and universities are among the most challenging sectors where models based on DEA have been able to assess efficiency and determine strength and weakness of the various units.

See also Optimization and Decision Support Systems References 1. Andersen P, Petersen NC (1993) A procedure for ranking efficient units in data envelopment analysis. mansci 10:1261–1264

DAVID L. OLSON Department of Management, University of Nebraska, Lincoln, USA Article Outline Synonyms Introduction Definitions Example Applications Customer Relationship Management (CRM) Credit Scoring Bankruptcy Prediction Fraud Detection

Ethical Issues in Data Mining Conclusions References Synonyms Data mining; Large-scale data analysis; Pattern recognition

Data Mining

D

Introduction

Definitions

Data mining has proven valuable in almost every aspect of life involving large data sets. Data mining is made possible by the generation of masses of data from computer information systems. In engineering, satellites stream masses of data down to storage systems, yielding a mountain of data that needs some sort of data mining to enable humans to gain knowledge. Data mining has been applied in engineering applications such as quality [4], manufacturing and service [13], labor scheduling [17], and many other places. Medicine has been an extensive user of data mining, both in the technical area [21] and in health policy [6]. Pardalos [16] provide recent research in this area. Governmental operations have received support from data mining, primarily in the form of fraud detection [9]. In business, data mining has been instrumental in customer relationship management [5,8], financial analysis [3,12], credit card management [1], health service debt management [22], banking [19], insurance [18], and many other areas of business involving services. Kusiak [13] reviewed data mining applications to include service applications of operations. Recent reports of data mining applications in web service and technology include Tseng and Lin [20] and Hou and Yang [11]. In addition to Tseng and Lin, Lee et al. [14] discuss issues involving mobile technology and data mining. Data mining support is required to make sense of the masses of business data generated by computer technology. Understanding this information-generation system and tools available leading to analysis is fundamental for business in the 21st century. The major applications have been in customer segmentation (by banks and retail establishments wishing to focus on profitable customers) and in fraud and rare event detection (especially by insurance and government, as well as by banks for credit scoring). Data mining has been used by casinos in customer management, and by organizations evaluating personnel. We will discuss data mining functions, data mining process, data systems often used in conjunction with data mining, and provide a quick review of software tools. Four prototypical applications are given to demonstrate data mining use in business. Ethical issues will also be discussed.

There are a few basic functions that have been applied in business. Bose and Mahapatra [2] provided an extensive list of applications by area, technique, and problem type. Classification uses a training data set to identify classes or clusters, which then are used to categorize data. Typical applications include categorizing risk and return characteristics of investments, and credit risk of loan applicants. The Adams [1] case, for example, involved classification of loan applications into groups of expected repayment and expected problems. Prediction identifies key attributes from data to develop a formula for prediction of future cases, as in regression models. The Sung et al. [19] case predicted bankruptcy while the Drew et al. [5]) case and the customer retention part of the Smith et al. [18] case predicted churn. Association identifies rules that determine the relationships among entities, such as in market basket analysis, or the association of symptoms with diseases. IF–THEN rules were shown in the Sung et al. [19] case. Detection determines anomalies and irregularities, valuable in fraud detection. This was used in claims analysis by Smith et al. [18]. To provide analysis, data mining relies on some fundamental analytic approaches. Regression and neural network approaches are alternative ways to identify the best fit in a given set of data. Regression tends to have advantages with linear data, while neural network models do very well with irregular data. Software usually allows the user to apply variants of each, and lets the analyst select the model that fits best. Cluster analysis, discriminant analysis, and case-based reasoning seek to assign new cases to the closest cluster of past observations. Rule induction is the basis of decision tree methods of data mining. Genetic algorithms apply to special forms of data, and are often used to boost or improve the operation of other techniques. In order to conduct data mining analyzes, a data mining process is useful. The Cross-Industry Standard Process for Data Mining (CRISP-DM) is widely used by industry members [15]. This model consists of six phases intended as a cyclical process:

601

602

D

Data Mining

Business understanding: Business understanding includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a project plan. Data understanding: Once business objectives and the project plan are established, data understanding considers data requirements. This step can include initial data collection, data description, data exploration, and the verification of data quality. Data exploration such as viewing summary statistics (which includes the visual display of categorical variables) can occur at the end of this phase. Models such as cluster analysis can also be applied during this phase, with the intent of identifying patterns in the data. Data preparation: Once the data resources available are identified, they need to be selected, cleaned, built into the form desired, and formatted. Data cleaning and data transformation in preparation for data modeling needs to occur in this phase. Data exploration at a greater depth can be applied during this phase, and additional models utilized, again providing the opportunity to see patterns based on business understanding. Modeling: Data mining software tools such as visualization (plotting data and establishing relationships) and cluster analysis (to identify which variables go well together) are useful for initial analysis. Tools such as generalized rule induction can develop initial association rules. Once greater data understanding is gained (often through pattern recognition triggered by viewing model output), more detailed models appropriate to the data type can be applied. The division of data into training and test sets is also needed for modeling (sometimes even more sets are needed for model refinement). Evaluation: Model results should be evaluated in the context of the business objectives established in the first phase (business understanding). This will lead to the identification of other needs (often through pattern recognition), frequently reverting to prior phases of CRISP-DM. Gaining business understanding is an iterative procedure in data mining, where the results of various visualization, statistical, and artificial intelligence tools show the user new relationships that provide a deeper understanding of organizational operations.

Deployment: Data mining can be used both to verify previously held hypotheses, and for knowledge discovery (identification of unexpected and useful relationships). Through the knowledge discovered in the earlier phases of the CRISP-DM process, sound models can be obtained that may then be applied to business operations for many purposes, including prediction or identification of key situations. These models need to be monitored for changes in operating conditions, because what might be true today may not be true a year from now. If significant changes do occur, the model should be redone. It is also wise to record the results of data mining projects so documented evidence is available for future studies. This six-phase process is not a rigid, by-the-numbers procedure. There is usually a great deal of backtracking. Additionally, experienced analysts may not need to apply each phase for every study. But CRISPDM provides a useful framework for data mining. There are many database systems that provide content needed for data mining. Database software is available to support individuals, allowing them to record information that they consider personally important. They can extract information provided by repetitive organizational reports, such as sales by region within their area of responsibility, and regularly add external data such as industry-wide sales, as well as keep records of detailed information such as sales representative expense account expenditure. Data warehousing is an orderly and accessible repository of known facts and related data that is used as a basis for making better management decisions. Data warehouses provide ready access to information about a company’s business, products, and customers. This data can be from both internal and external sources. Data warehouses are used to store massive quantities of data in a manner that can be easily updated and allow quick retrieval of specific types of data. Data warehouses often integrate information from a variety of sources. Data needs to be identified and obtained, cleaned, catalogued, and stored in a fashion that expedites organizational decision making. Three general data warehouse processes exist. (1) Warehouse generation is the process of designing the warehouse and loading data. (2) Data management is the process of storing the

Data Mining

data. (3) Information analysis is the process of using the data to support organizational decision making. Data marts are sometimes used to extract specific items of information for data mining analysis. Terminology in this field is dynamic, and definitions have evolved as new products have entered the market. Originally, many data marts were marketed as preliminary data warehouses. Currently, many data marts are used in conjunction with data warehouses rather than as competitive products. But also many data marts are being used independently in order to take advantage of lower-priced software and hardware. Data marts are usually used as repositories of data gathered to serve a particular set of users, providing data extracted from data warehouses and/or other sources. Designing a data mart tends to begin with the analysis of user needs. The information that pertains to the issue at hand is relevant. This may involve a specific time-frame and specific products, people, and locations. Data marts are available for data miners to transform information to create new variables (such as ratios, or coded data suitable for a specific application). In addition, only that information expected to be pertinent to the specific data mining analysis is extracted. This vastly reduces the computer time required to process the data, as data marts are expected to contain small subsets of the data warehouse’s contents. Data marts are also expected to have ample space available to generate additional data by transformation. Online analytical processing (OLAP) is a multidimensional spreadsheet approach to shared data storage designed to allow users to extract data and generate reports on the dimensions important to them. Data is segregated into different dimensions and organized in a hierarchical manner. Many variants and extensions are generated by the OLAP vendor industry. A typical procedure is for OLAP products to take data from relational databases and store them in multidimensional form, often called a hypercube, to reflect the OLAP ability to access data on these multiple dimensions. Data can be analyzed locally within this structure. One function of OLAP is standard report generation, including financial performance analysis on selected dimensions (such as by department, geographical region, product, salesperson, time, or other dimensions desired by the

D

analyst). Planning and forecasting are supported through spreadsheet analytic tools. Budgeting calculations can also be included through spreadsheet tools. Usually, pattern analysis tools are available. There are many statistical and analytic software tools marketed to provide data mining. Many good data mining software products are being used, including the well-established (and expensive) Enterprise Miner by SAS and Intelligent Miner by IBM, CLEMENTINE by SPSS (a little more accessible by students), PolyAnalyst by Megaputer, and many others in a growing and dynamic industry. For instance, SQL Server 2005 has recently been vastly improved by Microsoft, making a more usable system focused on the database perspective. These products use one or more of a number of analytic approaches, often as complementary tools that might involve initial cluster analysis to identify relationships and visual analysis to try to understand why data clustered as it did, followed by various prediction models. The major categories of methods applied are regression, decision trees, neural networks, cluster detection, and market basket analysis. The Web site www. KDnuggets.com gives information on many products, classified by function. In the category of overall data mining suites, they list 56 products in addition to 16 free or shareware products. Specialized software products were those using multiple approaches (15 commercial plus 3 free), decision tree (15 plus 10 free), rulebased (7 plus 4 free), neural network (12 plus 3 free), Bayesian (13 plus 11 free), support vector machines (3 plus 8 free), cluster analysis (8 plus 10 free), text mining (50 plus 4 free), and other software for functions such as statistical analysis, visualization, and Web usage analysis. Example Applications There are many applications of data mining. Here we present four short examples in the business world. Customer Relationship Management (CRM) The idea of customer relationship management is to target customers for special treatment based on their anticipated future value to the firm. This requires estimation of where in the customer life-cycle each subject is, as well as lifetime customer value, based on expected

603

604

D

Data Mining

tenure with the company, monthly transactions by that customer, and the cost of providing service. Lifetime value of a customer is the discounted expected stream of cash flow generated by the customer. Many companies applying CRM score each individual customer by their estimated lifetime value (LTV), stored in the firm’s customer database [5]. This concept has been widely used in catalog marketing, newspaper publishing, retailing, insurance, and credit cards. LTV has been the basis for many marketing programs offering special treatment such as favorable pricing, better customer service, and equipment upgrades. While CRM is very promising, it has often been found to be less effective than hoped [10]. CRM systems can cost up to $70 million to develop, with additional expenses incurred during implementation. Many of the problems in CRM expectations have been blamed on over-zealous sales pitches. CRM offers a lot of opportunities to operate more efficiently. However, they are not silver bullets, and benefits are not unlimited. As with any system, prior evaluation of benefits is very difficult, and investments in CRM systems need to be based on sound analysis and judgment. Credit Scoring Data mining can involve model building (extension of conventional statistical model building to very large data sets) and pattern recognition. Pattern recognition aims to identify groups of interesting observations. Interesting is defined as discovery of knowledge that is important and unexpected. Often experts are used to assist in pattern recognition. Adams et al. [1] compared data mining used for model building and pattern recognition on the behavior of customers over a one-year period. The data set involved bank accounts at a large British credit card company observed monthly. These accounts were revolving loans with credit limits. Borrowers were required to repay at least some minimum amount each month. Account holders who paid in full were charged no interest, and thus not attractive to the lender. We have seen that clustering and pattern search are typically the first activities in data analysis. Then appropriate models are built. Credit scoring is a means to use the results of data mining modeling for two purposes. Application scoring was applied in the Adams

et al. example to new cases, continuing an activity that had been done manually for half a century in this organization. Behavioral scoring monitors revolving credit accounts with the intent of gaining early warnings of accounts facing difficulties. Bankruptcy Prediction Corporate bankruptcy prediction is very important to management, stockholders, employees, customers, and other stakeholders. A number of data mining techniques have been applied to this problem, including multivariate discriminant analysis, logistical regression, probit, genetic algorithms, neural networks, and decision trees. Sung et al. [19] applied decision analysis and decision tree models to a bankruptcy prediction case. Decision tree models provide a series of IF–THEN rules to predict bankruptcy. Pruning (raising the proportion of accurate fit required to keep a specific IF–THEN relationship) significantly increased overall prediction accuracy in the crisis period, indicating that data collected in the crisis period was more influenced by noise than data from the period with normal conditions. Example rules obtained were as shown in Table 1, giving an idea of how decision tree rules work. For instance, in normal conditions, if the variable Productivity of capital (E6) was greater than 19.65, the model would predict firm survival with 86 percent confidence. Conversely, if Productivity of capital (E6) was less than or equal to 19.65, and if the Ratio of cash flow to total assets (C9) was less than or equal to 5.64, the model would predict bankruptcy with 84 percent confidence. These IF–THEN rules are stated in ways that are easy for management to see and use. Here the rules are quite simple, a desirable feature. With large data sets, it is common to generate hundreds of clauses in decision tree rules, making it difficult to implement (although gaining greater accuracy). The number of rules can be controlled through pruning rates within the software. Fraud Detection Data mining has successfully supported many aspects of the insurance business, to include fraud detection, underwriting, insolvency prediction, and customer segmentation. An insurance firm had a large data warehouse system recording details on every transac-

Data Mining

D

Data Mining, Table 1 Bankruptcy Prediction Rules Condition Rule

Prediction

Normal

E6>19.65

Nonbankrupt 0.86

Confidence level

Normal

C9>5.64

Nonbankrupt 0.95

Normal

C95.64 & E619.65

Bankrupt

Crisis

E6>20.61

Nonbankrupt 0.91

Crisis

C8>2.64

Nonbankrupt 0.85

Crisis

C3>87.23

Nonbankrupt 0.86

Crisis

C82.64, E620.61, & C387.23 Bankrupt

0.84

0.82

Where C3 = Ratio of fixed assets to equity & long-term liabilities. C8 = Ratio of cash flow to liabilities. C9 = Ratio of cash flow to total assets. E6 = Productivity of capital. Based on Sung et al. [19]

tion and claim [18]. An aim of the analysis was to accurately predict average claim costs and frequency, and to examine the impact of pricing on profitability. In evaluating claims, data analysis for hidden trends and patterns is needed. In this case, recent growth in the number of policy holders led to lower profitability for the company. Understanding the relationships between cause and effect is fundamental to understanding what business decisions would be appropriate. Policy rates are based on statistical analysis assuming various distributions for claims and claim size. In this case, clustering was used to better model the performance of specific groups of insured. Profitability in insurance is often expressed by the cost ratio, or sum of claim costs divided by sum of premiums. Claim frequency ratio is the number of claims divided by the number of policy units of risk (possible claims). Profitability would be improved by lowering

the frequency of claims, or the costs of claims relative to premiums. Data was extracted from the data warehouse for policies for which premiums were paid in the first quarter over a three-year period. This meant that policies were followed over the period, augmented by new policies, and diminished by terminations. Data on each policy holder was available as well as claim behavior over the preceding year. The key variables of cost ratio and claim frequency ratio were calculated for each observation. Sample sizes for each quarter were well above 100,000. Descriptive statistics found exceptional growth in policies over the past two years for young people (under 22), and with cars insured for over $40,000. Clustering analysis led to the conclusion that the claim cost of each individual policy holder would be pointless, as the vast majority of claims could not be predicted. Af-

Data Mining, Table 2 General Ability of Data Mining Techniques to Deal with Data Features Data characteristic

Rule induction Neural networks

Case-based reasoning Genetic algorithms

Handle noisy data

Good

Very good

Good

Very good

Handle missing data

Good

Good

Very good

Good

Process large data sets

Very good

Poor

Good

Process different data types Good

Transform to numerical Very good

Good Transforma-tion needed

Predictive accuracy

High

Very high

High

High

Explanation capability

Very good

Poor

Very good

Good

Ease of integration

Good

Good

Good

Very good

Ease of operation

Easy

Difficult

Easy

Difficult

Extracted from Bose and Mahapatra [2]

605

606

D

Data Mining

ter experimentation, the study was based on 50 clusters. A basic k-means algorithm was used. This identified several clusters as having abnormal cost ratios or frequency sizes. By testing over a two-year gap, stability for each group was determined. Table 2 compares data mining techniques.

tem correction. However, before taking drastic action, a good rule is that if the system works, it is best not to fix it. The best measure that electronic retailers can take is to not do anything that will cause customers to suspect that their rights are being violated. Conclusions

Ethical Issues in Data Mining Data mining is a potentially useful tool, capable of doing a lot of good, not only for business but also for the medical field and for government. It does, however, bring with it some dangers. So, how can we best protect ourselves, especially in the area of business data mining? A number of options exist. Strict control of data usage through governmental regulation was proposed by Garfinkel [7]. A number of large database projects that made a great deal of practical sense have ultimately been stopped. Those involving government agencies were successfully stopped due to public exposure, the negative outcry leading to cancellation of the National Data Center and the Social Security Administration projects. A system with closely held information by credit bureaus in the 1960s was only stopped after governmental intervention, which included the passage of new laws. Times have changed, with business adopting a more responsive attitude toward consumers. Innovative data mining efforts by Lotus/Equifax and by LexisNexis were quickly stopped by public pressure alone. Public pressure seems to be quite effective in providing some control over potential data mining abuses. If that fails, litigation is available (although slow in effect). It is necessary for us to realize what businesses can do with data. There will never be a perfect system to protect us, and we need to be vigilant. However, too much control can also be dangerous, inhibiting the ability of business to provide better products and services through data mining. Garfinkel prefers more governmental intervention, while we would prefer less governmental intervention and more reliance on publicity and, if necessary, the legal system. Control would be best accomplished if it were naturally encouraged by systemic relationships. The first systemic means of control is publicity. Should those adopting questionable practices persist, litigation is a slow, costly, but ultimately effective means of sys-

Data mining has evolved into a useful analytic tool in all aspects of human study, to include medicine, engineering, and science. It is a necessary means to cope with the masses of data that are produced in contemporary society. Within business, data mining has been especially useful in applications such as fraud detection, loan analysis, and customer segmentation. Such applications heavily impact the service industry. Data mining provides a way to quickly gain new understanding based upon large-scale data analysis. This paper reviewed some of the applications that have been applied in services. It also briefly reviewed the data mining process, some of the analytic tools available, and some of the major software vendors of general data mining products. Specific tools for particular applications are appearing with astonishing rapidity. References 1. Adams NM, Hand DJ, Till RJ (2001) Mining for classes and patterns in behavioural data. J Oper Res Soc 52(9):1017– 1024 2. Bose I, Mahapatra RK (2001) Business data mining – a machine learning perspective. Inf Manage 39(3):211–225 3. Cowan AM (2002) Data mining in finance: Advances in relational and hybrid methods. Int J Forecasting 18(1):155– 156 4. Da Cunha C, Agard B, Kusiak A (2006) Data mining for improvement of product quality. Int J Prod Res 44(18/19): 4027–4041 5. Drew JH, Mani DR, Betz AL, Datta P (2001) Targeting customers with statistical and data-mining techniques. J Serv Res 3(3):205–219 6. Garfinkel MS, Sarewitz D, Porter AL (2006) A societal outcomes map for health research and policy. Am J Public Health 96 (3):441–446 7. Garfinkel S (2000) Database Nation: The Death of Privacy in the 21st Century. O’Reilly & Associates, Sebastopol CA 8. Garver MS (2002) Using data mining for customer satisfaction research. Mark Res 14(1):8–12 9. Government Accounting Office (2006) Hurricanes Katrina and Rita: Unprecedented challenges exposed the individ-

D.C. Programming

10.

11.

12. 13. 14.

15. 16. 17.

18.

19.

20.

21.

22.

uals and households program to fraud and abuse: Actions needed to reduce such problems in future: GAO-06–1013, 9/27/2006, pp 1–110 Hart ML (2006) Customer relationship management: Are software applications aligned with business objectives? South African J Bus Manage 37(2):17–32 Hou J-L, Yang S-T (2006) Technology-mining model concerning operation characteristics of technology and service providers. Int J Prod Res 44(16):3345–3365 Hui W, Weigend AS (2004) Data mining for financial decision making. Decis Support Syst 37(4):457–460 Kusiak A (2006) Data mining: Manufacturing and service applications. Int J Prod Res 44(18/19):4175–4191 Lee S, Hwang C-S, Kitsuregawa M (2006) Efficient, energy conserving transaction processing in wireless data’ broadcast. IEEE Trans Knowl Data Eng 18(9):1225–1238 Olson DL, Shi Y (2007) Introduction to Business Data Mining. McGraw-Hill/Irwin, Englewood Cliffs, NJ Pardalos PM, Boginski VL, Vazacopoulos A (eds) (2007) Data Mining in Biomedicine. Springer, Heidelberg Qi X, Bard JF (2006) Generating labor requirements and rosters for mail handlers. Comput Oper Res 33(9):2645– 2666 Smith KA, Willis RJ, Brooks M (2000) An analysis of customer retention and insurance claim patterns using data mining: A case study. J Oper Res Soc 51(5):532–541 Sung TK, Chang N, Lee G (1999) Dynamics of modeling in data mining: Interpretive approach to bankruptcy prediction. J Manage Inf Sys 16(1):63–85 Tseng VS, Lin KW (2006) Efficient mining and prediction of user behavior patterns in mobile web systems. Inf Softw Technol 48(6):357–369 Yamaguchi M, Kaseda C, Yamazaki K, Kobayashi M (2006) Prediction of blood glucose level of type 1 diabetics using response surface methodology and data mining. Med Biol Eng Comput 44(6):451–457 Zurada J, Lonial S (2005) Comparison of the performance of several data mining methods for bad debt recovery in the healthcare industry. J Appl Bus Res 21(2):37–54

D.C. Programming HOANG TUY Institute of Mathematics, VAST, Hanoi, Vietnam MSC2000: 90C26 Article Outline DC Structure in Optimization Recognizing dc Functions Global Optimality Criterion

D

Solution Methods An OA Method for (CDC) A BB Method for General DC Optimization DCA–A Local Optimization Approach to (DC)

Applications and Extensions References As optimization techniques become widely used in engineering, economics, and other sciences, an increasing number of nonconvex optimization problems are encountered that can be described in terms of dc functions (differences of convex functions). These problems are called dc optimization problems, and the theory dealing with these problems is referred to as dc programming, or dc optimization ([3,4,5,6,13]; see also [1,8]). Historically, the first dc optimization problem that was seriously studied is the concave minimization problem [11]. Subsequently, reverse convex programming and some other special dc optimization problems such as quadratic and, more generally, polynomial programming problems appeared before a unified theory was developed and the term dc optimization was introduced [12]. In fact, most global optimization problems of interest that have been studied so far can be identified as dc optimization problems, despite the diversity of the approaches used.

DC Structure in Optimization Let ˝ be a convex set in Rn . A function f : ˝ ! R is said to be dc on ˝ if it can be expressed as the difference of two convex functions on ˝ : f (x) D p(x) q(x); where p(x); q(x) : ˝ ! R are convex. Denote the set of dc functions on ˝ by DC(˝). Proposition 1 DC(˝) is a vector lattice with respect to the two operations of pointwise maximum and pointwise minimum. In other words, if f i (x) 2 DC(˝); i D 1; : : : ; m; then: Pm 1. iD1 ˛ i f i (x) 2 DC(˝), for any real numbers ˛ i ; 2. g(x) D maxf f 1 (x); : : : ; f m (x)g 2 DC(˝); 3. h(x) D minf f 1 (x); : : : ; f m (x)g 2 DC(˝). From this property it follows in particular that if f 2 DC(˝), then j f j 2 DC(˝), and if g; h 2 DC(˝), then gh 2 DC(˝). But for the purpose of optimization

607

608

D

D.C. Programming

the most important consequence is that

dc representations of these functions. For composite functions a useful result about dc representation is the following [13].

g i (x) 0; 8i D 1; : : : ; m ,

g(x) :D maxfg1 (x); : : : ; g m (x)g 0 ;

g i (x) 0 for at least one i D 1; : : : ; m ,

g(x) :D minfg1 (x); : : : ; g m (x)g 0 :

Therefore, any finite system of dc inequalities, whether conjunctive or disjunctive, can be rewritten as a single dc inequality. By easy manipulations it is then possible to reduce any dc optimization problem to the following canonical form: minimize

f (x)

subject to

g(x) 0 h(x) ;

(CDC)

where all functions f , g, h are convex. Thus dc functions allow a very compact description of a wide class of nonconvex optimization problems. Recognizing dc Functions To exploit the dc structure in optimization problems, it is essential to be able to recognize dc functions that are still in hidden form (i. e., not yet presented as differences of convex functions). The next proposition addresses this question. Proposition 2 Every function f 2 C 2 is dc on any compact convex set ˝. It follows that any polynomial function is dc, and hence, by the Weierstrass theorem, DC(˝) is dense in the Banach space C(˝) of continuous functions on ˝ with the supnorm topology. In other words, any continuous function can be approximated as closely as desired by a dc function. More surprisingly, any closed set S in Rn can be shown to be a dc set, i. e., a set that is the solution set of a dc inequality. Namely, given any closed set S Rn and any strictly convex function h : Rn ! R, there exists a continuous convex function g S : Rn ! R such that S D fx 2 Rn : g S (x) h(x) 0g [10]. In many situations we not only need to recognize a dc function but also to know how to represent it effectively as a difference of two convex functions. While several classes of functions have been recognized as dc functions [2], there are still few results about effective

Proposition 3 Let h(x) D u(x) v(x), where u; v : ˝ ! RC are convex functions on a compact convex set ˝ Rm such that 0 h(x) a 8x 2 ˝. If q : [0; a] ! R is a convex nondecreasing function such that q0 (a) < 1 (q0 (a) being the left derivative of q(t) at a), then q(h(x)) is a dc function on ˝: q(h(x)) D g(x) K[a C v(x) u(x)] ; where g(x) D q(h(x)) C K[a C v(x) v(x)] is a convex function and K is any constant satisfying K q0 (a). For example, by writing x ˛ D e h(x) with h(x) D P iD1; :::; n ˛ i log x i and applying the above proposition, n ; is it is easy to see that x ˛ D x1˛1 x n˛n ; with ˛ 2 RC n dc on any box ˝ D [r; s] RCC . Hence, any synoP n ; mial function f (x) D ˛ c˛ x ˛ , with c˛ 2 R; ˛ 2 RC is also dc on ˝. Global Optimality Criterion A key question in the theoretical as well as computational study of a global optimization problem is how to test a given feasible solution for global optimality. Consider a pair of problems in some sense mutually obverse: inff f (x) : x 2 ˝ ; h(x) ˛g ;

(P˛ )

supfh(x) : x 2 ˝ ; f (x) g ;

(Q )

where ˛; 2 R; ˝ is a closed set in Rn , and f ; g : Rn ! R are two arbitrary functions. We say that problem (P˛ ) is regular if inf P˛ D inff f (x) : x 2 ˝; h(x) > ˛g :

(1)

Analogously, problem (Q ) is regular if sup Q D supfh(x) : x 2 ˝; f (x) < g. Proposition 4 Let x¯ be a feasible solution of problem (P˛ ). If x¯ is optimal to problem (P˛ ) and if problem (Q ) is regular for D f (x¯), then supfh(x) : x 2 ˝; f (x) g D ˛ :

(2)

Conversely, if (2) holds and if problem (P˛ ) is regular, then x¯ is optimal to (P˛ ).

D.C. Programming

Turning now to the canonical dc optimization problem (CDC), let us set ˝ D fx : g(x) 0g and without losing generality assume that the reverse convex constraint h(x) 0 is essential, i. e., inff f (x) : x 2 ˝g < inff f (x) : x 2 ˝; h(x) 0g: (3) Since CDC is a problem (P˛ ) with ˛ D 0, if x¯ is a feasible solution to CDC, then condition (3) ensures the regularity of the associated problem (Q ) for D f (x¯). Define C D fx : h(x) 0g;

D( ) D fx 2 ˝ : f (x) g ; (4)

and for any set E denote its polar by E . As specialized for CDC, Proposition 4 yields: Proposition 5 In order that a feasible solution x¯ of CDC may be a global minimizer, it is necessary that the following equivalent conditions hold for D f (x¯) : D( ) C ;

(5)

0 D maxfh(x) : x 2 D( )g ;

(6)

C [D( )] :

(7)

If the problem is regular, then any one of the above conditions is also sufficient. An important special dc program is the following problem: minimize

g(x) h(x) subject to

x 2 Rn ; (DC)

¯ are convex functions ( R¯ denotes where g; h : Rn ! R the set of extended real numbers). Writing this problem as minfg(x) t : x 2 D; h(x) tg with D D domg \ domh and using (7), one can derive the following: ¯ be two convex funcProposition 6 Let g; h : Rn ! R tions such that h(x) is proper and lsc. Let x¯ be a point where g(x¯ ) and h(x¯) are finite. In order for x¯ to be a global minimizer of g(x) h(x) over Rn , it is necessary and sufficient that @" h(x¯ ) @" g(x¯ ) 8" > 0 ;

(8)

where @" f (a) D fp 2 Rn : hp; x ai " f (x) f (a) 8x 2 Rn g is the "-subdifferential of f (x) at point a.

D

Solution Methods Numerous solution methods have been proposed for different classes of dc optimization. Each of them proceeds either by outer approximation (OA) of the feasible set or by branch and bound (BB) or is of a hybrid type, combining OA with BB. Following are some typical dc algorithms.

An OA Method for (CDC) Without losing generality, assume (3), i. e., 9w

s.t. g(w) 0 ; f (w) < minf f (x) : x 2 ˝; h(x) 0g :

(9)

where, as was defined above, ˝ D fx : g(x) 0g. In most cases checking the regularity of a problem is not easy while regularity is needed for the sufficiency of the optimality criteria in Proposition 5. Therefore the method to be presented below only makes use of the necessity part of this proposition and is independent of any regularity assumption. In practice, what we usually need is not an exact solution but just an approximate solution of the problem. Given tolerances " > 0; > 0; we are interested in "-approximate solutions, i. e., solutions x 2 ˝ satisfying h(x) ". An "-approximate solution x is then said to be -optimal if f (x ) minf f (x) : x 2 ˝; h(x) 0g. With x¯ now being a given "-approximate solution and D f (x¯) , consider the subproblem maxfh(x) : x 2 ˝; f (x) g :

(Q )

For simplicity assume that the set D( ) D fx 2 ˝; f (x) g is bounded. Then (Q ) is a convex maximization problem over a compact convex set and can be solved by an OA algorithm (see [13] or [3]) generating a sequence fx k ; y k g such that x k 2 ˝; f (x k ) ; h(x k ) max(Q ) h(y k ) (10) and, furthermore, kx k y k k ! 0 as k ! C1: These relations imply that we must either have h(y k ) < 0 for some k (which implies that max (Q ) < 0), or else h(x k ) " for some k: In the former case,

609

610

D

D.C. Programming

this means there is no x 2 ˝ with h(x) 0 and f (x) f (x¯) ; i. e., x¯ is -optimal to CDC, and we are done. In the latter case, xk is an "-approximate solution with f (x k ) f (x¯) : Using then a local search (or any inexpensive way available) one can improve xk to x 0 2 ˝ \ fx : h(x) D "g, and, after resetting f (x 0 ) in (Q ), one can repeat the procedure with the new (Q ). And so on. As is easily seen, the method consists essentially of a number of consecutive cycles in each of which, say the lth cycle, a convex maximization subproblem (Q ) is solved with D f (x l ) for some "-approximate solution xl . This sequence of cycles can be organized into a unified procedure. For this, it suffices to start each new cycle from the result of the previous cycle: after resetting 0 :D f (x 0 ) in (Q ), we have 0 D( ) D( ), with a point x 0 … D( 0 ), so the algorithm can be continued by using a hyperplane separating x0 from D( 0 ) to form with the current polytope outer approximating D( ) a smaller polytope outer approximating D( 0 ). Since each cycle decreases the objective function value by at least a quantity > 0, and the objective function is bounded from below, the whole procedure must terminate after finitely many cycles, yielding an "-approximate solution that is -optimal to (CDC). It is also possible to use a BB algorithm for solving the subproblem (Q ) in each cycle. The method then proceeds exactly as in the BB method for GDC to be presented next. A BB Method for General DC Optimization A general dc optimization problem can be formulated as minf f (x) : g i (x) 0; i D 1; : : : ; m; x 2 ˝g ; (GDC) where ˝ is a compact convex subset of Rn , and f ; g1 ; : : : ; g m are dc functions on ˝. Although in principle GDC can be reduced to the canonical form and solved as a CDC problem, this may not be an efficient method as it does not take account of specific features of GDC. For instance, if the feasible set of GDC is highly nonconvex, computing a single feasible solution may be as hard as solving the problem itself. Under these conditions, a direct application of the OA or the BB strate-

gies to GDC is fraught with pitfalls. Without adequate precautions, such approaches may lead to grossly incorrect results or to an unstable solution that may change drastically upon a small change of the data or the tolerances [15,16]. A safer approach is to reduce GDC to a sequence of problems with a convex feasible set in the following way. By simple manipulations it is always possible to arrange that the objective function f (x) is convex. Let g(x) D min iD1; ::: ;m g i (x), and for every 2 R [ fC1g consider the subproblem maxfg(x) : x 2 ˝; f (x) g :

(R )

Assuming the set D( ) :D fx 2 ˝; f (x) g to be bounded, we have in (R ) a dc optimization over a compact convex set. Using a BB procedure to solve (R ) we generate a nested sequence of partition sets M k (boxes, e. g., using a rectangular subdivision), together with a sequence ˛(M k ) 2 R [ f1g, and x k 2 Rn ; k D 1; 2; : : : ; such that diam M k ! 0 as k ! C1 ;

(11)

˛(M k ) & maxfg(x) : x 2 M k \ D( )g(k ! C1) ; (12) ˛(M k ) max(R ); x k 2 M k \ D( ) ;

(13)

where max(P) denotes, as usual, the optimal value of problem P. Condition (11) means that the subdivision rule used must be exhaustive, while (12) indicates that ˛(M k ) is an upper bound over the feasible solutions in M k , and (13) follows from the fact that M k is the partition set with the largest upper bound among all partition sets currently of interest. As before, we say that x is an "-approximate solution of GDC if x 2 ˝; g(x) " and x is -optimal if f (x ) minf f (x) : g(x) 0; x 2 ˝g. From (11)– (13) it follows that ˛(M k ) g(x k ) ! 0 as k ! C1, and hence, for any given " > 0, either ˛(M k ) < 0 for some k or g(x k ) " for some k. In the former case, max(R ) < 0, hence max (GDC) > ; in the latter case, xk is an "-approximate solution of GDC with f (x k ) . So, given any "-approximate solution x¯ with D f (x¯) , a finite number of iterations of this BB

D.C. Programming

procedure will help to determine whether there is no feasible solution x to GDC with f (x) f (x¯) ; i. e., x¯ is -optimal to GDC, or else there exists an "-approximate solution x0 to GDC with f (x 0 ) f (x¯) . In the and repeat the latter case, we can reset f (x 0 ) procedure with the new , and so on. In this way the whole solution process consists of a number of cycles, each involving a finite BB procedure and giving a decrease in the incumbent value of f (x) by at least > 0. By starting each cycle right from the result of the previous one, the sequence of cycles forms a unified procedure. Since is a positive constant, the number of cycles is finite and the procedure terminates with an "-approximate solution that is -optimal to GDC. The efficiency of such a BB procedure depends on two basic operations: branching and bounding. Usually, branching is performed by means of an exhaustive subdivision rule, so as to satisfy condition (11). For rectangular partition, this condition can be achieved by the standard bisection rule: bisect the current box M into two equal subboxes by means of a hyperplane perpendicular to a longest edge of M at its midpoint. However, it has been observed that the convergence guaranteed by an exhaustive subdivision rule is rather slow, especially in high dimensions. To improve the situation, the idea is to use, instead of the standard bisection, an adaptive subdivision rule defined as follows. Let the upper bound ˛(M k ) in (12) be obtained as ˛(M k ) D maxf (x) : x 2 M k \ D( )g, where (x) is some concave overestimator of g(x) over M k that is tight at some point y k 2 M k , i. e., satisfies (y k ) D g(y k ). If x k 2 argmaxf (x)jx 2 M k \ D( )g, then the subdivision rule is to bisect M k by means of the hyperplane x s D x sk C ysk /2, where s 2 argmax iD1; ::: ;n jy ik x ik j. As has been proved in [13], such an adaptive bisection rule ensures the existence of an infinite subsequence fk g such that y k x k ! 0 as ! C1. The common limit x of x k and y k then yields an optimal solution of the problem (R ). Computational experience has effectively confirmed that convergence achieved with an adaptive subdivision rule is usually much faster than with the standard bisection. For such an adaptive subdivision to be possible, the constraint set D( ) of (12) must be convex, so that for each partition set M k two points x k 2 M k \ D( ) and y k 2 M k can be defined such that ˛(M k ) g(y k ) D o(kx k y k k).

D

DCA–A Local Optimization Approach to (DC) By rewriting DC as a canonical dc optimization problem minft h(x) : x 2 Rn ; t 2 R; g(x) t 0g ; we see that DC can be solved by the same method as CDC. Since, however, for some large-scale problems we are not so much interested in a global optimal solution as in a sufficiently good feasible solution, a local optimization approach to DC has been developed [9] that seems to perform quite satisfactorily in a number of applications. This method, referred to as DCA, is based on the well-known Toland equality: inf fg(x) h(x)g D

x 2 domg

inf

y 2 domh

fg (y) g (y)g; (14)

where g; h : Rn ! R are lower semicontinuous proper convex functions, and the star denotes the conjugate, e. g., g (y) D supfhx; yi g(x) : x 2 domgg. Taking account of this equality, DCA starts with x 0 2 domg and for k D 1; 2; : : : ; computes y k 2 @h(x k ); x kC1 2 @g (y k ). As has been proved in [9], the thus generated sequence x k ; y k satisfies the following conditions: 1. The sequences g(x k ) h(x k ) and h (x k ) g( (x k ) are decreasing. 2. Every accumulation point x (resp. y ) of the sequence fx k g (resp. fy k g) is a critical point of the function g(x) h(x) (resp. h (y) g (y)). Though global optimality cannot be guaranteed by this method, it has been observed that in many cases of interest it yields a local minimizer that is also global. Applications and Extensions The above described dc methods are of a generalpurpose type. For many special dc problems more efficient algorithms are needed to take full advantage of additional structures. Along this line, dc methods have been adapted to solve problems with separated nonconvexity, bilinear programming, multilevel programming, multiobjective programming, optimization problems over efficient sets, polynomial and synomial programming, fractional programming, continuous location problems, clustering and datamining problems, etc. [4]. In particular, quite efficient methods have been developed for a class of dc optimization problems important for applications called multiplicative program-

611

612

D

Decision Support Systems with Multiple Criteria

ming [4,5]. Also, techniques for bounding, branching, and decomposition have been refined that have very much widened the range of applicability of dc methods. Most recently, monotonic optimization, also called DM optimization, has emerged as a new promising field of research dealing with a class of optimization problems important for applications whose structure, though different from the dc structure, shares many common features with the latter. To be specific, let C be a family of real valued functions on Rn such that (i) g1 ; g2 2 C ; ˛1 ; ˛2 2 RC ) ˛1 g1 C ˛2 g2 2 C ; (ii) g1 ; g2 2 C ) g(x) :D maxfg1 (x); g2 (x)g 2 C . Then the family D(C ) D C C is a vector lattice with respect to the two operations of pointwise maximum and pointwise minimum. When C is the set of convex functions, D(C ) is nothing but the vector lattice of dc functions. When C is the set of increasing functions on Rn , i. e., the set of functions f : Rn ! R such that x 0 x ) f (x 0 ) f (x), the vector lattice D(C ) consists of DM functions, i. e., functions representable as the difference of two increasing functions. For the theory, methods, and algorihms of DM optimization, we refer the reader to [7,14,18]. References 1. Floudas CA (2000) Deterministic Global Optimization. Kluwer, Dordrecht 2. Hartman P (1959) On Functions Representable as a Difference of Convex Functions. Pacific J Math 9:707–713 3. Horst R, Tuy H (1996) Global Optimization (Deterministic Approaches), 3rd edn. Springer, Berlin 4. Horst R, Pardalos PM (eds) (1995) Handbook of Global Optimization. Kluwer, Dordrecht 5. Konno H, Thach PT, Tuy H (1997) Optimization on Low Rank Nonconvex Structures. Kluwer, Dordrecht 6. Pardalos PM, Rosen JB (1987) Constrained Global Optimization: Algorithms and Applications. Lecture Notes in Computer Sciences 268. Springer, Berlin 7. Rubinov A (1999) Abstract Convexity and Global Optimization. Kluwer, Dordrecht 8. Sherali HD, Adams WP (1999) A Reformulation-Linearization Technique for Solving Discrete and Continuous Nonconvex Problems. Kluwer, Dordrecht 9. Tao PD, An LTH (1997) Convex analysis approach to D.C. Programmng: Theory, algorithms and applications. Acta Mathematica Vietnamica 22:289–356 10. Thach PT (1993) D.c. sets, dc functions and nonlinear equations. Math Programm 58:415–428 11. Tuy H (1964) Concave programming under linear constraints. Soviet Math 5:1437–1440

12. Tuy H (1985) A general deterministic approach to global optimization via dc programming. In: Hiriart-Urruty JB (ed) Fermat Days 1985: Mathematics for Optimization. NorthHolland, Amsterdam, pp 137–162 13. Tuy H (1998) Convex Analysis and Global Optimization. Kluwer, Dordrecht 14. Tuy H (2000) Monotonic Optimization: Problems and Solution Approaches. SIAM J Optim 11(2):464–494 15. Tuy H (2005) Robust Solution of Nonconvex Global Optimization Problems. J Global Optim 32:307–323 16. Tuy H (2005) Polynomial Optimization: A Robust Approach. Pacific J Optim 1:357–373 17. Tuy H, Al-Khayyal FA, Thach PT (2005) Monotonic Optimization: Branch and Cuts Methods. In: Audet C, Hansen P, Savard G (eds) Essays and Surveys on Global Optimization. GERAD. Springer, Berlin, pp 39–78 18. Tuy H, Minoux M, NTH Phuong (2006) Discrete Monotonic Optimization with Application to A Discrete Location Problem. SIAM J Optim 1778–97

Decision Support Systems with Multiple Criteria CONSTANTIN ZOPOUNIDIS, MICHAEL DOUMPOS Department Production Engineering and Management Financial Engineering Lab. Techn., University Crete, Chania, Greece MSC2000: 90C29 Article Outline Keywords Multicriteria Decision Aid Multicriteria Decision Support Systems Multicriteria Group Decision Support Systems Intelligent Multicriteria Decision Support Systems

Conclusions See also References Keywords Decision support system; Multicriteria analysis; Multicriteria group decision support system; Intelligent multicriteria decision support systems In practical real-world situations the available time for making decisions is often limited, while the cost of investigation is increasing with time. Therefore, it would

Decision Support Systems with Multiple Criteria

be enviable to exploit the increasing processing power provided by the modern computer technology, to save significant amounts of time and cost in decision making problems. Computationally intensive, but routine tasks, such as data management and calculations can be performed with remarkable speed by a common personal computer, compared to the time that a human would need to perform the same tasks. On the other hand, computers are unable to perform cognitive tasks, while their inference and reasoning capabilities are still very limited compared to the capabilities of the human brain. Thus, in decision making problems, computers can support decision makers by managing the data of the problem and performing computationally intensive calculations, based on a selected decision model, which could help in the analysis, while the decision makers themselves have to examine the obtained results of the models and conclude to the most appropriate decision. This merging of human judgment and intuition together with computer systems constitutes the underlying philosophy, methodological framework and basic goal of decision support systems [17]. The term ‘decision support system’ (DSS) is already consolidated and it is used to describe any computer system that provides information on a specific decision problem using analytical decision models and access to databases, in order to support a decision maker in making decisions effectively in complex and ill-structured problems where no straightforward, algorithmic procedure can be employed [28]. The development of DSSs kept pace with the advances in computer and information technologies, and since the 1970s numerous DSSs have been designed by academic researchers and practitioners for the examination and analysis of several decision problems including finance and accounting, production management, marketing, transportation, human resources management, agriculture, education, etc. [17,19]. Except for the specific decision problems that DSSs address, these systems are also characterized by the type of decision models and techniques that they incorporate (i. e. statistical analysis tools, mathematical programming and optimization techniques, multicriteria decision aid methods, etc.). Some of these methodologies (optimization, statistical analysis, etc.) which have already been implemented in several DSSs, are based on the classical monocriterion approach. How-

D

ever, real world decision problems can be hardly considered through the examination of a single criterion, attribute or point of view that will lead to the ‘optimum’ decision. In fact such a monocriterion approach is merely an oversimplification of the actual nature of the problem at hand, that can lead into unrealistic decisions. On the other hand, a more realistic and flexible approach would be the simultaneous consideration of all pertinent factors that may affect a decision. However, through this appealing approach a very essential issue emerges: how can several and often conflicting factors can be aggregated to make rational decisions? This issue constitutes the focal point of interest for all the multicriteria decision aid methods. The incorporation of multicriteria decision aid methods in DSSs provides the decision makers with a highly efficient tool to study complex real world decision problems where multiple criteria of conflicting nature are involved. Therefore, the subsequent sections of this paper will concentrate on this specific category of DSSs (multicriteria DSSs, MCDSSs). The article is organized as follows. In section 2 some basic concepts, notions and principles of multicriteria decision aid are discussed. Section 3 presents the main features and characteristics of MCDSSs, along with a review of the research that has been conducted in this field, while some extensions of the classical MCDSSs framework in group decision making and intelligent decision support are also discussed. Finally, section 4 concludes the paper and outlines some possible future research directions in the design, development and implementation of MCDSSs. Multicriteria Decision Aid Multicriteria decision aid (MCDA, the European School) or multicriteria decision making (MCDM, the American School) [49,64] constitutes an advanced field of operations research which is devoted to the development and implementation of decision support methodologies to confront complex decision problems involving multiple criteria, goals or objectives of conflicting nature. The foundations of MCDA can be traced back in the works of J. von Neumann and O. Morgenstern [43], and P.C. Fishburn [20] on utility theory, A. Charnes and W.W. Cooper [10] on goal program-

613

614

D

Decision Support Systems with Multiple Criteria

ming, and B. Roy [47] on the concept of outranking relations and the foundations of the ELECTRE methods. These pioneering works have affected the subsequent research in the field of MCDA that can be distinguished in two major groups: discrete and continuous MCDA. The former is involved with decision problems where there is a finite set of alternatives which should be considered in order to select the most appropriate one, to rank them from the best to the worst, or to classify them in predefined homogeneous classes. On the contrary in continuous MCDA problems the alternatives are not defined a priori, but instead one seeks to construct an alternative that meets his/her goals or objectives (for instance the construction of a portfolio of stocks). There are different ways to address these two classes of problems in MCDA. Usually, a continuous MCDA problem is addressed through multi-objective or goal programming approaches. In the former case, the objectives of the decision maker are expressed as a set of linear or non linear functions which have to be ‘optimized’, whereas in the latter case the decision maker expresses his/her goals in the form of a reference or ideal point which should be achieved as close as possible. These two approaches extend the classical singleobjective optimization framework, through the simultaneous consideration of more than one objectives or goals. Of course in this new context it seems illusory to speak of optimality, but instead the aim is initially to determine the set of efficient solutions (solutions which are not dominated by any other solution) and then to identify interactively a specific solution which is consistent with the preference structure of the decision maker. The books [54,57] and [63] provide an excellent and extensive discussion of both multi-objective and goal programming. On the other hand, discrete MCDA problems are usually addressed through the multi-attribute utility theory (MAUT) [26], the outranking relations approach [48] or the preference disaggregation approach ([23,44]). These three approaches are mainly focused on the determination and modeling of the decision makers’ preferences, in order to develop a global preference model which can be used in decision making. Their differences concern mainly the form of the global preference model that is developed, as well as the procedure that is used to estimate the parameters of the model. The developed

preference model in both MAUT and preference disaggregation is a utility or value function either additive or multiplicative, whereas the outranking relations approach is based on pairwise comparisons of the form ‘alternative a is at least as good as alternative b’. Concerning the procedure that is used to estimate the parameters of the global preference model, both in MAUT and outranking relations there is a direct interrogation of the decision maker. More precisely, in MAUT the decision maker is asked to determine the trade-offs among the several attributes or criteria, while in outranking relations the decision maker has to determine several parameters, such as the weights of the evaluation criteria, indifference, strict preference and veto thresholds for each criterion. On the contrary, in preference disaggregation, an ordinal regression procedure is used to estimate the global preference model. Based on a reference set of alternatives, which may consist either of past decisions or by a small subset of the alternatives under consideration, the decision maker is asked to provide a ranking or a classification of the alternatives according to his/her decision policy (global preferences). Then, using an ordinal regression procedure the global preference model is estimated so that the original ranking or classification (and consequently the global preference system of the decision maker) can be reproduced as consistently as possible. Multicriteria Decision Support Systems From the above brief discussion of the basic concepts and approaches of MCDA, it is clear that in any case the decision maker and his/her preferences constitute the focal point of the methodological framework of MCDA. This special characteristic of MCDA implies that a comprehensive model of a decision situation cannot be developed, but instead the model should be developed to meet the requirements of the decision maker [46]. The development of such a model can be only achieved through an iterative and interactive process, until the decision maker’s preferences are consistently represented in the model. Both interactivity and iterative operation are two of the key characteristics of DSSs. Consequently, a DSS incorporating MCDA methods could provide essential support in structuring the decision problem, analyzing the preferences of the decision maker, and supporting the model building process.

Decision Support Systems with Multiple Criteria

The support provided by multicriteria DSSs (MCDSSs) is essential for the decision maker as well as for the decision analyst. The decision maker through the use of MCDSSs becomes familiar with sophisticated operations research techniques, he is supported in structuring the decision problem considering all possible points of view, attributes or criteria, and furthermore, he is able to analyze the conflicts between these points of view and consider the existing trade-offs. All these capabilities provided by MCDSSs serve the learning process of decision makers in resolving complex decision problems in a realistic context, and constitute a solid scientific basis for arguing upon the decisions taken. On the other hand, from the decision analyst point of view, MCDSSs provide a supportive tool which is necessary throughout the decision making process, enabling the decision analyst who usually acts as an intermediate between the system and the decision maker, to highlight the essential features of the problem to the decision maker, to introduce the preferences of the decision maker in the system, and to develop the corresponding model. Furthermore, through sensitivity and robustness analyses the decision analyst is able to examine several scenarios, concerning both the significance of the evaluation criteria as well as the changes in the decision environment. The supportive operation of MCDSSs in making decisions in ill-structured complex decision problems was the basic motivation for computer scientists, management scientists and operations researchers in the development of such systems. Actually, MCDSSs are one of the major areas of DSSs research since the 1970s [19] and significant progress has been made both on the theoretical and the practical/implementation viewpoints. The first MCDSSs to be developed in the 1970s where mainly oriented towards the study of multiobjective mathematical programming problems ([16,61]). These early pioneer systems, mainly due to the limited capabilities of computer technology during that period, were primarily developed for academic purposes, they were implemented in mainframe computers, with no documentation available, while they had no visual representation capabilities [31]. Today, after more than twenty years of research and advances

D

in MCDA, DSSs, and computer science, most MCDSSs provide many advanced capabilities to decision makers including among others [46]: 1) Enhanced data management capabilities including interactive addition, deletion or modification of criteria. 2) Assessment and management of weights. 3) User-friendly interfaces based on visual representations of both alternatives and criteria to assist the interaction between the system and the decision maker. 4) Sensitivity analysis (what-if analysis) to determine how the changes in the weights of the evaluation criteria can affect the actual decision. These capabilities are in accordance with the general characteristics of DSSs, that is interactivity, flexibility and adaptability to the changes of the decision environment, user oriented design and development, and combination of data base management with decision models. Although the aforementioned capabilities are common to most of the existing MCDSSs, one could provide a distinction of the MCDSSs according to the MCDA approaches that they employ: MCDSSs based on the multi-objective programming approach: – the TOMMIX system [2], – the TRIMAP system [11], – the VIG system ([29,32]), – the VIDMA system [30], – the DIDAS system [36], – the AIM system [37], – the ADBASE system [58], and – the STRANGE system [59]. MCDSSs based on the MAUT approach: – the MACBETH system [5], – the VISA system [6], and – the EXPERT CHOICE system [21]. MCDSSs based on the outranking relations approach: – the PROMCALC and GAIA systems [7], – the ELECCALC system [27], – the PRIAM system [34], and – the ELECTRE TRI system [62]. MCDSSs based on the preference disaggregation approach: – the PEFCALC system [22], – the MINORA system [51],

615

616

D

Decision Support Systems with Multiple Criteria

– the MIIDAS system [52], and – the PREFDIS system [66]. Most of the existing MCDSSs are designed for the study of general multicriteria decision problems. Although they provide advanced capabilities for modeling the decision makers’ preferences in order to make a specific decision regarding the choice of an alternative and the ranking or the classification of the alternatives, MCDSSs do not consider the specific characteristics, as well as the nature of the decision that should be taken according to the specific decision problem that is considered. To address the unique nature of some significant decision problems, where except for the application of MCDA methodology, some other type of analyses are necessary to consider the environment in which the decision is taken, several authors proposed domain specific MCDSSs. Some decision problems for which specific MCDSSs have been developed include the assessment of corporate performance and viability (the BANKADVISER system [39], the FINCLAS system [65], the FINEVA system [68], and the system proposed in [53]), bank evaluation (the BANKS system [40]), bank asset liability management [33], financial planning [18], portfolio selection [67], new product design (the MARKEX system [42]), urban planning (the system proposed in [1]), strategic planning [9], and computer system design [15]. Multicriteria Group Decision Support Systems A common characteristic of all the aforementioned MCDSSs is that they refer to decisions that are taken by individual decision makers. However, in many cases the actual decision is not the responsibility of an individual, but instead there is a team of negotiating or cooperative participants who must conclude to a consensus decision. In this case, although the decision process and consequently the required decision support, remains the same, as far as each individual decision maker is concerned, the process that will lead the cooperative team or the negotiating parties to a consensus decision is completely different from the individual decision making process. Therefore, the type of support needed also differs. Group DSSs (GDSSs) aim at supporting such decision processes, and since the tools provided by MCDA

can be extended to generalized group decision process, several attempts have been made to design and develop such multicriteria systems. Some examples of multicriteria GDSSs include the Co-oP system [8], the JUDGES system [12], the WINGDSS system [13], the MEDIATOR system [24], and the SCDAS system [35]. Intelligent Multicriteria Decision Support Systems Except for the extension of the MCDSSs framework in supporting group decision making, recently researchers have also investigated the extension of MCDSSs through the exploitation of the advances in the field of artificial intelligence. Scientific fields such as those of neural networks, expert systems, fuzzy sets, genetic algorithms, etc., provide promising features and new capabilities regarding the representation of expert knowledge, the development of intelligent and more friendly user interfaces, the reasoning and explanation abilities, as well as the handling of incomplete, uncertain and imprecise information. These appealing new capabilities provided by artificial intelligence techniques can be incorporated in the existing MCDSSs framework to provide expert advice on the problem under consideration, assistance to the use of the several modules of the system, explanations concerning the results MCDA, models, support on structuring the decision making process, as well as recommendations and further guidance for the future actions that the decision maker should take in order to implement successfully his/her decisions. The terms ‘intelligent multicriteria decision support systems’ or ‘knowledge-based multicriteria decision support systems’ have been used by several authors to describe MCDSSs which take advantage of artificial intelligence techniques in combination with MCDA methods. Some representative examples of intelligent MCDSSs are, the system proposed in [3] for multiobjective linear programming, the MARKEX system for new product design [42], the CREDEX system [45] and the CGX system [55] for credit granting problems, the MIIDAS system for estimating additive utility functions based on the preference disaggregation approach [52], the INVEX system for investment analysis [60] based on the PROMETHEE method, as well as the FINEVA system [68] for the assessment of corporate performance and viability. All these systems incor-

Decision Support Systems with Multiple Criteria

porate in their structure one or more expert system components either to derive estimations regarding the problem under consideration (FINEVA, MARKEX, CREDEX, CGX, INVEX systems) or to support the use of the MCDA models which are incorporated in the system and generally support and improve the communication between the user and the system (MIIDAS and MARKEX systems). Furthermore, the INVEX system incorporates fuzzy sets to provide an initial distinction between good and bad investment projects, so that the number of alternatives to be considered latter on in the multicriteria analysis module is reduced. The ongoing research on the integration of artificial intelligence with MCDA regarding the theoretical foundations of this integration and the related implementation issues ([4,25]), the construction of fuzzy outranking relations ([14,41,50]), and the applications of neural networks in preference modeling and utility assessment ([38,56]) constitutes a significant basis for the design and development of intelligent MCDSSs implementing the theoretical findings of this research. Conclusions This article investigated the potentials provided by MCDSSs in the decision making process. MCDSSs during the last two decades have consolidated their position within the operations research, information systems and management science communities as an efficient tool for supporting the whole decision making process beginning from problem structuring until the implementation of the final decision, in complex illstructured problems. The review which was presented in this paper reveals that recent advances in MCDSSs include systems for general use to solve both discrete and continuous MCDA problems, systems designed to study some specific real world decisions, as well as systems designed to support negotiation and group decision making. As the computer science and technology progresses rapidly, new areas of applications of MCDSSs can be explored including their operation over the Internet to provide computer support to co-operative work of dispersed and asynchronous decision units. The incorporation of artificial intelligence techniques in the existing framework of MCDSSs also constitutes another significant area of future research. Although, as its has been

D

illustrated in this paper, researchers have already tried to integrate these two approach in an integrated intelligent system, there is a lot of work to be done in order to take the most out of the capabilities of neural networks, fuzzy sets and expert systems to provide userfriendly support in decision problems where multiple criteria are involved.

See also Bi-objective Assignment Problem Estimating Data for Multicriteria Decision Making Problems: Optimization Techniques Financial Applications of Multicriteria Analysis Fuzzy Multi-objective Linear Programming Multicriteria Sorting Methods Multi-objective Combinatorial Optimization Multi-objective Integer Linear Programming Multi-objective Optimization and Decision Support Systems Multi-objective Optimization: Interaction of Design and Control Multi-objective Optimization: Interactive Methods for Preference Value Functions Multi-objective Optimization: Lagrange Duality Multi-objective Optimization: Pareto Optimal Solutions, Properties Multiple Objective Programming Support Outranking Methods Portfolio Selection and Multicriteria Analysis Preference Disaggregation Preference Disaggregation Approach: Basic Features, Examples From Financial Decision Making Preference Modeling

References 1. Anselin L, Arias EG (1983) A multi-criteria framework as decision support system for urban growth management applications: Central city redevelopment. Europ J Oper Res 13:300–309 2. Antunes CH, Alves MJ, Silva AL, Climaco J (1992) An integrated MOLP method base package-A guided tour of TOMMIX. Comput Oper Res 1(4):609–625 3. Antunes CH, Melo MP, Climaco JN (1992) On the integration of an interactive MOLP procedure base and expert system techniques. Europ J Oper Res 61:135–144

617

618

D

Decision Support Systems with Multiple Criteria

4. Balestra G, Tsoukiàs A (1990) Multicriteria analysis represented by artificial intelligence techniques. J Oper Res Soc 41(5):419–430 5. Bana e Costa CA, Vansnick JC (1994) MACBETH-An interactive path towards the construction of cardinal value functions. Internat Trans Oper Res 1:489–500 6. Belton V, Vickers SP (1989) V.I.S.A.-VIM for MCDA. In: Lockett AG, Islei G (eds) Improving Decision Making in Organizations. Springer, Berlin, pp 319–334 7. Brans JP, Mareschal B (1994) The PROMCALC and GAIA decision support system for multicriteria decision aid. Decision Support Systems 12:297–310 8. Bui T (1994) Software architectures for negotiation support: Co-oP and Negotiator. Computer-Assisted Negotiation and Mediation Symposium, Program of Negotiation (May 26-27 1994), Harvard Law School, Cambridge, MA, pp 216–227 9. Chandrasekaran G, Ramesh R (1987) Microcomputer based multiple criteria decision support system for strategic planning. Inform and Management 12:163–172 10. Charnes A, Cooper WW (1961) Managem. models and industrial applications of linear programming. Wiley, New York 11. Climaco J, Antunes CH (1989) Implementation of a user friendly software package-A guided tour of TRIMAP. Math Comput Modelling 12(10–11):1299–1309 12. Colson G, Mareschal B (1994) JUDGES: A descriptive group decision support system for the ranking of items. Decision Support Systems 12:391–404 13. Csaki P, Rapcsak T, Turchanyi P, Vermes M (1995) R and D for group decision aid in Hungary by WINGDSS, a Microsoft Windows based group decision support system. Decision Support Systems 14:205–217 14. Czyzak P, Slowinski R (1996) Possibilistic construction of fuzzy outranking relation for multiple-criteria ranking. Fuzzy Sets and Systems 81:123–131 15. Dutta A, Jain HK (1985) A DSS for distributed computer system design in the presence of multiple conflicting objectives. Decision Support Systems 1:233–246 16. Dyer J (1973) A time-sharing computer program for the solution of the multiple criteria problem. Managem Sci 19:1379–1383 17. Eom HB, Lee SM (1990) Decision support systems applications research: A bibliography 1971–1988. Europ J Oper Res 46:333–342 18. Eom HB, Lee SM, Snyder CA, Ford FN (1987/8) A multiple criteria decision support system for global financial planning. J Management Information Systems 4(3):94–113 19. Eom SB, Lee SM, Kim JK (1993) The intellectual structure of decision support systems: 1971–1989. Decision Support Systems 10:19–35 20. Fishburn PC (1965) Independence in utility theory with whole product sets. Oper Res 13:28–45 21. Forman EH, Selly MA (2000) Decisions by objectives: How to convince others that you are right. World Sci., Singapore

22. Jacquet-Lagrèze E (1990) Interactive assessment of preferences using holistic judgments: The PREFCALC system. In: Bana e Costa CA (ed) Readings in Multiple Criteria Decision Making. Springer, Berlin, pp 335–350 23. Jacquet-Lagrèze E, Siskos J (1982) Assessing a set of additive utility functions for multicriteria decision-making: The UTA method. Europ J Oper Res 10:151–164 24. Jarke M, Jelassi MT, Shakun MF (1987) MEDIATOR: Toward a negotiation support system. Europ J Oper Res 31:314– 334 25. Jelassi MT (1987) MCDM: From stand-alone methods to integrated and intelligent DSS. In: Sawaragi Y, Inoue K, Nakayama H (eds) Towards Interactive and Intelligent Decision Support Systems. Springer, Berlin, pp 575–584 26. Keeney RL, Raiffa H (1976) Decisions with multiple objectives: Preferences and value trade-offs. Wiley, New York 27. Kiss LN, Martel JM, Nadeau R (1994) ELECCALC-An interactive software for modelling the decision maker’s preferences. Decision Support Systems 12:311–326 28. Klein MR, Methlie LB (1995) Knowledge based decision support systems with application in business. Wiley, New York 29. Korhonen P (1987) VIG-A visual interactive support system for multiple criteria decision making. Belgian J Oper Res Statist Computer Sci 27:3–15 30. Korhonen P (1988) A visual reference direction approach to solving discrete multiple criteria problems. Europ J Oper Res 34:152–159 31. Korhonen P, Moskowitz H, Wallenius J (1992) Multiple criteria decision support-A review. Europ J Oper Res 63:361– 375 32. Korhonen P, Wallenius J (1988) A Pareto race. Naval Res Logist 35:615–623 33. Langen D (1989) An (interactive) decision support system for bank asset liability management. Decision Support Systems 5:389–401 34. Levine P, Pomerol JCh (1986) PRIAM, an interactive program for chosing among multiple attribute alternatives. Europ J Oper Res 25:272–280 35. Lewandowski A (1989) SCDAS-Decision support system for group decision making: Decision theoretic framework. Decision Support Systems 5:403–423 36. Lewandowski A, Kreglewski T, Rogowski T, Wierzbicki A (1989) Decision support systems of DIDAS family (Dynamic Interactive Decision Analysis & Support. In: Lewandowski A and Wierzbicki A (eds) Aspiration Based Decision Support Systems. Springer, Berlin, pp 21–27 37. Lofti V, Stewart TJ, Zionts S (1992) An aspiration-level interactive model for multiple criteria decision making. Comput Oper Res 19:677–681 38. Malakooti B, Zhou YQ (1994) Feedforward artificial neural networks for solving discrete multiple criteria decision making problems. Managem Sci 40(11):1542–1561 39. Mareschal B, Brans JP (1991) BANKADVISER: An industrial evaluation system. Europ J Oper Res 54:318–324

Decomposition Algorithms for the Solution of Multistage Mean-Variance Optimization Problems

40. Mareschal B, Mertens D (1992) BANKS a multicriteria, PROMETHEE-based decision support system for the evaluation of the international banking sector. Revue des Systèmes de Décision 1(2):175–189 41. Martel JM, D’Avignon CR, Couillard J (1986) A fuzzy outranking relation in multicriteria decision making. Europ J Oper Res 25:258–271 42. Matsatsinis NF, Siskos Y (1999) MARKEX: An intelligent decision support system for product development decisions. Europ J Oper Res 113:336–354 43. Neumann J Von, Morgenstern O (1944) Theory of games and economic behavior. Princeton Univ. Press, Princeton 44. Pardalos PM, Siskos Y, Zopounidis C (1995) Advances in multicriteria analysis. Kluwer, Dordrecht 45. Pinson S (1992) A multi-expert architecture for credit risk assessment: The CREDEX system. In: O’Leary DE, Watkins PR (eds) Expert Systems in Finance. North-Holland, Amsterdam, pp 27–64 46. Pomerol JCh (1993) Multicriteria DSSs: State of the art and problems. Central Europ J Oper Res Econ 3(2):197–211 47. Roy B (1968) Classement et choix en présence de points de vue multiples: La méthode ELECTRE. RIRO 8:57–75 48. Roy B (1991) The outranking approach and the foundations of ELECTRE methods. Theory and Decision 31:49–73 49. Roy B, Vanderpooten D (1997) An overview on the European school of MCDA: Emergence, basic features and current works. Europ J Oper Res 99:26–27 50. Siskos J (1982) A way to deal with fuzzy preferences in multiple-criteria decision problems. Europ J Oper Res 10:614–324 51. Siskos Y, Spiridakos A, Yannacopoulos D (1993) MINORA: A multicriteria decision aiding system for discrete alternatives. J Inf Sci Techn 2:136–149 52. Siskos Y, Spiridakos A, Yannacopoulos D (1999) Using artificial intelligence and visual techniques into preference disaggregation analysis: The MIIDAS system. Europ J Oper Res 113:281–299 53. Siskos Y, Zopounidis C, Pouliezos A (1994) An integrated DSS for financing firms by an industrial development bank in Greece. Decision Support Systems 12:151–168 54. Spronk J (1981) Interactive multiple goal programming application to financial planning. Martinus Nijhoff, Boston, MA 55. Srinivasan V, Ruparel B (1990) CGX: An expert support system for credit granting. Europ J Oper Res 45:293–308 56. Stam A, Sun M, Haines M (1996) Artificial neural network representations for hierarchical preference structures. Comput Oper Res 23(12):1191–1201 57. Steuer RE (1986) Multiple criteria optimization: Theory, computation and application. Wiley, New York 58. Steuer RE (1992) Manual for the ADBASE multiple objective linear programming package. Dept Management Sci and Inform. Technol. Univ. Georgia, Athens, GA 59. Teghem J, Dufrane D, Thauvoye M, Kunsch P (1986) STRANGE: An interactive method for multi-objective linear

60.

61.

62.

63. 64.

65.

66.

67.

68.

D

programming under uncertainty. Europ J Oper Res 26:65– 82 Vranes S, Stanojevic M, Stevanovic V, Lucin M (1996) INVEX: Investment advisory expert system. Expert Systems 13, no 2:105–119 Wallenius J, Zionts S (1976) Some tests of an interactive programming method for multicriteria optimization and an attempt at implementation. In: Thiriez H, Zionts S (eds) Multiple Criteria Decision Making. Lecture Notes Economics and Math Systems. Springer, Berlin, pp 319–331 Yu W (1992) ELECTRE TRI: Aspects methodologiques et manuel d’utilisation. Document du Lamsade (Univ ParisDauphine) 74 Zeleny M (1982) Multiple criteria decision making. McGraw-Hill, New York Zopounidis C (1997) The European school of MCDA: Some recent trends. In: Climaco J (ed) Multicriteria Analysis. Springer, Berlin, pp 608–616 Zopounidis C, Doumpos M (1998) Developing a multicriteria decision support system for financial classification problems: The FINCLAS system. Optim Methods Softw 8(34) Zopounidis C, Doumpos M (2000) PREFDIS: A multicriteria decision support system for sorting decision problems. Comput Oper Res 27:779–797 Zopounidis C, Godefroid M, Hurson Ch (1995) Designing a multicriteria DSS for portfolio selection and management. In: Janssen J, Skiadas CH, Zopounidis C (eds) Advances in Stochastic Modeling and Data Analysis. Kluwer, Dordrecht, pp 261–292 Zopounidis C, Matsatsinis NF, Doumpos M (1996) Developing a multicriteria knowledge-based decision support system for the assessment of corporate performance and viability: The FINEVA system. Fuzzy Economic Rev 1(2):35–53

Decomposition Algorithms for the Solution of Multistage Mean-Variance Optimization Problems PANOS PARPAS, BERÇ RUSTEM Department of Computing, Imperial College, London, UK MSC2000: 90C15, 90C90 Article Outline Abstract Background Problem Statement

619

620

D

Decomposition Algorithms for the Solution of Multistage Mean-Variance Optimization Problems

Methods Nested Benders Decomposition (NBD) Augmented Lagrangian Decomposition (ALD) Numerical Experiments

References Abstract Stochastic multistage mean-variance optimization problems represent one of the most frequently used modeling tools for planning problems, especially financial. Decomposition algorithms represent a powerful tool for the solution of problems belonging to this class. The first aim of this article is to introduce multi-stage mean-variance models, explain their applications and structure. The second aim is the discussion of efficient solution methods of such problems using decomposition algorithms. Background Stochastic programming (SP) is becoming an increasingly popular tool for modeling decisions under uncertainty because of the flexible way uncertain events can be modeled, and real-world constraints can be imposed with relative ease. SP also injects robustness to the optimization process. Consider the following standard “deterministic” quadratic program: min x

s:t

1 0 x Hx C c 0 x 2 Ax D b

tempt to take advantage of the specific structure of SP models. We examine two decomposition algorithms that had encouraging results reported in linear SP; the first is based on the regularized version of Benders decomposition developed by [21], and the second on an augmented-lagrangian-based scheme developed by [4]. Others [9,24,27] formulated multistage SP as a problem in optimal control, where the current stage variables depend on the parent node variables, and used techniques from optimal control theory to solve the resulting problem. Another related method is the approximation algorithm by [11] where a sequence of scenario trees is generated whose solution produces lower and upper bounds on the solution of the true problem. Decomposition algorithms are not, however, the only approach to tackle the state explosion from which SPs suffer; approximation algorithms and stochastic methods are just two examples of other methods where research is very active [5]. In this study, we are concerned only with decomposition methods. Problem Statement We consider a quadratic multistage SP. In the linear case, SP was first proposed independently by [10] and [1]; for a more recent description see [7] and [13]. For two stages, the problem is: min x

(1)

x l x xu : It is not always possible to know the exact values of the problem data of (1) given by H, A, c, and b. Instead, we may have some estimations in the form of data gathered either empirically or known to be approximated well by a probability distribution. The SP framework allows us to solve problems where the data of the problem are represented as functions of the randomness, yielding results that are more robust to deviations. The power and flexibility of SP does, however, come at a cost. Realistic models include many possible events distributed across several periods, and the end result is a large-scale optimization problem with hundreds of thousands of variables and constraints. Models of this scale cannot be handled by general-purpose optimization algorithms, so special-purpose algorithms at-

s:t

1 0 x Hx C c 0 x C Q(x) 2 Ax D b

x l x xu :

(2a) (2b) (2c)

We use 0 to denote the transpose of a vector or a matrix. c and x u;l are known vectors in i

where E(i a ) is the rotamer–template energy for rotamer ia of amino acid i, E(i a ; j b ) is the rotamer–rotamer energy of rotamer ia and rotatmer jb of amino acids i and j, respectively, and N is the total number of positions. The original DEE pruning criterion is based on the concept that if the pairwise energy between rotamer ia and rotamer jb is higher than that between rotamer ic and rotamer jb for all rotamer jb in a certain rotamer set {B}, then rotamer ia cannot be in the global energy minimum conformation and thus can be eliminated. It was

643

644

D

De Novo Protein Design Using Rigid Templates

proposed in [9] and can be expressed in the following mathematical form: E(i a ) C

N X

E(i a ; j b ) > E(i c ) C

j¤i

N X

E(i c ; j b ) 8fBg:

This can be generalized to the use of a weighted average of C rotamers ic to eliminate ia [14]:

j¤i

Rotamer ia can be pruned if the above holds true. Bounds implied by (1) can be utilized to generate the following computationally more tractable inequality [9]: N X

min E(i a ; j b ) b

j¤i

> E(i c ) C

N X

max E(i c ; j b ) :

j¤i

b

(3)

N X

min "(i a ; j b ; k c )

k¤i; j

c

N X

> "(i a 0 ; j b 0 ) C

k¤i; j

max "(i a 0 ; j b 0 ; k c ) ; c

(4)

(8)

w c E(i c ; j b )] > 0 :

Lasters et al. [25] proposed that the most suitable weights wc can be determined by solving a linear programming problem. In addition to these criteria proposed by Goldstein [14], Pierce et al. [38] introduced the split DEE, which splits the conformational space into partitions and thus eliminated the dead-ending rotamers more efficiently:

C

N X

fmin [E(i a ; j a 0 ) E(i c ; j a 0 )]g 0

j; j¤k¤i

a

C [E(i a ; kb 0 ) E(i c ; kb 0 )] > 0 :

(9)

In general, n splitting positions can be assigned for more efficient but computationally expensive rotamer elimination:

C C

N X

fmin [E(i a ; j a 0 ) E(i c ; j a 0 )]g 0

X

a

[E(i a ; kb 0 ) E(i c ; kb 0 )] > 0 :

(10)

kDk 1 ;:::;kn

(5)

"(i a ; j b ; k c ) D E(i a ; k c ) C E( j b ; k c ) :

(6)

It determines a rotamer pair ia and jb which always contributes higher energies than rotamer pair i a 0 and j b 0 for all possible rotamer combinations. Goldstein [14] improved the original DEE criterion by stating that rotamer ia can be pruned if the energy contribution is always reduced by an alternative rotamer ic :

j¤i

b

cD1;:::;C

j; j¤k 1 ;:::;kn¤i

"(i a ; j b ) D E(i a ) C E( j b ) C E(i a ; j b ) ;

N X

j¤i

min[E(i a ; j b )

E(i a ) E(i c )

where " is the total energy of rotamer pairs:

E(i a ) E(i c ) C

X

N X

E(i a ) E(i c )

The above equations for eliminating rotamers at a single position (or singles) can be extended to eliminating rotamer pairs at two distinct positions (doubles), rotamer triplets at three distinct positions (triples), or above [9,37]. In the case of doubles, the equation becomes "(i a ; j b ) C

w c E(i c ) C

cD1;:::;C

(2)

E(i a ) C

X

E(i a )

min[E(i a ; j b ) E(i c ; j b )] > 0: b

(7)

Looger and Hellinga [27] also introduced the generalized DEE by ranking the energy of rotamer clusters instead of that of individual rotamers and increased the ability of the algorithm to deal with higher levels of combinatorial complexity. Further revisions and improvements on DEE were performed by Wernisch et al. [47] and Gordon et al. [15]. Being deterministic in nature, the different forms of DEE reviewed above all yield the same globally optimal solution upon convergence. Successes Using Dead-End Elimination: Based on operating the DEE algorithm on a fixed template, the Mayo group devised their optimization of rotamers

De Novo Protein Design Using Rigid Templates

by an iterative technique (ORBIT) program and applied it to numerous de novo protein designs. Examples are the full-sequence design of the ˇˇ˛ fold of a zinc finger domain [6], improvement of calmodulin binding affinity [45], full core design of the variable domains of the light and heavy chains of catalytic antibody 48G7 FAB, full core/boundary design, full surface design, and full-sequence design of the ˇ1 domain of protein G [15], as well as the redesign of the core of T4 lysozyme [32]. They also adjusted secondary structure parameters to build the “idealized backbone” and used it as a fixed template to design an ˛/ˇ-barrel protein [33]. The Hellinga group applied DEE with a fixed backbone structure to introduce iron and oxygen binding sites into thioredoxin [2,3], design receptor and sensor proteins with novel ligand-binding functions [28], and confer novel enzymatic properties onto ribosebinding protein [11]. The Self-Consistent Mean-Field Method The SCMF optimization method is an iterative procedure that predicts the values of the elements of a conformational matrix P(i, a) for the probability of a design position i adopting the conformation of rotamer a. Note that P(i, a) sums to unity over all rotamers a for each position i. Koehl and Delarue [19] were among those who introduced such a method for protein design. They started the iteration with an initial guess for the conformational matrix, which assigns equal probability to all rotamers: P(i; a) D

1 A

a D 1; 2; : : : ; A :

(11)

Most importantly, they applied the mean-field potential, E(i, a), which depends on the conformational matrix P(i, a): E(i; a) D U(x i a ) C U(x i a ; x0 ) C

N B X X

P( j; b)U(x i a ; x j b ) ;

(12)

jD1; j¤i bD1

where x0 corresponds to the coordinates of atoms in the fixed template, and x i a and x j b correspond to the coordinates of the atoms of position i assuming the conformation of rotamer a and those of position j assuming the conformation of rotamer b, respectively. The classical Lennard-Jones (12-6) potential can be used to de-

D

scribe potential energy U [19]. The conformational matrix can be subsequently updated using the mean-field potential and the Boltzmann law: e P1 (i; a) D P A

E(i;a) RT

aD1

e

E(i;a) RT

:

(13)

The update on P(i, a), namely, P1 (i; a), can then be used to repeat the calculation of the mean-field potential and another update until convergence is attained. Koehl and Delarue [19] set the convergence criterion to be 104 to define self-consistency. They also proposed the introduction of memory of the previous step to minimize oscillations during convergence: P(i; a) D P1 (i; a) C (1 )P(i; a) ;

(14)

with the optimal step size to be 0.9 [19]. The Saven group [12,24,44,48] extended the SCMF theory and formulated de novo design as an optimization problem maximizing the sequence entropy subject to composition constraints and mean-field energy constraints. In addition to the site probabilities, their method also predicts the number of sequences for a combinatorial library of arbitrary size for the fixed template as a function of energy. It should be highlighted that though deterministic in nature, the SCMF method does not guarantee convergence to the global optimal solution [26]. Successes Using the Self-Consistent Mean-Field Method Koehl and Delarue [20] applied the SCMF approach to design protein loops. In their optimization procedure, they first selected the loop fragment from a database with the highest site probabilities. Then they placed side chains on the fixed loop backbone from a rotamer library. Kono and Doi [23] also used an energy minimization with an automata network, which bears some resemblance to the SCMF method, to design the cores of the globular proteins of cytochrome b562 , triosephosphate isomerase, and barnase. The SCMF method is related to the design of combinatorial libraries of new sequences with good folding properties, which was reviewed in several papers [17,34,35,43]. Stochastic Methods The fact that de novo design is nondeterministic polynomial-time hard [13,36] means that in the worst

645

646

D

De Novo Protein Design Using Rigid Templates

case the time required to solve the problem scales nonpolynomially with the number of design positions. As the problem complexity exceeds a certain level, deterministic methods may reach their limits and in such instances we may have to resort to stochastic methods, which perform searches for only locally optimal solutions. Monte Carlo methods and genetic algorithms are the two most commonly used types of stochastic methods for de novo protein design.

repressor, and sperm whale myoglobin using the conventional Monte Carlo method. The Baker group also utilized the classic Monte Carlo algorithm in their computational protein design program RosettaDesign. Examples of applications of the program include the redesign of nine globular proteins: the src SH3 domain, repressor, U1A, protein L, tenascin, procarboxypeptidase, acylphosphatase, S6, and FKBP12 using fixed templates [7].

Monte Carlo Methods Different variants of the Monte Carlo methods have been applied for sequence design. In the classic Monte Carlo method, mutation is performed at a certain position in the sequence and energies of the sequence in the fixed template are calculated before and after the mutation. This usually involves the use of discrete rotamer libraries to simplify the consideration of possible side-chain conformations. The new sequence after mutation is accepted if the energy becomes lower. If the energy is higher, the Metropolis acceptance criterion [30] is used

Genetic Algorithms Originating in genetics and evolution, genetic algorithms generate a multitude of random amino acid sequences and exchange them for a fixed template. Sequences with low energies form hybrids with other sequences, while those with high energies are eliminated in an iterative process which only terminates when a converged solution is attained [46].

paccept D min(1; exp(ˇE))

ˇD

1 ; kT

(15)

and the sequence is updated if paccept is larger than a random number uniformly distributed between 0 and 1. In the configurational bias Monte Carlo method, at each step a local energy is used which does not include those positions where a mutation has not been attempted [49]. Cootes et al. [4] reported that the method was more efficient at finding good solutions than the conventional Monte Carlo method, especially for complex systems. Zoz and Savan [49] also devised the meanfield biased Monte Carlo method which biases the sequence search with predetermined site probabilities, which are in turn calculated using SCMF theory. They claimed their new method converges to low-energy sequences faster than classic Monte Carlo and configurational bias Monte Carlo methods. Successes of Monte Carlo Methods Imposing sequence specificity by keeping the amino acid composition fixed, which reduced significantly the complexity, Koehl and Levitt [21,22] designed new sequences for the fixed backbones of the ˇ1 domain of protein G,

Successes of Genetic Algorithms With fixed backbones, Belda et al. [1] applied genetic algorithms to the design of ligands for prolyl oligopeptidase, p53, and DNA gyrase. In addition, with a cubic lattice and empiricial contact potentials Hohm et al. [18] and Miyazawa and Jernigan [31] also employed evolutionary methods to design short peptides that resemble the antibody epitopes of thrombin and blood coagulation factor VIII with high stability.

References 1. Belda I, Madurga S, Llorà X, Martinell M, Tarragó T, Piqueras MG, Nicolás E, Giralt E (2005) ENPDA: An evolutionary structure-based de novo peptide design algorithm. J Computer-Aided Mol Des 19:585–601 2. Benson D, Wisz M, Hellinga H (1998) The development of new biotechnologies using metalloprotein design. Curr Opin Biotechnol 9:370–376 3. Benson D, Wisz M, Hellinga H (2000) Rational design of nascent metalloenzymes. Proc Natl Acad Sci USA 97:6292– 6297 4. Cootes AP, Curmi PMG, Torda AE (2000) Biased monte carlo optimization of protein sequences. J Chem Phys 113:2489–2496 5. Dahiyat B, Mayo S (1996) Protein design automation. Protein Sci 5:895–903 6. Dahiyat B, Mayo S (1997) De novo protein design: Fully automated sequence selection. Science 278:82–87 7. Dantas G, Kuhlman B, Callender D, Wong M, Baker D (2003) A large scale test of computational protein design: Folding

De Novo Protein Design Using Rigid Templates

8.

9.

10. 11.

12.

13.

14.

15.

16. 17.

18.

19.

20.

21. 22. 23.

24.

and stability of nine completely redesigned globular proteins. J Mol Biol 332:449–460 Desjarlais JR, Clarke ND (1998) Computer search algorithms in protein modification and design. Curr Opin Struct Biol 8:471–475 Desmet J, Maeyer MD, Hazes B, Lasters I (1992) The deadend elimination theorem and its use in side-chain positioning. Nature 356:539–542 Dill K (1990) Dominant forces in protein folding. Biochemistry 29:7133–7155 Dwyer MA, Looger LL, Hellinga H (2004) Computational design of a biologically active enzyme. Science 304:1967– 1971 Fu X, Kono H, Saven J (2003) Probabilistic approach to the design of symmetric protein quaternary structures. Protein Eng 16:971–977 Fung HK, Rao S, Floudas CA, Prokopyev O, Pardalos PM, Rendl F (2005) Computational comparison studies of quadratic assignment like formulations for the in silico sequence selection problem in de novo protein design. J Comb Optim 10:41–60 Goldstein R (1994) Efficient rotamer elimination applied to protein side-chains and related spin glasses. Biophys J 66:1335–1340 Gordon B, Hom G, Mayo S, Pierce N (2003) Exact rotamer optimization for protein design. J Comput Chem 24:232–243 Handel T, Desjarlais J (1995) De novo design of the hydrophobic cores of proteins. Protein Sci 4:2006–2018 Hecht M, Das A, Go A, Bradley L, Wei Y (2004) De novo proteins from designed combinatorial libraries. Protein Sci 13:1711–1723 Hohm T, Limbourg P, Hoffmann D (2006) A multiobjective evolutionary method for the design of peptidic mimotopes. J Comput Biol 13:113–125 Koehl P, Delarue M (1994) Application of a self-consistent mean field theory to predict protein side-chains conformation and estimate their conformational entropy. J Mol Biol 239:249–275 Koehl P, Delarue M (1995) A self consistent mean field approach to simultaneouos gap closure and sidechain positioning in homology modeling. Nat Struct Biol 2:163–170 Koehl P, Levitt M (1999) De novo protein design. i. in search of stability and specificity. J Mol Biol 293:1161–1181 Koehl P, Levitt M (1999) De novo protein design. ii. plasticity in sequence space. J Mol Biol 293:1183–1193 Kono H, Doi J (1994) Energy minimization method using automata network for sequence and side-chain conformation prediction from given backbone geometry. Proteins 19:244–255 Kono H, Saven J (2001) Statistical theory of protein combinatorial libraries: Packing interactions, backbone flexibility, and the sequence variability of a main-chain structure. J Mol Biol 306:607–628

D

25. Lasters I, Maeyer MD, Desmet J (1995) Enhanced dead-end elimination in the search for the global minimum energy conformation of a collection of protein side chains. Protein Eng 8:815–822 26. Lee C (1994) Predicting protein mutant energetics by selfconsistent ensemble optimization. J Mol Biol 236:918– 939 27. Looger L, Hellinga H (2001) Generalized dead-end elimination algorithms make large-scale protein side-chain structure prediction tractable: Implications for protein design and structural genomics. J Mol Biol 307:429–445 28. Looger L, Dwyer M, Smith J, Hellinga H (2003) Computational design of receptor and sensor proteins with novel functions. Nature 423:185–190 29. Lovell SC, Word JM, Richardson JS, Richardson DC (2000) The penultimate rotamer library. Proteins 40:389–408 30. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state calculations by fast computing machines. J Chem Phys 21:389–408 31. Miyazawa S, Jernigan RL (1996) Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term for simulation and threading. J Mol Biol 256:623–644 32. Mooers B, Datta D, Baase W, Zollars E, Mayo S, Matthews B (2003) Repacking the core of t4 lysozyme by automated design. J Mol Biol 332:741–756 33. Offredi F, Dubail F, Kischel P, Sarinski K, Stern AS, de Weerdt CV, Hoch JC, Prosperi C, François JM, Mayo SL, Martial JA (2003) De novo backbone and sequence design of an idealized ˛/ˇ -barrel protein: Evidence of stable tertiary structure. J Mol Biol 325:163–174 34. Park S, Stowell XF, Wang W, Yang X, Saven J (2004) Computational protein design and discovery. Annu Rep Prog Chem Sect C 100:195–236 35. Park S, Yang X, Saven J (2004) Advances in computational protein design. Curr Opin Struct Biol 14:487–494 36. Pierce N, Winfree E (2002) Protein design is np-hard. Protein Eng 15:779–782 37. Pierce N, Spriet J, Desmet J, Mayo S (2000) Conformational splitting: A more powerful criterion for dead-end elimination. J Comput Chem 21:999–1009 38. Pierce N, Spriet J, Desmet J, Mayo S (2000) Conformational splitting: A more powerful criterion for dead-end elimination. J Comput Chem 21:999–1009 39. Ponder J, Richards F (1987) Tertiary templates for proteins. J Mol Biol 193:775–791 40. Dunbrack L Jr, Cohen FE (1997) Bayesian statistical analysis of protein side-chain rotamer preferences. Protein Sci 6:1661–81 41. Richards F, Hellinga H (1994) Optimal sequence selection in proteins of known structure by simulated evolution. Proc Natl Acad Sci USA 91:5803–5807 42. Rosenberg M, Goldblum A (2006) Computational protein design: A novel path to future protein drugs. Curr Pharm Des 12:3973–3997

647

648

D

Derivative Free Method for Nonsmooth Optimization

43. Saven J (2002) Combinatorial protein design. Curr Opin Struct Biol 12:453–458 44. Saven J, Wolynes PG (1997) Statistical mechanics of the combinatorial synthesis and analysis of folding macromolecules. J Phys Chem B 101:8375–8389 45. Shifman J, Mayo S (2002) Modulating calmodulin binding specificity through computational protein design. J Mol Biol 323:417–423 46. Tuffery P, Etchebest C, Hazout S, Lavery R (1991) A new approach to the rapid determination of protein side chain conformations. J Biomol Struct Dyn 8:1267–1289 47. Wernisch L, Hery S, Wodak S (2000) Automatic protein design with all atom force-fields by exact and heuristic optimization. J Mol Biol 301:713–736 48. Zou J, Saven J (2000) Statistical theory of combinatorial libraries of folding proteins: Energetic discrimination of a target structure. J Mol Biol 296:281–294 49. Zou J, Saven J (2003) Using self-consistent fields to bias monte carlo methods with applications to designing and sampling protein sequences. J Chem Phys 118:3843– 3854

Derivative-Free Methods for Non-smooth Optimization ADIL BAGIROV Centre for Informatics and Applied Optimization, University of Ballarat, Ballarat, Australia MSC2000: 65K05, 90C56 Article Outline Introduction Definitions The Clarke Subdifferential Semismooth Functions Quasidifferentiable Functions

Methods Approximation of Subgradients Computation of Subgradients Computation of Subdifferentials and Discrete Gradients A Necessary Condition for a Minimum Computation of Descent Directions The Discrete Gradient Method

Applications Conclusions References

Introduction Consider the following unconstrained minimization problem: minimize f (x) subject tox 2 IRn ;

(1)

where the objective function f is assumed to be Lipschitz continuous. Nonsmooth unconstrained optimization problems appear in many applications and in particular in data mining. Over more than four decades different methods have been developed to solve problem (1). We mention among them the bundle method and its different variations (see, for example, [11,12,13,14,17,20]), algorithms based on smoothing techniques [18], and the gradient sampling algorithm [8]. In most of these algorithms at each iteration the computation of at least one subgradient or approximating gradient is required. However, there are many practical problems where the computation of even one subgradient is a difficult task. In such situations derivative-free methods seem to be a better choice since they do not use the explicit computation of subgradients. Among derivative-free methods, the generalized pattern search methods are well suited for nonsmooth optimization [1,19]. However their convergence are proved under quite restrictive differentiability assumptions. It was shown in [19] that when the objective function f is continuously differentiable in IRn , then the lower limit of the norm of the gradient of the sequence of points generated by the generalized pattern search algorithm goes to zero. The paper [1] provides convergence analysis under less restrictive differentiability assumptions. It was shown that if f is strictly differentiable near the limit of any refining subsequence, then the gradient at that point is zero. However, in many practically important problems this condition is not satisfied, because in such problems the objective functions are not differentiable at local minimizers. In the paper [15] a derivative-free algorithm for a linearly constrained finite minimax problem was proposed. The original problem was converted into a smooth one using a smoothing technique. This algorithm is globally convergent toward stationary points of the finite minimax problem. In this paper we describe a derivative-free method based on the notion of a discrete gradient for solving

D

Derivative Free Method for Nonsmooth Optimization

unconstrained nonsmooth optimization problems. Its convergence is proved for a broad class of nonsmooth functions.

f 0 (x; g) D lim ˛ 1 [ f (x C ˛g) f (x)]:

Definitions

˛#0

We use the following notation: IR n is an n-dimensional space, where the scalar product will be denoted by hx; yi: hx; yi D

f 0 (x; g) D f 0 (x; g) for all x; g 2 IRn , where f 0 (x; g) is a derivative of function f at point x with respect to direction g:

n X

xi yi

iD1

and k k will denote the associated norm. The gradient of a function f : IRn ! IR1 will be denoted by r f and the closed ı-ball at x 2 IRn by Sı (x) (by Sı if x D 0): Sı (x) D fy 2 IRn : kx yk ıg; ı > 0. The Clarke Subdifferential Let f be a function defined on IRn . Function f is called locally Lipschitz continuous if for any bounded subset X IRn there exists an L > 0 such that j f (x) f (y)j Lkx yk8x; y 2 X: We recall that a locally Lipschitz function f is differentiable almost everywhere and that we can define for it a Clarke subdifferential [9] by n @ f (x) D co v 2 IRn : 9(x k 2 D( f ); o x k ! x; k ! C1) : v D lim r f (x k ) ; k!C1

where D( f ) denotes the set where f is differentiable and co denotes the convex hull of a set. It is shown in [9] that the mapping @ f (x) is upper semicontinuous and bounded on bounded sets. The generalized directional derivative of f at x in the direction g is defined as f 0 (x; g) D lim sup ˛ 1 [ f (y C ˛g) f (y)] : y!x;˛#0

If function f is locally Lipschitz continuous, then the generalized directional derivative exists and f 0 (x; g) D max fhv; gi : v 2 @ f (x)g : f is called a Clarke regular function on IR n if it is differentiable with respect to any direction g 2 IRn and

It is clear that the directional derivative f 0 (x; g) of the Clarke regular function f is upper semicontinuous with respect to x for all g 2 IRn . Let f be a locally Lipschitz continuous function defined on IRn . For point x to be a minimum point of function f on IRn , it is necessary that 0 2 @ f (x): Semismooth Functions The function f : IRn ! IR1 is called semismooth at x 2 IR n , if it is locally Lipschitz continuous at x and for every g 2 IRn , the limit lim

hv; gi

g 0 !g;˛#0;v2@ f (xC˛ g 0 )

exists. It should be noted that the class of semismooth functions is fairly wide and it contains convex, concave, max- and min-type functions [16]. The semismooth function f is directionally differentiable and f 0 (x; g) D

lim

g 0 !g;˛#0;v2@ f (xC˛ g 0 )

hv; gi:

Quasidifferentiable Functions A function f is called quasidifferentiable at a point x if it is locally Lipschitz continuous and directionally differentiable at this point and there exist convex, compact sets @ f (x) and @ f (x) such that f 0 (x; g) D max hu; gi C min hv; gi: u2@ f (x)

v2@ f (x)

The set @ f (x) is called a subdifferential, the set @ f (x) is called a superdifferential, and the pair of sets [@ f (x); @ f (x)] is called a quasidifferential of function f at a point x [10]. Methods Approximation of Subgradients We consider a locally Lipschitz continuous function f defined on IRn and assume that this function is quasidifferentiable. We also assume that both sets @ f (x) and

649

650

D

Derivative Free Method for Nonsmooth Optimization

@ f (x) at any x 2 IRn are polytopes, that is, at a point x 2 IRn there exist sets A D fa1 ; : : : ; a m g; a i 2 IRn ; i D 1; : : : ; m; m 1 and

Consider the following two sets: j

j

j

R(x; e (˛)) D v 2 A : hv; e i D maxhu; e i ; u2A j j j R(x; e (˛)) D w 2 B : hw; e i D minhu; e i : u2B

1

p

j

n

B D fb ; : : : ; b g; b 2 IR ; j D 1; : : : ; p; p 1

Proposition 1 Assume that function f is quasidifferentiable and its subdifferential and superdifferential are polytopes at a point x. Then there exists ˛0 > 0 such that

such that @ f (x) D co A; @ f (x) D co B:

R(x; e j (˛)) R j ; R(x; e j (˛)) R j ; j D 1; : : : ; n

This assumption is true, for example, for functions represented as a maximum, minimum, or max-min of a finite number of smooth functions. We take a direction g 2 IR n such that g D (g1 ; : : : ; g n ); jg i j D 1; i D 1; : : : ; n and consider the sequence of n vectors e j D e j (˛); j D 1; : : : ; n with ˛ 2 (0; 1]: e1 e2 ::: en

D D D D

(˛g1 ; 0; : : : ; 0); (˛g1 ; ˛ 2 g2 ; 0; : : : ; 0); ::::::::: (˛g1 ; ˛ 2 g2 ; : : : ; ˛ n g n ):

f 0 (x; e j (˛)) D f 0 (x; e j1 (˛)) C v j ˛ j g j C w j ˛ j g j ; 8v 2 R j ; w 2 R j ; j D 1; : : : ; n

Proposition 2 Assume that function f is quasidifferentiable and its subdifferential and superdifferential are polytopes at a point x. Then the sets R n and R n are singletons.

R0 D A; R0 D B; n o R j D v 2 R j1 : v j g j D maxfw j g j : w 2 R j1 g ; ˚ R j D v 2 R j1 : v j g j D minfw j g j : w 2 R j1 g : j D 1; : : : ; n :

Remark 1 In the next subsection we propose an algorithm to approximate subgradients. This algorithm finds a subgradient that can be represented as a sum of elements of the sets R n and R n . Computation of Subgradients

It is clear that R j ¤ ;; 8 j 2 f0; : : : ; ng; R j R j1 ; 8 j 2 f1; : : : ; ng and

Let g 2 IRn ; jg i j D 1; i D 1; : : : ; n be a given vector and > 0; ˛ > 0 be given numbers. We define the following points: x 0 D x; x j D x 0 C e j (˛); j D 1; : : : ; n:

R j ¤ ;; 8 j 2 f0; : : : ; ng; R j R j1 ; 8 j 2 f1; : : : ; ng: Moreover,

It is clear that x j D x j1 C(0; : : : ; 0; ˛ j g j ; 0; : : : ; 0); j D 1; : : : ; n:

(2)

and vr D wr 8v; w 2 R j ; r D 1; : : : ; j:

Corollary 1 Assume that function f is quasidifferentiable and its subdifferential and superdifferential are polytopes at a point x. Then there exists ˛0 > 0 such that

for all ˛ 2 (0; ˛0 ]

We introduce the following sets:

vr D wr 8v; w 2 R j ; r D 1; : : : ; j

for all ˛ 2 (0; ˛0 ).

(3)

Let v D v(˛; ) 2 IRn be a vector with the following coordinates: v j D (˛ j g j )1 f (x j ) f (x j1 ) ; j D 1; : : : ; n: (4)

D

Derivative Free Method for Nonsmooth Optimization

For any fixed g 2 IRn ; jg i j D 1; i D 1; : : : ; n and ˛ > 0 we introduce the following set: V(g; ˛) D w 2 IRn : 9( k ! C0; k ! C1); w D lim v(˛; k ) : k!C1

Proposition 3 Assume that f is a quasidifferentiable function and its subdifferential and superdifferential are polytopes at x. Then there exists ˛0 > 0 such that V(g; ˛) @ f (x) for all ˛ 2 (0; ˛0 ]. Remark 2 It follows from Proposition 3 that in order to approximate subgradients of quasidifferentiable functions one can choose a vector g 2 IRn such that jg i j D 1; i D 1; : : : ; n, sufficiently small ˛ > 0; > 0, and apply (4) to compute a vector v(˛; ). This vector is an approximation to a certain subgradient.

Sect. “Approximation of Subgradients.” Then for given x 2 IR n and z 2 P we define a sequence of n C 1 points as follows: x0 D x C x1 D x0 C x2 D x0 C ::: D ::: x n D x0 C

g; z()e 1 (˛); z()e 2 (˛); ::: z()e n (˛):

Definition 1 The discrete gradient of function f at point x 2 IRn is the vector i (x; g; e; z; ; ˛) D (1i ; : : : ; ni ) 2 IRn ; g 2 S1 with the following coordinates: ji D [z()˛ j e j )]1 f (x j ) f (x j1 ) ; j D 1; : : : ; n; j ¤ i ; 3 n X ii D (g i )1 4 f (x C g) f (x) ji g j 5 : 2

jD1; j¤i

Computation of Subdifferentials and Discrete Gradients In the previous subsection we demonstrated an algorithm for the computation of subgradients. In this subsection we consider an algorithm for the computation of subdifferentials. This algorithm is based on the notion of a discrete gradient. We start with the definition of the discrete gradient, which was introduced in [2] (for more details, see also [3,4]). Let f be a locally Lipschitz continuous function defined on IRn . Let S1 D fg 2 IRn : kgk D 1g; G D fe 2 IRn : e D (e1 ; : : : ; e n ); je j j D 1; j D 1; : : : ; ng; P D fz() : z() 2 IR1 ; z() > 0; > 0; 1 z() ! 0; ! 0g: Here S1 is the unit sphere, G is the set of vertices of the unit hypercube in IR n , and P is the set of univariate positive infinitesimal functions. We take any g 2 S1 and define jg i j D maxfjg k j; k D 1; : : : ; ng. We also take any e D (e1 ; : : : ; e n ) 2 G, a positive number ˛ 2 (0; 1], and define the sequence of n vectors e j (˛); j D 1; : : : ; n as in

It follows from the definition that f (x C g) f (x) D h i (x; g; e; z; ; ˛); gi

(5)

for all g 2 S1 ; e 2 G; z 2 P; > 0; ˛ > 0. Remark 3 One can see that the discrete gradient is defined with respect to a given direction g 2 S1 , and in order to compute the discrete gradient i (x; g; e; z; ; ˛), first we define a sequence of points x 0 ; : : : ; x n and compute the values of function f at these points; that is, we compute n C 2 values of this function including point x. n 1 coordinates of the discrete gradient are defined similarly to those of the vector v(˛; ) from the Sect. “Approximation of Subgradients,” and the ith coordinate is defined so as to satisfy equality (5), which can be considered as as version of the mean value theorem. Proposition 4 Let f be a locally Lipschitz continuous function defined on IRn and L > 0 its Lipschitz constant. Then for any x 2 IRn ; g 2 S1 ; e 2 G; > 0; z 2 P; ˛>0 k i k C(n)L; C(n) D (n2 C 2n3/2 2n1/2 )1/2 :

651

652

D

Derivative Free Method for Nonsmooth Optimization

For a given ˛ > 0 we define the following set: B(x; ˛) Dfv 2 IRn : 9(g 2 S1 ; e 2 G; z k 2 P; z k ! C0; k ! C0; k ! C1); v D lim i (x; g; e; z k ; k ; ˛)g: k!C1

(6)

Proposition 5 Assume that f is a semismooth, quasidifferentiable function and its subdifferential and superdifferential are polytopes at a point x. Then there exists ˛0 > 0 such that co B(x; ˛) @ f (x)

> 0. However, it is true at a given point. To get convergence results for a minimization algorithm based on discrete gradients, we need some relationship between the set D0 (x; ) and @ f (x) in some neighborhood of a given point x. We will consider functions satisfying the following assumption. Assumption 1 Let x 2 IRn be a given point. For any " > 0 there exist ı > 0 and 0 > 0 such that D0 (y; ) @ f (x C S¯" ) C S" for all y 2 Sı (x) and 2 (0; 0 ). Here

for all ˛ 2 (0; ˛0 ]. Remark 4 Proposition 5 implies that discrete gradients can be applied to approximate subdifferentials of a broad class of semismooth, quasidifferentiable functions. Remark 5 One can see that the discrete gradient contains three parameters: > 0, z 2 P, and ˛ > 0. z 2 P is used to exploit the semismoothness of function f , and it can be chosen sufficiently small. If f is a semismooth quasidifferentiable function and its subdifferential and superdifferential are polytopes at any x 2 IRn , then for any ı > 0 there exists ˛0 > 0 such that ˛ 2 (0; ˛0 ] for all y 2 Sı (x). The most important parameter is > 0. In the sequel we assume that z 2 P and ˛ > 0 are sufficiently small. Consider the following set: D0 (x; ) Dcl co fv 2 IRn : 9(g 2 S1 ; e 2 G; z 2 P) : v D i (x; g; e; ; z; ˛)g: Proposition 4 implies that the set D0 (x; ) is compact and it is also convex for any x 2 IRn . Corollary 2 Let f be a quasidifferentiable semismooth function. Assume that in the equality f (x C g) f (x) D f 0 (x; g) C o(; g); g 2 S1 1 o(; g) ! 0 as ! C0 uniformly with respect to g 2 S1 . Then for any " > 0 there exists 0 > 0 such that D0 (x; ) @ f (x) C S" for all 2 (0; 0 ). Corollary 2 shows that the set D0 (x; ) is an approximation to the subdifferential @ f (x) for sufficiently small

(7)

@ f (x C S¯" ) D

[

@ f (y); S¯" (x)

y2 S¯" (x)

D fy 2 IRn : kx yk "g:

A Necessary Condition for a Minimum Consider problem (1), where f : IRn ! IR1 is an arbitrary function. Proposition 6 Let x 2 IRn be a local minimizer of function f . Then there exists 0 > 0 such that 0 2 D0 (x; ) for all 2 (0; 0 ). Proposition 7 Let 0 62 D0 (x; ) for a given > 0 and v 0 2 IRn be a solution to the following problem: minimizekvk2 subject to v 2 D0 (x; ): Then the direction g 0 D kv 0 k1 v 0 is a descent direction. Proposition 7 shows how the set D0 (x; ) can be used to compute descent directions. However, in many cases the computation of the set D0 (x; ) is not possible. In the next section we propose an algorithm for the computation of descent directions using a few discrete gradients from D0 (x; ). Computation of Descent Directions In this subsection we describe an algorithm for the computation of descent directions of the objective function f of Problem (1).

Derivative Free Method for Nonsmooth Optimization

D

Let z 2 P; > 0; ˛ 2 (0; 1], the number c 2 (0; 1), and a tolerance ı > 0 be given.

condition (9) satisfies after m computations of the discrete gradients, where

Algorithm 1 An algorithm for the computation of the descent direction. Step 1. Choose any g 1 2 S1 ; e 2 G; compute i D argmax fjg j j; j D 1; : : : ; ng and a discrete gradient v 1 D i (x; g 1 ; e; z; ; ˛). Set D1 (x) D fv 1 g and k D 1. Step 2. Compute the vector kw k k2 D minfkwk2 : w 2 D k (x)g. If

¯ log rC1); r D 1[(1c)(2C) ¯ 1 ı]2 ; m 2(log2 (ı/C)/ 2

kw k k ı;

(8)

then stop. Otherwise go to Step 3. Step 3. Compute the search direction by g kC1 D kw k k1 w k . Step 4. If f (x C g kC1 ) f (x) ckw k k;

(9)

then stop. Otherwise go to Step 5. j : j D 1; : : : ; ng Step 5. Compute i D argmax fjg kC1 j and a discrete gradient v kC1 D i (x; g kC1 ; e; z; ; ˛); construct the set D kC1 (x) D co fD k (x) k D k C 1, and go to Step 2.

S kC1 fv gg, set

In what follows we provide some explanations of Algorithm 1. In Step 1 we compute the discrete gradient with respect to an initial direction g 1 2 IRn . The distance between the convex hull D k (x) of all computed discrete gradients and the origin is computed in Step 2. This problem is solved using the algorithm from [21]. If this distance is less than the tolerance ı > 0, then we accept point x as an approximate stationary point (Step 2); otherwise we compute another search direction in Step 3. In Step 4 we check whether this direction is a descent direction. If it is, we stop and the descent direction has been computed; otherwise we compute another discrete gradient with respect to this direction in Step 5 and update the set D k (x). At each iteration k we improve the approximation of the subdifferential of function f . The next proposition shows that Algorithm 1 is terminating. Proposition 8 Let f be a locally Lipschitz function de¯ either condition (8) or fined on IRn . Then, for ı 2 (0; C),

C¯ D C(n)L, and C(n) is a constant from Proposition 4. Remark 6 Proposition 4 and equality (5) are true for any > 0 and for any locally Lipschitz continuous functions. This means that Algorithm 1 can compute descent directions for any > 0 and for any locally Lipschitz continuous functions in a finite number of iterations. Sufficiently small values of give an approximation to the subdifferential, and in this case Algorithm 1 computes local descent directions. However, larger values of do not give an approximation to the subdifferential and in this case descent directions computed by Algorithm 1 can be considered global descent directions. The Discrete Gradient Method In this section we describe the discrete gradient method. Let sequences ı k > 0; z k 2 P; k > 0; ı k ! C0; z k ! C0; k ! C0; k ! C1, sufficiently small number ˛ > 0, and numbers c1 2 (0; 1); c2 2 (0; c1 ] be given. Algorithm 2 Discrete gradient method Step 1. Choose any starting point x 0 2 IRn and set k D 0. Step 2. Set s D 0 and x sk D x k . Step 3. Apply Algorithm 1 for the computation of the descent direction at x D x sk ; ı D ı k ; z D z k ; D k ; c D c1 . This algorithm terminates after a finite number of iterations l > 0. As a result we get the set D l (x sk ) and an element vsk such that kvsk k2 D minfkvk2 : v 2 D l (x sk )g: Furthermore, either kvsk k ı k or for the search direction gsk D kvsk k1 vsk f (x sk C k gsk ) f (x sk ) c1 k kvsk k:

(10)

Step 4. If kvsk k ı k ;

(11)

then set x kC1 D x sk ; k D k C 1 and go to Step 2. Otherwise go to Step 5.

653

654

D x sk

Derivative Free Method for Nonsmooth Optimization

k Step 5. Construct the following iteration x sC1 D k C s gs , where s is defined as follows:

s D argmax f 0 : f (x sk C gsk ) f (x sk ) o c2 kvsk k : Step 6. Set s D s C 1 and go to Step 3. n 0 0 For ˚ thenpoint x 2 IR 0 we consider the set M(x ) D x 2 IR : f (x) f (x ) :

Proposition 9 Assume that function f is semismooth quasidifferentiable, its subdifferential and superdifferential are polytopes at any x 2 IRn , Assumption 1 is satisfied, and the set M(x 0 ) is bounded for starting points x 0 2 IRn . Then every accumulation point of fx k g belongs to the set X 0 D fx 2 IRn : 0 2 @ f (x)g. Remark 7 Since Algorithm 1 can compute descent directions for any values of > 0, we take 0 2 (0; 1), some ˇ 2 (0; 1), and update k ; k 1 as follows: k D ˇ k 0 ; k 1: Thus in the discrete gradient method we use approximations to subgradients only at the final stage of the method, which guarantees convergence. In most iterations we do not use explicit approximations of subgradients. Therefore it is a derivative-free method. Remark 8 It follows from (10) and c2 c1 that always s k and therefore k > 0 is a lower bound for s . This leads to the following rule for the computation of s . We define a sequence: m D m k ; m 1; and s is defined as the largest m satisfying the inequality in Step 5. Applications There are many problems from applications where the objective and/or constraint functions are not regular. We will consider one of them, the cluster analysis problem, which is an important application area in data mining. Clustering is also known as the unsupervised classification of patterns; it deals with problems of organizing a collection of patterns into clusters based on similarity. Clustering has many applications in information retrieval, medicine, etc.

In cluster analysis we assume that we have been given a finite set C of points in the n-dimensional space IRn , that is, C D fc 1 ; : : : ; c m g; where c i 2 IRn ; i D 1; : : : ; m: We consider here partition clustering, that is, the distribution of the points of set C into a given number q of disjoint subsets C i ; i D 1; : : : ; q with respect to predefined criteria such that: (1) C i ¤ ;; i D 1; : : : ; q; T (2) C i C j D ;; i; j D 1; : : : ; q; i ¤ j; q S (3) C D Ci . iD1

The sets C i ; i D 1; : : : ; q are called clusters. The strict application of these rules is called hard clustering, unlike fuzzy clustering, where the clusters are allowed to overlap. We assume that no constraints are imposed on the clusters C i ; i D 1; : : : ; q, that is, we consider the hard unconstrained clustering problem. We also assume that each cluster C i ; i D 1; : : : ; q can be identified by its center (or centroid). There are different formulations of clustering as an optimization problem. In [5,6,7] the cluster analysis problem is reduced to the following nonsmooth optimization problem: minimize f (x 1 ; : : : ; x q ) subject to (x 1 ; : : : ; x q ) 2 IRnq ;

(12)

where f (x 1 ; : : : ; x q ) D

m 1 X min kx s c i k2 : m iD1 sD1;:::;q

(13)

Here k k is the Euclidean norm and x s 2 IRn stands for the sth cluster center. If q > 1, then the objective function (13) in problem (12) is nonconvex and nonsmooth. Moreover, function f is a nonregular function, and the computation of even one subgradient of this function is quite a difficult task. This function can be represented as the difference of two convex functions as follows: f (x) D f1 (x) f 2 (x); where f 1 (x) D

q m 1 XX s kx c i k2 ; m iD1 sD1

Derivatives of Markov Processes and Their Simulation

f 2 (x) D

q m X 1 X max kx k c i k2 : sD1;:::;q m iD1

kD1;k¤s

It is clear that function f is quasidifferentiable and its subdifferential and are polytopes at any point. Thus, the discrete gradient method can be applied to solve clustering problem.

Conclusions We have discussed a derivative-free discrete gradient method for solving unconstrained nonsmooth optimization problems. This algorithm can be applied to a broad class of optimization problems including problems with nonregular objective functions. It is globally convergent toward stationary points of semismooth, quasidifferentiable functions whose subdifferential and superdifferential are polytopes.

References 1. Audet C, Dennis JE Jr (2003) Analysis of generalized pattern searches. SIAM J Optim 13:889–903 2. Bagirov AM, Gasanov AA (1995) A method of approximating a quasidifferential. J Comput Math Math Phys 35(4):403–409 3. Bagirov AM (1999) Minimization methods for one class of nonsmooth functions and calculation of semi-equilibrium prices. In: Eberhard A et al (eds) Progress in Optimization: Contributions from Australasia. Kluwer, Dordrecht, pp 147–175 4. Bagirov AM (2003) Continuous subdifferential approximations and their applications. J Math Sci 115(5):2567–2609 5. Bagirov AM, Rubinov AM, Soukhoroukova AV, Yearwood J (2003) Supervised and unsupervised data classification via nonsmooth and global optimisation. TOP: Span Oper Res J 11(1):1–93 6. Bagirov AM, Ugon J (2005) An algorithm for minimizing clustering functions. Optim 54(4–5):351–368 7. Bagirov AM, Yearwood J (2006) A new nonsmooth optimisation algorithm for minimum sum-of-squares clustering problems. Eur J Oper Res 170(2):578–596 8. Burke JV, Lewis AS, Overton ML (2005) A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM J Optim 15(3):751–779 9. Clarke FH (1983) Optimization and Nonsmooth Analysis. Wiley, New York 10. Demyanov VF, Rubinov AM (1995) Constructive Nonsmooth Analysis. Lang, Frankfurt am Main 11. Hiriart-Urruty JB, Lemarechal C (1993) Convex Analysis and Minimization Algorithms, vols 1 and 2. Springer, Berlin

D

12. Kiwiel KC (1985) Methods of Descent for Nondifferentiable Optimization. In: Lecture Notes in Mathematics. Springer, Berlin 13. Lemarechal C (1975) An extension of Davidon methods to nondifferentiable problems. In: Balinski ML, Wolfe P (eds) Nondifferentiable Optimization. Mathematical Programming Study, vol 3, North-Holland, Amsterdam, pp 95–109 14. Lemarechal C, Zowe J (1994) A condensed introduction to bundle methods in nonsmooth optimization. In: Spedicato E (ed) Algorithms for Continuous Optimization. Kluwer, Dordrecht, pp 357–482 15. Liuzzi G, Lucidi S, Sciandrone M (2006) A derivative free algorithm for linearly constrained finite minimax problems. SIAM J Optim 16(4):1054–1075 16. Mifflin R (1977) Semismooth and semiconvex functions in constrained optimization. SIAM J Control Optim 15(6):959–972 17. Mifflin R (1977) An algorithm for constrained optimization with semismooth functions. Math Oper Res 2:191–207 18. Polak E, Royset JO (2003) Algorithms for finite and semiinfinite min-max-min problems using adaptive smoothing techniques. J Optim Theory Appl 119(3):421–457 19. Torczon V (1997) On the convergence of pattern search algorithms. SIAM J Optim 7:1–25 20. Wolfe PH (1975) A method of conjugate subgradients of minimizing nondifferentiable convex functions. Math Program Stud 3:145–173 21. Wolfe PH (1976) Finding the nearest point in a polytope. Math Program 11(2):128–149

Derivatives of Markov Processes and Their Simulation GEORG PFLUG University Vienna, Vienna, Austria MSC2000: 90C15, 60J05 Article Outline Keywords Introduction Process Derivatives Distributional Derivatives

Regenerative Processes See also References Keywords Derivatives; Stochastic optimization

655

656

D

Derivatives of Markov Processes and Their Simulation

Introduction

Process Derivatives

The optimal design of stochastic systems like queueing or inventory systems is a specific stochastic optimization problem. Let Y x (t) be an ergodic Markov process with discrete time t = 1, 2, . . . and values in Rm , depending on a control parameter x 2 Rd . Let H(x, ) be some cost function. The problem is to find the control x which minimizes the expectedcosts of the system either under the transient or under the stationary regime: 1) Under the transient regime, the process is started at time 0 in a specific starting state y0 and observed until time T. The optimality problem reads 8 T ˆ X ˆ 0, where is the unique stationary probability measure pertaining to P. Suppose that A is a regenerative set for all transitions Px . The sequence of regenerative stopping times of Y x (t)is T1(A) D min ft : Yx (t) 2 Ag ; n o (A) D min t > Ti(A) : Yx (t) 2 A : TiC1

657

658

D

Derivatives of Probability and Integral Functions: General Theory and Examples

These stopping times cut the process into independent pieces. For a process Y x started in A, the following fundamental equation relates the finite time behavior to the stationary, i. e. long run behavior: E

hP

E[H(Yx (1))] D

T (A) tD1

H(Yx (t))

i :

E(T (A) )

(5)

The score method for derivative estimation gives 2 (A) 3 3 2 (A) T T X X rx E 4 H(Yx (t))5 D E 4 H(Yx (t))Wx (t)5 tD1

2 rx E[T (A) ] D E 4

3

(A)

T X

Wx (t)5

and — by the quotient rule — rx E[H(Yx (1))] ) rx E

D E

hP

MSC2000: 90C15 Article Outline Keywords Notations and Definitions Integral Over the Surface Formula Integral Over the Volume Formula General Formula See also References

tD1

E(T

S. URYASEV Department Industrial and Systems Engineering, University Florida, Gainesville, USA

tD1

and

(A)

Derivatives of Probability and Integral Functions: General Theory and Examples

hP

T (A) tD1

H(Yx (t))

i

[E(T (A) )]2

T (A) tD1

i H(Yx (t)) rx E(T (A) ) [E(T (A) )]2

(see [2]). For the estimation of r x E[H(Y x (1))], all expectations of the right-hand side have to be replaced by estimates. See also Derivatives of Probability and Integral Functions: General Theory and Examples Derivatives of Probability Measures Discrete Stochastic Optimization Optimization in Operation of Electric and Energy Power Systems References 1. Pflug GCh (1996) Optimization of stochastic models: the interface between simulation and optimization. Kluwer, Dordrecht 2. Rubinstein RY, Shapiro A (1993) Discrete event systems: Sensitivity and stochastic optimization by the score function method. Wiley, New York

Keywords Probability function; Derivative of an integral; Gradient of an integral; Derivative of a probability function; Gradient of a probability function Probability functions are commonly used for the analysis of models with uncertainties or variabilities in parameters. For instance, in risk and reliability analysis, performance functions, characterizing the operation of systems, are formulated as probabilities of successful or unsuccessful accomplishment of their missions (core damage probability of a nuclear power plant, probability of successful landing of an aircraft, probability of profitable transactions in a stock market, or percentiles of the risks in public risk assessments). Sensitivity analysis of such performance functions involves evaluating of their derivatives with respect to the parameters. Also, the derivatives of the probability function can be used to solve stochastic optimization problems [1]. A probability function can be formally presented as an expectation of a discontinuous indicator function of a set, or as an integral over a domain — depending upon parameters. Nevertheless, differentiability conditions of the probability function do not follow from similar conditions of the expectations of continuous (smooth or convex) functions.

Derivatives of Probability and Integral Functions: General Theory and Examples

The derivative of the probability function has many equivalent representations. It can be represented as an integral over the surface, an integral over the volume, or a sum of integrals over the volume and over the surface. Also, it can be calculated using weak derivatives of the probability measures or conditional expectations. The first general result on the differentiability of the probability function was obtained by E. Raik [8]. He represented the gradient of the probability function with one constraint in the form of the surface integral. S. Uryasev [10] extended Raik’s formula for probability functions with many constraints. A.I. Kibzun and G.L. Tretyakov [3] extended it to the piecewise smooth constraint and probability density function. Special cases of probability function with normal and gamma distributions were investigated by A. Prékopa [6]. G.Ch. Pflug [5] represented the gradient of probability function in the form of an expectation using weak probability measures. Uryasev [9] expressed the gradient of the probability function as a volume integral. Also, using a change of variables, K. Marti [4] derived the probability function gradient in the form of the volume integral. A general analytical formula for the derivative of probability functions with many constraints was obtained by Uryasev [10]; it calculates the gradient as an integral over the surface, an integral over the volume, or the sum of integrals over the surface and the volume. Special cases of this formula correspond to the Raik formula [8], the Uryasev formula[9], and the change-ofvariables approach [4]. The gradient of the quantile function was obtained in [2]. Notations and Definitions Let an integral over the volume Z p(x; y) d y F(x) D

D

ter x. For example, let F(x) D Pf f (x; (!)) 0g

(2)

be a probability function, where (!) is a random vector in Rm . The random vector (!) is assumed to have a probability density p(x, y) that depends on a parameter x 2 Rn . The probability function can be represented as an expectation of an indicator function, which equals one on the integration set, and equals zero outside of it. For example, let F(x) D E If f (x;)0g g(x; ) Z g(x; y)(x; y) d y D f (x;y)0 Z p(x; y) d y; D

(3)

f (x;y)0

where I {} is an indicator function, andthe random vector in Rm has a probability density (x, y) that depends on a parameter x 2 R. Integral Over the Surface Formula The following formula calculates the gradient of an integral (1) over the set given by nonlinear inequalities as sum of integral over the volume plus integral over the surface of the integration set. We call this the integral over the surface formula because if the density p(x, y) does not depend upon x the gradient of the integral (1) equals an integral over the surface. This formula for the case of one inequality was obtained by Raik [8] and generalized for the case with many inequalities by Uryasev [10]. Let us denote by (x) the integration set (x) D fy 2 Rm : f (x; y) 0g

(1)

:D fy 2 Rm : f l (x; y) 0; 1 l kg

f (x;y)0

be defined on the Euclidean space Rn , where f : Rn × Rm ! Rk and p: Rn × Rm ! R are some functions. The inequality f (x, y) 0 in the integral is a system of inequalities f i (x; y) 0;

i D 1; : : : ; k:

Both the kernel function p(x, y) and the function f (x, y) defining the integration set depend upon the parame-

and by @(x) the surface of this set (x). Also, let us denote by @i (x) a part of the surface which corresponds to the function f i (x, y), i. e., @ i (x) D (x) \ fy 2 Rm : f i (x; y) D 0g : If the constraint functions are differentiable and the following integral exists, then gradient of integral (1)

659

660

D

Derivatives of Probability and Integral Functions: General Theory and Examples

We use formula (4) to calculate the gradient r x F(x) as an integral over the surface. The function p(A) does not depend upon x and r x p(A) = 0. Formula (4) implies that r x F(x) equals

equals Z rx F(x) D

rx p(x; y) d y

(x)

k Z X iD1

@ i (x)

p(x; y)

r f (x; y) dS:

r y f i (x; y) x i

(4)

k Z X @ i (x)

iD1

A potential disadvantage of this formula is that in multidimensional case it is difficult to calculate the integral over the nonlinear surface. Most well known numerical techniques, such as Monte-Carlo algorithms, are applicable to volume integrals. Nevertheless, this formula can be quite useful in various special cases, such as the linear case. Example 1 (Linear case: Integral over the surface formula [10].) Let A(!), be a random l × n matrix with the joint density p(A). Suppose that x 2 Rn and xj 6D 0, j = 1, . . . , n. Let us define F(x) D PfA(!)x b; A(!) 0g; l

b D (b1 ; : : : ; b l ) 2 R ;

n

x2R ;

(5)

i. e. F(x) is the probability that the linear constraints A(!)x b, A(!) 0 are satisfied. The constraint, A(!) 0, means that all elements aij (!) of the matrix A(!) are nonnegative. Let us denote by Ai and Ai the ith row and column of the matrix A 0 1 A1 B :: C 1 A D @ : A D A ; : : : ; An ;

p(A) rx f i (x; A) dS: krA f i (x; A)k

Since r x f i (x, A) = 0 for i = l + 1, . . . , k, then r x F(x) equals

l Z X iD1

D

@ i (x)

p(A) rx f i (x; A) dS krA f i (x; A)k

l Z X @ i (x)

iD1

1

D kxk

l Z X

1 A1 x b1 C B :: C 0 1 B : C B f1 (x; A) C B B :: C B A l x b l C f (x; A) D @ : A D B C; B A1 C C B f k (x; A) :: C B A @ : An

Integral Over the Volume Formula This section presents gradient of the function (1) in the form of volume integral. Let us introduce the following shorthand notations 1 0 f 1 (x; y) B : C f 1l (x; y) D @ :: A ; f (x; y) D f 1k (x; y); f l (x; y) 0 @ f 1 (x;y) B r y f (x; y) D B @

@y 1

@ f 1 (x;y) @y m

:: :

@ f k (x;y) @y 1

@ f k (x;y) @y m

1 C C: A

1i

B div y H D B @

iD1 @y i

Pm

:: :

@h ni iD1 @y i

C C: A

Following [10], the derivative of the function (1) is represented as an integral over the volume Z

k D l C l n:

f (x;A)0

p(A)A> i dS:

Divergence for the n × m matrix H consisting of the elements hji is denoted by 0 Pm @h 1

0

The function F(x) equals Z p(A) dA: F(x) D

Axb; A0 A i xDb i

iD1

Al then

p(A) > A dS kxk i

rx F(x) D

rx p(x; y) d y Z C div y p(x; y)H(x; y) d y;

(x)

(6)

(x)

(7)

D

Derivatives of Probability and Integral Functions: General Theory and Examples

where a matrix function H: Rn × Rm ! Rn×m satisfies the equation H(x; y)r y f (x; y) C rx f (x; y) D 0:

(8)

The last system of equations may have many solutions, therefore formula (7) provides a number of equivalent expressions for the gradient. The following section gives analytical solutions of this system of equations. In some cases, this system does not have any solution, and formula (7) is not valid. The following section deals with such cases and provides a general formula where system of equations can be solved only for some of the functions defining the integration set. Example 2 (Linear case: Integral over the volume formula [10].) With formula (7), the gradient of the probability function (5) with linear constrains considered in Example 1 can be represented as the integral over the volume. It can be shown that equation (8) does not have a solution in this case. Nevertheless, we can slightly modify the constraints, such that integration set is not changed and equation (8) has a solution. In the vector function f (x, A) we multiply column Ai on xi if xi is positive or multiply it on xi if xi is negative. Therefore, we have the following constraint function 0 1 A1 x b1 B C :: B C : B C B C B Al x bl C (9) f (x; A) D B C; B (C)x1 A1 C B C :: B C @ A : n (C)x n A where (+) means that we take an appropriate sign. It can be directly checked that, the matrix H l (x, A) H (x; A) D h 1 (x; A1 ); : : : ; h l (x; A l ) ; 1 0 a i1 x11 0 C B :: h i (x; A i ) D @ A : 0

a i n x n1

is a solution of system (8). As it will be shown in the next section, this analytical solution follows from the fact that change of the variables Y i = xi Ai , i = 1, . . . ,

n, eliminates variables xi , i = 1, . . . , n, from the constraints (9). Since r x p(A) = 0 and divA (p(A)H (x, A)) equals 1 0 P x11 l p(A) C liD1 a i1 @a@i1 p(A) B C B C :: B C; : @ A P l @ x n1 l p(A) C iD1 a i n @a in p(A) formula (7) implies that @F(x)/ @xj } equals x 1 j

Z Axb A0

l p(A) C

l X iD1

@ ai j p(A) @a i j

! dA:

General Formula Further, we give a general formula [9,10] for the differentiation of integral (1). A gradient of the integral is represented as a sum of integrals taken over a volume and over a surface. This formula is useful when system of equations (8) does not have a solution. We split the set of constraints K := = {1, . . . , k} into two subsets K 1 and K 2 .Without loss of generality we suppose that K1 D f1; : : : ; lg;

K2 D fl C 1; : : : ; kg:

The derivative of integral (1) can be represented as the sum of the volume and surface integrals Z rx p(x; y) d y rx F(x) D

(x) Z C div y p(x; y)H l (x; y) d y

(x)

Z k X

iDl C1 @ i (x)

p(x; y)

r y f i (x; y)

rx f i (Cx; y) C H l (x; y)r y f i (x; y) dS;

(10)

where the matrix H l : Rn × Rm ! Rn×m satisfies the equation H l (x; y)r y f 1l (x; y) C rx f 1l (x; y) D 0:

(11)

The last equation can have a lot of solutions and we can choose an arbitrary one, differentiable with respect to the variable y. The general formula contains as a special cases the integral over the surface formula (4) and integral over the volume formula (7). When the set K 1 is empty, the

661

662

D

Derivatives of Probability and Integral Functions: General Theory and Examples

matrix H l is absent and the general formula is reduced to the integral over the surface. Also, when the set K 2 is empty we have integral over the volume formula (7). Except these extreme cases, the general formula provides number of intermediate expressions for the gradient in the form of the sum of an integral over the surface and an integral over the volume. Thus, we have a number of equivalent representations of the gradient corresponding to the various sets K 1 and K 2 and solutions of equation (11). Equation (11) (and equation (8) which is a partial case of equation (11)) can be solved explicitly. Usually, this equation has many solutions. The matrix rx f 1l (x; y)

r y> f 1l (x;

y)ry f 1l (x; y)

1

r y> f 1l (x; y)

integral was considered Z F(x) D b(y)x; p(y) d y; where x 2 R1 , y 2 Rm , p: Rm ! R1 , > 0, b(y) = y˛i . In this case

y D (x; z)

ym and

Z

Z p(y) d y D

1 (x; (x; z)) D z:

p(y) d y:

(x)

Let us consider that l = 1, i. e. K 1 = {1} and K 2 = {2, . . . , m + 1}. The gradient r x F(x) equals Z rx p(y) C div y p(y)H1 (x; y) d y

(x)

which eliminates vector x from the function f (x, y) defining integration set, i. e., function f (x, (x, z)) does not depend upon the variable x. Denote by 1 (x, y) the inverse function, defined by the equation

iD1

1 b(y) x B y1 C C B f (x; y) D B C; :: A @ :

f (x;y)0

(12)

Pm

0

F(x) D

is a solution of equation (11). Also, in many cases there is another way to solve equation (11) using change of variables. Suppose that there is a change of variables

mC1 XZ iD2

@ i (x)

p(y)

r y f i (x; y)

rx f i (x; y) C H1 (x; y)r y f i (x; y) dS;

(15)

Where the matrix H 1 (x, y) satisfies (11). In view of 0 ˛1 1 y1 B : C r y f 1 (x; y) D ˛ @ :: A ; rx f 1 (x; y) D 1: y˛1 m

Let us show that the following matrix H(x; y) D rx (x; z)jzD 1 (x;y)

(14)

y i ; iD1;:::;m

(13)

is a solution of (11). Indeed, the gradient of the function (x, y(x, z)) with respect to x equals zero, therefore 0 D rx f 1l (x; (x; z)) D rx (x; z)r y f 1l (x; y)j yD (x;z) C rx f 1l (x; y)j yD (x;z) ; and function r x (x, z)|z = 1 (x, y) is a solution of (11). Formula (7) with matrix (13) gives the derivative formulas which can be obtained with change of variables in the integration set [4]. Example 3 While investigating the operational strategies for inspected components (see [7]) the following

a solution H 1 (x, y) of (11) equals H1 (x; y) D h(y) :D h1 (y1 ); : : : ; h m (y m ) 1 1˛ y1 ; : : : ; y1˛ : D m ˛m Let us denote ( i jy) D (y1 ; : : : ; y i1 ; ; y iC1 ; : : : ; y m ); yi D (y1 ; : : : ; y i1 ; y iC1 ; : : : ; y m ); b( i jy) D ˛ C

m X

y˛j :

jD1 j¤i

We denote by yi the set of inequalities y j ;

j D 1; : : : ; i 1;

i C 1; : : : ; m:

(16)

D

Derivatives of Probability and Integral Functions: General Theory and Examples

m Z 1˛ X p( i jy) d yi : C i jy)x; ˛m iD1 b( i

The sets @i (x), i = 2, . . . , m + 1, have a simple structure \ @ i (x) D (x) fy 2 Rm : y i D g ˚ i D y 2 Rm1 : b( i jy) x; yi 0 :

y

The formula for r x F(x) is valid for an arbitrary sufficiently smooth function p(y).

For i = 2, . . . , m + 1, we have

r y f i (y)

j

D 0;

j D 1; : : : ; m;

r y f i (y) i1 D 1;

j ¤ i 1; (17)

r y f i (y) D 1:

(18)

The function p(y) and the functions f i (y), i = 2, . . . , m + 1, do not depend on x, consequently rx p(y) D 0;

(19)

rx f i (y) D 0;

i D 2; : : : ; m C 1:

(20)

Equations (15)–(20) imply Z div y p(y)h(y) d y rx F(x) D

(x) mC1 XZ

iD2

Z

@ i (x)

p(y)

h(y)r y f i (y) dS

r y f i (y)

div y p(y)h(y) d y

D

(x)

C

Z h i1 ()

b(y)x; div y y i ; iD1;:::;m m 1˛ Z X

C

iD1

p(y) dS @ i (x)

iD2

Z D

mC1 X

˛m

p(y)h(y) d y

b( i jy)x; y i

p( i jy) d yi :

Since div y p(y)h(y) D h(y)r y p(y) C p(y) div y h(y) D

m m 1 X @p(y) 1˛ 1 ˛ X ˛ y i C p(y) y ; ˛m iD1 @y i ˛m iD1 i

we, finally, obtain that the gradient r x F(x) equals Z b(y)x; y i ; iD1;:::;m

m X y˛ @p(y) i yi C (1 ˛)p(y) d y ˛m @y i iD1

See also Derivatives of Markov Processes and Their Simulation Derivatives of Probability Measures Discrete Stochastic Optimization Optimization in Operation of Electric and Energy Power Systems References 1. Ermoliev Y, Wets RJ-B (eds) (1988) Numerical techniques for stochastic optimization. Ser Comput Math. Springer, Berlin 2. Kibzun AI, Malyshev VV, Chernov DE (1988) Two approaches to solutions of probabilistic optimization problems. Soviet J Automaton Inform Sci 20(3):20–25 3. Kibzun AI, Tretyakov GL (1996) Onprobability function differentiability. Theory and Control System 2:53–63. (In Russian) 4. Marti K (1996) Differentiation formulas for probability functions: The transformation method. Math Program B 75(2) 5. Pflug GCh (1996) Optimization of stochastic models: the interface between simulation and optimization. Kluwer, Dordrecht 6. Prékopa A (1970) On probabilistic constrained programming, Proc. Princeton Symp. Math. Program. Princeton University Press, Princeton, 113–138 7. Pulkkinen A, Uryasev S (1991) Optimal operational strategies for an inspected component: Solution techniques. Collaborative Paper Internat Inst Appl Systems Anal, Laxenburg, Austria CP-91-13 8. Raik E (1975) The differentiability in theparameter of the probability function and optimization of the probability function via the stochastic pseudogradient method. Eesti NSV Teaduste Akad Toimetised Füüs Mat. 24(1):3–6 (In Russian) 9. Uryasev S (1989) A differentiation formula for integrals over sets given by inclusion. Numer Funct Anal Optim 10(7–8):827–841 10. Uryasev S (1994) Derivatives of probability functions and integrals over sets given by inequalities. J Comput Appl Math 56:197–223 11. Uryasev S (1995) Derivatives of probability functions and some applications. Ann Oper Res 56:287–311

663

664

D

Derivatives of Probability Measures

Derivatives of Probability Measures GEORG PFLUG University Vienna, Vienna, Austria MSC2000: 90C15

The family of densities (g x (v)) is called weakly L1 ()differentiable if there is a vector of L1 () functions r x g x = (g 0x;1 , . . . , g 0x;d )| such that for every bounded measurable function H Z [g xCh (v) g x (v) h > rx g x (v)]H(v) d(v) D o(khk)

Article Outline

(3)

Weak differentiability implies strong differentiability but not vice versa. There is also a notion of differentiability for families (x ), which do not possess densities (see [3]). If the densities (g x ) are differentiable and H(x, v) is boundedly differentiable in x and bounded R and continuous in v, then the gradient of F(x) = H(x, v)g x (v)d(v) is Z rx H(x; v)g x (v) d(v) Z C H(x; v)rx g x (v) d(v):

Keywords Direct Differentiability Inverse Differentiability Simulation of Derivatives Process Derivatives Distributional Derivatives

See also References Keywords Derivatives; Stochastic optimization For stochastic optimization problems of the form Z 8 rx g x (v)ˇ d(v) b D o(khk)

as khk # 0 :

as khk # 0:

(2)

The family (x ) is called process differentiable if there exists a family of random variables V x (!) — the process representation — defined on some probability space (˝, A, P), such that: a) V x () has distribution x for all x; and b) x 7! V x (!) is differentiable a.s. As an example, let x be exponential distributions with densities g x (v) = x exp(x u). Then V x (!) = (1/x) U for U Uniform [0, 1] is a process representation in the sense of a) and differentiable in the sense of b) with derivative r x V x (!) = (1/x2 )U. Process differentiability does not imply and isR not u implied by weak differentiability. If Gx (u) = 1 g x (v)dv is the distribution function, then process differentiability is equivalent to the differentiability of x 7! G1 x (u), whereas the weak differentiability is connected to the differentiablity of x 7! Gx (u). If V x () is a process representation of (x ), then the objective function Z F(x) D H(x; v) dx (v) D E[H(x; Vx )] has derivative rx F(x) D E[H x (x; Vx ) C Hv (x; Vx ) rx Vx ]: where H x (x, v) = r x H(x, v) and H v (x, v) = r v H(x, v).

Derivatives of Probability Measures

Simulation of Derivatives If the objective function F in (1) is easily calculated, then the stochastic optimization problem reduces to a standard nonlinear deterministic optimization problem. This is however the exception. In the majority of applications, the objective function value has to be approximated either by a numeric integration technique or a Monte-Carlo (MC) estimate. In the same manner, the gradient r x F(x) may be approximated either by numerical integration or by Monte-Carlo simulation. We discuss here the construction of MC estimates for the gradient r x F(x). For simplicity, we treat only the univariate case x 2 R1 . We begin with recalling the Monte-Carlo (MC) method for estimating F(x). If (V (i) x ) is a sequence of independent identically distributed random variables with distribution function Gx , then the MC estimate

If the family (x ) has differentiable process representation (V x ), then n 1 Xh H x (x; Vx(i) ) n iD1

CHv (x; Vx(i) ) rx Vx(i)

i

(4)

is a MC estimate of r x F(x). The method of using the process derivative (4) is also called perturbation analysis ([1,2]). Distributional Derivatives If the densities g x are differentiable, there are two possibilities to construct estimates. First, ı one may define the score function sx (v) = [r x g x (v)] g x (v) and construct the score function estimate [4]

1

n 1 Xh H x (x; Vx(i) ) n iD1

i CH(x; Vx(i) )s x (Vx(i) ) ;

which is unbiased.

(5)

where g˙x and g¨x are probability densities w.r.t. , and cx is a nonnegative constant. One possibility is to set g˙x resp. g¨x as the appropriately scaled positive, resp. negative, part of r x g x , but other representations are possible as well. Let now V˙x(i) , resp. V¨x(i) , be random variables with distributions g˙x d, resp. g¨x d. The difference estimate is

1

rx Fn (x) D

1 Xn H x (x; Vx(i) ) n iD1 n

o Cc x [H(x; V˙x(i) ) H(x; V¨x(i) ] ;

g x (y) D x exp(x y):

Process Derivatives

rx Fn (x) D

rx g x (v) D c x [ g˙x (v) g¨x (v)];

Example 2 Assume again that (x ) are exponential distributions with expectation x. The probability x has density

is an unbiased estimate of F(x).

1

Alternatively, one may write the function r x g x (v) in the form

which is unbiased (see [3]).

n

1X b H(x; Vx(i) ) F n (x) D n iD1

rx Fn (x) D

D

Let V x be distributed according to x . For simplicity, assume that the cost function H does not depend explicitly on x. We need estimates for r x E(H(V x )). The three methods are: 1) Score derivative: The score function is 1 rx g x (v) D v g x (v) x and the score function estimate is 1 rx F (1) D H Vx )( Vx : x

b

2) Difference derivative: There are several representations in the sense of (5). One could use the decomposition of r x g x () into its positive and negative part (Jordan–Hahn decomposition) and get the estimate

b

rx F (2a) D

1 (H(V˙x ) H(V¨x )); x

where V˙x has density x e(1 xv)e xv 1v 1 x

665

666

D

Design Optimization in Computational Fluid Dynamics

and V¨x has density

Derivatives of Probability and Integral Functions: General Theory and Examples Discrete Stochastic Optimization Optimization in Operation of Electric and Energy Power Systems

x e(xv 1)e xv 1v> 1 x

and both are independent. Another possibility is to set 1 V˙x D log U1 ; x 1 V¨x D (log U1 C log U2 ); x where U 1 , U 2 are independent Uniform [0, 1] variates. The final difference estimate is

b

rx F (2b) D

1 (H(V˙x ) H(V¨x )): x

3) Process derivative: A process representation of (x ) is 1 Vx D log(1 U); x

U Uniform[0; 1]:

A process derivative of H(V x ) is

b

1 rx F (3) x :D H x (Vx )( Vx ): x Notice that in methods 1) and 2) the function H need not to be differentiable and may be an indicator function – as is required in some applications. In method 3), the function H must be differentiable.

2 Whenever a MC estimate r F(x) has been defined, it x

can be used in a stochastic quasigradient method (SQG; cf. also Stochastic quasigradient methods) for optimization

1

X sC1 D prS [X s s rx Fn (X s )] where prS is the projection on the set S and (s ) are the stepsizes. The important feature of such algorithms is the fact that they work with stochastic estimates. In particular, the sample size n per step can be set to 1 and still convergence holds under regularity assumptions. To put it differently, the SQG allows to approach quickly a neighborhood of the solution even with much noise corrupted estimates. See also Derivatives of Markov Processes and Their Simulation

References 1. Glasserman P (1991) Gradient estimation via perturbation analysis. Kluwer, Dordrecht 2. Ho YC, Cao X (1983) Perturbation analysis and optimization of queueing networks. J Optim Th Appl 20:559–589 3. Pflug GC (1996) Optimization of stochastic models. Kluwer, Dordrecht 4. Rubinstein RY, Shapiro A (1993) Discrete event systems: Sensitivity analysis and stochastic optimization by the score function method. Wiley, New York

Design Optimization in Computational Fluid Dynamics DOYLE KNIGHT Department Mechanical and Aerospace Engineering, Rutgers University, New Brunswick, USA MSC2000: 90C90 Article Outline Keywords Synonyms Focus Framework Levels of Simulation The Stages of Design Emergence of Automated Design Optimization Using CFD Problem Definition Algorithms for Optimization Gradient Optimizers Stochastic Optimizers

Examples Sequential Quadratic Programming Variational Sensitivity Response Surface Simulated Annealing Genetic Algorithms

Conclusion See also References

Design Optimization in Computational Fluid Dynamics

D

Keywords Search

Optimization; Computational fluid dynamics

Gradient Optimizer Synonyms

Stochastic Optimizer

Design Optimization in CFD Focus The article focuses on design optimization using computational fluid dynamics (CFD). Design implies the creation of an engineering prototype (e. g., a pump) or engineering process (e. g., particle separator). Optimization indicates the selection of a ‘best’ design. Computational fluid dynamics (CFD) represents a family of models of fluid motion implemented on a digital computer. In recent years, efforts have focused on merging elements of these three disciplines to improve design effectiveness and efficiency. Framework Consider the design of a prototype or process with n design variables {xi : i = 1, . . . , n} denoted by x. It is assumed that n is finite, although infinite-dimensional design spaces also exist (e. g., the shape of a civilian transport aircraft). The domain of x constitutes the design space. A scalar objective function f (x) is assumed to be defined for some (or possibly all) points in the design space. This is the simplest design optimization problem. Oftentimes, however, the optimization cannot be easily cast into this form, and other methods (e. g., Pareto optimality) are employed. The purpose of the design optimization is to find the design point x which minimizes f . Note that there is no loss of generality in assuming the objective is to minimize f , since the maximization of an objective function e f (x) is equivalent to the minimization of f D e f. The design optimization is typically an iterative process involving two principal elements. The first element is the simulation which evaluates the objective function by (in the case of computational fluid dynamics) a fluid flow code (flow solver). The second element is the search which determines the direction for traversing the design space. The search engine is the optimizer of which they are several different types as described later. The design optimization process is an iterative procedure involving repetitive simulation and search steps until

6 ? Simulate Generate grid Solve flowfield Compute objective function Design Optimization in Computational Fluid Dynamics, Figure 1 Elements of design optimization

a predefined convergence criteria is met. This is illustrated in Fig. 1.

Levels of Simulation There are five levels of complexity for CFD simulation Fig. 2. Empirical methods represent correlations of experimental data and possibly simple one-dimensional analytical models. An example is the NIDA code [15] employed for analysis of two-dimensional and axisymmetric inlets. The code is restricted to a limited family of geometries and flow conditions (e. g., no sideslip). Codes based on the linear potential equations (e. g., PANAIR [6]; see also [17]) and nonlinear potential equations (e. g., [8]; see also [7]) incorporate increased geometric flexibility while implementing a simplified model of the flow physics (i. e., it is assumed that the shock waves are weak and there is no significant flow separation). Codes employing the Euler equations (e. g., [22]) allow for strong shocks and vorticity although neglect viscous effects. Reynolds-averaged Navier–Stokes codes (RANS codes) (e. g., GASP [31]) employ a model for the effects of turbulence. The range of execution time between the lowest and highest levels is roughly three orders of magnitude, e. g., on a conventional workstation the NIDA code requires only a few seconds execution time while a 2-dimensional RANS simulation would typically require a few hours.

667

D

Design Optimization in Computational Fluid Dynamics

Reynolds-averaged Navier-Stokes (1990s)

Nonlinear potential equation (1970s)

Decreased CPU time

Euler equation (1980s) More acurate simulation

668

Linear potential equation (1960s)

Empirical correlations (< 1960)

Design Optimization in Computational Fluid Dynamics, Figure 2 Levels of CFD simulation

The Stages of Design There are typically three stages of design: conceptual, preliminary and detailed. As the names suggest, the design specification becomes more precise at successive design stages. Thus, for example, a conceptual design of a civilian transport aircraft may consider a (discrete) design space with the possibility of two, three or four engines, while the preliminary design space assumes a fixed number of engines and considers the details of the engine (e. g., nacelle shapes). It is important to note that the CFD algorithms employed in each of these three stages are likely to be different. Typically, the conceptual design stage employs empirical formulae, while the preliminary design stage may also include simplified CFD codes (e. g., linearized and nonlinear potential methods, and Euler codes), and the detailed design stage may utilize full Reynolds-averaged Navier–Stokes methods. Additionally, experiment is oftentimes essential to verify key features of the design.

an automated design optimization process. This opportunity has arisen for five reasons. First, the continued rapid improvements in computer performance (e. g., doubling of microprocessor performance every 18 to 24 months [3]) enable routine numerical simulations of increasing sophistication and complexity. Second, improvements in the accuracy, efficiency and robustness of CFD algorithms (see, for example, [18]) likewise contribute to the capability for simulation of more complex flows. Third, the development of more accurate turbulence models provides increased confidence in the quality of the flow simulations [16]. Fourth, the development of efficient and robust optimizers enable automated search of design spaces [33]. Finally, the development of sophisticated shell languages (e. g., Perl [43]) provide effective control of pathological events which may occur in an automated design cycle using CFD (e. g., square root of a negative number, failure to converge within a predetermined number of iterations, etc.). Problem Definition The general scalar nonlinear optimization problem (also known as the nonlinear programming problem) is [11,33,52] minimize f (x);

(1)

where f (x) is the scalar objective function and x is the vector of design variables. Typically there are limits on the allowable values of x: a x b;

(2)

Emergence of Automated Design Optimization Using CFD

and m additional linear and/or nonlinear constraints ( c i (x) D 0; i D 1; : : : ; m0 ; (3) c i (x) 0; i D m0 C 1; : : : ; m:

Although the first numerical simulation of viscous fluid flow was published in 1933 by A. Thom [51], CFD as a discipline emerged with the development of digital mainframe computers in the 1960s. With the principal exception of the work on inverse design methods for airfoils (see, for example, the review [30] and [48]), CFD has mainly been employed in design analysis as a cost-effective replacement for some types of experiments. However, CFD can now be employed as part of

If f and ci are linear functions, then the optimization problem is denoted the linear programming problem, while if f is quadratic and the ci are linear, then the optimization problem is denoted the quadratic programming problem. An example of a nonlinear optimization problem using CFD is the design of the shape of an inlet for a supersonic missile. The geometry model of an axisymmetric inlet [53] is shown in Fig. 3.

Design Optimization in Computational Fluid Dynamics

D

Design Optimization in Computational Fluid Dynamics, Figure 3 Geometry of high speed inlet

The eight design variables are listed below. Item Definition

1

initial cone angle

2

final cone angle

xd

x-coordinate of throat

rd

r-coordinate of throat

xe

x-coordinate of end of ‘constant’ cross section

3

internal cowl lip angle

Hej

height at end of ‘constant’ cross section

Hfk

height at beginning of ‘constant’ cross section

where g i = @f / @xi , jg i j is the norm of the vector g i , and H = H ij is the Hessian matrix 0 B HDB @

@2 f @x 12

@2 f @x 1 @x n

:: :

@2 f @x 1 @x n

:: : @2 f @x n @x n

1 C C: A

The matrix H is positive definite if all of the eigenvalues of H are positive. Algorithms for Optimization

There are no general methods for guaranteeing that the global minimum of an arbitrary objective function f (x) can be found in a finite number of steps [4,11]. Typically, methods focus on determining a local minimum with additional (often heuristic) techniques to avoid convergence to a local minimum which is not the global minimum. A point x is a (strong) local minimum [11] if there is a region surrounding x wherein the objective function is defined and f (x) > f(x ) for x 6D x . Provided f (x) is twice continuously differentiable (this is not always true; see, for example, [53]), necessary and sufficient conditions for the existence of a solution to (1) subject to (3) may be obtained [11]. In the one-dimensional case with no constraints the sufficient conditions for a minimum at x are g D 0 and H > 0 at x D x ; where g = df /dx and H = d2 f / dx2 . For the multidimensional case with no constraints jg i j D 0

and

H is positive definite at x D x ; (4)

The efficacy of an optimization algorithm depends strongly on the nature of the design space. In engineering problems, the design space can manifest pathological characteristics. The objective function f may pos sess multiple local optima [36] arising from physical and/or numerical reasons. Examples of the latter include noise introduced in the objective function by grid refinement between successive flow simulations, and incomplete convergence of the flow simulator. Also, the objective function f and/or its gradient g i may exhibit near discontinuities for physical reasons. For example, a small change in the the design state x of a mixed compression supersonic inlet operating at critical conditions can cause the terminal shock to be expelled, leading to a rapid decrease in total pressure recovery [44]. Moreover, the objective function f may not be evaluable at certain points. This may be due to constraints in the flow simulator such as a limited range of applicability for empirical data tables. A brief description of some different classes of general optimizers is presented. These methods are described for the unconstrained optimization problem for reasons of brevity. See [33] for an overview of opti-

669

670

D

Design Optimization in Computational Fluid Dynamics

mization algorithms and software packages, and [11] for a comprehensive discussion of the constrained optimization problem. Detailed mathematical exposition of optimization problems is presented in [19]. Gradient Optimizers If the objective function f can be approximated in the vicinity of a pointe x by a quadratic form, then 1 e i j (x j e f e f Ce g i (x i e x i ) C (x i e x i )H x j ); (5) 2 e i j imply evaluation at e where e f,e g and H x and the Einstein summation convention is implied. In the relatively simple method of steepest descent [40], the quadratic term in (5) is ignored, and a line minimization is performed along the direction of g i , i. e., a sequence of values of the design variable x =() , = 1, . . . , are formed according to x C ıx() x() D e where () g i je ıx () g i j1 i D e

and () , = 1, . . . , are an increasing sequence of displacements. The estimated decrease in the objective function f is () je g i j. The objective function f is evaluated at each iteration and the search is terminated when f begins to increase. At this location, the gradient g i is computed and the procedure is repeated. This method, albeit straightforward to implement, is inefficient for design spaces which are characterized by long, narrow ‘valleys’ [40]. The conjugate gradient methods [40] are more efficient than the method of steepest descent, since they perform a sequence of line minimizations along specific directions in the design space which are mutually orthogonal in the context of the objective function. Consider a line minimization of f along a direction u = {ui : i = 1, . . . , n}. At any point on the line, the gradient of f in g i by definition. At the minimum the direction of u is u ie pointe x in the line search, gi D 0 u ie by definition. Consider a second line minimization of f along a direction v. From (5) and noting that H ij is symei jv j. metric, the change in g i along the direction v is H

Thus, the condition that the second line minimization also remain a minimization along the first direction u is ei jv j D 0 ui H When this condition is satisfied, u and v are denoted conjugate pairs. Conjugate gradient methods (CGM) generate a sequence of directions u, v, . . . which are mutually conjugate. If f is exactly quadratic, then CGM yield an n-step sequence to the minimum. Sequential quadratic programming methods employ the Hessian H which may be computed directly when economical or may be approximated from the sequence of gradients g i generated during the line search (the quasi-Hessian [33]). Given the gradient and Hessian, the location xi of the minimum value of f may be found from (5) as e i j (x j e x j ) D e gi: H For the general case where f is not precisely quadratic, a line minimization is typically performed in the direcx i ), and the process is repeated. tion (x i e Variational sensitivity employs the concept of direct differentiation of the optimization function f and governing fluid dynamic equations (in continuous or discrete form) to obtain the gradient g i , and optimization using a gradient-based method. It is related to the theory of the control of systems governed by partial differential equations [29,39]. For example, the boundary shape (e. g., airfoil surface) is viewed as the (theoretically infinite-dimensional) design space which controls the objective function f . Several different formulations have been developed depending on the stage at which the numerical discretization is performed, and the use of direct or adjoint (costate) equations. Detailed descriptions are provided in [23] and [24]. Additional references include [2,5,20,21,37,38,50]. The following summary follows the presentation in [24] which employs the adjoint formulation. The objective function f is considered to be a function of the flowfield variables w and the physical shape S. The differential change in the objective function is therefore @f @f ıw C ıS: (6) @w @S The discretized governing equations of the fluid motion are represented by the vector of equations ıf D

R(w; S) D 0

Design Optimization in Computational Fluid Dynamics

and therefore @R @R ıR D ıw C ıS D 0; (7) @w @S where ıR is a vector. Assume a vector Lagrange multiplier and combining (6) and (7) @f @R @R @f > ıw C > ıS; ıf D @w @w @S @S where | indicates vector transpose. If isfy the adjoint (costate) equation > @R

@w

D

@f ; @w

is chosen to sat-

(8)

then ı f D GıS; where @R @f > : @S @S This yields a straightforward method for optimization using, for example, the method of steepest descent. The increment in the shape is GD

ıS D G; where is a positive scalar. The variational sensitivity approach is particularly advantageous when the dimension n of the design space (which defines S) is large, since the gradient of S is obtained from a single flowfield solution (7) plus a single adjoint solution (8) which is comparable to the flowfield solution in cost. Constraints can be implemented by projecting the gradient onto an allowable subspace in which the constraints are satisfied. Response surface methods employ an approximate representation of the objective function using smooth functions which are typically quadratic polynomials [25]. For example, the objective function may be approximated by X X ˇi x i C i j x i x j f b f D˛C 1in

1i jn

where ˛, ˇ i and ij are coefficients which are determined by fitting b f to a discrete set of data using the method of least squares. The minimum of b f can then be found by any of the gradient optimizers, with optional recalibration of the coefficients of b f as needed. There are many different implementations of the response surface method (see, for example, [12,34] and [46]).

D

Stochastic Optimizers Often the objective function is not well behaved in a portion or all of the design space as discussed above. In such situations, gradient methods can stop without achieving the global optimum (e. g., at an infeasible point, or a local minimum). Stochastic optimizers seek to avoid these problems by incorporating a measure of randomness in the optimization process, albeit oftentimes at a cost of a significant increase in the number of evaluations of the objective function f . Simulated annealing [26,27,32] mimics the process of crystalization of liquids or annealing of metals by minimizing a function E which is analogous to the energy of a thermodynamic system. Consider a current point (state) in the design spacee x and its associated ‘energy’ e E. A candidate for the next state x is selected by randomly perturbing typically one of the components e x and its energy E is evaluated (typix j , 1 j n, of, e cally, each component of x is perturbed in sequence). If E thene x D x , i. e., the next state is x . If E > e E E < e then the probability of selecting x as the next design state is e E E ; p D exp kT where k is the ‘Boltzman constant’ (by analogy to statistical mechanics) and T is the ‘temperature’ which is successively reduced during the optimization according to an assumed schedule [27]. (Of course, only the value of the product kT is important.) The stochastic nature can be implemented by simply calling a random number generator to obtain a value r between zero and one. Then the state x is selected if r < p. Therefore, during the sequence of design states, the algorithm permits the selection of a design state with E > e E, but the probability of selecting such a state decreases with increasing E e E. This feature tends to enable (but does not guarantee) the optimizer to ‘jump out’ of a local minimum. Genetic algorithms (GAs) mimic the process of biological evolution by means of random changes (mutations) in a set of designs denoted the population [14]. At each step, the ‘least fit’ member(s) of the population (i. e., those designs with the highest value of f ) are typically removed, and new members are generated by a recombination of some (or all) of the remaining members. There are numerous GA variants. In the approach of [41], an initial population P of designs is generated

671

672

D

Design Optimization in Computational Fluid Dynamics

Design Optimization in Computational Fluid Dynamics, Figure 4 P2 and P8 inlets

by randomly selecting points xi , i = 1, . . . , p, satisfying (2). The two best designs (i. e., with the lowest values of f ) are joined by a straight line in the design space. A random point x0 is chosen on the line connecting the two best designs. A mutation is performed by randomly selecting a point xp+1 within a specified distance of x0 . This new point is added to the population. A member of the population is then removed according to a heuristic criterion, e. g., among the k members with the highest f , remove the member closest to xp+1 , thus maintaining a constant number of designs in the population. The removal of the closest member tends to prevent clustering of the population (i. e., maintains diversity). The process is repeated until convergence.

Examples Examples of the above algorithms for optimization using CFD are presented. All of the examples are single discipline involving CFD only. It is emphasized that

multidisciplinary optimization (MDO) involving computational fluid dynamics, structural dynamics, electromagnetics, materials and other disciplines is a very active and growing field, and many of the optimization algorithms described herein are appropriate to MDO also. A recent review is presented in [49].

Sequential Quadratic Programming V. Shukla et al. applied a sequential quadratic programming algorithm CFSQP [28] to the optimal design of two hypersonic inlets (denoted P2 and P8) at Mach 7.4. The geometric model is shown in Fig. 4. The optimization criteria was the minimization of the strength of the shock wave which reflected from the centerbody (lower) surface. This is the same criteria as originally posed in the design of the P2 and P8 inlets [13]. The NPARC flow solver [47] was employed for the P2 optimization, and the GASP flow solver [31] for the P8 optimization.

Design Optimization in Computational Fluid Dynamics

Design Optimization in Computational Fluid Dynamics, Figure 5 Static pressure contours for optimal P8 inlet (the original centerbody contour is shown by the dotted line)

D

Design Optimization in Computational Fluid Dynamics, Figure 6 Initial shape of wing

The optimization criteria was met for both inlets. In Fig. 5, the static pressure contours for the optimized P8 inlet are shown. The strength of the reflected shock is negligible. Variational Sensitivity A. Jameson et al. [24] applied the methodology of variational sensitivity (control theory) to the optimization of a three-dimensional wing section for a subsonic widebody commercial transport. The design objective was to minimize the drag at a given lift coefficient CL = 0.55 at Mach 0.83 while maintaining a fixed planform. A two stage procedure was implemented. The first stage employed the Euler equations, while the second stage used the full Reynolds-averaged Navier–Stokes equations. In the second stage, the pressure distribution obtained from the Euler optimization is used as the target pressure distribution. The initial starboard wing shape is shown in Fig. 6 as a sequence of sections in the spanwise direction. The initial pressure distribution on the upper surface, shown as the pressure coefficient cp plotted with negative values upward, is presented in Fig. 7. A moderately strong shock wave is evident, as indicated by the sharp drop in cp at roughly the mid-chord line. After sixty design cycles of the first stage, the drag coefficient was reduced by 15 counts from 0.0196 to 0.0181, and the shock wave eliminated as indicated in the cp distribution in Fig. 8. A subsequent second stage optimization

Design Optimization in Computational Fluid Dynamics, Figure 7 Initial surface pressure distribution

using the Reynolds–averaged Navier–Stokes equations yielded only slight modifications. Response Surface R. Narducci et al. [35] applied a response surface method to the optimal design of a two-dimensional

673

674

D

Design Optimization in Computational Fluid Dynamics

Design Optimization in Computational Fluid Dynamics, Figure 8 Optimized surface pressure distribution

transonic airfoil. The design objective was to maximize the lift coefficient CL at Mach 0.75 and zero degrees angle of attack, while satisfying the constraints that the drag coefficient CD 0.01 and the thickness ratio 0.075 t 0.15 where t is the ratio of the maximum airfoil thickness to the airfoil chord. The airfoil surface was represented by a weighted sum of six different shapes which included four known airfoils (a different set of basis functions were employed in [9] for airfoil optimization using a conjugate gradient method). The objective function f was represented by a quadratic polynomial. An inviscid flow solver was employed. A successful optimization was achieved in five response surface cycles. The history of the convergence of CL and CD is shown in Fig. 9. A total of twenty three flow solutions were required for each response surface. Simulated Annealing S. Aly et al. [1] applied a modified simulated annealing algorithm to the optimal design of an axisymmetric forebody in supersonic flow. The design objective was to minimize the pressure drag on the forebody of a vehicle at Mach 2.4 and zero angle of attack, subject to constraints on the allowable range of the body radius as

Design Optimization in Computational Fluid Dynamics, Figure 9 Convergence history for transonic airfoil

a function of axial position. Two different variants of SA were employed, and compared to a gradient optimizer NPSOL [10] which is based on a sequential quadratic programming algorithm. All optimizers employed the same initial design which satisfied the constraints but was otherwise a clearly nonoptimal shape. Optimizations were performed for two different initial shapes. The flow solver was a hybrid finite volume implicit Euler marching method [45]. The first method, denoted simulated annealing with iterative improvement (SAWI), employed SA for the initial phase of the optimization, and then switched to a random search iterative improvement method when close to the optimum. This method achieved from 8% to 31% reduction in the pressure drag, compared to optimal solution obtained NPSOL alone, while requiring fewer number of flowfield simulations (which constitute the principal computational cost). The second method employed SA for the initial phase of the optimization, followed by NPSOL. This approach achieved from 31% to 39% reduction in the pressure drag, compared to the optimal solution obtained by NPSOL alone, while requiring comparable (or less) cputime. The forebody shapes obtained using SA, SA with NPSOL and NPSOL alone are shown in Fig. 10.

Design Optimization in Computational Fluid Dynamics

Design Optimization in Computational Fluid Dynamics, Figure 10 Forebody shapes obtained using SA, SA with NPSOL and NPSOL. Copyright 1996 AIAA - Reprinted with permission

D

Design Optimization in Computational Fluid Dynamics, Figure 11 Total pressure recovery coefficient versus axial location of throat

Genetic Algorithms G. Zha et al. applied a modified genetic algorithm (GADO [42]) to the optimal design of an axisymmetric supersonic mixed compression inlet at Mach 4 and 60 kft altitude cruise conditions (see above). The geometric model included eight degrees of freedom (see above), and the optimization criteria was maximization of the inlet total pressure recovery coefficient. The constraints included the requirement for the inlet to start at Mach 2.6, plus additional constraints on the inlet geometry including a minimum cowl thickness and leading edge angle. The constraints were incorporated into the GA using a penalty function. The flow solver was the empirical inlet analysis code NIDA [15]. This code is very efficient, requiring only a few seconds cputime on a workstation, but is limited to 2-dimensional or axisymmetric geometries. Moreover, the design space generated by NIDA (i. e., the total pressure recovery coefficient as a function of the eight degrees of freedom) is nonsmooth with numerous local minima and gaps attributable to the use of empirical data Fig. 11. The GA achieved a 32% improvement in total pressure recovery coefficient compared to a trial-and-error method [53]. A total of 50 hours on a DEC-2100 workstation was employed. A series of designs generated during the optimization were selected for evaluation by a full Reynolds-averaged Navier–Stokes code (GASP [31]). A close correlation was observed between the predictions of NIDA and GASP Fig. 12.

Design Optimization in Computational Fluid Dynamics, Figure 12 Total pressure recovery coefficient from NIDA and GASP for several different inlet designs

Conclusion Computational fluid dynamics has emerged as a vital tool in design optimization. The five levels of CFD analysis are utilized in various optimization methodologies. Complex design optimizations have become commonplace. A significant effort is focused on multidisciplinary optimization involving fluid dynamics, solid mechanics, materials and other disciplines.

675

676

D

Design Optimization in Computational Fluid Dynamics

See also Bilevel Programming: Applications in Engineering Interval Analysis: Application to Chemical Engineering Design Problems Multidisciplinary Design Optimization Multilevel Methods for Optimal Design Optimal Design of Composite Structures Optimal Design in Nonlinear Optics Structural Optimization: History

References 1. Aly S, Ogot M, Pelz R (Sept.–Oct.1996) Stochastic approach to optimal aerodynamic shape design. J Aircraft 33(5):945– 961 2. Anderson W, Venkatakrishnan V (1997) Aerodynamic design optimization on unstructured grids with a continuous adjoint formulation. AIAA Paper 97–0643, (Amer. Inst. Aeronautics and Astronautics, Reston, VA) 3. Berkowitz B (1996) Information age intelligence. Foreign Policy, 103:35–50 4. Boender C, Romeijn H (1995) Stochastic methods. In: Handbook of Global Optimization. Kluwer, Dordrecht, pp 829– 869 5. Cabuk H, Modi V (1992) Optimal plane diffusers in laminar flow. J Fluid Mechanics 237:373–393 6. Carmichael R, Erickson L (1981) PAN AIR – A higher order panel method for predicting subsonic or supersonic linear potential flows about arbitrary configurations. AIAA Paper 81–1255, (Amer. Inst. Aeronautics and Astronautics, Reston, VA) 7. Caughey D (1982) The computation of transonic potential flows. In: Annual Rev. Fluid Mechanics, 14, pp 261–283 8. Caughey D, Jameson A (Feb. 1979) Numerical calculation of transonic potential flow about wing–body combinations. AIAA J 17(2):175–181 9. Eyi S, Hager J, Lee K (Dec. 1994) Airfoil design optimization using the Navier–Stokes equations. J Optim Th Appl 83(3):447–461 10. Gill P, Murray W, Saunders M, Wright M (1986) User’s guide for NPSOL: A FORTRAN package for nonlinear programming. SOL Techn Report Dept Oper Res Stanford Univ 86(2) 11. Gill P, Murray W, Wright M (1981) Practical optimization. Acad. Press, New York 12. Giunta A, Balabanov V, Haim D, Grossman B, Mason W, Watson L (1996) Wing design for a high speed civil transport using a design of experiments methodology. AIAA Paper 96–4001-CP, (Amer. Inst. Aeronautics and Astronautics, Reston, VA) 13. Gnos A, Watson E, Seebaugh W, Sanator R, DeCarlo J (Apr. 1973) Investigation of flow fields within large–scale hypersonic inlet models. Techn Note NASA D–7150

14. Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading, MA 15. Haas M, Elmquist R, Sobel D (Apr. 1992) NAWC inlet design and analysis (NIDA) code. UTRC Report, R92–970037– 1, (United Technologies Res. Center) 16. Haase W, Chaput E, Elsholz E, Leschziner M, Müller U (eds) (1997) ECARP – European computational aerodynamics research project: Validation of CFD codes and assessment of turbulence models. Notes on Numerical Fluid Mechanics. Vieweg, Braunschweig/Wiesbaden 17. Hess J (1990) Panel methods in computational fluid dynamics. In: Annual Rev. Fluid Mechanics, 22, pp 255– 274 18. Hirsch C (1988) Numerical computation of internal and external flows, vol I–II. Wiley, New York 19. Horst R, Pardalos PM (eds) (1995) Handbook of global optimization. Kluwer, Dordrecht 20. Ibrahim A, Baysal O (1994) Design optimization using variational methods and CFD. AIAA Paper 94–0093, (Amer. Inst. Aeronautics and Astronautics, Reston, VA) 21. Iollo A, Salas M (1995) Contribution to the optimal shape design of two–dimensional internal flows with embedded shocks. ICASE Report 95–20, (NASA Langley Res. Center, Hampton, VA) 22. Jameson A (1982) Steady–state solution of the Euler equations for transonic flow. In: Transonic, Shock and Multidimensional Flows. Acad. Press, New York, pp 37–70 23. Jameson A (1988) Aerodynamic design via control theory. J Sci Comput 3:33–260 24. Jameson A, Pierce N, Martinelli L (1997) Optimum aerodynamic design using the Navier–Stokes equations. AIAA Paper 97–0101, (Amer. Inst. Aeronautics and Astronautics, Reston, VA) 25. Khuri A, Cornell J (1987) Response surfaces: Designs and analyses. M. Dekker, New York 26. Kirkpatrick S, Gelatt C, Vecchi M (1983) Optimization by simulated annealing. Science 220(4598):671–680 27. van Laarhoven P, Aarts E (1987) Simulated annealing: Theory and Acad. Pressplications. Reidel, London 28. Lawrence AHGC, Zhou J, Tits A (Nov. 1994) User’s guide for CFSQP version 2.3: A C code for solving (large scale) constrained nonlinear (minimax) optimization problems, generating iterates satisfying all inequality constraints. Techn Report Inst Systems Res Univ Maryland 94–16r1 29. Lions JL (1971) Optimal control of systems governed by partial differential equations. Springer, Berlin (translated from the French) 30. Lores M, Hinson B (1982) Transonic design using computational aerodynamics. In: Progress in Astronautics and Aeronautics, 81. Am Inst Aeronautics and Astronautics, Reston, VA, pp 377–402 31. McGrory W, Slack D, Pressplebaum M, Walters R (1993) GASP version 2.2: The general aerodynamic simulation program. Aerosoft, Blacksburg, VA

Design of Robust Model-Based Controllers via Parametric Programming

32. Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E (1953) Equations of state calculations by fast computing machines. J Chem Phys 21:1087–1092 33. Moré J, Wright S (1993) Optimization software guide. SIAM, Philadelphia 34. Myers R, Montgomery D (1995) Response surface methodology: Process and product optimization using design experiments. Wiley, New York 35. Narducci R, Grossman B, Valorani M, Dadone A, Haftka R (1995) Optimization methods for non–smooth or noisy objective functions in fluid dynamic design problems. AIAA Paper 95–1648–CP. Am Inst Aeronautics and Astronautics, Reston, VA 36. Obayashi S, Tsukahara T (1996) Comparison of optimization algorithms for aerodynamic shape design. AIAA Paper 96–2394–CP. Am Inst Aeronautics and Astronautics, Reston, VA 37. Pironneau O (1973) On optimal profiles in Stokes flow. J Fluid Mechanics 59(1):117–128 38. Pironneau O (1974) On optimal design in fluid mechanics. J Fluid Mechanics 64(1):97–110 39. Pironneau O (1984) Optimal shape design for elliptic systems. Springer, Berlin 40. Press W, Flannery B, Teukolsky S, Vetterling W (1986) Numerical recipes. Cambridge Univ. Press, Cambridge 41. Rasheed K, Gelsey A (1996) Adaption of genetic algorithms for continuous design space search. In: Fourth Internat. Conf. Artificial Intelligence in Design: Evolutionary Systems in Design Workshop, 42. Rasheed K, Hirsh H, Gelsey A (1997) A genetic algorithm for continuous design space search. Artif Intell in Eng 11(3):295–305 43. Schwartz R (1993) Learning Perl. O’Reilly, Sebastopol, CA 44. Seddon J, Goldsmith E (eds) (1985) Intake aerodynamics. AIAA Education Ser Amer. Inst. Aeronautics and Astronautics, Reston, VA 45. Siclari M, Del Guidice P (Jan 1990) Hybrid finite volume Approach to Euler solutions for supersonic flows. AIAA J 28(1):66–74 46. Simpson T, Peplinski J, Koch P, Allen J (1997) On the use of statistics in design and the implications for deterministic computer experiments. ASME Paper DETC 97/DTM–3881. Am Soc Mech Engin, New York 47. Sirbaugh J, Smith C, Towne C, Cooper G, Jones R, Power G (Nov. 1994) A users guide to NPARC version 2.0. NASA Lewis Res. Center and Arnold Engin. Developm. Center, Cleveland, OH/Arnold, TN 48. Sobieczky H, Seebass A (1984) Supercritical airfoil and wing design. In: Annual Rev. Fluid Mechanics, 16, pp 337–363 49. Sobieszczanski–Sobieski J, Haftka R (1996) Multidisciplinary aerospace design optimization: Survey of recent developments. AIAA Paper 96–0711. Am Inst Aeronautics and Astronautics, Reston, VA

D

50. Ta’asan S, Kuruvilla K, Salas M (1992) Aerodyamic design and optimization in one shot. AIAA Paper 92–0025. Am Inst Aeronautics and Astronautics, Reston, VA 51. Thom A (1933) The flow past circular cylinders at low speeds. Proc Royal Soc London A141:651–666 52. Vanderplaats G (1984) Numerical optimization techniques for engineering design: With Applications. McGraw-Hill, New York 53. Zha G, Smith D, Schwabacher M, Rasheed K, Gelsey A, Knight D (Nov.–Dec. 1997) High–performance supersonic missile inlet design using automated optimization. J Aircraft 34(6):697–705

Design of Robust Model-Based Controllers via Parametric Programming K.I. KOURAMAS1 , V. SAKIZLIS2 , EFSTRATIOS N. PISTIKOPOULOS1 1 Centre for Process Systems Engineering, Imperial College London, London, UK 2 Bechtel Co. Ltd., London, UK Article Outline Introduction/Background Definitions Formulation Open-Loop Robust Parametric Model Predictive Controller Closed-Loop Robust Parametric Model-Based Control

Methods/Applications Parametric Solution of the Inner Maximization Problem of the Open-Loop Robust pMPC Problem Solution of the Closed-Loop RpMPC Problem

Cases Robust Counterpart (RC) Problem Interval Robust Counterpart Problem

Conclusions References Introduction/Background Model predictive control (MPC) is very popular for its capacity to deal with multivariable, constraints-modelbased control problems for a variety of complex linear or non-linear processes [13]. MPC is based on the receding-time-horizon philosophy where an openloop, constrained optimal control problem is solved online at each sampling time to obtain the optimal control actions. The optimal control problem is solved repet-

677

678

D

Design of Robust Model-Based Controllers via Parametric Programming

itively at each time when a new measurement or estimate of the state is available, thus establishing an implicit feedback control method [14,15]. The main reasons for the popularity of MPC are its optimal performance, its capability to handle constraints and its inherent robustness due to feedback control properties. Despite the widely acknowledged capabilities of MPC, there are two main shortcomings that have been a major concern for the industrial and academic communities. The first shortcoming is that MPC implementation is limited to slowly varying processes due to the demanding online computational effort for solving the online optimal control problem. The second is that, despite its inherent robustness due to the implicit feedback, MPC cannot guarantee the satisfaction of constraints and optimal performance in the presence of uncertainties and input disturbances, since usually it relies on nominal models (uncertainty-free models) for the prediction of future states and control actions [14,20,22]. The first shortcoming of MPC can be overcome by employing the so-called parametric MPC (pMPC) or multiparametric MPC (mp-MPC) [4,16,20]. Parametric MPC controllers are based on the well-known parametric optimization techniques [9,18] for solving the open-loop optimal control problem offline and obtain the complete map of the optimal control actions as functions of the states. Thus, a feedback control law is obtained offline and the online computational effort is reduced to simple function evaluations of the feedback control. The inevitable presence of uncertainties and disturbances have been ignored by the pMPC community, and only recently has the research started focusing on control problems with uncertainty [2,20]. In traditional MPC the issue of robustness under uncertainty has been dealt with using various methods such as robust model predictive control [3,8], model predictive tubes [6,12] and min-max MPC [21,22]. However, this is still an unexplored area for pMPC, apart from the recent work presented in [2,20]. In this manuscript we discuss the challenges of robust parametric model predictive control (RpMPC) and we present a method for RpMPC for linear, discrete-time dynamic systems with exogenous disturbances (input uncertainty) and a method for RpMPC for systems with model uncertainty. In both cases the uncertainty is described by the realistic scenario where

no uncertainty model (stochastic or deterministic) is known but it is assumed that the uncertainty variables satisfy a set of inequalities. Definitions Consider the following linear, discrete-time system: x tC1 D Ax t C Bu t C W t

(1)

y t D Bx t C Du t C F t ;

where x 2 X Rn , u 2 U Rm , y 2 Y Rq and 2 Rw are the state, input, output and disturbance (or uncertain) input vectors respectively and A, B, C, D, W and F are matrices of appropriate dimensions. The disturbance input is assumed to be bounded in the set D f 2 Rw j iL i iU ; i D 1; : : : ; wg. This type of uncertainty is used to characterize a broad variety of input disturbances and modeling uncertainties including non-linearities or hidden dynamics [7,11]. This type of uncertainty in general may result in infeasibilities and performance degradation. Definition 1 The robust controller is defined as the controller that provides a single control sequence that steers the plant into the feasible operating region for a specific range of variations in the uncertain variables. The general robust parametric MPC (RpMPC) problem is defined as [20] ( (x tjt ) D min

u N 2V N

N1 h X

T Px tCNjt C x tCNjt

y TtCkjt Q y tCkjt C u TtCk Ru tCk

i

) (2)

kD0

s.t.

x tCkC1jt D Ax tCkjt C Bu tCk C W tCk ;

k0 (3)

y tCkjt D Cx tCkjt C Du tCk CF tCk ;

k 0 (4)

g(x tCkjt ; u tCk ) D C1 x tCkjt CC2 u tCk CC3 0; k D 0; 1; : : : ; N 1

(5)

h(x tCNjt ) D D1 x tCNjt C D2 0

(6)

u tCk D Kx tCkjt ;

(7)

x tjt D x

kN

(8)

D

Design of Robust Model-Based Controllers via Parametric Programming

where g : X U ! Rn g and h : X ! Rn h are the path and terminal constraints respectively, x is the initial state, u N D fu t ; : : : ; u tCN1 g 2 U U D U N are the predicted future inputs and N D f t ; : : : ; N1 g 2 N are the current and future values of the disturbance. Formulation The design of a robust control scheme is obtained by solving a receding horizon constrained optimal control problem where the objective is the deviations expected over the entire uncertainty set, or the nominal value of the output and input deviations. In order to ensure feasibility of (2)–(8) for every possible uncertainty scenario tCk 2 , k D 0; : : : ; N 1, the set of constraints of (2)–(8) is usually augmented with an extra set of feasibility constraints. The type of these constraints, as will be described later, will determine if the RpMPC is an open-loop or closed-loop controller.

The constraints 0 ensure that, given a particular state realization x , the single control action uN satisfies all the constraints for all possible bounded disturbance scenarios over the time horizon. However, this feasibility constraint represents an infinite set of constraints since the inequalities are defined for every possible value of N 2 N . In order to overcome this problem one has to notice that (11) is equivalent to ˚ max max g¯(x ; u N ; N )j

N

j D 1; : : : ; J; u N 2 U N ; x 2 X ; N 2 N 0 (12) Adding (12) into (2)–(8) and minimizing the expectation of the objective function (2) over all uncertain realizations tCk one obtains the following robust model predictive control problem: ( (x tjt ) D min E N 2 N

Open-Loop Robust Parametric Model Predictive Controller

u N 2V N

N1 h X

To define the set of extra feasibility constraints the future state prediction x tCkjt

j

k1 X DA x C (A j Bu tCk1 j C A j W tCk1 j ) k

T x tCNjt Px tCNjt C

y TtCkjt Q y tCkjt

C

u TtCk Ru tCk

i

) (13)

kD0

s.t.

(3)–(8)

and

(11)

(14)

jD0

(9) is substituted into the inequality constraints (5)–(6), which then become g¯ j (x ; u N ; N ) 0 ; j D 1; : : : ; J , n X

1 i; j x i

C

iD1

q N1 X X

2 i;k; j u tCk;i

kD0 iD1

C

N1 X w X

3 i;k; j C t C k; i C 4 j 0

(10)

kD0 iD1

where 1, 2, 3 are coefficients that are explicit functions of the elements of matrices A, B, C, D, W, F, C1 , C2 , C3 , D1 , D2 , Q, R, P. The set of feasibility constraints is defined as (x ; u N ) 0 , 8 N 2 N 8 j D 1; : : : ; J g¯ j (x ; u N ; N ) 0; u N 2 U N ; x 2 X : (11)

Problem (13)–(14) is a bilevel program that has as constraint a maximization problem, which, as will be shown later, can be solved parametrically and then replaced by a set of linear inequalities of u N ; x . The solution to this problem corresponds to a robust control law as it is defined in Definition 1. Problem (13)–(14) is an open-loop robust control formulation in that it obtains the optimal control actions uN for the worstcase realization of the uncertainty only, as expressed by inequality (12), and does not take into account the information of the past uncertainty values in the future measurements, thus losing the benefit of the prediction property. This implies that the future control actions can be readjusted to compensate for any variation in the past uncertainty realizations, thereby obtaining more “realistic” and less conservative values for the optimal control actions. This problem can be overcome if we consider the following closed-loop formulation of the problem (2)–(8).

679

680

D

Design of Robust Model-Based Controllers via Parametric Programming

Closed-Loop Robust Parametric Model-Based Control

max min : : : max min max max

tC1 u tC2

To acquire a closed-loop formulation of the general RpMPC problem, a dynamic programming approach is used to formulate the worst-case closed-loop MPC problem, which requires the solution of a number of embedded optimization problems that in the case of a quadratic objective are non-linear and non-differentiable. Feasibility analysis is used to directly address the problem and a set of constraints is again incorporated in the optimization problem to preserve feasibility and performance for all uncertainty realizations. Future measurements of the state contain information about the past uncertainty values. This implies that the future control actions can be readjusted to compensate for the past disturbance realizations by deriving a closed-loop MPC problem as shown next. The main idea is to introduce constraints into the control optimization problem (2)–(8) that preserve feasibility and performance for all disturbance realizations. These constraints are given as

tC`

(x ; [u tCk ] kD0;:::;` ) ,

tCN2 u tCN1 tCN1

j

g¯ j (x ; u N ; N ) 0

(18)

max min max min : : : max min max max

t

u tC1 tC1 u tC2

tCN2 u tCN1 tCN1

N

N

g¯ j (x ; u ; ) 0 u N 2 UN ;

x 2 X ;

N 2 N :

j

(19) (20)

The difference between the above formulation and formulation (13)–(14) is that at every time instant t C k the future control actions fu tCkC1 ; : : : ; u tCN1 g are readily adjusted to offset the effect of the past uncertainty f t ; : : : ; tCk g to satisfy the constraints. In contrast, in formulation (13)–(14) the control sequence has to ensure constraint satisfaction for all possible disturbance scenarios. The main issue for solving the above optimization problem is how to solve parametrically each of (17)–(19) and replace them with a set of inequalities of u N ; x suitable to formulate a multiparametric programming problem. This is shown in the following section.

8 tC` 2 f9u tC`C1 2 Uf8 tC`C1 2 f9u tC`C2 2 U : : : f8 tCN2

Methods/Applications

2 f9u tCN1 2 Uf8 tCN1 2

f8 j D 1; : : : ; J g¯ j (x ; [u tCk ] kD0;:::;N1 ; [ tCk ] kD0;:::;N1 ) 0 ggg : : : ggg ;

Parametric Solution of the Inner Maximization Problem of the Open-Loop Robust pMPC Problem

u tCk 2 U ; k D 0; : : : ; ` ; x 2 X ; tCk 2 ; k D 0; : : : ; ` 1 ; ` D 0; : : : ; N 1 :

(15)

The constraints of (15) are incorporated into (2)– (8) and give rise to a semi-infinite dimensional program that can be posed as a min–max bilevel optimization problem: (

(x ) D min

u N 2V N

N1 h X

T x tCNjt Px tCNjt C

y TtCkjt Q y tCkjt

C

u TtCk Ru tCk

i

) (16)

kD0

s.t.

max g¯ j (x ; u N ; N ) 0

tCN1; j

:: :

(17)

An algorithm for solving parametrically the maximization problem of (12), which forms the inner maximization problem of the open-loop RpMPC (13)–(14), comprises the following steps: Step 1. Solve G j (x ; u N ) D max N f g¯ j (x ; u N ; N )j N;L N N;U g; j D 1; : : : ; J as a parametric program with respect to N and by recasting the control elements and future states as parameters. The parametric solution can be obtained by following the method in [19], where the critical disturbance points for each maximization are identified as follows: @ g¯ j cr U D 3 i;k > 0 ) tCk;i D tCk;i , j D 1. If @ tCk;i 1; : : : ; J, then k D 0; : : : ; N 1, i D 1; : : : ; w; @ g¯ j cr L D 3 i;k < 0 ) tCk;i D tCk;i , j D 2. If @ tCk;i 1; : : : ; J, then k D 0; : : : ; N 1, i D 1; : : : ; w. cr in the constraints g¯ 0 we obtain Substituting tCk;i N G j (x ; u ) D g¯ j (x ; u N ; N;cr ), where N;cr is the sequence of the critical values of the uncertainty vector tcr over the horizon N.

D

Design of Robust Model-Based Controllers via Parametric Programming

Step 2. Compare the parametric profiles G j (x ; u N ) over the joint space of uN and x and retain the upper bounds. A multiparametric linear program is formulated: (x ; u N ) D max G j

Solution of the Closed-Loop RpMPC Problem

j

,

N

(x ; u ) D minf"j" G j ; j D 1; : : : ; Jg ; "

N

N

u 2U ; x 2X;

(21)

which is equivalent to the comparison procedure of [1]. Step 3. Problem (21) is a multiparametric linear programming problem; hence the solution consists of a set of piece-wise linear expressions for i in terms of the parameters uN and x and a set of regions i , ˆ reg where these expressions are valid. This i D 1; : : : ; N statement was proven in [20], sect. 2.2, theorem 2.1, and in [10]. Note that no region s exists such that N is cons i , 8fx ; u g 2 s and 8i ¤ s since vex. Thus, inequality (11) can be replaced by the inequalities i (x ; u N ) 0. In this way problem (13)– (14) can be recast as a single-level stochastic program: (x ) D min f˚(x ; u N ; N;n )j ) 0; j D 1; : : : ; J;

ˆ reg g ; (x ; u N ) 0; i D 1; : : : ; N

In order to solve the problem (16)–(20), the inner max– min–max problem in (17)–(19) have to be solved parametrically and replaced by simpler linear inequalities, so the resulting problem is a simple multiparametric quadratic program. For simplicity, we only present an algorithm for solving the most difficult problem (19). The same thought process can be performed for the remaining constraints. The algorithm consists of the following steps: Step 1. Solve

G j tCN1 (x ; u N ; [ tCk ] kD0;:::;N2 ) D max f g¯ j (x ; u N ; N ); N;L N N;U g ;

tCN1

j D 1; : : : ; J ;

(23)

as a multiparametric optimization problem by recasting x and uN as parameters and by following again the method of [19] or [20], sect. 2.2.

u N 2U N N N;n

g¯ j (x ; u ;

The proof of the theorem is straightforward from (21) and [20] and is omitted for brevity’s sake. It shows that the solution to (22), and hence (13)–(14), can be obtained as an explicit multiparametric solution [9].

(22)

where ˚ is the quadratic objective (13) after substituting (9). The superscript n in N;n denotes the nominal value of N , which is usually zero. An approximate solution to the above stochastic problem can be obtained by discretizing the uncertainty space into a finite set of scenarios N;i , i D 1; : : : ; ns with associated objective weights ([20]), thus leading to a multiperiod optimization problem where each period corresponds to a particular uncertainty scenario. By treating the control variables uN as the optimization variables and the current state x as parameters, (22) is recast as multiparametric quadratic program. Theorem 1 The solution of (22) is a piece-wise linear control law u t (x ) D Ac x C b c and CR c x C cr c , c D 1; : : : ; N c is the polyhedral critical region where this control law is valid and guarantees that (5) and (6) are feasible for all tCk 2 , k D 0; : : : ; N 1.

Step 2. Compare the parametric profiles

G j tCN1 (x ; u N ; [ tCk ] kD0;:::;N2 ) over the joint space of uN , [ tCk ] kD0;:::;N2 and x to retain the upper bounds. For this comparison a multiparametric program is formulated and then solved by following the comparison procedure in [1]:

tCN1

(x ; u N ; [ tCk ] kD0;:::;N2 ) D max G j tCN1 j

,

tCN1

(x ; u N ; [ tCk ] kD0;:::;N2 )

D minf" j s.t. G j tCN1 "; j D 1; : : : ; Jg : "

(24)

The solution of the above optimization consists of a set

of linear expressions for i tCN1 in terms of the parameters x , uN , [ tCk ] kD0;:::;N2 and a set of polyhedral

tCN1 ˆ reg , where these exregions i tCN1 , i D 1; : : : ; N pressions are valid. Step 3. Set ` D N 1.

681

682

D

Design of Robust Model-Based Controllers via Parametric Programming

Step 4. Solve the following multiparametric optimization problem over u ` u tC`

u tC` 2U

˚ (x ) D min E N 2 N ˚(x ; u N ; N;n ) u N 2U N N

(x ; u ` ; ` )

D min f

ing stochastic multparametric program:

s.t. g¯ j (x ; u ; N;n ) 0

tC` (x ; [u tCk ] kD0;:::;` ; [ tCk ] kD0;:::;`1 ); i

t i (x ; u t )

tC` ˆ reg g: if i tC` 0; i D 1; : : : ; N

tC1 (x ; [u Tt ; u TtC1 ]T ; tn ); i

(25)

tCN2 n (x ; [u tCk ] kD0;:::;N2 ; [ tCk ] kD0;:::;N3 ); i

tCN2 ˆ reg i D 1; : : : ; N

tCN1 n (x ; [u tCk ] kD0;:::;N1 ; [ tCk ] kD0;:::;N2 ); i

ˆ regtCN1 i D 1; : : : ; N x tjt D x ; j D 1; : : : ; J ;

Step 5. Set ` D ` 1 and solve the following maximization problem over `1 :

(x ; [u tCk ] kD0;:::;` ; [ tCk ] kD0;:::;`1 )

D maxf

tC`

tC` (x ; [u tCk ] kD0;:::;` ; [ tCk ] kD0;:::;`1 ); i u tC`C1

if i

u tC`C1 ˆ reg 0; i D 1; : : : ; N g:

(26)

Since the function on the left-hand side of the above equality is a convex piecewise affine function, its maximization with respect to [ tCk ] kD0;:::;`1 reduces to the method of [19] followed by a comparison procedure as described in step 2. Step 6. If ` > 0, then go to step 4, else terminate the procedure and store the affine functions i t ,

t ˆ reg . i D 1; : : : ; N Step 7. The expressions i t (u t ; x ) are the max–min– max constraint (19). Similarly, the remaining max– min–max constraints are replaced by the set of inequalities

tC1 (x ; [u Tt ; u TtC1 ]T ; t ) i

0;

::: ;

tCN2 (x ; [u Tt ; u TtC1 ; : : : ; u TtCN2 ]T ; i T T ; : : : ; tCN3 ]T ) [ tT ; tC1

tCN1 (x ; [u Tt ; u TtC1 ; : : : ; u TtCN1 ]T ; i T T ; : : : ; tCN2 ]T ) [ tT ; tC1

0;

0:

Substituting the inequalities in step 7 into the max– min–max constraints of (16)–(20) we obtain the follow-

1 ˆ reg i D 1; : : : ; N

:: :

The above problem can be solved parametrically by following the procedure in [20], appendix A, or [17], chap. 3, sect. 3.2. The solution to (25) is a convex piecewise affine function of u tC` in terms of the parameters x , uN , [ tCk ] kD0;:::;N2 that is defined over a set of u u tC` ˆ reg . polyhedral regions i tC` , i D 1; : : : ; N

tC`

0 ˆ reg 0; i D 1; : : : ; N

(27) where ˚ is again the quadratic objective function in (16). By discretizing the expectation of the value function to a set of discrete uncertainty scenarios and by treating the current state x as parameter and the control actions as optimization variables and the problem is recast as a parametric quadratic program. The solution is a complete map of the control variables in terms of the current state. The results for the closedloop RpMPC controller are summarized in the following theorem. Theorem 2 The solution of (27) is obtained as a linear piecewise control law u t (x ) D Ac x C b c and a set of polyhedral regions CRc D fx 2 X jCR c x C cr c 0g in the state space for which system (1) satisfies constraints (5)–(6) for all N 2 N . Cases A special case of the RpMPC problem (2)–(8) arises when the system matrices in the first equation in (1) are uncertain in that their entries are unknown but bounded within specific bounds. For simplicity we will consider the simpler case where W; F D 0 and the entries aij and bij of matrices A and B are not known but satisfy a i j D a¯ i j C ıa i j ; b i` D b¯ i` C ıb i` ıa i j 2 A i j D fıa i j 2 Rj "j a¯ i j j ıa i j "j a¯ i j jg ; (28)

D

Design of Robust Model-Based Controllers via Parametric Programming

ıb i` 2 B i` D fıb i` 2 Rj "jb¯ i` j ıb i` "jb¯ i`jg ;

T k A x C s.t. C1i

k1 X jD0

(29)

T C C2i u tCk C C3i 0 ;

where a¯ i j , b¯ i` are the nominal values of the entries of A, B respectively and ıa i j , ıb i` denote the uncertainty in the matrix entries, which is assumed to be bounded as in (28)–(29). The general RpMPC formulation (2)– (8) must be redefined to include the introduced model uncertainty by adding the extra constraints a i j D a¯ i j C ıa i j ; b i` D b¯ i` C ıb i`

Definition 2 A feasible solution uN for problem (2)– (8) and (30), for a given initial state x , is called a robust or reliable solution. Obviously, a robust solution for a given x is a control sequence uN (future prediction vector) for which constraints (5)–(6) are satisfied for all admissible values of the uncertainty. Since it is difficult to solve this MPC formulation by the known parametric optimization methods, the problem must be reformulated in a multiparametric quadratic programming (mpQP) form. Our objective in this section is to obtain such a form by considering the worst-case values of the uncertainty, i. e. those values of the uncertain parameters for which the linear inequalities of (5)–(6) are critically satisfied. Usually, the objective function (2) is formulated to penalize the nominal system behavior; thus one must subP ¯j ¯ stitute x tCkjt D A¯k x C k1 jD0 A Bu tCk1 j in (2). In this way the objective function is a quadratic function of uN and x . Finally, the uncertain evolution of P j the system x tCkjt D Ak x C k1 jD0 A Bu tCk1 j is replaced in the constraints (5)–(6) to formulate a set of linear inequalities. Thus the following formulation of the RpMPC is obtained:

(x ) D min

u N 2U N

1 N T N (u ) Hu 2

1 Cx T Fu N C (x )T Y x 2

T N D1` A x C

N1 X

(32)

T j D1` A Bu tCk1 j C D2` 0 ;

jD0

` D 1; : : : ; n h ; 8ıa i j 2 A i j ;

(33)

8ıb i` 2 B i` ;

i; j D 1; : : : ; n ;

The new formulation of the RpMPC (2)–(8) and (30) gives rise to a semi-infinite dimensional problem with a rather high computational complexity.

k D 1; : : : ; N 1 ; i D 1; : : : ; n g ;

(30)

8ıa i j 2 A i j ; 8ıb i` 2 B i` ;

T j C1i A Bu tCk1 j

` D 1; : : : ; m :

(34)

It is evident that the new formulation of the RpMPC problem (31)–(34) is also a semi-infinite dimensional problem. This formulation can be further simplified if one considers that for any uncertain matrices A and B, the entries of the matrices Ak and Ak B for all k 0 are given respectively by [17] k k k k k k D a¯ i` C ıa i` ; jıa i`;min j ıa i` jıa i`;max j; a i`

(35) k ¯ k C ıab k ; ab i` D ab i` i` k k k j ıab i` jıab i`;max j: jıab i`;min

(36)

The analysis on (35)–(36) follows from [17], chap. 3, and is omitted for brevity’s sake. Robust Counterpart (RC) Problem Using the basic properties of matrix multiplication and (35)–(36), problem (31)–(34) reformulates into (x ) D min

u N 2U N

1 N T N (u ) Hu 2

1 Cx T Fu N C (x )T Y x 2 n X k1 X m X

;

(37)

j

C1i q ab q` u tCk1 j;`

jD1 qD1 `D1

C

;

(31)

X `

C2i` u tCk;` C

n X n X

k C1i q a q` x` C C3i 0 ;

qD1 `D1

(38)

683

684

D

Design of Robust Model-Based Controllers via Parametric Programming

n X N1 m XX

k D 1; : : : ; N 1; i D 1; : : : ; n g N1 X n X m X

jD1 qD1 `D1

j

D1i q ab q` u tCk1 j;`

jD1 qD1 `D1

C C

n X n X

k D1i q a q` x` C D2i 0

n X N1 m XX

k jD1i q jjıab q`;max jgju tCk1 j;` j

C i D 1; : : : ; n h 8ıa i j 2 A i j ;

u N 2U N

1 N T N (u ) Hu 2

Cx

T

1 Fu C (x )T Y x 2 N

;

(41)

s.t.

n X k1 X m X

u N 2 UN ;

C D2i 0

x 2 X :

The interval robust counterpart (IRC) problem can then be formulated as follows: 1 N T N (u ) Hu (x ) D min N N 2 u 2U 1 T T N ; (45) Cx Fu C (x ) Y x 2 n X k1 X m X

¯ j u tCk1 j;` C1i q ab q`

jD1 qD1 `D1 n X k1 X m X

k maxfjC1i q jjıab q`;min j;

jD1 qD1 `D1 k jC1i q jjıa q`;max jgz tCk1 j;` C

X

C2i` u tCk;`

C

` k C1i q a¯ q` x` C

qD1 `D1

n X n X

k D 1; : : : ; N 1 ;

k maxfjC1i q jjıa q`;min j;

qD1 `D1

k jC1i q jjıa q`;max jgjx` j

(44)

Interval Robust Counterpart Problem

k maxfjC1i q jjıab q`;min j;

k jgju tCk1 j;` j C jC1i q jjıa q`;max

C

qD1 `D1

In this way the initial semi-infinite dimensional problem (37)–(40) becomes the above multiparametric non-linear program (mp-NLP). However, the parametric solution of this mp-NLP problem is still very difficult.

C

jD1 qD1 `D1

n X n X

k maxfjD1i q jjıa q`;min j;

(43)

¯ j u tCk1 j;` C1i q ab q`

jD1 qD1 `D1

C

n X n X

i D 1; : : : ; n h ;

s.t.

n X k1 X m X

k D1i q a¯ q` x` C

k jD1i q jjıa q`;max jgjx` j

(40)

This is a robust multiparametric QP problem (robust mp-QP) where the coefficients of the linear inequalities in the constraints are uncertain, the vector uN is the optimization variable and the initial states x are the parameters. A similar robust LP problem was studied in [5] where the coefficients of the linear constraints are uncertain, similar to (35)–(36); however, no multiparametric programming problems were considered. In a similar fashion to the analysis in [5] we construct the robust counterpart of the robust mp-QP problem (37)–(40):

(x ) D min

n X n X qD1 `D1

8ıb i` 2 B i` ; i; j D 1; : : : ; n; ` D 1; : : : ; m :

k maxfjD1i q jjıab q`;min j;

jD1 qD1 `D1

(39)

qD1 `D1

¯ j u tCk1 j;` D1i q ab q`

C

n X n X qD1 `D1 n X n X

X

C2i` u tCk;`

` k C1i q a¯q` x`

k maxfjC1i q jjıa q`;min j;

qD1 `D1 k jC1i q jjıa q`;max jgw` C C3i 0

C C3i 0

k D 1; : : : ; N 1 ;

i D 1; : : : ; n g ; (42)

i D 1; : : : ; n g ; (46)

Design of Robust Model-Based Controllers via Parametric Programming

D

Design of Robust Model-Based Controllers via Parametric Programming, Figure 1 Critical regions for the nominal parametric MPC and state trajectory

n X N1 m XX

¯ j u tCk1 j;` D1i q ab q`

jD1 qD1 `D1

C

n X N1 m XX

k maxfjD1i q jjıab q`;min j;

jD1 qD1 `D1 k jD1i q jjıab q`;max jgz tCk1 j;`

C

n n X X

k D1i q a¯q` x` C

qD1 `D1

n n X X

k maxfjD1i q jjıa q`;min j;

qD1 `D1

ables now are the vectors u tCk1 j , z tCk1 j and w and the parameters are the states x . The IRC problem can be solved with the known parametric optimization methods [4,9,16] since the objective function is strictly convex by assumption. The optimal control inputs uN , optimization variables z and w and hence the optimal control ut can then be obtained as explicit functions u N (x ), z(x ) and w(x ) of the initial state x . Furthermore, the control input ut is obtained as the explicit, optimal control

k jgw` C D2i 0 jD1i q jjıa q`;max

i D 1; : : : ; n h ; (47) z tCk1 j;` u tCk1 j;` z tCk1 j;` ;

(48)

w` x` w` ;

(49)

u N 2 UN ;

x 2 X ;

(50)

where the non-linear inequalities (42)–(43) have been replaced by four new linear inequalities. Two new variables have been introduced to replace the absolute values of the u tCk1 j;` and x` , thus leading to the relaxed IRC problem. The IRC is a mpQP problem with a quadratic index and linear inequalities, where the optimization vari-

Design of Robust Model-Based Controllers via Parametric Programming, Figure 2 Magnification of Fig. 1 around the state trajectory at the second time instant

685

686

D

Design of Robust Model-Based Controllers via Parametric Programming

Design of Robust Model-Based Controllers via Parametric Programming, Figure 3 Critical regions for the nominal parametric MPC and state trajectory

law [9] u t (x ) D Ac x C b c which is valid in the polyhedral region CRc D fx 2 X jCR c x C cr 0g, c D 1; : : : ; N c , where N c is the number of critical regions obtained from the parametric programming algorithm. The general RpMPC problem obtained from the case where the dynamic system (1) pertains to model uncertainties have now been transformed into the IRC problem and can be solved as a mp-QP problem. It is obvious that a feasible solution for the IRC problem is also a feasible solution for the RC and hence the initial RpMPC problem (2)–(8) and (30). Hence:

and b¯1 D 0:0609. The state and control constraints are 3 [0 1:4142]T x 3 ; 2 u 2, and the terminal constraint is

Lemma 1 If uN is a feasible solution for the IRC problem, then it is also a feasible solution for the RC problem, and hence it is a robust solution for the initial RpMPC problem (2)–(8), (30).

(53)

Example 2 Consider a two-dimensional, discrete-time linear system (1) where W D F D 0 and

0:7326 C ıa 0:0861 0:1722 0:0064 0:0609 C ıb BD ; 0:0064 AD

; (51)

where the entries a11 and b1 of the A and B matrices are uncertain, where ıa and ıb are bounded as in (28)–(29) with D 10% and the nominal values are a¯11 D 0:7326

2

0:070251 6 0:070251 6 4 0:21863 0:21863

3 2 1 0:02743 7 6 1 7 0:02743 x6 4 0:022154 1 5 1 0:022154

3 7 7: 5

(52)

Moreover, QD

0 0

0 2

R D 0:01; P D

1:8588 1:2899

1:2899 6:7864

:

Initially, the MPC problem (2)–(8) is formulated and solved only for the nominal values of A and B, thus solving a multiparametric quadratic programming problem as described in [4,16]. Then the IRC problem is formulated as in (45)–(50) by using POP software [9]. The resulting regions for both cases are shown in Figs. 1 and 3 respectively. A simulation of the state trajectories of the nominal and the uncertain system are shown in Figs. 1 and 3 respectively. In these simulations the uncertain parameters ıa and ıb were simulated as a sequence of random numbers that take their values on the upper or lower bounds of ıa, ıb i. e. a time-varying uncertainty. It is clear from Fig. 1 (and Fig. 2, which displays the magnified area around the state trajectory at the second

Determining the Optimal Number of Clusters

time instant) that the nominal solution to problem (2)– (8) cannot guarantee robustness in the presence of the uncertainty and the nominal system trajectory results in constraint violation. On the other hand, the controller obtained with the method discussed here manages to retain the trajectory in the set of feasible initial states (obtained by the critical regions of the parametric solution) and drives the trajectory close to the origin. One should notice that the space of feasible initial states (Fig. 3) given by the critical regions of the parametric solution is smaller than the one given in the nominal system’s case (Fig. 1).

Conclusions In this chapter two robust parametric MPC problems were analyzed. In the first problem two methods for robust parametric MPC are discussed, an open-loop and a closed-loop method, for treating robustness issues arising from the presence of input disturbances/uncertainties. In the second problem, a robust parametric MPC procedure was discussed for the control of dynamic systems with uncertainty in the system matrices by employing robust parametric optimization methods.

References 1. Acevedo J, Pistikopoulos EN (1999) An algorithm for multiparametric mixed-integer linear programming problems. Oper Res Lett 24:139–148 2. Bemporad A, Borelli F, Morari M (2003) Min–max control of constrained uncertain discrete-time linear systems. IEEE Trans Autom Contr 48(9):1600–1606 3. Bemporad A, Morari M (1999) Robust model predictive control: a survey Robustness in indentification and control. Springer, Berlin, pp 207–226 4. Bemporad A, Morari M, Dua V, Pistikopoulos EN (2002) The explicit linear quadratic regulator for constrained systems. Automatica 38:3–20 5. Ben-Tal A, Nemirovski A (2000) Robust solutions of linear programming problems contaminated with uncertain data. Math Program 88:411–424 6. Bertsekas DP, Rhodes IB (1971) On the minimax reachability of target sets and target tubes. Automatica 7:233–247 7. Camacho E, Bordons C (1999) Model Predictive Control. Springer, Berlin 8. Chisci L, Rossiter JA, Zappa G (2001) Systems with persistent disturbances: predictive control with restricted constraints. Automatica 37:1019–1028

D

9. Dua V, Bozinis NA, Pistikopoulos EN (2002) A multiparametric programming approach for mixed integer and quadratic engineering problems. Comput Chem Eng 26:715–733 10. Fiacco A (1983) Introduction to sensitivity and stability analysis in nonlinear programming. Academic, New York 11. Kothare MV, Balakrishnan V, Morari M (1996) Robust constrained model predictive control using linear matrix inequalities. Automatica 32(10):1361–1379 12. Langson W, Chryssochoos I, Raković SV, Mayne DQ (2004) Robust model predictive control using tubes. Automatica 40:125–133 13. Lee J, Cooley B (1997) Recent advances in model predictive control and other related areas. In: Carnahan B, Kantor J, Garcia C (eds) Proceedings of chemical process control – V: Assesment and new directions for research, vol 93 of AIChE Symposium Series No. 316, AIChE and CACHE, pp 201–216 14. Mayne D, Rawlings J, Rao C, Scokaert PO (2000) Constrained model predictive control: stability and optimality. Automatica 36:789–814 15. Morari M, Lee J (1999) Model predictive control: past, present and future. Comput Chem Eng 23:667–682 16. Pistikopoulos EN, Dua V, Bozinis NA, Bemporad A, Morari M (2002) On-line optimization via off-line parametric optimization tools. Automatica 26:175–185 17. Pistikopoulos EN, Georgiadis M, Dua V (eds) (2007) Multiparametric Model-Based Control. In: Process Systems Engineering, vol 2. Wiley-VCH, Weinheim 18. Pistikopoulos EN, Georgiadis M, Dua V (eds) (2007) Multiparametric Programming. In: Process Systems Engineering, vol 1. Wiley-VCH, Weinheim 19. Pistikopoulos EN, Grossmann I (1988) Optimal retrofit design for improving process flexibility in linear systems. Comput Chem Eng 12(7):719–731 20. Sakizlis V, Kakalis NMP, Dua V, Perkins JD, Pistikopoulos EN (2004) Design of robust model-based controllers via parametric programming. Automatica 40:189–201 21. Scokaert P, Mayne D (1998) Min–max feedback model predictive control for constrained linear systems. IEEE Trans Autom Contr 43(8):1136–1142 22. Wang YJ, Rawlings JB (2004) A new robust model predictive control method I: theory and computation. J Process Control 14:231–247

Determining the Optimal Number of Clusters MENG PIAO TAN, CHRISTODOULOS A. FLOUDAS Department of Chemical Engineering, Princeton University, Princeton, USA

687

688

D

Determining the Optimal Number of Clusters

MSC2000: 90C26, 91C20, 68T20, 68W10, 90C11, 92-08, 92C05, 92D10

number of clusters. Some of these measures are introduced in the following section.

Article Outline

Methods

Introduction Methods Dunn’s Validity Index Davies–Bouldin Validity Index Measure of Krzanowski and Lai Measure of Calinski and Harabasz

Applications A Novel Clustering Approach with Optimal Cluster Determination Extension for Biological Coherence Refinement

References Introduction Clustering is probably the most important unsupervised learning problem and involves finding coherent structures within a collection of unlabeled data. As such it gives rise to data groupings so that the patterns are similar within each group and remote between different groups. Besides having been extensively applied in areas such as image processing and pattern recognition, clustering also sees rich applications in biology, market research, social network analysis, and geology. For instance, in marketing and finance, cluster analysis is used to segment and determine target markets, position new products, and identify clients in a banking database having a heavy real estate asset base. In libraries, clustering is used to aid in book ordering and in insurance, clustering helps to identify groups of motor insurance policy holders with high average claim costs. Given its broad utility, it is unsurprising that a substantial number of clustering methods and approaches have been proposed. On the other hand, fewer solutions to systematically evaluate the quality or validity of clusters have been presented [1]. Indeed, the prediction of the optimal number of groupings for any clustering algorithm remains a fundamental problem in unsupervised classification. To address this issue, numerous cluster indices have been proposed to assess the quality and the results of cluster analysis. These criteria may then be used to compare the adequacy of clustering algorithms and different dissimilarity measures, or to choose the optimal

Dunn’s Validity Index This technique [2,5] is based on the idea of identifying the cluster sets that are compact and well separated. For any partition of clusters, where ci represent the ith cluster of such a partition, Dunn’s validation index, D, can be calculated as d(c1 ; c j ) 0 D D min min d (c k ) : 1 jn 1in max1kn i¤ j

Here, d(ci ,cj ) is the distance between clusters ci , and cj (intercluster distance), d0 (ck ) is the intracluster distance of cluster ck , and n is the number of clusters. The goal of this measure is to maximize the intercluster distances and minimize the intracluster distances. Therefore, the number of cluster that maximizes D is taken as the optimal number of clusters to be used. Davies–Bouldin Validity Index This index [4] is a function of the ratio of the sum of within-cluster scatter to between-cluster separation: n S n (Q i ) C S n (Q j ) 1X : max DB D n S(Q i ; Q j ) i¤ j iD1

In this expression, DB is the Davies–Bouldin index, n is the number of clusters, Sn is the average distance of all objects from the cluster to their cluster center, and S(Qi Qj ) is the distance between cluster centers. Hence, the ratio is small if the clusters are compact and far from each other. Consequently, the Davies–Bouldin index will have a small value for a good clustering. The silhouette validation technique [22] calculates the silhouette width for each sample, the average silhouette width for each cluster, and the overall average silhouette width for a total data set. With use of this approach each cluster can be represented by a so-called silhouette, which is based on the comparison of its tightness and separation. The average silhouette width can be applied for the evaluation of clustering validity and can also be used to decide how good are the number of selected clusters. To construct the silhouettes S(i) the

Determining the Optimal Number of Clusters

following formula is used: S(i) D

(b(i) a(i)) : max fa(i); b(i)g

Here, a(i) is the average dissimilarity of the ith object to all other objects in the same cluster and b(i) is the minimum average dissimilarity of the ith object to all objects in the other clusters. It follows from the formula that s(i) lies between 1 and 1. If the silhouette value is close to 1, it means that sample is “well clustered” and has been assigned to a very appropriate cluster. If the silhouette value is close to 0, it means that that sample could be assigned to another “closest” cluster as well, and the sample lies equally far away from both clusters. If the silhouette value is close to 1, it means that sample is “misclassified” and is merely somewhere in-between the clusters. The overall average silhouette width for the entire plot is simply the average of the S(i) for all objects in the whole dataset and the largest overall average silhouette indicates the best clustering (number of clusters). Therefore, the number of clusters with the maximum overall average silhouette width is taken as the optimal number of the clusters. Measure of Krzanowski and Lai This index is based on the decrease of the within-cluster sum of squares (WSS) [15] and is given by ˇ ˇ ˇ DIFF(k) ˇ ˇ ; where ˇ KL(k) D ˇ DIFF(k C 1) ˇ 2

2

DIFF(k) D (k 1) p WSS(k 1) k p WSS(k) : Assuming that g is the ideal cluster number for a given dataset, and k is a particular number of clusters, then WSS(k) is assumed to decrease rapidly for k g and decreases only slightly for k > g. Thus, it is expected that KL(k) will be maximized for the optimal number of clusters. Measure of Calinski and Harabasz This method [3] assesses the quality of k clusters via the index CH(k) D

BSS(k 1)/(k 1) : WSS(k)/(n k)

D

Here, WSS(k) and BSS(k) are the WSS and the betweencluster sums of squares, for a dataset of n members. The measure seeks to choose clusters that are well isolated from one another and coherent, but at the same time keep the number of clusters as small as possible, thus maximizing the criterion at the optimal cluster number. Incidentally, a separate study comparing 28 validation criteria [18] found this measure to perform the best. In addition, some other measures to determine the optimal number of clusters are (i) the C index [10], (ii) the Goodman–Kruskal index [8]), (iii) the isolation index [19], (iv) the Jaccard index [11], and (v) the Rand index [20]. Applications As can be seen, while it is relatively easy to propose indices of cluster validity, it is difficult to incorporate these measures into clustering algorithms and to appoint suitable thresholds on which to define key decision values [9,12]. Most clustering algorithms do not contain built-in screening functions to determine the optimal number of clusters. This implies that for a given clustering algorithm, the most typical means of determining the optimal cluster number is to repeat the clustering numerous times, each with a different number of groupings, and hope to catch a maximum or minimum turning point for the cluster validity index in play. Nonetheless, there have been attempts to incorporate measures of cluster validity into clustering algorithms. One such method [21] introduces a validity index: Validity D

Intra Cluster : Inter Cluster

Since it is desirable for the intracluster distance and the intercluster distance to be minimized and maximized, respectively, the above validity measure should be as small as possible. Using the K-means algorithm, Ray and Turi [21] proposed running the process for two up to a predetermined maximum number of clusters. At each stage, the cluster with the maximum variance is split into two and clustering is repeated with these updated centers, until the desired turning point for the validity measure is observed. Another approach [16] is based on simulated annealing, which was originally formulated to simulate a collection of atoms in equilibrium at a given temperature [14,17]. It assumes two

689

690

D

Determining the Optimal Number of Clusters

given parameters D, which is the cutoff cluster diameter, and P, a P-value statistic, as well as p(d), the distribution function of the Euclidean distances between the members in a dataset. Then, the upper boundary for the fraction of incorrect vector pairs is given by Z

Global center, zok D

1

p(x) dx :

f (D; K D 1) D D

On the other hand, it is possible to define a lower boundary for f(D,K) with a preassigned P-value cutoff. The clustering algorithm then sequentially increases the cluster number until the two indicators converge. A Novel Clustering Approach with Optimal Cluster Determination See also the article on “Gene Clustering: A Novel Optimization-Based Approach”. Recently, we proposed a novel clustering approach [23,24] that expeditiously contains a method to predict the optimal cluster number. The clustering seeks to minimize the Euclidean distances between the data and the assigned cluster centers as MIN

w i j ;z jk

c X s n X X

a certain number of clusters used for a particular clustering algorithm. Given n data points, each having k feature points, j clusters, and a binary decision variable for cluster membership wij , we introduce the following:

2 w i j a i k z jk :

iD1 jD1 kD1

To make the nonlinear problem tractable, we apply a variant of the generalized benders decomposition algorithm [6,7], the global optimum search. The global optimum search decomposes the problem into a primal problem and the master problem. The former solves the continuous variables while fixing the integer variables and provides an upper-bound solution, while the latter finds the integer variables and the associated Lagrange multipliers while fixing the continuous variables and provides a lower-bound solution. The two sequences are iteratively updated until they converge at an optimal solution in a finite number of steps. In determining the optimal cluster number, we note that the optimal number of clusters occurs when the intercluster distance is maximized and the intracluster distance is minimized. We adapt the novel work of Jung et al. [13] in defining a clustering balance, which has been shown to have a minimum value when intracluster similarity is maximized and intercluster similarity is minimized. This provides a measure of how optimal is

n P

1 n

ai k ;

8k ;

iD1

Intracluster error sum; n P c P s

2 P w i j a i k z jk 2 ; D iD1 jD1 kD1

Intercluster error sum; D

c P s

P

z jk z o 2 : k 2

jD1 kD1

Jung et al. [13] next proposed a clustering balance parameter, which is the ˛-weighted sum of the two error sums: Clustering balance,

" D ˛ C (1 ˛) :

We note here that the right ˛ ratio is 0.5. There are two ways to come to this conclusion. We note that the factor ˛ should balance the contributive weights of the two error sums to the clustering balance. At extreme cluster numbers, that is, the largest and smallest numbers possible, the sum of the intracluster and intercluster error sums at both cluster numbers should be balanced. In the minimal case, all the data points can be placed into a single cluster, in which case the intercluster error sum is zero and the intracluster error sum can be calculated with ease. In the maximal case, each data point forms its own cluster, in which case the intracluster error sum is zero and the intercluster error sum can be easily found. Obviously the intracluster error sum in the minimal case and the intercluster error sum in the maximal case are equal, suggesting that the most appropriate weighting factor to use is in fact 0.5. The second approach uses a clustering gain parameter proposed by Jung et al. [13]. This gain parameter is the difference between the decreased intercluster error sum j compared with the value at the initial stage and the increased intracluster error sum j compared with the value at the initial stage, and is given by jk D

n X

2

2 w i j a i k z ok 2 z jk z ok 2 ;

iD1

8j,8k ; jk D

n X iD1

2 w i j a i k z jk 2 ; 8j,8k ;

Determining the Optimal Number of Clusters n X

Gain; jk D

2

2 w i j a i k z ok 2 z jk z ok 2

iD1 n X

2 w i j a i j z jk 2 ; 8j,8k :

iD1

With the identities n P iD1 n P

w i j a i k D n j z jk ; 8 j; 8k ; wi j D n j ; 8 j ;

iD1

where nj denotes the number of data points in cluster j, the gain can be simplified to

2

jk D n j 1 z ok z jk 2 ; 8j; 8k ; c P s

2

P D n j 1 z ok z jk 2 : jD1 kD1

Jung et al. [13] showed the clustering gain to have a maximum value at the optimal number of clusters, and demonstrated that the sum total of the clustering gain and balance parameters is a constant. As can be seen from the following derivation, this is only possible if the ˛ ratio is 0.5: Sum of clustering balance and clustering gain; ˝ D"C DC C 3 2 c X s n X X

2 w i j a i k z jk 2 5 D4 iD1 jD1 kD1

2 3 c X

z jk z o 2 5 C : : : C4 k 2 jD1

2P n P c

c P s

2 P

3

z jk z o 2 w i j a i k z ok 2 k 27 6 iD1 jD1 kD1 jD1 6 7 n P c P s

2 4 P 5 w i j a i k z jk 2 iD1 jD1 kD1

D

c X s n X X

2 w i j a i k z ok 2

iD1 jD1 kD1

D

s n X X

a i k z o 2 ; k 2 iD1 kD1

which is a constant for any given dataset.

D

Extension for Biological Coherence Refinement Today, the advent of DNA microarray technology has made possible the large-scale monitoring of genomic behavior. In working with gene expression data, it is often useful to utilize external validation in evaluating clusters of gene expression data. Besides assessing the biological meaning of a cluster through the functional annotations of its constituent genes using gene ontology resources, other indications of strong biological coherence [25] are (i) the proportion of genes that reside in clusters with good P-value scores, (ii) cluster correlation, since closely related genes are expected to exhibit very similar patterns of expression, and (iii) cluster specificity, which is the proportion of genes within a cluster that annotates for the same function. A novel extension of the previously described work [25] allows not just for the determination of the optimal cluster number within the framework of a robust yet intuitive clustering method, but also for an iterative refinement of biological validation for the clusters. The algorithm is as follows. Gene Preclustering We precluster the original data by proximity studies to reduce the computational demands by (i) identifying genes with very similar responses and (ii) removing outliers deemed to be insignificant to the clustering process. To provide just adequate discriminatory characteristics, preclustering can be done by reducing the expression vectors into a set of representative variables {C; o; }, or by pregrouping genes that are close to one another by correlation or some other distance function. Iterative Clustering We let the initial clusters be defined by the genes preclustered previously, and find the distance between each of the remaining genes and these initial clusters and as a good initialization point place these genes into the nearest cluster. For each gene, we allow its suitability in a limited number of clusters on the basis of the proximity study. In the primal problem of the global optimum search algorithm, we solve for zjk . These, together with the Lagrange multipliers, are used in the master problem to solve for wij . The primal problem gives an upper-bound solution and the master problem gives a lower bound. The optimal solution is obtained when both bounds converge. Then, the worst-

691

692

D

Determining the Optimal Number of Clusters

Determining the Optimal Number of Clusters, Figure 1 Iterative clustering procedure. GOS global optimum search

Determining the Optimal Number of Clusters

placed gene is removed and used as a seed for a new cluster. This gene has already been subjected to a membership search, so there is no reason for it to belong to any of the older clusters. The primal and master problems are iterated and the number of clusters builds up gradually until the optimal number is attained. Iterative Extension Indication of strong biological coherence is characterized by good P values based on gene ontology resources and the proportion of genes that reside in such clusters. As an extension, we would like to mine for the maximal amount of relevant information from the gene expression data and sieve out the least relevant data. This is important because information such as biological function annotation drawn from the cluster content is often used in the further study of coregulated gene members, common reading frames, and gene regulatory networks. From the clustered genes, we impose a coherence floor, based on some or all of the possible performance factors such as functional annotation, cluster specificity, and correlation, to demarcate genes that have already been well clustered. We then iterate to offer the poorly placed genes an opportunity to either find relevant membership in one of the strongly coherent clusters, or regroup amongst themselves to form quality clusters. Through this process, a saturation point will be reached eventually whereby the optimal number of clusters becomes constant as the proportion of genes distributed within clusters of high biological coherence levels off. Figure 1 shows a schematic of the entire clustering algorithm. References 1. Azuaje F (2002) A Cluster Validity Framework for Genome Expression Data. Bioformatics 18:319–320 2. Bezdek JC, Pal NR (1998) Some New Indexed of Cluster Validity. IEEE Trans Syst Man Cybern 28:301–315 3. Calinski RB, Harabasz J (1974) A Dendrite Method for Cluster Analysis. Commun Stat 3:1–27 4. Davis DL, Bouldin DW (1979) A Cluster Separation Measure. IEEE Trans Pattern Anal Machine Intell 1(4):224–227 5. Dunn JC (1974) Well Separated Clusters and Optimal Fuzzy Partitions. J Cyber 4:95–104 6. Floudas CA (1995) Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications. Oxford University Press, New York 7. Floudas CA, Aggarwal A, Ciric AR (1989) Global Optimum Search for Non Convex NLP and MINLP Problems. Comp Chem Eng 13(10):1117–1132

D

8. Goodman L, Kruskal W (1954) Measures of Associations for Cross-Validations. J Am Stat Assoc 49:732–764 9. Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster Validity Methods: Part 1. SIGMOD Record 31(2):40–45 10. Hubert L, Schultz J (1976) Quadratic Assignment as a General Data-Analysis Strategy. Brit J Math Stat Psych 29: 190–241 11. Jaccard P (1912) The Distribution of Flora in the Alpine Zone. New Phytol 11:37–50 12. Jain AK, Dubes RC (1988) Algorithms for Clustering Data. Prentice-Hall Advanced Reference Series, Prentice-Hall, New Jersey 13. Jung Y, Park H, Du D, Drake BL (2003) A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering. J Global Optim 25:91–111 14. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by Simulated Annealing. Science 220(4598):671–680 15. Krzanowski WJ, Lai YT (1985) A Criterion for Determining the Number of Groups in a Data Set using Sum of Squares Clustering. Biometrics 44:23–44 16. Lukashin AV, Fuchs R (2001) Analysis of Temporal Gene Expression Profiles: Clustering by Simulated Annealing and Determining the Optimal Number of Clusters. Bioinformatics 17(5):405–414 17. Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller EJ (1953) Equations of State Calculations by Fast Computing Machines. J Chem Phys 21:1087 18. Milligan GW, Cooper MC (1985) An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika 50:159–179 19. Pauwels EJ, Fregerix G (1999) Finding Salient Regions in Images: Non-parametric Clustering for Image Segmentation and Grouping. Comput Vis Image Underst 75:73–85 20. Rand WM (1971) Objective Criteria for the Evaluation of Clustering Methods. J Am Stat Assoc 66(336):l846–850 21. Ray S, Turi R (1999) Determination of Number of Clusters in K-Means Clustering and Application in Color Image Segmentation. In: Proceed 4th Int Conf Advances in Pattern Recognition and Digital Techniques, 137–143 22. Rousseeuw PJ (1987) Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. J Comp Appl Math 20:53–65 23. Tan MP, Broach JR, Floudas CA (2007) A Novel Clustering Approach and Prediction of Optimal Number of Clusters: Global Optimum Search with Enhanced Positioning. J Global Optim 39:323–346 24. Tan MP, Broach JR, Floudas CA (2008) Evaluation of Normalization and Pre-Clustering Issues in a Novel Clustering Approach: Global Optimum Search with Enhanced Positioning. J Bioinf Comput Biol 5(4):895–913 25. Tan MP, Broach JR, Floudas CA (2008) Microarray Data Mining: A Novel Optimization-Based Iterative Clustering Approach to Uncover Biologically Coherent Structures (submitted for publication)

693

694

D

Deterministic and Probabilistic Optimization Models for Data Classification

Deterministic and Probabilistic Optimization Models for Data Classification YA-JU FAN, W. ART CHAOVALITWONGSE Department of Industrial and Systems Engineering, Rutgers University, Piscataway, USA MSC2000: 65K05, 55R15, 55R35, 90C11 Article Outline Deterministic Optimization Models Support Vector Machines Robust LP for SVM Feature Selection with SVM Hybrid LP Discriminant Model MIP Discriminant Model Multi-hyperplane Classification Support Feature Machines

Probabilistic Optimization Models Bayesian-Based Mathematical Program Probabilistic Models for Classification MIP Formulation for Anderson’s Model LP Formulation for Anderson’s Model

References A classification problem is concerned with categorizing a data point (entity) into one of G (G 2) mutually exclusive groups based upon m (positive integer) specific measurable features of the entity. A classification rule is typically constructed from a sample of entities, where the group classifications are known or labeled (training or supervised learning). Then it can be used to classify new unlabeled entities. Many classification methods are based on distance measures. A common approach is to find a hyperplane to classify two groups S (G D G1 G2 ). The hyperplane can be represented in a form of A! D , where A denotes an n m input data matrix, n is the total number of input data points, and m is the total number of data features/attributes. The classification rule is then made by the weight vector ! to map data points onto a hyperplane, and the scalar , which are best selected by solving a mathematical programming model. The goal is to have entities of Group 1 (G1 ) lie on one side of the hyperplane and entities of Group 2 (G2 ) lie on the other side. Support Vector Machines (SVM) is the most studied hy-

perplane construction method. The SVM concept is to construct a hyperplane that minimizes the upper bound on the out-of-sample error. The critical step of SVM is to transform (or map) data points on to a high dimensional space, known as kernel transformation, and classify data points by a separating plane [9]. Subsequently, the hybrid linear programming discriminant model is proposed by [12,13,20]. The hybrid model does not depend on data transformation, where the objective is to find a plane that minimizes violations and maximizes satisfactions of the classified groups. Glover [19] proposed a mixed integer programming (MIP) formulation for the hybrid model by adding binary variables for misclassified entities. Other MIP formulations that are subsequently developed include [1,15,16]. Recently, a new technique that use multiple hyperplanes for classification has been proposed by [17]. This technique constructs a piecewise-linear model that gives convex separating planes. Subsequently, Better, Glover and Samorani [6] proposed multi-hyperplane formulations that generate multiple linear hyperplanes simultaneously with the consequence of forming a binary decision tree. In classification, the selection of data’s features/ attributes is also very critical. Many mathematical programming methods have been proposed for selecting well represented features/attributes. Bennett and Mangasarian [5,23] gives a feature selection formulation such that the model not only separates entities into two groups, but also tries to suppress nonsignificant features. In a more recent study, Chaovalitwongse et al. (2006) proposed Support Feature Machine (SFM) formulations can be used to find a set of features that gives the highest classification performance [10]. Baysian decision method has also been widely studied in classification. However, there are only few studies incorporating the Baysian model with mathematical programming approaches. Among those studies, Asparouhov and Danchev [4] formulates a MIP model with binary variables, which are conformed with the Bayesian decision theory. In the case of multi-group classification, Anderson [2] developed a mathematical formulation that incorporates the population densities and prior probabilities of training data. This model yields classification rules for multi-groups with a reject option, (a set having entities that does not belong to any group) [22].

D

Deterministic and Probabilistic Optimization Models for Data Classification

Deterministic Optimization Models Support Vector Machines Support Vector Machines (SVM) is aimed at finding a hyperplane that separates the labeled input data into two groups, G1 and G2 . Then the optimal plane can be used for classifying new data point. The hyperplane can be mathematically expressed by A! D , where ! 2 ; i 2 G2 ). Let yi and zi represent external and internal deviation variables referring to the point violations and satisfactions of the classification rule. More specifically, they are the magnitudes of the data points lying outside or inside their targeted half spaces. The objective is to minimize violations and maximize the satisfactions of the classified groups. Thus, in the objective function, variable hi ’s discourage external deviations and variable ki ’s encourage internal deviations. Then h i k i for i D 0 and i 2 G, must be satisfied. The hybrid model is given by X X h i y i k0 z 0 ki zi min h0 y0 C

There are several related mixed integer formulations in the literature [1,15,16]. In general, due to the computational requirements, these standard MIP formulations can only be applied to classification problems with a relatively small number of observations. Glover [19] proposed a compact mathematical program for discriminant model, which is a variant of the above-mentioned hybrid LP model. This objective of this model is to minimize the number of misclassified entities. The MIP discriminant model is given by X zi min

i2G

i2G

s.t. A i ! y0 y i C z0 C z i D ;

i 2 G1

A i ! C y0 C y i z0 z i D ; X z0 C z i D 1; i 2 G

i 2 G2 (2)

i

y0 ; z0 0 y i ; z i 0; !;

i2G

unrestricted:

We note that Eq. (2) is a normalization constraint that is necessary for avoiding a trivial solution where all ! j D 0 and D 0. Glover [18] identifies more normalization methods to conquer the problem with null weighting.

i2G

s:t: A i x Mz i C ˇ i D b;

i 2 G1

A i x C Mz i ˇ i D b;

i 2 G2

ˇ i 0;

i2G

z i 2 f0; 1g; x; b

i2G

unrestricted;

where ˇ i are slack variables, and M is a large constant chosen so that when z i D 1, A i x b C Mz i will be redundant for i 2 G1 and A i x b Mz i will be redundant for i 2 G2 . This model can incorporate a normalP P ization constraint, (n2 i2G1 A i C n1 i2G2 A i )x D 1, where n1 and n2 are the number of entities in G1 and G2 , respectively. Multi-hyperplane Classification Multi-hyperplane formulations, given by Better et al. [6], generate multiple linear hyperplanes simultaneously with the consequence of forming a decision tree. The hyperplanes are generated from an extension of the Discriminant Model proposed by Glover [18]. Instead of using kernel transformation that projects data into a high dimensional space to improve the performance of SVM, the multi-hyperplane approach approximates a nonlinear separation by constructing multiple hyperplanes. Let d D 0 when we are at a root node of a binary tree, where none of the classifications have been done. Let d D D when the tree has two leaf nodes corresponding to the final separation step. In order to explain the model, we define the following terms. Successive Perfect Separation (SPS) is a procedure that forces all elements of Group 1 (G1 ) and Group 2 (G2 ) to lie on one side of the hyperplane at each node

Deterministic and Probabilistic Optimization Models for Data Classification

for any depth d 2 f0; : : : ; D 1g. SPS is a special use of a variant based on a proposal of Glover [18]. SPS decision tree is a tree that results from the twogroup classification iteratively applying the SPS procedure. The root node (d D 0) contains all the entities in the data set, and at d D D the two leaf nodes correspond to the final separation step. For a given maximum depth D, an initial multihyperplane model considers each possible SPS tree type of depth d, for d D 0; : : : ; D 1. A root node is viewed as a “problem” node where all data points from both groups need to be separated. A leaf node, on the other hand, is considered to be a “decision” node where data points are classified into two groups. Define slicing variables sl i for i 2 f1; : : : ; D 1g. There are total of D 1 slicing variables needed for a tree having maximum depth D. Specifically, at depth d D 1, sl1 D 0 if the “left” node constitutes a leaf node while the “right” node constitutes a root (or problem) node. Without loss of generality, we herein consider D D 3 for the initial multi-hyperplane model. The mathematical model for multi-hyperplane SVM can be formally defined as follows. Let M and " denote large and small positive constants, respectively, and G denote a set of the union of entities in G1 and G2 . Suppose there are n entities in the training data set. Define a binary variable zi D 0 if object i is correctly classified by the “tree”, otherwise zi D 1. Define a binary variable and zhi D 0 if object i is correctly classified by “hyperplane h”, otherwise zhi D 1. The multi-hyperplane SVM model also includes traditional hyperplane constraints for each depth d of the tree and the normalization constraint, which is similar to the mixed integer programming model in [18]. Then, " is added to prevent data points from lying on the hyperplane. Tree-type constraints are included to identify the optimal tree structure for the data set, which will be in part of the optimal classification rule. Binary variables yi are used for tree types (0,1) and (1,0) to activate or deactivate either-or constraints. The SPS decision tree formulation for the depth D D 3 is given by

min

n X

A i x d C Mzd i ˇ i D b d C " i 2 G2 ; d D 1; 2; 3

z1i C z2i C z3i 2

i 2 G1

(5)

i 2 G2

(6)

i 2 G1

(7)

i 2 G2

(8)

i 2 G1

(9)

M(sl1 C sl2 ) C Mzi z1i C z2i C z3i M(2 sl1 sl2 ) C Mzi z1i C z2i C z3i M(2 sl1 sl2 ) C zi z1i C z2i C z3i 2 M(1 C sl1 sl2 ) C zi z1i My i

M(1 C sl1 sl2 ) C Mzi z2i C z3i M[1 y i ]

i 2 G1

(10)

i 2 G2

(11)

i 2 G2

(12)

i 2 G1

(13)

i 2 G1

(14)

i 2 G2

(15)

i 2 G2

(16)

M(1 C sl1 sl2 ) C zi z1i

M(1 C sl1 sl2 ) C zi z2i C z3i 1 M(1 C sl1 sl2 ) C zi z1i

M(1 C sl1 sl2 ) C zi z2i C z3i 1

M(1 C sl1 sl2 ) C zi z1i My i

M(1 C sl1 sl2 ) C Mzi

s:t:A i x d Mzd i C ˇ i D b d " i 2 G1 ; d D 1; 2; 3

(4)

M(sl1 C sl2 ) C zi

zi

iD1

D

(3)

z2i C z3i M[1 y i ]

697

698

D

Deterministic and Probabilistic Optimization Models for Data Classification

3 n X X

x jd D 1

(17)

jD1 dD1

m m X xj X a i j x j C M(1 y i ) 2 jD1

zi 2 f0; 1g; zd i 2 f0; 1g; y i 2 f0; 1g;

for i D 1; : : : ; n

i 2 G; d D 1; 2; 3

m X

sl k 2 f0; 1g k D 1; 2; x; b unrestricted,

Support Feature Machines Support Feature Machines (SFM) proposed in [10] is a mathematical programming technique used to identify a set of features that gives the highest performance in classification using the nearest neighbor rule. SFM can be formally defined as follows. Assume there are n data points, each with m features, we define the decision variables x j 2 f0; 1g ( j D 1; : : : ; m) indicating if feature j is selected by SFM and y i 2 f0; 1g (i D 1; : : : ; n) indicating if sample i can be correctly classified by SFM. There are two versions of SFM, voting and averaging. Each version uses different weight matrices, which are provided by user’s classification rule. The objective function of voting SFM is to maximize the total correct classification as in Eq. (18). There are two sets of constraints used to ensure that the training samples are classified based on the voting nearest neighbor rule as in Eqs. (19)-(20). There is a set of logical constraints in Eq. (21) used to ensure that at least one feature is used in the voting nearest neighbor rule. The mixed-integer program for voting SFM is given by: n X

yi

(18)

iD1

s.t.

m X jD1

ai j x j

m X xj My i 2 jD1

for i D 1; : : : ; n

(20)

xj 1

(21)

jD1

where the constraints in Eqs. (3)-(4) are the hyperplane constraints, Eqs. (5)-(6) are the constraints for tree type (0,0), Eqs. (7)-(8) are the constraints for tree type (1,1), Eqs. (9)-(12) are the constraints for tree type (0,1), Eqs. (13)-(16) are the constraints for tree type (1,0), and Eq. (17) is the normalization constraint. This small model with D D 3 performs well for small depths and has computational limitations. The reader should refer to [6] for a greater detail of an improved and generalized structure model for all types of SPS trees.

max

jD1

(19)

x 2 f0; 1gm ; y 2 f0; 1gn ; where a i j D 1 if the nearest neighbor rule correctly classified sample i at electrode j, 0 otherwise, n is total number of training samples, m is total number of features, M = m/2, and is a small positive number used to break a tie during the voting (0 < < 1/2). The objective function of averaging SFM is to maximize the total correct classification as in Eq. (22). There are two sets of constraints used to ensure that the training samples are classified based on the distance averaging nearest neighbor rule as in Eqs. (23)-(24). There is a set of logical constraints in Eq. (25) used to ensure that at least one feature is used in the distance averaging nearest neighbor rule. The mixed-integer program for averaging SFM is given by: max s.t.

n X

yi

(22)

iD1 m X

m X

jD1

jD1

d¯i j x j

d i j x j M1i y i

for i D 1; : : : ; n m X

di j x j

jD1

m X

d¯i j x j M2i (1 y i )

jD1

for i D 1; : : : ; n m X

(23)

xj 1

(24) (25)

jD1

x 2 f0; 1gm ; y 2 f0; 1gn ; where d i j is the average statistical distance between sample i and all other samples from the same class at feature j (intra-class distance), d¯i j is the average statistical distance between sample i and all other samples from different class at feature j (inter-class distance), P P M1i D mjD1 d i j , and M2i D mjD1 d i j .

Deterministic and Probabilistic Optimization Models for Data Classification

D

Probabilistic Optimization Models

Probabilistic Models for Classification

The deterministic classification models in the previous section make a strong assumption that the data are separable. In the case that the data may not be well separated, using the deterministic models may lead to a high misclassification rate. The classification models that incorporate probabilities may be a better option for such noisy data. When the population densities and prior probabilities are known, there are probabilistic models that consider constrained rules with a reject option [2] as well as a Baysian-based model [4].

An optimization model proposed by Anderson [2] incorporates population densities, prior probabilities from all groups, and misclassification probabilities. This method is aimed to find a partition fR0 ; R1 ; : : : ; RG g of Rem where m is the number of features. This method naturally forms a multi-group classification. The objective is to maximize the probability of correct allocation subject to constraints on the misclassification probabilities. The mathematical model can be formally defined as follows. Let f h , h D 1; : : : ; G, denote the group conditional density functions. Let g denote the prior probability that a randomly selected entity is from group g, g D 1; : : : ; G, and ˛ h g , h ¤ g, are constants between 0 and 1. The probabilistic classification model is then given by

Bayesian-Based Mathematical Program The Baysian-based mathematical program that are conformed with the Bayesian decision theoretic approach is proposed by Asparouhov and Danchev [4]. The model can be formally defined as follows. Denote c 2 < as a cut-off value, x 2 B m as a vector of m binary values, and ! 2 8i, incorporated. Experimental studies in [4] suggest this Baysian-based model can give better performance than other contemporary linear discriminant models. The Baysian-based classification formulation is given by X min (jn1s n2s j zs C min(n1s ; n2s )) !;z s ;c

s:t:

s T xs ! xsT !

Mzs c if n1s n2s C Mzs c C " if n1s < n2s

n1s C n2s ¤ 0 zs 2 f0; 1g; ! 2 0, one considers the following two functions: 1 ; 1 C x2 1 F2 (x) D 1 C x2 ( 1 1 e (x˛)2 ı2 e ı2 ; Cm 0; F1 (x) D

x 2 (˛ ı; ˛ C ı); x … (˛ ı; ˛ C ı):

Function F 1 has in x = 0 the unique local minimizer which is also the global minimizer, i. e. x = 0. Let m < 1, 0 < ı < |˛|; then function F 2 has several critical points including two local minimizers, one is x = 0 and the other is x = x2 2 (˛ ı, ˛ + ı). Moreover the global minimizer of F 2 is x = x = x2 . One notes that F 1 , F 2 are smooth functions and that they coincide, for every x 2 R \ (˛ ı, ˛ + ı) where ı > 0 is arbitrary. Let D = RN , let F: D ! R be a continuously differentiable function, let rF be the gradient of F, let x 2 D be such that (rF)(x) 6D 0 then the vector (rF)(x) gives the direction of steepest descent for the function F at the point x. One can consider the following system of differential equations: dx (t) D (rF)(x(t)); dt x(0) D x0 :

t > 0;

(2) (3)

Under some hypotheses on F, the solution of problem (2), (3) is a trajectory in RN starting from x0 and

703

704

D

Differential Equations and Global Optimization

ending in the critical point xloc (x0 ) of F whose attraction region contains x0 . Using a numerical integration scheme for (2), (3) one can obtain a numerical optimization method, for example choosing the Euler integration scheme with variable stepsize from (2), (3) one obtains the so-called steepest descent algorithm. Let k 2 RN be the approximation of x(t k ), k 2 N, where t 0 = 0, 0 < t k < t k+1 < +1, k = 1, 2, . . . , and t k ! +1 when k ! 1, obtained with a numerical optimization method coming from (2), (3). Suppose { k k 2 N} is a sufficiently good approximation of the solution x(t), t > 0 of (2), (3) one has limk ! 1 k = limt ! +1 x(t) = xloc (x0 ), thus the numerical optimization methods obtained from (2), (3) compute critical points that depend on the initial guess x0 . So that these critical points usually are not global minimizers of F. One can consider numerical optimization methods due to other differential equations instead of (2), that is differential equations taking in account higher order derivatives of F or of x(t). However the minimizers computed with these numerical optimization methods depend only on local properties of the function F, thus in general they will not be global minimizers of F. So that methods based on ordinary differential equations are inadequate to deal with problem (1). In this article it is described how to use stochastic differential equations to avoid this difficulty. In fact one wants to destabilize the trajectories generated by problem (2), (3) using a stochastic perturbation in order to be able to reach global minimizers. This must be an appropriate perturbation, that is the corresponding perturbed trajectories must be able to leave the attraction region of a local minimizer of F to go in an attraction region of another minimizer of F obtaining as t ! +1 the solution of problem (1). This is done by adding a stochastic term, i. e., a Brownian motion on the right-hand side of equation (2). Moreover this stochastic term takes into account the domain D, when D RN . This is done introducing the solution of the Skorokhod reflection problem. In the second section one gives some mathematical background about stochastic differential equations that is necessary to state the results of the third and fourth sections. In the third section, the unconstrained version of problem (1) is treated, i. e., D = RN . In the fourth section, the constrained version of problem (1) is treated, i. e., D RN . In both these sections one gives methods,

convergence analysis and discussion when possible of a relevant software library. In the last section one gives some information about new application areas of global optimization such as graph theory and game theory. Mathematical Background Let ˝ R, ˙ be a -field of subsets of ˝ and P be a probability measure on ˙. The triple (˝; ˙; P) is called a probability measure space, see [5] for a detailed introduction to probability theory. Let ˝ 0 R, be a topology of subsets of ˝ 0 . Then X : ˝ ! ˝ 0 is a random variable if {X 2 A} 2 ˙ for every A 2 . The distribution function GX : R ! [0, 1] of X is defined by G X (x) D PfX xg, x 2 R and one denotes with g X its density. The expected value or the mean value of X is defined as follows: Z Z xG X ( dx) D x g X (x) dx (4) m(X) D R

R

and the variance of X is given by: v(X) D m((X m(X))2 ):

(5)

For example, a random variable X has discrete distribution, or is concentrated on x1 , . . . , xn , when g X (x) = Pn p ı(x xi ), where pi > 0, xi 2 ˝ 0 , i = 1, . . . , n, PniD1 i iD1 pi = 1 and ı is the Dirac delta. Given m 2 R, v > 0 a random variable has normal distribution when (xm)2 1 e 2v ; g X (x) D p 2 v

one notes that m(X) = m and v(X) = v. A stochastic process is a family of random variables depending on a parameter t, that is, {X(t): ˝ ! ˝ 0 , t 0}. A Brownian motion is a stochastic process {w(t): t 0} having the following properties: Pfw(0) D 0g D 1; for every choice of t i , i = 1, . . . , k, 0 t i < t i+1 < +1, i = 1, . . . , k 1, the increments w(t i+1 ) w(t i ), i = 1, . . . , k 1, are independent and normally distributed random variables with mean value equal to zero and variance equal to t i+1 t i . An N-dimensional Brownian motion is a N-dimensional process fw(t) D (w1 (t); : : : ; w N (t)) : t 0g where its components {wi (t): ˝ ! ˝ 0 , t 0}, i = 1, . . . , N, are independent Brownian motions. The Brownian

D

Differential Equations and Global Optimization

motion is a good mathematical model to describe phenomena that are the superposition of a large number of chaotic elementary independent events. The most famous example of Brownian motion is the motion of pollen grains immersed in a fluid, the grains have a chaotic perpetual motion due to the collisions with the molecules of the fluid, see [15, p. 39]. Let ˘ = ˝ 0 × × ˝ 0 RN , where × denotes the Cartesian product of sets. Let 0 be a topology of subsets of ˘ . Let s, t, be such that 0 s t, let x 2 ˘ , A 2 0 , then the transition distribution function of a Ndimensional stochastic process {X(t) : t 0} is defined as follows: T(s; x; t; A) D PfX(t) 2 A and X(s) D xg:

(6)

The Langevin equation expresses Newton principle for a particle subject to a random force field, see [15, p. 40]. Let divy be the divergence operator with respect to the variables y, y be the Laplace operator with respect to the variables y and Lˇ , ˛() = divy (˛) (1/2)y (ˇ 2 ). Under regularity assumptions on ˛ and ˇ, the transition probability density p(s, x, t, y), 0 s < t, x, y 2 RN , associated to the solution {Z(t): t 0} of problem (9), (10) exists and satisfies the Fokker–Planck equation, (see 8, p. 149]) that is, given x 2 RN , s 0 one has: @p C Lˇ;˛ (p) D 0; @t

y 2 R N ; t > s;

lim p(s; x; t; y) D ı(x y);

t!s;t>s

When T can be written as: Z p(s; x; t; y) dy T(s; x; t; A) D

(7)

A

for every 0 s t, x 2 ˘ , A 2 0 then the function p is called the transition probability density of the process {X(t): t 0}. Finally, if there exists a density distribution function

that depends only on x 2 ˘ such that:

(x) D lim p(s; u; t; x); t!C1

(8)

then is called the steady-state distribution density of the process {X(t): t 0}. One considers the following stochastic differential equation: dZ(t) D ˛(Z(t); t) dt C ˇ(Z(t); t) dw(t); t > 0; Z(0) D x0 ;

(9) (10)

where w is the N-dimensional Brownian motion, ˛ is the drift coefficient and ˇ is the diffusion coefficient, see [8, p. 98] or [8, p. 196] for a detailed discussion. One notes that dw cannot be considered as a differential in the elementary sense and must be understood as a stochastic differential, see [8, p. 59]. Under regularity assumptions on ˛ and ˇ there exists a unique solution {Z(t): t > 0} of (9), (10), see [8, p. 98]. When ˛ is minus the gradient of a potential function equation (9) is called the SmoluchowskiKramers equation. The Smoluchowski-Kramers equation is a singular limit of the Langevin equation.

y 2 RN :

(11) (12)

For the treatment of the constrained global optimization, that is, problem (1) with D RN , a stochastic process depending on the domain D must be considered. Let (x) SN1 be the set-valued function that gives the outward unit normals of the boundary @D of D at the point x 2 @D. One notes that when x is a regular point of @D, (x) is a singleton. Let : [0, T] ! RN , with possibly [0, T] = R+ , let ||(t) be the total variation of in the interval [0, t], where t < T. The Skorokhod problem is defined as follows: let , , : [0, T] ! RN , then the triple (, , ) satisfies the Skorokhod problem, on [0, T] with respect to D, if ||(T) < +1, (0) = (0) and for t 2 [0, T] the following relations hold: (t) D

(t) C (t);

(13)

(t) 2 D;

(14)

Z

t

fr2R :

j j (t) D Z

0

(t) D

(r)2@Dg (s)

d j j (s);

(15)

t

(s) d j j (s);

(16)

0

where S is the characteristic function of the set S and (s) 2 ((s)), when s 2 [0, T] and (s) 2 @D and (s) = O elsewhere. Viewing (t), t 2 [0, T], as the trajectory of a point A 2 RN , one has that at time zero A is inside D, since (0) 2 D. Moreover the trajectory of A is reflected from the boundary of D and the reflected trajectory can be viewed as (t), t 2 [0, T]. That is, is equal to until A 2 D, when A goes out of D it is brought back on @D in the normal direction to @D. One

705

706

D

Differential Equations and Global Optimization

notes that the function gives the reflection rule with respect to the boundary of D of the function . In [16] it is proved that under suitable assumptions on D and F there exists a unique solution of the Skorokhod problem. One considers the following stochastic differential equation with reflection term, that is:

where: Z C D

t>0;

Z(0) D x0 ; where (Z; Z ; )

(19)

is the solution of the Skorokhod problem. One notes that relations (14), (19) imply that the solution of (17), (18) verifies Z(t) 2 D, t > 0. In [16] it is proved that under some hypotheses there exists a unique solution {Z(t): t 0} of (17), (18), (19) for every x0 2 D. Global Unconstrained Optimization Given problem (1) with D = RN one considers the following stochastic differential equation: dZ(t) D (rF)(Z(t)) dt C (t)dw(t); t > 0;

(20)

Z(0) D x0 ;

(21)

where {w(t): t 0} is the N-dimensional Brownian motion and (t) is a suitable decreasing function that guarantees the convergence of the stochastic process {Z(t) : t 0} to a random variable with density concentrated on the global minimizers of F. Under some assumptions on F, the transition probability density p(0, x0 , t, x), x0 , x 2 RN , t > 0, of the process {Z(t) : t 0} exists and verifies equations (11), (12); moreover, when , > 0, for the steady-state distribution density (x), x 2 RN , the following equation holds: x 2 RN ;

(22)

one has: 2F(x) 2

;

x 2 RN ;

> 0;

(23)

;

> 0:

(24)

2F(x0 ) 2

1 X

n

n (x) n (x0 )e t ; (25)

nD1

(17) (18)

1 dy

p(0; x0 ; t; x) D (x) C e

C ˇ(Z(t); t) dw(t) C d (t);

(x) D C e

2F(y) 2

One assumes C < +1 for > 0. Moreover, one has:

dZ(t) D ˛(Z(t); t) dt

L;r F ( ) D 0;

RN

e

n

is the eigenfunction of L, rF correspondwhere ing to the eigenvalue n , n = 1, 2, . . . , and 0 = 0 > 1 > . One notes that the eigenfunctions n , n = 1, 2 . . . , are appropriately normalized and is the eigenfunction of L, rF corresponding to the eigenvalue 0 = 0. Consider N = 1, the function F smooth and with three extrema in x , x0 , x+ 2 R such that x < x0 < x+ . Moreover, F increases in (x , x0 ) and in (x+ , +1) and decreases in (1, x ) and in (x0 , x+ ). Let: 8 d2 F ˆ ˆ c D (x ); ˆ ˆ ˆ dx 2 ˆ ˆ ˆ ˆ ˆ < d2 F 0 (26) c0 D (x ); ˆ dx 2 ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ d2 F C ˆ C : (x ): c D dx 2 One assumes c˙ , c0 to be nonzero. In [1] it is shown that when F(x ) < F(x+ ), one has (x) ! ı(x x ) as ! 0 while when F(x ) = F(x+ ) one has q (x) ! ı(x + x ) + (1 )ı(x x ), where D [1 C ccC ]1 as ! 0, and the limits are taken in distribution sense. That is in [1] it is shown that the steady-state distribution density tends to Dirac deltas concentrated on the global minimizers of F when ! 0. In [12] it is shown that: p c C c 0 22 ı F 1 e as ! 0; (27) 2 where ı F = max{F(x0 ) F(x ), F(x0 ) F(x+ )}. Formula (25) shows that p converges to when t ! +1, but the rate of convergence becomes slow when is small. Replacing with (t) a slowly decreasing function such that (t) ! 0 when t ! +1, using elementary adiabatic perturbation theory one can expect that

Differential Equations and Global Optimization

the condition: Z C1 2 ıF e 2 (t) dt D C1

kC1 D k h k (rF)( k ) (28)

0

guarantees that {Z(t) : t > 0} is a solution of (20), (21) when t ! +1 converges to a random variable concentrated on the global minimizers of F. In [6] the following result is proved: Theorem 1 (convergence theorem) Let F : RN ! R be a twice continuously differentiable function satisfying the following properties: min F(x) D 0;

(29)

x2R N

lim

F(x) D

lim

k(rF)(x)k2 ( F)(x) > 1;

kxk!C1

kxk!C1

lim

kxk!C1

k(rF)(x)k D C1;

(30) (31)

p let (t) D (c)/(log t) for t ! +1, where c > cF > 0 and cF is a constant depending on the function F. Then the transition probability density p of the process {Z(t) : t 0}, solution of (20), (21), converges weakly to a stationary distribution , that is: p(0; x; t; ) !

when t ! C1:

(32)

Moreover the distribution is the weak limit of , given by (23), as ! 0. One notes that (20), (21) is obtained perturbing the trajectories given by the steepest descent equation for F with the Brownian motion and is a factor that controls the amplitude of this perturbation. The fact that (t) ! 0 when t ! +1 makes possible the stabilization of the perturbed trajectories at the minimizers of F. With the assumptions of the convergence theorem it is possible to conclude that is concentrated on the global minimizers of F, so that the random variable Z(t) = (Z1 (t), . . . , ZN (t)) ‘converges’ to x , solution of problem (1), as t ! +1. That is, when x is the unique global minimizer of F, then PfZ i (t) D x i g ! 1 when t ! +1 for i = 1, . . . , N. The stochastic differential equation (20) can be integrated numerically to obtain an algorithm for the soP lution of problem (1). Let t 0 = 0, t k = lk1 D0 hl , where hl > 0, l = 0, 1, . . . , are such that t k ! +1 when k ! 1 then using the Euler method one has: 0 D x0 ;

D

(33)

C (t k )(w(t k C h k ) w(t k )); (34) where k = 0, 1, . . . and k 2 RN is the approximation of Z(t k ), k = 1, 2, . . . , see [2,3]. In (34) due to the presence of the stochastic term, one can substitute the gradient of F with a kind of ‘stochastic gradient’ of F in order to save computational work, see [2,3] for details. One notes that the sequence { k : k 2 N} depends on the particular realization of the Brownian motion {w(t k ) : k = 0, 1, . . . }. That is, solving several times problem (20), (21), by means of (33), (34), the solutions obtained are not necessarily the same. However, the convergence theorem states that ‘all’ the solutions { k : k 2 N} obtained by (33), (34) tend to x as k ! +1. So that in the numerical algorithm derived from (20), (21) using (33), (34) one can approximate by means of nT independent realizations (i. e., trajectories) of the stochastic process {Z(t) : t 0}, solution of (20), (21). A possible strategy for a numerical algorithm is the following: after an ‘observation period’ the various trajectories are compared, one of them is discarded and is not considered any more, another one is branched. The new set of trajectories are computed throughout the next observation period. The following stopping conditions are used: uniform stop: the final values of the function F at the end of the various trajectories are numerically equal; maximum trial duration: a maximum number of observation periods has been reached. One notes that the algorithms based on the discretization of the stochastic differential equations have sound mathematical basis, that is for a wide class of functions F some convergence results such as the convergence theorem given above are available. These algorithms usually have a slow convergence rate, this can be seen from the kind of function which is required in the convergence theorem. This implies that the algorithms based on stochastic differential equations have an high computational cost, so that their use is usually restricted to low-dimensional problems. However these algorithms can be parallelized with a significant computational advantage, for example in the algorithm described above each trajectory can be computed independently from the others until the end of an observation period. One notes that the algorithms derived

707

708

D

Differential Equations and Global Optimization

from (20), (21) are in some sense similar to the simulated annealing algorithm (cf. also Simulated annealing methods in protein folding) introduced in combinatorial optimization in [11]. Global Constrained Optimization Given problem (1) with D RN the following stochastic differential equation with reflection term is considered:

D RN is a bounded convex domain such that exists p satisfying (37), (38), (39) and exists the steadystate distribution density of the process solution of (35), (36); let be the steady-state distribution density of the process solution of (35), (36) when , > 0, that is:

(x) D C e Z

dZ(t) D (rF)(Z(t)) dt C D

C (t) dw(t) C d (t);

x 2 D;

(40)

1 dy

(41)

(35)

0

Z(0) D x ;

(36)

where x0 2 D, {w(t) : t 0} is the N-dimensional Brownian motion, (t) is a suitable decreasing function that guarantees the convergence of the stochastic process {Z(t) : t > 0} to a random variable with density concentrated on the global minimizers of F on D when t ! +1 and (t) is a suitable function to assure Z(t) 2 D, t > 0, that is, (Z, Z , ) is the solution of the Skorokhod problem in R+ respect to D. Let int(D) be the set of the interior points of D. One assumes that D is the closure of int(D). Let p(0, x0 , t, x), x0 , x 2 int(D), t > 0, be the transition probability density of the process {Z(t): t > 0}, solution of (35), (36), when , > 0. Then p satisfies the Fokker–Planck equation: x 2 int(D);

lim p(0; x0 ; t; x) D ı(x x0 );

t!0C

2F(y) 2

;

D

t > 0;

@p C L;r F (p) D 0; @t

e

2F(x) 2

(37)

x 2 int(D); (38)

2 rx p C prF; n(x) D 0; 2 x 2 @D;

t > 0; (39)

where L, rF is defined in (11) (12) and n(x) 2 (x) is the outward unit normal to @D in x 2 @D. One notes that boundary condition (39) assures that PfZ(t) 2 Dg D 1 for every t > 0. This boundary condition follows from the requirement that (Z, Z , ) is the solution of the Skorokhod problem. One assumes the following properties of F and D: F : D ! R is twice continuously differentiable;

and is the weak limit of as ! 0. In analogy with the unconstrained case one can conjecture that when D RN and F : D rarr; R psatisfy the properties listed above and when (t) D (c)/(log t) for t ! +1, where c > cF > 0 and cF is a constant depending on F, then the transition probability density p(0, x0 , t, y), x0 , x 2 D, t > 0 of the process {Z(t): t 0}, solution of (35), (36) converges to a steady-state distribution density when t ! +1 and is the distribution density obtained as weak limit of when ! 0. That is, the process {Z(t): t 0} converges in law to a random variable concentrated at the points x 2 D that solve problem (1). A numerical algorithm to solve problem (1), with D RN , can be obtained using a numerical method to integrate problem (35), (36). This is done integrating numerically problem (20), (21) and ‘adding’ the constraints given by D. In the numerical algorithm the trajectories can be computed using formulas (33), (34) when the trajectories are in D, when a trajectory violates the constraints, it is brought back on @D putting to zero its normal component with respect to the violated constraints. Finally the stopping conditions are the same ones considered in the previous section. Analogously to the unconstrained problem, the algorithms based on the stochastic differential equations for the constrained case have slow convergence rate. However these algorithms have a high rate of parallelism. Miscellaneous Results In this section are shown two mathematical problems that are somewhat unusual as optimization problems.

Differential Equations and Global Optimization

Clique Problem Let I = {1, . . . , N} N be a finite set, let I I be the set of unordered pairs of elements of I. Let E I I. Then a graph G is a pair G = (I, E), where I is the set of the nodes of G and E is the set of the edges of G, i. e. {i, j} 2 E implies that G has an edge joining nodes i, j 2 I. A graph G = (I, E) is said to be complete or to be a clique when E = I I. A graph G0 = (I 0 , E0 ) is a subgraph of G = (I, E) when I 0 I and E0 E \ (I 0 I 0 ). The maximum clique problem can be defined as follows: Given G = (I, E), find the largest subgraph G0 of G which is complete. Let k(G) be the number of nodes of the graph G0 . Several algorithms exist to obtain a numerical solution of the maximum clique problem see, for example, [14] where the branch and bound algorithm is described. One considers here the maximum clique problem as a continuous optimization problem. The adjacency matrix A of the graph G = (I, E) is a square matrix of order equal to the number of nodes of G and its generic entry Ai, j , at row i and at column j, is defined equal to 1 if {i, j} 2 E and is equal to 0 otherwise. Then in [13] it is shown that: 1 (42) D max x t Ax; 1 x2S k(G) where

D

sivariational inequality problem, is defined as follows: Find a vector x 2 ˝(x ) such that: F(x ); y x 0;

8y 2 ˝(x );

(43)

see [4] for a detailed introduction to quasivariational inequalities. This problem can be reduced to the search of a fixed-point of a function defined implicitly by a variational inequality. The quasivariational inequalities have many applications such as for example the study of the generalized Nash equilibrium points of an N-player noncooperative game. See [10] for a detailed discussion on N-player noncooperative games. See also ˛BB Algorithm Continuous Global Optimization: Applications Continuous Global Optimization: Models, Algorithms and Software DIRECT Global Optimization Algorithm Global Optimization Based on Statistical Models Global Optimization in Binary Star Astronomy Global Optimization Methods for Systems of Nonlinear Equations Global Optimization Using Space Filling Topology of Global Optimization

( References

S D x D (x1 ; : : : ; x N ) t 2 R N : N X

) x i D 1; x i 0; i D 1; : : : ; N :

iD1

One notes that many maximizers of (42) can exist, however there exists always a maximizer x = (x1 , . . . , xN )t of problem (42) such that for i = 1, . . . , N one has xi = 1/k(G) if i 2 G0 and xi = 0 if i 62 G0 . That is the maximum clique problem is reduced to a continuous global optimization problem that can be treated with the algorithms described above. Several other problems in graph theory can be reformulated as continuous optimization problems. Quasivariational Inequalities Let X RN be a nonempty set, let ˝(x) X, x 2 X, be a set-valued function and let F : RN ! RN . The qua-

1. Aluffi-Pentini F, Parisi V, Zirilli F (1985) Global optimization and stochastic differential equations. J Optim Th Appl, 47:1–17 2. Aluffi-Pentini F, Parisi V, Zirilli F (1988) A global optimization algorithm using stochastic differential equations. ACM Trans Math Software, 14:345–365 3. Aluffi-Pentini F, Parisi V, Zirilli F (1988) SIGMA - A stochastic integration global minimization algorithm. ACM Trans Math Softw 14:366–380 4. Baiocchi C, Capelo A (1984) Variational and quasi-variational inequalities: Application to Free-boundary problems. Wiley, New York 5. Billingsley P (1995) Probability and measure. Wiley, New York 6. Chiang TS, Hwang CR, Sheu SJ (1987) Diffusion for global optimization in Rn. SIAM J Control Optim 25:737–753 7. Dantzing GB (1963) Linear programming and extensions. Princeton Univ Press, Princeton 8. Friedman A (1975) Stochastic differential equations and applications, vol 1. Acad Press, New York

709

710

D

Dini and Hadamard Derivatives in Optimization

9. Gill PE, Murray W, Wright MH (1981) Practical optimization. Acad Press, New York 10. Harker P, Pang J (1990) Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications. Math Program 48:161–220 11. Kirkpatrick S, Gelatt CD Jr, Vecchi MP (1983) Optimization by simulated annealing. Science 220:671–680 12. Matkowsky BJ, Schuss Z (1981) Eigenvalues of the FokkerPlanck operator and the equilibrium for the diffusions in potential fields. SIAM J Appl Math 40:242–254 13. Motzkin TS, Straus EG (1964) Maxima for graphs and a new proof of a theorem of Turán. Notices Amer Math Soc 11:533–540 14. Pardalos PM, Rodgers GP (1990) Computational aspects of a branch and bound algorithm for quadratic zero-one programming. Computing 45:131–144 15. Schuss Z (1980) Theory and applications of stochastic differential equations. Wiley, New York 16. Tanaka H (1979) Stochastic differential equations with reflecting boundary conditions in convex regions. Hiroshima Math J 9:163–177

Dini and Hadamard Derivatives in Optimization VLADIMIR F. DEMYANOV St. Petersburg State University, St. Petersburg, Russia MSC2000: 90Cxx, 65K05 Article Outline Keywords Directional Derivatives First Order Necessary and Sufficient Conditions for an Unconstrained Optimum Conditions for a Constrained Optimum See also References Keywords Dini directional derivatives; Hadamard directional derivatives; Necessary and sufficient optimality conditions; Bouligand cone; Composition theorem; Lipschitz function; Quasidifferentiable function; Concave function; Convex function; First order approximation of a function; Nondifferentiable optimization; Nonsmooth analysis; Numerical

methods; Maximizer; Maximum function; Minimum function; Minimizer; Steepest ascent direction; Steepest descent direction; Unconstrained optimum Directional Derivatives Let f be a function defined on some open set X Rn and taking its values in R D R [ f1; C1g. The set dom f = {x 2 X : |f (x)| < + 1} is called the effective set (or domain) of the function f . Take x 2 dom f , g 2 Rn . Put 1 f (x C ˛g) f (x) ; ˛ ˛#0 1 # f (x C ˛g) f (x) : fD (x; g) :D lim inf ˛#0 ˛

fD" (x; g) :D lim sup

(1) (2)

Here ˛ # 0 means that ˛ ! +0. The quantity fD" (x; g) (respectively, fD# (x; g)) is called the Dini upper (respectively, lower) derivative of the function f at the point x in the direction g. The limit f 0 (x; g) D f D0 (x; g) :D lim ˛#0

1 f (x C ˛g) f (x) ; (3) ˛

is called the Dini derivative of f at the point x in the direction g. If the limit in (3) exists, then fD" (x; g) D fD# (x; g) D f 0 (x; g). The quantity "

f H (x; g) :D

1 f (x C ˛g 0 ) f (x) (4) [˛;g 0 ]![C0;g] ˛ lim sup

(respectively, fH# (x; g) :D

1 f (x C ˛g 0 ) f (x) ); [˛;g ]![C0;g] ˛ (5) lim inf 0

is called the Hadamard upper (respectively, lower) derivative of the function f at the point x in the direction g. The limit f H0 (x; g) :D

lim

[˛;g 0 ]![C0;g]

1 f (x C ˛g 0 ) f (x) (6) ˛

is called the Hadamard derivative of f at x in the direction g. If the limit in (6) exists, then fH" (x; g) D f H# (x; g).

Dini and Hadamard Derivatives in Optimization

Note that the limits in (1), (2), (4) and (5) always exist but are not necessarily finite. Remark 1 In the one-dimensional case (Rn = R) the Hadamard directional derivatives coincide with the corresponding Dini directional derivatives:

Proposition 4 Let functions f 1 and f 2 be Dini (Hadamard) directionally differentiable at a point x 2 X. Then their sum, difference, product and quotient (if f 2 (x) 6D 0) are also Dini (Hadamard) d.d. at this point and the following formulas hold: 0 0 (x; g) ˙ f 2Q (x; g); ( f 1 ˙ f 2 )0Q (x; g) D f 1Q

f H" (x; g) D f D" (x; g); f H# (x; g) D f D# (x; g); f H0 (x; g) D f D0 (x; g):

(8)

0 0 ( f1 f 2 )0Q (x; g) D f 1 (x) f 2Q (x; g) C f2 (x) f 1Q (x; g); (9)

If the limit in (3) exists and is finite, then the function f is called differentiable (or Dini differentiable) at x in the direction g. The function f is called Dini directionally differentiable (Dini d.d.) at the point x if it is Dini differentiable at x for every g 2 Rn . Analogously, if the limit in (6) exists and is finite, the function f is called Hadamard differentiable at x in the direction g. The function f is called Hadamard directionally differentiable (Hadamard d.d.) at the point x if it is Hadamard differentiable at x for every g 2 Rn . If the limit in (6) exists and is finite, then the limit in (3) also exists and f 0H (x, g) = f 0 (x, g). The converse is not necessarily true. All these derivatives are positively homogeneous (of degree one) functions of direction: f Q (x; g) D f Q (x; g);

D

8 0:

f1 f2

0 (x; g) D Q

1 0 f 1 (x) f 2Q (x; g) 2 ( f 2 (x)) 0 f 2 (x) f 1Q (x; g) :

(10)

Here Q is either D, or H. These formulas follow from the classical theorems of differential calculus. Proposition 5 Let '(x) D max f i (x); i21:N

(11)

where the functions f i are defined and continuous on an open set X Rn and Dini (Hadamard) d.d. at a point x 2 X in a direction g. Then the function ' is also Dini (Hadamard) d.d. at x and

(7)

(Here is either ", or #, and Q is either D, or H.) A function f defined on an open set X is called Dini uniformly directionally differentiable at a point x 2 X if it is directionally differentiable at x and for every " > 0 there exists a real number ˛ 0 > 0 such that 1 f (x C ˛g) f (x) ˛ f 0 (x; g) < "; ˛ 8˛ 2 (0; ˛0 ); 8g 2 S; where S = {g 2 Rn : kgk = 1} is the unit sphere. Proposition 2 (see [2, Thm. I.3.2]) A function f is Hadamard d.d. at a point x 2 X if and only if it is Dini uniformly differentiably at x and its directional derivative f 0 (x, g) is continuous as a function of direction. Remark 3 If f is locally Lipschitz and Dini directionally differentiable at x 2 X, then it is Hadamard d.d. at x, too. For Dini and Hadamard derivatives (see (3) and (6)) there exists a calculus:

'Q0 (x; g) D max f i0 (x; g); i2R(x)

(12)

where R(x) = {i 2 1 : N : f i (x) = '(x)} (see [2, Cor. I.3.2]). If ' is defined by '(x) D max f i (x; y); y2Y

where Y is some set, then under some additional conditions a formula, analogous to (12), also holds (see [2, Chap. I, Sec. 3]). A theorem on the differentiability of a composition can also be stated. Unfortunately, formulas similar to (8)–(10) and (12) are not valid for Dini (Hadamard) upper and lower derivatives. The Dini and Hadamard upper and lower directional derivatives are widely used in nonsmooth analysis and nondifferentiable optimization. For example, the following mean value theorem holds.

711

712

D

Dini and Hadamard Derivatives in Optimization

Proposition 6 (see [2, Thm. I.3.1]) Let f be defined and continuous on the interval {y: y = x + ˛g, ˛ 2 [0, ˛ 0 ], ˛ 0 > 0}. Put mD

inf

˛2[0;˛0 ]

f D# (x C ˛g; g); "

M D sup f D (x C ˛g; g):

First Order Necessary and Sufficient Conditions for an Unconstrained Optimum Let a function f be defined on an open set X Rn , ˝ be a subset of X. A point x 2 ˝ is called a local minimum point (local minimizer) of the function f on the set ˝ if there exists ı > 0 such that

˛2[0;˛0 ]

f (x) f (x );

Then [1] m˛0 f (x C ˛0 g) f (x) M˛0 : The following first order approximations may be constructed via the Dini and Hadamard derivatives. Proposition 7 Let f be defined on an open set X Rn , and Dini d.d. at a point x 2 X. Then f (x C ) D f (x) C fD0 (x; ) C oD (x; ):

(13)

If f is Hadamard d.d. at x, then f (x C ) D f (x) C fH0 (x; ) C oH (x; ):

(14)

Let f be defined on an open set X Rn and finite at x 2 X. Then

where Bı (x ) = {x 2 Rn : k x x k ı}. If ı = +1, then the point x is called a global minimum point (global minimizer) of f on ˝. A point x 2 ˝ is called a strict local minimum point (strict local minimizer) of f on ˝ if there exists ı > 0 such that f (x) > f (x );

x ¤ x:

Proposition 8 Let a function f be Dini (Hadamard) directionally differentiable on X. For a point x 2 dom f to be a local or global minimizer of f on X it is necessary that

(15)

f (x C ) D f (x) C fD# (x; ) C oD (x; );

(16)

f (x C ) D f (x) C fH" (x; ) C oH (x; );

(17)

f D0 (x ; g) 0;

f (x C ) D f (x) C fH# (x; ) C oH (x; );

(18)

oD (x; ˛ ) ˛#0 ! 0; 8 2 Rn ; ˛ oH (x; ˛ ) kk!0 ! 0; k k oD (x; ˛ ) D 0; 8 2 Rn ; lim sup ˛ ˛#0 o (x; ˛ ) D 0; 8 2 Rn ; lim inf D ˛#0 ˛ oH (x; ˛ 0 ) 0; 8 2 Rn ; lim sup ˛ [˛;0 ]![C0;] oH (x; ˛ 0 ) 0; 8 2 Rn : lim inf 0 [˛; ]![C0;] ˛

8x 2 ˝ \ Bı (x );

Analogously one can define local, global and strict local maximum points (maximizers) of f on ˝. It may happen that the set of local (global, strict local) minimizers (maximizers) is empty. If ˝ = X then the problem of finding a minimum or a maximum of f on X is called an unconstrained optimization problem.

f (x C ) D f (x) C fD" (x; ) C oD (x; );

where

8x 2 ˝ \ Bı (x );

fH0 (x ; g) 0

8g 2 Rn ;

(25)

8g 2 Rn :

(26)

If f is Hadamard d.d. at x and (19)

f H0 (x ; g) > 0;

8g 2 Rn ;

g ¤ 0n ;

(27)

(20)

then x is a strict local minimizer of f .

(21)

Here 0n = (0, . . . , 0) is the zero element of Rn .

(22)

Proposition 9 Let f be Dini (Hadamard) d.d. on X. For a point x 2 dom f to be a local or global maximizer of f on X it is necessary that

(23) (24)

f D0 (x ; g) 0; ( f H0 (x ; g) 0;

8g 2 Rn ; 8g 2 Rn ):

(28) (29)

Dini and Hadamard Derivatives in Optimization

If f is Hadamard d.d. at x and f H0 (x ; g) < 0;

8g 2 Rn ;

g ¤ 0n ;

(30)

then x is a strict local maximizer of f . Note that (26) implies (25), and (29) implies (28). In the smooth case f 0H (x, g) = (f 0 (x), g) (f 0 (x) being the gradient of f at x) and the conditions (27) and (30) are impossible. It means that the sufficient conditions (27) and (30) are essentially nonsmooth. Proposition 10 Let f be defined on an open set on X Rn . For a point x 2 dom f (i. e., |f (x)| < +1) to be a local or global minimizer of f on X it is necessary that #

8g 2 Rn ;

(31)

#

8g 2 Rn :

(32)

f D (x ; g) 0; f H (x ; g) 0;

#

f H (x ; g) > 0;

8g 2 Rn ;

g ¤ 0n ;

(33)

then x is a strict local minimizer of f . Note that (32) implies (31) but (31) does not necessarily imply (32). Proposition Let f be defined on an open set on X Rn . For a point x 2 dom f to be a local or global maximizer of f on X it is necessary that f D" (x ; g) 0;

8g 2 Rn

(34)

and g) 0;

n

8g 2 R :

(35)

If "

point. A point x satisfying the conditions (28) or (34) is called a Dini sup-stationary point of f , while a point x satisfying (28) or (35) is called an Hadamard supstationary point. Remark 13 Note that the function f is not assumed to be continuous or even finite-valued. Let x0 2 dom f and assume that the condition (31) does not hold, i. e. x0 is not a Dini inf-stationary point. If g 0 2 Rn , kg 0 k = 1, #

#

f D (x0 ; g0 ) D inf f D (x0 ; g); k g kD1 then g 0 is called a Dini steepest descent direction of f at x0 (kgk is the Euclidean norm). If (32) does not hold and if g 0 2 Rn , kg 0 k = 1, #

#

f H (x0 ; g0 ) D inf f H (x0 ; g); k g kD1

If

" fH (x ;

D

f H (x ; g) < 0;

8g 2 Rn ;

g ¤ 0n ;

(36)

then x is a strict local maximizer of f . The condition (35) implies (34) but (34) does not necessarily imply (35). Remark 12 Observe that the conditions for a minimum are different from the conditions for a maximum. A point x satisfying the conditions (25) or (31) is called a Dini inf-stationary point of f , while a point x satisfying (26) or (32) is called an Hadamard inf-stationary

then g 0 is called an Hadamard steepest descent direction of f at x0 . Analogously if x0 is not a Dini sup-stationary point and if g 0 2 Rn , kg 0 k = 1, f D" (x0 ; g 0 ) D sup f D" (x0 ; g); k g kD1 then g 0 is called a Dini steepest ascent direction of f at x0 . If x0 is not an Hadamard sup-stationary point of f (i. e. (35) does not hold) and if g 0 2 Rn , k g 0 k = 1, f H" (x0 ; g 0 ) D sup f H" (x0 ; g); k g kD1 then g 0 is called an Hadamard steepest ascent direction of f at x0 . Of course it is possible that there exist many steepest descent or/and steepest ascent directions of f at x0 . It may also happen that some direction is a direction of steepest ascent and, at the same time, a direction of steepest ascent as well (which is impossible in the smooth case). Example 14 Let X = R, ( f (x) D

jxj C 12 x sin x1 ;

x ¤ 0;

0;

x D 0:

713

714

D

Dini and Hadamard Derivatives in Optimization

Dini and Hadamard Derivatives in Optimization, Figure 1 Dini and Hadamard Derivatives in Optimization, Figure 2

Take x0 = 0. It is clear that (see Fig. 1): 1 3 jgj D jgj ; 2 2 1 1 # f D (x0 ; g) D jgj jgj D jgj : 2 2

f D" (x0 ; g) D jgj C

As X = R, the Hadamard derivatives coincide with the Dini ones (see Remark 1). fD# (x0 ; g) > 0;

8g ¤ 0;

we may conclude (see (32)) that x0 is a strict local minimizer (in fact it is a global minimizer but our theory does not allow us to claim this). Note that f D" and f D# are positively homogeneous (see (7)), therefore it is sufficient to consider (in R) only two directions: g 1 = 1 and g 2 = 1. Example 15 Let X = R, x0 = 0, ( x sin x1 ; x > 0; f (x) D 0; x 0: It is clear that (see Fig. 2) that ( jgj ; g > 0; " f D (x0 ; g) D 0; g 0; ( jgj ; g > 0; # fD (x0 ; g) D 0; g 0: Neither the condition (25) nor the condition (31) holds, therefore we conclude that x0 is neither a local

minimizer nor a local maximizer. Since max f D" (x0 ; g) k g kD1 D maxf f D" (x0 ; C1); f D" (x0 ; 1)g D maxf1; 0g D fD" (x0 ; C1) D C1; then g 1 = +1 is a steepest ascent direction. Since min f D# (x0 ; g) k g kD1 #

#

D minf f D (x0 ; C1); f D (x0 ; 1)g D minf1; 0g D f D# (x0 ; C1) D 1; then g 1 = +1 is a steepest descent direction as well. Conditions for a Constrained Optimum Let a function f be defined on an open set X Rn , ˝ be a subset of X. Let x 2 ˝, |f (x)| < +1, g 2 Rn . The limit "

f D (x; g; ˝) D lim sup

˛#0 xC˛ g2˝

f (x C ˛g) f (x) ˛

(37)

is called the Dini conditional upper derivative of the function f at the point x in the direction g with respect to ˝. If no sequence {˛ k } exists such that ˛ k # 0, x + ˛ k g 2 ˝ for all k, then, by definition, we set f D" (x; g; ˝) D 1.

D

Dini and Hadamard Derivatives in Optimization

The limit f (x C ˛g) f (x) f D# (x; g; ˝) D lim inf ˛#0 ˛

(38)

xC˛ g2˝

g; ˝) D

lim sup [˛;g 0 ]![C0;g] xC˛ g 0 2˝

f (x C ˛g 0 ) f (x) (39) ˛

is called the Hadamard conditional upper derivative of the function f at the point x in the direction g with respect to ˝. If no sequences {˛ k }, {g k } exist such that [˛ k , g k ] ! [+0, g], x + ˛ k g k 2 ˝ for all k, then, by definition, we set fH" (x; g; ˝) D 1. The limit f H# (x;

g; ˝) D

lim inf

[˛;g 0 ]![C0;g] xC˛ g 0 2˝

f (x C ˛g 0 ) f (x) (40) ˛

Proposition 16 (see [1]) For a point x 2 ˝ and such that |f (x )| < 1 to be a local or global minimizer of f on ˝ it is necessary that #

f H# (x ;

g; ˝) 0;

8g 2 Rn ;

(41)

n

8g 2 R :

(42)

Furthermore, if fH# (x ; g; ˝) > 0;

8g 2 Rn ;

g ¤ 0n ;

(43)

then x is a strict local minimizer of f on ˝. A point x 2 ˝ satisfying (41) ((42)) is called a Dini (Hadamard) inf-stationary point of f on ˝.

Proposition 17 For a point x 2 ˝ and such that |f (x )| < 1 to be a local or global minimizer of f on ˝ it is necessary that fD" (x ;

g; ˝) 0;

"

8g 2 Rn ;

n

8g 2 R ;

(44)

(45)

If g ¤ 0;

(46)

then x is a strict local maximizer of f on ˝. A point x 2 ˝ satisfying (44) ((45)) is called a Dini (Hadamard) sup-stationary point of f on ˝. The condition (41) is equivalent to #

f D (x ; g; ˝) 0;

8g 2 K(x ; ˝);

(47)

where 9 ˛ k # 0; = K(x ; ˝) D g 2 Rn : 9˛ k : x C ˛ k g 2 ˝; : ; : 8k 8
b i . Two arcs A i and A j intersect iff sp(A i ) \ sp(A j ) ¤ ;. Note that one can view the arc coloring problem as a path coloring problem on a cycle. If a number is contained in the span of more than k arcs, then the arcs can surely not be colored with k colors and the answer to this instance of arc coloring is no. Otherwise, one can assume that every number i, 1 i 2n is contained in the span of exactly k arcs; if this were not the case, one could simply add arcs of the form (i; i C 1) until the condition holds, without changing the answer of the coloring problem. Now consider a chain of n vertices v1 , v2 ; : : : ; v n . Imagine the chain drawn from left to right, with v1 the start vertex at its left end. The directed edges from left to right followed by the directed edges from right to left make up a cycle of length 2n. The given circular arcs can be translated into directed paths on this cycle such that two paths share a directed edge iff the corre-

Directed Tree Networks

Directed Tree Networks, Figure 3 Reduction from arc coloring

D

of the arcs can be turned into a k-coloring of the paths by assigning all paths corresponding to an arc the same color as the arc and by coloring the blockers with the remaining k 1 colors. This shows that the decision version of the path coloring problem is NP-complete already for binary trees. Approximation Algorithms

sponding arcs intersect, but these paths do not yet constitute a valid path coloring problem because some of the paths are not simple: an arc (1; 2n) would correspond to a path running from v1 to v n and back to v1 , for example. Nevertheless, it is possible to obtain a valid instance of the path coloring problem by splitting paths that are not simple into two or three simple paths and by using blockers to make sure that the paths derived from one non-simple path must receive the same color in any valid k-coloring. For this purpose, extend the chain by adding k vertices on both ends, resulting in a chain of length n C 2k. Connect each of the newly added vertices to a distinct subtree consisting of a new vertex with two leaf children. The resulting network is a binary tree T. If a path arrives at vertex v n coming from the left (i. e., from v n1 ) and “turns around” to revisit v n1 , divide the path into two: one coming from the left, passing through v n and ending at the left leaf of one of the subtrees added on the right side of the chain; the other one starting at the right leaf of that subtree, passing through v n and continuing left. In addition, add k 1 blockers in that subtree, i. e., paths from the right leaf to the left leaf. Observe that there are no more than k paths containing v n as an inner vertex, and a different subtree can be chosen for each of these paths. A symmetric splitting procedure is applied to the paths that contain v1 as an inner vertex, i. e., the paths that arrive at v1 coming from the right (i. e., from v2 ) and “turn around” to revisit v2 . This way, all non-simple paths are split into two or three simple paths, and a number of blockers are added. The resulting set of paths in T can be colored with k colors if and only if the original arc coloring instance is a yes-instance. The blockers ensure that all paths corresponding to the same arc receive the same color in any k-coloring. Hence, a k-coloring of the paths can be used to obtain a k-coloring of the arcs by assigning each arc the color of its corresponding paths. Also, a k-coloring

Since the path coloring problem in directed tree networks is NP-hard, one is interested in polynomial-time approximation algorithms with provable performance guarantee. All such approximation algorithms that have been developed so far belong to the class of greedy algorithms. A greedy algorithm picks a start vertex s in the tree T and assigns colors to the paths touching (starting at, ending at, or passing through) s first. Then it visits the remaining vertices of the tree in some order that ensures that the current vertex is adjacent to a previously visited vertex; for example, a depth-first search can be used to obtain such an order. When the algorithm processes vertex v, it assigns colors to all paths touching v without changing the color of paths that have been colored at a previous vertex. Each such step is referred to as coloring extension. Furthermore, the only information about the paths touching the current vertex that the algorithm considers is which edges incident to the current vertex they use. To emphasize this latter property, greedy algorithms are sometimes referred to as local greedy algorithms. Whereas all greedy algorithms follow this general strategy, individual variants differ with respect to the solution to the coloring extension substep. The best known algorithm was presented by T. Erlebach, K. Jansen, C. Kaklamanis, and P. Persiano in [11,16] (see also [10]). It colors a set of paths with maximum load L in a directed tree network of arbitrary degree with at most d5L/3e colors. In the next section this will be shown to be best possible in the class of greedy algorithms. For the sake of clarity, assume that the load on all edges is exactly L and that L is divisible by 3. The algorithm maintains two invariants: (a) the number of colors used is at most 5L/3, and (b) for each pair of directed edges with opposite directions the number of colors used to color paths going through either of these edges is at most 4L/3. First, the algorithm picks a leaf s

719

720

D

Directed Tree Networks

of T as the start vertex and colors all paths starting or ending at s using at most L colors. Therefore, the invariants are satisfied initially. It remains to show that they still hold after a coloring extension step if they were satisfied at the beginning of this step. Reduction to Constrained Bipartite Edge Coloring The coloring extension problem at a current vertex v is reduced to a constrained edge coloring problem in a bipartite graph Gv with left vertex set V1 and right vertex set V2 . This reduction was introduced by M. Mihail, C. Kaklamanis and S. Rao in [19]. Let n0 , n1 ; : : : ; n k be the neighbors of v in T, and let n0 be the unique neighbor that was processed before v. For every neighbor n i of v the graph Gv contains four vertices: vertices w i and z i in V1 , and vertices x i and y i in V2 . Vertex w i is said to be opposite x i , and z i is opposite y i . A pair of opposite vertices is called a line of Gv . A line sees a color if it appears on an edge incident to a vertex of that line. For every path touching v there is one edge in Gv : an edge (w i ; x j ) for each path coming from n i , passing through v and going to n j ; an edge (w i ; y i ) for each path coming from n i and ending at v; and an edge (z i ; x i ) for each path starting at v and going to n i . It is easy to see that coloring the paths touching v is equivalent to coloring the edges of Gv . Note that the vertices w i and x i have degree L in Gv , while the other vertices may have smaller degree. If this is the case, the algorithm adds dummy edges (shown dashed in Fig. 4) in order to make the graph L-regular. As the paths that contain the edges (n0 ; v) or (v; n0 ) have been colored at a previous vertex, the edges incident to w0 and x0 are already colored with at most 4L/3 colors by invariant (b). These edges are called color-

Directed Tree Networks, Figure 4 Construction of the bipartite graph

forced edges. A color that appears on exactly one colorforced edge is a single color. A color that appears on two color-forced edges is a double color. Since there are at most 4L/3 colors on 2L color-forced edges, there must be at least 2L/3 double colors. Furthermore, one can assume that there are exactly 2L/3 double colors and 2L/3 single colors, because if there are too many double colors then it is possible to split an appropriate number of double colors into two single colors for the duration of the current coloring extension step. In order to maintain invariant (a), the algorithm must color the uncolored edges of Gv using at most L/3 new colors (colors not used on the color-forced edges). Invariant (b) is satisfied by ensuring that no line of Gv sees more than 4L/3 colors. Partition Into Matchings Gv is an L-regular bipartite graph and its edges can thus be partitioned into L perfect matchings efficiently. Each matching is classified according to the colors on its two color-forced edges: SS-matchings contain two single colors, ST-matchings contain one single color and one double color, PP-matchings contain the same (preserved) double color on both color-forced edges, and TT-matchings contain two different double colors. Next, the L matchings are grouped into chains and cycles: a chain of length ` 2 is a sequence of ` matchings M1 ; : : : ; M` such that M1 and M` are STmatchings, M2 ; : : : ; M`1 are TT-matchings, and two consecutive matchings share a double color; a cycle of length ` 2 is a sequence of ` TT-matchings such that consecutive matchings as well as the first and the last matching share a double color. Obviously, the set of L matchings is in this way entirely partitioned into SSmatchings, chains, cycles, and PP-matchings. In addition, if a chain or cycle contains parallel color-forced edges, then the algorithm exchanges these edges in the respective matchings, thus dividing the original chain or cycle into a shorter sequence of the same type and an extra cycle. Now the algorithm chooses triplets, i. e., groups of three matchings, and colors the uncolored edges of each triplet using at most one new color and at most four active colors. The active colors are selected among the colors on color-forced edges of that triplet, and a color is active in at most one triplet. The algorithm ensures

Directed Tree Networks

that a line that sees the new color does not see one of the active colors of that triplet. This implies that no line of Gv sees more than 4L/3 colors altogether, as required to maintain invariant (b). Coloring of Triplets The rules for choosing triplets ensure that each triplet contains two color-forced edges with single colors and four color-forced edges with double colors. Furthermore, most triplets are chosen such that one double color appears twice, and this double color as well as the two single colors can be reused without considering conflicts outside the triplet. V. Kumar and E.J. Schwabe proved in [18] that such triplets can be colored as required using three active colors and one new color. This coloring procedure can be sketched as follows. Partition the edges of the triplet into a matching on all vertices except w0 and x0 and a gadget, i. e., a subgraph in which w0 and x0 have degree 3 while all other vertices have degree 2. A gadget consists of a number of cycles of even length not containing w0 or x0 and either three disjoint paths from w0 to x0 or one path from w0 to x0 , one path from w0 to w0 , and one path from x0 to x0 . A careful case analysis shows that the triplet can be colored by reusing the single colors and the double color to color the gadget and using a new color for the matching. If a partitioning into gadget and matching does not exist, the triplet contains a PP-matching and can be colored using the double color of the PP-matching for the uncolored edges of the PP-matching and a single color and a new color for the uncolored edges of the cycle cover consisting of the other two matchings. In the following, the terms even sequence and odd sequence refer to sequences of TT-matchings of even resp. odd length such that consecutive matchings share a double color. Note that an even sequence can be grouped into triplets by combining two consecutive matchings of the sequence with an SS-matching as long as SS-matchings are available and combining each remaining TT-matching with a chain of length 2. There are always enough SS-matchings or chains of length 2 because the ratio between color-forced edges with double colors and color-forced edges with single colors is 2 : 1 in Gv initially and remains the same after extracting triplets. Similarly, an odd sequence can be grouped into triplets if there is at least one chain of length 2,

D

which can be used to form a triplet with the first matching of the sequence, leaving an even sequence behind. Selection of Triplets Now the rules for selecting triplets are as follows. From chains of odd length, combine the first two matchings and the last matching to form a triplet. The remainder of the chain (if non-empty) is an even sequence and can be handled as described above. Cycles of even length are even sequences and can be handled the same way. As long as there is a chain of length 2 left, chains of even length 4 and odd cycles can be handled, too. Pairs of PP-matchings can be combined with an SS-matching, single PP-matchings can be combined with chains of length 2. If there are two chains of even length 4, combine the first two matchings of one chain with the last matching of the other and the last two matchings of the first chain with the first matching of the other, leaving two even sequences behind. So far, all triplets contained a double color twice and could be colored as outlined above. What remains is a number of cycles of odd length, at most one chain of even length, at most one PP-matching, and some SS-matchings. To deal with these, it is necessary to form some triplets that contain four distinct double colors. However, it is possible to ensure that the set of color-forced edges of Gv (inside and outside the triplet) colored with one of these double colors does not contain parallel edges; T. Erlebach, K. Jansen, C. Kaklamanis and P. Persiano showed in [11] that such a triplet can be colored as required using its single colors, two of its double colors, and one new color. In the end, the entire graph Gv has been partitioned into triplets, and each triplet has been colored using at most one new color and such that a line that sees a new color in a triplet does not see one of the active colors of that triplet. Hence, invariants (a) and (b) hold at the end of the coloring extension step, and once the coloring extension step has been performed for all vertices of T all paths have received one of d5L/3e colors. Since the number OPT of colors necessary in an optimal coloring is at least L, this implies that the algorithm uses at most d5OPT/3e colors to color the paths. From the lower bound in the next section it will be clear that the algorithm (and any other greedy algorithm) is not better than 5OPT/3 in the worst case.

721

722

D

Directed Tree Networks

Note that greedy algorithms are well-suited for practical distributed implementation in optical networks: one node of the network initiates the wavelength assignment by assigning wavelengths to all connections going through that node; then it transfers control to its neighbors who can extend the assignment independently and in parallel, transferring control to their neighbors in turn once they are done. It should be mentioned that simpler variants of greedy algorithms are known that are restricted to binary trees and color a given set of paths with load L using d5L/3e colors. These algorithms do not make use of the reduction to constrained bipartite edge coloring [6,15].

Directed Tree Networks, Figure 5 Lower bound for greedy algorithms

Lower Bounds Two kinds of lower bounds have been investigated for path coloring in directed tree networks. First, one wants to determine the best worst-case performance guarantee achievable by any greedy algorithm. Second, it is interesting to know how many colors are required even in an optimal coloring for a given set of paths with load L in the worst case. Lower Bound for Greedy Algorithms For a given local greedy algorithm A and positive integer L, an adversary can construct an instance of path coloring in a directed binary tree network such that A uses at least b5L/3c colors while an optimal solution uses only L colors [15]. The construction proceeds inductively. As A considers only the edges incident to a vertex v when it colors the paths touching v, the adversary can determine how these paths should continue and introduce new paths not touching v depending on the coloring A produces at vertex v. Assume that there are ˛ i L/2 paths going through each of the directed edges between vertex v and its parent, and that these paths have been colored with ˛ i L different colors. Initially, this assumption can be satisfied for ˛0 D 1 by introducing L paths in either direction on the link between the start vertex picked by algorithm A and one of its neighbors and letting appropriately chosen L/2 of these paths start resp. end at that neighbor. Denote the set of paths coming down from the parent by Pd and let them continue to (pass through) the left child v1 of v. Denote the set of paths

going up to the parent by Pu and let them pass through the right child v2 of v. Introduce a set P` of (1 ˛ i /2)L paths coming from v2 and going left to v1 , and a set Pr of L paths coming from v1 and going right to v2 . Algorithm A must use (1 ˛ i /2)L new colors to color the paths in P` . No matter which colors it chooses for the paths in Pr , it will use at least (1 C ˛ i /4)L different colors on the connection between v and v1 or on the connection between v and v2 . The best it can do with respect to minimizing the number of colors appearing between v and v1 and between v and v2 is to color (1 ˛ i /2)L paths of Pr with colors used for P` , ˛ i L/4 paths of Pr with colors used for Pd , and ˛ i L/4 paths of Pr with colors used for Pu . In that case, it uses (1 C ˛ i /4)L colors on each of the downward connections of v. Any other assignment uses more colors on one of the downward connections. If the algorithm uses at least (1 C ˛ i /4)L different colors for paths on, say, the connection between v and v1 , let (1 C ˛ i /4)L/2 of the downward paths and equally many of the upward paths extend to the left child of v1 , such that all of these paths use different colors, and let the remaining paths terminate or begin at v1 . Now the inductive assumption holds for the left child of v1 with ˛ iC1 D 1 C ˛ i /4. Hence, the number of colors on a pair of directed edges can be increased as long as ˛ i < 4/3. When ˛ i D 4/3, 4L/3 colors are used for the paths touching v and its parent, and algorithm A must use L/3 new colors to color the paths in P` , using 5L/3 colors altogether.

Directed Tree Networks

D

The previous calculations have assumed that all occurring terms like (1C˛ i /4)L/2 are integers. If one takes the possibility of non-integral values into account and carries out the respective calculations for all cases, one can show that, for every L, every greedy algorithm can be forced to use b5L/3c colors on a set of paths with maximum load L [15]. Furthermore, it is not difficult to show that the paths resulting from this worst-case construction for greedy algorithms can be colored optimally using only L colors. Hence, this yields also a lower bound of b5OPT/3c colors for any greedy algorithm. Lower Bounds for Optimal Colorings The instance of path coloring depicted in Fig. 1 consists of 5 paths in a binary tree with maximum load L D 2 such that even an optimal coloring requires 3 colors. Consider the instances of path coloring obtained from this instance by replacing each path by ` identical copies. Such an instance consists of 5` paths with maximum load L D 2`, and an optimal coloring requires at least d5`/2e D d5L/4e colors because no more than two of the given paths can be assigned the same color. Furthermore, d5`/2e colors are also sufficient to color these instances: for example, if ` is even, use colors 1; : : : ; ` for paths from a to e, colors `C1; : : : ; 2` for paths from f to e, colors 1; : : : ; `/2 and 2` C 1; : : : ; 5`/2 for paths from f to c, colors `/2 C 1; : : : ; 3`/2 for paths from d to b, and colors 3`/2 C 1; : : : ; 5`/2 for paths from a to b. Hence, for every even L there is a set of paths in a binary tree with load L such that an optimal coloring requires d5L/4e colors [4,18]. While the path coloring instance with L D 2 and OPT D 3 could be specified easily, K. Jansen used a more involved construction to obtain an instance with L D 3 and OPT D 5 [15]. It makes use of three components as building blocks. Each component consists of a vertex v with its parent and two children and a specification of the usage of edges incident to v by paths touching v. The root component ensures that at least 3 colors are used either on the left downward connection (extending below v1 ) or on the right downward connection (extending below v2 ). Each child of the root component is connected to a type A component, i. e., the child is identified with the parent vertex of a type A com-

Directed Tree Networks, Figure 6 Root component

Directed Tree Networks, Figure 7 Type A component

ponent and the corresponding paths are identified as well. Type A components have the property that, if the paths touching v and its parent are colored with 3 colors, at least 4 colors must be used either for the paths touching v and v1 or for those touching v and v2 . (If the paths touching v and its parent are colored with 4 colors, the remaining paths of the type A component require even 5 colors.) Hence, there is at least one child in one of the two type A components below the root component such that the paths touching this child and its parent are colored with four colors.

723

724

D

Directed Tree Networks

randomized greedy algorithm can achieve, with high probability, a performance ratio better than 3/2 for trees of height ˝(L) and better than 1:293 o(1) for trees of constant height. These results have been improved in [5] by I. Caragiannis, A. Ferreira, C. Kaklamanis, S. Pérennes, and H. Rivano, who gave a randomized approximation algorithm for bounded-degree trees that has approximation ratio 1:61 C o(1). The algorithm first computes in polynomial time an optimal solution for the fractional path coloring problem and then applies randomized rounding to obtain an integral solution. Related Topics Directed Tree Networks, Figure 8 Type B component

The final component used is of type B. It has the property that, if the paths touching v and its parent are colored with 4 colors, at least 4 colors must be used either for the paths touching v and v1 or for those touching v and v2 . For certain arrangements of colors on the paths touching v and its parent, 5 colors are necessary. It is possible to arrange a number of type B components in a binary tree such that for any combination of four colors on paths entering the tree of type B components at its root, 5 colors are necessary to complete the coloring. Hence, if one attaches a copy of this tree of type B components to each of the children of a type A component, it is ensured that at least one of the trees will be entered by paths with four colors and consequently 5 colors are necessary to color all paths. Since the load on every directed edge is at most 3, this gives a worst-case example for path coloring in binary trees with L D 3 and OPT D 5. Randomized Algorithms In [1,2], V. Auletta, I. Caragiannis, C. Kaklamanis and G. Persiano presented a class of randomized algorithms for path coloring in directed tree networks. They gave a randomized algorithm that, with high probability, uses at most 7/5L C o(L) colors for coloring any set of paths of maximum load L on binary trees of height o(L1/3 ). The analysis of the algorithm uses tail inequalities for hypergeometrical probability distributions such as Azuma’s inequality. Moreover, they proved that no

A number of further results related to the path coloring problem in directed tree networks or in networks with different topology are known. The number of colors required for sets of paths that have a special form have been investigated, e. g., one-to-all instances, allto-all instances, permutations, and k-relations. A survey of many of these results can be found in [4]. The undirected version of the path coloring problem has been studied by P. Raghavan and E. Upfal in [20]; here, the network is represented by an undirected graph and paths must receive different colors if they share an undirected edge. Approximation results for directed and undirected path coloring problems in ring networks, mesh networks, and arbitrary networks (all of these are NP-hard no matter whether the paths are fixed or can be chosen by the algorithm [7]) have been derived. An on-line variant of path coloring was studied by Y. Bartal and S. Leonardo in [3]. Here, the algorithm is given connection requests one by one and must determine a path connecting the corresponding vertices and a color for this path without any knowledge of future requests. The worst-case ratio between the number of colors used by the on-line algorithm and that used by an optimal off-line algorithm with complete advance knowledge is the competitive ratio. In [3] on-line algorithms with competitive ratio O(log n) are presented for trees, trees of rings, and meshes with n vertices. References 1. Auletta V, Caragiannis I, Kaklamanis C, Persiano P (2000) Randomized Path Coloring on Binary Trees. In: Jansen K,

Direct Global Optimization Algorithm

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13. 14. 15.

Khuller S (eds) Approximation Algorithms for Combinatorial Optimization (APPROX 2000), LNCS, vol 1913. Springer, Berlin, pp 60–71 Auletta V, Caragiannis I, Kaklamanis C, Persiano P (2002) Randomized Path Coloring on Binary Trees. Theoret Comput Sci 289:355–399 Bartal Y, Leonardi S (1997) On-Line routing in all-optical networks. Proceedings of the 24th International Colloquium on Automata, Languages and Programming ICALP 97, LNCS, vol 1256. Springer, Berlin, pp 516–526 Beauquier B, Bermond J-C, Gargano L, Hell P, Perennes S, Vaccaro U (1997) Graph problems arising from wavelength-routing in all-optical networks. Proceedings of IPPS 97, Second Workshop on Optics and Computer Science (WOCS), Geneva Caragiannis I, Ferreira A, Kaklamanis C, Pérennes S, Rivano H (2001) Fractional Path Coloring with Applications to WDM Networks. Proceedings of the 28th International Colloquium on Automata, Languages and Programming ICALP 01, LNCS, vol 2076. Springer, Berlin, pp 732–743 Caragiannis I, Kaklamanis C, Persiano P (1997) Bounds on optical bandwidth allocation on directed fiber tree topologies. Proceedings of IPPS 97, Second Workshop on Optics and Computer Science (WOCS), Geneva Erlebach T, Jansen K (1997) Call scheduling in trees, rings and meshes. Proceedings of the 30th Hawaii International Conference on System Sciences HICSS-30, vol 1, IEEE Computer Society Press, Maui, pp 221–222 Erlebach T, Jansen K (1997) Scheduling of virtual connections in fast networks. Proceedings of the 4th Parallel Systems and Algorithms Workshop PASA 96, World Scientific Publishing, Jülich, pp 13–32 Erlebach T, Jansen K (2001) The complexity of path coloring and call scheduling. Theoret Comput Sci 255(1–2): 33–50 Erlebach T, Jansen K, Kaklamanis C, Mihail M, Persiano P (1999) Optimal Wavelength Routing on Directed Fiber Trees. Theoret Comput Sci 221(1–2):119–137 Erlebach T, Jansen K, Kaklamanis C, Persiano P (1998) An optimal greedy algorithm for wavelength allocation in directed tree networks. Proceedings of the DIMACS Workshop on Network Design: Connectivity and Facilities Location. DIMACS Series Disc Math Theoret Comput Sci AMS 40:117–129 Garey MR, Johnson DS, Miller GL, Papadimitriou CH (1980) The complexity of coloring circular arcs and chords. SIAM J Algebraic Discrete Methods 1(2):216–227 Green PE (1991) The future of fiber-optic computer networks. IEEE Comput 24(9):78–87 Holyer I (1981) The NP-completeness of edge-coloring. SIAM J Comput 10(4):718–720 Jansen K (1997) Approximation results for wavelength routing in directed trees. Proceedings of IPPS 97, Second Workshop on Optics and Computer Science (WOCS), Geneva

D

16. Kaklamanis C, Persiano P, Erlebach T, Jansen K (1997) Constrained bipartite edge coloring with applications to wavelength routing. Proceedings of the 24th International Colloquium on Automata, Languages and Programming ICALP 97, LNCS, vol 1256, Bologna. Springer, Berlin, pp 493–504 17. Kumar SR, Panigrahy R, Russel A, Sundaram R (1997) A note on optical routing on trees. Inf. Process. Lett 62:295–300 18. Kumar V, Schwabe EJ (1997) Improved access to optical bandwidth in trees. Proceedings of the 8th Annual ACM– SIAM Symposium on Discrete Algorithms SODA 97, New Orleans. pp 437–444 19. Mihail M, Kaklamanis C, Rao S (1995) Efficient access to optical bandwidth. Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pp 548–557 20. Raghavan P, Upfal E (1994) Efficient routing in all-optical networks. Proceedings of the 26th Annual ACM Symposium on Theory of Computing STOC 94, ACM SIGACT, Monteal. ACM Press, New York, pp 134–143

Direct Global Optimization Algorithm DONALD R. JONES General Motors Corp., Warren, USA MSC2000: 65K05 Article Outline Keywords See also References Keywords Global optimization; Black-box optimization; Nonsmooth optimization; Constraints; Deterministic optimization For a black-box global optimization algorithm to be truly global, some effort must be allocated to global search, that is, search done primarily to ensure that potentially good parts of the space are not overlooked. On the other hand, to be efficient, some effort must also be placed on local search near the current best solution. Most algorithms either move progressively from global to local search (e. g., simulated annealing) or combine a fundamentally global method with a fundamentally local method (e. g., multistart, tunneling).

725

726

D

Direct Global Optimization Algorithm

DIRECT introduces a new approach: in each iteration several search points are computed using all possible weights on local versus global search (how this is done will be made clear shortly). This approach eliminates the need for ‘tuning parameters’ that set the balance between local and global search, resulting in an algorithm that is robust and easy-to-use. DIRECT is especially valuable for engineering optimization problems. In these problems, the objective and constraint functions are often computed using time-consuming computer simulations, so there is a need to be efficient in the use of function evaluations. The problems may contain both continuous and integer variables, and the functions may be nonlinear, nonsmooth, and multimodal. While many algorithms address these problem features individually, DIRECT is one of the few that addresses them collectively. However, the versatility of DIRECT comes at a cost: the algorithm suffers from a curse of dimensionality that limits it to low-dimensional problems (say, no more than 20 variables). The general problem solved by DIRECT can be written as follows: 8 ˆ min f (x1 ; : : : ; x n ) ˆ ˆ ˆ ˆ ˆ s.t. g1 (x1 ; : : : ; x n ) 0; ˆ ˆ ˆ ˆ :: < : ˆ ˆ g m (x1 ; : : : ; x n ) 0; ˆ ˆ ˆ ˆ ˆ ì x i u i ; ˆ ˆ ˆ : x i 2 I integer: To prove convergence, we must assume that the objective and constraint functions are continuous in the neighborhood of the optimum, but the functions can otherwise be nonlinear, nondifferentiable, nonconvex, and multimodal. While DIRECT does not explicitly handle equality constraints, problems with equalities can often be rewritten as problems with inequality constraints (either by replacing the equality with an inequality that becomes binding in the solution, or by using the equalities to eliminate variables). The set I in the above problem is the set of variables that are restricted to integer values. DIRECT works best when the integer variables describe an ordered quantity, such as the number of teeth on a gear. It is less effective when the integer variables are categorical.

Direct Global Optimization Algorithm, Figure 1

In what follows, we begin by describing how DIRECT works when there are no inequality and integer constraints. This basic version corresponds, with minor differences, to the originally published algorithm [2]. After describing the basic version, we then introduce extensions to handle inequality and integer constraints (this article is the first publication to document these extensions). We conclude with a step-by-step description of the algorithm. The bounds on the variables limit the search to an n-dimensional hyper-rectangle. DIRECT proceeds by partitioning this rectangle into smaller rectangles, each of which has a ‘sampled point’ at its center, that is, a point where the functions have been evaluated. An example of such a partition for n = 2 is shown in Fig. 1. We have drawn the rectangle as a square because later, whenever we measure distances or lengths, we will weight each dimension so that the original range (ui ì ) has a weighted distance of one. Drawing the hyperrectangle as a hyper-cube allows us to visualize relative lengths as they will be used in the algorithm. Figure 2 shows the first three iterations of DIRECT on a hypothetical two-variable problem. At the start of each iteration, the space is partitioned into rectangles. DIRECT then selects one or more of these rectangles for further search using a technique described later. Finally, each selected rectangle is trisected along one of its long sides, after which the center points of the outer thirds are sampled. In this way, we sample two new points in the rectangle and maintain the property that every sampled point is at the center of a rectangle (this property would not be preserved if the rectangle were bisected). At the beginning of iteration 1, there is only one rectangle (the entire space). The process of selecting

Direct Global Optimization Algorithm

D

Direct Global Optimization Algorithm, Figure 3

Direct Global Optimization Algorithm, Figure 2

rectangles is therefore trivial, and this rectangle is trisected as shown. At the start of iteration 2, the selection process is no longer trivial because there are three rectangles. In the example, we select just one rectangle, which is then trisected and sampled. At the start of iteration 3, there are 5 rectangles; in this example, two of them are selected and trisected. The key step in the algorithm is the selection of rectangles, since this determines how search effort is allocated across the space. The trisection process and other details are less important, and we will defer discussion of them until later. To motivate how DIRECT selects of rectangles, let us begin by considering the extremes of pure global search and pure local search. A pure global search strategy would select one of the biggest rectangles in each iteration. If this were done, all the rectangles would become small at about the same rate. In fact, if we always trisected one of the biggest rectangles, then after 3kn function evaluations every rectangle would be a cube with side length 3k , and the sampled points would form a uniform grid. By looking everywhere, this pure global strategy avoids overlooking good parts of the space. A pure local strategy, on the other hand, would sample the rectangle whose center point has the best objective function value. This strategy is likely to find good

solutions quickly, but it could overlook the rectangle that contains the global optimum (this would happen if the rectangle containing the global optimum had a poor objective function value at the center). To select just one ‘best’ rectangle, we would have to introduce a tuning parameter that controlled the local/global balance. Unfortunately, the algorithm would then be extremely sensitive to this parameter, since the proper setting would depend on the (unknown) difficulty of the problem at hand. DIRECT avoids tuning parameters by rejecting the idea of selecting just one rectangle. Instead, several rectangles are selected using all possible relative weightings of local versus global search. The idea of using all possible weightings may seem impractical, but with the help of a simple diagram this idea can actually be made quite intuitive. For this diagram, we will need a way to measure of the size of a rectangle. We will measure size using the distance between the center point and the vertices, as shown in Fig. 3. With this measure of rectangle size, we can now turn our attention to Fig. 4 which shows how rectangles are selected. In the figure, each rectangle in the partition is represented by a dot. The horizontal coordinate of a dot is the size of the rectangle, measured by the center-vertex distance. The vertical coordinate is the function value at the midpoint of the rectangle. The dot labeled A represents the rectangle with the lowest function value, and so this would be the rectangle selected by a pure local strategy. Similarly, the dot labeled B represents one of the biggest rectangles, and so it would be selected by a pure global strategy. DIRECT selects not only these two extremes but also all the rectangles on the lower-right convex hull of the cloud of dots (the dots connected by the line). These rectangles represent ‘efficient trade-offs’ between local versus global search, in the sense that each of them is best for some relative weighting of midpoint function value and center-vertex distance. (We will explain the other lines in Fig. 4. shortly.)

727

728

D

Direct Global Optimization Algorithm

Direct Global Optimization Algorithm, Figure 4

One might think that the idea illustrated in Fig. 4 would extend naturally to the constrained case; that is, we would simply select any rectangle that was best for some weighting of objective value, center-vertex distance, and constraint values. Unfortunately, this does not work because it leads to excessive sampling in the infeasible region. However, as we explain next, there is an alternative way of thinking about the lower-right convex hull that does extend to the constrained case. For the sake of the exposition, let us suppose for the moment that we know the optimal function value f . For the function to reach f within rectangle r, it would have to undergo a rate of change of at least (f r f )/dr , where f r is the function value at the midpoint of rectangle r and dr is the center-vertex distance. This follows because the function value at the center is f r and the maximum distance over which the function can fall to f is the center-vertex distance dr . Intuitively, it seems ‘more reasonable’ to assume that the function will undergo a gradual change than to assume it will make a steep descent to f . Therefore, if only we knew the value f , a reasonable criterion for selecting a rectangle would be to choose the one that minimizes (f r f )/dr . Figure 4 shows a graphical way to find the rectangle that minimizes (f r f )/dr . Along the vertical axis we show the current best function value, f min , as well as the supposed global minimum f . Now suppose we anchor a line at the point (0, f ) and slowly swing it upwards. When we first encounter a dot, the slope of the line will be precisely the ratio (f r f )/dr , where r is the index

Direct Global Optimization Algorithm, Figure 5

of the rectangle corresponding to the encountered dot. Moreover, since this is the first dot touched by the line, rectangle r must be the rectangle that minimizes (f r f )/dr . Of course, in general we will not know the value of f . But we do know that, whatever f is, it satisfies f f min . So imagine that we repeat the line-sweep exercise in Fig. 4 for all values of f ranging from f min to 1. How many rectangles could be selected? Well, with a little thought, it should be clear that the set of dots that can be selected via these line sweeps is precisely the lower-right convex hull of the dots. This alternative approach to deriving the lowerright convex hull suggests a small but important modification to the selection rule. In particular, to prevent DIRECT from wasting function evaluations in pursuit of very small improvements, we will insist that the value of f satisfy f f min . That is, we are only interested in selecting rectangles where it is reasonable that we can find a ‘significantly better’ solution. A natural value of would be the desired accuracy of the solution. In our implementation, we have set = max(104 |f min |, 108 ). As shown in Fig. 5, the implication of this modification is that some of the smaller rectangles on the lowerright convex hull may be skipped. In fact, the smallest rectangle that will be selected is the one chosen when f = f min . The version of DIRECT described so far corresponds closely to the originally published version [2].

Direct Global Optimization Algorithm

The only difference is that, in the original version, a selected rectangle was trisected not just on a single long side, but rather on all long sides. This approach eliminated the need to arbitrarily select a single long side when there were more than one and, as a result, it added an element of robustness to the algorithm. Experience has since shown, however, that the robustness benefit is small and that trisecting on a single long side (as here) accelerates convergence in higher dimensions. Let us now consider how the rectangle selection procedure can be extended to handle inequality constraints. The key to handling constraints in DIRECT is to work with an auxiliary function that combines information on the objective and constraint functions in a special manner. To express this auxiliary function, we will need some additional notation. Let g rj denote the value of constraint j at the midpoint of rectangle r. In addition, let c1 , . . . , cm be positive weighting coefficients for the inequality constraints (we will discuss how these coefficients are computed later). Finally, for the sake of the exposition, let us again suppose that we know the optimal function value f . The auxiliary function, evaluated at the center of rectangle r, is then as follows: max( f r f ; 0) C

m X

c j max(gr j ; 0)

jD1

The first term of the auxiliary function exacts a penalty for any deviation of the function value f r above the global minimum value f . Note that, in a constrained problem, it is possible for f r to be less than f by violating the constraints; due to the maximum operator, the auxiliary function gives no credit for values of f r below f . The second term in the auxiliary function is a sum of weighted constraint violations. Clearly, the lowest possible value of the auxiliary function is zero and occurs only at the global minimum. At any other point, the auxiliary function is positive either due to suboptimality or infeasibility. This auxiliary function is not a penalty function in the standard sense. A standard penalty function would be a weighted sum of the objective function and constraint violations; it would not include the value f since this value is generally unknown. Moreover, in the standard approach, it is critical that the penalty coefficients be sufficiently large to prevent the penalty function from being minimized in the infeasible region.

D

This is not true for our auxiliary function: as long as f is the optimal function value, the auxiliary function is minimized at the global optimum for any positive constraint coefficients. For the global minimum to occur in rectangle r, the auxiliary function must fall to zero starting from its (positive) value at the center point. Moreover, the maximum distance over which this change can occur is the center-vertex distance dr . Thus, to reach the global minimum in rectangle r, the auxiliary function must undergo a minimum rate of change, denoted hr (f ), given by P max( f r f ; 0) C mjD1 c j max(gr j ; 0) : hr ( f ) D dr Since it is more reasonable to expect gradual changes than abrupt ones, a reasonable way to select a rectangle would be to select rectangle that minimizes the rate of change hr (f ). Of course, this is impractical because we generally will not know the value f . Nevertheless, it is possible to select the set of rectangles that minimize hr (f ) for some f f min . This is how we select rectangles with constraints—assuming a feasible point has been found so that f min is well-defined (we will show how this is implemented shortly). If no feasible point has been found, we simply select the rectangle that minimizes Pm jD1 c j max(g r j ; 0) : dr That is, we select the rectangle where the weighted constraint violations can be brought to zero with the least rate of change. To implement this selection rule, it is again helpful to draw a diagram. This new selection diagram is based on plotting the rate-of-change function hr (f ) as a function of f . Figure 6 illustrates this function. For values of f f r , the first term in the numerator of hr (f ) is zero, and so hr (f ) is constant. As f falls below f r , however, the hr (f ) increases, because we now exact a penalty for f r being above the supposed global minimum f . The slope of hr (f ) function to the left of f r is 1/dr . Figure 7 superimposes, in one diagram, the rate-ofchange functions for a hypothetical set of seven rectangles. For a particular value of f , we can visually find the rectangle that minimizes hr (f ) by starting at the point

729

730

D

Direct Global Optimization Algorithm

Direct Global Optimization Algorithm, Figure 6

Direct Global Optimization Algorithm, Figure 7

(f , 0) along the horizontal axis and moving vertically until we first encounter a curve. What we want, however, is the set of all rectangles that can be selected in this way using any f f min . This set can be found as follows (see Fig. 7). We start with f = f min and move upwards until we first encounter a curve for some rectangle. We note this rectangle and follow its curve to the left until it intersects the curve for another rectangle (these intersections are circled in Figure 7). When this happens, we note this other rectangle and follow its curve to the left. We continue in this way until we find a curve that is never intersected by another one. This procedure will identify all the hr (f ) functions that participate in the lower envelope of the curves to the left of f min . The set of rectangles found in this way is the set selected by DIRECT. Along the horizontal axis in Fig. 7, we identify ranges of f values for which different rectangles have the lowest value of hr (f ). As we scan from f min to

the left, the rectangles that participate in the lower envelope are 1, 2, 5, 2, and 7. This example illustrates that it is possible to encounter a curve more than once (here rectangle 2), and care must be taken not to double count such rectangles. It is also possible for some curves to coincide along the lower envelope, and so be ‘tied’ for the least rate of change (this does not happen in Fig. 7). In such cases, we select all the tied rectangles. Tracing the lower envelope in Fig. 7 is not computationally intense. To see this, note that each selected rectangle corresponds to a curve on the lower envelope, and for each such curve the work we must do is to find the intersection with the next curve along the lower envelope. Finding this next intersection requires computing the intersection of the current curve with all the other curves. It follows that the work required for each selected rectangle (and hence for every two sampled points) is only on the order of the total number of rectangles in the partition. The tracing of the lower envelope can also be accelerated by some pre-processing. In particular, it is possible to quickly identify rectangles whose curves lie completely above other curves. For example, in Fig. 7, curve 3 lies above curve 1, and curve 4 lies above curve 2. These curves cannot possibly participate in the lower envelope, and so they can be deleted from consideration before the lower envelope is traced. It remains to explain how the constraint coefficients c1 , . . . , cm are computed, as well as a few other details about trisection and the handling of integer variables. We will cover these details in turn, and then bring everything together into a step-by-step description of the algorithm. To understand how we compute the constraint coefficient cj , suppose for the moment that we knew the average rate of change of the objective function, denoted a0 , and the average rate of change of constraint j, denoted aj . Furthermore, suppose that at the center of a rectangle we have g j > 0. At the average rate of change of constraint j, we would have to move a distance equal to g j /aj to get rid of the constraint violation. If during this motion the objective function got worse at its average rate of change, it would get worse by a0 times the distance, or a0 (g j /aj ) = (a0 /aj ) g j . Thus we see that the ratio a0 /aj provides a way of converting units of constraint violation into potential increases in the objective function. For this reason, we will set cj = a0 /aj .

Direct Global Optimization Algorithm

The average rates of change are estimated in a very straightforward manner. We maintain a variable s0 for the sum of observed rates of change of the objective function. Similarly we maintain variables s1 , . . . , sm for the sum of observed rates of change for each of the m constraints. All of these variables are initialized to zero at the start of the algorithm and updated each time a rectangle is trisected. Let xmid denote the midpoint of the parent rectangle and let xleft and xright denote the midpoints of the left and right child rectangles after trisection. The variables are updated as follows: ˇ ˇ child ˇ f (x ) f (x mid)ˇ

s0 D s0 C

x child x mid

childDleft ˇ ˇ right ˇ g j (x child ) g j (x mid )ˇ X

: sj D sj C

x child x mid

right X

D

Let us now turn to the calculation of the centervertex distance. Recall that we measure distance using a weighted metric that assigns a length of one to the initial range of each variable (ui ì ). Each time a rectangle is split, the length of that side is then reduced by a factor of 1/3. Now consider a rectangle that has been trisected T times. Let j = mod(T, n), so that we may write T = kn + j where k = (T j)/n. After the first kn trisections, all of the n sides will have been trisected k times and will therefore have length 3 k . The remaining j trisections will make j of the sides have length 3 (k+ 1) , leaving n j sides with length 3 k . Simple algebra then shows that the distance d from the center to the vertices is given by 0:5 3k j Cn j : dD 2 9

childDleft

Now the average rates of change are a0 = s0 /N and aj = sj /N, where N is the number of rates of change accumulated into the sums. It follows that a0 D aj

s0 N sj N

D

s0 : sj

We may therefore compute cj using cj D

s0 ; max(s j ; 1030 )

where we use the maximum operator in the denominator to prevent division by zero. So far we have said that we will always trisect a rectangle along one of its long sides. However, as shown in Fig. 2, several sides may be tied for longest, and so we need some way to break these ties. Our tie breaking mechanism is as follows. We maintain counters t i (i = 1, . . . , n) for how many times we have split along dimension i over the course of the entire search. These counters are initialized to zero at the beginning of the algorithm, and counter t i is incremented each time a rectangle is trisected along dimension i. If we select a rectangle that has several sides tied for being longest, we break the tie in favor of the side with the lowest t i value. If several long sides are also tied for the lowest t i value, we break the tie arbitrarily in favor of the lowest-indexed dimension. This tie breaking strategy has the effect of equalizing the number of times we split on the different dimensions.

(This is not obvious, but can be easily verified.) The handling of integer variables is amazingly simple, involving only minor changes to the trisection routine and to the way the midpoint of a rectangle is defined. For example, consider an integer variable with range [1, 8]. We could not define the midpoint to be 4.5 because this is not an integer. Instead, we will use the following procedure. Suppose the range of a rectangle along an integer dimension is [a, b], with both a and b being integers. We will define the ‘midpoint’ as b(a + b)/2c, that is, it is the floor of algebraic average (the floor of z, denoted bzc, is the greatest integer less than or equal to z). To trisect along the integer dimension, we first compute = b(b a + 1)/3c. If 1, then after trisection the left child will have the range [a, a + 1], the center child will have the range [a + , b ], and the right child will have range [b + 1, b]. If = 0, then the integer side must have a range of two (i. e., b = a + 1). In this case, the center child will have the range [a, a] the right child will have the range [b, b], and there will be no left child. This procedure maintains the property that the midpoint of the parent rectangle always becomes the midpoint of the center child. As an example, Fig. 8 shows how a rectangle would be trisected when there are two integer dimensions. In the figure, the circles represent possible integer combinations, and the filled circles represent the midpoints. Integer variables introduce three other complications. The first, which may be seen in Fig. 8, is that the

731

732

D

Direct Global Optimization Algorithm

Direct Global Optimization Algorithm, Figure 8

sampled point may not be in the true geometric center of the rectangle. As a result, the center-vertex distance will not be unique but will vary from vertex to vertex. We ignore this detail and simply use the formula given above for the continuous case, which only depends upon the number of times a rectangle has been trisected. The second complication concerns how we define a ‘long’ side. In the continuous case, the length of a side is directly related to the number of times it has been trisected along that dimension. Specifically, if a rectangle has been split k times along some side, then the side length will be 3 k (recall that we measure distance relative to the original range of each variable). In the continuous case, therefore, the set of long sides is the same as the set of sides that have been split upon the least. When there are integers, however, the side lengths will no longer be multiples of 1/3. To keep things simple, however, we ignore this and continue to define a ‘long’ side as one that has been split upon the least. However, if an integer side has been split so many times that its side length is zero (i. e., the range contains a single integer), then this side will not be considered long. The third and final complication is that, if all the variables are integer, then it is possible for a rectangle to be reduced to a single point. If this happens, the rectangle would be fathomed; hence, it should be ignored in the rectangle selection process in all subsequent iterations.

DIRECT stops when it reaches a user-defined limit on function evaluations. It would be preferable, of course, to stop when we have achieved some desired accuracy in the solution. However, for black-box problems where we only assume continuity, better stopping rules are hard to develop. As for convergence, it is easy to show that, as f moves to 1, DIRECT will select one of the largest rectangles. Because we always select one of the largest rectangles, and because we always subdivide on a long side, every rectangle will eventually become very small and the sampled points will be dense in the space. Since we also assume the functions are continuous in the neighborhood of the optimum, this insures that we will get within any positive tolerance of the optimum after a sufficiently large number of iterations. Although we have now described all the elements of DIRECT, our discussion has covered several pages, and so it will be helpful to bring everything together in a step-by-step description of the algorithm. 1) Initialization. Sample the center point of the entire space. If the center is feasible, set xmin equal to the center point and f min equal to the objective function value at this point. Set sj = 0 for j = 0, . . . , m; t i = 0 for i = 1, . . . , n; and neval = 1 (function evaluation counter). Set maxeval equal to the limit on the number of function evaluations (stopping criterion). 2) Select rectangles. Compute the cj values using the current values of s0 and sj , j = 1, . . . , m. If a feasible point has not been found, select the rectangle that minimizes the rate of change required to bring the weighted constraint violations to zero. On the other hand, if a feasible point has been found, identify the set of rectangles that participate in the lower envelope of the hr (f ) functions for some f f min . A good value for is = max(104 |f min |, 108 ). Let S be the set of selected rectangles. 3) Choose any rectangle r 2 S. 4) Trisect and sample rectangle r. Choose a splitting dimension by identifying the set of long sides of rectangle r and then choosing the long side with the smallest ti value. If more than one side is tied for the lowest t i value, choose the one with the lowest-dimensional index. Let i be the resulting splitting dimension. Note that a ‘long side’

Direct Global Optimization Algorithm

is defined as a side that has been split upon the least and, if integer, has a positive range. Trisect rectangle r along dimension i and increment t i by one. Sample the midpoint of the left third, increment neval by one, and update xmin and f min . If neval = maxeval, go to Step 7. Otherwise, sample the midpoint of the right third, increment neval by one, and update xmin and f min (note that there might not be a right child when trisecting on an integer variable). Update the sj j = 0, . . . , m. If all n variables are integer, check whether a child rectangle has been reduced to a single point and, if so, delete it from further consideration. Go to Step 5. 5) Update S. Set S = S {r}. If S is not empty, go to Step 3. Otherwise go to Step 6. 6) Iterate. Report the results of this iteration, and then go to Step 2. 7) Terminate. The search is complete. Report xmin and f min and stop. The results of DIRECT are slightly sensitive to the order in which the selected rectangles are trisected and sampled because this order affects the t i values and, hence, the choice of splitting dimensions for other selected rectangles. In our current implementation, we select the rectangles in Step 3 in the same order that they are found as we scan the lower envelope in Fig. 7 from f = f min towards f = 1. On the first iteration, all the sj will be zero in Step 2 and, hence, all the cj will be zero when computed using cj = s0 /max(sj , 1030 ). Thus, in the beginning the constants cj will not be very meaningful. This is not important, however, because on the first iteration there is only one rectangle eligible for selection (the entire space), and so the selection process is trivial. As the iterations proceed, the sj will be based on more observations, leading to more meaningful cj constants and better rectangle selections. When there are no inequality constraints, the above step-by-step procedure reduces to the basic version of DIRECT described earlier. To see this, note that, when there are no constraints, every point is feasible and so f r f min f for all rectangles r. This fact, combined with the lack of any constraint violations, means that the hr (f ) function given earlier reduces to (f r f )/dr ,

D

Direct Global Optimization Algorithm, Figure 9

which is precisely the rate-of-change function we minimized in the unconstrained version. Thus, in the unconstrained case, tracing the lower envelope in Fig. 7 identifies the same rectangles as tracing the lower-right convex hull in Fig. 5. We will illustrate DIRECT on the following twodimensional test function: 8 ˆ ˆ :

The control variables are constrained by 0 u1 (k) 4; 0 u2 (k) 4; 0 u3 (k) 0:5: The performance index to be minimized is I D x12 (P) C x22 (P) C x32 (P) ! " P X 2 2 2 x1 (k 1) C x2 (k 1) C 2u3 (k 1) C kD1

P X

!# 12 x32 (k

1) C

2u12 (k

1) C

2u22 (k

1)

kD1

where P is the number of stages. When P is taken as 100, then we have a 300 variable optimization problem, because at each stage there are three control variables

737

738

D

Direct Search Luus–Jaakola Optimization Procedure

to be determined. Without the use of a reliable way of determining the region sizes over which to take the control variables, the problem is very difficult, but with the method suggested in [20] the problem was solved quite readily by the LJ optimization procedure by using 100 random points per iteration and 60 passes, each consisting of 201 iterations, to yield I = 258.3393. Although the computational requirements appear enormous, the actual computation time was less than 20 minutes on a Pentium-120 personal computer [20], which corresponds to less than one minute on the Pentium4/2.4 GHz personal computer. This value of the performance index is very close to the value I = 258.3392 obtained by use of iterative dynamic programming [18]. To solve this problem, IDP is much more efficient in spite of the nonseparability of the problem, because in IDP the problem is solved as a 3 variable problem over 100 stages, rather than a 300 variable optimization problem. Therefore, the LJ procedure is useful in checking the optimal control policy obtained by some other method. Here, the control policies obtained by IDP and LJ optimization procedure are almost identical, where a sudden change at around stage 70 occurs in the control variables u1 and u2 . Therefore, LJ optimization procedure is ideally suited for checking results obtained by other methods, especially when the optimal control policy differs from what is expected, as is the case with this particular example. Recently it was shown that the convergence of the LJ optimization procedure in the vicinity of the optimum can be improved substantially by incorporating a simple line search to choose the best center point for a subsequent pass [24]. For a typical model reduction problem, to reach the global optimum the computation time was reduced by a factor of four when the line search was incorporated. Due to its simplicity, the LJ optimization procedure can be programmed very easily. Computational experience with numerous optimization problems has shown that the method has high reliability of obtaining the global optimum, so the LJ optimization procedure provides a very good means of obtaining the optimum for very complex problems. See also Interval Analysis: Unconstrained and Constrained Optimization

References 1. Bennett HW, Luus R (1971) Application of numerical hill-climbing in control of systems via Liapunov’s direct method. Canad J Chem Eng 49:685–690 2. Bojkov B, Hansel R, Luus R (1993) Application of direct search optimization to optimal control problems. Hungarian J Industr Chem 21:177–185 3. Dinkoff B, Levine M, Luus R (1979) Optimum linear tapering in the design of columns. Trans ASME 46:956–958 4. Hartig F, Keil FJ, Luus R (1995) Comparison of optimization methods for a fed-batch reactor. Hungarian J Industr Chem 23:141–148 5. Jaakola THI, Luus R (1974) A note on the application of nonlinear programming to chemical-process optimization. Oper Res 22:415–417 6. Kalogerakis N, Luus R (1982) Increasing the size of region of convergence for parameter estimation. Proc. 1982 Amer. Control Conf., 358–362 7. Luus R (1974) Optimal control by direct search on feedback gain matrix. Chem Eng Sci 29:1013–1017 8. Luus R (1974) A practical approach to time-optimal control of nonlinear systems. Industr Eng Chem Process Des Developm 13:405–408 9. Luus R (1974) Time-optimal control of linear systems. Canad J Chem Eng 52:98–102 10. Luus R (1974) Two-pass method for handling difficult equality constraints in optimization. AIChE J 20:608–610 11. Luus R (1975) Optimization of multistage recycle systems by direct search. Canad J Chem Eng 53:217–220 12. Luus R (1975) Optimization of system reliability by a new nonlinear integer programming procedure. IEEE Trans Reliabil 24:14–16 13. Luus R (1975) Solution of output feedback stabilization and related problems by stochastic optimization. IEEE Trans Autom Control 20:820–821 14. Luus R (1976) A discussion on optimization of an alkylation process. Internat J Numer Methods in Eng 10:1187– 1190 15. Luus R (1980) Optimization in model reduction. Internat J Control 32:741–747 16. Luus R (1990) Optimal control by dynamic programming using systematic reduction in grid size. Internat J Control 19:995–1013 17. Luus R (1993) Optimization of heat exchanger networks. Industr Eng Chem Res 32:2633–2635 18. Luus R (1996) Application of iterative dynamic programming to optimal control of nonseparable problems. Hungarian J Industr Chem 25:293–297 19. Luus R (1996) Handling difficult equality constraints in direct search optimization. Hungarian J Industr Chem 24: 285–290 20. Luus R (1998) Determination of the region sizes for LJ optimization procedure. Hungarian J Industr Chem 26:281– 286

Discontinuous Optimization

21. Luus R (1999) Effective solution procedure for systems of nonlinear algebraic equations. Hungarian J Industr Chem 27:307–310 22. Luus R (2000) Handling difficult equality constraints in direct search optimization. Part 2. Hungarian J Industr Chem 28:211–215 23. Luus R (2000) Iterative dynamic programming. Chapman and Hall/CRC, London, pp 44–66 24. Luus R (2007) Use of line search in the Luus–Jaakola optmization procedure. Proc IASTED Internat Conf on Computational Intelligence. Banff, Alberta, Canada, July 2-4, 2007, pp 128–135 25. Luus R, Brenek P (1989) Incorporation of gradient into random search optimization. Chem Eng Techn 12:309–318 26. Luus R, Dittrich J, Keil FJ (1992) Multiplicity of solutions in the optimization of a bifunctional catalyst blend in a tubular reactor. Canad J Chem Eng 70:780–785 27. Luus R, Hartig F, Keil FJ (1995) Optimal drug scheduling of cancer chemotherapy by direct search optimization. Hungarian J Industr Chem 23:55–58 28. Luus R, Hennessy D (1999) Optimization of fed-batch reactors by the Luus–Jaakola optimization procedure. Industr Eng Chem Res 38 29. Luus R, Jaakola THI (1973) Optimization by direct search and systematic reduction in the size of search region. AIChE J 19:760–766 30. Luus R, Mutharasan R (1974) Stabilization of linear system behaviour by pole shifting. Internat J Control 20:395– 405 31. Luus R, Storey C (1997) Optimal control of final state constrained systems. Proc. IASTED Intl. Conf. Modelling, Simulation and Optimization, pp 245–249 32. Luus R, Tassone V (1992) Optimal control of nonseparable problems by iterative dynamic programming. Proc. 42nd Canad. Chemical Engin. Conf., Toronto, Canada, pp 81–82 33. Luus R, Wyrwicz R (1996) Use of penalty functions in direct search optimization. Hungarian J Industr Chem 24:273– 278 34. Papangelakis VG, Luus R (1993) Reactor optimization in the pressure oxidation process. Proc. Internat. Symp. Modelling, Simulation and Control of Metall. Processes, 159– 171 35. Rosenbrock HH (1960) An automatic way of finding the greatest or least value of a function. Computer J 3:175–184 36. Spaans R, Luus R (1992) Importance of search-domain reduction in random optimization. JOTA 75:635–638 37. Wang BC, Luus R (1977) Optimization of non-unimodal systems. Internat J Numer Methods in Eng 11:1235–1250 38. Wang BC, Luus R (1978) Reliability of optimization procedures for obtaining global optimum. AIChE J 19:619–626 39. Yang SM, Luus R (1983) A note on model reduction of digital control systems. Internat J Control 37:437–439 40. Yang SM, Luus R (1983) Optimization in linear system reduction. Electronics Lett 19:635–637

D

Discontinuous Optimization MARCEL MONGEAU Laboratoire MIP, University Paul Sabatier, Toulouse, France MSC2000: 90Cxx Article Outline Keywords See also References Keywords Discontinuous optimization; Nondifferentiable optimization; Piecewise linear programming; Active set methods; Exact penalty method Continuous optimization refers to optimization involving objective functions whose domain of definition is a continuum, as opposed to a set of discrete points in combinatorial (or discrete) optimization. Discontinuous optimization is the special case of continuous optimization in which the objective function, although defined over a continuum (let us suppose over Rn ), is not necessarily a continuous function. We define the discontinuous optimization problem as: 8 ˆ inf e f (x) ˆ ˆ < s.t. ˆ ˆ ˆ :

f i (x) D 0;

i 2 E;

f i (x) 0;

i 2 I;

(1)

where the index sets E and I are finite and disjoint and e f and f i , i 2 E [ I are a collection of (possibly discontinuous) piecewise differentiable functions that map Rn to R. A piecewise differentiable function f : Rn ! R is a function whose derivative is defined everywhere except over a subset of a finite number of sets, called ridges, of the form {x 2 Rn : r(x) = 0}, where r is a differentiable function, and these ridges partition the domain into subdomains over each of which f is differentiable. By abuse of language, we shall call r(x) a ridge of f .

739

740

D

Discontinuous Optimization

Without loss of generality, we can restrict our attention to the unconstrained optimization problem: infx f (x), where f is a (possibly discontinuous) piecewise differentiable function. Indeed, in order to solve problem (1), one can consider the unconstrained l1 exact penalty function f (x) C f (x) :D e

X

X

j f i (x)j

i2E

min[0; f i (x)]

i2I

for a succession of decreasing positive values of the penalty parameter (f is clearly a piecewise differentiable function). Notice however that using the l1 penalty function (and dealing with the decrease of a penalty parameter) is only one approach to handling the constrained problem and may not be the best way. Given a (possibly discontinuous) piecewise differentiable function f defined over Rn and the finite set fr i (x)g i2R of its ridges, we define a cell of f to be a nonempty set C Rn such that for all x, y 2 C we have sign(ri (x)) = sign(ri (y)) 6D 0, for all i 2 R, where the function sign is either 1, 1 or 0, according to whether its argument is positive, negative or zero. Thus, f is differentiable over a cell. Considering the optimization of functions which are nonsmooth and even discontinuous is motivated by applications in VLSI and floor-planning problems, plant layout, batch production, switching regression, discharge allocation for hydro-electric generating stations, fixed-charge problems, for example (see [4, Introd.] for references). Note that most of these problems can alternatively be modeled within the context of mixed integer programming, a field straddling combinatorial optimization and continuous optimization. The inescapable nonconvexity nature of discontinuous functions gives rise to the existence of several local optima in discontinuous optimization problems. We do not address here the difficult issue of global optimization. We are concerned with finding a local infimum of the above optimization problem. An algorithm looking for local optima can however be used as an adjunct to some heuristic or global optimization method for discontinuous optimization problems but

the inherent combinatorial nature of such an approach is often ultimately dominant. More importantly, it provides a framework allowing the optimizer to deal directly with the nonsmoothnesses and discontinuities involved, and thereby, improve solutions found by heuristic methods, when this is possible. Leaving aside the heuristic methods (which many people facing practical discontinuous optimization problems rely upon in order to solve mixed integer programming formulation of discontinuous optimization problems), previous work on discontinuous optimization includes smoothing algorithms. The smoothing algorithms express discontinuities by means of a step function, and then they approximate the step function by a function which is not only continuous but moreover smooth, so that the resulting problem can be solved by a gradient technique (cf. also Conjugate– gradient methods). Both I.I. Imo and D.J. Leech [7] and I. Zang [9] developed methods in which the objective function is replaced only in the neighborhood of the discontinuities. Two drawbacks of these methods are the potential numerical instability when we want this neighborhood to be small, and the cost of evaluating the smoothed functions. In many instances the discontinuities of the first derivative are exactly the regions of interest and smoothing has the effect of making such regions less discernible. Another approach, which deals explicitly with the discontinuities within the framework of continuous optimization, is the following active set method (introduced in [4]). Recall the following definitions relevant to active set methods: the null space of M, denoted by N(M), is defined by n

o

E : N (M) x 2 Rn : Mx D 0 We say that a ridge r is active at b x if r(b x) D 0. Let A(b x) R be the (finite) index set of the ridges that are active at the current point b x, and let A(b x) be the matrix of activities, having as columns the gradients of the ridges that are active at b x. In the case of linear ridges, | x)) is said to ri (x) := ai x bi , a direction d 2 N (A> (b preserve each activity i 2 A(b x) since for each i 2 A(b x) x C ˛d) D r i (b x) D 0. we have r i (b x) ¤ ;, then r f (b x) is not necessarily defined. If A(b This is because we cannot talk about the gradient of the

Discontinuous Optimization

function at b x since there is no vector g 2 Rn such that | g d is the first order change of f in direction d, for any d 2 Rn . Thus, we cannot use, as in the smooth situation, the negative gradient direction as a descent direction. We term any (n × 1)-vector g xˆ such that

1 BEGIN 2

f 0 (b x; d) D g x> ˆ d; for all d 2 N (A> ); a restricted gradient of f at b x, because it is the gradient of the restriction of f to the space N (A> (b x)). Let us first consider the continuous piecewise linear case. We assume that the ridges of f are given, and also we assume that the restriction of f to any cell is known. Hence, we are assuming that more information on the structure of the objective function is available than, for example, in a bundle method [8], which assumes that only one element of the subdifferential is known at any point. It is shown in [4] that, under some nondegeneracy assumptions (e. g. the gradients of the ridges which are active at x are linearly independent), any continuous piecewise linear function f can be decomposed in a neighborhood of b x into a smooth function and a sum of continuous functions having a single ridge as follows:

3

4

5

6

x) f (x) D f (b x) C g x> ˆ (x b X xiˆ min(0; a> x)) ; C i (x b ˆ i2A( x) n for some scalars fxiˆ g i2A(x) ˆ , and some vector g xˆ 2 R . x. Note that if We term g xˆ the restricted gradient of f at b m ridges of f are active at b x, it means that there are 2m cells in any small neighborhood of b x. The vector g xˆ and , together with the m gradients the m scalars fxiˆ g i2A(b x) , thus completely characterof the activities, fa i g i2A(b x) ize the behavior of f over the 2m cells in the neighborhood of b x! With such a decomposition at any point of Rn , an algorithm for finding a local minimum of a continuous piecewise linear function f is readily obtained, as long as we assume no degeneracy at any iterate and at any breakpoint encountered in the line search (we shall discuss later the degenerate situation):

7 END

D

Choose any x 1 2 Rn and set k 1. REPEAT Identify the activities, A(x k ), and compute d k P(g x k ), the projection of the restricted gradient onto the space orthogonal to the gradients of the activities. ! IF d k = 0 (x k is a dead point; compute a single-dropping descent direction or establish optimality), THEN Compute fu i g i2A(x k ) , the coefficients of fa i g i2A(x k ) in the linear combination of g x k in terms of the columns of A(x k ). IF u i < 0 or u i > xi k , for some i 2 A(x k ) (violated optimality condition), THEN (Drop activity i) Redefine d k = Pi (a i ), if the violated inequality found corresponds to u i 0; otherwise d k = Pi (a i ) if it is u i xi k , where Pi is the orthogonal projector onto the space orthogonal to the gradients of all the activities but activity i. ELSE stop: x k is a local minimum of f . ENDIF ENDIF (Line search) Determine the step size ˛ k by solving min˛>0 f (x k + ˛d k ). This line search can be done from x k , moving from one break-point of f to the next, in the direction d k , until either we establish unboundedness of the objective function or the value of f starts increasing. Update x k+1 = x k + ˛ k d k ; k k + 1. REPEAT

Continuous piecewise linear minimization algorithm

Remark that in step 6, the directional derivative of the objective function in the direction dk can easily be updated from one breakpoint to the other in terms of the scalar xi , where i is the index of the ridge crossed at breakpoint x. Let us now consider the case where f is still piecewise linear but with possibly discontinuities across x) desome ridges. We term such ridges: faults, and F (b notes the faults that are active at b x.

741

742

D

Discontinuous Optimization

Note first that a (local) minimum does not always exist in the discontinuous case. Consider for example the following univariate function, having x = 0 as a fault: ( x C 1 if x 0; f (x) D x otherwise: Hence, we rather look for a local infimum. In order to find such a local infimum of a function f having some faults, we shall simply generalize the algorithm for the continuous problem by implicitly considering any discontinuity or jump across a fault i in f as the limiting case of a continuous situation. Since we are looking for a local infimum, without loss of generality we shall henceforth only consider functions f such that f (x) D lim inf f (x); x!x

in other words, we consider the lower semicontinuous envelope of f . The algorithm for the discontinuous case is essentially the same as in the continuous case except that we consider dropping an active fault from a dead point, x, only if we do so along a direction d such that lim f (x C ıd) D f (x)

ı!0C

(i. e. as ı > 0 is small, the value of f does not jump up from x to x + ıd). Thus, virtually only step 4 must be adapted from the continuous problem algorithm in order to solve the discontinuous case. To make more carefully the intuitive concept of directions jumping up or down, we define the set of soaring directions from a point b x to be: S(b x) :D

8 < :

d 2 Rn :

9 9 > 0; ı > 0 : = : 80 i S (b x) :D i 2 A(b x) : and a i d > 0 ˆ C : then d i 2 S(b x) 8 if d i 2 N (A> i S (b x) :D i 2 A(b x) : and a i d < 0 : x) then d i 2 S(b

then the set of soaring single-dropping directions from b x are simply the directions dropping an activity i 2 SC (b x) positively and the directions dropping an i 2 S (b x) negatively (we say that activity i is dropped positively (negatively) if all current activities, except for the ith, are preserved and if, moreover, a> i d is positive (negative)). A fault can now be defined more rigorously: a positive (negative) fault of f at a point b x is a ridge i 2 R such that for any neighborhood, B(b x), of b x, there exists 0 x) with i 2 S+ (x0 ) (with a nondegenerate point x 2 B(b 0 x i 2 S (x )). The set of all positive (negative) faults at b x) F (b x)). The set of faults of f at is denoted by F C (b a point b x is denoted by F (b x) :D F C (b x) [ F (b x):

We modify the continuous problem algorithm in such a way that, at a nondegenerate dead point, xk , we do not need to verify the optimality conditions corresponding to soaring single-dropping directions (ui 0, i 2 S+ (xk ) and ui i , i 2 S (xk )), so that we never consider such single-dropping directions in order to establish whether xk is optimal. This is reasonable since we are looking for a local minimum. The line-search step (step 6) is modified similarly: when we encounter a breakpoint x on a fault along a direction d 2 S(x) (jump up), we stop; while if d is such that d 2 S(x), (jump down), we carry on to the next breakpoint, and update properly the directional derivative along d. Note that one has to be careful at a ‘contact’ point xc 2 R (defined below). At xc , contrary to at other points of a fault, we can drop activity i both positively and negatively. The function f : R2 ! R, given by

f (x) D 9 > = > ; 9 = ;

;

;

8 ˆ 2x ˆ < 2 ˆ ˆ : x2

if x1 > 0 or (x1 D 0 and x2 0);

(2)

otherwise;

illustrates well the situation. Figure 1 shows the graph of f in a neighborhood of xc := (0, 0)| (the dotted lines are simply lines that could be seen if the hatched surface were transparent). The point xc is a contact point with respect to the fault x1 = 0.

Discontinuous Optimization

D

An algorithm similar to the one introduced in the continuous case, but which does not consider soaring single-dropping directions, will encounter no difficulty with the discontinuity in f at any noncontact point (e. g. for (2), at any point other than xc ). Let us assume, without loss of generality, that at the kth iterate, xk , F(xk ) = F (xk ). The only step of the continuous algorithm which need to be modified is (assuming moreover that all points encountered in the algorithm are noncontact points):

4 Formally, we define xc 2 Rn to be a contact point of f with respect to i 2 A(xc ), when i 2 F(xc ) such that either 1) i 2 F+ (xc ) \ F (xc ), or 2) there exist + , 2 3|R| such that C i = 1, i = 1 and lim

x!x c ; (x)D C

f (x) D

lim

x!x c ; (x)D

f (x)

(continuity when crossing ridge i, which is a fault, at xc ), where (x) is the vector whose kth component is sign(rk (x)). Note that the fault x1 = 0 and the point xc = (0, 0)| satisfy both conditions 1) and 2) in the above definition of a contact point for the function f defined by (2). They however satisfy only condition 1) for the function f : R2 ! R, defined by 8 ˆ 1 if x1 0 and x2 0; ˆ ˆ ˆ ˆ ˆ ˆ ; ˆ ˆ ˆ ˆ 3 if x1 > 0 and x2 < 0; ˆ ˆ ˆ :4 otherwise; with faults x1 = 0 and x2 = 0. For the function f : R2 ! R given by ( f (x) D

x2

if x1 > 0 and x2 < 0;

0

otherwise;

with fault x1 = 0, we satisfy only condition 2) (F (xc ) is empty).

IF u i < 0 for some i 2 A(x k ), or u i > xi k for some i 2 A(x k )nF (x k ) (violated optimality condition), THEN

The paper [4] describes techniques (including perturbation) to cope with problems that occur in certain cases where the hypothesis of nondegeneracy is not satisfied at points encountered in the course of the algorithm. One cannot however extend this algorithm to deal with dead-point iterates (i. e. not encountered as breakpoint along the line search) without considering carefully the combinatorial nature of the problem of degeneracy. Nevertheless, no difficulties were encountered in the computational experiments reported in [4], although serious problems can still arise at certain singular points (contact points and dead-point iterates, at which the objective function is not decomposable). Indeed, in the discontinuous case, there is no straightforward extension of this approach to the cases where the algorithm encounters a contact point. In the continuous case, the behavior of f over two juxtaposed cells are linked. At contact points however, there is coincidence of the values of restrictions of f to subdomains not otherwise linked to each other. Let us now discuss the extension to the nonlinear case. An advantage of the active set approach for the continuous piecewise linear optimization problem, over, for example, the simplex-format algorithm of R. Fourer [6], is that it generalizes it not only to the discontinuous situation but also to the nonseparable and certain (decomposable) nonconvex cases. Above all, the active set approach is readily extendable to the nonlinear case, by adapting conventional techniques for nonlinear programming, as was done above with the projected gradient method for the (possibly discontinuous)

743

744

D

Discretely Distributed Stochastic Programs: Descent Directions and Efficient Points

piecewise linear case. The definition of decomposition must first be generalized so that it expresses the first order behavior of a piecewise differentiable function in the neighborhood of a point. The piecewise linear algorithm described above used descent directions attempting to decrease the smooth part of the function while maintaining the value of its nonsmooth part (when preserving the current activities). A first order algorithm for the nonlinear case could obtain these two objectives up to first order changes, as in the approach of A.R. Conn and T. Pietrzykowski to nonlinear optimization, via an l1 exact penalty function [5]. In order to develop a second order algorithm, assuming now that f is (possibly discontinuous) piecewise twice-differentiable (i. e. twice differentiable everywhere except over a finite number of ridges), one must first extend the definition of first order decomposition to that of second order decomposition. One could then consider extending the strategies used by T.F. Coleman and Conn [2] on the exact penalty function approach to nonlinear programming (although the l1 exact penalty function involves only first order types of nondifferentiabilities – ridges). The main idea is to attempt to find a direction which minimizes the change in f (up to second order terms) subject to preserving the activities (up to second order terms). Specifically, second order conditions must be derived (which are the first order conditions plus a condition on the ‘definiteness’ of the reduced Hessian of the twice-differentiable part of f (in the second order decomposition of f )). An analog of the Newton step (or of a modification of the Newton method; cf. also Gauss-Newton method: Least squares, relation to Newton’s method) using a nonorthogonal projection [3] is then taken (or a single-dropping direction is used). An algorithm following these lines would be expected to possess global convergence properties (regardless of starting point) and a fast (2-step superlinear) asymptotic convergence rate as in [1]. See also Nondifferentiable Optimization References 1. Coleman TF, Conn AR (1982) Nonlinear programming via an exact penalty function: Asymptotic analysis. Math Program 24:123–136

2. Coleman TF, Conn AR (1982) Nonlinear programming via an exact penalty function: Global analysis. Math Program 24:137–161 3. Conn AR (1976) Projection matrices – A fundamental concept in optimization. In: Vogt WG, Mickle MH (eds) 7th Annual Conf. in Modelling and Simulation, Ed. April 26–27, pp 599–605 4. Conn AR, Mongeau M (1998) Discontinuous piecewise linear optimization. Math Program 80(3):315–380 5. Conn AR, Pietrzykowski T (Apr. 1977) A penalty function method converging directly to a constrained optimum. SIAM J Numer Anal 14(2):348–375 6. Fourer R (1985) A simplex algorithm for piecewise-linear programming I: Derivation and proof. Math Program, 33:204–233 7. Imo II, Leech DJ (1984) Discontinuous optimization in batch production using SUMT. Internat J Production Res 22(2):313–321 8. Lemaréchal C (1978) Bundle methods in nonsmooth optimization. In: Lemaréechal C, Mifflin R (eds) Proc. IIASA Workshop, Nonsmooth Optimization, March 28–April 8, 1977, vol 3, Pergamon, Oxford, pp 79–102 9. Zang I (1981) Discontinuous optimization by smoothing. Math Oper Res 6(1):140–152

Discretely Distributed Stochastic Programs: Descent Directions and Efficient Points KURT MARTI Federal Armed Forces, University Munich, Neubiberg, Germany MSC2000: 90C15, 90C29 Article Outline Keywords Introduction Discretely Distributed Stochastic Programs

Example: Scenario Analysis A System of Linear Relations for the Construction of Descent Directions Efficient Solutions of (1), (2) Comparison of Definitions 7 and 3 Further Characterization of ED, J Necessary Optimality Conditions Without Using (Sub)Gradients

Parametric Representation of ED, J See also References

Discretely Distributed Stochastic Programs: Descent Directions and Efficient Points

Keywords Stochastic program; Stochastic optimization; Mean value function; Uncertainty; Efficient point; Efficient solution; Admissible solution; Pareto optimal solution; Partially monotonous; Partial monotonicity; Robustness; Stochastic matrix; Markov kernel; Descent direction; Necessary optimality condition without using (sub)gradients parametric representations; Scenario analysis

= xu of (1), (2) for each u 2 U. An important class U = J Cm of loss functions u is the set of partially monotonous increasing convex loss functions on Rm defined asfollows: Definition 1 Let J be a given subset of {1, . . . , m}. For ; = C m , where C m is the set of all convex J = ; we put Cm J denotes the set of functions u on Rm . If J 6D ;, then Cm m all convex functions u : R ! R having the following property: z I w I ; z II w II H) u(z) u(w):

Introduction Many problems in stochastic optimization, as for instance optimal stochastic structural design problems, stochastic control problems, problems of scenario analysis, etc., can be described [3,5] by mean value minimization problems of the type ( min F(x) (1) s.t. x 2 D; where the objective function F = F u is the mean value function, defined by

Remark 2 In many cases one has loss functions u 2 J Cm with one of the following additional strict partial monotonicity property: 8 ˆ ˆ 1 is an integer, ˛ i > 0, i = 1, . . . , r, riD1 ˛ i = 1, and (Ai ;b i ) denotes the one-point measure in the given m × (n + 1) matrix (Ai , bi ), i = 1, . . . , r.

scenario-independent decisions are obviously the efficient solutions of r X 0 ˛ i c i x C u(T i x h i ) ; (11) min x2D

Example: Scenario Analysis Given a certain planning problem, in scenario analysis [1,2,6,7,8] the future evolution or development of the system to be considered is anticipated or explored by means of a (usually small) number r (e. g., r = 3, 4, 5, 6) of so-called scenarios s1 , . . . , sr . Scenarios si , i = 1, . . . , r, are plausible alternative models of the future development given by ‘extreme points’ of a certain set of basic or key variables. An individual scenario or a certain mixture of the scenarios s1 , . . . , sr is assumed then to be revealed in the considered future time period. We assume now that the planningproblem can be described mathematically by the optimization problem (

min

c0 x

s.t.

Tx D () h;

x 2 D:

(9)

Here, D is a given convex subset of Rn , and the data (c, T, h) are given by (c, T, h) = (ci , T i , hi ) for scenario si , i = 1, . . . , r, where ci is an n-vector, T i an m × n matrix and hi an m-vector. Having written here the scenarios s1 , . . . , sr by means of (9) and the data (ci , T i , hi ), i = 1, . . . , r, and facing therefore the subproblems (

min

c i0 x

s.t.

T i x D () h i ;

x 2 D;

iD1

which is a discretely distributed stochastic optimization problem of the type (1), (2). A System of Linear Relations for the Construction of Descent Directions Fundamental for the computation of the set ED, U of efficient solutions of (1), (2) is the following construction method for descent directions of the objective function F of (1), (2), cf. [3,4]. We suppose that the true, but unknown loss function u in (1) is, see Definition 1, an elJ for some known index set J {1, . . . , m}. ement of Cm We recall that for any vector z 2 Rm the subvectors zI , zII are defined by zI = (zi )i 2 J , zII = (zi )i 62 J ;see (3). Of course, if J = ;, then z = zII and zI does not exist. For any m × (n + 1) matrix (A, b), let (AI , bI ), (AII , bII ), resp., denote the submatrices of (A, b) having the rows (Ai , bi ) with i 2 J, i 2 {1, . . . , m}\ J, respectively. Given an n-vector x (e. g., the tth iteration point of an algorithm for solving (1), (2)), we consider, in extension of [3, system (3.1)–(3.4b)], the following system of linear relationsfor the unknowns (y, ˘ ), where y 2 Rn and ˘ = ( ij ) is an auxiliary r × r matrix: r X

(10)

i j D 1;

i; j D 1; : : : ; r;

(12)

jD1

˛j D for i = 1, . . . , r, the decision maker has then to select an appropriate decision x 2 D. Since one is unable in general to predict with certainty which scenario si will occur, scenario analysts are looking for decisions x0 which are ‘robust’ with respect to the different scenarios or ‘scenario-independent’, cf. [6,7,8]. Obviously, this robustness concept is closely related to the idea of detecting ‘similarities’ within the family of optimal solutions x (si ), i = 1, . . . , r, of the individual subproblems (10)(i) , P i = 1, . . . , r. Let ˛ 1 , . . . , ˛ r with ˛ i > 0, i = 1, . . . , r, riD1 ˛ i = 1, be (subjective) probabilities for the occurrence of s1 , . . . , sr , or weights reflecting the relative importance of s1 , . . . , sr . Considering loss functions u 2 CmJ for evaluating the violations zi = T i x h i of the constraint T i x = hi , T i x hi , resp., in (10)(i) , a class of robust or

i j 0;

r X

˛i i j ;

j D 1; : : : ; r;

(13)

iD1 j

j

AI y bI

r X ˛i i j

˛j

iD1

(AiI x b Ii ); j D 1; : : : ; r;

j

j

A II y b II D

r X ˛i i j iD1

˛j

(14)

(AiII x b IIi ); j D 1; : : : ; r:

(15)

The transition probability measure j

K D

r X iD1

ˇ i j z i ;

ˇi j D

˛i i j ; ˛j

z i D Ai xb i ; (16)

is not a one-point measure for at least one j, 1 j r.

D

Discretely Distributed Stochastic Programs: Descent Directions and Efficient Points

There exists at least one j, 1 j r, such that for all i = 1, . . . , r K j is not a one-point measure and i j > 0:

(17)

f) If J = ;, then (14) vanishes and (15) reads Aj y b j D

r X ˛i i j iD1

˛j

(Ai x b i ); j D 1; : : : ; r:

At least one inequality in (14) holds with < : (18)

(23)

In the special case The constraint x 2 D in (1) can be handled by adding the condition y 2 D:

(19)

Remark 4 a) By z we denote the one-point measure in a point z 2 Rm . b) According to (12), ˘ is a stochastic matrix. System (12)–(15) has always the trivial solution (y, ˘ ) = (x, Id), where Id is the r × r identity matrix. c) If (y, ˘ ) solves (12)–(15), then AI y B I x;

AII y D B II x;

(20)

where AI D EA I (!), AII D EA II (!). d) If PAII ()ybII () denotes the probability distribution of the random (m |J|)-vector AII (!)x bII (!), then (12), (13) and (15) mean that the distributions PAII ()y bII() and PAII ()x bII () corresponding to y, x, resp., are related by PA II ()xb II () D KPA II ()yb II () Z D K(w; )PA II ()yb II () (dw);

(21)

where K(w, ) is the Markov kernel defined by K(w j ; ) :D K j D

r X ˛i i j iD1

˛j

z i ;

(22)

j j j i i i Rwith w = A y b , z = A x b , i, j = 1, . . . , r. Since zK(w, dz) = w, the Markov kernel K is also calleda dilatation. e) If n-vectors x, y are related by (21), (22), then for every convex subset B Rm |J| we have that

PA II ()xb II () (B) D 1 H)

PA II ()xb II () (B) D 1;

hence, the distribution of AII () y bII () is concentrated to the convex hull of the support of PAII ()x bII ().

j

j

(A I ; b I ) D (AI ; b I ) for all j D 1; : : : ; r;

(24)

i. e., if (AI (!), bI (!)) is constant with probability one, then (14) is reduced, cf. (20), to AI y AI x:

(25)

The meaning of (12)–(15) and the additional conditions (16)–(18) for the basic mean value minimization problem (1), (2) with objective function F is summarized in the next result. Theorem 5 Let J be any fixed subset of {1, . . . , m}. a) If (y, II) is a solution of (12)–(15), then F(y) F(x) J . For J = ; also the converse holds: If for every u 2 Cm there is a vector y such that F(y) F(x) for all u 2 Cm (C;m ), then there exists an r × r matrix II such that (y, II) satisfies (12), (13) and (23). b) If (y, II) is a solution of (12)–(15) and (16), then J which is strictly convex F(y)< F(x) for every u 2 Cm i on conv{z : 1 i r}. c) If (y, II) is a solution of (12)–(15) and (17), then F(y) J which is not affine-linear on < F(x) for every u 2 Cm i conv{z : 1 i r}. d) If (y, II) fulfills (12)–(15) and (18), then F(y) < F(x) J satisfying (4). for every u 2 Cm Proof If x and (y, II) are related by (12)–(15), then F(y) P P J . rjD1 ˛ j u( riD1 ˇ ij zi ) for every u 2 Cm If x, (y, II) are related by (12)–(15) and (18), then P P J fulfilling F(y) < rjD1 ˛ j u( riD1 ˇ ij zi ) for every u 2 Cm (4). The rest can then be shown as in [3, Thm. 2.2]. A simple, but important consequence of the above theorem is stated next: Corollary 6 For given x 2 Rn or x 2 D let (y, II) be any solution of (12)–(15) such that y 6D x, y 2 D { x}, respectively. a) Then h = y x is a descent direction, a feasible deJ such scent direction, resp., of F at x for every u 2 Cm that F is not constant on the line segment xy joining x and y.

747

748

D

Discretely Distributed Stochastic Programs: Descent Directions and Efficient Points

b) If (y, II) fulfills also (16), (17), (18), resp., then h = y x is a (feasible) descent direction of F at x for every J which is strictly convex on conv {z i :1 i r}, u 2 Cm is not affine-linear on conv { zi : 1 i r}, fulfills (4), respectively. Efficient Solutions of (1), (2) In the following we suppose that the unknown loss J , where J is a given subfunction u is an element of Cm set of {1, . . . , m}. For a given point x 2 D, the descent direction-finding procedure described in the previous section only can fail completely if for each solution (y, II) of (12)–(15) with x 2 D we have that Aj y D Aj x

for each j D 1; : : : ; r:

(26)

Indeed, in this case we either have y = x, or, for arbitrary loss functions u, the objective function F of (1), (2) is constant on the whole line through the points x, y. This observation suggests the following basic efficiency concept. J )-efficient Definition 7 A point x 2 D is called a (Cm J point or a (Cm )-efficient solution of (1), (2) if and only if for each solution (y, II) of (12)–(15) with y 2 D we have that Aj y = Aj x for each j = 1, . . . , r, i. e., A(!)x = A(!)y with probability 1. Let ED, J denote the set of all efficient points of (1), (2).

For deriving parametric representations of ED, J , we need the following definitions and lemmas. For a given n-vector x and zi = Ai x bi , i = 1, . . . , r, let S = Sx { 1, . . . , r} with |S| = s be an index set such that {zi : 1 i r } = {zi : i 2 S}, where zi 6D zj for i, j 2 S, i 6D j. Defining for i 2 S, j = 1, . . . , r, the quantities ˛ei :D

X

1 X ˛t t j ; i j :D ˛ei t i z Dz

˛ei i j f ˇ ; i j :D ˛j

(27)

we find that relations (12)–(15) can also be represented by i j D 1;

jD1

˛j D

X i2S

i j 0;

e ˛ i i j ;

j

X

e ˇ i j z Ii ; j D 1; : : : ; r; i2S X j j e ˇ i j z IIi ; j D 1; : : : ; r: A II y b II D

(30) (31)

i2S

For the next lemma we still need the s × r matrix T 0 = ( 0i j ) defined by ( i0j

D

0

if z i ¤ z j ;

˛j e ˛i

if z j D z i ;

for i 2 S; j D 1; : : : ; r: (32)

Lemma 8 Let (y, II) be a solution of (12)–(15), and let T = T(II) = ( ij ) be the s × r matrix having the elements ij given by (27). If (26) holds, then T(II) = T 0 and (14) holds with ‘ = ’. Lemma 8 implies the following important property of efficient solutions: Corollary 9 Let x 2 D be an efficient solution of (1), (2). If (y, II) is any solution of (21)–(22) with y 2 D, then T(II) = T 0 and (14) holds with ‘ = ’. For J = ; we obtain the set ED := ED, ; of all Cm -efficient solutions of (1), (2). This set is studied in [3]. An important relationship between ED and ED , J for any J {1, . . . , m} is given next: Lemma 10 ED, J ED for every J {1, . . . , m}. Comparison of Definitions 7 and 3 Comparing the efficient solutions according to Definition 7 and the nondominated solutions according to Definition 3, first for J = ;, i. e., U = Cm , we find thefollowing correspondence: Theorem 11 ED, ; = E(0) D;C m . The next corollary follows immediately from the above theorem and Lemma 10.

˛t ;

z t Dz i

r X

j

AI y bI

j D 1; : : : ; r;

j D 1; : : : ; r;

i 2 S; (28) (29)

Corollary 12 ED, J ED, ; = E(0) D;C m for J { 1, . . . , m}. J we have this inclusion: Considering now U = C m J (0) Theorem 13 ED, J ED , C m for J {1, . . . , m}.

The following inclusion follows from Corollary 12 and Theorem 13. Corollary 14 E(0)

J D;C m

ED, J E(0) D;C m for J {1, . . . , m}.

A converse statement to Theorem 13 can be obtained for (24):

D

Discretely Distributed Stochastic Programs: Descent Directions and Efficient Points

Theorem 15 If (A I (!); b I (!)) D (AI ; b I ) with probability 1, then ED, J = E(0) J for each J {1, . . . , m}. D;C m

Further Characterization of ED, J J The C m -efficiency of a point x 2 D can also be described

in the following way. J )-efficient if and only Theorem 16 A point x 2 D is (C m if for every solution (y, II) of (12)-(15) we have that Aj y = Aj x for all j = 1, . . . , r, or h = y x is not a feasible direction for D at x.

Necessary Optimality Conditions Without Using (Sub)Gradients If x 2 D is efficient, then, cf. Theorem 16, the descent direction-finding method described in in the previous Section fails at x. Since especially in any optimal solution x of (1), (2) no feasible descent direction may exist, efficient points are candidates for optimal solutions: Theorem 17 Suppose that for every x 2 D and every solution (y, II) of (12)-(15) with y 2 D the objective funcJ is constant tion F of (1), (2) with a loss function u 2 C m j j on the line segment xy if and only if A y = A x for every j = 1, . . . , r. If x is an optimal solution of (1), (2), then x 2 ED, J . Remark 18 The assumption in Theorem 17 concerning J is strictly convex on the F is fulfilled, e. g., if u 2 C m j j convex hull conv {(A y b )(Aj x bj ): x, y 2 D, 1 j r} generated by the line segments (Aj y bj )(Aj x bj ) joining (Aj y bj ) and (Aj x bj ).

(12)–(15) and (16)–(18), resp., we may use, see Theorem 5 and Corollary 6, the quadratic program, cf. [3,4], 8 ˆ ˆ ˆ min ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ s.t. ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ < ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ :

0 (AI y AI x) C

r X

˛j

jD1 r X

i j D 1;

X

e ˇ 2i j

i2S

i j 0;

jD1

j D 1; : : : ; r; i 2 S; X (33) e ˛ i i j ; j D 1; : : : ; r; ˛j D i2S X j j e AI y bI ˇ i j z Ii ; j D 1; : : : ; r; i2S X j j e A II y b II D ˇ i j z IIi ; j D 1; : : : ; r; i2S

y 2 D;

where = (l ) is a |J|-vector having fixed positive components l , l 2 J. Efficient solutions of (1), (2) can be characterized as follows: Lemma 20 A vector x 2 D is an efficient solution of (1), (2) if and only if (33) has an optimal solution (y , T ) such that Aj y = Aj x for all j = 1, . . . , r. Remark 21 According to Lemma 8 we have then also that T = T 0 and (14) holds with ‘ = ’. We suppose now that the feasible domain D of (1), (2) is given by D D fx 2 Rn : g k (x) 0; k D 1; : : : ; g :

(34)

If the assumption in Theorem 17 concerning F does not hold, then it may happen that F is constant on a certain line segment xy though Aj y 6D Aj x for at least one index j, 1 j r. Hence, Theorem 17 can not be applied then directly. However, in this case the following modification of Theorem 17 holds true.

Here, g 1 , . . . , g are differentiable, convex functions. Moreover, we suppose that (33) has a feasible solution (y, T) such that for each nonaffine linear function g k

Theorem 19 Let u be an arbitrary loss function from J for some J {1, . . . , m}. If D is a compact convex Cm subset of Rn , then there exists at least one optimal solution x of (1), (2) lying in the closure E D;J of the set ED, J of efficient solutions of (1), (2).

No constraint qualifications are needed in the important special case D = {x 2 Rn : Gx g}, where (G, g) is a given × (n + 1) matrix. By means of the Kuhn-Tucker conditions of (33), the following parametric representation of ED, J can be derived [3,4]:

Parametric Representation of ED, J

Theorem 22 Let D be given by (34), and assume that the constraint qualification (35) holds for every x 2 D. An n-vector x is an efficient solution of (1), (2) if and

J Suppose that u 2 C m for some index set J { 1, . . . , m}. For solving the descent direction-generating relations

g k (y) < 0:

(35)

749

750

D

Discretely Distributed Stochastic Programs: Descent Directions and Efficient Points

only if x satisfies the linear relations j i

j i

b j b i ˛j ˛i

b j b i ˛j ˛i

0

0

zi D 2

zi

1 1 ˛i ˛j

;

if z i D z j ;

(36)

if z i ¤ z j ;

(37)

2 ; ˛i

where 1 , . . . , r are arbitrary real parameters, and the parameter m-vectors 1 , . . . , r and further parameter vectors 2 R , y 2 Rn are selected such that r X

0

Aj b j C

jD1

jI 0;

X

k r g k (y) D 0;

(38)

kD1

j D 1; : : : ; r;

g k (x) 0;

k D 1; : : : ; ;

g k (y) 0;

k g k (y) D 0;

(39) (40) k 0;

(41)

k D 1; : : : ; ; A j y D A j x;

j D 1; : : : ; r;

j D and the vectors b j are defined by b . . . , r.

(42) ˛ j C j l jII

, j = 1,

See also Approximation of Extremum Problems with Probability Functionals Approximation of Multivariate Probability Integrals Extremum Problems with Probability Functions: Kernel Type Solution Methods General Moment Optimization Problems Logconcave Measures, Logconvexity Logconcavity of Discrete Distributions L-shaped Method for Two-Stage Stochastic Programs with Recourse Multistage Stochastic Programming: Barycentric Approximation Preprocessing in Stochastic Programming Probabilistic Constrained Linear Programming: Duality Theory Probabilistic Constrained Problems: Convexity Theory Simple Recourse Problem: Dual Method

Simple Recourse Problem: Primal Method Stabilization of Cutting Plane Algorithms for Stochastic Linear Programming Problems Static Stochastic Programming Models Static Stochastic Programming Models: Conditional Expectations Stochastic Integer Programming: Continuity, Stability, Rates of Convergence Stochastic Integer Programs Stochastic Linear Programming: Decomposition and Cutting Planes Stochastic Linear Programs with Recourse and Arbitrary Multivariate Distributions Stochastic Network Problems: Massively Parallel Solution Stochastic Programming: Minimax Approach Stochastic Programming Models: Random Objective Stochastic Programming: Nonanticipativity and Lagrange Multipliers Stochastic Programming with Simple Integer Recourse Stochastic Programs with Recourse: Upper Bounds Stochastic Quasigradient Methods in Minimax Problems Stochastic Vehicle Routing Problems Two-Stage Stochastic Programming: Quasigradient Method Two-Stage Stochastic Programs with Recourse References 1. Carroll JM (ed) (1995) Scenario-based design. WileyNew York , New York 2. Chandler J, Cockle P (1982) Techniques of scenario planning. McGraw-Hill, New York 3. Marti K (1988) Descent directions and efficient solutions in discretely distributed stochastic programs. Lecture Notes Economics and Math Systems, vol 299. Springer, Berlin 4. Marti K (1992) Computation of efficient solutions of discretely distributed stochastic optimization problems. ZOR 36:259–294 5. Marti K (1996) Stochastic optimization methodsin engineering. In: Dolezal J, Fidler J (eds) System Modelling and Optimization. Chapman and Hall, London, pp 75–87 6. Reibnitz Uvon (1988) Scenario techniques. McGraw-Hill, New York 7. Steinsiek E, Knauer P (1981) Szenarien als Instrument der Umweltplanung. Angewandte Systemanalyse 2:10–19 8. Zentner RD (1975) Scenarios: A new tool for corporate planners. Chem and Eng News, Internat Ed 53:22–34

Discrete Stochastic Optimization

Discrete Stochastic Optimization GEORG PFLUG University Vienna, Vienna, Austria MSC2000: 90C15, 90C27 Article Outline Keywords Stochastic Simulated Annealing Stochastic Branch and Bound See also References Keywords Derivatives; Stochastic optimization A stochastic combinatorial optimization problem is of the form Z 8 b

751

752

D

Discrete Stochastic Optimization

is typically much smaller for such a choice than with samples taken independently for each xj (see [6]). If the estimates b F are difficult to get (e. g. they need real observation or expensive simulation) allocation rules decide, which estimate or which set of estimates has to be taken next. These rules try to exclude quickly subsets of the feasible set, which – with high statistical evidence – do not contain optimal solutions. The effort is then concentrated on the (shrinking) set of not yet excluded points.Allocation rules may be based on subset selection (see [3]) or ordinal optimization (see [5]). There is also a connection to experimental design, in particular to sequential experimental design: In experimental design one has to choose the next point(s) for sampling, which – based on the information gathered so far – will give the best additional information which we need to solve the underlying estimation or optimization problem (for experimental design literature see [1] and the references therein). For large sets S, which have graph-neighborhood or partition structures, ‘stochastic’ variants of neighbor search or branch and bound methods may be used. In particular, stochastic simulated annealing and stochastic branch and bound have been studied in literature. Stochastic Simulated Annealing This is a variant of ordinary simulated annealing (cf. Simulated annealing): The Metropolis rule for the acceptance probability is calculated on the basis of the current stochastic estimates of the objective function, i. e. the new state xj is preferred to the current state xi with probability ! ! b F n (x j ) b F n (x i ) ;1 min exp kB T where kB is the Boltzmann constant and T is the temperature. The estimates b F are improved in each step by taking additional observations, i. e. increasing the sample size n. For an analysis of this algorithm see [4]. Stochastic Branch and Bound For the implementation of a stochastic branch and bound method (cf. also Integer programming: Branch and bound methods), an estimate of a lower bound function is needed. Recall that a function F, defined on the subsets of S, is called a lower bound function

if inf fF(x) : x 2 Tg F(T) for all T S. In stochastic branch and bound an estimate b F of F can be found for instance by sampling b F n (x i ) for each F n (x i )) D F(x i ) and setting xi in T with E(b n o b F(T) D inf b F n (x i ) : x i 2 T : The bound-step of the branch and bound method is replaced by a statistical test, whether the lower bound estimate of a branch is significantly larger than the estimate of an intermediate solution. After each step, all estimates are improved by taking additional observations. For details see [2] and [7]. In all these algorithms, common random numbers may decrease the variance. See also Derivatives of Markov Processes and Their Simulation Derivatives of Probability and Integral Functions: General Theory and Examples Derivatives of Probability Measures Optimization in Operation of Electric and Energy Power Systems References 1. Chernoff H (1989) Sequential analysis and optimal design. SIAM, Philadelphia ´ 2. Ermoliev YM, Norkin VI, Ruszczynnski A (1998) On optimal allocation of indivisibles under uncertainty. Oper Res 46:381–395 3. Futschik A, Pflug GCh (1997) Optimal allocation of simulation experiments in discrete stochastic optimization and approximative algorithms. Europ J Oper Res 101:245–260 4. Gutjahr W, Pflug G (1996) Simulated annealing for noisy cost functions. J Global Optim 8:1–13 5. Ho YC, Sreenivas RS, Vakili P (1992) Ordinal optimization of DEDS. J Discret Event Dynamical Systems 2:61–88 6. Kleywegt AJ, Shapiro A (1999) The sample average approximation method for stochastic discrete optimization. Georgia Inst Technol, Atlanta, GA ´ 7. Norkin VI, Pflug GCh, Ruszczynski A (1998) A branch and bound method for stochastic global optimization. Math Program 83:425–450

Disease Diagnosis: Optimization-Based Methods

Disease Diagnosis: Optimization-Based Methods EVA K. LEE, TSUNG-LIN WU Center for Operations Research in Medicine and HealthCare, School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, USA MSC2000: 90C05, 90C06, 90C10, 90C11, 90C20, 90C30, 90C90, 90-08, 65K05 Article Outline Abstract Introduction Pattern Recognition, Discriminant Analysis, and Statistical Pattern Classification Supervised Learning, Training, and Cross-Validation Bayesian Inference and Classification Discriminant Functions

Mathematical Programming Approaches Linear Programming Classification Models Mixed Integer Programming Classification Models Nonlinear Programming Classification Models Support Vector Machine

Mixed Integer Programming Based Multigroup Classification Models and Applications to Medicine and Biology Discrete Support Vector Machine Predictive Models Classification Results for Real-World Biological and Medical Applications Further Advances

Progress and Challenges Other Methods Summary and Conclusion References

D

ing predictive rules for large heterogeneous biological and medical data sets. Our predictive model simultaneously incorporates (1) the ability to classify any number of distinct groups; (2) the ability to incorporate heterogeneous types of attributes as input; (3) a highdimensional data transformation that eliminates noise and errors in biological data; (4) the ability to incorporate constraints to limit the rate of misclassification, and a reserved-judgment region that provides a safeguard against overtraining (which tends to lead to high misclassification rates from the resulting predictive rule); and (5) successive multistage classification capability to handle data points placed in the reservedjudgment region. To illustrate the power and flexibility of the classification model and solution engine, and its multigroup prediction capability, application of the predictive model to a broad class of biological and medical problems is described. Applications include the differential diagnosis of the type of erythemato-squamous diseases; predicting presence/absence of heart disease; genomic analysis and prediction of aberrant CpG island methylation in human cancer; discriminant analysis of motility and morphology data in human lung carcinoma; prediction of ultrasonic cell disruption for drug delivery; identification of tumor shape and volume in treatment of sarcoma; multistage discriminant analysis of biomarkers for prediction of early atherosclerois; fingerprinting of native and angiogenic microvascular networks for early diagnosis of diabetes, aging, macular degeneracy, and tumor metastasis; prediction of protein localization sites; and pattern recognition of satellite images in classification of soil types. In all these applications, the predictive model yields correct classification rates ranging from 80 to 100%. This provides motivation for pursuing its use as a medical diagnostic, monitoring and decision-making tool.

Abstract

Introduction

In this chapter, we present classification models based on mathematical programming approaches. We first provide an overview of various mathematical programming approaches, including linear programming, mixed integer programming, nonlinear programming, and support vector machines. Next, we present our effort of novel optimization-based classification models that are general purpose and suitable for develop-

Classification is a fundamental machine learning task whereby rules are developed for the allocation of independent observations to groups. Classic examples of applications include medical diagnosis – the allocation of patients to disease classes on the basis of symptoms and laboratory tests – and credit screening – the acceptance or rejection of credit applications on the basis of applicant data. Data are collected concerning observa-

753

754

D

Disease Diagnosis: Optimization-Based Methods

tions with known group membership. These training data are used to develop rules for the classification of future observations with unknown group membership. In this introduction, we briefly describe some terminologies related to classification, and provide a brief description of the organization of this chapter. Pattern Recognition, Discriminant Analysis, and Statistical Pattern Classification Cognitive science is the science of learning, knowing, and reasoning. Pattern recognition is a broad field within cognitive science, which is concerned with the process of recognizing, identifying, and categorizing input information. These areas intersect with computer science, particularly in the closely related areas of artificial intelligence, machine learning, and statistical pattern recognition. Artificial intelligence is associated with constructing machines and systems that reflect human abilities in cognition. Machine learning refers to how these machines and systems replicate the learning process, which is often achieved by seeking and discovering patterns in data, or statistical pattern recognition. Discriminant analysis is the process of discriminating between categories or populations. Associated with discriminant analysis as a statistical tool are the tasks of determining the features that best discriminate between populations, and the process of classifying new objects on the basis of these features. The former is often called feature selection and the latter is referred to as statistical pattern classification. This work will be largely concerned with the development of a viable statistical pattern classifier. As with many computationally intensive tasks, recent advances in computing power have led to a sharp increase in the interest and application of discriminant analysis techniques. The reader is referred to Duda et al. [25] for an introduction to various techniques for pattern classification, and to Zopounidis and Doumpos [121] for examples of applications of pattern classification. Supervised Learning, Training, and Cross-Validation An entity or observation is essentially a data point as commonly understood in statistics. In the framework of statistical pattern classification, an entity is a set

of quantitative measurements (or qualitative measurements expressed quantitatively) of attributes for a particular object. As an example, in medical diagnosis an entity could be the various blood chemistry levels of a patient. With each entity is associated one or more groups (or populations, classes, categories) to which it belongs. Continuing with the medical diagnosis example, the groups could be the various classes of heart disease. Statistical classification seeks to determine rules for associating entities with the groups to which they belong. Ideally, these associations align with the associations that human reasoning would produce on the basis of information gathered on objects and their apparent categories. Supervised learning is the process of developing classification rules based on entities for which the classification is already known. Note that the process implies that the populations are already well defined. Unsupervised learning is the process of discovering patterns from unlabeled entities and thereby discovering and describing the underlying populations. Models derived using supervised learning can be used for both functions of discriminant analysis – feature selection and classification. The model that we consider is a method for supervised learning, so we assume that populations are previously defined. The set of entities with known classification that is used to develop classification rules is the training set. The training set may be partitioned so that some entities are withheld during the model-development process, also known as the training of the model. The withheld entities form a test set that is used to determine the validity of the model, a process known as crossvalidation. Entities from the test set are subjected to the rules of classification to measure the performance of the rules on entities with unknown group membership. Validation of classification models is often performed using m-fold cross-validation where the data with known classification are partitioned into m folds (subsets) of approximately equal size. The classification model is trained m times, with the mth fold withheld during each run for testing. The performance of the model is evaluated by the classification accuracy on the m test folds, and can be represented using a classification matrix or confusion matrix. The classification matrix is a square matrix with the number of rows and columns equal to the number of

D

Disease Diagnosis: Optimization-Based Methods

groups. The ijth entry of the classification matrix contains the number or proportion of test entities from group i that were classified by the model as belonging to group j. Therefore, the number or proportion of correctly classified entities is contained in the diagonal elements of the classification matrix, and the number or proportion of misclassified entities is in the off-diagonal entries. Bayesian Inference and Classification The popularity of Bayesian inference has risen drastically over the past several decades, perhaps in part due to its suitability for statistical learning. The reader is referred to O’Hagan [92] for a thorough treatment of Bayesian inference. Bayesian inference is usually contrasted against classical inference, though in practice they often imply the same methodology. The Bayesian method relies on a subjective view of probability, as opposed to the frequentist view upon which classical inference is based [92]. A subjective probability describes a degree of belief in a proposition held by the investigator based on some information. A frequency probability describes the likelihood of an event given an infinite number of trials. In Bayesian statistics, inferences are based on the posterior distribution. The posterior distribution is the product of the prior probability and the likelihood function. The prior probability distribution represents the initial degree of belief in a proposition, often before empirical data are considered. The likelihood function describes the likelihood that the behavior is exhibited, given that the proposition is true. The posterior distribution describes the likelihood that the proposition is true, given the observed behavior. Suppose we have a proposition or random variable about which we would like to make inferences, and data x. Application of Bayes’s theorem gives dF() dF(xj) : dF(jx) D dF(x) Here, F denotes the (cumulative) distribution function. For ease of conceptualization, assume that F is differentiable, then dF D f , and the above equality can be rewritten as f (jx) D

f () f (xj) : f (x)

For classification, a prior probability function (g) describes the likelihood that an entity is allocated to group g regardless of its exhibited feature values x. A group density function f (xjg) describes the likelihood that an entity exhibits certain measurable attribute values, given that it belongs to population g. The posterior distribution for a group P(gjx) is given by the product of the prior probability and group density function, normalized over the groups to obtain a unit probability over all groups. The observation x is allocated to group h (g) f (xjg) if h D arg max g2G P(gjx) D arg max g2G P ( j) f (xj j) , where G denotes the set of groups.

j2G

Discriminant Functions Most classification methods can be described in terms of discriminant functions. A discriminant function takes as input an observation and returns information about the classification of the observation. For data from a set of groups G , an observation x is assigned to group h if h D arg max g2G l g (x); where the functions l g are the discriminant functions. Classification methods restrict the form of the discriminant functions, and training data are used to determine the values of the parameters that define the functions. The optimal classifier in the Bayesian framework can be described in terms of discriminant functions. Let g D (g) be the prior probability that an observation is allocated to group g and let f g (x) D f (xjg) be the likelihood that data x are drawn from population g. If we wish to minimize the probability of misclassification given x, then the optimal allocation for an entity is to the group f (x) h D arg max g2G P(gjx) D arg max g2G P g g j f j (x) . j2G

Under the Bayesian framework, P(gjx) D

g f (xjg)

g f (xjg) D P : f (x)

j f (xj j) j2G

The discriminant functions can be l g (x) D P(gjx) for g 2 G. The same classification rule is given by l g (x) D g f (xjg) and l g (x) D log f (xjg) C log g . The problem then becomes finding the form of the prior functions and likelihood functions that match the data.

755

756

D

Disease Diagnosis: Optimization-Based Methods

If the data are multivariate normal with equal covariance matrices ( f (xjg) N( g ; ˙)), then a linear discriminant function (LDF) is optimal: l g (x) D log f (xjg) C log g D 1/2(x g )T ˙ 1 (x g ) 1/2 log j˙ g j d/2 log 2 C log g D w Tg x C w g0 ; where d is the number of attributes, w g D ˙ 1 g , and w g0 D 1/2Tg ˙ 1 g C log g C x T ˙ 1 x d/2 log 2 . Note that the last two terms of w g0 are constant for all g and need not be calculated. When there are two groups (G D f1; 2g) and the priors are equal ( 1 D 2 ), the discriminant rule is equivalent to Fisher’s linear discriminant rule [30]. Fisher’s rule can also be derived, as it was by Fisher, by choosing w so that (w T 1 w T 2 )2 /(w T ˙w) is maximized. These LDFs and quadratic discriminant functions (QDFs) are often applied to data sets that are not multivariate normal or continuous (see pp. 234–235 in [98]) by using approximations for the means and covariances. Regardless, these models are parametric in that they incorporate assumptions about the distribution of the data. Fisher’s LDF is nonparametric because no assumptions are made about the underlying distribution of the data. Thus, for a special case, a parametric and a nonparametric model coincide to produce the same discriminant rule. The LDF derived above is also called the homoscedastic model, and the QDF is called the heteroscedastic model. The exact form of discriminant functions in the Bayesian framework can be derived for other distributions [25]. Some classification methods are essentially methods for finding coefficients for LDFs. In other words, they seek coefficients wg and constants wg0 such that l g (x) D w g x C w g0 , g 2 G is an optimal set of discriminant functions. The criteria for optimality are different for different methods. LDFs project the data onto a linear subspace and then discriminate between entities in that subspace. For example, Fisher’s LDF projects two-group data on an optimal line, and discriminates on that line. A good linear subspace may not exist for data with overlapping distributions between groups and therefore the data will not be classified accurately using these methods. The hyperplanes defined by the

discriminant functions form boundaries between the group regions. A large portion of the literature concerning the use of mathematical programming models for classification describes methods for finding coefficients of LDFs [121]. Other classification methods seek to determine parameters to establish QDFs. The general form of a QDF is l g (x) D x T Wg x C w Tg x C w g0 . The boundaries defining the group regions can assume any hyperquadric form, as can the Bayes decision rules for arbitrary multivariate normal distributions [25]. In this paper, we survey the development and advances of classification models via the mathematical programming techniques, and summarize our experience in classification models applied to prediction in biological and medical applications. The rest of this chapter is organized as follows. Section “Mathematical Programming Approaches” first provides a detailed overview of the development and advances of mathematical programming based classification models, including linear programming (LP), mixed integer programming (MIP), nonlinear programming, and support vector machine (SVM) approaches. In Sect. “Mixed Integer Programming Based Multigroup Classification Models and Applications to Medicine and Biology”, we describe our effort in developing optimization-based multigroup multistage discriminant analysis predictive models for classification. The use of the predictive models for various biological and medical problems is presented. Section “Progress and Challenges” provides several tables to summarize the progress of mathematical programming based classification models and their characteristics. This is followed by a brief description of other classification methods in Sect. “Other Methods”, and by a summary and concluding remarks in Sect. “Summary and Conclusion”. Mathematical Programming Approaches Mathematical programming methods for statistical pattern classification emerged in the 1960s, gained popularity in the 1980s, and have grown drastically since. Most of the mathematical programming approaches are nonparametric, which has been cited as an advantage when analyzing contaminated data sets over methods that require assumptions about the distribution of the

D

Disease Diagnosis: Optimization-Based Methods

data [107]. Most of the literature about mathematical programming methods is concerned with either using mathematical programming to determine the coefficients of LDFs or support vector machines (SVMs). The following notation will be used. The subscripts i, j, and k are used for the observation, attribute, and group, respectively. Let xij be the value of attribute j of observation i. Let m be the number of attributes, K be the number of groups, Gk represent the set of data from group k, M be a big positive number, and be a small positive number. The abbreviation “urs” is used in reference to a variable to denote “unrestricted in sign.”

MSD: Minimize

X

di

i

X

subject to w0 C

xi j w j di 0

8i 2 G1 ;

j

X

w0 C

x i j w j C d i 0 8i 2 G2 ;

j

w j urs

8j ;

di 0

8i :

MMD: Linear Programming Classification Models

Minimize

The use of linear programs to determine the coefficients of LDFs has been widely studied [31,46,50,74]. The methods determine the coefficients for different objectives, including minimizing the sum of the distances to the separating hyperplane, minimizing the maximum distance of an observation to the hyperplane, and minimizing other measures of badness of fit or maximizing measures of goodness of fit.

subject to w0 C

d

X

xi j w j d 0

8i 2 G1 ;

xi j w j C d 0

8i 2 G2 ;

j

X

w0 C

j

w j urs 8 j ; d 0: MSID:

Two-Group Classification One of the earliest LP classification models was proposed by Mangasarian [74] to construct a hyperplane to separate two groups of data. Separation by a nonlinear surface using LP was also proposed when the surface parameters appear linearly. Two sets of points may be inseparable by one hyperplane or surface through a single-step LP approach, but they can be strictly separated by more planes or surfaces via a multistep LP approach [75]. In [75] real problems with up to 117 data points, ten attributes, and three groups were solved. The three-group separation was achieved by separating group 1 from groups 2 and 3, and then group 2 from group 3. Studies of LP models for the discriminant problem in the early 1980s were carried out by Hand [47], Freed and Glover [31,32], and Bajgier and Hill [5]. Three LP models for the two-group classification problem, including minimizing the sum of deviations (MSD), minimizing the maximum deviation (MMD), and minimizing the sum of interior distances (MSID) were proposed. Freed and Glover [33] provided computational studies of these models where the test conditions involved normal and nonnormal populations.

Minimize

pd

X

ei

i

subject to

w0 C

X

xi jw j d C ei 0

8i 2 G1 ;

xi jw j C d ei 0

8i 2 G2 ;

j

w0 C

X j

w j urs

8j ;

d 0; ei 0

8i ;

where p is a weight constant. The objective function of the MSD model is the L1 -norm distance, while the objective function of MMD is the L1 -norm distance. They are special cases of Lp -norm classification [50,108]. In some models the constant term of the hyperplane is a fixed number instead of a decision variable. The model minimize the sum of deviations with constant cutoff score MSD0 shown below is an example where the cutoff score b replaces w0 in the formulation. The same replacement could be used in other formulations.

757

758

D

Disease Diagnosis: Optimization-Based Methods

MSD0 : Minimize

X

di

i

subject to

X

xi j w j di b

8i 2 G1 ;

xi j w j C di b

8i 2 G2 ;

j

X j

w j urs 8 j ; di 0

8i :

A gap can be introduced between the two regions determined by the separating hyperplane to prevent degenerate solutions. Take MSD as an example; the separation constraints become X x i j w j d i 8i 2 G1 ; w0 C j

w0 C

X

xi j w j C di

8i 2 G2 :

j

The small number can be normalized to 1. Besides introducing a gap, another normalization P approach is to include constraints such as mjD0 w j D 1 Pm or jD1 w j D 1 in the LP models to avoid unbounded or trivial solutions. Specifically, Glover et al. [45] gave the hybrid model, as follows. Hybrid model: X X p i d i qe qi ei Minimize pd C i

subject to w0 C

X

i

xi j w j d di C e C ei D 0

j

8i 2 G1 ; X w0 C xi j w j C d C di e ei D 0 j

8i 2 G2 ; w j urs 8 j ; d; e 0 ; di ; ei 0

8i ;

where p; p i ; q; q i are the costs for different deviations. Including different combinations of deviation terms in the objective function then leads to variant models. Joachimsthaler and Stam [50] reviewed and summarized LP formulations applied to two-group classi-

fication problems in discriminant analysis, including MSD, MMD, MSID, and MIP models, and the hybrid model. They summarized the performance of the LP methods together with the traditional classification methods such as Fisher’s LDF [30], Smith’s QDF [106], and a logistic discriminant method. In their review, MSD sometimes but not uniformly improves classification accuracy, compared with traditional methods. On the other hand, MMD is found to be inferior to MSD. Erenguc and Koehler [27] presented a unified survey of LP models and their experimental results, in which the LP models include several versions of MSD, MMD, MSID, and hybrid models. Rubin [99] provided experimental results comparing these LP models with Fisher’s LDF and Smith’s QDF. He concluded that QDF performs best when the data follow normal distributions and that QDF could be the benchmark when seeking situations for advantageous LP methods. In summary, the above mentioned review papers [27,50,99] describe previous work on LP classification models and their comparison with traditional methods. However, it is difficult to make definitive statements about the conditions under which one LP model is superior to others, as stated in [107]. Stam and Ungar [110] introduced the software package RAGNU, a utility program in conjunction with the LINDO optimization software, for solving twogroup classification problems using LP-based methods. LP formulations such as MSD, MMD, MSID, hybrid models, and their variants are contained in the package. There are some difficulties in LP-based formulations, in that some models could result in unbounded, trivial, or unacceptable solutions [34,87], but possible remedies are proposed. Koehler [51,52,53] and Xiao [114,115] characterized the conditions of unacceptable solutions in two-group LP discriminant models, including MSD, MMD, MISD, the hybrid model, and their variants. Glover [44] proposed P P the normalization constraint mjD1 (jG2 j i2G1 x i j C P jG1 j i2G2 x i j )w j D 1, which is more effective and reliable. Rubin [100] examined the separation failure for two-group models and suggested applying the models twice, reversing the group designations the second time. Xiao and Feng [116] proposed a regularization method to avoid multiple solutions in LP discriminant P analysis by adding the term mjD1 w 2j in the objective functions.

Disease Diagnosis: Optimization-Based Methods

Bennett and Mangasarian [9] proposed the following model which minimizes the average of the deviations, which is called robust LP (RLP): Minimize subject to

1 X 1 X di C di jG1 j jG2 j i2G 1 i2G 2 X w0 C x i j w j d i 1

8i 2 G1 ;

j

w0 C

X

xi j w j C di 1

8i 2 G2 ;

j

w j urs 8 j ; di 0

8i :

It is shown that this model gives the null solution P w1 D D w m D 0 if and only if jG11 j i2G1 x i j D P 1 i2G 2 x i j for all j, in which case the solution jG 2 j w1 D D w m D 0 is guaranteed to be not unique. Data of different diseases have been tested by the proposed classification methods, as in most of Mangasarian’s papers. Mangasarian et al. [86] described two applications of LP models in the field of breast cancer research, one in diagnosis and the other in prognosis. The first application is to discriminate benign from malignant breast lumps, while the second one is to predict when breast cancer is likely to recur. Both of them work successfully in clinical practice. The RLP model [9] together with the multisurface method tree algorithm [8] is used in the diagnostic system. Duarte Silva and Stam [104] included the secondorder (i. e., quadratic and cross-product) terms of the attribute values in the LP-based models such as MSD and hybrid models and compared them with linear models, Fisher’s LDF, and Smith’s QDF. The results of the simulation experiments show that the methods which include second-order terms perform much better than first-order methods, given that the data substantially violate the multivariate normality assumption. Wanarat and Pavur [113] investigated the effect of the inclusion of the second-order terms in the MSD, MIP, and hybrid models when the sample size is small to moderate. However, the simulation study shows that second-order terms may not always improve the performance of a first-order LP model even with data configurations that are more appropriately classified by Smith’s QDF. Another result of the simulation study is

D

that inclusion of the cross-product terms may hurt the model’s accuracy, while omission of these terms causes the model to be not invariant with respect to a nonsingular transformation of the data. Pavur [94] studied the effect of the position of the contaminated normal data in the two-group classification problem. The methods for comparison in that study included MSD, minimizing the number of misclassifications (MM; (described in the “Mixed Integer Programming Classification Models” section), Fisher’s LDF, Smith’s QDF, and nearest -neighbor models. The nontraditional methods such as LP models have potential for outperforming the standard parametric procedures when nonnormality is present, but this study shows that no one model is consistently superior in all cases. Asparoukhov and Stam [3] proposed LP and MIP models to solve the two-group classification problem where the attributes are binary. In this case the training data can be partitioned into multinomial cells, allowing for a substantial reduction in the number of variables and constraints. The proposed models not only have the usual geometric interpretation, but also possess a strong probabilistic foundation. Let s be the index of the cells, n1s ; n2s be the number of data points in cell s from groups 1 and 2, respectively, and (b s1 ; : : : ; b sm ) be the binary digits representing cell s. The model shown below is the LP model of minimizing the sum of deviations for two-group classification with binary attributes. Cell conventional MSD: Minimize

X

(n1s d1s C n2s d2s )

s: n 1s Cn 2s >0

subject to

w0 C

X

b s j w j d1s 0 8s : n1s > 0 ;

j

w0 C

X

b s j w j C d2s > 0

8s : n2s > 0 ;

j

w j urs

8j ;

d1s ; d2s 0

8s :

Binary attributes are usually found in medical diagnoses data. In this study three real data sets of disease discrimination were tested: developing postoperative pulmonary embolism or not, having dissecting aneurysm or other diseases, and suffering from post-

759

760

D

Disease Diagnosis: Optimization-Based Methods

traumatic epilepsy or not. In these data sets the MIP model for binary attributes (BMIP), which will be described later, performs better than other LP models or traditional methods.

d ihk (w0h C

subject to

C

K1 X

c k ˛k

kD1

subject to

X

xi j w j Uk

8i 2 G k 8k ;

xi j w j Lk

8i 2 G k 8k ;

j

X j

U k C L kC1 C ˛ k 8k D 1; : : : ; K 1 ; w j urs 8 j ; U k ; L k urs 8k ; ˛ k urs 8k D 1; : : : ; K 1 ; where the number could be normalized to be 1, and ck is the misclassification cost. However, singlefunction classification is not as flexible and general as multiple-function classification. Another extension from the two-group case to the multigroup case in [32] is to solve two-group LP models for all pairs of groups and determine classification rules based on these solutions. However, in some cases the group assignment is not clear and the resulting classification scheme may be suboptimal [107]. For the multigroup discrimination problem, Bennett and Mangasarian [10] defined the piecewise-linear separability of data from K groups as the following: The data from K groups are piecewise-linear-separable if k ) 2 R mC1 ; k D and only if there exist (w0k ; w1k ; : : : ; w m P P h h 1; : : : ; K, such that w0 C j x i j w j w0k C j x i j w kj C 1; 8i 2 G h 8h; k ¤ h. The following LP will generate a piecewise-linear separation for the K groups if one exists, otherwise it will generate an error-minimizing separation: Minimize

XX 1 X d hk jG h j i2G i h

k¤h

h

C

X

x i j w hj )

j

x i j w kj ) C 1

j

8i 2 G h 8h; k ¤ h ;

Multigroup Classification Freed and Glover [32] extended the LP classification models from two-group to multigroup problems. One formulation which uses a single discriminant function is given below:

Minimize

(w0k

X

w kj

8 j; k ;

urs

d ihk 0

8i 2 G h 8h; k ¤ h :

The method was tested in three data sets. It performs pretty well in two of the data sets which are totally (or almost totally) piecewise-linear separable. The classification result is not good in the third data set, which is inherently more difficult. However, combining the multisurface method tree algorithm [8] results in an improvement in performance. Gochet et al. [46] introduced an LP model for the general multigroup classification problem. The method separates the data with several hyperplanes by sequentially solving LPs. The vectors wk , k D 1; : : : ; K, are estimated for the classification decision rule. The rule is to classify an observation i into group s, where P s D arg max k fw0k C j x i j w kj g. Suppose observation i is from group h. Denote the goodness of fit for observation i with respect to group k as i (w h ; w k ) G hk h X X iC D w0h C x i j w hj w0k C x i j w kj ; j

j

C

where [a] D maxf0; ag : Likewise, denote the badness of fit for observation i with respect to group k as B ihk (w h ; w k ) h X X i D w0h C x i j w hh w0k C x i j w kj ; j

j

where [a] D minf0; ag : The total goodness of fit and total badness of fit are then defined as XX X i G hk (w h ; w k ) ; G(w) D G(w 1 ; : : : ; w K ) D h

1

K

B(w) D B(w ; : : : ; w ) D

k¤h i2G h

XX X h

k¤h i2G h

B ihk (w h ; w k ) :

Disease Diagnosis: Optimization-Based Methods

The LP is to minimize the total badness of fit, subject to a normalization equation, in which q > 0: Minimize

B(w) ;

subject to

G(w) B(w) D q ;

Two-Group Classification In the two-group classification problem, binary variables can be used in the formulation to track and minimize the exact number of misclassifications. Such an objective function is also considered as the L0 -norm criterion [107]. MM:

w urs . Minimize Expanding G(w) and B(w) and substituting i i i G hk (w h ; w k ) and B ihk (w h ; w k ) by hk and ˇ hk respectively, the LP becomes Minimize

h

subject to

i ˇ hk

j

XX X h

w kj

X

x i j w j Mz i

8i 2 G1 ;

j

X

x i j w j Mz i

8i 2 G2 ;

z i 2 f0; 1g 8i :

j

i ˇ hk

8i 2 G h 8h; k ¤ h ;

i i ( hk ˇ hk )Dq;

k¤h i2G h

urs

subject to w0 C

w j urs 8 j ;

k¤h i2G h

D

zi

i

j

X X w0h C x i j w hj w0k C x i j w kj i hk

X

w0 C

XX X

D

8 j; k ;

i i hk ; ˇ hk 0

8i 2 G h 8h; k ¤ h :

The classification results for two real data sets show that this model can compete with Fisher’s LDF and the nonparametric k-nearest-neighbor method. The LP-based models for classification problems highlighted above are all nonparametric models. In Sect. “Mixed Integer Programming Based Multigroup Classification Models and Applications to Medicine and Biology”, we describe LP-based and MIP-based classification models that utilize a parametric multigroup discriminant analysis approach [39,40,60,63]. These latter models have been employed successfully in various multigroup disease diagnosis and biological/medical prediction problems [16,28,29,56,57,59, 60,64,65].

Mixed Integer Programming Classification Models While LP offers a polynomial-time computational guarantee, MIP allows more flexibility in (among other things) modeling misclassified observations and/or misclassification costs.

The vector w is required to be a nonzero vector to prevent the trivial solution. In the MIP formulation the objective function could include the deviation terms, such as those in the hybrid models, as well as the number of misclassifications [5]; or it could represent expected cost of misclassification [1,6,101,105]. In particular, there are some variant versions of the basic model. Stam and Joachimsthaler [109] studied the classification performance of MM and compared it with that of MSD, Fisher’s LDF, and Smith’s QDF. In some cases the MM model performs better, but in some cases it does not. MIP formulations are in the review studies of Joachimsthaler and Stam [50] and Erenguc and Koehler [27], and are contained in the software developed by Stam and Ungar [110]. Computational experiments show that the MIP model performs better when the group overlap is higher [50,109], although it is still not easy to reach general conclusions [107]. Since the MIP model is N P -hard, exact algorithms and heuristics are proposed to solve it efficiently. Koehler and Erenguc [54] developed a procedure to solve MM in which the condition of nonzero w is replaced by the requirement of at least one vioP lation of the constraints w0 C j x i j w j 0 for i 2 G1 P or w0 C j x i j w j 0 for i 2 G2 . Banks and Abad [6] solved the MIP of minimizing the expected cost of misclassification by an LP-based algorithm. Abad and Banks [1] developed three heuristic procedures for the problem of minimizing the expected cost of misclas-

761

762

D

Disease Diagnosis: Optimization-Based Methods

sification. They also included the interaction terms of the attributes in the data and applied the heuristics [7]. Duarte Silva and Stam [105] introduced the divide and conquer algorithm for the classification problem of minimizing the misclassification cost by solving MIP and LP subproblems. Rubin [101] solved the same problem by using a decomposition approach, and tested this procedure on some data sets, including two breast cancer data sets. Yanev and Balev [119] proposed exact and heuristic algorithms for solving MM, which are based on some specific properties of the vertices of a polyhedral set neatly connected with the model. For the two-group classification problem where the attributes are binary, Asparoukhov and Stam [3] proposed LP and MIP models which partition the data into multinomial cells and result in fewer variables and constraints. Let s be the index of the cells, n1s ; n2s be the number of data points in cell s from groups 1 and 2, respectively, and (b s1 ; : : : ; b sm ) be the binary digits representing cell s. Below is the BMIP, which performs best in the three real data sets in [3]: BMIP Minimize

X

fjn1s n2s jzs C min(n1s ; n2s )g

subject to w0 C

b s j w j Mzs 8s : n1s n2s ;

j

n1s > 0 ; w0 C

X

b s j w j > Mzs 8s : n1s < n2s ;

j

w j urs

Minimize

X

di

i

subject to w0 C

m X

x i j (w C j w j ) di 0

jD1

8i 2 G1 ; w0 C

m X

x i j (w C j w j ) C di 0

jD1

8i 2 G2 ; m X

(w C j C wj ) D 1 ;

jD1

s: n 1s Cn 2s >0

X

Glen [42] proposed integer programming (IP) techniques for normalization in the two-group discriminant analysis models. One technique is to add the conP straint mjD1 jw j j D 1. In the proposed model, wj for j D 1; : : : ; m is represented by w j D w C j w j , where C w j ; w j 0, and binary variables ı j and j are defined such that ı j D 1 , w C j and j D 1 , w j . The IP normalization technique is applied to MSD and MMD, and the MSD version is presented below. MSD – with IP normalization:

8j ;

zs 2 f0; 1g 8s : n1s C n2s > 0 : Pavur et al. [96] included different secondary goals in model MM and compared their misclassification rates. A new secondary goal was proposed, which maximizes the difference between the means of the discriminant scores of the two groups. In this model the term –ı is added to the minimization objective function as a secondary goal with a constant multiplier, while the conP P (2) ¯ j w j j x¯ (1) straint jx j w j ı is included, where P (k) x¯ j D 1/jG k j i2G k x i j 8 j, for k D 1; 2. The results of simulation study show that an MIP model with the proposed secondary goal has better performance than the other models studied.

wC j ı j 0

8 j D 1; : : : ; m ;

wC j ıj 0

8 j D 1; : : : ; m ;

w j j 0

8 j D 1; : : : ; m ;

w j j 0

8 j D 1; : : : ; m ;

ıj C j 1

8 j D 1; : : : ; m ;

w0 urs , wC j ; wj 0

di 0

8 j D 1; : : : ; m ;

8i ;

ı j ; j 2 f0; 1g 8 j D 1; : : : ; m : The variable coefficients of the discriminant function generated by the models are invariant under origin shifts. The proposed models were validated using two data sets from [45,87]. The models were also extended for attribute selection by adding the constraint Pm jD1 (ı j C j ) D p, which allows only a constant number, p, of attributes to be used for classification.

Disease Diagnosis: Optimization-Based Methods

Glen [43] developed MIP models which determine the thresholds for forming dichotomous variables as well as the discriminant function coefficients, wj . For each continuous attribute to be formed as a dichotomous attribute, the model finds the threshold among possible thresholds while determining the separating hyperplane and optimizing the objective function such as minimizing the sum of deviations or minimizing the number of misclassifications. Computational results of a real data set and some simulated data sets show that the MSD model with dichotomous categorical variable formation can improve classification performance. The reason for the potential of this technique is that the LDF generated is a nonlinear function of the original variables.

D

k and y g k D 0 otherwise. The constant ı 0 is the minimum width of an interval of a group and the constant ı is the minimum gap between adjacent intervals. General multiple-function classification (GMFC) – minimizing the number of misclassifications: Minimize

X

zi

i

subject to w0h C

X

x i j w hj w0k

j

X

x i j w kj C Mz i

j

8i 2 G h ; 8h; k ¤ h ; w kj

urs 8 j; k ;

z i 2 f0; 1g 8i : Multigroup Classification Gehrlein [41] proposed MIP formulations of minimizing the total number of misclassifications in the multigroup classification problem. He gave both a single-function classification scheme and a multiple-function classification scheme, as follows. General single-function classification (GSFC) – minimizing the number of misclassifications:

Minimize

X

zi

i

subject to

w0 C

X

x i j w j Mz i U k

8i 2 G k ;

x i j w j C Mz i L k

8i 2 G k ;

j

w0 C

X j

U k Lk ı0

8k ;

9 > L g U k C My g k ı = L k U g C My k g ı 8g; k; g ¤ k ; > ; yg k C yk g D 1

w j urs 8 j ; U k ; L k urs 8k ; z i 2 f0; 1g 8i ; y g k 2 f0; 1g 8g; k; g ¤ k ; where U k ; L k denote the upper and lower endpoints of the interval assigned to group k, and y g k D 1 if the interval associated with group g precedes that with group

Both models work successfully on the iris data set provided by Fisher [30]. Pavur [93] solved the multigroup classification problem by sequentially solving the GSFC in one dimension each time. LDFs were generated by successively solving the GSFC with the added constraints that all linear discriminants are uncorrelated to each other for the total data set. This procedure could be repeated for the number of dimensions that is believed to be enough. According to the simulation results, this procedure substantially improves the GSFC model and sometimes outperforms GMFC, Fisher’s LDF, or Smith’s QDF. To solve the three-group classification problem more efficiently, Loucopoulos and Pavur [71] made a slight modification to the GSFC and proposed the model MIP3G, which also minimizes the number of misclassifications. Compared with GSFC, MIP3G is also a single-function classification model, but it reduces the possible group orderings from six to three in the formulation and thus becomes more efficient. Loucopoulos and Pavur [72] reported the results of a simulation experiment on the performance of GMFC, MIG3G, Fisher’s LDF, and Smith’s QDF for a threegroup classification problem with small training samples. Second-order terms were also considered in the experiment. Simulation results show that GMFC and MIP3G can outperform the parametric procedures in some nonnormal data sets and that the inclusion of second-order terms can improve the performance

763

764

D

Disease Diagnosis: Optimization-Based Methods

of MIP3G in some data sets. Pavur and Loucopoulos [95] investigated the effect of the gap size in the MIP3G model for the three-group classification problem. A simulation study illustrates that for fairly separable data, or data with small sample sizes, a non-zero-gap model can improve the performance. A possible reason for this result is that the zero-gap model may be overfitting the data. Gallagher et al. [39,40,63] and Lee [59,60] proposed MIP models, both heuristic and exact, as a computational approach to solving the constrained discriminant method described by Anderson [2]. These models are described in detail in Sect. “Mixed Integer Programming Based Multigroup Classification Models and Applications to Medicine and Biology”.

Nonlinear Programming Classification Models Nonlinear programming approaches are natural extensions for some of the LP-based models. Thus far, nonlinear programming approaches have been developed for two-group classification. Stam and Joachimsthaler [108] proposed a class of nonlinear programming methods to solve the twogroup classification problem under the Lp -norm objective criterion. This is an extension of MSD and MMD, for which the objectives are the L1 -norm and L1 -norm, respectively. Minimize the general Lp -norm distance: Minimize

(

X

Mangasarian et al. [85] proposed a nonconvex model for the two-group classification problem: Minimize subject to

d1 C d2 X xi j w j d1 0 X

xi j w j C d2 0

p

max jw j j D 1 ;

jD1; ::: ;m

w j urs 8 j ; d 1 ; d 2 urs , This model can be solved in polynomial-time by solving 2m linear programs, which generate a sequence of parallel planes, resulting in a piecewise-linear nonconvex discriminant function. The model works successfully in clinical practice for the diagnosis of breast cancer. Further, Mangasarian [76] also formulated the problem of minimizing the number of misclassifications as a linear program with equilibrium constraints (LPEC) instead of the MIP model MM described previously: Minimize

X

zi

i2G 1 [G 2

subject to w0 C

X j

subject to

xi j w j di b

8i 2 G1 ;

w0 C

X j

xi j w j C di b

8i 2 G2 ;

j

z i (w0 C

8i :

The simulation results show that, in addition to the L1 -norm and the L1 -norm, it is worth the effort to compute other Lp -norm objectives. Restricting the analysis to 1 p 3, plus p D 1, is recommended. This method was reviewed by Joachimsthaler and Stam [50] and Erenguc and Koehler [27].

8i 2 G1 ; xi j w j C di 1

X

8i 2 G2 ;

x i j w j C d i 1) D 0

j

w j urs 8 j ; di 0

x i j w j d i C 1) D 0

j

j

X

x i j w j d i 1 8i 2 G1 ;

X

i

X

8i 2 G2 ;

j

z i (w0 C

d i )1/p

8i 2 G1 ;

j

8i 2 G2 ; d i (1 z i ) D 0

8i 2 G1 [ G2 ;

0 z i 1 8i 2 G1 [ G2 ; di 0

8i 2 G1 [ G2 ;

w j urs 8 j : The general LPEC can be converted to an exact penalty problem with a quadratic objective and linear

D

Disease Diagnosis: Optimization-Based Methods

constraints. A stepless Frank–Wolfe-type algorithm is proposed for the penalty problem, terminating at a stationary point or a global solution. This method is called the parametric misclassification minimization (PMM) procedure, and numerical testing is included in [77]. To illustrate the next model, we first define the step function s : R ! f0; 1g as ( s(u) D

1

if u > 0 ;

0

if u 0 :

The problem of minimizing the number of misclassifications is equivalent to Minimize

X

s(d i )

i2G 1 [G 2

subject to

w0 C

X

x i j w j d i 1

8i 2 G1 ;

1. For a fixed w0 , solve RLP [9] to determine (w1 ; : : : ; w m ). 2. For this (w1 ; : : : ; w m ), solve the one-dimensional misclassification minimization problem to determine w0 . Comparison of the hybrid method is made with the RLP method and the PMM procedure. The hybrid method performs better in the testing sets of the tenfold cross-validation and is much faster than PMM. Mangasarian [78] proposed the model of minimizing the sum of arbitrary-norm distances of misclassified points to the separating hyperplane. For a general norm jj jj on Rm , the dual norm jj jj0 on Rm is defined as jjxjj0 D maxjjyjjD1 x T y. Define [a]C D maxf0; ag and let w D (w1 ; : : : ; w m ). The formulation can then be written as Minimize

Xh i2G 1

j

w0 C

X

xi j w j C di 1

8i 2 G2 ;

j

di 0

C

w0 C

Xh i2G 2

X

xi j w j

iC

j

w0

X

xi jw j

iC

j

0

8i 2 G1 [ G2 ;

w j urs 8 j : Mangasarian [77] proposed a simple concave approximation of the step function for nonnegative variables: t(u; ˛) D 1 e ˛u , where ˛ > 0; u 0. Let ˛ > 0 and approximate s(di ) by t(d i ; ˛). The problem then reduces to minimizing a smooth concave function bounded below on a nonempty polyhedron, which has a minimum at a vertex of the feasible region. A finite successive linearization algorithm (SLA) was proposed, terminating at a stationary point or a global solution. Numerical tests of SLA were done and compared with the PMM procedure described above. The results show that the much simpler SLA obtains a separation that is almost as good as PMM in considerably less computing time. Chen and Mangasarian [21] proposed an algorithm on a defined hybrid misclassification minimization problem, which is more computationally tractable than the N P -hard misclassification minimization problem. The basic idea of the hybrid approach is to obtain iteratively w0 and (w1 ; : : : ; w m ) of the separating hyperplane:

subject to jjwjj D 1 ; w0 ; w urs : The problem is to minimize a convex function on a unit sphere. A decision problem related to this minimization problem is shown to be N P -complete, except for p D 1. For a general p-norm, the minimization problem can be transformed via an exact penalty formulation to minimizing the sum of a convex function and a bilinear function on a convex set. Support Vector Machine A support vector machine (SVM) is a type of mathematical programming approach [112]. It has been widely studied, and has become popular in many application fields in recent years. The introductory description of SVMs given here is summarized from the tutorial by Burges [20]. In order to maintain consistency with SVM studies in published literature, the notation used below is slightly different from the notation used to describe the mathematical programming methods in earlier sections. In the two-group separable case, the objective function is to maximize the margin of a separating hyper-

765

766

D

Disease Diagnosis: Optimization-Based Methods

plane, 2/jjwjj, which is equivalent to minimizing jjwjj2 : T

Minimize

w w;

subject to

x iT w C b C1

for y i D C1 ;

x iT w

for y i D 1 ;

C b 1

w; b urs , where x i 2 R m represents the values of attributes of observation i and y i 2 f1; 1g represents the group of observation i. This problem can be solved by solving its Wolfe dual problem: X 1X ˛i ˛ i ˛ j y i y j x iT x j ; Maximize 2 i i; j X subject to ˛i y i D 0 ; i

˛i 0

8i :

Here, ˛ i is the Lagrange multiplier for the training point i, and the points with ˛ i > 0 are called the support vectors (analogous to the support of a hyperplane, and thus the introduction of the name “support vecP tor”). The primal solution w is given by w D i ˛ i y i x i . b can be computed by solving y i (w T x i C b) 1 D 0 for any i with ˛ i > 0. For the nonseparable case, slack variables i are introduced to handle the errors. Let C be the penalty for the errors. The problem becomes X 1 T w w C C( i )k ; Minimize 2 i subject to

x iT w C b C1 i

for y i D C1 ;

x iT w

for y i D 1 ;

C b 1 C i

w; b urs , i 0

8i :

When k is chosen to be 1, neither the i nor their Lagrange multipliers appear in the Wolfe dual problem: X 1X ˛i ˛ i ˛ j y i y j x iT x j ; Maximize 2 i i; j X subject to ˛i y i D 0 ; i

0 ˛i C

8i :

The data points can be separated nonlinearly by mapping the data into some higher-dimensional space

and applying linear SVM to the mapped data. Instead of knowing explicitly the mapping ˚, SVM needs only the dot products of two transformed data points ˚(x i ) ˚(x j ). The kernel function K is introduced such that K(x i ; x j ) D ˚(x i ) ˚(x j ). Replacing x iT x j by K(x i ; x j ) in the above problem, the separation becomes nonlinear, while the problem to be solved remains a quadratic program. In testing a new data point x after training, the sign of the function f (x) is computed to determine the group of x: f (x) D

Ns X iD1

˛ i y i ˚(s i )˚(x)Cb D

Ns X

˛ i y i K(s i ; x)Cb;

iD1

where the si are the support vectors and N s is the number of support vectors. Again the explicit form of ˚(x) is avoided. Mangasarian provided a general mathematical programming framework for SVM, called generalized SVM or GSVM [79,83]. Special cases can be derived from GSVM, including the standard SVM. Many SVM-type methods have been developed by Mangasarian and others to solve huge classification problems more efficiently. These methods include successive overrelaxation for SVM [82], proximal SVM [36,38], smooth SVM [68], reduced SVM [67], Lagrangian SVM [84], incremental SVMs [37], and other methods [13,81]. Mangasarian [80] summarized some of the developments. Examples of applications of SVM include breast cancer studies [69,70] and genome research [73]. Hsu and Lin [49] compared different methods for multigroup classification using SVMs. Three methods studied were based on several binary classifiers: one against one, one against all, and directed acyclic graph (DAG) SVM. The other two methods studied are methods with decomposition implementation. The experimental results show that the one-against-one and DAG methods are more suitable for practical use than the other methods. Lee et al. [66] proposed a generic approach to multigroup problems with some theoretical properties, and the proposed method was well applied to microarray data for cancer classification and satellite radiance profiles for cloud classification. Gallagher et al. [39,40,63] offered the first discrete SVM for multigroup classification with reserved judgement. The approach has been successfully applied to

Disease Diagnosis: Optimization-Based Methods

a diverse variety of biological and medical applications (see Sect. “Mixed Integer Programming Based Multigroup Classification Models and Applications to Medicine and Biology”). Mixed Integer Programming Based Multigroup Classification Models and Applications to Medicine and Biology Commonly used methods for classification, such as LDFs, decision trees, mathematical programming approaches, SVMs, and artificial neural networks, can be viewed as attempts at approximating a Bayes optimal rule for classification; that is, a rule that maximizes (minimizes) the total probability of correct classification (misclassification). Even if a Bayes optimal rule is known, intergroup misclassification rates may be higher than desired. For example, in a population that is mostly healthy, a Bayes optimal rule for medical diagnosis might misdiagnose sick patients as healthy in order to maximize total probability of correct diagnosis. As a remedy, a constrained discriminant rule that limits the misclassification rate is appealing. Assuming that the group density functions and prior probabilities are known, Anderson [2] showed that an optimal rule for the problem of maximizing the probability of correct classification subject to constraints on the misclassification probabilities must be of a specific form when discriminating among multiple groups with a simplified model. The formulae in Anderson’s result depend on a set of parameters satisfying a complex relationship between the density functions, the prior probabilities, and the bounds on the misclassification probabilities. Establishing a viable mathematical model to describe Anderson’s result, and finding values for these parameters that yield an optimal rule are challenging tasks. The first computational models utilizing Anderson’s formulae were proposed in [39,40].

gistics problems. Utilizing the technology of large-scale discrete optimization and SVMs, we have developed novel classification models that simultaneously include the following features: (1) the ability to classify any number of distinct groups; (2) the ability to incorporate heterogeneous types of attributes as input; (3) a highdimensional data transformation that eliminates noise and errors in biological data; (4) constraints to limit the rate of misclassification, and a reserved-judgment region that provides a safeguard against overtraining (which tends to lead to high misclassification rates from the resulting predictive rule); and (5) successive multistage classification capability to handle data points placed in the reserved-judgment region. Studies involving tumor volume identification, ultrasonic cell disruption in drug delivery, lung tumor cell motility analysis, CpG island aberrant methylation in human cancer, predicting early atherosclerosis using biomarkers, and fingerprinting native and angiogenic microvascular networks using functional perfusion data indicate that our approach is adaptable and can produce effective and reliable predictive rules for various biomedical and biobehavior phenomena [16,28,29,56,57,59,60,64,65]. Based on the description in [39,40,59,60,63], we summarize below some of the classification models we have developed. Modeling of Reserved-Judgment Region for General Groups When the population densities and prior probabilities are known, the constrained rules with a reject option (reserved judgment), based on Anderson’s results, call for finding a partition fR0 ; : : : ; RG g of R k that maximizes the probability of correct allocation subject to constraints on the misclassification probabilities; i. e., Z G X

g f g (w) dw (1) Maximize gD1

As part of the work carried out at Georgia Institute of Technology’s Center for Operations Research in Medicine, we have developed a general-purpose discriminant analysis modeling framework and computational engine that are applicable to a wide variety of applications, including biological, biomedical, and lo-

Rg

Z

f h (w)dw ˛ h g ; h; g D 1; : : : ; G;

subject to Discrete Support Vector Machine Predictive Models

D

Rg

h¤ g;

(2)

where f h ; h 2 f1; : : : ; Gg; are the group conditional density functions, g denotes the prior probability that a randomly selected entity is from group g; g 2 f1; : : : ; Gg, and ˛ h g ; h ¤ g, are constants between 0 and 1. Under quite general assumptions, it was shown

767

768

D

Disease Diagnosis: Optimization-Based Methods

that there exist unique (up to a set of measure zero) nonnegative constants i h ; i; h 2 f1; : : : ; Gg; i ¤ h; such that the optimal rule is given by R g D fx 2 R k : L g (x) D

max

h2f0;1; ::: ;Gg

L h (x)g ; (3)

g D 0; : : : ; G ; where L0 (x) D 0 ;

(4)

L h (x) D h f h (x)

G X

i h f i (x); h D 1; : : : ; G :

iD1;i¤h

(5) For G D 2 the optimal solution can be modeled rather straightforwardly. However, finding optimal ih ’s for the general case, G 3, is a difficult problem, with the difficulty increasing as G increases. Our model offers an avenue for modeling and finding the optimal solution in the general case. It is the first such model to be computationally viable [39,40]. Before proceeding, we note that Rg can be written as R g D fx 2 R k : L g (x) L h (x) for all h D 0; : : : ; Gg. P So, since L g (x) L h (x) if, and only if, (1/ GtD1 f t (x)) PG L g (x) (1/ tD1 f t (x))L h (x), the functions L h ; h D 1; : : : ; G; can be redefined as L h (x) D h p h (x)

G X

i h p i (x); h D 1; : : : ; G ;

iD1;i¤h

(6) PG

where p i (x) D f i (x)/ tD1 f t (x). We assume that Lh is defined as in (6) in our model. Mixed Integer Programming Formulations Assume that we are given a training sample of N entities whose group classifications are known; say, ng entities P are in group g, where GgD1 n g D N. Let the k-dimensional vectors xgj , g D 1; : : : ; G; j D 1; : : : ; n g ; contain the measurements on k available characteristics of the entities. Our procedure for deriving a discriminant rule proceeds in two stages. The first stage is to use the training sample to compute estimates, fˆh , either parametrically or nonparametrically, of the density functions f h [89] and estimates, ˆ h , of the prior probabilities h ; h D 1; : : : ; G. The second stage is to determine

the optimal ih ’s given these estimates. This stage requires being able to estimate the probabilities of correct classification and misclassification for any candidate set of ih ’s. One could, in theory, substitute the estimated densities and prior probabilities into (5), and directly use the resulting regions Rg in the integral expressions given in (1) and (2). This would involve, even in simple cases such as normally distributed groups, the numerical evaluation of k-dimensional integrals at each step of a search for the optimal ih ’s. Therefore, we have designed an alternative approach. After substituting the fˆh ’s and ˆ h ’s into (5), we simply calculate the proportion of training sample points which fall in each of the regions R1 ; : : : ; RG : The MIP models discussed below attempt to maximize the proportion of training sample points correctly classified while satisfying constraints on the proportions of training sample points misclassified. This approach has two advantages. First, it avoids having to evaluate the potentially difficult integrals in (1) and (2). Second, it is nonparametric in controlling the training sample misclassification probabilities. That is, even if the densities are poorly estimated (by assuming, for example, normal densities for nonnormal data), the constraints are still satisfied for the training sample. Better estimates of the densities may allow a higher correct classification rate to be achieved, but the constraints will be satisfied even if poor estimates are used. Unlike most SVM models that minimize the sum of errors, our objective is driven by the number of correct classifications, and will not be biased by the distance of the entities from the supporting hyperplane. A word of caution is in order. In traditional unconstrained discriminant analysis, the true probability of correct classification of a given discriminant rule tends to be smaller than the rate of correct classification for the training sample from which it was derived. One would expect to observe such an effect for the method described herein as well. In addition, one would expect to observe an analogous effect with regard to constraints on misclassification probabilities – the true probabilities are likely to be greater than any limits imposed on the proportions of training sample misclassifications. Hence, the ˛ hg parameters should be carefully chosen for the application in hand. Our first model is a nonlinear 0/1 MIP model with the nonlinearity appearing in the constraints. Model 1

Disease Diagnosis: Optimization-Based Methods

maximizes the number of correct classifications of the given N training entities. Similarly, the constraints on the misclassification probabilities are modeled by ensuring that the number of group g training entities in region Rh is less than or equal to a prespecified percentage, ˛ h g (0 < ˛ h g < 1), of the total number, ng , of group g entities, h; g 2 f1; : : : ; Gg; h ¤ g. For notational convenience, let G D f1; : : : ; Gg and N g D f1; : : : ; n g g, for g 2 G. Also, analogous to P the definition of pi , define pˆ i by pˆ i D fî (x)/ GtD1 fˆt (x). In our model, we use binary indicator variables to denote the group classification of entities. Mathematically, let uhgj be a binary variable indicating whether or not xgj lies in region Rh ; i. e., whether or not the jth entity from group g is allocated to group h. Then model 1 can be written as follows. Discriminant analysis MIP (DAMIP): Maximize

XX

ug g j

g2G j2N g

subject to

L h g j D ˆ h pˆ h (x g j )

X

i h pˆ i (x g j ) ;

i2Gnh

h; g 2 G; j 2 N g ; (7) y g j D maxf0; L h g j : h D 1; : : : ; Gg;

g 2 G; j 2 N g ; (8)

y g j L g g j M(1 u g g j ); y g j L h g j "(1 u h g j );

g 2 G; j 2 N g ;

(9)

h; g 2 G; j 2 N g ; h ¤ g ; (10)

X

u h g j b˛ h g n g c;

h; g 2 G; h ¤ g ;

(11)

j2N g

1 < L h g j < 1; y g j 0; i h 0; u h g j 2 f0; 1g : Constraint (7) defines the variable Lhgj as the value of the function Lh evaluated at xgj . Therefore, the continuous variable ygj , defined in constraint (8), represents maxfL h (x g j ) : h D 0; : : : ; Gg; and consequently, xgj lies in region Rh if, and only if, y g j D L h g j . The binary variable uhgj is used to indicate whether or not xgj lies in region Rh ; i. e., whether or not the jth entity from group g is allocated to group h. In particular, constraint

D

(9), together with the objective, forces uggj to be 1 if, and only if, the jth entity from group g is correctly allocated to group g; and constraints (10) and (11) ensure that at most b˛ h g n g c (i. e., the greatest integer less than or equal to ˛ h g n g ) group g entities are allocated to group h; h ¤ g. One caveat regarding the indicator variables uhgj is that although the condition u h g j D 0; h ¤ g, implies (by constraint (10)) that x g j … R h , the converse need not hold. As a consequence, the number of misclassifications may be overcounted. However, in our preliminary numerical study we found that the actual amount of overcounting is minimal. One could force the converse (thus, u h g j D 1 if and only if x g j 2 R h ) by adding constraints y g j L h g j M(1 u h g j ), for example. Finally, we note that the parameters M and are extraneous to the discriminant analysis problem itself, but are needed in the model to control the indicator variables uhgj . The intention is for M and to be, respectively, large and small positive constants. Model Variations We explore different variations in the model to grasp the quality of the solution and the associated computational effort. A first variation involves transforming model 1 to an equivalent linear mixed integer model. In particular, model 2 replaces the N constraints defined in (8) with the following system of 3GN C 2N constraints: y g j Lh g j ;

h; g 2 G; j 2 N g ;

y˜ h g j L h g j M(1 v h g j ); y˜ h g j ˆ h pˆ h (x g j )v h g j ; X

v h g j 1;

(12)

h; g 2 G; j 2 N g ; (13)

h; g 2 G; j 2 N g ;

g 2 G; j 2 N g ;

(14) (15)

h2G

X

y˜ h g j D y g j ;

g 2 G; j 2 N g ;

(16)

h2G

where y˜ h g j 0 and v h g j 2 f0; 1g; h; g 2 G; j 2 N g . These constraints, together with the nonnegativity of ygj force y g j D maxf0; L h g j : h D 1; : : : ; Gg. The second variation involves transforming model 1 to a heuristic linear MIP model. This is done by replacing the nonlinear constraint (8) with y g j L h g j ; h; g 2 G; j 2 N g , and including penalty terms in the objective function. In particular, model 3

769

770

D

Disease Diagnosis: Optimization-Based Methods

has the objective Maximize

L g g j C w g j 0;

XX

ˇu g g j

g2G j2N g

XX

yg j ;

L h g j C y g j 0;

X

u h g j b˛Nc ;

(17)

g2G h2Gnfgg j2N g

where ˛ is a constant between 0 and 1. We will refer to models 1, 2, and 3 modified in this way as models 1T, 2T, and 3T, respectively. Of course, other modifications are also possible. For instance, one could place restrictions on the total number of type g points misclassified for each g 2 G. Thus, in place of the constraints specified in (17), one would include the conP P straints h2Gnfgg j2N g u h g j b˛ g Nc; g 2 G, where 0 < ˛ g < 1. We also explore a heuristic linear model of model 1. In particular, consider the linear program (DALP): Maximize

XX

(c1 w g j C c2 y g j )

g2G j2N g

subject to

L h g j D h pˆ h (x g j )

X

(18) i h pˆ i (x g j ) ;

i2Gnh

h; g 2 G; j 2 N g ; (19) L g g j L h g j Cw g j 0;

h; g 2 G; j 2 N g ;

(21) (22)

g2G j2N g

where ˇ and are positive constants. This model is heuristic in that there is nothing to force y g j D maxf0; L h g j : h D 1; : : : ; Gg. However, since in addition to trying to force as many uggj ’s to 1 as possible, the objective in model 3 also tries to make the ygj ’s as small as possible, and the optimizer tends to drive ygj towards maxf0; L h g j : h D 1; : : : ; Gg. We remark that ˇ and could be stratified by group (i. e., introduce possibly distinct ˇ g ; g ; g 2 G) to model the relative importance of certain groups to be correctly classified. A reasonable modification to models 1, 2, and 3 involves relaxing the constraints specified by (11). Rather than placing restrictions on the number of type g training entities classified into group h, for all h; g 2 G; h ¤ g, one could simply place an upper bound on the total number of misclassified training entities. In this case, the G(G 1) constraints specified by (11) would be replaced by the single constraint X X

g 2 G; j 2 N g ;

h; g 2 G; h ¤ g; j 2 N g ; (20)

1 < L h g j < 1; w g j ; y g j ; i h 0 : Constraint (19) defines the variable Lhgj as the value of the function Lh evaluated at xgj . As the optimization solver searches through the set of feasible solutions, the ih variables will vary, causing the Lhgj variables to assume different values. Constraints (20), (21), and (22) link the objective-function variables with the Lhgj variables in such a way that correct classification of training entities and allocation of training entities into the reserved-judgment region are captured by the objective-function variables. In particular, if the optimization solver drives wgj to zero for some g,j pair, then constraints (20) and (21) imply that L g g j D maxf0; L h g j : h 2 Gg. Hence, the jth entity from group g is correctly classified. If, on the other hand, the optimal solution yields y g j D 0 for some g,j pair, then constraint (22) implies that maxf0; L h g j : h 2 Gg D 0. Thus, the jth entity from group g is placed in the reserved-judgment region. (Of course, it is possible for both wgj and ygj to be zero. One should decide prior to solving the linear program how to interpret the classification in such cases.) If both wgj and ygj are positive, the jth entity from group g is misclassified. The optimal solution yields a set of ih ’s that best allocates the training entities (i. e., “best” in terms of minimizing the penalty objective function). The optimal ih ’s can then be used to define the functions Lh , h 2 G, which in turn can be used to classify a new entity with feature vector x 2 R k by simply computing the index at which maxfL h (x) : h 2 f0; 1; : : : ; Ggg is achieved. Note that model DALP places no a priori bound on the number of misclassified training entities. However, since the objective is to minimize a weighted combination of the variables wgj and ygj , the optimizer will attempt to drive these variables to zero. Thus, the optimizer is, in essence, attempting either to correctly classify training entities (w g j D 0), or to place them in the reserved-judgment region (y g j D 0). By varying the weights c1 and c2 , one has a means of controlling the optimizer’s emphasis for correctly classifying training entities versus placing them in the reserved-

Disease Diagnosis: Optimization-Based Methods

D

Disease Diagnosis: Optimization-Based Methods, Table 1 Model size Model Type

Constraints

Total variables

1

Nonlinear MIP

2GN C N C G(G 1)

2GN C N C G(G 1) GN

2

Linear MIP

5GN C 2N C G(G 1) 4GN C N C G(G 1) 2GN

3

Linear MIP

3GN C G(G 1)

2GN C N C G(G 1) GN

1T

Nonlinear MIP

2GN C N C 1

2GN C N C G(G 1) GN

2T

Linear MIP

5GN C 2N C 1

4GN C N C G(G 1) 2GN

3T

Linear MIP

3GN C 1

DALP

Linear program 3GN

2GN C N C G(G 1) GN NG C N C G(G 1)

judgment region. If c2 /c1 < 1, the optimizer will tend to place a greater emphasis on driving the wgj variables to zero than driving the ygj variables to zero (conversely, if c2 /c1 > 1). Hence, when c2 /c1 < 1, one should expect to get relatively more entities correctly classified, fewer placed in the reserved-judgment region, and more misclassified, than when c2 /c1 > 1. An extreme case is when c2 D 0. In this case, there is no emphasis on driving ygj to zero (the reserved-judgment region is thus ignored), and the full emphasis of the optimizer is to drive wgj to zero. Table 1 summarizes the number of constraints, the total number of variables, and the number of 0/1 variables in each of the discrete SVM models, and in the heuristic LP model (DALP). Clearly, even for moderately sized discriminant analysis problems, the MIP instances are relatively large. Also, note that model 2 is larger than model 3, in terms of both the number of constraints and the number of variables. However, it is important to keep in mind that the difficulty of solving an MIP problem cannot, in general, be predicted solely by its size; problem structure has a direct and substantial bearing on the effort required to find optimal solutions. The LP relaxation of these MIP models poses computational challenges as commercial LP solvers return (optimal) LP solutions that are infeasible, owing to the equality constraints, and the use of big M and small in the formulation. It is interesting to note that the set of feasible solutions for model 2 is “tighter” than that for model 3. In particular, if F i denotes the set of feasible solutions of model i, then F1 D f(L; ; u; y) : there exists y˜; v such that (L; ; u; y; y˜; v) 2 F2 g ¨ F3 :

0/1 Variables

(23)

0

The novelties of the classification models developed herein include the following: (1) they are suitable for discriminant analysis given any number of groups, (2) they accept heterogeneous types of attributes as input, (3) they use a parametric approach to reduce highdimensional attribute spaces, and (4) they allow constraints on the number of misclassifications, and utilize a reserved judgment to facilitate the reduction of misclassifications. The lattermost point opens the possibility of performing multistage analysis. Clearly, the advantage of an LP model over an MIP model is that the associated problem instances are computationally much easier to solve. However, the most important criterion in judging a method for obtaining discriminant rules is how the rules perform in correctly classifying new unseen entities. Once the rule has been developed, applying it to a new entity to determine its group is trivial. Extensive computational experiments have been performed to gauge the qualities of solutions of different models [17,19,40,59,60,63]. Validation of Model and Computational Effort We performed tenfold cross-validation, and designed simulation and comparison studies on our models. The results reported in [40,63] demonstrate that our approach works well when applied to both simulated data and data sets from the machine learning database repository [91]. In particular, our methods compare favorably and at times superior to other mathematical programming methods, including the GSFC model by Gehrlein [41], and the LP model by Gochet et al. [46], as well as Fisher’s LDF, artificial neural networks, quadratic discriminant analysis, tree classification, and other SVMs, on real biological and medical data.

771

772

D

Disease Diagnosis: Optimization-Based Methods

Classification Results for Real-World Biological and Medical Applications The main objective in discriminant analysis is to derive rules that can be used to classify entities into groups. Computationally, the challenge lies in the effort expended to develop such a rule. Once the rule has been developed, applying it to a new entity to determine its group is trivial. Feasible solutions obtained from our classification models correspond to predictive rules. Empirical results [40,63] indicate that the resulting classification model instances are computationally very challenging, and even intractable by competitive commercial MIP solvers. However, the resulting predictive rules prove to be very promising, offering correct classification rates on new unknown data ranging from 80 to 100% for various types of biological/medical problems. Our results indicate that the general-purpose classification framework that we have designed has the potential to be a very powerful predictive method for clinical settings. The choice of MIP as the underlying modeling and optimization technology for our SVM classification model is guided by the desire to simultaneously incorporate a variety of important and desirable properties of predictive models within a general framework. MIP itself allows for incorporation of continuous and discrete variables, and linear and nonlinear constraints, providing a flexible and powerful modeling environment. Our mathematical modeling and computational algorithm design shows great promise as the resulting predictive rules are able to produce higher rates of correct classification for new biological data (with unknown group status) compared with existing classification methods. This is partly due to the transformation of raw data via the set of constraints in (7). While most mathematical programming approaches directly determine the hyperplanes of separation using raw data, our approach transforms the raw data via a probabilistic model, before the determination of the supporting hyperplanes. Further, the separation is driven by maximizing the sum of binary variables (representing correct classification or not of entities), instead of maximizing the margins between groups, or minimizing a sum of errors (representing distances of entities from hyperplanes), as in other SVMs. The combination of these two strategies offers better classification capabil-

ity. Noise in the transformed data is not as profound as in raw data. And the magnitudes of the errors do not skew the determination of the separating hyperplanes, as all entities have equal importance when correct classification is being counted. To highlight the broad applicability of our approach, below we briefly summarize the application of our predictive models and solution algorithms to ten different biological problems. Each of the projects was carried out in close partnership with experimental biologists and/or clinicians. Applications to finance and other industry applications are described elsewhere [17,40,63]. Determining the Type of Erythemato-Squamous Disease The differential diagnosis of erythematosquamous diseases is an important problem in dermatology [60]. They all share the clinical features of erythema and scaling, with very little differences. The six groups are psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris. Usually a biopsy is necessary for the diagnosis but unfortunately these diseases share many histopathological features as well. Another difficulty for the differential diagnosis is that a disease may show the features of another disease at the beginning stage and may have the characteristic features at the following stages [91]. The six groups consisted of 366 subjects (112, 61, 72, 49, 52, and 20 respectively) with 34 clinical attributes. Patients were first evaluated clinically with 12 features. Afterwards, skin samples were taken for the evaluation of 22 histopathological features. The values of the histopathological features were determined by an analysis of the samples under a microscope. The 34 attributes include (1) clinical attributes (erythema, scaling, definite borders, itching, koebner phenomenon, polygonal papules, follicular papules, oral mucosal involvement, knee and elbow involvement, scalp involvement, family history, age) and (2) histopathological attributes (melanin incontinence, eosinophils in the infiltrate, polymorphonuclear leukocyte infiltrate, fibrosis of the papillary dermis, exocytosis, acanthosis, hyperkeratosis, parakeratosis, clubbing of the rete ridges, elongation of the rete ridges, thinning of the suprapapillary epidermis, spongiform pustule, Munro microabscess, focal hypergranulosis, disappearance of the granular

Disease Diagnosis: Optimization-Based Methods

D

layer, vacuolization and damage of basal layer, spongiosis, sawtooth appearance of retes, follicular horn plug, perifollicular parakeratosis, inflammatory monoluclear infiltrate, band-like infiltrate). Our multigroup classification model selected 27 discriminatory attributes, and successfully classified the patients into six groups, each with an unbiased correct classification of greater than 93% (with 100% correct rate for groups 1, 3, 5, and 6) with an average overall accuracy of 98%. Using 250 subjects to develop the rule, and testing the remaining 116 patients, we obtained a prediction accuracy of 91%. Predicting Presence/Absence of Heart Disease The four databases concerning heart disease diagnosis were collected by Dr. Andras Janosi of the Hungarian Institute of Cardiology, Budapest; Dr. William Steinbrunn of University Hospital, Zurich; Dr. Matthias Pfisterer of University Hospital, Basel; and Dr. Robert Detrano of V.A. Medical Center, Long Beach, and Cleveland Clinic Foundation [60]. Each database contains the same 76 attributes. The “goal” field refers to the presence of heart disease in the patient. The classification attempts to distinguish presence (values 1, 2, 3, 4, involving a total of 509 subjects) from absence (value 0, involving 411 subjects) [91]. The attributes include demographics, physiocardiovascular conditions, traditional risk factors, family history, personal lifestyle, and cardiovascular exercise measurements. This data set has posed some challenges to past analysis via various classification approaches, resulting in less than 80% correct classification. Applying our classification model without reserved judgment, we obtained 79 and 85% correct classification for each group respectively. To determine the usefulness of multistage analysis, we applied twostage classification. In the first stage, 14 attributes were selected as discriminatory. One hundred and thirty-five group absence subjects were placed into the reservedjudgment region, with 85% of the remaining being classified as group absence correctly; while 286 group presence subjects were placed into the reserved-judgment region, and 91% of the remaining were classified correctly into the group presence. In the second stage, 11 attributes were selected with 100 and 229 classified into group absence and presence respectively. Combining the two stages, we obtained a correct classification of 82 and 85%, respectively, for diagnosis of absence or pres-

Disease Diagnosis: Optimization-Based Methods, Figure 1 A tree diagram for two-stage classification and prediction of heart disease

ence of heart disease. Figure 1 illustrates the two-stage classification. Predicting Aberrant CpG Island Methylation in Human Cancer More details of this work can be found in [28,29]. Epigenetic silencing associated with aberrant methylation of promoter-region CpG islands is one mechanism leading to loss of tumor suppressor function in human cancer. Profiling of CpG island methylation indicates that some genes are more frequently methylated than others, and that each tumor type is associated with a unique set of methylated genes. However, little is known about why certain genes succumb to this aberrant event. To address this question, we used restriction landmark genome scanning (RLGS) to analyze the susceptibility of 1749 unselected CpG islands to de novo methylation driven by overexpression of DNMT1. We found that whereas the overall incidence of CpG island methylation was increased in cells overexpressing DNMT1, not all loci were equally affected. The majority of CpG islands (69.9%) were resistant to de novo methylation, regardless of DNMT1 overexpression. In contrast, we identified a subset of methylation-prone CpG islands (3.8%) that were consistently hypermethylated in multiple DNMT1 overexpressing clones. Methylation-prone and methylationresistant CpG islands were not significantly different with respect to size, C+G content, CpG frequency, chromosomal location, or gene association or pro-

773

774

D

Disease Diagnosis: Optimization-Based Methods

moter association. To discriminate methylation-prone from methylation-resistant CpG islands, we developed a novel DNA pattern recognition model and algorithm [61], and coupled our predictive model described herein with the patterns found. We were able to derive a classification function based on the frequency of seven novel sequence patterns that was capable of discriminating methylation-prone from methylationresistant CpG islands with 90% correctness upon crossvalidation, and 85% accuracy when tested against blind CpG islands unknown to us regarding the methylation status. The data indicate that CpG islands differ in their intrinsic susceptibility to de novo methylation, and suggest that the propensity for a CpG island to become aberrantly methylated can be predicted on the basis of its sequence context. The significance of this research is twofold. First, the identification of sequence patterns/attributes that distinguish methylation-prone CpG islands will lead to a better understanding of the basic mechanisms underlying aberrant CpG island methylation. Because genes that are silenced by methylation are otherwise structurally sound, the potential for reactivating these genes by blocking or reversing the methylation process represents an exciting new molecular target for chemotherapeutic intervention. A better understanding of the factors that contribute to aberrant methylation, including the identification of sequence elements that may act to target aberrant methylation, will be an important step in achieving this long-term goal. Secondly, the classification of the more than 29,000 known (but as yet unclassified) CpG islands in human chromosomes will provide an important resource for the identification of novel gene targets for further study as potential molecular markers that could impact on both cancer prevention and treatment. Extensive RLGS fingerprint information (and thus potential training sets of methylated CpG islands) already exists for a number of human tumor types, including breast, brain, lung, leukemias, hepatocellular carcinomas, and primitive neuroectodermal tumor [23,24,35,102]. Thus, the methods and tools developed are directly applicable to CpG island methylation data derived from human tumors. Moreover, new microarray-based techniques capable of “profiling” more than 7000 CpG islands have been developed and applied to human breast cancers [15,117,118]. We are uniquely poised to take ad-

vantage of the tumor CpG island methylation profile information that will likely be generated using these techniques over the next several years. Thus, our generalpredictive modeling framework has the potential to lead to improved diagnosis and prognosis and treatment planning for cancer patients. Discriminant Analysis of Cell Motility and Morphology Data in Human Lung Carcinoma Refer to [16] for more details of this work. This study focuses on the differential effects of extracellular matrix proteins on the motility and morphology of human lung epidermoid carcinoma cells. The behavior of carcinoma cells is contrasted with that of normal L-132 cells, resulting in a method for the prediction of metastatic potential. Data collected from time-lapsed videomicroscopy were used to simultaneously produce quantitative measures of motility and morphology. The data were subsequently analyzed using our discriminant analysis model and algorithm to discover relationships between motility, morphology, and substratum. Our discriminant analysis tools enabled the consideration of many more cell attributes than is customary in cell motility studies. The observations correlate with behaviors seen in vivo and suggest specific roles for the extracellular matrix proteins and their integrin receptors in metastasis. Cell translocation in vitro has been associated with malignancy, as has an elongated phenotype [120] and a rounded phenotype [97]. Our study suggests that extracellular matrix proteins contribute in different ways to the malignancy of cancer cells, and that multiple malignant phenotypes exist. Ultrasound-Assisted Cell Disruption for Drug Delivery Reference [57] discusses this in detail. Although biological effects of ultrasound must be avoided for safe diagnostic applications, ultrasound’s ability to disrupt cell membranes has attracted interest as a method to facilitate drug and gene delivery. This preliminary study seeks to develop rules for predicting the degree of cell membrane disruption based on specified ultrasound parameters and measured acoustic signals. Too much ultrasound destroys cells, while cell membranes will not open up for absorption of macromolecules when too little ultrasound is applied. The key is to increase cell permeability to allow absorption of macromolecules, and to apply ultrasound transiently to disrupt viable cells so

Disease Diagnosis: Optimization-Based Methods

as to enable exogenous material to enter without cell damage. Thus our task is to uncover a “predictive rule” of ultrasound-mediated disruption of red blood cells using acoustic spectrums and measurements of cell permeability recorded in experiments. Our predictive model and solver for generating prediction rules were applied to data obtained from a sequence of experiments on bovine red blood cells. For each experiment, the attributes consisted of four ultrasound parameters, acoustic measurements at 400 frequencies, and a measure of cell membrane disruption. To avoid overtraining, various feature combinations of the 404 predictor variables were selected when developing the classification rule. The results indicate that the variable combination consisting of ultrasound exposure time and acoustic signals measured at the driving frequency and its higher harmonics yields the best rule, and our method compares favorably with classification tree and other ad hoc approaches, with a correct classification rate of 80% upon cross-validation and 85% when classifying new unknown entities. Our methods used for deriving the prediction rules are broadly applicable, and could be used to develop prediction rules in other scenarios involving different cell types or tissues. These rules and the methods used to derive them could be used for real-time feedback about ultrasound’s biological effects. For example, it could assist clinicians during a drug delivery process, or could be imported into an implantable device inside the body for automatic drug delivery and monitoring. Identification of Tumor Shape and Volume in Treatment of Sarcoma Reference [56] includes the detailed analysis. This project involves the determination of tumor shape for adjuvant brachytherapy treatment of sarcoma, based on catheter images taken after surgery. In this application, the entities are overlapping consecutive triplets of catheter markings, each of which is used for determining the shape of the tumor contour. The triplets are to be classified into one of two groups: group 1 (triplets for which the middle catheter marking should be bypassed) and group 2 (triplets for which the middle marking should not be bypassed). To develop and validate a classification rule, we used clinical data collected from 15 soft-tissue sarcoma patients. Cumulatively, this comprised 620 triplets of catheter markings. By careful (and tedious) clinical analysis of

D

the geometry of these triplets, 65 were determined to belong to group 1, the “bypass” group, and 555 were determined to belong to group 2, the “do-not-bypass” group. A set of measurements associated with each triplet was then determined. The choice of what attributes to measure to best distinguish triplets as belonging to group 1 or group 2 is nontrivial. The attributes involved the distance between each pair of markings, angles, and the curvature formed by the three triplet markings. On the basis of the attributes selected, our predictive model was used to develop a classification rule. The resulting rule provides 98% correct classification on cross-validation, and was capable of correctly determining/predicting 95% of the shape of the tumor with new patients’ data. We remark that the current clinical procedure requires manual outline based on markers in films of the tumor volume. This study was the first to use automatic construction of tumor shape for sarcoma adjuvant brachytherapy [56,62]. Discriminant Analysis of Biomarkers for Prediction of Early Atherosclerosis More detail on this work can be found in [65]. Oxidative stress is an important etiologic factor in the pathogenesis of vascular disease. Oxidative stress results from an imbalance between injurious oxidant and protective antioxidant events, of which the former predominate [88,103]. This results in the modification of proteins and DNA, alteration in gene expression, promotion of inflammation, and deterioration in endothelial function in the vessel wall, all processes that ultimately trigger or exacerbate the atherosclerotic process [22,111]. It was hypothesized that novel biomarkers of oxidative stress would predict early atherosclerosis in a relatively healthy nonsmoking population free from cardiovascular disease. One hundred and twenty-seven healthy nonsmokers, without known clinical atherosclerosis had carotid intima media thickness (IMT) measured using ultrasound. Plasma oxidative stress was estimated by measuring plasma lipid hydroperoxides using the determination of reactive oxygen metabolites (d-ROMs) test. Clinical measurements include traditional risk factors, including age, sex, low-density lipoprotein (LDL), highdensity lipoprotein (HDL), triglycerides, cholesterol, body-mass index (BMI), hypertension, diabetes mellitus, smoking history, family history of coronary artery

775

776

D

Disease Diagnosis: Optimization-Based Methods

disease, Framingham risk score, and high-sensitivity C-reactive protein. For this prediction, the patients were first clustered into two groups: (group 1, IMT 0:68; group 2, IMT < 0.68). On the basis of this separator, 30 patients belonged to group 1, and 97 belonged to group 2. Through each iteration, the classification method trains and learns from the input training set and returns the most discriminatory patterns among the 14 clinical measurements; ultimately resulting in the development of a prediction rule based on observed values of these discriminatory patterns among the patient data. Using all 127 patients as a training set, the predictive model identified age, sex, BMI, HDL cholesterol, family history of coronary artery disease under 60, highsensitivity C-reactive protein, and d-ROM as discriminatory attributes that together provide unbiased correct classification of 90 and 93%, respectively, for group 1 (IMT 0:68) and group 2 (IMT < 0.68) patients. To further test the power of the classification method for correctly predicting the IMT status of new/unseen patients, we randomly selected a smaller patient training set of size 90. The predictive rule from this training set yielded 80 and 89% correct rates for predicting the remaining 37 patients as group 1 and group 2 patients, respectively. The importance of d-ROM as a discriminatory predictor for IMT status was confirmed during the machine learning process. This biomarker was selected in every iteration as the “machine” learned and was trained to develop a predictive rule to correctly classify patients in the training set. We also performed predictive analysis using Framingham risk score and d-ROM; in this case the unbiased correct classification rates (for the 127 individuals) for groups 1 and 2 were 77 and 84%, respectively. This is the first study to illustrate that this measure of oxidative stress can be effectively used along with traditional risk factors to generate a predictive rule that can potentially serve as an inexpensive clinical diagnostic tool for prediction of early atherosclerosis. Fingerprinting Native and Angiogenic Microvascular Networks Through Pattern Recognition and Discriminant Analysis of Functional Perfusion Data The analysis and findings are described in [64]. The cardiovascular system provides oxygen and nutrients to the entire body. Pathological conditions that impair

normal microvascular perfusion can result in tissue ischemia, with potentially serious clinical effects. Conversely, development of new vascular structures fuels the progression of cancer, macular degeneration, and atherosclerosis. Fluorescence microangiography offers superb imaging of the functional perfusion of new and existent microvasculature, but quantitative analysis of the complex capillary patterns is challenging. We developed an automated pattern-recognition algorithm to systematically analyze the microvascular networks, and then applied our classification model described herein to generate a predictive rule. The pattern-recognition algorithm identifies the complex vascular branching patterns, and the predictive rule demonstrates, respectively, 100 and 91% correct classification for perturbed (diseased) and normal tissue perfusion. We confirmed that transplantation of normal bone marrow to mice in which genetic deficiency resulted in impaired angiogenesis eliminated predicted differences and restored normal-tissue perfusion patterns (with 100% correctness). The pattern-recognition and classification method offers an elegant solution for the automated fingerprinting of microvascular networks that could contribute to better understanding of angiogenic mechanisms and be utilized to diagnose and monitor microvascular deficiencies. Such information would be valuable for early detection and monitoring of functional abnormalities before they produce obvious and lasting effects, which may include improper perfusion of tissue, or support of tumor development. The algorithm can be used to discriminate between the angiogenic response in a native healthy specimen compared with groups with impairment due to age or chemical or other genetic deficiency. Similarly, it can be applied to analyze angiogenic responses as a result of various treatments. This will serve two important goals. First, the identification of discriminatory patterns/attributes that distinguish angiogenesis status will lead to a better understanding of the basic mechanisms underlying this process. Because therapeutic control of angiogenesis could influence physiological and pathological processes such as wound and tissue repairing, cancer progression and metastasis, or macular degeneration, the ability to understand it under different conditions will offer new insight into developing novel therapeutic interventions, monitoring and treatment, especially in aging, and heart disease. Thus, our study

Disease Diagnosis: Optimization-Based Methods

and the results form the foundation of a valuable diagnostic tool for changes in the functionality of the microvasculature and for discovery of drugs that alter the angiogenic response. The methods can be applied to tumor diagnosis, monitoring, and prognosis. In particular, it will be possible to derive microangiographic fingerprints to acquire specific microvascular patterns associated with early stages of tumor development. Such “angioprinting” could become an extremely helpful early diagnostic modality, especially for easily accessible tumors such as skin cancer. Prediction of Protein Localization Sites The protein localization database consists of eight groups with a total of 336 instances (143, 77, 52, 35, 20, 5, 2, and 2, respectively) with seven attributes [91]. The eight groups are eight localization sites of protein, including cytoplasm (cp), inner membrane without signal sequence (im), perisplasm (pp), inner membrane, uncleavable signal sequence (imU), outer membrane (om), outer membrane lipoprotein (omL), inner membrane lipoprotein (imL), inner membrane, and cleavable signal sequence (imS). However, the last four groups were taken out of our classification experiment since the population sizes are too small to ensure significance. The seven attributes include McGeoch’s method for signal sequence recognition (mcg), von Heijne’s method for signal sequence recognition (gvh), von Heijne’s signal peptidase II consensus sequence score (lip), presence of charge on N-terminus of predicted lipoproteins (chg), score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins (aac), score of the ALOM membrane spanning region prediction program (alm1), and score of the ALOM program after excluding putative cleavable signal regions from the sequence (alm2). In the classification we use four groups, 307 instances, with seven attributes. Our classification model selected the discriminatory patterns mcg, gvh, alm1, and alm2 to form the predictive rule with unbiased correct classification rates of 89%, compared with 81% by other classification models [48]. Pattern Recognition in Satellite Images for Determining Types of Soil The satellite database consists of the multispectral values of pixels in 3 3 neighbor-

D

hoods in a satellite image, and the classification associated with the central pixel in each neighborhood. The aim is to predict this classification, given the multispectral values. In the sample database, the class of a pixel is coded as a number. There are six groups with 4435 samples in the training data set and 2000 samples in the testing data set; and each sample entity has 36 attributes describing the spectral bands of the image [91]. The original Landsat Multi-Spectral Scanner (MSS) image data for this database were generated from data purchased from NASA by the Australian Centre for Remote Sensing. The Landsat satellite data are one of the many sources of information available for a scene. The interpretation of a scene by integrating spatial data of diverse types and resolutions including multispectral and radar data, maps indicating topography, land use, etc. is expected to assume significant importance with the onset of an era characterized by integrative approaches to remote sensing (for example, NASA’s Earth Observing System commencing this decade). One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infrared. Each pixel is an 8-bit binary word, with 0 corresponding to black and 255 to white. The spatial resolution of a pixel is about 80 m × 80 m. Each image contains 2340 3380 such pixels. The database is a (tiny) subarea of a scene, consisting of 82 100 pixels. Each line of data corresponds to a 3 3 square neighborhood of pixels completely contained within the 82 100 subarea. Each line contains the pixel values in the four spectral bands (converted to ASCII) of each of the nine pixels in the 3 3 neighborhood and a number indicating the classification label of the central pixel. The number is a code for the following six groups: red soil, cotton crop, gray soil, damp gray soil, soil with vegetation stubble, and very damp gray soil. Running our classification model, we selected 17 discriminatory attributes to form the classification rule, producing an unbiased prediction with 85% accuracy. Further Advances Brooks and Lee [17,18] devised other variations of the basic DAMIP model. They also showed that

777

778

D

Disease Diagnosis: Optimization-Based Methods

DAMIP is strongly universally consistent (in some sense) with very good rates of convergence from Vapnik and Chervonenkis theory. A polynomial-time algorithm for discriminating between two populations with the DAMIP model was developed, and DAMIP was shown to be N P -complete for a general number of groups. The proof demonstrating N P -completeness employs results used in generating edges of the conflict graph [4,11,12,55]. Exploiting the necessary and sufficient conditions that identify edges in the conflict graph is the central contribution to the improvement in solution performance over industry-standard software. The conflict graph is the basis for various valid inequalities, a branching scheme, and for conditions under which integer variables are fixed for all solutions. Additional solution methods are identified which include a heuristic for finding solutions at nodes in the branch-andbound tree, upper bounds for model parameters, and necessary conditions for edges in the conflict hypergraph [26,58]. Further, we have concluded that DAMIP is a computationally feasible, consistent, stable, robust, and accurate classifier. Progress and Challenges We summarize in Table 2 the mathematical programming techniques used in classification problems as reviewed in this chapter. As noted by current research efforts, multigroup classification remains N P -complete and much work is needed to design effective models as well as to derive novel and efficient computational algorithms to solve these multigroup instances. Other Methods While most classification methods can be described in terms of discriminant functions, some methods are not trained in the paradigm of determining coefficients or parameters for functions of a predefined form. These methods include classification and regression trees, nearest-neighbor methods, and neural networks. Classification and regression trees [14] are nonparametric approaches to prediction. Classification trees seek to develop classification rules based on successive binary partitions of observations based on attribute values. Regression trees also employ rules consisting of bi-

nary partitions, but are used to predict continuous responses. The rules generated by classification trees are easily viewable by plotting them in a treelike structure, from which the name arises. A test entity may be classified using rules in a tree plot by first comparing the entity’s data with the root node of the tree. If the root node condition is satisfied by the data for a particular entity, the left branch is followed to another node; otherwise, the right branch is followed to another node. The data from the observation are compared with conditions at subsequent nodes until a leaf node is reached. Nearest-neighbor methods begin by establishing a set of labeled prototype observations. The nearestneighbor classification rule assigns test entities to groups according to the group membership of the nearest prototype. Different measures of distance may be used. The k-nearest-neighbor rule assigns entities to groups according to the group membership of the k nearest prototypes. Neural networks are classification models that can also be interpreted in terms of discriminant functions, though they are used in a way that does not require finding an analytic form for the functions [25]. Neural networks are trained by considering one observation at a time, modifying the classification procedure slightly with each iteration. Summary and Conclusion In this chapter, we presented an overview of mathematical programming based classification models, and analyzed their development and advances in recent years. Many mathematical programming methods are geared toward two-group analysis only, and their performance is often compared with Fisher’s LDF or Smith’s QDF. It has been noted that these methods can be used for multiple group analysis by finding G(G 1)/2 discriminants for each pair of groups (“one against one”) or by finding G discriminants for each group versus the remaining data (“one against all”), but these approaches can lead to ambiguous classification rules [25]. Mathematical programming methods developed for multiple group analysis have been described [10,32, 39,40,41,46,59,60,63,93]. Multiple group formulations for SVMs have been proposed and tested [17,36,40,49, 59,60,66], but are still considered computationally in-

D

Disease Diagnosis: Optimization-Based Methods

Disease Diagnosis: Optimization-Based Methods, Table 2 Progress in mathematical programming-based classification models Mathematical programming methods

References

Linear programming Two-group classification Separate data by hyperplanes

[74,75]

Minimizing the sum of deviations, minimizing the maximum deviation, and minimizing the sum of interior distances

[5,31,32,33,47,99]

Hybrid model

[45,99]

Mathematical programming methods

References

Binary attributes

[3]

Normalization and attribute selection

[42]

Dichotomous categorical variable formation

[43]

Multigroup classification

Review

[27,50,107]

Software

[110]

Issues about normalization

[34,44,51,52,53,87, 100,114,115,116]

Robust linear programming

[9,86]

Inclusion of second-order terms

[104,113]

Effect of the position of outliers

[94]

Binary attributes

[3]

Multigroup classification Single function classification

[32]

Multiple function classification

[10,46]

Classification with reserved-judgment region using linear programming

[39,40,60,63]

Multigroup classification

[41,93]

Three-group classification

[71,72,95]

Classification with reserved-judgment region using mixed integer programming

[17,39,40,59,60]

Nonlinear programming Two-group classification Lp -norm criterion

[108]

Review

[27,50,107]

Piecewise-linear nonconvex discriminant function

[85]

Minimizing the number of misclassifications

[21,76,77]

Minimizing the sum of arbitrary-norm distances

[78]

Support vector machine

Mixed integer programming Two-group classification Minimizing the number of misclassifications

[1,5,6,7,54,101,105, 109,119]

Review

[27,50,107]

Software

[110]

Secondary goals

[96]

tensive [49]. The “one-against-one” and “one-againstall” methods with SVMs have been successfully applied [49,90]. We also discussed a class of multigroup generalpurpose predictive models that we have developed based on the technology of large-scale optimization and SVMs [17,19,39,40,59,60,63]. Our models seek to maximize the correct classification rate while constraining the number of misclassifications in each group. The models incorporate the following features: (1) the ability to classify any number of distinct groups; (2) allow incorporation of heterogeneous types of attributes as input; (3) a high-dimensional data transformation that eliminates noise and errors in biological data;

Introduction and tutorial

[20,112]

Generalized support vector machine

[79,83]

Methods for huge-size problems

[13,36,37,38,67,68, 80,81,82,84]

Multigroup support vector machine [17,38,39,40,49,59, 60,63,66]

(4) constrain the misclassification in each group and a reserved-judgment region that provides a safeguard against overtraining (which tends to lead to high misclassification rates from the resulting predictive rule); and (5) successive multistage classification capability to handle data points placed in the reserved-judgment region. The performance and predictive power of the classification models is validated through a broad class of biological and medical applications. Classification models are critical to medical advances as they can be used in genomic, cell, molecular, and system-level analyses to assist in early prediction, diagnosis and detection of disease, as well as for intervention and monitoring. As shown in the CpG

779

780

D

Disease Diagnosis: Optimization-Based Methods

island study for human cancer, such prediction and diagnosis opens up novel therapeutic sites for early intervention. The ultrasound application illustrates its application to a novel drug delivery mechanism, assisting clinicians during a drug delivery process, or in devising devices that can be implanted into the body for automated drug delivery and monitoring. The lung cancer cell motility study offers an understanding of how cancer cells behave in different protein media, thus assisting in the identification of potential gene therapy and target treatment. Prediction of the shape of a cancer tumor bed provides a personalized treatment design, replacing manual estimates by sophisticated computer predictive models. Prediction of early atherosclerosis through inexpensive biomarker measurements and traditional risk factors can serve as a potential clinical diagnostic tool for routine physical and health maintenance, alerting physicians and patients to the need for early intervention to prevent serious vascular disease. Fingerprinting of microvascular networks opens up the possibility for early diagnosis of perturbed systems in the body that may trigger disease (e. g., genetic deficiency, diabetes, aging, obesity, macular degeneracy, tumor formation), identification of target sites for treatment, and monitoring prognosis and success of treatment. Determining the type of erythemato-squamous disease and the presence/absence of heart disease helps clinicians to correctly diagnose and effectively treat patients. Thus, classification models serve as a basis for predictive medicine where the desire is to diagnose early and provide personalized target intervention. This has the potential to reduce healthcare costs, improve success of treatment, and improve quality of life of patients. References 1. Abad PL, Banks WJ (1993) New LP based heuristics for the classification problem. Eur J Oper Res 67:88–100 2. Anderson JA (1969) Constrained discrimination between k populations. J Roy Statist Soc Ser B (Methodological) 31(1):123–139 3. Asparoukhov OK, Stam A (1997) Mathematical programming formulations for two-group classification with binary variables. Ann Oper Res 74:89–112 4. Atamturk A (1998) Conflict graphs and flow models for mixed-integer linear optimization problems. PhD thesis, School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta

5. Bajgier SM, Hill AV (1982) An experimental comparison of statistical and linear programming approaches to the discriminant problem. Decis Sci 13:604–618 6. Banks WJ, Abad PL (1991) An efficient optimal solution algorithm for the classification problem. Decis Sci 22:1008– 1023 7. Banks WJ, Abad PL (1994) On the performance of linear programming heuristics applied on a quadratic transformation in the classification problem. Eur J Oper Res 74:23–28 8. Bennett KP (1992) Decision tree construction via linear programming. In: Evans M (ed) Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society Conference, pp 97–101 9. Bennett KP, Mangasarian OL (1992) Robust linear programming discrimination of two linearly inseparable sets. Optim Methods Softw 1:23–34 10. Bennett KP, Mangasarian OL (1994) Multicategory discrimination via linear programming. Optim Methods Softw 3:27–39 11. Bixby RE, Lee EK (1998) Solving a truck dispatching scheduling problem using branch-and-cut. Oper Res 46:355–367 12. Borndörfer R (1997) Aspects of set packing, partitioning and covering. PhD thesis, Technischen Universität Berlin, Berlin 13. Bradley PS, Mangasarian OL (2000) Massive data discrimination via linear support vector machines. Optim Methods Softw 13(1):1–10 14. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees. Wadsworth and Brooks/Cole Advanced Books and Software, Pacific Grove 15. Brock GJ, Huang TH, Chen CM, Johnson KJ (2001) A novel technique for the identification of CpG islands exhibiting altered methylation patterns (ICEAMP). Nucleic Acids Res 29:e123 16. Brooks JP, Wright A, Zhu C, Lee EK (2007) Discriminant analysis of motility and morphology data from human lung carcinoma cells placed on purified extracellular matrix proteins. Ann Biomed Eng, Submitted 17. Brooks JP, Lee EK (2006) Solving a mixed-integer programming formulation of a multi-category constrained discrimination model. In: Proceedings of the 2006 INFORMS Workshop on Artificial Intelligence and Data Mining, Pittsburgh 18. Brooks JP, Lee EK (2007) Analysis of the consistency of a mixed integer programming-based multi-category constrained discriminant model. Submitted 19. Brooks JP, Lee EK (2007) Mixed integer programming constrained discrimination model for credit screening. In: Proceedings of the 2007 Spring Simulation Multiconference, Business and Industry Symposium, pp 1–6, Norfolk, VA, March ACM Digital Library

Disease Diagnosis: Optimization-Based Methods

20. Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Mining Knowl Discov 2: 121–167 21. Chen C, Mangasarian OL (1996) Hybrid misclassification minimization. Adv Comput Math 5:127–136 22. Chevion M, Berenshtein E, Stadtman ER (2000) Human studies related to protein oxidation: protein carbonyl content as a marker of damage. Free Radical Res 33(Suppl):S99–S108 23. Costello JF, Fruhwald MC, Smiraglia DJ, Rush LJ, Robertson GP, Gao X, Wright FA, Feramisco JD, Peltomaki P, Lang JC, Schuller DE, Yu L, Bloomfield CD, Caligiuri MA, Yates A, Nishikawa R, Su HH, Petrelli NJ, Zhang X, O’Dorisio MS, Held WA, Cavenee WK, Plass C (2000) Aberrant CpGisland methylation has non-random and tumour-typespecific patterns. Nat Genet 24:132–138 24. Costello JF, Plass C, Cavenee WK (2000) Aberrant methylation of genes in low-grade astrocytomas. Brain Tumor Pathol 17:49–56 25. Duda RO, Hart PE, Stork DG (2001) Pattern Classification. Wiley, New York 26. Easton T, Hooker K, Lee EK (2003) Facets of the independent set plytope. Math Program Ser B 98:177–199 27. Erenguc SS, Koehler GJ (1990) Survey of mathematical programming models and experimental results for linear discriminant analysis. Managerial Decis Econ 11:215–225 28. Feltus FA, Lee EK, Costello JF, Plass C, Vertino PM (2003) Predicting aberrant CpG island methylation. Proc Natl Acad Sci USA 100:12253–12258 29. Feltus FA, Lee EK, Costello JF, Plass C, Vertino PM (2006) DNA signatures associated with CpG island methylation states. Genomics 87:572–579 30. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188 31. Freed N, Glover F (1981) A linear programming approach to the discriminant problem. Decis Sci 12:68–74 32. Freed N, Glover F (1981) Simple but powerful goal programming models for discriminant problems. Eur J Oper Res 7:44–60 33. Freed N, Glover F (1986) Evaluating alternative linear programming models to solve the two-group discriminant problem. Decis Sci 17:151–162 34. Freed N, Glover F (1986) Resolving certain difficulties and improving the classification power of LP discriminant analysis formulations. Decis Sci 17:589–595 35. Fruhwald MC, O’Dorisio MS, Rush LJ, Reiter JL, Smiraglia DJ, Wenger G, Costello JF, White PS, Krahe R, Brodeur GM, Plass C (2000) Gene amplification in NETs/ medulloblastomas: mapping of a novel amplified gene within the MYCN amplicon. J Med Genet 37:501–509 36. Fung GM, Mangasarian OL (2001) Proximal support vector machine classifiers. In Proceedings KDD-2001, San Francisco 37. Fung GM, Mangasarian OL (2002) Incremental support vector machine classification. In: Grossman R, Mannila H,

38. 39.

40.

41.

42.

43.

44. 45. 46.

47. 48.

49.

50.

51. 52. 53.

54.

D

Motwani R (eds) Proceedings of the Second SIAM International Conference on Data Mining. SIAM, Philadelphia, pp 247–260 Fung GM, Mangasarian OL (2005) Multicategory proximal support vector machine classifiers. Mach Learn 59:77–97 Gallagher RJ, Lee EK, Patterson DA (1996) An optimization model for constrained discriminant analysis and numerical experiments with iris, thyroid, and heart disease datasets. In: Proceedings of the 1996 American Medical Informatics Association Gallagher RJ, Lee EK, Patterson DA (1997) Constrained discriminant analysis via 0/1 mixed integer programming. Ann Oper Res 74:65–88 Gehrlein WV (1986) General mathematical programming formulations for the statistical classification problem. Oper Res Lett 5(6):299–304 Glen JJ (1999) Integer programming methods for normalisation and variable selection in mathematical programming discriminant analysis models. J Oper Res Soc 50:1043–1053 Glen JJ (2004) Dichotomous categorical variable formation in mathematical programming discriminant analysis models. Naval Res Logist 51:575–596 Glover F (1990) Improved linear programming models for discriminant analysis. Decis Sci 21:771–785 Glover F, Keene S, Duea B (1988) A new class of models for the discriminant problem. Decis Sci 19:269–280 Gochet W, Stam A, Srinivasan V, Chen S (1997) Multigroup discriminant analysis using linear programming. Oper Res 45(2):213–225 Hand DJ (1981) Discrimination and classification. Wiley, New York Horton P, Nakai K (1996) A probablistic classification system for predicting the cellular localization sites of proteins. In: Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, St. Louis, USA, pp 109–115 Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Networks 13(2):415–425 Joachimsthaler EA, Stam A (1990) Mathematical programming approaches for the classification problem in two-group discriminant analysis. Multivariate Behavior Res 25(4):427–454 Koehler GJ (1989) Characterization of unacceptable solutions in LP discriminant analysis. Decis Sci 20:239–257 Koehler GJ (1989) Unacceptable solutions and the hybrid discriminant model. Decis Sci 20:844–848 Koehler GJ (1994) A response to Xiao’s “necessary and sufficient conditions of unacceptable solutions in LP discriminant analysls”: Something is amiss. Decis Sci 25: 331–333 Koehler GJ, Erenguc SS (1990) Minimizing misclassifications in linear discriminant analysis. Decis Sci 21: 63–85

781

782

D

Disease Diagnosis: Optimization-Based Methods

55. Lee EK (1993) Solving a truck dispatching scheduling problem using branch-and-cut. PhD thesis, Computational and Applied Mathematics, Rice University, Houston 56. Lee EK, Fung AYC, Brooks JP, Zaider M (2002) Automated planning volume definition in soft-tissue sarcoma adjuvant brachytherapy. Biol Phys Med 47:1891–1910 57. Lee EK, Gallagher RJ, Campbell AM, Prausnitz MR (2004) Prediction of ultrasound-mediated disruption of cell membranes using machine learning techniques and statistial analysis of acoustic spectra. IEEE Trans Biomed Eng 51:1–9 58. Lee EK, Maheshwary S (2006) Conflict hypergraphs in integer programming. Technical report, Georgia Institute of Technology, submitted 59. Lee EK (2007) Optimization-based predictive models in medicine and biology. Optimization in Medicine. Springer Netherlands. Springer Series in Optimization and Its Application 12:127–151 60. Lee EK (2007) Large-scale optimization-based classification models in medicine and biology. Ann Biomed Eng Syst Biol Bioinformat 35(6):1095–1109 61. Lee EK, Easton T, Gupta K (2006) Novel evolutionary models and applications to sequence alignment problems. Ann Oper Res Oper Res Medic – Comput Optim Medic Life Sci 148:167–187 62. Lee EK, Fung AYC, Zaider M (2001) Automated planning volume contouring in soft-tissue sarcoma adjuvant brachytherapy treatment. Int J Radiat Oncol Biol Phys 51:391 63. Lee EK, Gallagher RJ, Patterson DA (2003) A linear programming approach to discriminant analysis with a reserved-judgment region. INFORMS J Comput 15(1):23–41 64. Lee EK, Jagannathan S, Johnson C, Galis ZS (2006) Fingerprinting native and angiogenic microvascular networks through pattern recognition and discriminant analysis of functional perfusion data. submitted 65. Lee EK, Wu TL, Ashfaq S, Jones DP, Rhodes SD, Weintrau WS, Hopper CH, Vaccarino V, Harrison DG, Quyyumi AA (2007) Prediction of early atherosclerosis in healthy adults via novel markers of oxidative stress and d-ROMs. Working paper 66. Lee Y, Lin Y, Wahba G (2004) Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. J Am Stat Assoc 99:67–81 67. Lee YJ, Mangasarian OL (2001) RSVM: Reduced support vector machines. In: Proceedings of the SIAM International Conference on Data Mining, Chicago, April 5–7 68. Lee YJ, Mangasarian OL (2001) SSVM: A smooth support vector machine for classification. Comput Optim Appl 20(1):5–22 69. Lee YJ, Mangasarian OL, Wolberg WH (2000) Breast cancer survival and chemotherapy: A support vector machine analysis. In: DIMACS Series in Discrete Mathemati-

70.

71.

72.

73.

74. 75. 76. 77.

78. 79.

80.

81.

82.

83.

84. 85.

cal and Theoretical Computer Science, vol 55. American Mathematical Society, Providence, pp 1–10 Lee YJ, Mangasarian OL, Wolberg WH (2003) Survivaltime classification of breast cancer patients. Comput Optim Appl 25:151–166 Loucopoulos C, Pavur R (1997) Computational characteristics of a new mathematical programming model for the three-group discriminant problem. Comput Oper Res 24(2):179–191 Loucopoulos C, Pavur R (1997) Experimental evaluation of the classificatory performance of mathematical programming approaches to the three-group discriminant problem: The case of small samples. Ann Oper Res 74:191–209 Luedi PP, Hartemink AJ, Jirtle RL (2005) Genome-wide prediction of imprinted murine genes. Genome Res 15:875–884 Mangasarian OL (1965) Linear and nonlinear separation of patterns by linear programming. Oper Res 13:444–452 Mangasarian OL (1968) Multi-surface method of pattern separation. IEEE Trans Inform Theory 14(6):801–807 Mangasarian OL (1994) Misclassification minimization. J Global Optim 5:309–323 Mangasarian OL (1996) Machine learning via polyhedral concave minimization. In: Fischer H, Riedmueller B, Schaeffler S (eds) Applied Mathematics and Parallel computing – Festschrift for Klaus Ritter. Physica-Verlag, Heidelberg, pp 175–188 Mangasarian OL (1999) Arbitrary-norm separating plane. Oper Res Lett 24:15–23 Mangasarian OL (2000) Generalized support vector machines. In: Smola AJ, Bartlett P, Schökopf B, Schuurmans D (eds) Advances in Large Margin Classifiers. MIT Press, Cambridge, pp 135–146 Mangasarian OL (2003) Data mining via support vector machines. In: Sachs EW, Tichatschke R (eds) System Modeling and Optimization XX. Kluwer, Boston, pp 91–112 Mangasarian OL (2005) Support vector machine classification via parameterless robust linear programming. Optim Methods Softw 20:115–125 Mangasarian OL, Musicant DR (1999) Successive overrelaxation for support vector machines. IEEE Trans Neural Networks 10:1032–1037 Mangasarian OL, Musicant DR (2001) Data discrimination via nonlinear generalized support vector machines. In: Ferris MC, Mangasarian OL, Pang JS (eds) Complementarity: Applications, Algorithms and Extensions. Kluwer, Boston, pp 233–251 Mangasarian OL, Musicant DR (2001) Lagrangian support vector machines. J Mach Learn Res 1:161–177 Mangasarian OL, Setiono R, Wolberg WH (1990) Pattern recognition via linear programming: Theory and application to medical diagnosis. In: Coleman TF, Li Y (eds) Large-Scale Numerical Optimization. SIAM, Philadelphia, pp 22–31

Disease Diagnosis: Optimization-Based Methods

86. Mangasarian OL, Street WN, Wolberg WH (1995) Breast cancer diagnosis and prognosis via linear programming. Oper Res 43(4):570–577 87. Markowski EP, Markowski CA (1985) Some difficulties and improvements in applying linear programming formulations to the discriminant problem. Decis Sci 16:237–247 88. McCord JM (2000) The evolution of free radicals and oxidative stress. Am J Med 108:652–659 89. McLachlan GJ (1992) Discriminant analysis and statistical pattern recognition. Wiley, New York 90. Müller KR, Mika S, Rätsch G, Tsuda K, Schölkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Networks 12(2):181–201 91. Murphy PM, Aha DW (1994) UCI Repository of machine learning databases http://www.ics.uci.edu/~mlearn/ MLRepository.html. Department of Information and Computer Science, University of California, Irvine 92. O’Hagan A (1994) Kendall’s Advanced Theory of Statistics: Bayesian Inference, vol 2B. Halsted Press, New York 93. Pavur R (1997) Dimensionality representation of linear discriminant function space for the multiple-group problem: An MIP approach. Ann Oper Res 74:37–50 94. Pavur R (2002) A comparative study of the effect of the position of outliers on classical and nontraditional approaches to the two-group classification problem. Eur J Oper Res 136:603–615 95. Pavur R, Loucopoulos C (2001) Evaluating the effect of gap size in a single function mathematical programming model for the three-group classification problem. J Oper Res Soc 52:896–904 96. Pavur R, Wanarat P, Loucopoulos C (1997) Examination of the classificatory performance of MIP models with secondary goals for the two-group discriminant problem. Ann Oper Res 74:173–189 97. Raz A, Ben-Zéev A (1987) Cell contact and architecture of malignant cells and their relationship to metastasis. Cancer Metastasis Rev 6:3–21 98. Rencher AC (1998) Multivariate Statistical Inference and Application. Wiley, New York 99. Rubin PA (1990) A comparison of linear programming and parametric approaches to the two-group discriminant problem. Decis Sci 21:373–386 100. Rubin PA (1991) Separation failure in linear programming discriminant models. Decis Sci 22:519–535 101. Rubin PA (1997) Solving mixed integer classification problems by decomposition. Ann Oper Res 74:51–64 102. Rush LJ, Dai Z, Smiraglia DJ, Gao X, Wright FA, Fruhwald M, Costello JF, Held WA, Yu L, Krahe R, Kolitz JE, Bloomfield CD, Caligiuri MA, Plass C (2001) Novel methylation targets in de novo acute myeloid leukemia with prevalence of chromosome 11 loci. Blood 97:3226– 3233 103. Sies H (1985) Oxidative stress: introductory comments. In: Sies H (ed) Oxidative Stress. Academic Press, London, pp 1–8

D

104. Duarte Silva AP, Stam A (1994) Second order mathematical programming formulations for discriminant analysis. Eur J Oper Res 72:4–22 105. Duarte Silva AP, Stam A (1997) A mixed integer programming algorithm for minimizing the training sample misclassification cost in two-group classification. Ann Oper Res 74:129–157 106. Smith CAB (1947) Some examples of discrimination. Ann Eugenics 13:272–282 107. Stam A (1997) Nontraditional approaches to statistical classification: Some perspectives on lp -norm methods. Ann Oper Res 74:1–36 108. Stam A, Joachimsthaler EA (1989) Solving the classification problem in discriminant analysis via linear and nonlinear programming methods. Decis Sci 20:285–293 109. Stam A, Joachimsthaler EA (1990) A comparison of a robust mixed-integer approach to existing methods for establishing classification rules for the discriminant problem. Eur J Oper Res 46:113–122 110. Stam A, Ungar DR (1995) RAGNU: A microcomputer package for two-group mathematical programming-based nonparametric classification. Eur J Oper Res 86:374–388 111. Tahara S, Matsuo M, Kaneko T (2001) Age-related changes in oxidative damage to lipids and DNA in rat skin. Mechan Ageing Develop 122:415–426 112. Vapnik V (1995) The Nature of Statistical Learning Theory. Springer, New York 113. Wanarat P, Pavur R (1996) Examining the effect of second-order terms in mathematical programming approaches to the classification problem. Eur J Oper Res 93: 582–601 114. Xiao B (1993) Necessary and sufficient conditions of unacceptable solutions in LP discriminant analysis. Decis Sci 24:699–712 115. Xiao B (1994) Decision power and solutions of LP discriminant models: Rejoinder. Decis Sci 25:335–336 116. Xiao B, Feng Y (1997) Alternative discriminant vectors in LP models and a regularization method. Ann Oper Res 74:113–127 117. Yan PS, Chen CM, Shi H, Rahmatpanah F, Wei SH, Caldwell CW, Huang TH (2001) Dissecting complex epigenetic alterations in breast cancer using CpG island microarrays. Cancer Res 61:8375–8380 118. Yan PS, Perry MR, Laux DE, Asare AL, Caldwell CW, Huang TH (2000) CpG island arrays: an application toward deciphering epigenetic signatures of breast cancer. Clin Cancer Res 6:1432–1438 119. Yanev N, Balev S (1999) A combinatorial approach to the classification problem. Eur J Oper Res 115:339–350 120. Zimmermann A, Keller HU (1987) Locomotion of tumor cells as an element of invasion and metastasis. Biomed Pharmacotherapy 41:337–344 121. Zopounidis C, Doumpos M (2002) Multicriteria classification and sorting methods: A literature review. Eur J Oper Res 138:229–246

783

784

D

Disjunctive Programming

linear program, the extreme point optimization problem, the linear complementarity problem, among numerous others, and finds application in several related problems such as orthogonal production scheduling, scheduling on identical machines, multistage assignment, location-allocation problems, load balancing problems, the segregated storage problem, the fixedcharge problem, project/portfolio selection problems, goal programming problems, and many other game theory and decision theory problems (see [35] for a detailed discussion of such problems and applications). The theory and algorithms for disjunctive programming problems are mainly supported by the fundamental disjunctive cut principle. The forward part of this result due to E. Balas [4,5] states that for any nonnegative surrogate multiplier vectors h , h 2 H, the inequality

Disjunctive Programming DP HANIF D. SHERALI Virginia Polytechnic Institute and State University, Blacksburg, USA MSC2000: 90C09, 90C10, 90C11 Article Outline Keywords See also References Keywords

sup f h Ah gx inf f h b h g

Disjunctive programming; Polyhedral annexation; Facial disjunctive program; Cutting planes; Nondominated cuts; Facet; Valid inequalities; Reformulation-linearization technique; Lift-and-project; Tight relaxations; Model reformulation; Convex hull; Mixed integer 0–1 programs; Polynomial programs; Nonconvex programs

h2H

Disjunctive programming (DP) problems can be stated in the form (DP)

Minimize f f (x) : x 2 X; x 2 [ h2H S h g ;

where f : Rn ! R is a lower semicontinuous function, X is a closed convex subset of the nonnegative orthant of Rn , and H is an index set for the collection of nonempty polyhedra n o S h D x : Ah x b h ; x 0 ;

h 2 H:

(1)

The name for this class of problems arises from the feature that the constraints in (1) include the disjunction that at least one of the (linear) sets of constraints defining Sh , for h 2 H, must be satisfied. Problems including other logical conditions such as conjunctions, negations, and implications can be cast in the framework of this problem. Problem (DP) subsumes the classes of 0–1 mixed integer problems, the generalized lattice point problem, the cardinality constrained

h2H

(2)

is valid for (or is implied by) the disjunction x 2 [h 2 H Sh , where the sup{} and inf{} are taken componentwise in (2). More importantly, the converse part of this result due to R.G. Jeroslow [16] states that for any given valid inequality x 0 for the disjunction x 2 [h 2 H Sh , there exist nonnegative surrogate multipliers h , h 2 H, such that the disjunctive cut (2) implies this given valid inequality, or uniformly dominates it, over the nonnegative orthant. This disjunctive cut principle also arises from the setting of convexity cuts and polyhedral annexation methods as propounded by F. Glover [11,12], and it subsumes as well as can improve upon many types of classical cutting planes such as Gomory’s mixed integer cuts, intersection cuts, and reverse outer polar cuts for 0–1 programs (see [4,5,11,12,35]). H.P. Williams [39] provides some additional insights into disjunctive formulations. The generation of particular types of ‘deep cuts’ to delete a given solution (say, the origin, without loss of generality) based on the criteria of maximizing the Euclidean distance or the rectilinear distance between the origin and the nonnegative region feasible to the cutting plane, or maximizing the surplus in the cut with respect to the origin subject to suitable normalization constraints has also been explored in [34,37]. The intent behind such cutting plane methods is to generate nondominated valid inequalities that are supports (and hopefully, facets) of the closure convex hull of solutions

Disjunctive Programming

feasible to the disjunction. H.D. Sherali and C.M. Shetty [35,37] discuss how different alternate formulations of the disjunctive statement can influence the strength of the cut derived therefrom, and demonstrate how a sequence of augmented formulations can be used to sequentially tighten a given valid inequality. This process turns out to be precisely the Glover polyhedral annexation scheme in [12]. In contrast with this sequence dependent ‘lifting’ procedure, Sherali and Shetty [37] propose a ‘simultaneous lifting’ variant of this approach. Other types of disjunctive cutting planes for special problems include the cuts of [4,5,10,11,12,20], and [32] for linear knapsack, multiple choice and combinatorial disjunctions, [29] for linear complementarity problems, and the facet cuts of [25] based on the convex hull of certain types of disjunctions. Balas [3] also provides an algebraic characterization for the closure convex hull of a union of polyhedra. This characterization is particularly useful in the study of the important class of facial disjunctive programs, that subsumes mixed integer 0–1 problems and linear complementarity problems, for example. A facial disjunctive program (FDP) can be stated as follows. (FDP)

Minimize fcx : x 2 X \ Yg ;

where X is a nonempty polytope in Rn , and where Y is a conjunction of some b h disjunctions given in the socalled conjunctive normal form (conjunction of disjunctions) h n oi Y D \ h2H [ i2Q h x : a ih x b ih :

(3)

Here, H D f1; : : : ; b hg and for each h 2 H we have specified a disjunction that requires at least one of the inequalities a hi x b hi , for i 2 Qh , to be satisfied. The terminology ‘facial’ conveys the feature that X \ {x: a hi x b hi } defines a face of X for each i 2 Qh , h 2 H. For example, in the context of 0–1 mixed integer problems, the set X represents the linear programming relaxation of the problem, and for each binary variable xh , h 2 H, the corresponding disjunction in (3) states that xh 0 or xh 1 should hold true (where 0 xh 1 is included within X). Balas [3] shows that for facial disjunctive programs, the convex hull of feasible solutions can be constructed inductively by starting with K 0 = X and

D

then determining oi h n K h D conv [ i2Q h K h1 \ x : a ih x b ih for

h D 1; : : : ; b h;

(4)

where Kb produces conv(X \ Y). Based on this, a hierh is generated for (FDP) archy of relaxations K0 ; : : : ; Kb h that spans the spectrum from the linear programming to the convex hull representation [6]. Each member in this hierarchy can also be viewed as being obtained by representing the feasible region of the original problem as the intersection of the union of certain polyhedra, and then taking a hull-relaxation of this representation. Here, for a set D = \j Dj , where each Dj is the union of certain polyhedra, the hull-relaxation of D [3] is defined as h rel(D) = \j conv(Dj ) conv(D). In the context of 0–1 mixed integer problems (MIP), Sherali and W.P. Adams [27,28] develop a reformulation-linearization technique (RLT) for generating a hierarchy of such relaxations, introducing the notion of multiplying constraints using factors composed of xh and (1 xh ), h 2 H, to reformulate the problem, followed by a variable substitution to linearize the resulting problem. Approaches based on such constraint product and linearization strategies were used by these authors earlier in the context of several special applications [1,2,26]. Later, L. Lovász and A. Schrijver [17] independently used more general constraint factors to generate a similar hierarchy for 0–1 problems. The foregoing RLT construct can be specialized to derive K h defined by (4) for 0–1 MIPs, where in this case, K h conv (K h1 \ fx : x h 0g) [ (K h1 \ fx : x h 1g) can be obtained by multiplying the (implicitly defined) constraints of K h1 by xh and (1 xh ) and then linearizing the resulting problem. This RLT approach is used in [8] in the ‘lift-and-project’ hierarchy of relaxations. However, the RLT process of [27,28] generates tighter relaxations at each level which can be viewed as hull relaxations produced by the intersection of the convex hull of the union of certain specially constructed polyhedra. No direct realization of (4) can produce these relaxations. For a survey on RLT approaches and for further enhancements, see [29,30].

785

786

D

Disjunctive Programming

In the context of general facial disjunctive programs, Jeroslow [15] presented a cutting plane algorithm that generates suitable facetial inequalities at each stage of the procedure such that an overall finite convergence is guaranteed via (4). This is accomplished by showing that in the worst case, the hierarchy would be generated. The lift-and-project alK0 ; : : : ; Kb h gorithm of [8] employs this cutting plane procedure based on the foregoing hierarchy of relaxations. Balas [7] also addresses an enhanced procedure that considers two variables at a time to define the disjunctions. The RLT process is used to construct partial convex hulls, and the resulting relaxations are embedded in a branch and cut algorithm. Furthermore, for general facial disjunctive programs, Sherali and Shetty [36] present another finitely convergent cutting plane algorithm. At each step, this procedure searches for extreme faces of X relative to the cuts generated thus far (these are faces that do not contain any feasible points lying in a lower-dimensional face of X, see [18]), and based on the dimension of this extreme face and its feasibility to Y, either a disjunctive face cut or a disjunctive intersection cut is generated. This procedure was specialized for bilinear programming problems in [33] to derive a first nonenumerative finitely convergent algorithm for this class of problems. Other disjunctive cutting plane algorithms include the Sherali–Sen procedures [31] for solving the general class of extreme point mathematical programs, the Baptiste–LePape procedures [9], and the Pinto–Grossmann procedures [21] for solving certain scheduling problems having disjunctive logic constraints. S. Sen and Sherali [24] also discuss issues related to designing convergent cutting plane algorithms, and present examples to show nonconvergence of certain iterative disjunctive cutting plane methods. Sensitivity and stability issues related to feasible and optimal sets of disjunctive programs have been addressed in [14]; [13] deals with the problem of solving algebraic systems of disjunctive equations. For other applications of disjunctive methods to process systems engineering, and to logic programming, see [19,23,38]. See also MINLP: Branch and Bound Global Optimization Algorithm

MINLP: Branch and Bound Methods MINLP: Global Optimization with ˛BB MINLP: Logic-Based Methods Reformulation-linearization Methods for Global Optimization References 1. Adams WP, Sherali HD (1986) A tight linearization and an algorithm for zero-one quadratic programming problems. Managem Sci 32(10):1274–1290 2. Adams WP, Sherali HD (1990) Linearization strategies for a class of zero-one mixed integer programming problems. Oper Res 38(2):217–226 3. Balas E (1974) Disjunctive programming: Properties of the convex hull of feasible points. Managem Sci Res Report GSIA Carnegie-Mellon Univ 348, no. July 4. Balas E (1974) Intersection cuts from disjunctive constraints. Managem Sci Res Report Carnegie-Mellon Univ 330, no. Feb 5. Balas E (1975) Disjunctive programming: Cutting planes from logical conditions. In: Mangasarian OL, Meyer RR, Robinson SM (eds) Nonlinear Programming. Acad. Press, New York 6. Balas E (1985) Disjunctive programming and a hierarchy of relaxations for discrete optimization problems. SIAM J Alg Discrete Meth 6:466–485 7. Balas E (1997) A modified lift-and-project procedure. Math Program 79(1–3):19–32 8. Balas E, Ceria S, Cornuejols G (1993) A lift-and-project cutting plane algorithm for mixed 0-1 programs. Math Program 58:295–324 9. Baptiste P, Lepape C (1996) Disjunctive constraints for manufacturing scheduling: Principles and extensions. Internat J Comput Integrated Manufacturing 9(4):306–310 10. Glover F (1973) Convexity cuts and cut search. Oper Res 21:123–134 11. Glover F (1974) Polyhedral convexity cuts and negative edge extensions. Z Oper Res 18:181–186 12. Glover F (1975) Polyhedral annexation in mixed integer and combinatorial programming. Math Program 8:161– 188 (See also MSRS Report 73-9, Univ. Colorado, August, 1973). 13. Grossmann IE, Turkay M (1996) Solution of algebraic systems of disjunctive equations. Comput Chem Eng 20, Suppl.:S339–S344 14. Helbig S (1994) Stability in disjunctive optimization II. Continuity of the feasible and optimal set. Optim 31(1): 63–93 15. Jeroslow RG (1977) A cutting plane game and its algorithms. Discussion Paper Center Oper Res and Econometrics Univ Catholique de Louvain 7724, no. June 16. Jeroslow RG (1977) Cutting plane theory: Disjunctive methods. Ann Discret Math 1:293–330

Distance Dependent Protein Force Field via Linear Optimization

17. Lovasz L, Schrijver A (1991) Cones of matrices and set functions and 0-1 optimization. SIAM J Optim 1:166–190 18. Majthay A, Whinston A (1974) Quasi-concave minimization subject to linear constraints. Discret Math 9:35–59 19. Mcaloon K, Tretkoff C (1997) Logic, modeling, and programming. Ann Oper Res 71:335–372 20. Owen G (1973) Cutting planes for programs with disjunctive constraints. Optim Theory Appl 11:49–55 21. Pinto JM, Grossmann IE (1997) A logic based approach to scheduling problems with resource constraints. Comput Chem Eng 21(8):801–818 22. Ramarao B, Shetty CM (1984) Application of disjunctive programming to the linear complementarity problem. Naval Res Logist Quart 31:589–600 23. Sakama C, Seki H (1997) Partial deduction in disjunctive logic programming. J Logic Programming 32(3):229–245 24. Sen S, Sherali HD (1985) On the convergence of cutting plane algorithms for a class of nonconvex mathematical programs. Math Program 31(1):42–56 25. Sen S, Sherali HD (1986) Facet inequalities from simple disjunctions in cutting plane theory. Math Program 34(1): 72–83 26. Sherali HD, Adams WP (1984) A decomposition algorithm for a discrete location-allocation problem. Oper Res 32(878–900 27. Sherali HD, Adams WP (1990) A hierarchy of relaxations between the continuous and convex hull representations for zero-one programming problems. SIAM J Discret Math 3(3):411–430 28. Sherali H, Adams WP (1994) A hierarchy of relaxations and convex hull characterizations for mixed-integer zeroone programming problems. Discrete Appl Math 52:83– 106 Manuscript, Virginia Polytechnic Inst. State Univ., 1989. 29. Sherali HD, Adams WP (1996) Computational advances using the reformulation-linearization technique (RLT) to solve discrete and continuous nonconvex problems. OPTIMA 49:1–6 30. Sherali HD, Adams WP, Driscoll P (1998) Exploiting special structures in constructing a hierarchy of relaxations for 0-1 mixed integer problems. Oper Res 46(3):396–405 31. Sherali HD, Sen S (1985) A disjunctive cutting plane algorithm for the extreme point mathematical programming problem. Opsearch (Theory) 22(2):83–94 32. Sherali HD, Sen S (1985) On generating cutting planes from combinatorial disjunctions. Oper Res 33(4):928–933 33. Sherali HD, Shetty CM (1980) A finitely convergent algorithm for bilinear programming problems using polar cuts and disjunctive face cuts. Math Program 19:14–31 34. Sherali HD, Shetty CM (1980) On the generation of deep disjunctive cutting planes. Naval Res Logist Quart 27(3):453–475 35. Sherali HD, Shetty CM (1980) Optimization with disjunctive constraints. Lecture Notes Economics and Math Systems, vol 181. Springer, Berlin

D

36. Sherali HD, Shetty CM (1982) A finitely convergent procedure for facial disjunctive programs. Discrete Appl Math 4:135–148 37. Sherali HD, Shetty CM (1983) Nondominated cuts for disjunctive programs and polyhedral annexation methods. Opsearch (Theory) 20(3):129–144 38. Vecchietti A, Grossmann IE (1997) LOGMIP: A disjunctive 0-1 nonlinear optimizer for process systems models. Comput Chem Eng 21, Suppl.:S427–S432 39. Williams HP (1994) An alternative explanation of disjunctive formulations. Europ J Oper Res 72(1):200–203

Distance Dependent Protein Force Field via Linear Optimization R. RAJGARIA, S. R. MCALLISTER, CHRISTODOULOS A. FLOUDAS Department of Chemical Engineering, Princeton University, Princeton, USA Article Outline Abstract Keywords Introduction Theory and Modeling Physical Constraints Database Selection and Decoy Generation Training and Test Set

Results and Discussion Conclusions References Abstract Protein force fields play an important role in protein structure prediction. Knowledge based force fields use the database information to derive the interaction energy between different residues or atoms of a protein. These simplified force fields require less computational effort and are relatively easy to use. A C ˛ –C ˛ distance dependent high resolution force field has been developed using a set of high quality (low rmsd) decoys. A linear programming based formulation was used in which non-native “decoy” conformers are forced to take a higher energy compared to the corresponding native structure. This force field was tested on an independent test set and was found to excel on all the metrics that are widely used to measure the effectiveness of a force field.

787

788

D

Distance Dependent Protein Force Field via Linear Optimization

Keywords Force field; Potential model; High resolution decoys; Protein structure prediction; Linear optimization; Protein design potential Introduction Predicting the structure of a protein from its amino acid sequence is one of the biggest and yet most fundamental problems in computational structural biology. Anfinsen’s hypothesis [1] is one of the main approaches used to solve this problem, which says that for a given physiological set of conditions the native structure of a protein corresponds to the global Gibbs free energy minimum. Thus, one needs a force field to calculate the energy of different conformers and pick the one with the lowest energy. Physics-based force fields consider various types of interactions (for example, van der Waals interactions, hydrogen bonding, electrostatic interactions etc.) occurring at the atomic level of a protein to calculate the energy of a conformer. CHARMM [19], AMBER [5], ECEPP [20], ECEPP/3 [21] and GROMOS [24] are a few examples of the physics-based force fields. On the other hand, knowledge-based force fields use information from databases. Researchers have used the Boltzmann distribution [4,7,26,], optimization based techniques [17,27] and many other approaches [6,12,13, 14,15,16,18,23,25] to calculate these parameters. A recent review on such potentials can be found in Floudas et al. [8]. This work presents a novel C ˛ –C ˛ distance dependent high resolution force field that has been generated using linear optimization based framework [22]. The emphasis is on the high resolution, which would enable us to differentiate between native and non-native structures that are very similar to each other (rmsd < 2 Å). The force field is called high resolution because it has been trained on a large set of high resolution decoys (small rmsd with respect to the native) and it intends to effectively distinguish high resolution decoys structures from the native structure. The basic framework used in this work is similar to the one developed by Loose et al. [17]. However, it has been improved and applied to a diverse and enhanced (both in terms of quantity and quantity) set of high resolution decoys. The new proposed model has

resulted in remarkable improvements over the LKF potential. These high resolution decoys were generated using torsion angle dynamics in combination with restricted variations of the hydrophobic core within the native structure. This decoy set highly improves the quality of training and testing. The force field developed in this paper was tested by comparing the energy of the native fold to the energies of decoy structures for proteins separate from those used to train the model. Other leading force fields were also tested on this high quality decoy set and the results were compared with the results of our high resolution potential. The comparison is presented in the Results section. Theory and Modeling In this model, amino acids are represented by the location of its C ˛ atom on the amino acid backbone. The conformation of a protein is represented by a coordinate vector, X, which includes the location of the C ˛ of each amino acid. The native conformation is denoted as X n , while the set i D 1; : : : ; N is used to denote the decoy conformations X i . Non-native decoys are generated for each of p D 1; : : : ; P proteins and the energy of the native fold for each protein is forced to be lower than those of the decoy conformations (Anfinsen’s hypothesis). This constraint is shown in the following equation: E(X p; i ) E(X p; n ) > " p D 1; : : : ; P

i D 1; : : : ; N

(1)

Equation (1) requires the native conformer to be always lower in energy than its decoy. A small positive parameter " is used to avoid the trivial solution in which all energies are set to zero. An additional constraint (Eq. 2) is used to produce a nontrivial solution by constraining the sum of the differences in energies between decoy and native folds to be greater than a positive constant [28]. For the model presented in this paper, the values of " and were set to 0.01 and 1000, respectively. XX [E(X p; i ) E(X p; n )] > (2) p

i

The energy of each conformation is taken as the arithmetic sum of pairwise interactions corresponding to each amino acid combination at a particular “contact” distance. A contact exists when the C ˛ carbons of two

Distance Dependent Protein Force Field via Linear Optimization

Distance Dependent Protein Force Field via Linear Optimization, Table 1 Distance dependent bin definition [17]

3–4

2

4–5

3

5–5.5

4

5.5–6

5

6–6.5

6

6.5–7

7

7–8

8

8–9

ID

In this equation, N p, i, IC, ID is the number of interactions between an amino acid pair IC, at a C ˛ –C ˛ distance ID. The set IC ranges from 1 to 210 to account for the 210 unique combinations of the 20 naturally occurring amino acids. These bin definitions yield a total of 1680 interaction parameters to be determined by this model. To determine these parameters, a linear programming formulation is used in which the energy of a native protein is compared with a large number of its decoys. The violations, in which a non-native fold has a lower energy than the natural conformation, are minimized by optimizing with respect to these interaction parameters. Equation (1) can be rewritten in terms of N p, i, IC, ID as Eq. (4), where the slack parameters, Sp , are positive variables (Eq. 5) that represent the difference between the energies of the decoys and the native conformation of a given protein. XX [N p; i; IC; ID N p; n; IC; ID ]IC; ID C S p " IC

ID

p D 1; : : : ; P Sp 0

i D 1; : : : ; N

p D 1; : : : ; P

Sp

(6)

p

Physical Constraints

amino acids are within 9 Å of each other. The energy of each interaction is a function of the C ˛ –C ˛ distances and the identity of the interacting amino acids. To formulate the model, the energy of an interaction between a pair of amino acids, IC, within a distance bin, ID, was defined as IC,ID . The eight distance bins defined for the formulation are shown in Table 1. The energy for any fold X, of decoy i, for a protein p, is given by Eq. (3). XX N p; i; IC; ID IC; ID (3) E(X p; i ) D IC

(IC; ID)

X

The objective function for this formulation is to minimize the sum of the slack variables, Sp , written in the form of Eq. (6). The relative magnitude of IC,ID is meaningless because if all IC,ID parameters are multiplied by a common factor then Eqs. (4) and (5) are still valid. In this formulation, IC,ID values were bound between 25 and 25.

Bin ID C ˛ Distance [Å] 1

min

D

(4) (5)

The above mentioned equations constitute the basic constraints needed to solve this model. However, this set does not guarantee a physically realistic solution. It is possible to come up with a set of parameters that can satisfy Eqs. (2,3,4,5,6) but would not reflect the actual interaction occurring between amino acids in a real system. To prohibit these unrealistic cases, another set of constraints based on the physical properties of the amino acids was imposed. Statistical results presented in Bahar and Jernigan [2] were also incorporated through the introduction of hydrophilic and hydrophobic constraints. The details of these physical constraints are given elsewhere [22]. Database Selection and Decoy Generation The protein database selection is critical to force field training. This set should adequately represent the PDB set [3]. At the same time, it should not be too large, as the training becomes difficult with an increase in the size of the training set. Zhang and Skolinck [29] developed a set of 1,489 nonhomologous single domain proteins. High resolution decoys were generated for these proteins and used for training and testing purposes. High quality decoy generation was based on the hypothesis that high-quality decoy structures should preserve information about the distances within the hydrophobic core of the native structure of each protein. For each of the proteins in the database, a number of distance constraints are introduced based on the hydrophobic-hydrophobic distances within the native structure. Using a set of proximity parameters, a large number of decoy structures are generated using DYANA [9]. The rmsd distribution of decoy structures can be found elsewhere [22].

789

790

D

Distance Dependent Protein Force Field via Linear Optimization

Training and Test Set Of the 1400 proteins used for decoy generation, 1250 were randomly selected for training and the rest were used for testing purposes. For every protein in the set, 500–1600 decoys were generated depending on the fraction of secondary structure present in the native structure of the protein. These decoys were sorted based on their C ˛ rmsd to the native structure and then 500 decoys were randomly selected to represent the whole rmsd range. This creates a training set of 500 1250 D 625; 000 decoys. However, because of computer memory limitations, it is not possible to include all of these decoys at the same time for training. An iterative scheme, “Rank and Drop”, was employed to overcome the memory problem while effectively using all the high quality structures. In this scheme, a subset of decoys is used to generate a force field. This force field is then used to rank all the decoys and a set of most challenging decoys (based on their energy value) is selected for the next round of force field generation. This process of force field generation and decoy ranking is repeated until there is no improvement in the ranking of the decoys [22]. This force field model was solved using the GAMS modeling language coupled with the CPLEX linear programming package [11]. It is equally important to test a force field on a difficult and rigorous testing set to confirm its effectiveness. The test set was comprised of 150 randomly selected proteins (41–200 amino acids in length). For each of the 150 test proteins, 500 high resolution decoys were generated using the same technique that was used to generate training decoys. The minimum C ˛ based rmsds for these non-native structures were in the range of 0– 2 Å. This HR force field was also tested on another set of medium resolution decoys [17]. This set has 200 decoys for 151 proteins. The minimum RMSD of the decoys of this set ranged from 3–16 Å. This set, along with the high resolution decoy set, spans the practical range of possible protein structures that one might encounter during protein structure prediction. Results and Discussion A linear optimization problem was solved using information from 625,000 decoy structures and the values of all the energy parameters were obtained. The ability to distinguish between the native structure and native-

Distance Dependent Protein Force Field via Linear Optimization, Table 2 Testing force fields on 150 proteins of the high resolution decoy set. TE13 force field was only tested on 148 cases FF-Name Average Rank No of Firsts

Average rmsd

HR

1.87

113 (75.33%) 0.451

LKF

39.45

17 (11.33%) 1.721

TE13

19.94

92 (62.16%) 0.813

HL

44.93

70 (46.67%) 1.092

like conformers is the most significant test for any force field. The HR force field was tested on 500 decoys of the 150 test proteins. In this testing, the relative position, or rank, of the native conformation among its decoys was calculated. An ideal force field should be able to assign rank 1 to the native structures of all the test proteins. Other force fields like LKF [17], TE13 [27], and HL [10] were also tested on this set of high resolution decoys. All these force fields are fundamentally different from each other in their methods of energy estimation. Comparing the results obtained with these force fields aims to assess the fundamental utility of the HR force field. The comparison of the energy rankings obtained using different force fields is presented in Table 2. From this table it is evident that the HR force field is the most effective in identifying the native structures by rank. The HR force field correctly identified the native folds of 113 proteins out of a set of 150 proteins, which compares favorably to a maximum of 92 (out of 148) by the TE13 force field. Another analysis was carried out to evaluate the discrimination ability of these potentials. In this evaluation, all the decoys of the test set were ranked using these potentials. For each test protein, the C ˛ rmsd of the rank 1 conformer was calculated with respect to the native structure of that protein. The C ˛ rmsd would be zero for the cases in which a force field selects the native structure as rank 1. However, it will not be zero for all other cases in which a non-native conformer is assigned the top rank. The average of these rmsds represents the spatial separation of the decoys with respect to the native structure. The average rmsd value obtained for each of the force fields is shown in Table 2. It can be seen that the average C ˛ rmsd value is least for the HR force field. The average C ˛ rmsd value for the HR force field is 0.451 Å, which is much less compared to 1.721 Å by the LKF, and 0.813 Å by TE13 force field. This means

Distance Dependent Protein Force Field via Linear Optimization

that the structures predicted by the HR force fields have the least spatial deviation from their corresponding native structures. The HR force field was also tested on the test set published by Loose et al. [17] and was found to do better than other force fields. The comparison results for this test can be found elsewhere [22]. The effectiveness of the HR force field is further reinforced by its success on the medium resolution decoy test set. On the test set of 110 medium resolution decoys, it was capable of correctly identifying 78.2 % of the native structures, significantly more than other force fields.

6.

7.

8.

9.

Conclusions The HR force field was developed using an optimization based linear programming formulation, in which the model is trained using a diverse set of high quality decoys. Physically observed interactions between certain amino acids were written in the form of mathematical constraints and included in the formulation. The decoys were generated based on the premise that high quality decoy structures should preserve information about the distance within the hydrophobic core of the native structure of each protein. The set of interaction energy parameters obtained after solving the model were found to be of very good discriminatory capacity. This force field performed well on a set of independent, non-homologous high resolution decoys. This force field can become a powerful tool for fold recognition and de novo protein design.

10.

11. 12. 13.

14.

15.

References 1. Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181:223–230 2. Bahar I, Jernigan RL (1997) Inter-residue potential in globular proteins and the dominance of highly specific hydrophillic interactions at close separation. J Molec Biol 266:195–214 3. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28:235–242 4. Bryant SH, Lawrence CE (1991) The frequency of ionpair substructures in proteins is quantitaively related to electrostatic potential, a statistical model for nonbonded interactions. Proteins: Structure, Function, Bioinformatics 9:108–119 5. Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz Jr KM, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW, Kollman PA

16.

17.

18.

19.

D

(1995) A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J Am Chem Soc 117:5179–5197 DeBolt S, Skolnick J (1996) Evaluation of atomic level mean force potentials via inverse folding and inverse refinement of proteins structures: atomic burial position and pairwise non-bonded interactions. Protein Eng 9:637–655 Finkelstein AV, Badretdinov AY, Gutin AM (1995) Why do proteins architectures have Boltzmann-like statistics? Proteins: Structure, Function, Bioinformatics 23:142–150 Floudas CA, Fung HK, McAllister SR, Mönnigmann M, Rajgaria R (2006) Advances in protein structure prediction and de novo protein design: A review. Chem Eng Sci 61:966– 988 Güntert P, Mumenthaler C, Wüthrich K (1997) Torsion angle dynamics for NMR structure calculation with the new program DYANA. J Molec Biol 273:283–298 Hinds DA, Levitt M (1994) Exploring conformational space with a simple lattice model for protein structure. J Molec Biol 243:668–682 ILOG CPLEX (2003) User’s Manual 9.0 Jernigan RL, Bahar I (1996) Structure-derived potentials and protein simulations. Curr Opin Struct Biol 6:195–209 Liwo A, Kazmierkiewicz R, Czaplewski C, Groth M, Oldziej S, Wawak RJ, Rackovsky S, Pincus MR, Scheraga HA (1998) A united-residue force field for off-lattice protein structure simulations. III. Origin of backbone hydrogen bonding cooperativity in united residue potential. J Comput Chem 19:259–276 Liwo A, Odziej S, Czaplewski C, Kozlowska U, Scheraga HA (2004) Parametrization of backbone-electrostatic and multibody contributions to the UNRES force field for protein-structure prediction from ab initio energy surfaces of model systems. J Phys Chem B 108:9421–9438 Liwo A, Oldziej S, Pincus MR, Wawak RJ, Rackovsky S, Scheraga HA (1997) A united-residue force field for off-lattice protein structure simulations. I. Functional forms and parameters of long-range side-chain interaction potentials from protein crystal data. J Comput Chem 18:849–873 Liwo A, Pincus MR, Wawak RJ, Rackovsky S, Oldziej S, Scheraga HA (1997) A united-residue force field for off-lattice protein structure simulations. II. Parameterization of shortrange interactions and determination of weights of energy terms by z-score optimization. J Comput Chem 18:874–887 Loose C, Klepeis JL, Floudas CA (2004) A new pairwise folding potential based on improved decoy generation and side-chain packing. Proteins: Structure, Function, Bioinformatics 54:303–314 Lu H, Skolnick J (2001) A distance-dependent knowledgebased potential for improved protein structure selection. Proteins: Structure, Function, Bioinformatics 44:223–232 MacKerell Jr AD, Bashford D, Bellott M, Dunbrack Jr RL, Evanseck JD, Field MJ, Fischer S, Gao J, Guo H, Ha S, JosephMcCarthy D, Kuchnir L, Kuczera K, Lau FTK, Mattos C, Michnick S, Ngo T, Nguyen DT, Prodhom B, Reiher III WE, Roux

791

792

D 20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

Domination Analysis in Combinatorial Optimization

B, Schlenkrich M, Smith JC, Stote R, Straub J, Watanabe M, Wiórkiewicz-Kuczera J, Yin D, Karplus M (1998) All-atom empirical potential for molecular modeling and dynamics studies of proteins. J Phys Chem B 102:3586–3616 Momany FA, McGuire RF, Burgess AW, Scheraga HA (1975) Energy parameters in polypeptides. VII. Geometric parameters, partial atomic charges, nonbonded interactions, hydrogen bond interactions, and intrinsic torsional potentials for the naturally occurring amino acids. J Phys Chem 79:2361–2381 Némethy G, Gibson KD, Palmer KA, Yoon CN, Paterlini G, Zagari A, Rumsey S, Scheraga HA (1992) Energy parameters in polypeptides. 10. Improved geometrical parameters and nonbonded interactions for use in the ECEPP/3 algorithm, with application to proline-containing peptides. J Phys Chem 96:6472–6484 Rajgaria R, McAllister SR, Floudas CA (2006) Development of a novel high resolution C ˛ –C ˛ distance dependent force field using a high quality decoy set. Proteins: Structure, Function, Bioinformatics 65:726–741 Samudrala R, Moult J (1998) An all-atom distancedependent conditional probability discriminatory function for protein structure prediction. J Molec Biol 275:895–916 Scott WRP, Hunenberger PH, Trioni IG, Mark AE, Billeter SR, Fennen J, Torda AE, Huber T, Kruger P, VanGunsteren WF (1997) The GROMOS biomolecular simulation program package. J Phys Chem A 103:3596–3607 Subramaniam S, Tcheng DK, Fenton J (1996) A knowledgebased method for protein structure refinement and prediction. In: States D, Agarwal P, Gaasterland T, Hunter L, Smith R (eds) Proceedings of the 4th International Conference on Intelligent Systems in Molecular Biology. AAAI Press, Boston, pp 218–229 Tanaka S, Scheraga HA (1976) Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins. Macromolecules 9:945–950 Tobi D, Elber R (2000) Distance-dependent, pair potential for protein folding: Results from linear optimization. Proteins: Structure, Function, Bioinformatics 41:40–46 Tobi D, Shafran G, Linial N, Elber R (2000) On the design and analysis of protein folding potentials. Proteins: Structure, Function, Bioinformatics 40:71–85 Zhang Y, Skolnick J (2004) Automated structure prediction of weakly homologous proteins on a genomic scale. Proc Natl Acad Sci USA 101:7594–7599

Domination Analysis in Combinatorial Optimization GREGORY GUTIN Department of Computer Science, Royal Holloway, University of London, Egham, UK

MSC2000: 90C27, 90C59, 68Q25, 68W40, 68R10 Article Outline Keywords and Phrases Introduction Definitions Methods Greedy-Type Algorithms Better-Than-Average Heuristics

Cases Traveling Salesman Problem Upper Bounds for Domination Numbers of ATSP Heuristics Multidimensional Assignment Problem (MAP) Minimum Partition and Multiprocessor Scheduling Problems Max Cut Problem Constraint Satisfaction Problems Vertex Cover, Independent Set and Clique Problems Quadratic Assignment Problem

See also References Keywords and Phrases Domination analysis; Domination number; Domination ratio Introduction Exact algorithms allow one to find optimal solutions to NP-hard combinatorial optimization (CO) problems. Many research papers report on solving large instances of some NP-hard problems (see, e. g., [25,27]). The running time of exact algorithms is often very high for large instances (many hours or even days), and very large instances remain beyond the capabilities of exact algorithms. Even for instances of moderate size, if we wish to remain within seconds or minutes rather than hours or days of running time, only heuristics can be used. Certainly, with heuristics, we are not guaranteed to find optimum, but good heuristics normally produce nearoptimal solutions. This is enough in most applications since very often the data and/or mathematical model are not exact anyway. Research on CO heuristics has produced a large variety of heuristics especially for well-known CO problems. Thus, we need to choose the best ones among them. In most of the literature, heuristics are compared in computational experiments. While experi-

Domination Analysis in Combinatorial Optimization

mental analysis is of definite importance, it cannot cover all possible families of instances of the CO problem at hand and, in particular, it practically never covers the hardest instances. Approximation Analysis [3] is a frequently used tool for theoretical evaluation of CO heuristics. Let H be a heuristic for a combinatorial minimization problem P and let In be the set of instances of P of size n. In approximation analysis, we use the performance ratio r H (n) D maxf f (I)/ f (I) : I 2 In g; where f (I)( f (I)) is the value of the heuristic (optimal) solution of I. Unfortunately, for many CO problems, estimates for rH (n) are not constants and provide only a vague picture of the quality of heuristics. Moreover, even constant performance ratio does not guarantee that the heuristic often outputs good-quality solutions, see, e. g., the discussion of the DMST heuristic below. Domination Analysis (DA) (for surveys, see [22,24]) provides an alternative and a complement to approximation analysis. In DA, we are interested in the domination number or domination ratio of heuristics. Domination number (ratio) of a heuristic H for a combinatorial optimization problem P is the maximum number (fraction) of all solutions that are not better than the solution found by H for any instance of P of size n. In many cases, DA is very useful. For example, we will see later that the greedy algorithm has domination number 1 for many CO problems. In other words, the greedy algorithm, in the worst case, produces the unique worst possible solution. This is in line with latest computational experiments with the greedy algorithm, see, e. g., [25], where the authors came to the conclusion that the greedy algorithm ‘might be said to self-destruct’ and that it should not be used even as ‘a general-purpose starting tour generator’. The Asymmetric Traveling Salesman Problem (ATSP) is the problem of computing a minimum weight tour (Hamilton cycle) passing through every vertex in a weighted complete digraph K n on n vertices. The Symmetric TSP (STSP) is the same problem, but on a complete undirected graph. When a certain fact holds for both ATSP and STSP, we will simply speak of TSP. Sometimes, the maximizing version of TSP is of interest, we denote it by Max TSP. APX is the class of CO problems that admit polynomial time approximation algorithms with a constant performance ratio [3]. It is well known that while Max

D

TSP belongs to APX, TSP does not. This is at odds with the simple fact that a ‘good’ approximation algorithm for Max TSP can be easily transformed into an algorithm for TSP. Thus, it seems that both Max TSP and TSP should be in the same class of CO problems. The above asymmetry was already viewed as a drawback of performance ratio already in the 1970’s, see, e. g. [11,28,33]. Notice that from the DA point view Max TSP and TSP are equivalent problems. Zemel [33] was the first to characterize measures of quality of approximate solutions (of binary integer programming problems) that satisfy a few basic and natural properties: the measure becomes smaller for better solutions, it equals 0 for optimal solutions and it is the same for corresponding solutions of equivalent instances. While the performance ratio and even the relative error (see [3]) do not satisfy the last property, the parameter 1-r, where r is the domination ratio, does satisfy all of the properties. Local Search (LS) is one of the most successful approaches in constructing heuristics for CO problems. Recently, several researchers investigated LS with Very Large Scale Neighborhoods (see, e. g., [1,12,24]). The hypothesis behind this approach is that the larger the neighborhood the better quality solution are expected to be found [1]. However, some computational experiments do not support this hypothesis; sometimes an LS with small neighborhoods proves to be superior to that with large neighborhoods. This means that some other parameters are responsible for the relative power of neighborhoods. Theoretical and experimental results on TSP indicate that one such parameter may well be the domination number of the corresponding LS. In our view, it is advantageous to have bounds for both performance ratio and domination number (or, domination ratio) of a heuristic whenever it is possible. Roughly speaking this will enable us to see a 2D rather than 1D picture. For example, consider the double minimum spanning tree heuristic (DMST) for the Metric STSP (i. e., STSP with triangle inequality). DMST starts from constructing a minimum weight spanning tree T in the complete graph of the STSP, doubles every edge in T, finds a closed Euler trail E in the ‘double’ T, and cancels any repetition of vertices in E to obtain a TSP tour H. It is well-known and easy to prove that the weight of H is at most twice the weight

793

794

D

Domination Analysis in Combinatorial Optimization

of the optimal tour. Thus, the performance ratio for DMST is bounded by 2. However, Punnen, Margot and Kabadi [29] proved that the domination number of DMST is 1. Interestingly, in practice DMST often performs much worse than the well-known 2-Opt LS heuristic. For 2-Opt LS we cannot give any constant approximation guarantee, but the heuristic is of very large domination number [29]. The above example indicates that it makes sense to use DA to rank heuristics for the CO problem under consideration. If the domination number of a heuristic H is larger than the domination of a heuristic H 0 (for all or ‘almost all’ sizes n), we may say that H is better than H 0 in the worst case (from the DA point of view). Berend, Skiena and Twitto [10] used DA to rank some well-known heuristics for the Vertex Cover problem (and, thus, the Independent Set and Clique problems). The three problems and the heuristics will be defined in the corresponding subsection of the Cases section. BenArieh et al. [7] studied three heuristics for the Generalized TSP: the vertices of the complete digraph are partitioned into subsets and the goal is to find a minimum weight cycle containing exactly one vertex from each subset. In the computational experiment in [7] one of the heuristics was clearly inferior to the other two. The best two behaved very similarly. Nevertheless, the authors of [7] managed to ‘separate’ the two heuristics by showing that one of the heuristics was of much larger domination number. One might wonder whether a heuristic A, which is significantly better that another heuristic B from the DA point of view, is better that B in computational experiments. In particular, whether the ATSP greedy algorithm, which is of domination number 1, is worse, in computational experiments, than any ATSP heuristic of domination number at least (n 2)! ? Generally speaking the answer to this natural question is negative. This is because computational experiments and DA indicate different aspects of quality of heuristics. Nevertheless, it seems that many heuristics of very small domination number such as the ATSP greedy algorithm perform poorly also in computational experiments and, thus, cannot be recommended to be widely used in computational practice. The rest of the entry is organized as follows. We give additional terminology and notation in the section Definitions. In the section Methods, we describe two pow-

erful methods in DA. In the section Cases, we consider DA results for some well-known CO problems. Definitions Let P be a CO problem and let H be a heuristic for P . The domination number domn(H ; I ) of H for an instance I of P is the number of solutions of I that are not better than the solution s produced by H including s itself. For example, consider an instance T of the STSP on 5 vertices. Suppose that the weights of tours in T are 2,5,5,6,6,9,9,11,11,12,12,15 (every instance of STSP on 5 vertices has 12 tours) and suppose that the greedy algorithm computes the tour T of weight 6. Then domn(greedy; T ) D 9. In general, if domn(H ; I ) equals the number of solutions in I , then H finds an optimal solution for I . If domn(H ; I ) D 1, then the solution found by H for I is the unique worst possible one. The domination number domn(H ; n) of H is the minimum of domn(H ; I ) over all instances I of size n. Since the ATSP on n vertices has (n 1)! tours, an algorithm for the ATSP with domination number (n 1)! is exact. The domination number of an exact algorithm for the STSP is (n 1)!/2: If an ATSP heuristic A has domination number equal 1, then there is an assignment of weights to the arcs of each complete digraph K n , n 2, such that A finds the unique worst possible tour in K n : While studying TSP we normally consider only feasible solutions (tours), for several other problems some authors take into consideration also infeasible solutions [10]. One example is the Maximum Independent Set problem, where given a graph G, the aim is to find an independent set in G of maximum cardinality. Every non-empty set of vertices is considered to be a solution by Berend, Skiena and Twitto [10]. To avoid dealing with infeasible solutions (and, thus, reserving the term ‘solution’ only for feasible solutions) we also use the notion of the blackball number introduced in [10]. The blackball number bbn(H ; I ) of H for a an instance I of P is the number of solutions of I that are better than the solution produced by H . The blackball number bbn(H ; n) of H is the maximum of domn(H ; I ) over all instances I of size n. When the number of solutions depends not only on the size of the instance of the CO problem at hand (for

Domination Analysis in Combinatorial Optimization

example, the number of independent sets of vertices in a graph G on n vertices depends on the structure of G), the domination ratio of an algorithm A is of interest: the domination ratio of A, domr(A; n), is the minimum of domn(A; I )/sol(I ); where sol(I ) is the number of solutions of I , taken over all instances I of size n. Clearly, domination ratio belongs to the interval (0; 1] and exact algorithms are of domination ratio 1. Methods Currently, there are two powerful methods in DA. One is used to prove that the heuristic under consideration is of domination number 1. For this method to be useful, the heuristic has to be a greedy-type algorithm for a CO problem on independence systems. We describe the method and its applications in the subsection GreedyType Algorithms. The other method is used prove that the heuristic under consideration is of very large domination number. For many problems this follows from the fact that the heuristic always finds a solution that is not worse than the average solution. This method is described in the subsection Better-Than-Average Heuristics. Greedy-Type Algorithms The main practical message of this subsection is that one should be careful while using the classical greedy algorithm and its variations in combinatorial optimization (CO): there are many instances of CO problems for which such algorithms will produce the unique worst possible solution. Moreover, this is true for several wellknown optimization problems and the corresponding instances are not exotic, in a sense. This means that not always the paradigm of greedy optimization provides any meaningful optimization at all. An independence system is a pair consisting of a finite set E and a family F of subsets (called independent sets) of E such that (I1) and (I2) are satisfied. (I1) the empty set is in F ; (I2) If X 2 F and Y is a subset of X, then Y 2 F . All maximal sets of F are called bases. An independence system is uniform if all its bases are of the same cardinality. Many combinatorial optimization problems can be formulated as follows. We are given an independence

D

system (E,F ), a set W ZC and a weight function w that assigns a weight w(e) 2 W to every element of E (ZC is the set of non-negative integers). The weight w(S) of S 2 F is defined as the sum of the weights of the elements of S. It is required to find a base B 2 F of minimum weight. We will consider only such problems and call them the (E,F,W)-optimization problems. If S 2 F , then let I(S) D fx : S [ fxg 2 F g S. This means that I(S) consists of those elements from ES, which can be added to S, in order to have an independent set of size jSj C 1. Note that by (I2) I(S) ¤ ; for every independent set S which is not a base. The greedy algorithm tries to construct a minimum weight base as follows: it starts from an empty set X, and at every step it takes the current set X and adds to it a minimum weight element e 2 I(X), the algorithm stops when a base is built. We assume that the greedy algorithm may choose any element among equally weighted elements in I(X). Thus, when we say that the greedy algorithm may construct a base B, we mean that B is built provided the appropriate choices between elements of the same weight are made. An ordered partitioning of an ordered set Z D fz1 ; z2 ; : : : ; z k g is a collection of subsets A1 ; A2 ; : : : ; A q of Z such that if zr 2 A i and zs 2 A j where 1 i < j q then r < s. Some of the sets Ai may be empty and q [ iD1 A i D Z. The following theorem by Bang-Jensen, Gutin and Yeo [6] characterizes all uniform independence systems (E,F ) for which there is an assignment of weights to the elements of E such that the greedy algorithm solving the (E; F ; f1; 2; : : : ; rg)-optimization problem may construct the unique worst possible solution. Theorem 1 Let (E,F ) be a uniform independence system and let r 2 be a natural number. There exists a weight assignment w : E ! f1; 2; : : : ; rg such that the greedy algorithm may produce the unique worst possible base if and only if F contains some base B with the property that for some ordering x1 ; : : : ; x k of the elements of B and some ordered partitioning A1 ; A2 ; : : : ; A r of x1 ; : : : ; x k the following holds for every base B0 ¤ B of F : r1 X jD0

jI(A0; j ) \ B0 j
0, of blackball number approximately 1:839n ). Clearly, the maximal matching heuris-

Domination Analysis in Combinatorial Optimization

tic is the best among the three heuristics from the DA point of view. Quadratic Assignment Problem The Quadratic Assignment Problem (QAP) can be formulated as follows. We are given two n n matrices A D [a i j ] and B D [b i j ] of integers. Our aim is to find a permutation of f1; 2; : : : ; ng that minimizes the sum n n X X

a i j b(i)( j) :

iD1 jD1

Gutin and Yeo [23] described a better-than-average heuristic for QAP and proved that the heuristic is of domination number at least n!/ˇ n for each ˇ > 1. Moreover, the domination number of the heuristic is at least (n 2)! for every prime power n. These results were obtained using a group-theoretical approach. See also Traveling Salesman Problem References 1. Ahuja RK, Ergun Ö, Orlin JB, Punnen AP (2002) A survey of very large-scale neighborhood search techniques. Discret Appl Math 123:75–102 2. Alon N, Gutin G, Krivelevich M (2004) Algorithms with large domination ratio. J Algorithms 50:118–131 3. Ausiello G, Crescenzi P, Gambosi G, Kann V, MarchettiSpaccamela A, Protasi M (1999) Complexity and Approximation. Springer, Berlin 4. Balas E, Saltzman MJ (1991) An algorithm for the threeindex assignment problem. Oper Res 39:150–161 5. Bang-Jensen J, Gutin G (2000) Digraphs: Theory, Algorithms and Applications. Springer, London 6. Bang-Jensen J, Gutin G, Yeo A (2004) When the greedy algorithm fails. Discret Optim 1:121–127 7. Ben-Arieh D, Gutin G, Penn M, Yeo A, Zverovitch A (2003) Transformations of Generalized ATSP into ATSP: experimental and theoretical study. Oper Res Lett 31:357–365 8. Bendall G, Margot F (2006) Greedy Type Resistance of Combinatorial Problems. Discret Optim 3:288–298 9. Berge C (1958) The Theory of Graphs. Methuen, London 10. Berend B, Skiena SS, Twitto Y, Combinatorial dominance guarantees for heuristic algorithms. ACM Trans Algorithm (to appear) 11. Cornuejols G, Fisher ML, Nemhauser GL (1977) Location of bank accounts to optimize float; an analytic study of exact and approximate algorithms. Manag Sci 23:789–810

D

12. Deineko VG, Woeginger GJ (2000) A study of exponential neighbourhoods for the traveling salesman problem and the quadratic assignment problem. Math Prog Ser A 87:519–542 13. Ghosh D, Goldengorin B, Gutin G, Jäger G (2007) Tolerance-based greedy algorithms for the traveling salesman problem. Commun DQM 41:521–538 14. Glover F, Gutin G, Yeo A, Zverovich A (2001) Construction heuristics for the asymmetric TSP. Eur J Oper Res 129: 555–568 15. Gutin G, Goldengorin B, Huang J (2006) Worst Case Analysis of Max-Regret, Greedy and Other Heuristics for Multidimensional Assignment and Traveling Salesman Problems. Lect Notes Comput Sci 4368:214–225 16. Gutin G, Jensen T, Yeo A (2006) Domination analysis for minimum multiprocessor scheduling. Discret Appl Math 154:2613–2619 17. Gutin G, Koller A, Yeo A (2006) Note on Upper Bounds for TSP Domination Number. Algorithm Oper Res 1:52–54 18. Gutin G, Vainshtein A, Yeo A (2002) When greedy-type algorithms fail. Unpublished manuscript 19. Gutin G, Vainshtein A, Yeo A (2003) Domination Analysis of Combinatorial Optimization Problems. Discret Appl Math 129:513–520 20. Gutin G, Yeo A (2002) Polynomial approximation algorithms for the TSP and the QAP with a factorial domination number. Discret Appl Math 119:107–116 21. Gutin G, Yeo A (2002) Anti-matroids. Oper Res Lett 30: 97–99 22. Gutin G, Yeo A (2005) Domination Analysis of Combinatorial Optimization Algorithms and Problems. In: Golumbic M, Hartman I (eds) Graph Theory, Combinatorics and Algorithms: Interdisciplinary Applications. Springer, New York, pp 152–176 23. Gutin G, Yeo A, Zverovitch A (2002) Traveling salesman should not be greedy: domination analysis of greedy-type heuristics for the TSP. Discret Appl Math 117:81–86 24. Gutin G, Yeo A, Zverovitch A (2002) Exponential Neighborhoods and Domination Analysis for the TSP. In: Gutin G, Punnen AP (eds) The Traveling Salesman Problem and its Variations. Kluwer, Dordrecht, pp 223–256 25. Johnson DS, Gutin G, McGeoch LA, Yeo A, Zhang X, Zverovitch A (2002) Experimental Analysis of Heuristics for ATSP. In: Gutin G, Punnen AP (eds) The Traveling Salesman Problem and its Variations. Kluwer, Dordrecht, pp 445–487 26. Johnson DS, McGeoch LA (1997) The traveling salesman problem: A case study in local optimization. In: Aarts EHL, Lenstra JK (eds) Local Search in Combinatorial Optimization. Wiley, Chichester, pp 251–310 27. Johnson DS, McGeoch LA (2002) Experimental Analysis of Heuristics for STSP. In: Gutin G, Punnen AP (eds) The Traveling Salesman Problem and its Variations. Kluwer, Dordrecht, pp 369–443 28. Kise H, Ibaraki T, Mine H (1979) Performance analysis of six approximation algorithms for the one-machine maximum

801

802

D 29.

30.

31.

32.

33.

Duality Gaps in Nonconvex Optimization

lateness scheduling problem with ready times. J Oper Res Soc Japan 22:205–223 Punnen AP, Margot F, Kabadi SN (2003) TSP heuristics: domination analysis and complexity. Algorithmica 35:111– 127 Robertson AJ (2001) A set of greedy randomized adaptive local search procedure implementations for the multidimentional assignment problem. Comput Optim Appl 19:145–164 Rublineckii VI (1973) Estimates of the accuracy of procedures in the Traveling Salesman Problem. Numer Math Comput Tech 4:18–23 (in Russian) Sarvanov VI (1976) The mean value of the functional of the assignment problem. Vestsi Akad Navuk BSSR, Ser Fiz-Mat Navuk 2:111–114 (in Russian) Zemel E (1981) Measuring the quality of approximate solutions to zero-one programming problems. Math Oper Res 6:319–332

this asymmetry can be eliminated by allowing one of the players to select strategies from a larger space than that of the finite-dimensional Euclidean space. Once the asymmetry is removed, then there is zero duality gap. The aim of this article is to review two methods by which this process can be carried out. The first is based on randomization of the primal problem. The second extends the space from which the dual variables can be selected. Duality gaps are important in mathematical programming and some of the results reviewed here are more than 50 years old, but only recently methods have been discovered to take advantage of them. The theory is elegant and helps appreciate the game-theoretic origins of the dual problem and the role of Lagrange multipliers. Background

Duality Gaps in Nonconvex Optimization PANOS PARPAS, BERÇ RUSTEM Department of Computing, Imperial College, London, UK

We discuss how duality gaps arise, and how they can be eliminated in nonconvex optimization problems. A standard optimization problem is stated as follows: min

Abstract Background Game Theory Interpretation

Methods Randomization Functional Lagrange Multipliers

Conclusions References Abstract Duality gaps in optimization problems arise because of the nonconvexities involved in the objective function or constraints. The Lagrangian dual of a nonconvex optimization problem can also be viewed as a two-person zero-sum game. From this viewpoint, the occurrence of duality gaps originates from the order in which the two players select their strategies. Therefore, duality theory can be analyzed as a zero-sum game where the order of play generates an asymmetry. One can conjecture that

(1)

g(x) 0 ; x2X;

MSC2000: 90B50, 78M50 Article Outline

f (x) ;

where f : Rn ! R and g : Rn ! Rm are assumed to be smooth and nonconvex. The feasible region of (1) is denoted by F , and it is assumed to be nonempty and compact. X is some compact convex set. In order to understand the origins of duality in mathematical programming, consider devising a strategy to determine whether a point, say y, is the globally optimal solution of (1). Such a strategy can be concocted as follows: if f (y) is the global solution of (1) then the following system of inequalities f (x) < f (y) ; g(x) 0 ;

(2)

x2X will not have a solution. We can reformulate (2) in a slightly more convenient framework. Indeed, suppose that there exist m positive scalars i , i D 1; : : : ; m, such that L(x; ) D f (x) C

m X iD1

i g i (x) < f (y)

(3)

Duality Gaps in Nonconvex Optimization

has no solution. Then (2) does not have a solution either. The left-hand side of (3) is called the Lagrangian function associated with (1). It is clear from the discussion above that the Lagrangian can be used to answer questions about the optimal solutions of (1). The usefulness of the dual function emanates from the following duality observation: Let f be the optimal objective function value of (1), and let L : Rm ! R be defined as follows: x2X

m : L(x ; ) 8x 2 X; 8 2 RC

Then it is easy to prove that: sup L() f :

(4)

0

This result is known as the weak duality theorem, and it is valid with a quite general set of assumptions. The strong duality theorem asserts that if f and g are convex, f > 1, and the interior of F is not empty, then sup L() D f : 0

Proofs of the weak and strong duality theorems can be found in [1,10]. Game Theory Interpretation There is an interesting relationship between (1) and the following optimization problem:

0 x2X

m X

i g i (x) :

(5)

iD1

We refer to (1) as the primal problem, while (5) is referred to as the Lagrangian dual. The ’s that appear in (5) are called the Lagrange multipliers (or dual variables). It is interesting to note that (1) can equivalently be restated as follows: inf sup f (x) C

x2X 0

m X

the variables. If player A chooses x0 , and player B chooses 0 , then player A pays L(x0 ,0 ) to player B. Naturally, player A wishes to minimize this quantity, while player B attempts to maximize it. In game theory equilibria play an important role. An equilibrium, in the present context means a point from which no player will gain by a unilateral change of strategy. For the game outlined above an equilibrium point (x ; ) must satisfy L(x; ) L(x ; )

L() D inf L(x; ) :

sup inf f (x) C

D

i g i (x) :

(6)

iD1

The relationship between (6) (or (1)) and (5) can be analyzed as a two-person zero-sum game. In this game player A chooses the x variables, and player B chooses

(7)

A point satisfying the preceding equation is also known as a saddle point of L. To see that (7) is an equilibrium point we argue as follows: Given that player A wishes to minimize the amount paid to player B, then it is obvious that if player B chooses and player A selects anything other than x , player A will be worse off. Similarly, if player A chooses x , then if player B chooses anything other than , then player B will be worse off. By the strong duality theorem, we know that the game has an equilibrium point under convexity assumptions. For the general case, insight can be obtained by interpreting (5) and (6) as two different games. A saddle point will exist if the optimal values of the two games are equal. Our next task is to interpret (5) and (6) as games. Indeed consider the following situation: Player A chooses a strategy first, and then player B chooses a strategy. Thus, player B already knows the strategy that player A has chosen. As a result player B will have an advantage. Player A will argue as follows: “If I choose x, then player B will choose sup 0 L(x; ), therefore I had better choose the strategy that will minimize my losses.” In other words player A will choose the optimal strategy given by solving (6). Now consider the same game, but with the order of play reversed, i. e., player B chooses first, and player A second. Then applying the rules of rational behavior (as above), we see that player B will select the that solves (5). Consequently, duality gaps originate from the order in which the two players select their strategies. In the next section we see how this asymmetry can be eliminated by allowing one of the players to select strategies from a larger space than that of the finite-dimensional

803

804

D

Duality Gaps in Nonconvex Optimization

Euclidean space. Once the asymmetry is eliminated, then there is zero duality gap.

Equivalently: Z

P D Methods As argued above, the player that chooses first is disadvantaged, since the other player can adjust. In this section we discuss two methods in which this asymmetry in the order of play can be eliminated. Both methods were proposed early in the history of mathematical programming. The first method proceeds by randomization (increasing the powers of player A). It is difficult to say who suggested this strategy first. Since the origins of the idea emanate from mixed strategies in game theory, one could argue that the idea was first suggested by Borel in the 1920s [14]. A modern proof can be found in [2]. The second method allows player B to select the dual variables from a larger space. This idea seems to have been suggested by Everett [3], and then by Gould [6]. A review can be found in [11]. Algorithms that attempt to reduce the duality gap appeared in [4,5,7,8,9,12].

inf

sup

2M(X) 0

C

m X

f (x) d(x) X

Z

Z i

g(x) d(x) C 0

d(x) 1

X

iD1

The dual of (8) is given by Z D D sup inf f (x) d(x) 0 2M(X) Z m X

C

i

X

Z g(x) d(x) C 0

X

iD1

d(x) 1

Then it can be shown that P D D . The proof is beyond the scope of this article; it can be found in [2]. Functional Lagrange Multipliers We now consider the case where player B chooses first. From the previous section, it follows that player B will choose a strategy according to: (9)

0 x2X

Assume that player A chooses first, then the game can be described by

P D inf sup L(x; ) ; x2X 0

and in general by P D D sup inf L(x; ) : 0 x2X

Player A has a handicap since player B will choose a strategy knowing what player A will do. In order to avoid having a duality gap, we consider giving more flexibility to player A. We thus allow player A to choose strategies from M(X), where M(X) denotes the space of probability measures on B (the -field generated by X). Player A will therefore choose a strategy by solving Z f (x) d(x) P D inf

2M(X) X Z g(x) d(x) 0 (8) ZX d(x) D 1 : X

:

X

D D sup inf L(x; ) : Randomization

:

X

We have already pointed out that the following holds: D inf sup L(x; ) : x2X 0

In order for the preceding equation to hold as an equality, without any convexity assumptions, we consider increasing the space of available strategies of B. This was suggested in [3,6]. The exposition here is based on [11]. Let H denote all the feasible right-hand sides for (1): H D fb 2 Rm j 9x 2 X : g(x) bg :

Let D denote the following set of functions: D D fz : Rm ! R j z(d1 ) z(d2 ); if d1 d2 ;

8d1 ; d2 2 H g : The following dual can be defined using the concepts above: D D sup z(0) z

(10) z(g(x)) f (x) 8x 2 X

z 2D:

The dual in (10) is different from the type of duals that we have been discussing in this article. If, however, we

Duality in Optimal Control with First Order Differential Equations

assume that c C D D, then it was shown in [11] that (10) is equivalent to the following: D D sup inf f (x) C z(g(x)) z2D x2X

dual problem. A proof that the duality gap between (10) and (1) is zero can be found in [11]. Conclusions We have discussed Lagrangian duality, and the existence of duality gaps from a game-theoretic viewpoint. We have discussed two ways in which duality gaps can be eliminated. The first is randomization and the second is the use of functional Lagrange multipliers. Unfortunately none of the two methods are immediately applicable to real-world problems. However, for certain classes of problems the functional Lagrange multiplier approach can be useful. It was shown in [13] that if the original problem involves the optimization of polynomial functions, and if the Lagrange multipliers are allowed to be themselves polynomials then there will be no duality gap. Unlike the general case discussed in this article, polynomial Lagrange multipliers can be manipulated numerically. This approach can potentially help develop efficient algorithms for a large class of problems.

D

9. Rajasekaran S, Pardalos PM, Reif JH, Rolim J (eds) (2001) Combinatorial Optimization. In: Handbook of randomized computing, vol I, II, vol 9. Kluwer, Dordrecht 10. Rockafellar RT (1997) Convex analysis. Princeton Landmarks in Mathematics. Princeton Univ. Press, Princeton 11. Tind J, Wolsey LA (1981) An elementary survey of general duality theory in mathematical programming. Math Programm 21(3):241–261 12. Visweswaran V, Floudas CA (1990) A global optimization algorithm (gop) for certain classes of nonconvex nlps: II. applications of theory and test problems. Comput Chem Eng 14(12):1417–1434 13. Waki H, Kim S, Kojima M, Muramatsu M (2006) Sums of squares and semidefinite program relaxations for polynomial optimization problems with structured sparsity. SIAM J Optim 17(1):218–242 14. Weintraub ER (ed) (1992) Toward a history of game theory. Duke University Press, Durham, Annual supplement to History of Political Economy, vol 24

Duality in Optimal Control with First Order Differential Equations SABINE PICKENHAIN Brandenburg Technical University Cottbus, Cottbus, Germany MSC2000: 49K05, 49K10, 49K15, 49K20 Article Outline

References 1. Bertsekas DP (1999) Nonlinear Programming, 2nd edn. Athena Scientific, Belmont 2. Ermoliev Y, Gaivoronski A, Nedeva C (1985) Stochastic optimization problems with incomplete information on distribution functions. SIAM J Control Optim 23(5):697–716 3. Everett H (1963) Generalized Lagrange multiplier method for solving problems of optimum allocation of resources. Oper Res 11:399–417 4. Floudas CA, Visweswaran V (1990) A global optimization algorithm (gop) for certain classes of nonconvex nlps: I. theory. Comput Chem Eng 14(12):697–716 5. Floudas CA, Visweswaran V (1993) A primal-relaxed dual global optimization approach. J Optim Theory Appl 78(2): 187–225 6. Gould FJ (1972) Nonlinear duality theorems. Cahiers Centre Études Recherche Opér 14:196–212 7. Liu WB, Floudas CA (1993) A remark on the GOP algorithm for global optimization. J Global Optim 3(4):519–521 8. Liu WB, Floudas CA (1996) A generalized primal-relaxed dual approach for global optimization. J Optim Theory Appl 90(2):417–434

Keywords Construction of a Dual Problem Fenchel–Rockafellar Duality Duality in the Sense of Klötzler Bidual Problems, Generalized Flows, Relaxed Controls Strong Duality Results Case A Case B

Sufficient Optimality Conditions Duality and Maximum Principle Case A Case B

See also References Keywords Control problems; First order partial differential equations; Duality theory; Necessary and sufficient optimality conditions

805

806

D

Duality in Optimal Control with First Order Differential Equations

Consider the following optimal control problem with first order ordinary or partial differential equations: Z (P) min J(x; u) D r(t; x(t); u(t)) dt ˝ subject to functions (x, u) 2 W 1;n p (˝) × L p (˝), fulfilling the state equations

x ti˛ (t) D g˛i (t; x(t); u(t)) a.e. on ˝; (˛ D 1; : : : ; m;

i D 1; : : : ; n);

the control restrictions u(t) 2 U R

A first systematic approach to duality for special problems in calculus of variations was given by K. Friedrichs ([4], 1928). In the 1950s and 1960s, this concept was extended by W. Fenchel [3], J.-J. Moreau, R.T. Rockafellar [19,20] and I. Ekeland and R. Temam [2] to larger classes of variational and control problems. Basing on Legendre transformation (or Fenchel conjugation), it was proved to be a suitable tool to handle convex problems. Nonconvex problems (P) require an extended concept of duality. The construction of R. Klötzler, given in 1977 [7], can be regarded as a further development of Hamilton–Jacobi field theory.

a.e. on ˝; Construction of a Dual Problem

the state constraints x(t) 2 G(t) Rn

on ˝;

and the boundary conditions x(s) D '(s)

on @˝:

The data of problem (P) satisfies the following hypothesis: H1) For m = 1 we have 1 p 1, for m 2 we have m < p < 1. H2) The sets ˝ and

In a very general setting, a problem (D) of maximization of an (extended real-valued) functional L over an arbitrary set S 6D ; is said to be a dual problem to (P) if the weak duality relation sup (D) inf (P) is satisfied. The different notions of duality given in the introduction can be embedded into the following construction scheme:

X :D f(t; ) 2 Rm Rn : t 2 ˝; 2 G(t)g are strongly Lipschitz domains in the sense of C.B. Morrey and S. Hildebrandt [6]; the set U is closed. H3) The functions r, r , g ˛i , (g ˛i ) , ' are continuous with respect to all arguments. H4) The set of all admissible pairs (x, u), denoted by Z, is nonempty. The characterization of optimal solutions of special variational problems by dual or complementary problems has been well known in physics for a long time, e. g., in elasticity theory, the principle of the minimum of potential energy (Dirichlet’s principle) and the principle of tension (Castigliano’s principle) are dual or complementary to each other. in the theory of electrostatic fields, the principle of the minimum of potential energy and the Thomson (Lord Kelvin) principle are dual problems.

1

2

The set of admissible pairs (x; u) = z 2 Z is represented by the intersection of two suitable nonempty sets Z0 and Z1 . For an (extended real-valued) functional ˚ : Z0 S0 ! R the equivalence relation inf J(z) = inf sup ˚(z; S),

z2Z

3

z2Z0 S2S 0

holds. Assuming L0 (S) := inf ˚(z; S), each problem z2Z 0

(D)

max L(S) s.t. S 2 S1 S0

is a (weak) dual problem to (P) if L(S) L0 (S) for all S 2 S1 .

Duality in Optimal Control with First Order Differential Equations

The proof of the weak duality relation results from the well-known inequality inf sup ˚(z; S) sup inf ˚(z; S):

z2Z0 S2S0

S2S0 z2Z0

Fenchel–Rockafellar Duality In accordance with [2], we transform (P) into a general variational problem: Z 8 1;

8i; k;

y i k D 0 or 1;

8i; k:

(3) (4) (5) (6)

It is natural to assume that pik 0, f ik 0, Sik > 0 for all i, k and Dj > 0 for all j. The constraints (1) ensure that all the demand must be met for each customer, while (2) ensure that, for each location, the amount shipped also is produced. Constraint sets (3) and (4) ensure that the level of production corresponds to the correct level on the staircase cost function for each facility. One might note that from constraints (3) and (4) follows that yik + 1 yik . This is a linear mixed integer programming probPm lem with mn + iD1 qi continuous variables and Pm q integer variables. The proportion of integer iD1 i variables is higher than in the ordinary facility location problem. Because of this, and because of the structure of the problem, solving the problem with a general code for mixed integer programming problems is probably not very efficient for large (real life) instances. One aspect of the structure of the problem is that if y is kept fixed (i. e. the sizes of the facilities are given), the remaining problem is simply a standard network flow problem, and hence x and t will attain integer values. Another important aspect of the structure of the problem is the potential separability. There are several

Facility Location with Staircase Costs, Figure 1

F

possibilities of making the model separable by relaxing different sets of constraints. It is also possible to use a problem formulation with f and S instead of f and S. This yields constraints of SOS1-type (one must ensure that only one of the possible sizes is used at a facility), and a somewhat smaller problem (less constraints). The LP relaxation is quicker to solve and the optimal objective function value is the same as that of the model above (i. e. the duality gaps of the two formulations are the same). However, solving the model with general mixed integer codes, the alternate model seems to produce larger branch and bound trees. Concerning the methods discussed below, the two models in most cases behave in identical manners. Solution Methods Methods for models with staircase cost functions or for models capable of modeling such functions can be found in for example [2,11,14,15] and [12]. We will below describe some possibilities. If the exact solution is to be found (and verified), the only reasonable way seems to be to resort to branch and bound, in some sense. This matter in general is extensively discussed in the literature, and although there might be some considerations for the staircase cost case that differ from the single fixed cost case, when it comes to branching and search strategies, we will not dwell on it here. Assuming a standard branch and bound framework, the main question is how to solve the subproblems, i. e. how to get the bounds, especially the lower bounds. This will be discussed more below. However, an alternative is to move the branch and bound procedure into a Benders master problem, i. e. use a Benders decomposition framework in order to obtain the exact solution. This will also be briefly described below. We will start with procedures for obtaining upper and lower bounds on the optimal objective function value. The upper bounds correspond to feasible solutions obtained, while the lower bounds are used to get estimates of the quality of the feasible solutions. If all cost coefficients are integral, we note that a lower bound that is within one unit from the upper bound indicates that the upper bound is optimal.

991

992

F

Facility Location with Staircase Costs

Primal Heuristics There is a well-known ADD heuristic for the capacitated plant location problem, [13], which can be modified to suit the staircase cost facility location problem, see [12]. This heuristic can be improved by combining it with certain priority rules, [6]. If for each plant it is decided to which level of production it can be used (i. e. the y-variables are fixed), the resulting problem is an unbalanced transportation problem. Let Li denote the level (size) of plant i, and initiate the heuristic by setting Li = 0, 8i. Let I = {i: Li < qi }. The ADD heuristic consists of the repeated use of the following step: Increase the size (set Li = Li + 1) of the location site i 2 I that provides the largest reduction of the total cost. Terminate the procedure when no more reduction is possible. In order to avoid ADD increasing the level of production in the order of ‘decreasing’ capacity until a feasible solution is found, we apply a generalization of one of the priority rules discussed in [6]. These priority rules provides a better phase-1 solution than the ADD heuristic itself. Two examples of priority rules, PR1 and PR2, for choosing the location site i 2 I where the size is to be increased (Li = Li + 1), are given below. (They correspond to P1 and P3 in the notation of [6]). PR1) Choose site i 2 I in the order of decreasing quotients S i;L i C1 / f i;L i C1 , until the location sites are able to serve the entire demand. PR2) Choose site i 2 I in the order of increasing values of bn/3c f i;L i C1 1 X ci j C ; bn/3c jD1 S i;L i C1

include the redundant constraints t ik Sik , 8i, k, and possibly yik yik 1 for all i, k > 1. The optimal objective function value of the LP relaxation is denoted by vLP , and vLP v . The duality gap, the difference between v and vLP , is in most cases larger than zero. The LP-problem is large, but sparse, and can be solved with a standard LP-code. Convex Piecewise Linearization Since the binary variables yik are only included to give the correct cost for the production, they can be eliminated if we use an approximation of the costs. If the staircase cost function is underestimated with a piecewise linear and convex function, we get a problem, much easier to solve, which gives a lower bound on v , denoted by vCPL , see [14] and [11]. For explicit expressions of how to construct the convex piecewise linearization see [11]. The resulting problem is a linear minimal cost network flow problem with parallel arcs, which is quite easily solvable by a standard network code. The x- and tpart of the solution is feasible in the original problem, so we can generate an upper bound by evaluating this solution in the correct cost function, which is done by finding the correct values of y. In [10] it is proved that the convex piecewise linearization and the LP relaxation are equivalent, in the sense that vCPL = vLP and an x-solution that is optimal in one of the problems is also optimal in the other problem. Utilizing the network structure, we thus get a quicker way of solving the LP relaxation. Benders Decomposition

until the location sites are able to serve the entire demand. (c is c sorted according to increasing values.) In [13] the ADD heuristic is outperformed by the heuristic DROP but [6] show that ADD with priority rules produce solutions with equally good objective function values as DROP, in less computational time. Linearization A widely used way of obtaining a lower bound is direct LP relaxation. The integer requirements (6) are replaced with the constraints 0 yik 1, 8i, k. We also

In [11] a Benders decomposition approach is used, and combined with the convex piecewise linearization described above. The Benders subproblem is simply obtained by fixing the integer variables, i. e. fixing the sizes of the facilities. The resulting problem is minimal cost network flow problem, similar to a transportation problem, but with certain intervals (given by the facility sizes) for the supplies. However, the Benders master problem obtained by a standard application of the Benders approach, is much too hard to solve. The number of integer variables

Facility Location with Staircase Costs

is much larger than in an ordinary location problem with the same numbers of facilities and customers. One way around this is to combine the Benders approach with the convex piecewise linearization. An improved piecewise linearization is obtained by branching at certain production levels. A staircase cost function is divided into two parts by the branching, and a binary variable is introduced, indicating which of the parts that is to be used. In each of the two parts, convex piecewise linearization is used. In this manner, one could design a branch and bound method for solving the problem, similar to what is described in [14]. Considering the model after a number of branchings, we have an approximation (a relaxation) of the original problem, with a much smaller number of integer variables. On this problem we then apply Benders decomposition. In principle one could let each subproblem in the branch and bound method be solved exactly with Benders decomposition, thereby obtaining basically a branch and bound method, which employs Benders decomposition to solve the branch and bound subproblems. This is however very inefficient. The other extreme is standard application of Benders decomposition to the original problem, in which case the Benders approach employs branch and bound to solve the Benders subproblems. This is also quite inefficient in practice. A more efficient method is to combine the two approaches, Benders decomposition and branch and bound on a more equal level. This can be done the following way. 1) Solve the initial convex piecewise linearization (with a network code). 2) Do one or more branchings, where the error of the approximation at the obtained solution is largest. 3) Solve the obtained problem with Benders decomposition (to a certain accuracy). 4) Repeat 2) and 3), until optimality. There are two very important comments to the above algorithm. A) When one returns to step 3) after having done branchings, one can recalculate and reuse all the Benders cuts obtained before the branchings. (This is described in detail in [11].) B) The stopping criterion for the Benders method, i. e. the required accuracy in step 3), is a very important

F

control parameter. One should in initial iterations require a low accuracy, and gradually, as the method approaches the optimal solution, require higher and higher accuracies. The effect of combining comments A) and B) is that one should only do a few Benders iteration in each main iteration, since the number of Benders cuts will automatically increase, as the old cuts are recalculated and kept. The main conclusion of the computational tests done in [11] is that only a small part of all the integer variables (in average 4%) need to be included by the improving piecewise linearization technique, when solving a problem to reasonable accuracy. In other words, only a small subset of the possible sizes need to be investigated.

Lagrangian Relaxation and Subgradient Optimization Now we will describe a Lagrangian heuristic, found in [12], in more detail. Lagrangian relaxation and subgradient optimization are used to obtain a near-optimal dual solution, and act as a base for an efficient primal heuristic. Based on the solution of the Lagrangian relaxation one can construct a transportation problem which yields primal feasible solutions, and can be used during the subgradient process. An important aspect of the Lagrangian approach is that a method yielding good feasible primal solutions can be based on dual techniques. Lagrangian relaxation, [7], in combination with subgradient optimization, [9] is a commonly used technique for generating lower bounds on the optimal objective function value of mixed integer programming problems. Here we apply Lagrangian relaxation to constraint set (2), and denote the Lagrangian multipliers by ui . For fixed values of u, the subproblem separates into several smaller problems: 8 m X ˆ ˆ (u) D min (c i j C u i )x i j ˆ 1j ˆ ˆ ˆ < iD1 m X ˆ s.t. xi j D D j ; ˆ ˆ ˆ ˆ iD1 ˆ : x i j 0;

993

994

F 8 ˆ ˆ ˆ 2i (u) ˆ ˆ ˆ ˆ ˆ ˆ 1;

t i k 0; y i k D 0 or 1

The first set of subproblems consists of n continuous knapsack problems, which are trivially solvable. The second set of subproblems consists of m onedimensional staircase cost problems. The solution can be found by calculating the minimizer k i for each i, as follows: 2i (u i ) D

min

ki X

k i D0;:::;q i

((p i k u i )t i k C f i k y i k ):

kD1

The resulting solution is ( 1; 8k k i ; 0 yi k D 0; 8k > k i ; ( S i k ; 8k k i ; t 0i k D 0; 8k > k i : Note that the subproblem has the integrality property, [7], so max (u) = vLP . The Lagrangian dual, max (u) D

n X jD1

1 j (u) C

m X

2i (u)

iD1

can be solved by standard subgradient optimization, [9], in order to get the best lower bound. One can use enhancements such as modified directions, [5], dr = r + ˛ dr 1 , where r is the subgradient generated in iteration r and dr is the direction used in iteration r. A steplength shortening is obtained by setting = \2 when there has not been any improvement of v for N 1 consecutive iterations. When there has not been any improvement of v for N 2 iterations, the subgradient optimization procedure is terminated. The subgradient P Pq i t ik 0 for all i, where xij 0 is given by ri = njD1 xij 0 kD1 and t ik 0 are the optimal solutions to the subproblems. Reasonable choices for the parameters are N 1 D 6, N 2 D 25, and ˛ = 0.7. One can use a heuristic based on the solution of the Lagrangian relaxation to try to get a feasible solution.

The obtained values of yik 0 are used to calculate the supply at each location and a transportation problem is solved. The solution to the transportation problem is feasible in the original problem if constraint sets (3) and (4) are satisfied, which easily can be achieved. The values of the flow variables xij are taken directly from the solution to the transportation problem. The total proP duction t i is then calculated as t i = nj= 1 xij . One can then easily find t ik as the part of t i that lies within level k, and the yik solution is simply yik = 1 if t ik > 0 and 0 if not. Finally all unnecessary production capacity at each location i is removed. The complete Lagrangian heuristic, LH, also includes the following. The convex piecewise linearization, CPL, is solved with an efficient network code. The Lagrangian multipliers are initiated with a convex combination of the appropriate node prices obtained by solving CPL and minj cij , with the largest weight on the former. The primal procedure to generate feasible solutions is used every third iteration in the subgradient procedure. Note that CPL yields vCPL = vLP , so the subgradient procedure cannot improve the lower bound, which is quite unusual in methods of this kind. The motivation behind using the subgradient procedure is not to get lower bounds, but to get primal solutions (upper bounds). Computational Results In [12] the heuristic procedures are tested by solving a number of randomly generated test problems, with up to 50 locations, 100 destinations and 20 sizes of each location (yielding 6000 continuous variables and 1000 integer variables). The conclusions of the tests are the following. A standard mixed integer programming code (in this case LAMPS) needs extremely long solution times for finding the exact optimum. The ADD heuristics produce solutions with relative errors in the range of 1%–20% (in average 11%), but also requires quite long solution times (although not as long as the MIP-code). The convex piecewise linearization CPL, combined with exact integer evaluation of the solutions obtained, yields solutions that all are better than those obtained by the ADD heuristics, with relative errors between 0.8% and 10% (in average 4%), in a much shorter time

Farkas Lemma

(in average 1000 times quicker than the ADD heuristics). So CPL dominates the ADD heuristics completely, both with respect to solution time and solution quality. The Lagrangian heuristic, LH, produces solutions with relative errors between 0.4% and 3.2% (in average 1.5%), with solution times in average 20 times shorter than the ADD heuristics, but of course significantly longer than CPL. Comparison to other tests is difficult, since other computers and codes are used. The Benders approach in [11] seems to be slower than the Lagrangian approach. However, on modern computers and with modern MIP-codes, its performance may well improve. Conclusion The capacitated facility location problem with staircase costs has many important applications. Computational results indicate that it is possible to find near-optimal solutions to such problems of reasonable size in a reasonable time, i. e. that this better model can be used instead of, for example, the ordinary capacitated facility location problem in appropriate situations. See also Combinatorial Optimization Algorithms in Resource Allocation Problems Competitive Facility Location Facility Location with Externalities Facility Location Problems with Spatial Interaction Global Optimization in Weber’s Problem with Attraction and Repulsion MINLP: Application in Facility Location-allocation Multifacility and Restricted Location Problems Network Location: Covering Problems Optimizing Facility Location with Rectilinear Distances Production-distribution System Design Problem Resource Allocation for Epidemic Control Single Facility Location: Circle Covering Problem Single Facility Location: Multi-objective Euclidean Distance Location Single Facility Location: Multi-objective Rectilinear Distance Location Stochastic Transportation and Location Problems Voronoi Diagrams in Facility Location Warehouse Location Problem

F

References 1. Beasley JE (1993) Lagrangean heuristics for location problems. Europ J Oper Res 65:383–399, Testproblems available at http://mscmga.ms.ic.ac.uk 2. Bornstein CT, Rust R (1988) Minimizing a sum of staircase functions under linear constraints. Optim 19:181–190 3. Christofides N, Beasley JE (1983) Extensions to a Lagrangean relaxation approach for the capacitated warehouse location problem. Europ J Oper Res 12:19–28 4. Cornuejols G, Sridharan R, Thizy JM (1991) A comparison of heuristics and relaxations for the capacitated plant location problem. Europ J Oper Res 50:280–297 5. Crowder H (1976) Computational improvements for subgradient optimization. Symp Math, vol XIX. Acad. Press, pp 357–372 6. Domschke W, Drexl A (1985) ADD-heuristics’ starting procedures for capacitated plant location models. Europ J Oper Res 21:47–53 7. Geoffrion AM (1974) Lagrangean relaxation for integer programming. Math Program Stud 2:82–114 8. Geoffrion A, McBride R (1978) Lagrangean relaxation applied to capacitated facility location problems. AIIE Trans 10:40–47 9. Held M, Wolfe P, Crowder HP (1974) Validation of subgradient optimization. Math Program 6:62–88 10. Holmberg K (1991) Linearizations of the staircase cost facility location problem. Res Report Dept Math Linköping Inst Techn, no. LiTH-MAT-R-1991-19 11. Holmberg K (1994) Solving the staircase cost facility location problem with decomposition and piecewise linearization. Europ J Oper Res 75:41–61 12. Holmberg K, Ling J (1997) A Lagrangean heuristic for the facility location problem with staircase costs. Europ J Oper Res 97:63–74 13. Jacobsen SK (1983) Heuristics for the capacitated plant location model. Europ J Oper Res 12:253–261 14. Rech P, Barton LG (1970) A non-convex transportation algorithm. In: Beale EM (ed) Applications of Mathematical Programming Techniques, pp 250–260 15. Sridharan R (1991) A Lagrangean heuristic for the capacitated plant location problem with side constraints. J Oper Res Soc 42:579–585

Farkas Lemma Fl KEES ROOS Department ITS/TWI/SSOR, Delft University Technol., Delft, The Netherlands MSC2000: 15A39, 90C05

995

996

F

Farkas Lemma

Article Outline Keywords See also References Keywords Inequality systems; Certificate; Theorem of the alternative; Skew-symmetric matrix; Orthogonal matrix Farkas’ lemma is the most well-known theorem of the alternative or transposition theorem (cf. Linear optimization: Theorems of the alternative). Given an m × n matrix A and a vector b (of dimension m) it states that either the set ˚ S :D y : y> A 0; y> b < 0 or the set T :D fx : Ax D b; x 0g is empty but not both sets are empty. This result has a long history and it has had a tremendous impact on the development of the duality theory of linear and nonlinear optimization. J. Farkas (1847–1930) was professor of Theoretical Physics at the Univ. of Kolozsvár in Hungary. His interest in the subject is explained in the first two sentences of his paper [5]: The natural and systematic treatment of analytic mechanics has to have as its background the inequality principle of virtual displacements first formulated by Fourier and later by Gauss. The possibility of such a treatment requires, however, some knowledge of homogeneous linear inequalities that may be said to have been entirely missing up to now. J.B.J. Fourier [7] seems to have been the first who established that a mechanical system has a stable equilibrium state if and only if some homogeneous system of inequalities, like in the definition of the above set S, has no solution. This observation became known as the mechanical principle of Fourier. By Farkas’lemma this happens if and only if the set T is nonempty. It is almost obvious that if the set T is not empty, then the set S will be empty and we have equilibrium.

This follows easily by noting that the sets S and T cannot be both nonempty: if y 2 S and x 2 T then the contradiction y> b D y> (Ax) D (y> A)x 0 follows, because y> A 0 and x 0. This shows that the condition ‘T is not empty’ is certainly a sufficient condition for equilibrium. The hard part is to prove that this is also a necessary condition for equilibrium. The proof has a long history. First, the condition without proof for special cases was given by A. Cournot in 1827 and for the general case by M. Ostrogradsky in 1834. Farkas published his condition first in 1894 and 1895, but the proof contains a gap. A second attempt, in 1896, is also incomplete. The first complete proof was published in Hungarian, in 1898 [3], and in German in 1899 [4]. This proof is included in Farkas’ best known paper [5]. For more details and references, see the historical overviews [9] and [10]. Nowadays (1998) many different proofs of Farkas’ lemma are known. For quite recent proofs, see, e. g., [1,2,8]. An interesting derivation has been given by A.W. Tucker [11], based on a result that will be referred to as Tucker’s theorem. (See Tucker homogeneous systems of linear relations.) The theorem states that for any skew-symmetric matrix K (i. e., K = K > ) there exists a vector x such that Kx 0;

x 0;

x C Kx > 0:

By taking 0

0 B 0 KDB @A> b>

0 0 A> b >

1 A b A b C C; 0 0A 0 0

Tucker’s theorem implies the existence of nonnegative vectors z1 , z2 and x and a nonnegative scalar t such that Ax tb 0;

(1)

Ax C tb 0;

(2)

A> z1 C A> z2 0; b > z1 b > z2 0;

(3)

Farkas Lemma

and

Therefore, by Broyden’s theorem, there exists a sign matrix D and a positive vector z such that

z1 C Ax tb > 0;

(I C K)1 (I K)z D Dz:

z2 Ax C tb > 0; x A> z1 C A> z2 > 0; >

>

t C b z1 b z2 > 0:

F

This can be rewritten as (4)

If t = 0, then, putting y = z2 z1 , (3) and (4) yield a vector in the set S. If t > 0, since the above inequalities are all homogeneous, one may take t = 1 and then (1) and (2) give a vector in the set T. This shows that at least one of the two sets S and T is nonempty, proving the hard part of Farkas’ lemma. It is worth mentioning a result of C.G. Broyden [1] who showed that Tucker’s theorem, and hence also Farkas’ lemma, follows from a simple property of orthogonal matrices. The result states that for any orthogonal matrix Q (so QQ> = Q> Q = I) there exists a unique sign matrix D and a positive vector x such that Qx = Dx; a sign matrix is a diagonal matrix whose diagonal elements are equal to either plus one or minus one. The key observation here is that if K is a skewsymmetric matrix, then Q D (I C K)1 (I K) is an orthogonal matrix, where I denotes the identity matrix; Q is known as the Cayley transform of K [6]. The proof of this fact is straightforward. First, for each vector x one has x > (I C K)x D x > x; whence I + K is an invertible matrix. Furthermore, using K > = K, one may write Q > Q D (I C K)(I K)1 (I C K)1 (I K) D (I C K)(I K 2 )1 (I K): Multiplying both sides from the left with (I K) one gets (I K)QQ > D (I K 2 )(I K 2 )1 (I K) D (I K); and multiplying both sides with (I K)1 one finds QQ> = I, showing that Q is orthogonal indeed.

(I K)z D (I C K)Dz; whence z Kz D Dz C KDz; or z Dz D K(z C Dz): Defining x = z + Dz one has x 0, Kx 0 and x + Kx = 2z > 0, proving Tucker’s theorem. See also Farkas Lemma: Generalizations Linear Optimization: Theorems of the Alternative Linear Programming Motzkin Transposition Theorem Theorems of the Alternative and Optimization Tucker Homogeneous Systems of Linear Relations References 1. Broyden CG (1998) A simple algebraic proof of Farkas’ lemma and related theorems. Optim Methods Softw 8:185–199 2. Dax A (1997) An elementary proof of Farkas’ lemma. SIAM Rev 39:503–507 3. Farkas Gy (1898) A Fourier-féle mechanikai elv alkalmazásának algebrai alapja. Math Természettudományi Értesitö 16:361–364 4. Farkas J (1899) Die algebraischen Grundlage der Anwendungen des mechanischen Princips von Fourier. Math Naturwissenschaftl Bericht Ungarn 16:154–157 5. Farkas J (1902) Theorie der Einfachen Ungleichungen. J Reine Angew Math 124:1–27 6. Fekete A (1985) Real linear algebra. M. Dekker, New York 7. Fourier JBJ (1826) Solution d’une question particulière du calcul des inégalités. Nouveau Bull Sci Soc Philomath Paris, pp 99–100 8. Klafsky E, Terlaky T (1987) Remarks on the feasibility problem of oriented matroids. Ann Univ Sci Budapest R Eötvös 7:155–157 9. Prékopa A (1980) On the development of optimization theory. Amer Math Monthly 87:527–542

997

998

F

Farkas Lemma: Generalizations

10. Schrijver A (1986) Theory of linear and integer programming. Wiley, New York 11. Tucker AW (1956) Dual systems of homogeneous linear relations. In: Kuhn HW, Tucker AW (eds) Linear Inequalities and Related Systems. Ann Math Stud., vol 38. Princeton Univ Press, Princeton, pp 3–18

Farkas Lemma: Generalizations V. JEYAKUMAR School of Math., University New South Wales, Sydney, Australia MSC2000: 46A20, 90C30, 52A01 Article Outline Keywords Infinite-Dimensional Optimization Nonsmooth Optimization Global Nonlinear Optimization Nonconvex Optimization Semidefinite Programming See also References Keywords Inequality systems; 2-subdifferential; D.c. function; Global optimization; Convex inequality systems; Convex-like systems; Nonsmooth optimization; Semidefinite programming; Alternative theorem The key to identifying optimal solutions of constrained nonlinear optimization problems is the Lagrange multiplier conditions. One of the main approaches to establishing such multiplier conditions for inequality constrained problems is based on the dual solvability characterizations of systems involving inequalities. J. Farkas [7] initially established such a dual characterization for linear inequalities which was used in [23] to derive necessary conditions for optimality for nonlinear programming problems. This dual characterization is popularly known as Farkas’ lemma, which states that given any vectors a1 , . . . , am and c in Rn , the linear inequality c> x 0 is a consequence of the linear system a> i x 0, i = 1, . . . , m, if and only if there exist multipliers i 0 such P that c = m iD1 i ai . This result can also be expressed as

a so-called alternative theorem: Exactly one of the following alternatives is true: x 0, c> x < 0, i) 9x 2 Rn , a> i P ii) 9i 0, c = m iD1 i ai . This lemma is the key result underpinning the linear programming duality and has played a central role in the development of nonlinear optimization theory. A large variety of proofs of the lemma can be found in the literature (see [5,25,26]). The proof [3,5] that relies on the separation theorems has led to various extensions. These extensions cover wide range of systems including systems involving infinite-dimensional linear inequalities, convex inequalities and matrix inequalities. Applications range from classical nonlinear programming to modern areas of optimization such as nonsmooth optimization and semidefinite programming. Let us now describe certain main generalizations of Farkas’ lemma and their applications to problems in various areas of optimization. Infinite-Dimensional Optimization The Farkas lemma for a finite system of linear inequalities has been generalized to systems involving arbitrary convex cones and continuous linear mappings between spaces of arbitrary dimensions. In this case the lemma holds under a crucial closure condition. In symbolic terms, the main version of such extension to arbitrary dual pairs of vector spaces states that the following equivalence holds [6]: (1) A(x) 2 S ) c(x) 0 , c 2 A> (S ); provided the cone A> (S ) is closed in some appropriate topology. Here A is a continuous linear mapping between two Banach spaces, S is a closed convex cone having the dual cone S [5]. The closure condition holds when S is a polyhedral cone in some finite-dimensional space. For simple examples of nonpolyhedral convex cones in finite dimensions where the closure condition does not hold, see [1,5]. However, the following asymptotic version of Farkas’ lemma holds without a closure condition: A(x) 2 S ) c(x) 0 , c 2 cl(A> (S )); (2) where cl(A> (S )) is the closure of A> (S ) in the appropriate topology. These extensions resulted in the development of asymptotic and nonasymptotic first

F

Farkas Lemma: Generalizations

order necessary optimality conditions for infinitedimensional smooth constrained optimization problems involving convex cones and duality theory for infinite-dimensional linear programming problems (see e. g. [12]). Smooth optimization refers to the optimization of a differentiable function. A nonasymptotic form of an extension of Farkas’ lemma that is different from the one in (1) is given in [24] without the usual closure condition. For related results see [4]. An approach to the study of semi-infinite programming, which is based on generalized Farkas’ lemma for infinite linear inequalities is given in [12].

Nonsmooth Optimization The success of linear programming duality and the practical nature of the Lagrange multiplier conditions for smooth optimization have led to extensions of Farkas’ lemma to systems involving nonlinear functions. Convex analysis allowed to obtain extensions in terms of subdifferentials replacing the linear systems by sublinear (convex and positively homogeneous) systems [8,31]. A simple form of such an extension states that the following statements are equivalent: g(x) 2 S ) f (x) 0 " 0 2 cl @ f (0) C

[

(3) #

@(g)(0) ;

(4)

2S

where the real valued function f is sublinear and lower semicontinuous, and the vector function g is sublinear with respect to the cone S and vg is lower semicontinuous for each v 2 S . When f is continuous the statement (4) collapses to the condition # " [ @(g)(0) : (5) 0 2 @ f (0) C cl 2S

This extension was used to obtain optimality conditions for convex optimization problems and quasidifferentiable problems in the sense of B.N. Pshenichnyi [27]. A review of results of Farkas type for systems involving sublinear functions is given in [13,14]. Difference of sublinear (DSL) functions which arise frequently in nonsmooth optimization provide useful approximations for many classes of nonconvex nons-

mooth functions. This has led to the investigation of results of Farkas type for systems involving DSL functions. A mapping g: X ! Y is said to be difference sublinear (DSL) (with respect to S) if, for each v 2 S , there are (weak ) compact convex sets, here denoted @(vg)(0) and @(v g)(0), such that, for each x 2 X, v g(x) D

max u(x)

u2@(v g)(0)

max

w(x);

w2@(v g)(0)

where X and Y are Banach spaces. If Y = R and S = R+ then this definition coincides with the usual notion of a difference sublinear real-valued function. Thus a mapping g is DSL if and only if vg is a DSL function for each v 2 S . The sets @(vg)(0) and @(v g)(0) are the subdifferential and superdifferential of vg, respectively. For a DSL mapping g: X ! Y wen shall often require o

a selection from the class of sets @(v g)(0) : v 2 S . This is a set, denoted (wv ), in which we select a single element @(v g)(0) for each v 2 S . An extension of the Farkas lemma for DSL systems states that the following statements are equivalent [10,20]: i) g(x) 2 S ) f (x) 0; ii) for each selection (wv ) with wv 2 @(v g)(0), v 2 S , @ f (0) @ f (0) C B, S where B D cl cone co v2S (@(v g)(0) wv ) . A unified approach to generalizing the Farkas lemma for sublinear systems which uses multivalued functions and convex process is given [2,17,18].

Global Nonlinear Optimization Given that the optimality of a constrained global optimization problem can be viewed as the solvability of appropriate inequality systems, it is easy to see that an extension of Farkas’ lemma again provides a mechanism for characterizing global optimality of a range of nonlinear optimization problems. The -subdifferential analysis here allowed to obtain a new version of the Farkas lemma replacing the linear inequality c(x) 0 by a reverse convex inequality h(x) 0, where h is a convex function with h(0) = 0. This extension for systems involving DSL functions states that the following conditions are equivalent. i) g(x) 2 S ) h(x) 0;

999

1000

F

Farkas Lemma: Generalizations

ii) for each selection (wv ) with wv 2 @(v g)(0), v 2 S and for each 0, " # [ @ h(0) cl cone co (@(v g)(0) wv ) : v2S

Such an extension has led to the development of conditions which characterize optimal solutions of various classes of global optimization problems such as convex maximization problems and fractional programming problems (see [19,20]). However, simple examples show that the asymptotic forms of the above results of Farkas type do not hold if we replace the DSL (or sublinear) system by a convex system. Ch.-W. Ha [15] established a version of the Farkas lemma for convex systems in terms of epigraphs of conjugate functions. A simple form of such a result [29] states that the following statements are equivalent: i) (8i 2 I) g i (x) 0 ) h(x) 0; ii) epi h cl cone co [ [i 2 I epi g i ], provided the system

and (8y 2 Y): F(x3 ; y) ˛F(x1 ; y) C (1 ˛)F(x2 ; y): If the pair (f , F) is convex-like on X, there is x0 2 X with (8y 2 Y) F(x0 , y) 0 and if a regularity condition holds then the following statements are equivalent [21]: 8y 2 Y; F(x; y) 0 H) f (x) 0; (8 < 0)(9 2 )(8x 2 X) X y F(x; y) > ; f (x) C y2Y

where is the dual cone of the convex cone of all nonnegative functions on Y. An asymptotic version of the above result holds if the regularity hypothesis is not fulfilled. This extension has been applied to develop Lagrange multiplier type results for minimax problems and constrained optimization problems involving convex-like functions. For related results see [16]. Semidefinite Programming

i 2 I;

g i (x) 0

has a solution. Here h and, for each i 2 I, g i are continuous convex functions, I is an arbitrary index set, and h and g i are conjugate functions of h and g i respectively. This result has also been employed to study infinite-dimensional nonsmooth nonconvex problems [30]. A basic general form of the Farkas lemma for convex system with application to multiobjective convex optimization problems is given in [11]. Extensions to systems involving the difference of convex functions are given in [21,29]. A more general result involving H-convex functions [29] with application to global nonlinear optimization is given in [29]. Nonconvex Optimization The convexity requirement of the functions involved in the extended Farkas lemma above can be relaxed to obtain a form of Farkas’ lemma for convex-like system. Let F: X × Y ! R and let f : X ! R, where X and Y are arbitrary nonempty sets. The pair (f , F) is convex-like on X if (9˛ 2 (0; 1))(8x1; x2 2 X)(9x3 2 X); f (x3 ) ˛ f (x1 ) C (1 ˛) f (x2 )

A useful corollary of the Farkas lemma, which is often used to characterize the feasibility problem for linear inequalities, states that exactly one of the following alternatives is true: x bi , i = 1, . . . , m, i) 9x 2 Rn a> Pi Pm ii) 9i 0 m iD1 i ai = 0, iD1 bi i = 1. This form of the Farkas lemma has also attracted various extensions to nonlinear systems, including sublinear and DSL systems [20] with the view to characterize the feasibility of such systems. The feasibility problem, which has been of great interest in semidefinite programming, is the problem of determining whether there exists an x 2 Rn such that Q(x) 0, for real symmetric matrices Qi , i = 0, . . . , m, where denotes the partial order, i. e. B A if and only if A B is positive semidefP inite, and Q(x) = Q0 m iD1 xi Qi . However, simple examples show that a direct analog of the alternative does not hold for the semidefinite inequality systems Q(x) 0 without additional hypothesis on Q. A modified dual conditions which characterize solvability of the system Q(x) 0 is given in [28]. See also Farkas Lemma

Feasible Sequential Quadratic Programming

References 1. Ben-Israel A (1969) Linear inequalities and inequalities on finite dimensional real or complex vector spaces: A unified theory. J Math Anal Appl 27:367–389 2. Borwein JM (1983) Adjoint process duality. Math Oper Res 8:403–437 3. Borwein JM (1983) A note on the Farkas lemma. Utilitas Math 24:235–241 4. Borwein JM, Wolkowicz H (1982) Characterizations of optimality without constraint qualification for the abstract convex program. Math Program Stud 19:77–100 5. Craven BD (1978) Mathematical programming and control theory. Chapman and Hall, London 6. Craven BD, Koliha JJ (1977) Generalizations of Farkas’ theorem. SIAM J Math Anal 8:983–997 7. Farkas J (1901) Theorie der einfachen Ungleichungen. J Reine Angew Math 124:1–27 8. Glover BM (1982) A generalized Farkas lemma with applications to quasidifferentiable programming. Z Oper Res 26:125–141 9. Glover BM, Ishizuka Y, Jeyakumar V, Tuan HD (1996) Complete characterization of global optimality for problems involving the pointwise minimum of sublinear functions. SIAM J Optim 6:362–372 10. Glover BM, Jeyakumar V, Oettli W (1994) Farkas lemma for difference sublinear systems and quasidifferentiable programming. Math Program 63:333–349 11. Glover BM, Jeyakumar V, Rubinov AM (1999) Dual conditions characterizing optimality for convex multi-objective programs. Math Program 84:201–217 12. Goberna MA, Lopez MA, Pastor J (1981) Farkas-Minkowski systems in semi-infinite programming. Appl Math Optim 7:295–308 13. Gwinner J (1987) Corrigendum and addendum to ‘Results of Farkas type’. Numer Funct Anal Optim 10:415–418 14. Gwinner J (1987) Results of Farkas type. Numer Funct Anal Optim 9:471–520 15. Ha Ch-W (1979) On systems of convex inequalities. J Math Anal Appl 68:25–34 16. Ills T, Kassay G (1994) Farkas type theorems for generalized convexities. Pure Math Appl 5:225–239 17. Jeyakumar V (1987) A general Farkas lemma and characterization of optimality for a nonsmooth program involving convex processes. J Optim Th Appl 55:449–461 18. Jeyakumar V (1990) Duality and infinite dimensional optimization. Nonlinear Anal Th Methods Appl 15:1111–1122 19. Jeyakumar V, Glover BM (1993) A new version of Farkas’ lemma and global convex maximization. Appl Math Lett 6(5):39–43 20. Jeyakumar V, Glover BM (1995) Nonlinear extensions of Farkas’ lemma with applications to global optimization and least squares. Math Oper Res 20:818–837 21. Jeyakumar V, Gwinner J (1991) Inequality systems and optimization. J Math Anal Appl 159:51–71

F

22. Jeyakumar V, Rubinov AM, Glover BM, Ishizuka Y (1996) Inequality systems and global optimization. J Math Anal Appl 202:900–919 23. Kuhn HW, Tucker AW Nonlinear programming, Proc. Second Berkeley Symp. Math. Statist. and Probab., Univ. Calif. Press, Berkeley, CA, pp 481–492 24. Lasserre JB (1997) A Farkas lemma without a standard closure condition. SIAM J Control Optim 35:265–272 25. Mangasarian OL (1969) Nonlinear programming. McGrawHill, New York 26. Prékopa A (1980) On the development of optimization theory. Amer Math Monthly 87:527–542 27. Pshenichnyi BN (1971) Necessary conditions for an extremum. M. Dekker, New York 28. Ramana MV (1977) An exact duality theory for semidefinite programming and its complexity implications. Math Program 77:129–162 29. Rubinov AM, Glover BM, Jeyakumar V (1995) A general approach to dual characterizations of solvability of inequality systems with applications. J Convex Anal 2(2):309–344 30. Schirotzek W (1985) On a theorem of Ky Fan and its application to nondifferentiable optimization. Optim 16:353–366 31. Zalinescu C (1978) A generalization of the Farkas lemma applications to convex programming. J Math Anal Appl 66:651–678

Feasible Sequential Quadratic Programming FSQP ANDRÉ L. TITS University Maryland, College Park, USA MSC2000: 65K05, 65K10, 90C06, 90C30, 90C34 Article Outline Keywords Main Ideas Algorithms Applications See also References Keywords Nonlinear programming; Sequential quadratic programming; Successive quadratic programming; Feasible iterates

1001

1002

F

Feasible Sequential Quadratic Programming

Feasible sequential quadratic programming (FSQP) refers to a class of sequential quadratic programming (SQP) methods that have the additional property that all iterates they construct satisfy the inequality constraints. Thus, for the problem 8 ˆ minn f (x) ˆ 0. Step 0. Initialization: Set k = 0. Step 1. Computation of a search arc. Compute d 0k . If d 0k = 0, stop. Compute d 1k and k and set d k = (1 k )d 0k + k d 1k . Compute correction d˜k . Step 2. Arc search. Compute t k , the first number t in the sequence f1; ˇ; ˇ 2 ; : : :g satisfying f (x k + td k + t 2 d˜k ) f (x k ) + ˛thr f (x k ); d k i, g j (x k + td k + t 2 d˜k ) 0; j = 1; : : : ; m i . Step 3. Updates. Compute H k+1 = H > k+1 > 0. Set x k+1 = x k + t k d k + t 2k d˜k . Set k = k + 1. Go back to Step 1. Algorithm: Simple FSQP

Here, d1k is a feasible direction and a direction of first order descent for f , 2 (0, 1] goes to zero fast enough d k is a correc(like k d0k k2 ) when d0k goes to zero, and e tion that aims at insuring that the full step of one will be accepted when xk is close enough to a solution; compu-

1003

1004

F

Feasible Sequential Quadratic Programming

tation of e d k involves constraint values at xk + dk . Under standard assumptions this algorithm is known to generate sequences whose limit points are Karush–Kuhn– Tucker points. Under strengthened assumptions, including the assumption that H k is updated in such a way that it approximates well, in a certain sense, the Hessian of the Lagrangian as a solution is approached, convergence can be shown to be Q-superlinear or 2-step superlinear. See [10] for details. A refined version of the algorithm of [10] is implemented in the CFSQP/FFSQP software (see [15]). Refinements include the capability to handle equality constraints [6], minimax and constrained minimax problems and to efficiently handle problems with large numbers of inequality constraints and minimax problems with large numbers of objective functions [8]. Also note that an FSQP method with drastically reduced amount of work per iteration has been recently proposed [7]. Applications Applications abound where FSQP-type algorithms are of special interest. In particular, as stressed above, such algorithms are particularly appropriate for problems where the number of variables is not too large but functions evaluations are expensive, and feasibility of iterates is desirable (or imperative). Furthermore, problems with a large number of inequality constraints (or minimax problems with large numbers of objective functions), such as finely discretized semi-infinite optimization problems, can be handled effectively, making FSQP especially well-suited for problems involving, e. g., time or frequency responses of dynamical systems. Pointers to a large number of applications can be found on the web, at the URL listed above. Application areas include all branches of engineering, medicine, physics, astronomy, economics and finances, to mention but a few. See also Optimization with Equilibrium Constraints: A Piecewise SQP Approach Sequential Quadratic Programming: Interior Point Methods for Distributed Optimal Control Problems Successive Quadratic Programming Successive Quadratic Programming: Applications in Distillation Systems

Successive Quadratic Programming: Applications in the Process Industry Successive Quadratic Programming: Decomposition Methods Successive Quadratic Programming: Full Space Methods Successive Quadratic Programming: Solution by Active Sets and Interior Point Methods

References 1. Birge J, Qi L, Wei Z (2000) A variant of the Topkis–Veinott method for solving inequality constrained optimization problems. J Appl Math Optim 41:309–330 2. Bonnans JF, Panier ER, Tits AL, Zhou JL (Aug. 1992) Avoiding the Maratos effect by means of a nonmonotone line search II. Inequality constrained problems – feasible iterates. SIAM J Numer Anal 29(4):1187–1202 3. El-Bakry AS, Tapia RA, Tsuchiya T, Zhang Y (1996) On the formulation and theory of the Newton interior-point method for nonlinear programming. J Optim Th Appl 89:507–541 4. Fiacco AV, McCormick GP (1968) Nonlinear programming: Sequential unconstrained minimization techniques. Wiley New York, New York 5. Herskovits JN, Carvalho LAV (1986) A successive quadratic programming based feasible directions algorithm. In: Bensoussan A, Lions JL (eds) Proc. Seventh Internat. Conf. Analysis and Optimization of Systems – Antibes, June 25-27, 1986, Lecture Notes Control Inform Sci. Springer, Berlin, pp 93–101 6. Lawrence CT, Tits AL (1996) Nonlinear equality constraints in feasible sequential quadratic programming. Optim Methods Softw 6:265–282 7. Lawrence CT, Tits AL (1998) A computationally efficient feasible sequential quadratic programming algorithm. Techn Report Inst Systems Res, Univ Maryland, no. TR 9846 8. Lawrence CT, Tits AL (1998) Feasible sequential quadratic programming for finely discretized problems from SIP. In: Reemtsen R, Rückmann J-J (eds) Semi-infinite programming. Nonconvex Optim Appl. Kluwer, Dordrecht, pp 159– 193 9. Panier ER, Tits AL (1987) A superlinearly convergent feasible method for the solution of inequality constrained optimization problems. SIAM J Control Optim 25(4):934–950 10. Panier ER, Tits AL (1993) On combining feasibility, descent and superlinear convergence in inequality constrained optimization. Math Program 59:261–276 11. Panier ER, Tits AL, Herskovits JN (July 1988) A QP-free, globally convergent, locally superlinearly convergent algorithm for inequality constrained optimization. SIAM J Control Optim 26(4):788–811

Feedback Set Problems

12. Polak E (1971) Computational methods in optimization. Acad. Press, New York 13. Qi L, Wei Z (2000) On the constant positive linear independence condition and its application to SQP methods. SIAM J Optim 10:963–981 14. Urban T, Tits AL, Lawrence CT (1998) A primal-dual interiorpoint method for nonconvex optimization with multiple logarithmic barrier parameters and with strong convergence properties. Techn Report Inst Systems Res Univ Maryland no. TR 98-27 15. WEB: ‘www.isr.umd.edu/Labs/CACSE/FSQP/fsqp.html’. 16. Zoutendijk G (1960) Methods of feasible directions. Elsevier, Amsterdam

Feedback Set Problems FSP PAOLA FESTA1 , PANOS M. PARDALOS2 , MAURICIO G.C. RESENDE3 1 Dip. Mat. e Inform., University Salerno, Salerno, Italy 2 Center for Applied Optim. Department Industrial and Systems Engineering, University Florida, Gainesville, USA 3 Information Sci. Res., AT&T Labs Res., Florham Park, USA MSC2000: 90C35 Article Outline Keywords Notation and Graph Representation The Feedback Vertex Set Problem Mathematical Model of the Feedback Vertex Set Problem Polynomially Solvable Cases Approximation Algorithms and Provable Bounds on Undirected Graphs Approximation Algorithms and Provable Bounds on Directed Graphs Exact Algorithms

The Feedback Arc Set Problem Mathematical Model of the Feedback Arc Set Problem State of the Art of Feedback Arc Set Problems

A GRASP for Feedback Set Problems Future Research Conclusions See also References

F

Keywords Combinatorial optimization; Feedback set problem; Graph bipartization; Local search; GRASP; FORTRAN subroutines In recent years (1990) feedback set problems have been the subject of growing interest. They have found applications in many fields, including deadlock prevention [90], program verification [79], and Bayesian inference [2]. Therefore, it is natural that in the past few years there have been intensive efforts on exact and approximation algorithms for these kinds of problems. Exact algorithms have been proposed for solving the problems restricted to special classes of graphs as well as several approximation algorithms with provable bounds for the cases that are not known to be polynomially solvable. The most general feedback set problem consists in finding a minimum-weight (or minimum cardinality) set of vertices (arcs) that meets all cycles in a collection C of cycles in a graph (G, w), where w is a nonnegative function defined on the set of vertices V(G) (on the set of edges E(G)). This kind of problem is also known as the hitting cycle problem, since one must hit every cycle in C. It generalizes a number of problems, including the minimum feedback vertex (arc) set problem in both directed and undirected graphs, the subset minimum feedback vertex (arc) set problem and the graph bipartization problem, in which one must remove a minimum-weight set of vertices so that the remaining graph is bipartite. In fact, if C is the set of all cycles in G, then the hitting cycle problem is equivalent to the problem of finding the minimum feedback vertex (arc) set in a graph. If we are given a set of special vertices and C is the set of all cycles of an undirected graph G that contains some special vertex, then we have the subset feedback vertex (arc) set problem and, finally, if C contains all odd cycles of G, then we have the graph bipartization problem. All these problems are also special cases of vertex (arc) deletion problems, where one seeks a minimum-weight (or minimum cardinality) set of vertices (arcs) whose deletion gives a graph satisfying a given property. There are different versions of feedback set problems, depending on whether the graph is directed or undirected and/or the vertices (arcs) are weighted or unweighted. See [30] for a complete survey, and [91] for a general NP-hardness proof for almost all vertex and arc deletion problems restricted to

1005

1006

F

Feedback Set Problems

planar graphs. These results apply to the planar bipartization problem, the planar (directed, undirected, or subset) feedback vertex set problems, already proved to be NP-hard [33,46]. Furthermore, it is NP-complete for planar graphs with no indegree or outdegree exceeding three [46], general graphs with no indegree or outdegree exceeding two [46], and edge-directed graphs [46]. The scope of this article is to give a complete stateof-art survey of exact and approximation algorithms and to analyze a new practical heuristic method called GRASP for solving both feedback vertex and feedback arc set problems. Notation and Graph Representation Throughout this paper, we use the following notation and definitions. A graph G = (V, E) consists of a finite set of vertices V(G), and a set of arcs E(G) V(G) × V(G). An arc (or edge) e = (v1 , v2 ) of a directed graph (digraph) G = (V, E) is an incoming arc to v2 and an outgoing arc from v1 and it is incident to both v1 and v2 . If G is undirected, then e is said to be only incident to v1 and v2 . For each vertex i 2 V(G), let in(i) and out(i) denote the set of incoming and outgoing edges of i, respectively. They are defined only in case of a digraph G. If G is undirected, we will take into account only the degree G (i) of i as the number of edges that are incident to i in G. (G) denotes the maximum degree among all vertices of a graph G and it is called the graph degree. A vertex v 2 G is called an endpoint if it has degree one, a linkpoint if it has degree two, while a vertex having degree higher than two is called a branchpoint. A path P in G connecting vertex u to vertex v is a sequence of arcs e1 , . . . , er in E(G), such that ei = (vi , vi + 1 ), i = 1, . . . , r, with v1 = u and vr + 1 = v. A cycle C in G is a path C = (v1 , . . . , vr ), with v1 = vr . A subgraph G0 = (V 0 , E0 ) of G = (V, E) induced by 0 V is a graph such that E0 = E \ (V 0 × V 0 ). A graph G is said to be a singleton, if |V(G)| = 1. Any graph G can be partitioned into isolated connected components G1 , . . . , Gk and the partition is unique. Similarly, every feedback vertex set V 0 of G can be partitioned into feedback vertex sets F 1 , . . . , F k such that F i is a feedback vertex set of Gi . Therefore, following the additive property and de-

noting by (G, w) the weight of a minimum feedback vertex (arc) set for (G, w), we have: (G; w) D

k X

(G i ; w):

iD1

The Feedback Vertex Set Problem Formally, the feedback vertex set problem can be described as follows. Let G = (V, E) be a graph and let w: V(G) ! R+ be a weight function defined on the vertices of G. A feedback vertex set of G is a subset of vertices V 0 V(G) such that each cycle in G contains at least one vertex in V 0 . In other words, a feedback vertex set V 0 is a set of vertices of G such that by removing V 0 from G along with all the edges incident to V 0 , results in a forest. The weight of a feedback vertex set is the sum of the weights of its vertices, and a minimum feedback vertex set of a weighted graph (G, w) is a feedback vertex set of G of minimum weight. The weight of a minimum feedback vertex set will be denoted by (G, w). The minimum weighted feedback vertex set problem (MWFVS) is to find a minimum feedback vertex set of a given weighted graph (G, w). The special case of identical weights is called the unweighted feedback vertex set problem (UFVS). Mathematical Model of the Feedback Vertex Set Problem As a covering-type problem, the feedback vertex set problem admits an integer zero-one programming formulation. Given a feedback vertex set V 0 for a graph (G, w), G = (V, E), and a set of weights w = {w(v)}v 2 V(G) , let x = {xv }v 2 V(G) be a binary vector such that xv = 1 if v 2 V 0 , and xv = 0 otherwise. Let C be the set of cycles in (G, w). The problem of finding the minimum feedback vertex set of G can be formulated as an integer programming problem as follows: 8 X ˆ min w(v)xv ˆ ˆ ˆ ˆ v2V (G) < X s.t. xv 1; 8 2 C; ˆ ˆ ˆ v2V ( ) ˆ ˆ : 0 xv 1 integer; v 2 V (G): If one denotes by Cv the set of cycles passing through vertex v 2 V(G), then the dual of the corresponding lin-

Feedback Set Problems

ear programming relaxation is a packing problem: 8 X ˆ max y ˆ ˆ ˆ ˆ X 2C < s.t. y w(v); 8v 2 V (G); ˆ ˆ 2C ˆ v ˆ ˆ : y 0; 8 2 C: Polynomially Solvable Cases Given the NP-completeness of the feedback vertex set problem, a recent line of research has focused on identifying the largest class of specially structured graphs on which such problems remain polynomially solvable. A pioneering work is due to A. Shamir [79], who proposed a linear time algorithm to find a feedback vertex set for a reducible flow graph. C. Wang, E. Lloyd, and M. Soffa [90] developed an O(|E(G)||V(G)|2) algorithm for finding a feedback vertex set in the class of graphs known as cyclically reducible graphs, which is shown to be unrelated to the class of quasireducible graphs. Although the exact algorithm proposed by G.W. Smith and R.B. Walford [83] has exponential running time in general, it returns an optimal solution in polynomial time for certain types of graphs. A variant of the algorithm, called the Smith–Walford-one algorithm, selects only candidate sets F of size one and runs in O(|E(G)||V(G)|2) time. The class of graphs for which it finds a feedback vertex set is called Smith–Walford onereducible. In the study of feedback vertex set problems a set of operations called contraction operations has had significant impact. They contract the graph G(V, E), while preserving all the important properties relevant to the minimum feedback vertex set. See [56] for a detailed analysis of these reduction procedures which are important for the following two reasons. First, a class of graphs of increasing size is computed, where the feedback vertex set of each graph can be found exactly. Second, most proposed heuristics and approximation algorithms use the reduction schemes in order to reduce the size of the problem. Another line of research on polynomially solvable cases focuses on other special classes, including chordal and interval graphs, permutation graphs, convex bipartite graphs, cocomparability graphs and on meshes and toroidal meshes, butterflies, and toroidal butterflies. The feedback vertex set on chordal and interval graphs can be viewed as a special instance of the generalized clique cover problem, which

F

is solved in polynomial time on chordal graphs [20,93] and interval graphs [65]. For permutation graphs, an algorithm due to A. Brandstädt and D. Kratsch [8] was improved by Brandstädt [7] to run in O(|V(G)|6) time. More recently (1994), Y.D. Liang [58] presented an O(|V(G)||E(G)|) algorithm for permutation graphs that can be easily extended to trapezoid graphs while keeping the same time complexity. On interval graphs, C.L. Lu and C.Y. Tang [61] developed a linear-time algorithm to solve the minimum weighted feedback vertex set problem using dynamic programming. S.R. Coorg and C.P. Rangan [19] present an O(|V(G)|4) time and O(|V(G)|4) space exact algorithm for cocomparability graphs, which are a superclass of permutation graphs. More recently, Liang and M.S. Chang [13] developed a polynomial time algorithm, that by exploring the structural properties of a cocomparability graph uses dynamic programming to get a minimum feedback vertex set in O(|V(G)2 | |E(G)|) time. A recent (1998) line of research [63] on polynomially solvable cases focuses on special undirected graphs having bounded degree and that are widely used as connection networks, namely mesh, butterfly and k-dimensional cube connected cycle (CCCk ). Approximation Algorithms and Provable Bounds on Undirected Graphs A 2 log2 |V(G)|-approximation algorithm for the unweighted minimum feedback vertex set problem on undirected graphs is contained in a lemma due to P. Erdös and L. Posa [25]. This result wasp improved in [66] to obtain a performance ratio of O( log jV (G)j). R. Bar-Yeruda, D. Geiger, J. Naor, and R.M. Roth [2] gave an approximation algorithm for the unweighted undirected case having ratio less than or equal to 4 and two approximation algorithms for the weighted undirected case having ratios 4 log2 |V(G)| and 22 (G), respectively. To speedup the algorithm, they show how to preprocess the input valid graph by applying the corresponding undirected versions of the Levy–Lowe reduction transformations. For the feedback vertex set problem in general undirected graphs, two slightly different 2-approximation algorithms are described in [3] and [1]. These algorithms improve the approximation algorithms of [2]. They also can find a loop cutset which, under specific conditions, is guaranteed in the

1007

1008

F

Feedback Set Problems

worst case to contain less than four times the number of variables contained in a minimum loop cutset. Subsequently, A. Becker and Geiger [4] applied the same reduction procedure from the loop cutset problem to the minimum weighted feedback vertex set problem of [2], but their result is independent of any condition and is guaranteed in the worst case to contain less than twice the number of variables contained in a minimum loop cutset. They [4] propose two greedy approximation algorithms for finding the minimum feedback vertex set V 0 in a vertex-weighted undirected graph (G, w), one of them having performance ratio bounded by the constant 2 and complexity O(m+n log n), where m = |E(G)| and n = |V(G)|. In [17], F.A. Chudak, M.X. Goemans, D. Hochbaum, and D.P. Williamson showed how the algorithms due to Becker and Geiger [3] and V. Bafna, P. Berman, and T. Fujito [1] can be explained in terms of the primal-dual method for approximation algorithms that are used to obtain approximation algorithms for network design problems. The primaldual method starts with an integer programming formulation of the problem under consideration. It then simultaneously builds a feasible integral solution and a feasible solution to the dual of the linear programming relaxation. If it can be shown that the value of these two solutions is within a factor of ˛, then an ˛approximation algorithm is found. The integrality gap of an integer program is the worst-case ratio between the optimum value of the integer program and the optimum value of its linear relaxation. Therefore, by applying the primal-dual method it is possible to proof that the integrality gap of the integer program under consideration is bounded. In fact, Chudak et al., after giving a new integer programming formulation of the feedback vertex set problem, provided a proof that its integrality gap is at most 2. They also gave the proofs of some key inequalities needed to prove the correctness of their new integer programming formulation. Theorem 1 Let V 0 denote any feedback vertex set of a graph G = (V, E), E 6D ;, let denote the cardinality of the smallest feedback vertex set for G, and let E(S) denote the subset of edges that have both endpoints in S V(G), b(S) = |E(S)| |S|+1. Then X v2V 0

[ G (v) 1] b(V(G));

(1)

X

G (v) b(V(G)) C :

(2)

v2V 0

If every vertex in G has degree at least two, and V 0 M is any minimal feedback vertex set (i. e. 8 v 2 V 0 M , V 0 M \ {v} is not a feedback vertex set), then X G (v) 2(b(V(G)) C ) 2: (3) v2V 0 M

G. Even, Naor, B. Schieber, and L. Zosin [28] showed that the integrality gap of that integer program for the standard cycle formulation of the feedback vertex set problem is ˝(log n). The new integer programming formulation given in [17] is as follows: X 8 ˆ min w(v)xv ˆ ˆ ˆ ˆ v2V (G) ˆ X ˆ 0, and assume the claim has been proven for t 1. We select a splitting field S for P(z) over C. Then we have a decomposition P(z) D (z a1 ) (z a n ) in S[z]: For an arbitrary real number c we form the expressions bij (c) = ai aj + c(ai + aj ) and the polynomial Q(z) Q = 1 i < j n (z bij (c)). The coefficients of Q(z) are symmetric polynomials in a1 , . . . , an over R and thus real. The degree of Q(z) is n(n 1)/2 = 2t1 u(2t u 1) = 2t1 v for an odd number v. By the induction hypothesis Q(z) has at least one zero in C. Thus bij (c) is in C for a pair of subscripts (i, j) that may depend on c. If this construction is carried out for all natural numbers c with 1 c 1 + n(n 1)/2 one finds c and c0 belonging to the same pair of subscripts, i. e. there is a pair (i, j) with bij (c) 2 C and bij (c0 ) 2 C. If one solves the system of equations b i j (c) D a i a j C c(a i C a j ); b i j (c 0 ) D a i a j C c 0 (a i C a j ) p one obtains a i D a/2 ˙ a2 4b 2 /2 2 C. Thus P(z) has a complex zero. Let now P(z) 2 C[z] be irreducible and t a zero of P(z) in a splitting field of P(z) over C. Then P(z) is the irreducible polynomial of t over C. Since t is algebraic over C and C is algebraic over R, t is algebraic over R. We denote the irreducible polynomial of t over R by U(z). Then P(z)/U(z) in C[z]. U(z) has at least one zero in C. Since C is normal over R, U(z) splits into linear factors in C[z]. Thus P(z) is linear and t 2 C. See also Gröbner Bases for Polynomial Equations References 1. Körner O (1990) Algebra. Aula-Verlag, Wiesbaden 2. Massey WS (1967) Algebraic topology: An introduction. Springer, Berlin 3. Titchmarsh EC (1939) The theory of functions. Oxford Univ. Press, Oxford

Fuzzy Multi-objective Linear Programming

Fuzzy Multi-objective Linear Programming FMOLP ROMAN SLOWINSKI Institute Computing Sci., Poznań University Technol., Poznań, Poland MSC2000: 90C70, 90C29 Article Outline Keywords Flexible Programming MOLP with Fuzzy Coefficients Flexible MOLP with Fuzzy Coefficients Conclusions See also References Keywords Multi-objective linear programming under uncertainty; Fuzzy sets; Uncertainty modeling; Multicriteria decision making; Interactive procedures Fuzzy multi-objective linear programming extends the linear programming model (LP) in two important aspects: multiple objective functions representing different points of view (criteria) used for evaluation of feasible solutions, uncertainty inherent to information used in the modeling and solving stage. A general model of the FMOLP problem can be presented as the following system:

e

c k x] ! min [e c1 x; : : : ;e

(1)

such that e a i xe e bi ; x 0;

i D 1; : : : ; m;

(2) (3)

wheree c1 D [e c l 1 ; : : : ;e c l n (l = 1, . . . , k), x = [ x1 , . . . , xn ]| , a i1 ; : : : ;e a i n (i = 1, . . . , m). The coefficients with e a i D [e the sign of wave are, in general, fuzzy numbers, i. e. convex continuous fuzzy subsets of the real line. The wave

F

over min and relation ‘fuzzifies’ their meaning. Conditions (2) and (3) define a set of feasible solutions (decisions) X. An additional information completing (1) is a set of fuzzy aspiration levels on particular objectives, gk. thought of as goals, denoted by e g 1 ; : : : ;e There are three important special cases of the above problem that gave birth to the following classes of problems: flexible programming; multi-objective linear programming (MOLP) with fuzzy coefficients; flexible MOLP with fuzzy coefficients. In flexible programming, coefficients are crisp but there is a fuzzified relation e between objective functions and goals, and between left- and right-hand sides of the constraints. This means that the goals and constraints are fuzzy (‘soft’) and the key question is the degree of satisfaction. In MOLP with fuzzy coefficients all the coefficientsare, in general, fuzzy numbers and the key question is a representation of relation between fuzzy left- and right-hand sides of the constraints. Flexible MOLP with fuzzy coefficients concerns the most general form (1)–(3) and combines the two key questions of the previous problems. The two first classes of FMOLP problems use different semantics of fuzzy sets while the third class combines the two semantics. In flexible programming, fuzzy sets are used to express preferences concerning satisfaction of flexible constraints and/or attainment of goals. This semantics is especially important for exploiting information in decision making. The gradedness introduced by fuzzy sets refines the simple binary distinction made by ordinary constraints. It also refines the crisp specification of goals and ‘all-or-nothing’ decisions. Constraint satisfaction algorithms, optimization techniques and multicriteria decision analysis are typically involving flexible requirements which can be represented by fuzzy relations. In MOLP with fuzzy coefficients, the semantics of fuzzy sets is related to the representation of incomplete or vague states of information under the form of possibility distributions. This view of fuzzy sets enables representation of imprecise or uncertain information in mathematical models of decision problems considered in operations research. In models formulated in terms of mathematical programming, the imprecision and uncertainty of information (data) is taken into ac-

1103

1104

F

Fuzzy Multi-objective Linear Programming

count through the use of fuzzy numbers or fuzzy intervals instead of crisp coefficients. It involves fuzzy arithmetic and other mathematical operations on fuzzy numbers that are defined with respect to the famous Zadeh’s extension principle. In flexible MOLP with fuzzy coefficients, the uncertainty and the preference semantics are encountered together. This is typical for decision analysis and operations research where, in order to deal with both uncertain data and flexible requirements, one can use a fuzzy set representation. Below, we make a tutorial characterization of the three classes of problems and solution methods. For more detailed surveys see, e. g., [16,18,20,27, 30,32,36,37]. Flexible Programming Flexible programming has been considered for the first time in [41] with respect to single-objective linear programming. It is based on a general Bellman–Zadeh principle [2] defining the concept of fuzzy decision as an intersection of fuzzy goals and fuzzy constraints. A fuzzy goal corresponding to objective cl x is defined as a fuzzy set in X; its membership function l : X ! [0, 1] characterizes the decision maker’s aspiration of making cl x ‘essentially smaller or equal to g l ’. A fuzzy constraint b i is also defined as a fuzzy set corresponding to a i xe in X; its membership function i ! [0, 1] characterizes the degree of satisfaction of the ith constraint. In order to define the membership function i (x) for the ith fuzzy constraint, one has to know the tolerance margin di 0 for the right-hand side bi (i = 1, . . . , m); 8 ˆ 1 for a i x b i ; ˆ ˆ ˆ 0 ;

z>0

(6)

Generalizations of Interior Point Methods for the Linear Complementarity Problem

Definition 7 If M is a positive semidefinite matrix, then there exists a vector z such that Mz 0 ;

z >0:

(7)

Definition 8 A potential function is T

P(x; ˝) D n log(c x)

n X

log x j ;

x; y 0 ;

(10)

xT y D 0 ;

(11)

which can be regarded as a quadratic programming problem Minimize x T y

(12)

subject to y D Mx C q

(13)

x; y 0 :

(14)

x 2 int( ˝ ) (8)

jD1

where int( ˝ ) indicates the interior of the set ˝ which is the set of all feasible solutions of the dual.

Formulation The aim of this section is to describe two modern implementations with interior point methods. In the first subsection an interior reduction algorithm to solve the LCP is presented, with particular matrix classes, [4], while in the following subsection an interior point potential algorithm to solve the general LCP is presented.

An Interior Point Reduction Algorithm to Solve the LCP There exist many interior point algorithms to solve LCPs. A particularly interesting approach is an interior point potential reduction algorithm for the LCP [4]. The complementarity problem is viewed as a minimization problem, where the objective function is the product of the solution vector x and the slack vector of the inequalities y. The objective of the algorithm formulated is to find an -complementarity solution in time bounded by a polynomial in the input size. This algorithm is formulated to solve LCP(q,M) which will have a solution, such as when the matrix M is a P-matrix. It is then extended to matrices M which are only positive semidefinite and to skew-symmetric matrices. Consider a LCP, that is, given a rational matrix M 2 R nn and a rational vector q 2 R n , find vectors x; y 2 R n such that y D Mx C q ;

G

(9)

Given the problem Eqs. (12)–(14) the aim is to find a point with x T y < for a given > 0. The algorithm proceeds by iteratively reducing the potential function: X ln(x j y j ) : (15) f (x; y) D ln(x T y) j

Apply a linear scaling transformation to make the coordinates of the current point all equal to 1 and then take a gradient step in the transformed space using the gradient of the transformed potential function. The step size can be determined either by the algorithm or by line search to minimize the value of the potential function. Finally transform the solution point back to the original space. Consider the potential function Eq. (15) under scaling of x and y, given any feasible interior point (x0 ,y0 ) if the matrices X and Y are diagonal matrices with the elements on the diagonal given by the values of (x0 ,y0 ). Define a linear transformation of the space by x¯ D X 1 x;

y¯ D Y 1 y :

(16)

and let W D XY, w j D (x 0j )T (y0j ) so that (w D w1 ; w2 ; ; w n ) and M D Y 1 MX: Consider the transformed problem as follows: Minimize x¯ T W y¯

(17)

¯ x¯ C q¯ subject to y¯ D M

(18)

x¯ ; y¯ 0 :

(19)

Feasible solutions of the original problem are mapped into feasible solutions of the transformed problem: ¯ x¯ C q¯ : y¯ D Y 1 (Mx C q) D M

(20)

1147

1148

G

Generalizations of Interior Point Methods for the Linear Complementarity Problem

Assume that the current point is indeed (e,e) and the potential function has the form T

f (x; y) D ln(x W y)

n X

ln(x j w j y j ) :

(21)

point, (x; y) 2 F, and > 0, which may be represented so: n X ln(x j y j ): (x; y) D nC (x; y) D (nC) ln(x T y) jD1

jD1

(28) The gradient of f is given by rx f (x; y) D T Wy X 1 e ; x Wy

(22)

Wx Y 1 e ; x T Wy

(23)

r y f (x; y) D

and indicate by g the gradient vector evaluated at the current point (e,e). Denote by (x; y) the projection of r f (e; e) on the linear space ˝ defined by y D Mx. Thus we define the following problem: Minimize kx gk2 C ky gk2

(24)

subject to y D Mx :

(25)

It follows that [4] x D (I C M T M)1 (I C M T )g ;

(26)

Suppose the iterations have started from an interior feasible point (x0 ,y0 ), with (x0 ; y0 ) D 0 a sequence of interior feasible points can be generated fx k ; y k g, (k D 0; 1; : : :) terminating at a point such that (x k )T (y k ) . Such a point is found when (x k ; y k ) ln() C n ln(n)

since by the arithmetic–geometric inequality P n ln((x k )T (y k )) njD1 ln(x j y j ) n ln(n) 0. The fact that (x T y) 0 implies that x T y 0 / and therefore the boundedness of f(x; y) 2 F j x T y 0 /g guarantees the boundedness of f(x; y) 2 int(F) j x T y 0 g, where int() indicates the relative interior of its argument. To obtain a reduction in the potential function the scaled gradient projection method may be used. The gradient vectors of the potential function with respect to x and y are nC )y X 1 e ; xT y nC r y D ( T )x Y 1 e : x y rx D (

y D M(I C M T M)1 (I C M T )g :

(27)

It is possible determine the reduction f in the value of f in moving from x D y D e to a point of the form x˜ D e tx, y˜ D e ty, where t > 0. It is desired to choose t so as to achieve a reduction of at least n– k for some k > 0, at every iteration. Since this is shown to be possible, [4], the result follows if the matrix is positive definite, positive semidefinite or skew-symmetric. An Interior Point Potential Algorithm to Solve General LCPs In this subsection a “condition-based” iteration complexity will be formulated regarding the solution of various LCPs. This parameter will characterize the degree of difficulty of the problem when a potential reduction algorithm is used. The condition number derived will of course depend on the data of the problem (M,q). Consider the primal–dual potential function of a LCP as stated in Eqs: (9)–(11), for any interior feasible

(29)

(30) (31)

At the kth iteration the following linear program is solved, subject to an ellipsoid constraint: Minimize Z D r T x k d x C r T y k d y

(32)

subject to d y D Md x

(33)

1 > ˛ 2 k(X k )1 d x k2 C k(X k )1 d x k2 :

(34)

Denote by (d xT ; d Ty )T the minimal solution of Eqs. (32)– (34) and let ! ! nC X k (y k C M T ) e p kx (x k ) T (y k ) k D p D (35) nC p ky Y k (x k ) e (x k ) T (y k ) 1 Y k MX k

D (Y k )2 C M(X k )2 M T k T k (x ) (y ) e (36) X k yk nC

Generalizations of Interior Point Methods for the Linear Complementarity Problem

then there results k 1 pk (X ) d x : D ˛ (Y k )1 d y kp k k

those that can be solved in polynomial time may be indicated. (37)

By the concavity of the log function and certain elementary results it can be shown [17] that k

G

k

k

k

(x C d x ; y C d y ) (x ; y ) 1 ˛2 ˛kp k k C nCC : 2 (1 ˛)

(38)

Corollary 1 An instance of a LCP(q,M) is solvable in polynomial time if (M; q; ) > 0 and 1/ (M; q; ) is bounded above by a polynomial in ln(1/) and n. This corollary is slightly different to corollary 1 in [16]. Further the following definitions are important: C X (M; q) D f j x T y q T < 0; x > 0;

y C M T > 0 for some (x; y) 2 int(F)g (45)

Letting

jjp k jj 1 ˛ D min ; nCC2 nCC2

1 2

(39)

results in

Definition 9 Let G be a set of LCP(q,M) such that the following conditions are satisfied: G D f(M; q) j int(F) 6D ;;

(x k C d x ; y k C d y ) (x k ; y k ) 1 jjp k jj2 : ; min (2n C C 2) 2(n C C 2)

C X

(M; q) D ;g :

(46)

(40)

PC (M; q) be empty for a LCP(q,M). Lemma 1 Let p Then for n C 2n, (M; q; ) 1.

The expression for kp k k is indicated by (35) and can be considered the potential reduction at the kth iteration of the objective function. For any x,y let

Lemma 2 Let f j x T y q T > 0; x > 0; y C M T > 0 for some (x; y) 2 int(F)g be empty for p a LCP(q,M). Then for 0 < n (2n), there results (M; q; ) 1.

g(x; y) D

nC Xy e xT y

(41)

H(x; y) D 2I(XM T Y)(Y 2 CMX 2 M T )1 (MX Y) (42) which is a positive semidefinite matrix. Thus kp k k D g T (x k ; y k )H(x k ; y k )g(x k ; y k )

(43)

which may also be indicated as kg(x; y)k2H D g T (x; y) H(x; y)g(x; y). Define a condition number for the LCP(q,M) as (M; q; ) D inffjjg(x; y)jj2H j x T y 0

> ; (x; y) ; (x; y) 2 int(F)g :

(44)

The condition number (M,q,) represents the degree of difficulty for the potential reduction algorithm in solving the LCP(q,M). The larger the condition number that results, the easier can the problem be solved. The condition number for LCPs provides a criterion to subdivide given instances of LCP(q,M) into classes and

With these properties it can be shown that for many classes of matrices (M; q; ) > 0 or that the conditions indicated in the lemmas are satisfied, so the LCP is solvable in polynomial time. Further, the potential reduction algorithm will solve, under general conditions, the LCP(q,M) when M is a P -matrix and when M is a row-sufficient matrix. Thus, Theorem 1 Let (x 0 ; y) O(n ln(n)) and M be a P -matrix. Then the potential reduction algorithm terminates at x T y < in O(n2 maxfj j/(n); 1g ln(1/)) iterations and each iteration uses at most O(n3 ) arithmetic operations. The bound indicates that the algorithm is a polynomialtime algorithm if j j/(n) is bounded above by a polynomial in ln(1/) and n. Theorem 2 Let > 0 and be fixed. For a row-sufficient matrix M and f(x; y) 2 F j (x; y) 0 g bounded, then (M; q; ) > 0. Since for the LCP(q,M) defined by this class of matrices the condition number is bounded away from zero,

1149

1150

G

Generalizations of Interior Point Methods for the Linear Complementarity Problem

the potential reduction algorithm will solve this class of problems. Methods and Applications Depending on the algorithm proposed, any penalty function algorithm or any linear programming algorithm will ensure, given the conditions imposed on the problem, a polynomial-time solution is achieved. Often computationally, the most efficient method is the Newton method with a penalty or a barrier parameter. However, the actual method of solution is left to the interested reader, who can refer to the original contributions, since too many problem -dependent factors are involved.

xT y D 0

(54)

which without loss of generality will be indicated as Mx C q y D 0

(55)

x; y 0

(56)

xT y D 0 :

(57)

Assume that there exists an approximate interior point solution, as is usual with interior point methods, with variables 0 < x i ; y i ; < n2 8i D 1; 2; : : : ; n and consider the following barrier function for the optimization problem for the LCP (55)–(57). Minimize

(x; y; ) D x T y

ln(x i y i ) (58)

iD1

Models The aim of this section is to treat the methods described in “Formulation” under some more general conditions. An Interior Point Newton Method for the General LCP This algorithm finds a Karush–Kuhn-Tucker point for a nonmonotone LCP with a primal interior point method using Newton’s method with a convex barrier function, under some mild assumptions. Consider a bounded LCP:

subject to Mx y C q D 0 1 x; y < e 2 x; y > 0

(47)

u; v 0

(48)

(rx

u vD0

(49)

and suppose that the LCP solution set S D fu; vjMu C q v D 0; u; v 0; u T v D 0g is bounded above by a vector (m1T ; m2T )T 2 R2n . Define two diagonal positive matrices D1 > Diag(2m1 )

(50)

D2 > Diag(2m2 )

(51)

to obtain the following LCP y D D21 v D D21 (Mu C q) D (D21 MD1 )x C D21 q

(52) (53)

(60) (61)

x i y2i (ˇ )y i ; xi yi C ˇ x 2 y i (ˇ )x i : (x; y)) i D i xi yi C ˇ

Mu C q v D 0

T

(59)

where e 2 R n is the vector of unit elements and ˇ > 0 is an arbitrary small parameter. To convert the optimization problem (58)–(61) into a convex programming problem, consider as a barrier parameter, which is successively reduced, then the gradient of this function is: (rx (x; y)) i D

1 e x; y 0 2

n X

(62) (63)

It is easy to show that if the barrier parameter at any iteration k will satisfy the following inequality >

(x i y i C ˇ)2 ; y2i C ˇ

(64)

then the Hessian matrix of the function (58) is positive denite for the conditions imposed. Thus the optimization problem (58)–(61) is a convex programming problem and it may be solved by one of the methods above, which is also suitable to a further generalization [1]. Here it will be solved as a convex quadratic programming [12]. Rewrite the optimization problem (58)–(61) as: Min

(x; y; ) D x T y

n X iD1

ln(x i y i C ˇ) ; (65)

Generalizations of Interior Point Methods for the Linear Complementarity Problem

0 B B B subject to B B @

M I 0 I 0

I 0 I 0 I

1 C C C x Cb 0: C C y A

Generalization of an Interior Point Reduction Algorithm to Solve General LCPs (66)

Where b T D (q T ; 0; 0; 12 e T ; 12 e T ). Indicate the constraint matrix as the matrix A of dimension 5n 2n. Also, idicate with z T D (x T ; y T ) 2 R2n . The algorithm considered is a primal method with a log barrier function. It will follow a central path and will take small steps [12] and it can be shown that from an approximate global minimum, an exact global minimum can be simply derived [12]. Let ˘ denote the feasible region of Eq. (66) and denote the interior of this feasible region by int(˘ ), i. e., Az > b by relaxing as is usual in the Interior point algorithms, the equality constraints. Make the following assumptions: rank(A) D 2n, ˘ is compact, int(˘ ) 6D ;. x i y i > " 8i D 1; 2; : : : ; n. Define the potential function

h(z; ) D

(x; y; )

m X

ln(a Ti z b i ) :

G

(67)

iD1

The following lemmas are straight forward adaptations of the original results. Lemma 3 For any fixed choice of > 0, that meets the condition (64), the function (67) is strictly convex on int(˘ ). Lemma 4 For any fixed choice of > 0, that meets the condition (64), the function (67) has a unique minimum. Let () be the minimum of h(z; ) for a fixed . As ! 0 there must be an accumulation point by compactness. This point must be an approximate global minimum. Lemma 5 Let zˆ be an accumulation point of (). As ! 0 then zˆ is an approximate global minimum for problem (65)–(66).

The condition number for LCPs provides a criterion to subdivide given instances of LCP(q,M) into classes. These results will now be extended. Consider a LCP(q,M) Eqs. (9)–(11) with a nonsingular coefficient matrix M, for which, moreover, (I– M) is nonsingular and the solution set of LCP(q,M) is bounded from above. This LCP can be indicated so: Mu C q v D 0 ;

(68)

u; v 0 ;

(69)

uT v D 0 ;

(70)

where u; v; q 2 R n . Suppose that the LCP solution set S D fu; vj Mu C q v D 0; u; v 0; u T v D 0g is bounded above by a vector (m1T ; m2T )T 2 R2n . Apply the transformation defined by Eqs. (50) and (51), so that there results y D D21 v D D21 (Mu C q) D (D21 MD1 )x C D21 q ;

(71)

1 e x; y 0 ; 2

(72)

xT y D 0 ;

(73)

which will be indicated as Mx C q y D 0 ;

(74)

x; y 0 ;

(75)

xT y D 0 :

(76)

For the potential reduction algorithm to solve general LCPs, it is required that x > 0 and y > 0. Lemma 6 For a nonsingular M the matrices ˆ (I XY M) ˆ and (Y C MX) ˆ ˆ D D1 MD2 , (I M), M are all nonsingular. Corollary 2 Under the conditions of Lemma 3 (Y C MX) is nonsingular. The following additional lemma is also required.

1151

1152

G

Generalizations of Interior Point Methods for the Linear Complementarity Problem

Lemma 7 For all LCP(q,M) with nonsingular matrices M and (I–M) transformed to the form given by Eqs. (71)–(73) so that for any feasible solution (x; y) 2 int(F) so that 0 < X < I; 0 < Y < I, there reXy e 6D 0. sults g(x; y) D nC xT y

Generalizations of Interior Point Methods for the Linear Complementarity Problem, Table 1 Results for 140 linear complementarity problems (LCPs) of different matrix classes and sizes Type PSD Size

Theorem 3 For all LCP(q,M) with nonsingular matrices M and (I–M) transformed to the form given by Eqs. (71)–(73) so that for any feasible solution (x; y) 2 int(F) there results 0 < X < I; 0 < Y < I, the condition number for the LCP (M; q; ) > 0 for some > 0. For notational simplicity assume that the transformed ˆ is indicated by M without loss of genermatrix M ality. (M; q; ) D 0 if kg(x; y)k2H D 0. Assume that kg(x; y)k2H D 0 and expand it in terms of its factors. 2g(x; y)T g(x; y) g(x; y)T [(XM T Y)(Y 2 C MX 2 M T )1 (MX Y)] g(x; y) D 0

(77)

Cases Algorithms should be tested extensively for their computational efficiency on a wide series of cases, so that suitable comparisons can be made. One hundred and forty random instances of LCPs were solved for four different sizes (30, 50, 100, 250), with three types of matrices: positive semidefinite, negative semidefinite and indefinite. In Table 1 the number of problems solved for each type of matrix with the parametric LCP algorithm [11] and with an interior point algorithm with the Newton method are indicated. The instances with positive (semi)definite matrices are easy to solve in fact. The instances with negative (semi)definite and indefinite classes are considered hard to solve, but both algorithms have no trouble with these classes, except that the first seems to be more hap-

PLCP IPNM PLCP

INDF IPNM PLCP

IPNM

30

6

6

12

12

28

28

50

3

3

3

3

26(3)

29

100

6

6

6

6

16

16

250

5

5

7(4)

11

15

15

Total 20

20

28(32) 32

85(88) 88

PSD positive semidefinite matrix, NSD negative semidefinite matrix, INDF indefinite matrix, PLCP parametric LCP algorithm, IPMN interior point algorithm with the Newton method. Generalizations of Interior Point Methods for the Linear Complementarity Problem, Table 2 Timing results for 140 LCPs of different matrix classes and sizes (seconds) Type PSD Size

It is easy to show that this will never happen under the conditions of the theorem. Hence, for any matrix that satisfies the assumed conditions the condition number is strictly positive and so a solution to the LCP may be obtained straightforwardly by this method. This provides a partial characterization and extension of the matrix class G defined in [16].

NSD

PLCP

NSD IPNM PLCP

INDF IPNM PLCP

IPNM

30

0.06

0.04

0.08

0.06

0.07

50

0.28

0.18

0.38

0.32

0.33

0.32

100

3.47

1.42

7.00

3.37

5.18

2.78

250

0.07

109.37 22.56 121.51 95.12 111.99 87.45

hazard, rather than being subject to numerical difficulties. Both routines seem to be only slightly affected by the type of matrix, but the interior point algorithm with the Newton method is more efficient, as confirmed in Table 2, where the average time for solving the instances is given in seconds.

Conclusions Interior point methods to solve the LCP are now well established and allow polynomial solutions to be obtained for such problems with suitable matrix classes. Moreover these routines can be used as a subroutine in general iterative optimization problems. Evidently research is being actively conducted to generalize the applicable matrix classes for which solutions can be obtained in polynomial time and space.

G

Generalized Assignment Problem

See also Complementarity Algorithms in Pattern Recognition Mathematical Programming Methods in Supply Chain Management Simultaneous Estimation and Optimization of Nonlinear Problems

O. ERHUN KUNDAKCIOGLU, SAED ALIZAMIR Department of Industrial and Systems Engineering, University of Florida, Gainesville, USA

References

Article Outline

1. Boyd S, Vandenberghe L (2004) Convex Optimization. Cambridge University Press, Cambridge 2. Cottle RW, Dantzig G (1968) Complementarity pivot theory of mathematical programming. Lin Algebra Appl 1:103– 125 3. Cottle RW, Pang J-S, Stone RE (1992) The Linear Complementarity Problem. Academic Press, Inc., San Diego 4. Kojima M, Megiddo N, Ye Y (1992) An Interior point potential reduction algorithm for the linear complementarity problem. Math Programm 54:267–279 5. Lemke CE (1965) Bimatrix Equilibrium Points and Mathematical Programming. Manag Sci 11:123–128 6. Lemke CE, Howson JT (1964) Equilibrium points of bimatrix games. SIAM J Appl Math 12:413–423 7. Mangasarian OL (1979) Simplified characterizations of linear complementarity problems solvable as linear programs. Math Programm 10(2):268–273 8. Mangasarian OL (1976) Linear complementarity problems solvable by a single linear program. Math Programm 10:263–270 9. Mangasarian OL (1978) Characterization of linear co complementarity problems as linear program. Math Programm 7:74–87 10. Ferris MC, Sinapiromsaran K (2000) Formulating and Solving Nonlinear Programs as Mixed Complementarity Problems. In: Nguyen VH, Striodot JJ, Tossing P (eds) Optimization. Springer, Berlin, pp 132–148 11. Patrizi G (1991) The Equivalence of an LCP to a Parametric Linear program with a Scalar Parameter. Eur J Oper Res 51:367–386 12. Vavasis S (1991) Nonlinear Optimization: Complexity Issues. Oxford University Press, Oxford 13. Ye Y (1991) An O(n3 L) Potential Reduction Algorithm for linear Programming. Math Programm 50:239–258 14. Ye Y (1992) On affine scaling algorithms for nonconvex quadratic programming. Math Programm 56:285–300 15. Ye Y (1993) A fully polynomial-time approximation algorithm for computing a stationary point of the general linear complementarity problem. Math Oper Res 18:334–345 16. Ye Y, Pardalos PM (1991) A Class of Linear Complementarity Problems Solvable in Polynomial Time. Lin Algebra Appl 152:3–17 17. Ye Y (1997) Interior Point Algorithms: Theory and Analysis. Wiley, New York

Introduction Extensions

Generalized Assignment Problem

MSC2000: 90-00

Multiple-Resource Generalized Assignment Problem Multilevel Generalized Assignment Problem Dynamic Generalized Assignment Problem Bottleneck Generalized Assignment Problem Generalized Assignment Problem with Special Ordered Set Stochastic Generalized Assignment Problem Bi-Objective Generalized Assignment Problem Generalized Multi-Assignment Problem

Methods Exact Algorithms Heuristics

Conclusions References Introduction The generalized assignment problem (GAP) seeks the minimum cost assignment of n tasks to m agents such that each task is assigned to precisely one agent subject to capacity restrictions on the agents. The formulation of the problem is: min

n m X X

ci j xi j

(1)

iD1 jD1

subject to

n X

ai j xi j bi

i D 1; : : : ; m

(2)

jD1 m X

xi j D 1

j D 1; : : : ; n

(3)

iD1

x i j 2 f0; 1g j D 1; : : : ; n

i D 1; : : : ; m;

(4)

where c i j is the cost of assigning task j to agent i, a i j is the capacity used when task j is assigned to agent i, and b i is the available capacity of agent i. Binary variable x i j equals 1 if task j is assigned to agent i, and 0

1153

1154

G

Generalized Assignment Problem

otherwise. Constraints 3 are usually referred to as the semi-assignment constraints. The formulation above was first studied by Srinivasan and Thompson [80] to solve a transportation problem. The term generalized assignment problem for this setting was introduced by Ross and Soland [74]. This model is a generalization of previously proposed model by DeMaio and Roveda [17] where the capacity absorption is agent independent (i. e., a i j D a j ; 8i). The classical assignment problem, which provides a one to one pairing of agents and tasks, can be solved in polynomial time [47]. However, in GAP, an agent may be assigned to multiple tasks ensuring each task is performed exactly once, and the problem is N P hard [28]. Even the GAP with agent-independent requirements is an N P -hard problem [23,53]. The GAP has a wide spectrum of application areas ranging from scheduling (see [19,84]) and computer networking (see [5]) to lot sizing (see [31]) and facility location (see [7,30,74,75]). Nowakovski et al. [64] study the ROSAT space telescope scheduling where the problem is formulated as a GAP and heuristic methods are proposed. Multiperiod single-source problem (MPSSP) is reformulated as a GAP by Freling et al. [25]. Janak et al. [38] reformulate the NSF panel-assignment problem as a multiresource preference-constrained GAP. Other applications of GAP include lump sum capital rationing, loading in flexible manufacturing systems (see [45]), p-median location (see [7,75]), maximal covering location (see [42]), cell formation in group technology (see [79]), refueling nuclear reactors (see [31]), R & D planning (see [92]), and routing (see [22]). A summary of applications and assignment model components can be found in [76]. Extensions Multiple-Resource Generalized Assignment Problem Proposed by Gavish and Pirkul [29], multi-resource generalized assignment problem (MRGAP) is a special case of the multi-resource weighted assignment model that is previously studied by Ross and Zoltners [76]. In MRGAP a set of tasks has to be assigned to a set of agents in a way that permits assignment of multiple tasks to an agent subject to a set of resource constraints. This problem differs from the GAP in that, an agent consumes a variety of resources in perform-

ing the tasks assigned to it. Although most of the problems can be modeled as GAP, multiple resource constraints are frequently required in the effective modeling of real life problems. MRGAP may be encountered in large models dealing with processor and database location in distributed computer systems, trucking industry, telecommunication network design, cargo loading on ships, warehouse design and work load planning in job shops. Gavish and Pirkul [29] introduce and compare various Lagrangian relaxations of the problem and suggest heuristic solution procedures. They design an exact algorithm by incorporating one of these heuristics along with a branch-and-bound procedure. Mazzola and Wilcox [58] modify Gavish and Pirkul heuristic and develop a hybrid heuristic for MRGAP. Their algorithm defines a three phase heuristic which first constructs a feasible solution and then systematically tries to improve the solution. As an enhanced version of MRGAP, Janak et al. [38] study the NSF panel-assignment problem. In this setting, each task (i. e., proposal) has a specific number of agents (i. e., reviewers) assigned to it and each agent has a lower and upper bound on the number of tasks that can be done. The objective is to optimize the sum of a set of preference criteria for each agent on each task while ensuring that each agent is assigned to approximately the same number of tasks. Multilevel Generalized Assignment Problem The Multilevel Generalized Assignment Problem (MGAP) is first introduced by Glover et al. [31] to provide a model for the allocation of tasks in a manufacturing environment. MGAP differs from the classical GAP in that, agents can perform tasks at different efficiency levels, implying both different costs and different resource requirements. Each task must be assigned to one and only one agent at a level and each agent has limited amount of single resource. Important manufacturing problems, such as lot sizing, can be formulated as MGAP. Laguna et al. [46] use a neighborhood structure for defining moves based on ejection chains and develop a Tabu Search (TS) algorithm for this problem. French and Wilson [26] develop two heuristic solution methods for MGAP from the solution methods

Generalized Assignment Problem

for GAP. Procedures for deriving an upper bound on the solution of the problem are also described. Ceselli and Righini [11] present a branch-and-price algorithm based on decomposition of the MGAP into a master problem and a pricing sub-problem, where the former is a set-partitioning problem and the latter is a multiple-choice knapsack problem. This algorithm is the first exact method proposed in the literature for the MGAP. To provide a flexible assignment tool to the decision maker, Hajri-Gabouj [37] develops a fuzzy genetic multi-objective optimization algorithm to solve a nonlinear MGAP.

G

to an agent is minimized. Min-sum objective functions are commonly used in private sector applications, while min-max objective function can be applied to the public sector. BGAP has several important applications in scheduling and allocation problems. Mazzola and Neebe [57] propose two min-max formulations for the GAP: the Task BGAP and the Agent BGAP. Martello and Toth [56] present an exact branch-andbound algorithm as well as approximate algorithms for BGAP. They introduce relaxations and produce, as sub-problems, min-max versions of the multiplechoice knapsack problem which can be solved in polynomial time.

Dynamic Generalized Assignment Problem In The Gap Model, the sequence in which the agent performs the tasks is not considered. This sequence is essential when each task is performed to meet a demand and earliness or tardiness incurs additional cost. Dynamic generalized assignment problem (DGAP) is suggested to track customer demand while assigning tasks to agents. Kogan et al. [44], for the first time, add the impact of time to the GAP model assuming that each task has a due date. They formulate the continuoustime optimal control model of the problem and derive analytical properties of the optimal behavior of such a dynamic system. Based on those properties, an efficient time-decomposition procedure is developed. Kogan et al. [43] extend the DGAP to cope with stochastic environment and multiple agent-task relationships. They prove that this stochastic, continuoustime generalized assignment problem is strongly N P -hard and reduce the model to a number of classical deterministic assignment problems stated at discrete time points. A pseudo-polynomial time combinatorial algorithm is developed to approximate the solution. The well-known application of such a generalization is found in the stochastic environment of the flow shop scheduling of parallel workstations and flexible manufacturing cells as well as dynamic inventory management. Bottleneck Generalized Assignment Problem Bottleneck generalized assignment problem (BGAP), is the min-max version of the well-known (min-sum) generalized assignment problem. In the BGAP, the maximum penalty incurred by assigning each task

Generalized Assignment Problem with Special Ordered Set GAP is further generalized to include cases where items may be shared by a pair of adjacent knapsacks. This problem is called the generalized assignment problem with special ordered sets of type 2 (GAPS2). In other words, GAPS2 is the problem of allocating tasks to time-periods, where each task must be assigned to a time-period, or shared between two consecutive timeperiods. Farias et al. [15] introduce this problem which can also be applied to production scheduling. They study the polyhedral structure of the convex hull of the feasible space, develop three families of facet-defining valid inequalities, and show that these inequalities cut off all infeasible vertices of the LP relaxation. A branchand-cut procedure is described and facet-defining valid inequalities are used as cuts. Wilson [86] modifies and extends a heuristic algorithm developed previously for the GAP problem to solve GAPS2. He argues that, any feasible solution to GAP is a feasible solution to GAPS2, hence a heuristic algorithm for GAP can also be used as a heuristic algorithm to GAPS2. A solution produced by a GAP heuristic will be close to GAPS2 optimality if it is close to the LP relaxation bound of GAP. The heuristic uses a series of moves starting from an infeasible, but in some senses optimal solution and then attempts to restore feasibility with minimal degradation to the objective function value. An existing upper bound for GAP is also generalized to be used for GAPS2. French and Wilson [27] develop an LP-based heuristic procedure to solve GAPS2. They modify a heuristic for GAP to be used for GAPS2 and show

1155

1156

G

Generalized Assignment Problem

that, while Wilson [86] heuristic is straightforward for large instances of the problem, and Farias et al. [15] solve smaller instances of the problem by an exact method, their heuristic solves fairly large instances of the problem rapidly and with a consistently high degree of solution quality. Stochastic Generalized Assignment Problem In GAP, stochasticity may arise because the actual amount of resource needed to process the tasks by the different agents may not be known in advance or the presence or absence of individual tasks may be uncertain. In such cases, there is a set of potential tasks in which, each task may or may not require to be processed. Dyer and Frieze [20], analyze the generalized assignment problem under the assumption that all coefficients are drawn uniformly and independently from [0; 1] interval. Romeijn and Piersma [72] analyze a probabilistic version of GAP as the number of tasks goes to infinity while the number of machines remains fixed. Their model is different from Dyer and Frieze [20] since it doesn’t have the additional assumptions that the cost and resource requirement parameters are independent of each other and among machines. They first derive a tight condition on the probabilistic model of the parameters under which, the corresponding instances of the GAP are feasible with probability one. Next, under an additional sufficient condition, the optimal solution value of the GAP is characterized through a limiting value. It is shown that the optimal solution value, normalized by dividing by the number of tasks, converges with probability one to this limiting value. Toktas et al. [82], consider the uncertain capacities situation and derive two alternative approaches to utilize deterministic solution strategies while addressing capacity uncertainty. Albareda-Sambola et al. [1] assume that a random subset of the tasks would require to be actually processed. Tasks are interpreted as customers that may or may not require a service. They construct a convex approximation of the objective function and present three versions of an exact algorithm to solve this problem based on branch-and-bound techniques, optimality cuts, and a special purpose lower bound. An assignment of tasks can be modified once the actual demands are known. Different penalties are paid for reassigning

tasks and for leaving unprocessed tasks with positive demand. Bi-Objective Generalized Assignment Problem Zhang and Ong [91] consider the GAP from a multiobjective point of view, and propose an LP-based heuristic to solve the bi-objective generalized assignment problem (BiGAP). In BiGAP, each assignment has two attributes that are to be considered. For example, in production planning, these attributes may be the cost and the time caused by assigning jobs to machines. Generalized Multi-Assignment Problem Proposed by Park Et Al. [66], the generalized multiassignment problem (GMAP) consists of tasks that may be required to be duplicated at several agents. In other words, each task is assigned to r j agents instead of one. Park et al. [66] develop a Lagrangian dual ascent algorithm for the GMAP that is combined with the subgradient search and used as a lower bounding scheme for the branch-and-bound procedure. Methods Determining whether an instance of a GAP has a feasible solution is an N P -complete problem. Hence, unless P D N P , GAP admits no polynomial-time approximation algorithm with fixed worst-case performance ratio. Nevertheless there are numerous approximation algorithms for GAP in the literature which actually address a different setting where the available agent capacities are not fixed and the weighted sum of cost and available agent capacities is minimized. For some of these algorithms, a feasible solution is required as an input. For details, see [14,24,65,78]. Excluding this setting for GAP, the solution approaches proposed in the literature are either exact algorithms or heuristics. For expository surveys on the algorithms, see [10,54,60]. Exact Algorithms The optimal solution to the GAP is obtained using an implicit enumerative procedure either via branchand-bound scheme or branch-and-price scheme in the literature. Branch-and-bound method consists of an upper bounding procedure, a lower bounding procedure, a branching strategy, and a searching strategy. It

Generalized Assignment Problem

is known that good bounding procedures are crucial steps in branch-and-bound method. Branch-and-price proceeds similar to branch-and-bound but obtains the bounds by solving the LP-relaxations of the subproblems by column generation. For more details on the valid inequalities and facets for the GAP that are used in the solution procedures, see [16,32,33,40,55,67]. The first branch-and-bound algorithm for the GAP is proposed by Ross and Soland [74]. Considering a minimization problem, they obtain the lower bounds by relaxing the capacity constraints. Martello and Toth [53] propose removing the semi-assignment constraints where the problem decomposes into a series of knapsack problems. Due to the quality of the bounds obtained, this algorithm is frequently used in the literature for benchmarking purposes. Chalmet and Gelders [12] introduce the Lagrangian relaxation of the semi-assignment constraints. Fisher et al. [23] use this technique with multipliers set by a heuristic adjustment method to obtain the lower bounds in the branch-andbound procedure. Tighter bounds resulted from this method, significantly reduce the solution time. Guignard and Rosenwein [34] design a branch-and-bound algorithm with an enhanced Lagrangian dual ascent procedure that solves a Lagrangian dual at each enumeration node and adds a surrogate constraint to the Lagrangian relaxed model. This algorithm effectively solves generalized assignment problems with up to 500 variables. Drexl [19] presents a hybrid branch-andbound/dynamic programming algorithm where the upper bounds are obtained via an efficient Monte Carlo type heuristic. Numerous lower bounds are proposed and their benchmark results are presented. Nauss [62] proposes a branch-and-bound algorithm where linear programming cuts, Lagrangian relaxation, and subgradient optimization are used to derive good lower bounds; feasible-solution generators with the heuristic proposed by Ronen [73] are used to derive good upper bounds. Nauss [63] uses similar branch-and-bound techniques to solve the elastic generalized assignment problem (EGAP) as well. The first branch-and-price algorithm for the generalized assignment problem is proposed by Savelsbergh [77]. A combination of the algorithms proposed by Martello and Toth [53] and Jörnsten and Nasberg [39] is used to calculate the upper bound and the pricing problem is proved to be a knapsack problem.

G

Barnhart et al. [6] reformulate the GAP by applying Dantzig-Wolfe decomposition to obtain a tighter LP relaxation. In order to solve the LP relaxation of the reformulated problem, pricing is done by solving a series of knapsack problems. Pigatti et al. [67] propose a branch-and-cut-and-price algorithm with a stabilization mechanism to speed up the pricing convergence. Ceselli and Righini [11] present a branch-and-price algorithm for multilevel generalized assignment problem that is based on decomposition and a pricing subproblem that is a multiple-choice knapsack problem. Heuristics Large instances of the GAP are computationally intractable due to the N P -hardness of the problem. This calls for heuristic approaches whose benefits are twofold; they can be used as stand-alone algorithms to obtain good solutions within reasonable time and they can be used to obtain the upper bounds in exact solution methods such as the branch-and-bound procedure. Although the variety among the heuristics is high, they mostly fall into one of the following two categories: greedy heuristics and meta-heuristics. Klastorin [41] proposes a two phase heuristic algorithm for solving the GAP. In phase one, the algorithm employs a modified subgradient algorithm to search for the optimal dual solution and in phase two, a branchand-bound approach is used to search the neighborhood of the solution obtained in phase one. Cattrysse et al. [9] use column generation techniques to obtain upper and lower bounds. In their method, a column represents a feasible assignment of a subset of tasks to a single agent. The master problem is formulated as a set partitioning problem. New columns are added to the master problem by solving a knapsack problem for each agent. LP relaxation of the set partitioning problem is solved by a dual ascent procedure. Martello and Toth [54] present a greedy heuristic that assigns the jobs to machines based on a desirability factor. This factor is defined as the difference between the largest and second largest weight factors. The algorithm iteratively considers, among the unassigned jobs, the one having the highest desirability factor (or regret factor) and assigns it to its maximum profit agent. This iterative process establishes an initial solution which would be improved in the next step of the algorithm

1157

1158

G

Generalized Assignment Problem

by simple interchange arguments. This heuristic can be used in a problem size reduction procedure by fixing variables to one or to zero. Relaxation heuristics are developed by Lorena and Narciso [49] for maximization version of GAP. Feasible solutions are obtained by a subgradient search in a Lagrangian or surrogate relaxation. Six different heuristics are derived particularizing relaxation, the step size in the subgradient search and the method used to obtain the feasible solution. In a Lagrangian heuristic for GAP, Haddadi [35] introduces a substitution variable in the model which is defined as the multiplication of the original variables by their corresponding constraint coefficients. The constraints defining these new variables are then dualized in the Lagrangian relaxation of the problem and the resulted relaxation is decomposed into two subproblems: the knapsack problem and the transportation problem. Narciso and Lorena [61] use relaxation multipliers with efficient constructive heuristics to find good feasible solutions. A breadth-first branch-and-bound algorithm is described by Haddadi and Ouzia [36] in which a standard subgradient approach is used in each node of the decision tree to solve the Lagrangian dual and to obtain an upper bound. The main contribution in this study is a new heuristic that is applied to exploit the solution of the relaxed problem by solving a GAP of smaller size. Romeijn and Romero Morales [70] study the optimal value function from a probabilistic point of view and develop a class of greedy algorithms. A family of weight functions is designed to measure desirability of assigning each job to a machine which is used by the greedy algorithms. They derive conditions under which their algorithm is asymptotically optimal in a probabilistic sense. Meta-heuristics are widely used to solve GAP in the literature. They are either adapted by themselves for GAP or are used in combination with other heuristics and meta-heuristics. Variable depth search heuristic (VDSH) is a generalization of local search in which the size of the neighborhood adaptively changes to traverse a larger search space. VDSH is a two phase algorithm. In the first phase, an initial solution is developed and a lower bound is obtained. In the second phase, a nested iterative refinement process is applied to improve the quality of the solution. VDSH is introduced by Amini and

Racer [2] to solve the GAP. In their method, the improvement phase consists of a two level nested loop. The major iteration creates an action set corresponding to each neighborhood structure alternative. Possible neighborhood structures for GAP are: reassign (shift) a task from one agent to another, swap the assignment of two tasks, and permute the assignment of a subset of the tasks. Then, a subsequence of operations that achieves the highest saving is obtained through performing some minor iterations. A new solution is established based on that and another major operation starts. Amini and Racer [3] develop a hybrid heuristic (HH) around the two well known heuristics: VDSH (see [2,69]) and Heuristic GAP (HGAP) (see [54]). Previous studies show that HGAP dominates VDSH in terms of solution time, while VDSH obtains solutions of better quality within reasonable time. A computational comparison is conducted with the leading alternative heuristic approaches. Another hybrid approach is by Lourenço and Serra [52] where a MAX-MIN Ant System (MMAS) (see [81]) is applied with GRASP for the GAP. Yagiura et al. [90] propose a variable depth search (VDS) method for GAP. Their method alternates between shift and swap moves to explore the solution space. The main aspect of their method is that, infeasible solutions are allowed to be considered. However in some of the problem instances, the feasible space is small or contains many small separate regions and the efficiency of the algorithm is affected. In another study, Yagiura et al. [89] improve VDS by incorporating branching search processes to construct the neighborhoods. They show that appropriate choices of branching strategies can improve the performance of VDS. Lin et al. [48] make further observations on the VDSH method through a series of computational experiments. They consider six greedy strategies for generating the initial feasible solution and designed several simplified strategies for the improvement phase of the method. Osman [68] develops a hybrid heuristic which combines simulated annealing and tabu search. This algorithm takes advantage of the non-monotonic oscillation strategy of tabu search as well as the simulated annealing philosophy. Yagiura et al. [87] propose a tabu search algorithm for GAP which utilizes an ejection chain approach. An

Generalized Assignment Problem

ejection chain is an embedded neighborhood construction that compounds simple moves to create more complex and powerful moves. The chain considered in their study is a sequence of shift moves in which every two successive moves share a common agent. Searching into the infeasible region is allowed incurring a penalty proportional to the degree of infeasibility. An adaptive adjustment mechanism is incorporated for determining appropriate values of the parameters to control their influence on the problem. Yagiura et al. [88] improve their previous method by adding a path relinking approach which is a mechanism for generating new solutions by combining two or more reference solutions. The main difference of this method with the previous one is the way it generates starting solutions for ejection chains. It is shown that, by this simple change in the algorithm, the improvement in its performance is drastic. Asahiro et al. [4] develop two parallel heuristic algorithms based on the ejection chain local search (EC) presented by Yagiura et al. [87]. One is a simple parallelization called multi-start parallel EC (MPEC) and the other one is cooperative parallel EC (CPEC). In MPEC, each search process independently explores search space while in CPEC search processes share partial information to cooperate with each other. They show that their proposed algorithms outperform EC by Yagiura [87]. Diaz and Fernandez [18], devise a flexible tabu search algorithm for GAP. Allowing the search to explore infeasible region and adaptively modification of the objective function are the sources of flexibility. The modification of the objective function is caused by the dynamic adjustment of the weight of the penalty incurred for violating feasibility. The main difference of this method with the tabu search method of Yagiura et al. [87,88] in exploring the infeasible region is that, in this method, no solution is qualitatively preferred to others in terms of its structure. Chu and Beasley [13] develop a genetic algorithm for GAP that incorporates a fitness-unfitness pair evaluation function as a representation scheme. This algorithm uses a heuristic to improve the cost and feasibility. Feltl and Raidl [21] add new features to this algorithm including two alternative initialization heuristics, a modified selection and replacement scheme for handling infeasible solutions more appropriately and a heuristic mutation operator.

G

Wilson [85] proposes another algorithm for GAP which is operating in a dual sense. Instead of genetically improving a set of feasible solutions as in a regular GA, this algorithm tries to genetically restore feasibility to a set of near optimal ones. The method starts with potentially optimal but infeasible solutions and then improves feasibility while keeping optimality. When the feasible solution is obtained, the algorithm uses local search procedures to improve the solution. Lorena et al. [50] propose a constructive genetic algorithm (CGA) for GAP. In CGA, unlike classical GA, problems are modeled as bi-objective optimization problems, which consider the evaluation of two fitness functions. The evolution process is conducted to attain the two objectives conserving schemata that survive to an adaptive threshold test. The CGA algorithm has some new features compared to GA including population formation by schemata, recombination among schemata, dynamic population, mutation in structure and the possibility of using heuristics in schemata and/or structure representation. Lourenço and Serra [51] present two metaheuristic algorithms for GAP. One is a MIN-MAX ant system which is combined with local search and tabu search heuristics. The other one is a greedy randomized adaptive search heuristic (GRASP) studied with several neighborhoods. Both of these algorithms consist of three main steps: (i) constructing a solution by either a greedy randomized or an ant system approach, (ii) improving these initial solutions by applying local search and a tabu search, (iii) updating the parameters. These three steps are repeated until a stopping criterion is verified. Monfared and Etemadi [59] use a neural network based approach for solving the GAP. They investigate four different methods to structure the energy function of the neural network: exterior penalty function, augmented Lagrangian, dual Lagrangian and interior penalty function. They show that augmented Lagrangian can produce superior results with respect to feasibility and integrality while maintaining feasibility and stability measures. Problem generators and benchmark instances play an important role in comparing/developing new methods. Romeijn and Romero Morales [71] propose a new stochastic model for the GAP which can be used to analyze the random generators in the literature. They com-

1159

1160

G

Generalized Assignment Problem

pare the random generators by Ross and Soland [74], Martello and Toth [53], Trick [83], Chalmet and Gelders [12], Racer and Amini [69] and conclude these random generators are not adequate because they tend to generate easier problem instances when the number of machines increases. Cario et al. [8] compare GAP instances generated under two correlation-induction strategies. Using two exact and four heuristic algorithms from the literature, they show how solutions are affected by the correlation between costs and the resource requirements. Conclusions This review presents the applications, extensions, and solution methods for the generalized assignment problem. As the GAP receives more attention, it will be more likely to see large sets of classical benchmark instances and comparative results on solution approaches. References 1. Albareda-Sambola M, van der Vlerk MH, Fernandez E (2006) Exact solutions to a class of stochastic generalized assignment problems. Eur J Oper Res 173:465–487 2. Amini MM, Racer M (1994) A rigorous computational comparison of alternative solution methods for the generalized assignment problem. Manag Sci 40(7):868–890 3. Amini MM, Racer M (1995) A hybrid heuristic for the generalized assignment problem. Eur J Oper Res 87(2):343–348 4. Asahiro Y, Ishibashi M, Yamashita M (2003) Independent and cooperative parallel search methods for the generalized assignment problem. Optim Method Softw 18:129– 141 5. Balachandran V (1976) An integer generalized transportation model for optimal job assignment in computer networks. Oper Res 24(4):742–759 6. Barnhart C, Johnson EL, Nemhauser GL, Savelsbergh MWP, Vance PH (1998) Branch-and-price: column generation for solving huge integer programs. Oper Res 46(3):316–329 7. Beasley JE (1993) Lagrangean heuristics for location problems. Eur J Oper Res 65:383–399 8. Cario MC, Clifford JJ, Hill RR, Yang J, Yang K, Reilly CH (2002) An investigation of the relationship between problem characteristics and algorithm performance: a case study of the gap. IIE Trans 34:297–313 9. Cattrysse DG, Salomon M, Van LN Wassenhove (1994) A set partitioning heuristic for the generalized assignment problem. Eur J Oper Res 72:167–174 10. Cattrysse DG, Van LN Wassenhove (1992) A survey of algorithms for the generalized assignment problem. Eur J Oper Res 60:260–272

11. Ceselli A, Righini G (2006) A branch-and-price algorithm for the multilevel generalized assignment problem. Oper Res 54:1172–1184 12. Chalmet L, Gelders L (1976) Lagrangean relaxation for a generalized assignment type problem. In: Advances in OR. EURO, North Holland, Amsterdam, pp 103–109 13. Chu EC, Beasley JE (1997) A genetic algorithm for the generalized assignment problem. Comput Oper Res 24:17–23 14. Cohen R, Katzir L, Raz D (2006) An efficient approximation for the generalized assignment problem. Inf Process Lett 100:162–166 15. de Farias Jr, Johnson EL, Nemhauser GL (2000) A generalized assignment problem with special ordered sets: a polyhedral approach. Math Program, Ser A 89:187–203 16. de Farias Jr, Nemhauser GL (2001) A family of inequalities for the generalized assignment polytope. Oper Res Lett 29:49–55 17. DeMaio A, Roveda C (1971) An all zero-one algorithm for a class of transportation problems. Oper Res 19:1406–1418 18. Diaz JA, Fernandez E (2001) A tabu search heuristic for the generalized assignment problem. Eur J Oper Res 132:22–38 19. Drexl A (1991) Scheduling of project networks by job assignment. Manag Sci 37:1590–1602 20. Dyer M, Frieze A (1992) Probabilistic analysis of the generalised assignment problem. Math Program 55:169–181 21. Feltl H, Raidl GR (2004) An improved hybrid genetic algorithm for the generalized assignment problem. In: SAC ’04; Proceedings of the 2004 ACM symposium on Applied computing. ACM Press, New York, pp 990–995 22. Fisher ML, Jaikumar R (1981) A generalized assignment heuristic for vehicle routing. Netw 11:109–124 23. Fisher ML, Jaikumar R, van Wassenhove LN (1986) A multiplier adjustment method for the generalized assignment problem. Manag Sci 32:1095–1103 24. Fleischer L, Goemans MX, Mirrokni VS, Sviridenko M (2006) Tight approximation algorithms for maximum general assignment problems. In SODA ’06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm. ACM Press, New York, pp 611–620 25. Freling R, Romeijn HE, Morales DR, Wagelmans APM (2003) A branch-and-price algorithm for the multiperiod singlesourcing problem. Oper Res 51(6):922–939 26. French AP, Wilson JM (2002) Heuristic solution methods for the multilevel generalized assignment problem. J Heuristics 8:143–153 27. French AP, Wilson JM (2007) An lp-based heuristic procedure for the generalized assignment problem with special ordered sets. Comput Oper Res 34:2359–2369 28. Garey MR, Johnson DS (1990) Computers and Intractability; A Guide to the Theory of NP-Completeness. Freeman, New York 29. Gavish B, Pirkul H (1991) Algorithms for the multi-resource generalized assignment problem. Manag Sci 37:695–713

Generalized Assignment Problem

30. Geoffrion AM, Graves GW (1974) Multicommodity distribution system design by benders decomposition. Manag Sci 20(5):822–844 31. Glover F, Hultz J, Klingman D (1979) Improved computer based planning techniques, part ii. Interfaces 4:17–24 32. Gottlieb ES, Rao MR (1990) (1; k)-configuration facets for the generalized assignment problem. Math Program 46(1):53–60 33. Gottlieb ES, Rao MR (1990) The generalized assignment problem: Valid inequalities and facets. Math Stat 46:31–52 34. Guignard M, Rosenwein MB (1989) An improved dual based algorithm for the generalized assignment problem. Oper Res 37(4):658–663 35. Haddadi S (1999) Lagrangian decomposition based heuristic for the generalized assignment problem. Inf Syst Oper Res 37:392–402 36. Haddadi S, Ouzia H (2004) Effective algorithm and heuristic for the generalized assignment problem. Eur J Oper Res 153:184–190 37. Hajri-Gabouj S (2003) A fuzzy genetic multiobjective optimization algorithm for a multilevel generalized assignment problem. IEEE Trans Syst 33:214–224 38. Janak SL, Taylor MS, Floudas CA, Burka M, Mountziaris TJ (2006) Novel and effective integer optimization approach for the nsf panel-assignment problem: a multiresource and preference-constrained generalized assignment problem. Ind Eng Chem Res 45:258–265 39. Jörnsten K, Nasberg M (1986) A new lagrangian relaxation approach to the generalized assignment problem. Eur J Oper Res 27:313–323 40. Jörnsten KO, Varbrand P (1990) Relaxation techniques and valid inequalities applied to the generalized assignment problem. Asia-P J Oper Res 7(2):172–189 41. Klastorin TD (1979) An effective subgradient algorithm for the generalized assignment problem. Comp Oper Res 6:155–164 42. Klastorin TD (1979) On the maximal covering location problem and the generalized assignment problem. Manag Sci 25(1):107–112 43. Kogan K, Khmelnitsky E, Ibaraki T (2005) Dynamic generalized assignment problems with stochastic demands and multiple agent task relationships. J Glob Optim 31:17–43 44. Kogan K, Shtub A, Levit VE (1997) Dgap – the dynamic generalized assignment problem. Ann Oper Res 69:227–239 45. Kuhn H (1995) A heuristic algorithm for the loading problem in flexible manufacturing systems. Int J Flex Manuf Syst 7:229–254 46. Laguna M, Kelly JP, Gonzfilez-Velarde JL, Glover F (1995) Tabu search for the multilevel generalized assignment problem. Eur J Oper Res 82:176–189 47. Lawler E (1976) Combinatorial Optimization: Networks and Matroids. Holt, Rinehart, Winston, New York 48. Lin BMT, Huang YS, Yu HK (2001) On the variable-depthsearch heuristic for the linear-cost generalized assignment problem. Int J Comput Math 77:535–544

G

49. Lorena LAN, Narciso MG (1996) Relaxation heuristics for a generalized assignment problem. Eur J Oper Res 91:600– 610 50. Lorena LAN, Narciso MG, Beasley JE (2003) A constructive genetic algorithm for the generalized assignment problem. J Evol Optim 51. Lourenço HR, Serra D (1998) Adaptive approach heuristics for the generalized assignment problem. Technical Report 288, Department of Economics and Business, Universitat Pompeu Fabra, Barcelona 52. Lourenço HR, Serra D (2002) Adaptive search heuristics for the generalized assignment problem. Mathw Soft Comput 9(2–3):209–234 53. Martello S, Toth P (1981) An algorithm for the generalized assignment problem. In: Brans JP (ed) Operational Research ’81, 9th IFORS Conference, North-Holland, Amsterdam, pp 589–603 54. Martello S, Toth P (1990) Knapsack Problems: Algorithms and Computer Implementations. Wiley, New York 55. Martello S, Toth P (1992) Generalized assignment problems. Lect Notes Comput Sci 650:351–369 56. Martello S, Toth P (1995) The bottleneck generalized assignment problem. Eur J Oper Res 83:621–638 57. Mazzola JB, Neebe AW (1988) Bottleneck generalized assignment problems. Eng Costs Prod Econ 14(1):61–65 58. Mazzola JB, Wilcox SP (2001) Heuristics for the multiresource generalized assignment problem. Nav Res Logist 48(6):468–483 59. Monfared MAS, Etemadi M (2006) The impact of energy function structure on solving generalized assignment problem using hopfield neural network. Eur J Oper Res 168:645–654 60. Morales DR, Romeijn HE (2005) Handbook of Combinatorial Optimization, supplement vol B. In: Du D-Z, Pardalos PM (eds) The Generalized Assignment Problem and extensions. Springer, New York, pp 259–311 61. Narciso MG, Lorena LAN (1999) Lagrangean/surrogate relaxation for generalized assignment problems. Eur J Oper Res 114:165–177 62. Nauss RM (2003) Solving the generalized assignment problem: an optimizing and heuristic approach. INFORMS J Comput 15(3):249–266 63. Nauss RM (2005) The elastic generalized assignment problem. J Oper Res Soc 55:1333–1341 64. Nowakovski J, Schwarzler W, Triesch E (1999) Using the generalized assignment problem in scheduling the rosat space telescope. Eur J Oper Res 112:531–541 65. Nutov Z, Beniaminy I, Yuster R (2006) A (1 1/e)-approximation algorithm for the generalized assignment problem. Oper Res Lett 34:283–288 66. Park JS, Lim BH, Lee Y (1998) A lagrangian dual-based branch-and-bound algorithm for the generalized multiassignment problem. Manag Sci 44(12S):271–275 67. Pigatti A, de Aragao MP, Uchoa E (2005) Stabilized branchand-cut-and-price for the generalized assignment prob-

1161

1162

G 68.

69.

70.

71.

72.

73. 74.

75.

76. 77.

78.

79.

80.

81.

82.

83. 84. 85.

Generalized Benders Decomposition

lem. In: Electronic Notes in Discrete Mathematics, vol 19 of 2nd Brazilian Symposium on Graphs, Algorithms and Combinatorics, pp 385–395, Osman IH (1995) Heuristics for the generalized assignment problem: simulated annealing and tabu search approaches. OR-Spektrum 17:211–225 Racer M, Amini MM (1994) A robust heuristic for the generalized assignment problem. Ann Oper Res 50(1):487– 503 Romeijn HE, Morales DR (2000) A class of greedy algorithms for the generalized assignment problem. Discret Appl Math 103:209–235 Romeijn HE, Morales DR (2001) Generating experimental data for the generalized assignment problem. Oper Res 49(6):866–878 Romeijn HE, Piersma N (2000) A probabilistic feasibility and value analysis of the generalized assignment problem. J Comb Optim 4:325–355 Ronen D (1992) Allocation of trips to trucks operating from a single terminal. Comput Oper Res 19(5):445–451 Ross GT, Soland RM (1975) A branch and bound algorithm for the generalized assignment problem. Math Program 8:91–103 Ross GT, Soland RM (1977) Modeling facility location problems as generalized assignment problems. Manag Sci 24:345–357 Ross GT, Zoltners AA (1979) Weighted assignment models and their application. Manag Sci 25(7):683–696 Savelsbergh M (1997) A branch-and-price algorithm for the generalized assignment problem. Oper Res 45:831– 841 Shmoys DB, Tardos E (1993) An approximation algorithm for the generalized assignment problem. Math Program 62:461–474 Shtub A (1989) Modelling group technology cell formation as a generalized assignment problem. Int J Prod Res 27:775–782 Srinivasan V, Thompson GL (1973) An algorithm for assigning uses to sources in a special class of transportation problems. Oper Res 21(1):284–295 Stützle T, Hoos H (1999) The Max-Min Ant System and Local Search for Combinatorial Optimization Problems. In: Voss S, Martello S, Osman IH, Roucairol C (eds) Meta-heuristics; Advances and trends in local search paradigms for optimization. Kluwer, Boston, pp 313–329 Toktas B, Yen JW, Zabinsky ZB (2006) Addressing capacity uncertainty in resource-constrained assignment problems. Comput Oper Res 33:724–745 Trick M (1992) A linear relaxation heuristic for the generalized assignment problem. Nav Res Logist 39:137–151 Trick MA (1994) Scheduling multiple variable-speed machines. Oper Res 42(2):234–248 Wilson JM (1997) A genetic algorithm for the generalised assignment problem. J Oper Res Soc 48:804–809

86. Wilson JM (2005) An algorithm for the generalized assignment problem with special ordered sets. J Heuristics 11:337–350 87. Yagiura M, Ibaraki T, Glover F (2004) An ejection chain approach for the generalized assignment problem. INFORMS J Comput 16:133–151 88. Yagiura M, Ibaraki T, Glover F (2006) A path relinking approach with ejection chains for the generalized assignment problem. Eur J Oper Res 169:548–569 89. Yagiura M, Yamaguchi T, Ibaraki T (1998) A variable depth search algorithm with branching search for the generalized assignment problem. Optim Method Softw 10:419– 441 90. Yagiura M, Yamaguchi T, Ibaraki T (1999) A variable depth search algorithm for the generalized assignment problem. In: Voss S, Martello S, Osman IH, Roucairol C (eds) Metaheuristics; Advances and Trends in Local Search paradigms for Optimization, Kluwer, Boston, pp 459–471 91. Zhang CW, Ong HL (2007) An efficient solution to biobjective generalized assignment problem. Adv Eng Softw 38:50–58 92. Zimokha VA, Rubinshtein MI (1988) R & d planning and the generalized assignment problem. Autom Remote Control 49:484–492

Generalized Benders Decomposition GBD CHRISTODOULOS A. FLOUDAS Department Chemical Engineering, Princeton University, Princeton, USA MSC2000: 49M29, 90C11 Article Outline Keywords Formulation Theoretical Development The Primal Problem The Master Problem Projection Onto the y-Space Dual of V Dual Representation of N(y) Algorithmic Development How to Solve the Master Problem General Algorithmic Statement of GBD Finite Convergence of GBD

Variants of GBD Variant 1 of GBD: V1-GBD V1-GBD Under Separability

Generalized Benders Decomposition

Variant 2 of GBD: V2-GBD Variant 3 of GBD: V3-GBD V3-GBD Under Separability V3-GBD Without Separability

GBD in Continuous and Discrete-Continuous Optimization See also References Keywords Decomposition; Duality; Global optimization The generalized Benders decomposition, GBD, [7] is a powerful theoretical and algorithmic approach for addressing mixed integer nonlinear optimization problems, as well as problems that require exploitation of their inherent mathematical structure via decomposition principles. A comprehensive analysis of the Generalized Benders Decomposition approach along with a variety of other approaches for mixed integer nonlinear optimization problems and their applications are presented in [3]. Formulation [7] generalized the approach proposed by [1], for exploiting the structure of mathematical programming problems stated as: 8 min f (x; y) ˆ ˆ x;y ˆ ˆ ˆ ˆ ˆ h(x; y) D 0 h(x; y) C k> g(x; y): Infeasible primal. If the primal is detected by the NLP solver to be infeasible, then we consider its constraints h(x; y k ) D 0; g(x; y k ) 0; x 2 X Rn ; where the set X, for instance, consists of lower and upper bounds on the x variables. To identify a feasible point we can minimize an l1 or l1 sum of constraint violations. An l1 -minimization problem can be formulated as: 8 p X ˆ ˆ ˆ ˆ min ˛i ˆ ˆ x2X ˆ < iD1 ˆ ˆ ˆ ˆ ˆ ˆ ˆ :

s.t.

h(x; y k ) D 0

g i (x; y k ) ˛ i ; ˛ i 0;

i D 1; : : : ; p;

i D 1; : : : ; p;

Pp Note that if iD1 ˛ i = 0, then a feasible point has been determined.

Generalized Benders Decomposition

Also note that by defining as ˛ C D max (0; ˛) and k k gC (x; y ) D max 0; g (x; y ) ; i i the l1 -minimization problem is stated as: 8 P ˆ X ˆ g(x; y):

It should be noted that two different types of Lagrange functions are defined depending on whether the primal problem is feasible or infeasible. Also, the upper bound is obtained only from the feasible primal problem.

h(x; y k ) D 0 g i (x; y k ) 0;

i 2 I;

where I is the set of feasible constraints and I0 is the set of infeasible constraints. Other methods seek feasibility of the constraints one at a time while maintaining feasibility for inequalities indexed by i 2 I. This feasibility problem is formulated as: 8 X k ˆ wi gC min ˆ i (x; y ) ˆ < x2X i2I0

s.t. ˆ ˆ ˆ :

G

h(x; y k ) D 0 g i (x; y k ) 0;

i 2 I;

and it is solved at any one time. To include all mentioned possibilities [2] formulated a general feasibility problem (FP) defined as: 8 X k ˆ min wi gC ˆ i (x; y ) ˆ < x2X i2I0 (FP) s.t. h(x; y k ) D 0 ˆ ˆ ˆ : g i (x; y k ) 0; i 2 I:

The Master Problem The derivation of the master problem in the GBD makes use of nonlinear duality theory, and is characterized by the following three key ideas: i) projection onto the y-space; ii) dual representation of V; and iii) dual representation of the projection of the original problem on the y-space. In the sequel, the theoretical analysis involved in these three key ideas is presented. Projection Onto the y-Space The original problem can be written as: 8 min inf ˆ ˆ ˆ y x ˆ ˆ ˆ ˆs.t. < ˆ ˆ ˆ ˆ ˆ ˆ ˆ :

f (x; y) h(x; y) D 0 g(x; y) 0 x2X y 2 Y D f0; 1gq ;

(1)

1165

1166

G

Generalized Benders Decomposition

where the min operator has been written separately for y and x. Note that it is infimum with respect to x since for given y the inner problem may be unbounded. Let us define (y) as: 8 inf ˆ ˆ x ˆ ˆ < s.t. (y) D ˆ ˆ ˆ ˆ :

f (x; y) h(x; y) D 0

(2)

g(x; y) 0 x 2 X:

Note that (y) is parametric in the y variables and therefore, from its definition corresponds to the optimal value of the original problem for fixed y (i. e., the primal problem P(yk ) for y = yk ). Let us also define the set V as:

To overcome the aforementioned difficulty we have to introduce the dual representation of V and (y). Dual of V The dual representation of V will be invoked in terms of the intersection of a collection of regions that contain it, and it is described in the following theorem, due to [7]. Theorem 2 (Dual of V) Assuming conditions C1) and C2), a point y 2 Y belongs also to the set V if and only if it satisfies the (finite) system: 0 inf L(x; y; ; ); (

8; 2 ;

D 2 Rm ; 2 R p : 0;

p X

) i : D 1

(5)

iD1

8 < V D y: :

9 =

h(x; y) D 0; : g(x; y) 0 ; for some x 2 X

(3)

Then, problem (1) can be written as: 8 h(x; y) C > g(x; y):

The equality of (y) and its dual is due to having the strong duality theorem satisfied because of conditions C1), C2) and C3). Substituting (7) for (y) and (5) for y 2 Y \ V into problem (4), (which is equivalent to (1)), we obtain: 8 ˆ h(x; y) C > g(x; y); >

L(x; y; ; ) D h(x; y) C > g(x; y); which is called the master problem. If we assume that the optimum solution of (y) in (2) is bounded for all y 2 Y \ V, then we can replace the infimum with a minimum. Subsequently, the mas-

min L(x; y; ; ); x2X

8; 8 0; 8 ; 2 ;

are functions of y and can be interpreted as support functions of (y). ((y) is a support function of (y) at point yo if and only if (y) = (y) and (y) (y), 8y 6D yo .) If the support functions are linear in y, then the master problem approximates (y) by tangent hyperplanes and we can conclude that (y) is convex in y. Note that (y) can be convex in y even though the original problem is nonconvex in the joint x-y space (see [5]). In the sequel, we will define the aforementioned minimization problems in terms of the notion of support functions, that is: (y; ; ) D min L(x; y; ; ); x2X

8;

8 0;

(y; ; ) D min L(x; y; ; ); x2X 8 ; 2 :

1167

1168

G

Generalized Benders Decomposition

Algorithmic Development In the previous Section we discussed the primal and master problem for the GBD. We have the primal problem being a (linear or) nonlinear programming, NLP, problem that can be solved via available local NLP solvers (e. g., MINOS 5.3). The master problem, however, consists of outer and inner optimization problems, and approaches towards attaining its solution are discussed in the following. How to Solve the Master Problem The master problem has as constraints the two inner optimization problems (i. e., for the case of feasible primal and infeasible primal problems) which however need to be considered for all and all 0 (i.e feasible primal) and all (; ) 2 (i. e., infeasible). This implies that the master problem has a very large number of constraints. The most natural approach for solving the master problem is relaxation [7]. The basic idea in the relaxation approach consists of the following: i) ignore all but a few of the constraints that correspond to the inner optimization problems (e. g., consider the inner optimization problems for spe1 cific or fixed multipliers (1 , 1 ) or ( ; 1 )); ii) solve the relaxed master problem and check whether the resulting solution satisfies all of the ignored constraints. If not, then generate and add to the relaxed master problem one or more of the violated constraints and solve the new relaxed master problem again; iii) continue until a relaxed master problem satisfies all of the ignored constraints, which implies that an optimal solution at the master problem has been obtained or until a termination criterion indicates that a solution of acceptable accuracy has been found. General Algorithmic Statement of GBD Assuming that the problem has a finite optimal value, [7] stated the general algorithm for GBD listed below. Note that a feasible initial primal is needed in Step 1. However, this does not restrict the GBD since it is possible to start with an infeasible primal problem. In this case, after detecting that the primal is infeasible, Step 3b is applied in which a support function is employed.

Note that Step 1 could be altered, that is instead of solving the primal problem we could solve a continuous relaxation of the original problem in which the y variables are treated as continuous bounded by zero and one: 8 min f (x; y) ˆ ˆ x;y ˆ ˆ ˆ ˆ ˆ s.t. h(x; y) D 0 < g(x; y) 0

ˆ ˆ ˆ ˆ ˆ ˆ ˆ :

(8)

x2X 0 y 1:

If the solution of (8) is integral, then we terminate. If there exist fractional values of the y variables, then these can be rounded to the closest integer values and subsequently these can be used as the starting y1 vector with the possibility of the resulting primal problem being feasible or infeasible. Note also that in Step 1, Step 3a and Step 3b a rather important assumption is made, that is we can find the support functions and for the given values of the multiplier vectors (, ) and (; ). The determination of these support functions can not be achieved in general since these are parametric functions of y and result from the solution of the inner optimization problems. Their determination in the general case requires a global optimization approach as the one proposed by [5,6]. There exist however, a number of special cases for which the support functions can be obtained explicitly as functions of the y variables. We will discuss these special cases in the next Section. If however, it is not possible to obtain explicitly expressions of the support functions in terms of the y variables, then assumptions need to be introduced for their calculation. These assumptions, as well as the resulting variants of GBD will be discussed in the next Section. The point to note here is that the validity of lower bounds with these variants of GBD will be limited by the imposed assumptions. Note that the relaxed master problem (see Step 2) in the first iteration will have as a constraint one support function that corresponds to feasible primal and will be of the form: 8 < min

B

:s.t.

B (y; 1 ; 1 ):

y2Y; B

(9)

Generalized Benders Decomposition

1

2

Let an initial point y1 2 Y \ V (i.e., by fixing y = y1 , we have a feasible primal). Solve the resulting primal problem P(y1 ) and obtain an optimal primal solution x1 and optimal multipliers; vectors 1 ; 1 . Assume that you can find, somehow, the support function (y; 1 ; 1 ) for the obtained multipliers 1 ; 1 . Set the counters k = 1 for feasible and l = 1 for infeasible and the current upper bound UBD = v(y1 ). Select the convergence tolerance 0. Solve the relaxed master problem: 8 ˆ min B ˆ ˆ y2Y; B ˆ ˆ ˆ ˆ B (y; k ; k ); < s.t. (RM) k = 1; : : : ; K; ˆ ˆ l ˆ ˆ 0 (y : ; l ); ˆ ˆ ˆ : l = 1; : : : ; :

3

3a

3b

ˆ B ) be an optimal solution of the Let (ˆy; ˆ B is a lower above relaxed master problem. bound on the original problem, that is the ˆ B . If UBD current lower bound is LBD = LBD , then terminate. Solve the primal problem for y = yˆ , that is the problem P(ˆy). Then we distinguish two cases: feasible and infeasible primal: Feasible Primal P(ˆy). The primal has v(ˆy) finite with an optimal solution xˆ and optimal multiplier vecˆ . ˆ Update the upper bound UBD = tors ; minfUBD; v(ˆy)g. If UBD LBD , then ˆ terminate. Otherwise, set k = k + 1, k = , k ˆ Return to Step 2, assuming we and = . can somehow determine the support function (y; k+1 ; k+1 ). Infeasible Primal P(ˆy). The primal does not have a feasible solution for y = yˆ. Solve a feasibility problem (e.g., then l1 -minimization) to determine the mulˆ ˆ of the feasibility problem. tiplier vectors ; l ˆ and l = . ˆ Return to Set l = l + 1; = , Step 2, assuming we can somehow determine the support function (y;

l +1

; l +1 ).

In the second iteration, if the primal is feasible and (2 , 2 ) are its optimal multiplier vectors, then the re-

G

laxed master problem will feature two constraints and will be of the form: 8 ˆ min B ˆ 0 and yC j D 0, then y j is negative. To prohibit from yielding positive values for yC j and y j simultaneously, we have the following remark.

(i) yj y j M j yj ; C (ii) M( j 1) C yC j yj yj :

M is a sufficiently large positive number and j 2 f0; 1g: By means of changing variables, the GGP problem with free variables can be equivalently solved with another one having non-negative variables. The next is to deal with discrete variables containing zero, consider the following propositions: ˚Proposition 1 [21] For positive discrete variables y j 2 d j1 ; d j2 ; ; d jm j where d j;iC1 > d ji > 0 for i D 1; 2; ; m j 1, a product term y1˛1 y2˛2 y˛mm where ˛1 ; ˛2 ; ; ˛m are real constants can be transformed into a function e ˛1 z 1 CC˛m z m where z j D ln d j1 C Pmj 1 Pmj 1 iD1 u ji (ln d j;iC1 ln d j1 ); iD1 u ji 1 for u ji 2 f0; 1g. Proof Let y j D e z j and z j D ln y j , expressing y j as Pmj 1 Pmj 1 y j D d j1 C iD1 u ji (d j;iC1 d j1 ); iD1 u ji 1; where u ji 2 f0; 1g.

1187

1188

G

Generalized Geometric Programming: Mixed Continuous and Discrete Free Variables

We then have y1˛1 y2˛2 y˛mm D e ˛1 z 1 CC˛m z m and Pm j 1 Pm j 1 z j D ln d j1 C iD1 u ji (ln d j;iC1 ln d j1 ); iD1 u ji 1, for u ji 2 f0; 1g. Because some variables y j in Proposition 1 may have zero value, Proposition 1 needs to be modified as the following proposition: ˚Proposition 2 For positive discrete variables y j 2 d j1 ; d j2 ; ; d jm j where d j;iC1 > d ji > 0 for i D 1; 2; ; m j 1, 1 ˚j q, and non-negative discrete variables y j 2 0; d j1 ; d j2 ; ; d jm j where d j;iC1 > d ji > 0 for i D 1; 2; ; m j 1; q C 1 j ˛ q ˛ qC1 m, a product term s D y1˛1 y2˛2 y q y qC1 y˛mm can be expressed as ! mj X u ji ; for q C 1 j m; (i) 0 s s¯ 0 (ii) s¯ @

iD1 m X

1

mj

X

u ji (m q)A C e ˛1 z 1 CC˛m z m s

jDqC1 iD1

0

s¯ @(m q)

mj m X X

1 u jiA C L(e ˛1 z 1 CC˛m z m );

jDqC1 iD1

Pmj 1 where y j D d j1 C iD1 u ji (d j;iC1 d j1 ); z j D ln d j1 C Pmj 1 Pmj 1 1; u ji 2 ji iD1 u ji (ln d j;iC1 ln d j1 ); iD1 u P mj f0; 1g, for 1 j q, and y j D iD1 u ji d ji ; z j D Pm j Pm j u (ln d ); u 1; u 2 f0; 1g for q C 1 ji ji ji ji iD1 iD1 j m; L(e ˛1 z 1 CC˛m z m ) is a piecewisely linearized expression of e ˛1 z 1 CC˛m z m , and s¯ is the upper bound of s. Proof If there is y j D 0 for some j (q C 1 j m), Pm j then iD1 u ji D 0 and s D 0 by (i). Pm j If y j > 0 for all j D q C 1; ; m, then iD1 u ji D 1 for j D q C 1; ; m. Therefore we have Pm j Pm iD1 u ji (m q) D 0 if all variables in jDqC1 the signomial term are not zero, and this implies s D e ˛1 z 1 CC˛m z m according to (ii). Remark 2 For a non-negative discrete variable y, y 2 fd1 ; d2 ; ; d m g ; 0 d1 < d2 < < d m , the exponential term y˛ where ˛ is a real constant can be represented as y˛ D d1˛ C

m1 X iD1

u i (d ˛iC1 d1˛ )

where

m1 X

u i 1;

iD1

u i 2 f0; 1g:

According to the above discussions, free discrete variables in GGP can be converted into positive discrete variables. In addition, Li and Tsai method [18] can deal with the free continuous variables. Consequently, the GGP program with continuous and discrete free variables can be transformed into a GGP program with only positive variables. In order to obtain a global optimum of the transformed GGP program, it is required to be converted into a convex mixed-integer problem which is solvable by the conventional convex mixed-integer techniques to derive a globally optimal solution. Convexification Strategies. Convexification strategies for signomial terms are important techniques for global optimization problems. Sun et al. [25] proposed a convexification method for a class of global optimization problems with monotone functions under some restrictive conditions. Wu et al. [29] developed a more general convexification and concavification transformation for solving a general global optimization problem with certain monotone properties. With different convexification approaches, an MINLP problem can be reformulated into another convex mixed-integer program solvable to obtain an approximately global optimum. Björk et al. [4] proposed a global optimization technique based on convexifying signomial terms. They discussed that the right choice of transformation for convexifying nonconvex signomial terms has a clear impact on the efficiency of the optimization approach. Tsai et al. [26] also suggested convexification techniques for the signomial terms with three variables. This study presents generalized convexification techniques and rules to transform a nonconvex GGP program with continuous and discrete variables into a convex mixed-integer program. Consider the following propositions: Lemma 1 For a twice-differentiable function f (X) D n Q x ˛i i ; X D (x1 ; x2 ; ; x n ) ; c; x i ; ˛ i 2 0; ˛; ˇ; < 0, then cx1˛ x2 x3 is already a convex term by Proposition 3.

iD1

j2J i

(ii) u

j2J i

0 for i D 1; 2; ; n, when c < 0; x i ; ˛ i 0, and P 1 niD1 ˛ i 0. Since det H i (x) 0 for all i, H i (X) is positive semi-definite and f (X) is convex. For a given signomial term s, if s can be converted into a set of convex terms satisfying Proposition 3 and 4, then the whole solution process is more computationally efficient. Under this condition, s does not necessitate the exponential transformation. For instance, s D x11 x22 x31 with x1 ; x2 ; x3 0 is a convex term requirbreaking no transformation by Proposition 3, and s D x10:2 x20:7 with x1 ; x2 0 is also a convex term by Proposition 4. Remark 5 A product term z D u f (x) is equivalent to the following linear inequalities: (i) M(u 1) C f (x) z M(1 u) C f (x); (ii) Mu z Mu;

Rule 2 If c > 0; ˛; ˇ < 0, and > 0, then let cx1˛ x2ˇ x3 D cx1˛ x2ˇ z1 where z1 D x31 . The term cx1˛ x2ˇ z1 is convex by Rule 1. Rule 3 If c > 0; ˛ < 0, and ˇ; > 0, then let cx1˛ x2ˇ x3 D cx1˛ z1ˇ z2 where z1 D x21 ; z2 D x31 . The term cx1˛ z1ˇ z2 is convex by Rule 1. Rule 4 If c > 0 and ˛; ˇ; > 0, then let cx1˛ x2ˇ x3 D ce ˛ ln x 1 Cˇ ln x 2 C ln x 3 . Rule 5 If c < 0, ˛; ˇ; 0, and ˛ C ˇ C 1, then ˇ cx1˛ x2 x3 is already a convex term by Proposition 4. Rule 6 If c < 0; ˛; ˇ > 0; ˛ C ˇ < 1, then let cx1˛ x2ˇ x3 D cx1˛ x2ˇ z11˛ˇ where z1 D x3 /(1˛ˇ ) . The ˇ 1˛ˇ is convex by Rule 5. term cx1˛ x2 z1 ˇ

Rule 7 If c < 0; 0 < ˛ < 1, then let cx1˛ x2 x3 D 2ˇ /(1˛) and z2 D cx1˛ z1(1˛)/2 z2(1˛)/2 where z1 D x2 2 /(1˛) ˛ (1˛)/2 (1˛)/2 x3 . The term cx1 z1 z2 is convex by Rule 5. Rule 8 If c < 0 and “˛; ˇ; < 0 or ˛; ˇ; 1”, then 1

where u 2 f0; 1g, z is an unrestricted in sign variable, and M D max f (x) is a large constant.

1

1

let cx1˛ x2ˇ x3 D cz13 z23 z33 where z1 D x13˛ ; z2 D x23ˇ , 1

1

1

and z3 D x33 . The term cz13 z23 z33 is convex by Rule 5.

1189

1190

G

Generalized Geometric Programming: Mixed Continuous and Discrete Free Variables

Rule 9 If ˛; ˇ > 0; x1 2 Z; x3 D 1 and ˛ C ˇ > 1, P 1 1 ˇ ˛ ˛ ˛ then let cx1˛ x2ˇ D c d11 C m iD1 u 1i (d1;iC1 d11 ) x2 for i 2 f1; 2; ; m1 1g . By Remark 5, the product term u1i x2ˇ can be transformed into linear inequalities.

By applying the proposed rules, we can determine certain classes of signomial terms are convex and do not necessitate any transformation. Besides, we can transform a nonconvex signomial term into a convex term accordance with the proposed rules by replacing some variables, thereby decreasing the number of concave functions requiring to be estimated and making the resulting problem a computationally efficient model. In order to be a valid transformation in the global optimization procedure, the transformation should be selected such that the signomial terms are not only convexified but also underestimated [4,21,27]). If the transformations are appropriately selected, the corresponding approximate signomial term will underestimate the original convexified signomial term by applying piecewise linear approximations to the inverse transformation functions. We examine the proposed rules can satisfy the underestimating condition as follows: In Rule 2, let zˆ1 be the approximate transformation variable obtained from piecewise linear function of z1 D x31 . The inverse transformation z1 D x31 (x3 > 0) is convex and z1 will be overestimated (ˆz1 > z1 ). When inserting the approximate variable in the signomial term, we find the underestimating property cx1˛ x2ˇ zˆ1 cx1˛ x2ˇ z1 is fulfilled since c > 0 and z1 has a negative power in the convexified term. Similarly, Rules 3 and 4 meet the underestimating condition. In Rule 6, let zˆ1 be the approximate transformation variable obtained from piecewise linear function of z1 D x3 /(1˛ˇ ) . The inverse transforma /(1˛ˇ ) (x3 > 0; 1˛ˇ > 1 or 1˛ˇ tion z1 D x3 0) is convex and z1 will be overestimated (ˆz1 > z1 ). When inserting the approximate variable in the signomial term, we find the underestimating property cx1˛ x2ˇ zˆ11˛ˇ cx1˛ x2ˇ z11˛ˇ is fulfilled since c < 0 and z1 has a positive power in the convexified term. Similarly, Rules 7 and 8 satisfy the underestimating property.

From above discussions, we observe the proposed rules not only convexity but underestimate the convexified signomial term. Consequently, utilizing the transformations in the global optimization of a GGP problems, the feasible region of the convexified problem overestimates the feasible region of the original nonconvex problem. Case Studies Case1 Minimize x13 x21:5 x33 C x25:5 x3 C x15 subject to 3x1 C 2x2 x3 7; 5 x1 2; 0 x2 4; 5 x3 1; x1 ; x2 2 Z ;

x3 2 k C I)ıx D A k W b ;

1

1

1

U k D I W 2 Dk P k Dk W 2 ; 1

b (k) D W 2 1

Then A k WA> k D k WA> k

[r (k) C D k P k (Ve (k) C D k Wr (k) )]: Ak W Dk D Jk J> k V C Dk W Dk

and the Gauss–Newton matrix is at least positive semidefinite, often positive definite, (ıx(k) , ı (k) ) is a descent direction of f (x, ) at (x(k) ), (k) ). A line searchalong the direction determines a steplength ˛ k satisfying some descent conditions and the new iteration point is x (kC1) D x (k) C ˛ k ıx (k) ; (kC1) D (k) C ˛ k ı (kC1) : P.T. Boggs, R.H. Byrd and R.B. Schnabel [1] use trust region technique in their modification of Gauss– Newton method for generalized total least squares problems. The modification is a generalization of the Levenberg–Marquardt method, in which the trust region subproblem (

(k) 2 min q k (ız) D J > k ız C h s.t.

kızk k

is solved, where k is the trust region radius, ! 1 2 r (k) W x : zD ; h (k) D 1 V 2 e (k) The solution, denoted by ız(), of the subproblem satisfies the system of equations ıx A k Wr (k) D ; Bk Ve (k) C D k Wr (k) ı kız()k D k ; > 0, unless k ız(0) k k , where Bk denotes the matrix A k WA> Ak W Dk k C I : D k WA> V C D k W D k C I k Let P k D V C D k W D k C I. From the buttom part of the system, we get 1

ı D P k [Ve (k) C D k Wr (k) C D k WA> k ıx]:

Since this system is the normal equation of the linear least squares problem

" 1 # " 1 #

1

2 (k)

U k2 W 2 A>

U b k k ıx C min

; 1

0 2 I the solution ıx(k) can be obtained by performing a QR 1

1

factorization to the matrix U k2 W 2 A> k , a sequence of plane rotations to eliminate 1/2 I and back substitutions. For a given value (`) , ıx((`) ) is obtained from the solution of the system and then ı ((`) ) from substitution. If ˇ ˇ ˇ

ˇ ˇ ˇ ˇ

ˇ ˇ((`) )ˇ D ˇ ız((`) ) k ˇ k is satisfied, ız((`) ) is accepted as an approximate solution of the trust region subproblem where 2 (0, 1)is a preset tolerance. Otherwise, (`) is updated to give a new value (`+1) and a solution ız((`+1) ) is recomputed from the system. Moré’s updating formula [4]

((`) ) ız((`) )

(`C1) (`) D k r((`) ) can be used to generate (`+1) , where r((`) ) is evaluated from difference approximation r((`) ) D

((`) ) ((`1) ) : (`) (`1)

For generalized total least squares problems, the parameter vector x and the variable vector can be treated separately. The first order necessary condition for a point to be a solution of the problem can be used to eliminate the dependence in the function f (x, ). Consider the system of equations r f D Ve C DWr D 0: These contain m nonlinear equations with m unknowns, each of which only contains one unknown j

G

Generalized Total Least Squares

for fixed value of x v j ( j t j ) C w j ((x; j ) y j )

@(x; j ) D 0; @ j

j D 1; : : : ; m: When these equations can be algebraically solved to give an explicit solution expression (x), substitution it into the function f (x, ) allows the parameter vector x to be determined by directly using any conventional method to minimize the function f (x, (x)) which now is a function of the parameter vector x. However, in most cases, it is impossible or difficult to get an explicit form of the solution (x) and each equation mustbe solved numerically for each given value of x by minimizing the functions 1 [w j ((x; j ) y j )2 C v j ( j t j )2 ]; 2 j D 1; : : : ; m; (x; j ) D

to get an approximate solution, (x) say, to the solution (x) so that the values of function f (x, (x)) and its derivatives with respect to x can be evaluated from the values x and (x). Assume that r 2 f (x , ) is positive definite, then it follows from the implicit function theorem [3] that there exist open neighborhoods N(x ), N( ) of x , such that for any x 2 N(x ), a unique satisfying the system exists in N( ), this being the vector (x). Furthermore, (x) is continuously differentiable and r 2 f (x, (x)) is positive definite for all x 2 N(x ). Substituting (x) into the function f (x, ) we get a separable minimization problem min f (x; (x)); which is defined only in terms of x and reduces the problem dimension from m + n to n. The separation is particularly efficient since in most cases, m is very large. Using the chain rule, the differentiability of (x) and the fact that r f = 0 we get derivatives of the function f (x, (x)) g(x) Drx f C rx r f D rx f ; G(x)

Drx2x Drx2x

f f

2 C rx f rx 2 2 rx f [r

f ]1 r2x f :

Since the positive definiteness of the matrix G(x) is implied by that of the matrix r 2 f , if r 2 f is positive defi-

nite at the solution (x , ), the matrix G(x ) is positive definite, too. The separated Newton method minimizes the function f (x, (x)) using Newton iteration G k ıx (k) D g (k) ;

x (kC1) D x (k) C ıx (k)

to generate a sequence {x(k) }, where Gk and g (k) are evaluated at x(k) and (x(k) ). (x(k) ) is an approximate solution of the system r f = 0 obtained using Newton iteration 0

j(sC1)

D

j(s)

s D 1; 2; : : : ;

(x (k) ; j(s) )

00 (x (k) ; (s) ) j

;

j D 1; : : : ; m:

When ˇ ˇ ˇ (sC1) ˇ j(s) ˇ ; ˇ j is accepted as j (x(k) ) where > 0 isa preset small (sC1) j constant. The values t j and j (x(k1) ), j = 1, . . . , m, can be used as starting values of these iterations for k = 1 and k 2, respectively. A careful observation shows that the difference between the Powell–Macdonald method and the separated Newton method is that for given value x(k) , the former carries out only one Newton iteration for the system r f = 0 while the later one solves the system quite exactly by repeated doing the iteration. The separated Newton method still requires the evaluation of secondorder derivatives. Ignoring second order terms in all derivatives r 2x x f , r 2x f , r 2 x f and r 2 f , we get an approximation to G 1

1

M k D A k W 2 U k W 2 A> k; U k D (I C V 1 D k W D k )1 : Then the iteration M k ıx (k) D g (k) ;

x (kC1) D x (k) C ıx (k)

is the separated Gauss–Newton method [8]. The property that the convergence of Gauss–Newton method for ordinary least squares depends on the closeness of the Gauss–Newton matrix to true Hessian matrix is applicable to the separated Gauss–Newton method. If M(x ) = G(x ), the method is locally convergent and

1237

1238

G

Generalized Variational Inequalities: A Brief Review

rate of convergence is quadratic. If M(x ) 6D G(x ), the method may not converge and if it converges, the rate is at best linear. In order to force global convergence, line search or trust region techniques can be incorporated. For large residual problems, the Gauss–Newton matrix M is not a good approximation to G and quasiNewton updates can be used to generate better approximations. When quasi-Newton updates, for example BFGS update, are used, the separated problem is regarded as a general minimization problem, the special structure of the problem function is not exploited and approximations are not directly obtained from the first order derivatives. The vectors ı (k) and (k) used in updating formulas can be defined by

effective in solving generalized total least squares problems. See also ABS Algorithms for Linear Equations and Linear Least Squares ABS Algorithms for Optimization Gauss–Newton Method: Least Squares, Relation to Newton’s Method Least Squares Orthogonal Polynomials Least Squares Problems Nonlinear Least Squares: Newton-type Methods Nonlinear Least Squares Problems Nonlinear Least Squares: Trust Region Methods

ı (k) D ıx (k) D x (kC1) x (k) ; (k) D g(x (kC1) ; (x (kC1) )) g(x (k) ; (x (k) )) Alternative definitions for (k) can be derived by using thespecial structure of the derivatives. Two common used definitions for (k) are (k) C A kC1 W D kC1 ı (k) (k) D A kC1 WA> kC1 ıx

C (A kC1 A k )r (kC1) ; (k) D A kC1 W(r (kC1) r (k) ) C (A kC1 A k )Wr (kC1) ; where ı (k) = (x(k+1)) (x(k) ). Numerical experiments favors the last definition of (k) [9]. Based on the separated Gauss–Newton method and the separated BFGS method, separated hybrid method is a simple generalization of the hybrid method for ordinary nonlinear least squares problems, where a test [9] is derived to determine what step should be chosen at each iteration. When the test chooses the Gauss– Newton step, the approximation Bk to Gk is set to the Gauss– Newton matrix M k and when the test chooses the BFGS step, the matrix Bk is obtained from Bk1 using BFGS updating formula. When separated methods are used to solve generalized total least squares problems, computational savings can be obtained if we initially ignore errors in t j , j = 1, . . . , m, and just solve an ordinary nonlinear least squares problem. Whenreasonable reduction in the objective function has been achieved, errors in all variables are then considered and separated methods are applied. This modification of any separated method is

References 1. Boggs PT, Byrd RH, Schnabel RB (1987) A stable and efficient algorithm for nonlinear orthogonal distance regression. SIAM J Sci Statist Comput 8:1052–1078 2. Demming WE (1943) Statistics adjustment of data. Wiley, New York 3. Hestens MR (1966) Calculus of variations and optimal control problems. New York Wiley, New York 4. Moré JJ (1977) The Levenberg-Marquardt algorithm: Implementation and theory. In: Watson GA (ed) Numerical Analysis, Dundee. Lecture Notes Math. Springer, Berlin, pp 105– 116 5. O’Neill M, Sinclair IG, Smith J (1969) Polynomial curve fitting when abscisses and ordinates are both subject to error. Comput J 12:52–56 6. Powell DR, Macdonld JR (1972) A rapidly convergent iterative method for the solution of the generalizednonlinear least squares problem. Comput J 15:148–155 7. Southwell WH (1975) Fitting data to nonlinear functions with uncertainties in all measurement variables. Comput J 19:67–73 8. Watson GA (1985) The solution of generalized least squares problems. Internat Ser Numer Math 75:388–400 9. Xu CX (1987) Hybrid methods for nonlinear least squares and related problems. PhD Thesis Univ. Dundee

Generalized Variational Inequalities: A Brief Review BARBARA PANICUCCI Department of Applied Mathematics, University of Pisa, Pisa, Italy MSC2000: 49J53, 90C30

Generalized Variational Inequalities: A Brief Review

G

Article Outline

Problem Formulation and Framework

Keywords and Phrases Introduction Problem Formulation and Framework Existence and Uniqueness

In its general form, the GVI problem can be stated as follows: find x 2 X and u 2 F(x ) such that

Existence of Solutions: Bounded Domain Existence of Solutions: Unbounded Domain

GVI and Related Problems GVI and Fixed-Point Problems GVI and Optimization Problems GVI and Complementarity Problems

References

Keywords and Phrases

hu ; y x i 0

8y2X;

where h; i denotes the usual inner product in Rn , X Rn is a nonempty closed and convex set, Rn Rn is a set-valued map, i. e., an operator that associates with each x 2 Rn a set F(x) Rn . If F is a single valued function, then the GVI problem reduces to the classical VI, which is to find x 2 X such that

Generalized variational inequality; Optimization problem; Gap function

hF(x ); y x i 0

8y2X:

In connection with the set-valued map F : Rn R a few definitions need to be recalled. First, F is characterized by its graph: n

Introduction The theory as well the applications of variational inequalities (VIs) and the nonlinear complementarity problem (NCP) have proved to be a very powerful tool for studying a wide range of problems arising in mechanics, physics, optimization, and applied sciences. A survey on the developments of VI and NCP is in [7]. In recent years, considerable interest has been shown in developing various extensions and generalizations of the VI problem. An important class of such generalizations, introduced in [2], is the so-called generalized variational inequality (GVI). This class has many important and significant applications in various fields such as mathematical physics and control theory, economics, and transportation equilibrium (see, e. g., [1,11]). For example, it is known that the traffic equilibrium problem can be formulated as a VI when the travel cost between any two given nodes for a given flow is fixed [4]. However, the traffic conditions may vary and the travel cost between two given nodes may not be fixed, but within a cost interval. In this case the corresponding problem can be formulated as a GVI. Moreover, GVI provides a unifying framework for many general problems such us fixed-point, optimization, and complementarity problems. In what follows we give an overview of recent developments concerning the issue of existence of a solution and equivalent reformulations.

graph (F) D f(x; u) 2 Rn Rn : u 2 F(x)g : The image of X under F is [ F(x) ; F(X) D x2X

the inverse of F is defined by F 1 (u) D fx : u 2 F(x)g; and the domain of F is the set dom (F) D fx 2 Rn : F(x) ¤ ;g : Throughout we assume that dom (F) X. Over the past two decades, most effort has been concentrated on the question of the existence of solutions to GVI problems. The study of the existence of solutions of GVI involves several continuity properties of set-valued maps. We recall these conditions in the sequel. A set-valued map F : Rn Rn is said to be upper semicontinuous (u.s.c.) at x 2 Rn if for each open set V F(x) there exists a neighborhood U of x such that F(U) V ; F is u.s.c. on a set X Rn if it is u.s.c. at every point in X. A set-valued map F : Rn Rn is upper hemicontinuous on X Rn ; if its restriction to line segments of X is upper semicontinuous.

1239

1240

G

Generalized Variational Inequalities: A Brief Review

maximal monotone

The study of the existence of solutions of GVI involves also some monotonicity-type properties for setvalued maps. In what follows we recall the definitions. (M1) F is quasimonotone on X if, for every pair of distinct points x; y 2 X and every u 2 F(x), v 2 F(y), we have:

strongly monotone

strictly monotone

monotone

hv; x yi > 0 H) hu; x yi 0 : pseudomonotone

(M2) F is properly quasimonotone on X if, for any x 1 ; : : : ; x n 2 X and any 1 ; : : : ; n > 0 with Pn 2 f1; : : : ; ng such that iD1 i D 1, there exists jP j j for all u 2 F(x ) and x D niD1 i x i , we have:

properly quasimonotone

hu j ; x x j i 0 : (M3) F is pseudomonotone on X if, for every pair of distinct points x; y 2 X and every u 2 F(x); v 2 F(y), we have: hv; x yi 0 H) hu; x yi 0 : (M4) F is monotone on X if, for every pair of distinct points x; y 2 X and every u 2 F(x); v 2 F(y), we have: hu v; x yi 0 : (M5) F is strictly monotone on X if, for every pair of distinct points x; y 2 X and every u 2 F(x); v 2 F(y), we have: hu v; x yi > 0 : (M6) F is strongly monotone on X with constant ˇ > 0 if, for every pair of distinct points x; y 2 X and every u 2 F(x); v 2 F(y), we have: hu v; x yi ˇkx yk2 ; where k k denotes the classical euclidean norm. (M7) F is maximal monotone on X if it is monotone on X and its graph is not properly contained in the graph of any other monotone operator on X. The relationships among these kinds of monotonicity are represented in Fig. 1.

quasimonotone Generalized Variational Inequalities: A Brief Review, Figure 1 Relationships among generalized monotonicity conditions

Existence and Uniqueness In recent years the existence of solutions to GVIs has been investigated extensively. In what follows we provide some of the most fundamental results. The basic result on the existence of a solution to the GVI problem requires the set X to be compact and convex and the map F to be u.s.c. From this basic result many others can be derived by replacing the compactness of X with additional coercivity conditions on F. Existence of Solutions: Bounded Domain This section presents some existence results for solutions of GVI in the case of a compact domain. The following existence theorem exploits the formulation of GVI as a fixed-point problem. Theorem 1 ([8]) If X is compact and F is u.s.c. on X with compact and convex values, then GVI has a solution. Theorem 2 ([12]) If X is compact and F is upper hemicontinuous and properly quasimonotone on X with compact and convex values, then GVI has a solution.

Generalized Variational Inequalities: A Brief Review

Existence of Solutions: Unbounded Domain The existence of solutions of GVI on unbounded domains is guaranteed by the same conditions as for bounded domains, together with a coercivity condition. In the literature various coercivity conditions have been considered. In particular (see [5]): (C1) 9 R > 0; 8x 2 XŸX R ; 9 y 2 XR :

8u 2 F(x) ;

hu; y xi < 0 ;

(C2) 9 R > 0; 8 x 2 XŸX R ; 8 u 2 F(x) :

9 y 2 XR ;

hu; y xi < 0 ;

G

If F has convex values and it is upper hemicontinuous and pseudomonotone on X, then (C1), (C2), (C3), and (C4) are equivalent. The coercivity conditions allow us to exhibit a sufficiently large ball intersecting with X such that no point outside this ball is a solution of the GVI; then one can establish the existence of a solution stated below. Theorem 4 ([5]) If F is upper hemicontinuous and pseudomonotone on X with compact and convex values, then the following statements are equivalent: GVI has a nonempty and compact solution set. (C1) holds; (C2) holds. (C3) holds. (C4) holds.

(C3) 9 R > 0; 8 x 2 XŸX R ; 9 v 2 F(y) :

9 y 2 XR ;

hv; y xi < 0 ;

(C4) X1 \ (F(X)) D f0g ; where X R D fx 2 X : kxk Rg and (F(X)) D fd 2 Rn : hu; di 0; 8u 2 F(X)g is the polar cone of F(X). Further, the recession cone X1 , for X closed and convex, is defined by X1 D fd 2 Rn : x C t d 2 X; 8 t 0; x 2 Xg :

In what follows we state an existence theorem for which we require neither the upper semicontinuity of F, nor the compactness, nor the convexity of F(x), but we need the maximal monotonicity of F. Theorem 5 ([15]) Assume that F is maximal monotone on Rn . Then the solution set of GVI is nonempty and compact if and only if (C4) holds. In general, GVI can have more than one solution. The following theorem gives conditions under which GVI can have at most one solution. Theorem 6 If F is strictly monotone on X, then GVI has at most one solution. If F is u.s.c., strongly monotone on X, and has nonempty convex and compact values, then GVI has a unique solution.

Some basic relationships among these coercivity conditions are summarized in the following result.

GVI and Related Problems

Theorem 3 ([5]) (C2) H) (C1). If F has convex values, then (C2) and (C1) are equivalent. If F is pseudomonotone on X, then (C3) H) (C2). (C4) H) (C3). If F is upper hemicontinuous and pseudomonotone on X, then (C2), (C3) and (C4) are equivalent.

As stated, the theory of GVI is a powerful unifying methodology that contains as special cases several wellknown problems such as fixed-point, optimization, and complementarity problems. In what follows we describe these equivalent formulations of the GVI problem. Such formulations can be very beneficial for both analytical and computational purposes. Indeed we can apply classic results of these problems to treat the GVI.

1241

1242

G

Generalized Variational Inequalities: A Brief Review

GVI and Fixed-Point Problems In what follows we exploit the formulation of GVI as a fixed-point problem. We recall that x is a fixed point of the set-valued map F : X Rn if x 2 X

and

f (x) > f (y) H) hr f (x); y xi < 0 ;

x 2 F(x ) :

The fixed-point reformulation is very relevant for the GVI problem. Indeed we can apply Kakutani’s fixedpoint theorem, which is instrumental for proving the existence result on a bounded domain. We define the following set-valued map:

: X conv (F(X)) X conv (F(X)) (x; u) 7! (u) F(x) ; where (u) D arg minx2X hu; xi is the set of constrained minimizers of the map hu; xi on X and conv (F(X)) denotes the convex hull of F(X). Assuming that X is compact, (u) results in being nonempty. It easy to see that the problem of finding a fixed point (x , u ) of , i. e., x 2 K;

u 2 F(x );

It is well known that if f is continuously differentiable, then the classical VI with F D r f is a necessary optimality condition for (1). The VI gives also a sufficient condition if f is pseudoconvex on X, i. e.,

for all x; y 2 X. Therefore, if f is continuously differentiable and pseudoconvex on X, the VI with F D r f is equivalent to the optimization problem (1). In what follows we extend these results in terms of GVI when f : ˝ ! R is a locally Lipschitz continuous function, that is, for each point x 2 ˝ there exists a neighborhood U of x such that f is Lipschitz continuous on U. To this end we recall some basic facts about Clarke calculus for a locally Lipschitz continuous function, see [3]. The Clarke’s generalized derivative of f at x in the direction v, denoted by f 0 (x;v), is given by f 0 (x; v) D lim sup y!x

f (y C t v) f (y) : t

t#0

The generalized gradient of f at x, denoted by @ f (x), is defined as follows:

x 2 arg minhu ; xi ; x2K

is equivalent to GVI. It is worth noting that the GVI problem can also be formulated as an inclusion as follows: find x 2 K such that 0 2 F(x ) C N K (x ) ;

@ f (x) D f 2 Rn : h; vi f 0 (x; v)

8 v 2 Rn g :

A generalized derivative can be obtained from the generalized gradient: f 0 (x; v) D maxfh; vi : 2 @ f (x)g :

i. e., finding a zero of the set-valued map F C N K in the domain X, where the normal cone N X (x) to the set X at point x 2 X is given by: N X (x) D fd 2 Rn : hd; y xi 0 8 y 2 Xg :

h; y xi 0

GVI and Optimization Problems Let us consider the constrained optimization problem: (

min f (x)

We can extend the definition of pseudoconvexity for a locally Lipschitz continuous function f : ˝ ! R, [16]: f is pseudoconvex on ˝ if, for all x; y 2 ˝, there exists 2 @ f (x) such that H)

f (x) f (y) :

Let us now consider the GVI with Clarke gradient operator F D @ f . We can state the following result.

(1)

Theorem 7 ([3]) A GVI with F D @ f provides necessary optimality conditions for problem (1).

where X is a closed and convex subset of Rn , The objective function f is defined on an open neighborhood of X, denoted ˝.

In general, a GVI does not give sufficient optimality conditions. However, as shown in [16], when f is pseudoconvex on ˝, the GVI gives sufficient optimality conditions too. Consequently, as for the single-valued

x 2 X;

G

Generalized Variational Inequalities: A Brief Review

case, if f is pseudoconvex on ˝, a GVI with F D @ f is equivalent to the optimization problem (1). The above discussion focused on the GVI with gradient operator; however, an arbitrary set-valued map, in general, is not a gradient map. A powerful tool in dealing with the GVI problem by way of its equivalent optimization reformulation is given by the so-called gap functions. Specifically, we say that a function ' : Rn Rn ! R [ fC1g is a gap function for GVI if '(x; u) 0 for all (x; u) 2 graph (F), x is a solution of GVI if and only if x 2 X and there exists u 2 F(x ) such that '(x ; u ) D 0. Hence, the GVI problem can be rewritten as the following constrained optimization problem: ( min '(x; u) (x; u) 2 graph (F) : An example of a gap function, proposed in [6], is: '(x; u) D suphu; x yi;

(x; u) 2 Rn Rn : (2)

by using a regularized gap function. Let us consider 1 'G (x; u) D max hu; x yi kx yk2G ; y2X 2 where (x; u) 2 Rn Rn ; G is a symmetric positive n definite matrix, p and k kG is the norm in R defined by kxkG D hx; G xi. This function, introduced in [6] for generalized quasivariational inequalities, i. e., GVIs where set X depends on solution x, is a gap function for GVI and is called a regularized gap function. Since G (x; u;

1 y) D hu; x yi kx yk2G 2

is strongly concave with respect to y, there is a unique maximizer over X denoted by y(x; u). If we denote the projection operator onto set X with respect to the norm k kG by ˘ X;G (); it is easy to check that this maximizer is y(x; u) D ˘ X;G (x G 1 u) :

y2X

The function '(x; ) is convex and closed for every fixed x 2 Rn and '(; u) is affine for every fixed u 2 Rn (see [6]). It is worth noting that represents a duality gap in the Mosco duality scheme [14] for GVI. Let us consider this more general GVI problem: find x 2 Rn and u 2 F(x ) such that hu ; x x i (x ) (x)

8 x 2 Rn ;

(3)

where : Rn ! R [ fC1g is a proper, lower semicontinuous convex function. The dual problem of (3) is defined as: find v 2 Rn and y 2 F 1 (v ) such that hy ; v v i (v ) (y)

8 v 2 Rn ;

where (v) D sup x2Rn fhv; xi (x)g is the Fenchel conjugate of '.

Therefore, the regularized gap function 'G (x; u) D hu; x y(x; u)i

1 kx y(x; u)k2G 2

is finite valued everywhere. Moreover, the regularized gap function is continuously differentiable, and its gradient is given by rx 'G (x; u) D u C G [y(x; u) x] ; ru 'G (x; u) D x y(x; u) : Therefore, using the regularized gap function we obtain an equivalent differentiable optimization reformulation of the GVI problem. Gap functions can be used in the design of numerical algorithms for solving the GVI. GVI and Complementarity Problems

Theorem 8 ([15]) The gap function (2) measures the duality gap of Mosco’s duality scheme: ( '(x; u) if x 2 X (x) C (u) C hu; xi D C1 otherwise:

It is well known that, when X is a closed convex cone and F : X ! Rn , the VI problem is equivalent to the NCP problem, which consists in finding x 2 X such that

The gap function is not differentiable in general. Moreover, when graph (F) is unbounded, it is in general not finite valued. These drawbacks can be avoided

where

F(x ) 2 X

and

hF(x ); x i D 0 ;

X D fd 2 Rn : hu; di 0; 8u 2 Xg

1243

1244

G

General Moment Optimization Problems

is the negative polar cone of X. Such a relationship is preserved in the GVI problems. First, let us consider an extension of the NCP problem, see [17], that can be defined as follows. Let X be a closed convex cone of Rn and F a setvalued map. The generalized complementarity problem (GCP) is to find x 2 X such that there exists u 2 F(x ) satisfying the following properties: u 2 X

and

hu ; x i D 0 :

As in the single-valued case, both problems GVI and GCP have the same solution set if the underlying set X is a closed convex cone.

15. Panicucci B, Pappalardo M, Passacantando M (2006) On finite-dimensional generalized variational inequalities. J Indust Manag Optim 2:43–53 16. Penot J-P, Quang PH (1997) Generalized convexity of functions and generalized monotonicity of set-valued maps. J Optim Theory Appl 92:343–356 17. Saigal R (1976) Extension of the generalized complementarity problem. Math Oper Res 1:260–266

General Moment Optimization Problems GEORGE A. ANASTASSIOU Department Math. Sci., The University Memphis, Memphis, USA

References 1. Aubin JP (1984) L’Analyse Non Linéaire et Ses Motivations Economiques. Masson, Paris 2. Browder FE (1965) Multivalued Monotone Nonlinear Mappings and Duality Mappings in Banach Spaces. Trans Am Math Soc 71:780–785 3. Clarke FH (1990) Optimization and nonsmooth analysis, vol 5 of Classics in Applied Mathematics. SIAM, Philadelphia 4. Dafermos S (1980) Traffic Equilibrium and Variational Inequalities. Transp Sci 14:42–54 5. Daniilidis A, Hadjisavvas N (1999) Coercivity conditions and variational inequalities. Math Programm 86:433–438 6. Dietrich H (1999) A smooth dual gap function solution to a class of quasivariational inequalities. J Math Anal Appl 235:380–393 7. Facchinei F, Pang JS (2003) Finite-dimensional variational inequalities and complementarity problems, vol I, II. Springer, New York 8. Fang SC, Peterson EL (1982) Generalized variational inequalities. J Optim Theory Appl 38:363–383 9. Giannessi F, Maugeri A, Pardalos P (2001) Equilibrium Problems: Nonsmooth Optimization and Variational Inequality Models. Kluwer, Dordrecht 10. Giannessi F, Pardalos P, Rapcsák T (2001) Optimization theory. Recent developments. Kluwer, Dordrecht 11. Harker PT, Pang S (1990) Finite-Dimensional Variational Inequality and Nonlinear Complementarity Problems: a survey of theory, algorithms and applications. Math Programm 48:161–220 12. John R (2001) A note on Minty variational inequalities and generalized monotonicity. In: Generalized convexity and generalized monotonicity. Springer, Berlin, pp 240–246 13. Konnov I (2001) Combined relaxation methods for variational inequalities. Springer, Berlin 14. Mosco U (1972) Dual Variational Inequalities. J Math Anal Appl 202–206

MSC2000: 28-XX, 49-XX, 60-XX Article Outline Keywords The Standard Moment Problem The Method of Optimal Distance The Method of Optimal Ratio

The Convex Moment Problem Description of the Problem Solving the Convex Moment Problem

Infinite Many Conditions Moment Problem Applications and Discussion Final Conclusion

See also References Keywords Geometric moment theory; Probability; Integral constraint; Optimal integral bounds subject to moment conditions; Finite moment problem; Convex moment problem; Convexity; Infinite moment problem In this article we describe the main moment problems and their solution methods from theoretical to applied. In particular we present the standard moment problem, the convex moment problem, and the infinite many conditions moment problem. Optimization moment theory has a lot of important applications in many sciences and subjects, for a detailed list please see the final section.

General Moment Optimization Problems

The Standard Moment Problem Let g 1 , . . . , g n and h be given real-valued Borel measurable functions on a fixed measurable space X := (X, A). We would like to find the best upper and lower bound on Z h(t)(dt); (h) :D X

given that is a probability measure on X with prescribed moments Z g i (t) (dt) D y i ; i D 1; : : : ; n: Here we assume such that Z jg i j (dt) < C1;

i D 1; : : : ; n;

X

and Z

G

The next result comes from [22,23,25]. Theorem 1 Let f 1 , . . . , f N be given real-valued Borel measurable functions on a measurable space ˝ (such as g 1 , . . . , g n and h on X). Let be a probability measure on ˝ such that each f i is integrable with respect to . Then there exists a probability measure 0 of finite support on ˝ (i. e., having nonzero mass only at a finite number of points) satisfying Z Z f i (t) (dt) D f i (t) 0 (dt); ˝

˝

all i = 1, . . . , N. One can even achieve that the support of 0 has at most N+ 1 points. So from now on we can talk only about finitely supported probability measures. Call V :D conv g(X)

jhj (dt) < C1: X

For each y := (y1 , . . . , yn ) 2 Rn , consider the optimal quantities L(y) :D L(yjh) :D inf (h);

U(y) :D U(yjh) :D sup (h);

where is a probability measure as above with (g i ) D y i ;

i D 1; : : : ; n:

If there is no such probability measure we set L(y) := + 1, U(y) := 1. If h := S the characteristic function of a given measurable set S of X, then we agree to write L(yj S ) :D L S (y);

U(yj S ) :D U S (y):

Hence, LS (y) (S) U S (y). Consider g: X ! Rn such that g(t) := (g 1 (t), . . . , g n (t)). Set also g 0 (t) := 1, all t 2 X. Here we basically present J.H.B. Kemperman’s (1968) geometric methods for solving the above main moment problems [13] which were related to and motivated by [18,20,24]. The advantage of the geometric method is that many times is simple and immediate giving us the optimal quantities L, U in a closed-numerical form, on the top of this is very elegant. Here the -field A contains all subsets of X.

(conv stands for convex hull), where g(X) := {z 2 Rn : z = g(t) for some t 2 X} is a curve in Rn (if X = [a, b] R or if X = [a, b] × [c, d] R2 ). Let S X, and let M + (S) denote the set of all probability measures on X whose support is finite and contained in S. The next results come from [13]. Lemma 2 Given y 2 Rn , then y 2 V if and only if 9 2 M + (X) such that (g) D y (i. e. (g i ) :=

R

X

g i (t) (dt) = yi , i = 1, . . . , n).

Hence L(y|h) < + 1 if and only if y 2 V (note that by Theorem 1, ˚ L(yjh) D inf (h) : 2 M C (X); (g) D y and ˚ U(yjh) D sup (h) : 2 M C (X); (g) D y ): Easily one can see that L(y) :D L(yjh) is a convex function on V, i. e. L(y0 C (1 )y00 ) L(y0 ) C (1 )L(y00 );

1245

1246

G

General Moment Optimization Problems

whenever 0 1 and y0 , y00 2 V. Also U(y) := U(y| h) = L(y| h) is a concave function on V. One can also prove that the following three properties are equivalent: i) int(V) := interior of V 6D ; ii) g(X) is not a subset of any hyperplane in Rn ; iii) 1, g 1 , . . . , g n are linearly independent on X. From now on we assume that 1, g 1 , . . . , g n are linearly independent, i. e. int(V) 6D . Let D denote the set of all (n + 1)-tuples of real numbers d := (d0 , . . . , dn ) satisfying h(t) d0 C

n X

with g(t j ) 2 B(d ); and m X

p j 0;

all

t 2 X:

(5)

Then L(yjh) D

d i g i (t);

p j D 1:

jD1

m X

p j h(t j ) D d0 C

n X

jD1

(1)

di yi :

(6)

iD1

iD1

Theorem 3 For each y 2 int (V) we have that (2)

L(yjh) ( D sup d0 C

n X

)

d i y i : d D (d0 ; : : : ; d n ) 2 D

:

iD1

Given that L(y| h) > 1, the supremum in (2) is even assumed by some d 2 D . If L(y|h) is finite in int(V), then for almost all y 2 int(V) the supremum in (2) is assumed by a unique d 2 D . Thus L(y| h) < + 1 in int(V) if and only if D 6D ;. Note that y := (y1 , . . . , yn ) 2 int(V) Rn if and only if P d0 + ni= 1 di yi > 0 for each choice of the real constants P di not all zero such that d0 + niD1 di g i (t) 0, all t 2 X. (The last statement comes from [8 p. 5] and [12 p. 573].) If h is bounded then D 6D ;, trivially.

Theorem 5 Let y 2 int(V) be fixed. Then the following are equivalent: i) 9 2 M + (X) such that (g) = y and (h) = L(y|h), i. e. infimum is attained. ii) 9d 2 D satisfying (4). Furthermore for almost all y 2 int(V) there exists at most one d 2 D satisfying (4). In many situations the above infimum is not attained so that Theorem 4 is not applicable. The next theorem has more applications. For that, set (z) :D lim inf inf fh(t) : t 2 X; jg(t) zj < ıg : (7) ı!0

t

If " 0 and d 2 D , define C" (d ) ( :D

z 2 g(T) : 0 (z)

) d i z i " ; (8)

iD0

Theorem 4 Let d 2 D be fixed and set and

B(d ) ( :D z D g(t) : d0 C

n X

) d i g i (t) D h(t); t 2 X

(3)

Then for each point y 2 conv B(d ) the quantity L(y|h) is found as follows. Set m X jD1

p j g(t j )

G(d ) :D

1 \

convC 1 (d ): N

(9)

ND1

iD1

yD

n X

(4)

It is easily proved that C" (d ) and G(d ) are closed; furthermore B(d ) C0 (d ) C" (d ), where B(d ) is defined by (3). Theorem 6 Let y 2 int(V) be fixed. i) Let d 2 D be such that y 2 G(d ). Then L(yjh) D d0 C d1 y1 C C d n y n :

(10)

General Moment Optimization Problems

ii) Assume that g is bounded. Then there exists d 2 D satisfying

G

The Method of Optimal Ratio We would like to find

y 2 conv C0 (d ) G(d )

L S (y) :D inf (S)

and

and

L(yjh) D d0 C d1 y1 C C d n y n :

(11)

iii) We further obtain, whether or not g is bounded, that for almost all y 2 int(V) there exists at most one d 2 D satisfying y 2 G(d ). The above results suggest the following practical simple geometric methods for finding L(y|h) and U(y|h), see [13]. The Method of Optimal Distance Call M :D conv t2X (g1 (t); : : : ; g n (t); h(t)): Then L(y|h) is equal to the smallest distance between (y1 , . . . , yn , 0) and (y1 ; : : : ; y n ; z) 2 M. Also U(y|h) is equal to the largest distance between (y1 , . . . , yn , 0) and (y1 ; : : : ; y n ; z) 2 M. Here, M stands for the closure of M. In particular we see that L(y|h) = inf{yn + 1 : (y1 , . . . , yn , yn + 1 ) 2 M} and U(yjh) D sup fy nC1 : (y1 ; : : : ; y n ; y nC1 ) 2 Mg :

(12)

Example 7 Let denote probability measures on [0, a], a > 0. Fix 0 < d < a. Find Z t 2 (dt) L :D inf

and

[0;a]

Z

t 2 (dt)

U :D sup

[0;a]

subject to Z t (dt) D d: [0;a]

So consider the graph G := {(t, t 2 ): 0 t a}. Call M :D conv G D conv G. A direct application of the optimal distance method here gives us L = d2 (an optimal measure is supported at d with mass 1), and U = da (an optimal measure here is supported at 0 and a with masses (1 d/a and d/a, respectively).

U S (y) :D sup (S); over all probability measures such that (g i ) D y i ;

i D 1; : : : ; n:

Set S0 := X S. Call WS :D convg(S), WS 0 :D convg(S 0 ) and W :D convg(X), where g := (g 1 , . . . , g n ). Finding LS (y). 1) Pick a boundary point z of W and ‘draw’ through z a hyperplane H of support to W. 2) Determine the hyperplane H 0 parallel to H which supports W S0 as well as possible, and on the same side as H supports W. 3) Denote A d :D W \ H D WS \ H and Bd :D WS 0 \ H 0 : Given that H 0 6D H, set Gd :D conv(A d [ Bd ). Then we have that L S (y) D

(y) ;

(13)

for each y 2 int(V) such that y 2 Gd . Here, (y) is the distance from y to H 0 and is the distance between the distinct parallel hyperplanes H, H 0 . Finding U S (y). (Note that U S (y) = 1 LS0 (y).) 1) Pick a boundary point z of W S and ‘draw’ through z a hyperplane H of support to W S . Set Ad := W S \ H. 2) Determine the hyperplane H 0 parallel to H which supports g(X) and hence W as well as possible, and on the same side as H supports W S . We are interested only in H 0 6D H in which case H is between H 0 and W S . 3) Set Bd := W \ H 0 = W S0 \ H 0 . Let Gd as above. Then U S (y) D

(y) ;

(14)

for each y 2 int(V), where y 2 Gd , assuming that H and H 0 are distinct. Here, (y) and are defined as above.

1247

1248

G

General Moment Optimization Problems

Examples here of calculating LS (y) and U S (y) tend to be more involved and complicated, however the applications are many.

let := T denote the probability measure on R given by Z P(y; A) (d y): (A) :D (T)(A) :D R

The Convex Moment Problem

T is called a Markov transformation. Definition 8 Let s 1 be a fixed natural number and let x0 2 R be fixed. By ms (x0 ) we denote the set of probability measures on R such that the associated cumulative distribution function F possesses an (s 1)th derivative F (s1) (x) over (x0 , +1) and furthermore (1)s F (s1) (x) is convex in (x0 , +1). Description of the Problem Let g i , i = 1, . . . , n; h are Borel measurable functions from R into itself. These are assumed to be locally integrable on [x0 , +1) relative to Lebesgue measure. Consider 2 ms (x0 ), s 1 such that Z (jg i j) :D

R

(15)

and Z (jhj) :D

jh(t)j (dt) < C1: R

(16)

Let c := (c1 , . . . , cn ) 2 Rn be such that (g i ) D c i ;

i D 1; : : : ; n;

2 m s (x0 ):

(17)

We would like to find L(c) := inf (h) and U(c) :D sup (h);

(18)

where is as above described. Here, the method will be to transform the above convex moment problem into an ordinary one handled by the first section, see [14]. Definition 9 Consider here another copy of (R, B); B is the Borel -field, and further a given function P(y, A) on R × B. Assume that for each fixed y 2 R, P(y, ) is a probability measure on R, and for each fixed A 2 B, P(, A) is a Borel-measurable real-valued function on R. We call P a Markov kernel. For each probability measure on R,

(19)

R Notice K s (u, x) 0 and R K s (u, x) = dx = 1, all u > x0 . Let ı u be the unit (Dirac) measure at u. Define 8 x0 : A

Then Z (T)(A) :D

jg i (t)j (dt) < C1; i D 1; : : : ; n

In particular: Define the kernel ( s(ux) s1 if x0 < x < u; (ux 0 ) s K s (u; x) :D 0 elsewhere:

R

Ps (u; A)( du)

(21)

is a Markov transformation. Theorem 10 Let x0 2 R and natural number s 1 be fixed. Then the Markov transformation (21) = T defines a 1-1 correspondence between the set m of all probability measures on R and the set ms (x0 ) of all probability measures on R as in Definition 8. In fact T is a homeomorphism given that m and ms (x0 ) are endowed with the weak -topology. Let : R ! R be a bounded and continuous function. Introducing Z (u) :D (T)(u) :D (x) Ps (u; dx); (22) R

then Z

Z d D

d:

(23)

Here is a bounded and continuous function from R into itself. We obtain that 8 ˆ (u) if u x0 ; ˆ ˆ x0 :

General Moment Optimization Problems

That is, g i , h are -integrable. Finally

In particular 1 (u x0 )s (u) s! Z u 1 D (u x)s1 (x) dx: (s 1)! x 0

(25)

L(c) D inf (h )

(28)

and

Especially, if r > 1 we get for (u) := (u x0 )r that 1 rCs (u x0 )r , for all u > x0 . Here r ! := (u) D s 1 2 r and (r C 1) (r C s) rCs : :D s s! Solving the Convex Moment Problem Let T be the Markov transformation (21) as described above. For each 2 ms (x0 ) corresponds exactly one 2 m such that = T. Call g i := Tg i , i = 1, . . . , n and h := Th. We have Z Z g i d D g i d R

G

R

U(c) D sup (h );

(29)

where 2 m (probability measure on R) such that (26) and (27) are true. Thus the convex moment problem is solved as a standard moment problem (see the first section). Remark 11 Here we restrict our probability measures on [0, + 1) and we consider the case x0 = 0. That is 2 ms (0), s 1, i. e. ( 1)s F (s 1) (x) is convex for all x > 0 but ({0}) = ({0}) can be positive, 2 m . We have Z u (u) D su s (u x)s1 (x) dx; (30) 0 u > 0: Further (0) = (0), ( = T). Especially,

and Z R

h d D

Z R

if

h d:

Notice that we get Z (g i ) :D g i d D c i ; R

(x) D x r

then i D 1; : : : ; n:

(26)

rCs (u) D s

(31)

(32)

is also expressed as

and Z

T jhj d < C1:

(27)

Since T is a positive linear operator we obtain |Tg i | T|g i |, i = 1, . . . , n, and |Th| T|h|, i. e. Z ˇ ˇ ˇ g ˇ d < C1; i D 1; : : : ; n; i R

R

ur ;

0

R

and Z

1

(r 0): Hence the moment Z C1 x r ( dx) ˛r :D

From (15), (16) we get that Z T jg i j d < C1; i D 1; : : : ; n;

R

jh j d < C1:

˛r D

rCs s

1 ˇr ;

(33)

where Z

C1

u r (du):

ˇr :D

(34)

0

Recall that T = , where can be any probability measure on [0, + 1). Here we restrict our probability measures on [0, b], b > 0 and again we consider the case x0 = 0. Let 2

1249

1250

G

General Moment Optimization Problems

ms (0) and Z x r (dx) :D ˛r ;

Infinite Many Conditions Moment Problem (35)

[0;b]

where s 1, r > 0 are fixed. Also let be a probability measure on [0, b] unre rCs ˛r , where stricted, i. e. 2 m . Then ˇr D s Z u r (du): (36) ˇr :D [0;b]

Let h: [0, b] ! R+ be an integrable function with respect to Lebesgue measure. Consider 2 ms (0) such that Z h d < C1: (37) [0;b]

i. e.

See also [16]. Definition 13 A finite nonnegative measure on a compact and Hausdorff space S is said to be inner regular when (B) D sup f(K) : K B; K compactg

(40)

holds for each Borel subset B of S. Theorem 14 See [16]. Let S be a compact Hausdorff topological space and ai : S ! R(i 2 I) continuous functions (I is an index set of arbitrary cardinality), also let ˛ i (i 2 I) be an associated set of real constants. Call M 0 (S) the set of finite nonnegative inner regular measures on S which satisfy the moment conditions Z (41) (a i ) D a i (s) (ds) ˛ i ; all i 2 I: S

Z

h d < C1;

2 m :

(38)

[0;b]

Here h = Th, = T and Z Z h d D h d: [0;b]

[0;b]

i2I

Letting ˛ r be free, we have that the set of all possible (˛ r , (h)) = ((xr ), (h)) coincides with the set of all ! 1 rCs ˇr ; (h ) s ! 1 rCs r (u ); (h ) ; D s where as in (37) and as in (38), both probability measures on [0, b]. Hence, the set of all possible pairs (ˇ r , (h)) = (ˇ r , (h )) is precisely the convex hull of the curve :D f(u r ; h (u)) : 0 u bg :

(39)

In order one to determine L(˛ r ) the infimum of all (h), where is as in (35) and (37), one must determine the lowest point in this convex hull which is on the vertical through (ˇ r , 0). For U(˛ r ) the supremum of all (h), as above, one must determine the highest point of above convex hull which is on the vertical through (ˇ r , 0). For more on the above see again §1.

Also consider the function b: S ! R which is continuous and assume that there exist numbers di 0 (i 2 I), all but finitely many equal to zero, and further a number q 0 such that X d i a i (s) qb(s); all s 2 S: (42) 1 Finally assume that M 0 (S) 6D ; and call U0 (b) D sup f(b) : 2 M0 (S)g : R ((b) := S b(s) (ds)). Then U0 (b) D inf

(

X i2I

(43)

) c i 0; P ; c i ˛i : b(s) i2I c i a i (s) all s 2 S (44)

here all but finitely many ci , i 2 I, are equal to zero. Moreover, U 0 (b) is finite and the above supremum is assumed. Remark 15 In general we have: let S be a fixed measurable space such that each 1-point set {s} is measurable. Further let M 0 (S) denote a fixed nonempty set of finite nonnegative measures on S. For f : S ! R a measurable function we denote L0 ( f ) :D L( f ; M0 (S)) Z :D inf f (s) (ds) : 2 M0 (S) : S

(45)

General Moment Optimization Problems

Then we have L0 ( f ) D U0 ( f ):

(46)

Now one can apply Theorem 14 in its setting to find L0 (f ). Applications and Discussion The above described moment theory optimization methods have a lot of applications in many sciences. To mention a few of them: physics, chemistry, statistics, stochastic processes and probability, functional analysis in mathematics, medicine, material science, etc. Optimization moment theory could be also considered the theoretical part of linear finite or semi-infinite programming (here we consider discretized finite nonnegative measures). The above described methods have in particular important applications: in the marginal moment problems and the related transportation problems, also in the quadratic moment problem, see [17]. Other important applications are in tomography, crystallography, queueing theory, rounding problem in political science, and martingale inequalities in probability. At last, but not least, optimization moment theory has important applications in estimating the speeds: of the convergence of a sequence of positive linear operators to the unit operator, and of the weak convergence of nonnegative finite measures to the unit-Dirac measure at a real number, for that and the solutions of many other important optimal moment problems please see [2]. Final Conclusion Optimization moment theory is a very active area of mathematical probability theory with a lot of applications in other subjects, and with a lot of researchers from around the world in it contributing new useful results, continuously during all of the 20th century. See also Approximation of Extremum Problems with Probability Functionals Approximation of Multivariate Probability Integrals Discretely Distributed Stochastic Programs: Descent Directions and Efficient Points

G

Extremum Problems with Probability Functions: Kernel Type Solution Methods Logconcave Measures, Logconvexity Logconcavity of Discrete Distributions L-shaped Method for Two-stage Stochastic Programs with Recourse Multistage Stochastic Programming: Barycentric Approximation Preprocessing in Stochastic Programming Probabilistic Constrained Linear Programming: Duality Theory Probabilistic Constrained Problems: Convexity Theory Simple Recourse Problem: Dual Method Simple Recourse Problem: Primal Method Stabilization of Cutting Plane Algorithms for Stochastic Linear Programming Problems Static Stochastic Programming Models Static Stochastic Programming Models: Conditional Expectations Stochastic Integer Programming: Continuity, Stability, Rates of Convergence Stochastic Integer Programs Stochastic Linear Programming: Decomposition and Cutting Planes Stochastic Linear Programs with Recourse and Arbitrary Multivariate Distributions Stochastic Network Problems: Massively Parallel Solution Stochastic Programming: Minimax Approach Stochastic Programming Models: Random Objective Stochastic Programming: Nonanticipativity and Lagrange Multipliers Stochastic Programming with Simple Integer Recourse Stochastic Programs with Recourse: Upper Bounds Stochastic Quasigradient Methods in Minimax Problems Stochastic Vehicle Routing Problems Two-stage Stochastic Programming: Quasigradient Method Two-stage Stochastic Programs with Recourse

References 1. Akhiezer NI (1965) The classical moment problem. Hafner, New York

1251

1252

G

General Routing Problem

2. Anastassiou GA (1993) Moments in probability and approximation theory. Res Notes Math, vol 287. Pitman, Boston, MA 3. Anastassiou GA, Rachev ST (1992) How precise is the approximation of a random queue by means of deterministic queueing models. Comput Math Appl 24(8-9):229–246 4. Anastassiou GA, Rachev ST (1992) Moment problems and their applications to characterization of stochastic processes, queueing theory, and rounding problems. In: Anastassiou G (ed) Proc. 6th S.E.A. Meeting, Approximation Theory. 1–77 M. Dekker, New York 5. Benes V, Stepan J (eds) (1997) Distributions with given marginals and moment problems. Kluwer, Dordrecht 6. Isii K (1960) The extreme of probability determined by generalized moments (I): bounded random variables. Ann Inst Math Statist 12:119–133 7. Johnson NL, Rogers CA (1951) The moment problems for unimodal distributions. Ann Math Stat 22:433–439 8. Karlin S, Shapley LS (1953) Geometry of moment spaces. Memoirs, vol 12. Amer Math. Soc., Providence, RI 9. Karlin S, Studden WJ (1966) Tchebycheff systems: with applications in analysis and statistics. Interscience, New York 10. Kellerer HG (1964) Verteilungsfunktionen mit gegebenen Marginalverteilungen. Z Wahrscheinlichkeitsth Verw Gebiete 3:247–270 11. Kellerer HG (1984) Duality theorems for marginal problems. Z Wahrscheinlichkeitsth Verw Gebiete 67:399–432 12. Kemperman JHB (1965) On the sharpness of Tchebycheff type inequalities. Indagationes Mathematicae 27:554–601 13. Kemperman JHB (1968) The general moment problem, a geometric approach. Ann MathStat 39:93–122 14. Kemperman JHB (1971) Moment problems with convexity conditions. In: Rustagi JS (ed) Optimizing Methods in Statistics. Acad. Press, New York, pp 115–178 15. Kemperman JHB (1972) On a class of moment problems. Proc. Sixth Berkeley Symp. Math. Stat. Prob. 2, pp 101–126 16. Kemperman JHB (1983) On the role of duality in the theory of moments. In: Fiacco AV, Kortanek KO (eds) SemiInfinite Programming and Applications. of Lecture Notes Economics and Math Systems. Springer, Berlin, pp 63–92 17. Kemperman JHB (1987) Geometry of the moment problem. Moments in Math., of In: Short Course Ser, San Antonio, Texas, 1986, vol 34. Amer. Math. Soc., Providence, RI), pp 20–22 18. Krein MG (1959) The ideas of P.L. Cebysev and A.A. Markov in the theory of limiting values of integrals and their further development. AmerMathSoc Transl 2(12):1–121. ((1951) Uspekhi Mat Nauk 6:3–130) 19. Krein MG, Nudel’man AA (1977) The Markov moment problem and extremal problems. Amer. Math. Soc., Providence, RI 20. Markov A (1884) On certain applications of algebraic continued fractions. Thesis Univ St Petersburg 21. Mises R von (1939) The limits of a distribution function if two expected values are given. Ann Math Stat 10:99–104

22. Mulholland HP, Rogers CA (1958) Representation theorems for distribution functions. Proc London Math Soc 8:177–223 23. Richter H (1957) Parameterfreie Abschätzung und Realisierung von Erwartungswerten. Blätter Deutschen Gesellschaft Versicherungsmath 3:147–161 24. Riesz F (1911) Sur certaines systèmes singuliers d’équations intégrales. Ann Sci Ecole Norm Sup 28:33–62 25. Rogosinsky WW (1958) Moments of non-negative mass. Proc Royal Soc London Ser A 245:1–27 26. Rogosinsky WW (1962) Non-negative linear functionals, moment problems, and extremum problems in polynomial spaces. Stud. Math. Anal. and Related Topics. Stanford Univ Press, Palo Alto, CA, pp 316–324 27. Selberg HL (1940) Zwei Ungleichungen zur Ergänzung des Tchebycheffschen Lemmas. Skand Aktuarietidskrift 23:121–125 28. Shohat JA, Tamarkin JD (1983) The problem of moments. Math Surveys, vol 1. Amer. Math. Soc., Providence, RI 29. Shortt RM (1983) Strassen’s marginal problem in two or more dimensions. Z Wahrscheinlichkeitsth Verw Gebiete 64:313–325

General Routing Problem GRP RICHARD EGLESE, ADAM LETCHFORD Lancaster University, Lancaster, UK MSC2000: 90B20 Article Outline Keywords See also References Keywords Routing The general routing problem (GRP) is a routing problem defined on a graph or network where a minimum cost tour is to be found and where the route must include visiting certain required vertices and traversing certain required edges. More formally, given a connected, undirected graph G with vertex set V and (undirected) edge set E, a cost ce for traversing each edge e

General Routing Problem

2 E, a set V R V of required vertices and a set ER E of required edges, the GRP is the problem of finding a minimum cost vehicle route, starting and finishing at the same vertex, passing through each v 2 V R and each e 2 ER at least once ([13]). The GRP contains a number of other routing problems as special cases. When ER = ;, the GRP reduces to the Steiner graphical traveling salesman problem (SGTSP) ([4]), also called the road traveling salesman problem in [7]. On the other hand, when V R = ;, the GRP reduces to the rural postman problem (RPP) ([13]). When V R = V, the SGTSP in turn reduces to the graphical traveling salesman problem or GTSP ([4]). Similarly, when ER = E, the RPP reduces to the Chinese postman problem or CPP ([5,8]). The CPP can be solved optimally in polynomial time by reduction to a matching problem ([6]), but the RPP, GTSP, SGTSP and GRP are all NP-hard. This means that the computational effort to solve such a problem increases exponentially with the size of the problem. Therefore exact algorithms are only practical for a GRP if it is not too large, otherwise a heuristic algorithm is appropriate. The GRP was proved to be NPhard in [10]. In [3], an integer programming formulation of the GRP is given, along with several classes of valid inequalities which induce facets of the associated polyhedra under mild conditions. Another class of valid inequalities for the GRP is introduced in [11] and in [12] it is shown how to convert facets of the GTSP polyhedron into valid inequalities for the GRP polyhedron. These valid inequalities form the basis for a promising branch and cut style of algorithm described in [2] which can solve GRPs of moderate size to optimality. In [9], a heuristic algorithm for the GRP is described. The author adapts Christofides’ heuristic for the TSP to show that when the triangle inequality holds in the graph, the heuristic has a worst-case ratio of heuristic solution value to optimum value of 1.5. There are many vehicle routing applications of the GRP. In these cases, the edges of the graph are used to represent streets or roads and the vertices represent road junctions or particular locations on a map. In any practical application there are likely to be many additional constraints which must also be taken into account such as the capacity of the vehicles, time-window constraints for when the service may be carried out,

G

the existence of one-way streets and prohibited turns etc. Many applications are for the special cases when either ER = ; or V R = ;. However, there are some types of vehicle routing applications where the problem is most naturally modeled as a GRP with both required edges and required vertices. For example, in designing routes for solid waste collection services, collecting waste from all houses along a street could be modeled as a required edge and collecting waste from the foot of a multistory apartment block could be modeled as a required vertex. Other examples include postal delivery services where some customers with heavy demand might be modeled as required vertices, while other customers with homes in the same street might be modeled together as a required edge. School bus services are other examples of GRPs where a pick-up in a remote village could be modeled as a required vertex, but if the school bus must pick-up at some point along a street (and is not allowed to perform a U-turn in the street) then that may best be modeled as a required edge. Further details about solution methods and applications for various network routing problems can be found in [1].

See also Stochastic Vehicle Routing Problems Vehicle Routing Vehicle Scheduling

References 1. Ball MO, Magnanti TL, Monma CL, Nemhauser GL (eds) (1995) Network routing. vol 8, Handbook Oper. Res. and Management Sci. North-Holland, Amsterdam 2. Corberáan A, Letchford AN, Sanchis JM (1998) A cuttingplane algorithm for the general routing problem. Working Paper 3. Corberáan A, Sanchis JM (1998) The general routing problem polyhedron: Facets from the RPP and GTSP polyhedra. Europ J Oper Res 108:538–550 4. Cornué;jols G, Fonlupt J, Naddef D (1985) The travelling salesman problem on a graph and some related integer polyhedra. Math Program 33:1–27 5. Edmonds J (1963) The Chinese postman problem. Oper Res 13:B73–B77 6. Edmonds J, Johnson EL (1973) Matchings, Euler tours and the Chinese postman. Math Program 5:88–124

1253

1254

G

Genetic Algorithms

7. Fleischmann B (1985) A cutting-plane procedure for the travelling salesman problem on a road network. Europ J Oper Res 21:307–317 8. Guan M (1962) Graphic programming using odd or even points. Chinese Math 1:237–277 9. Jansen K (1992) An approximation algorithm for the general routing problem. Inform Process Lett 41:333–339 10. Lenstra JK, Rinnooy Kan AHG (1976) On general routing problems. Networks 6:273–280 11. Letchford AN (1997) New inequalities for the general routing problem. Europ J Oper Res 96:317–322 12. Letchford AN (1999) The general routing polyhedron: A unifying framework. Europ J Oper Res 112:122–133 13. Orloff CS (1974) A fundamental problem in vehicle routing. Networks 4:35–64

Genetic Algorithms GA RICHARD S. JUDSON Genaissance Pharmaceuticals, New Haven, USA MSC2000: 92B05 Article Outline Keywords See also References Keywords Optimization; Genetic algorithms; Evolution; Stochastic global optimization; Population; Fitness; Crossover; Mutation; Binary encoding; Individual; Chromosome; Generation; Elitism; Premature convergence; Gray code; Random walk search; Roulette wheel procedure; Population size; Schema theorem; Schema; Local minimum; Selection; Evolution strategy Genetic algorithms (GAs) comprise a class of stochasticglobal optimization methods based on several strategies from biological evolution. The basic genetic algorithm was developed by J.H. Holland and his students ([5,6,7,8]), and was based on the observation that selection (either natural or artificial) can produce highly op-

timized individuals in a relatively short number of generations. This is true despite the fact that the space of all gene mutations through which a population must sort is astronomical. For instancethe genome of the yeast Saccharomyces cerevisiae, which is the simplest eukaryote, contains just over 6000 genes, each of which can occur in several mutant forms. Despite this, S. cerevisiae can reoptimize itself to survive and flourish in many new environments in a relatively short number of generations. This is equivalent to having a computer search for a near-optimal solution to a 6000dimensional problem where each of the 6000 variables can take on any one of a large number of values. The most important notion from natural systems that the GA employs is the use of a population of individuals which go through a selection step to produce offspring and pass on their genetic material.Optimality or fitness is measured by how many offspring an individual produces. A second notion is the use of crossover in which individuals share genetic information and pass the shared information onto their offspring. A third borrowing from nature is the idea of mutation, the consequence of which is that the transfer of genetic informationis prone to random errors. This helps maintain the level of genetic diversity in a population. The implementation of a simple GA (SGA) which uses these ideas is straightforward. The description that follows uses a binary encoding, but all of the ideas follow identically for integer or even real number encodings. The most important idea is that one works with a population of individuals which will interact through genetic operators to carry out an optimization process. An individual is specified by a chromosome C which is a bit string of length N c that can be decoded to give a set of N parameters xi which are the natural parameters for the optimization application. Each parameter xi is enP coded by ni bits so that Ni ni = N c . In what follows, chromosome and bit string are synonymous. A fitness function f (x1 , . . . , xN ), which is the function to be optimized, is used to rank the individual chromosomes. An initial population of N pop individuals is formed by choosing N pop bit strings at random, and evaluating each individual’s fitness. (Decode C ! (x1 , . . . , xN ), calculate f (x1 , . . . , xN ).)Subsequent generations are formed as follows. All parents (members of the current generation) are ranked by fitness and the highest fitness individual is placed directly into the next generation with

Genetic Algorithms

no change. (This step of keeping the most-fit individual intact is termed elitism and is a purely heuristic addition. It insures that good solutions to the problem at hand are not lost until better ones are found.) Next, pairs of parents are selected and their chromosomes are crossed over to form chromosomes of the remaining individuals in the next generation. A parent’s probability of being selected increases with its fitness. So for a minimization application, the parent with the current lowest value of f (x1 , . . . , xN ) has the highest chance of being selected for mating. Crossover consists of taking some subset of the bits from parent 1 and the complementary set of bits from parent 2 and combining them to form the chromosome of child 1. A childis simply a member of the next generation. The remaining bits from the two parents are combined to form the chromosome of child 2. Additionally, during replication there is a small probability of a bit flip or mutation in a chromosome. This serves primarily to maintain diversity and prevent premature convergence. Convergence occurs when the population becomes largely homogeneous – most individuals have almost the same values for all of their parameters. Premature convergence occurs when the population converges early in a run, before significant amount of searching has been performed. The most common cause is a poor choice of the scaling of the fitness function. It should be noted that ‘premature’ and ‘early’ are loosely defined. To bound the magnitude of the effect of mutations, the binary chromosomes are usually Gray coded. An integer that is represented as a Gray coded binary number has the property that most single bit flips change the value of the decimal integer represented by the chromosome by ˙1. In sum, the algorithm consists of successively transforming one generation of individuals into the next using the operations of selection, crossover and mutation. Since the selection process is biased towards individuals with higher fitness, individuals are produced that come ever closer to being optimal solutions to the function of interest. It is important to emphasize that crossover is the key feature that distinguishes the GA from other stochastic global search methods. If crossover is ineffective, GA degenerates into a random walk search being executed separately by each individual in the population. The random walk is generated by the mutation operator.

G

The GA is presented below as pseudocode:

PROCEDURE genetic algorithm() Initialize population; FOR (g = 1 to Ngen generations) DO FOR (i = 1 to Npop individuals) DO Evaluate fitness of individual i: f i (g): END FOR; Save best individual to population g + 1; FOR (i = 2 to Npop ) DO Select 2 individuals; Crossover: create 2 new individuals; Mutate the new individuals; Move new individuals to population g+1; END FOR; END FOR; END genetic algorithm; Pseudocode for the Simple Genetic Algorithm

Selection commonly uses a roulette wheel procedure. Each individual is assigned a slice of the unit circle proportional to its fitness (f (x1 , . . . , xN )).One then chooses pairs of random numbers to select the next two individuals to be mated. A typical crossover operator takes the chromosomes from apair of individuals and chooses a common cut point along them. One child gets the portion of the first parent’s chromosome to the left of thecut point, and the portion of the second parent’s chromosome to the right of the cut point. The chromosome of the second child is comprised of the remaining fragments of the two parent chromosomes. In the most common mutation operator each bit in the binary chromosome has an equal and low probability being flipped from 1 to 0 or vice versa.Many variants on these operators have been used. The important variables in the GA method are the population size, N pop , the total number of generations allowed, N gen , the number of bits used to represent a real variable, and the mutation rate. The total CPU time used in an optimization run is proportional to N pop × N gen × T(f ), where T(f ) is the time required to evaluate the fitness function f (x1 , . . . , xN ). This leads to a trade-off between having large, diverse populations that explore parameter space widely, and having smaller populations that explore longer. In practice, the choice is problem dependent.

1255

1256

G

Genetic Algorithms

The simple GA and a large number of variants have been successfullyused to find near-optimal solutions to many engineering and scientific applications. ([2,3,4,6,9,10,11]) Although much effort has gone into formally analyzing the GA to understand why it is so robust, the most important formal result is the Schema theorem ([6,7,8]). Schemata are strings made up of the characters 1, 0 and which is the ‘don’t care’ character. These schemata are building blocks out of which the strings representing individuals’ chromosomes can be constructed. For instance the string 11100 contains schema such as 111, 1100 and 1 10. The schema theorem provides a powerful statement about the behavior of schemata in a chromosome. Mathematically, it states m(H; g C 1) m(H; g)

f (H) f

ı(H) o(H) 1 pc pm ; l 1 pm

(1)

where m(H, g) is the number of examples of a schema H that exist in the population at generation g; f (H) is the average fitness of chromosomes containing H; f is the average fitness of all chromosomes; pc is the probability that crossover will occur at a particular mating; pm is the probability that a particular bit will be mutated; l is the length of the chromosome; ı(H) is the length of the schema in bits; and o(H) is the order of the schema, defined to be the number of fixed (as opposed to don’t care) positions in the schema. The factors outside the brackets in (1) indicate that a particular schema will increase its representation in the population at a rate proportional to its fitness relative to the average fitness. Good schemata will increase their representation exponentially and bad schemata will decrease their representation likewise. The terms inside the bracket serve to decrease this exponential convergence by disrupting the selection-based pressure. Both crossover and mutation can disrupt good schemata. The longer a schema is, the more likely it is to be disrupted by crossover, and disappear from the population. In the same fashion, schemata with many fixed positions are more likely to be disrupted by mutations. The competition between selection which drives the population towards convergence on a good solution and crossover and mutation which drive the population towards more diverse states are the keys to the GA. Crossover is especially important for keeping the

method from being trapped in local minima. One consequence of the parameter shuffling brought about by the crossover operator is that the GA is most efficient at optimizing functions that are at least partially separable. One individual can find a state where half of the parameters of the fitness function are optimized and a second individual can find a state where the other half are optimized. If these individuals crossover at the correct point, one of theirchildren will have the parameter values that globally optimize the function. As with most other heuristic global optimization methods, no definitive statements can be made about the global optimality of GA-generated solutions. A family of algorithms that are very similar to the GA, called evolution strategies were developed independently and virtually simultaneously in Germany by I. Rechenberg ([1,12]). See also Adaptive Simulated Annealing and its Application to Protein Folding Genetic Algorithms for Protein Structure Prediction Global Optimization in Lennard–Jones and Morse Clusters Global Optimization in Protein Folding Molecular Structure Determination: Convex Global Underestimation Monte-Carlo Simulated Annealing in Protein Folding Multiple Minima Problem in Protein Folding: ˛BB Global Optimization Approach Packet Annealing Phase Problem in X-ray Crystallography: Shake and Bake Approach Protein Folding: Generalized-ensemble Algorithms Simulated Annealing Simulated Annealing Methods in Protein Folding References 1. Bäck T, Schwefel H-P (1993) An overview of evolutionalgorithms for parameter optimization. Evolutionary Computation 1(1) 2. Belew RK, Booker LB (eds) (1991) Proc. fourth Internat. Conf. Genetic Algorithms. Morgan Kaufmann, San Mateo 3. Davis L (ed) (1987) Genetic algorithms and simulated annealing. Pitman, Boston

Genetic Algorithms for Protein Structure Prediction

4. Davis L (1991) Handbook of genetic algorithms. v. Nostrand Reinhold, Princeton, NJ 5. DeJong K (1976) An analysis of the behavior of a class of genetic adaptive systems. PhD Thesis Univ. Michigan 6. Goldberg D (1989) Genetic algorithms in search, optimization and learning. Addison-Wesley, Reading 7. Holland JH (1992) Adaptation in natural and artificial systems. MIT, Cambridge 8. Holland JH (1992) Genetic algorithms. Scientif Amer 267:66 9. Judson RS (1997) Genetic algorithms and their use in chemistry. In: Lipkowitz KB, Boyd DB (eds) Rev. Computational Chemistry, vol 10. Wiley-VCH, Weinheim, pp 1–73 10. Koza J (1992) Genetic programming. MIT, Cambridge 11. Rawlins GJE (1991) Foundations of genetic algoritms. Morgan Kaufmann, San Mateo 12. Rechenberg I (1973) Evolutionsstrategie – Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Frommann-Holzboog, Stuttgart-Bad Cannstatt

Genetic Algorithms for Protein Structure Prediction RICHARD S. JUDSON Genaissance Pharmaceuticals, New Haven, USA MSC2000: 92B05 Article Outline Keywords See also References Keywords Optimization; Evolution; Protein structure; Amino acid; Active site; Free energy; Conformation; Configuration; Primary structure; Tertiary structure; Cartesian coordinates; Internal coordinates; Bond distance; Bond angle; Dihedral angle; Rotamer library; Rotamer; Empirical potential; Nonbonded distance; Secondary structure Genetic algorithms (GAs; cf. also Genetic algorithms) have been used for a large number of modeling applications in chemical and biological fields [5,9]. At least three factors contribute to this. First, GAs provide an easy-to-use global search and optimization approach. Second, they can easily handle noncontinuous functions. Finally, they are relatively robust even

G

for moderately high-dimensional problems. All of these have contributed to the use of the GA for the important but computationally demanding field of protein structure prediction. Proteins carry out a wide variety of functions in living cells, almost all of which require that the protein molecules assume precise 3-dimensional shapes [2,3]. Enzymes are typical examples. They generally consist of a large structure of 100–300 amino acids stabilizing a small active site which is designed to carry out a specific chemical reaction such as cleaving a bond in a target molecule. Even slight changes in the structure of the active site can destroy the protein’s ability to function. Many drugs act by fitting snugly into enzymes’ active sites, causing them to shut down. Therefore, a detailed understanding of the 3-dimensional structure of a protein can enhance our understanding of its function. This can in turn help understand related disease processes and can finally lead to disease cures. Unfortunately the experimental determination of protein structures, using x-ray crystallography or solution NMR is very difficult. Currently the structures of only a few thousand of the estimated 100,000 proteins that are used by the human body have been determined this way. The alternative is to predict the structures computationally. The basic computational approach is simple to state, although many details have yet to be worked out. It relies on the experimental fact that a protein in solution (as well as any other molecule) will tend to find a state of low free energy. Free energy accounts for the internal energy (potential plus kinetic) of single molecules as well as the entropy of the ensemble of molecules of the same type. At absolute zero, the entropy contribution to the energy, as well as the kinetic energy, go to zero, leaving only the potential energy. Therefore, the most likely shape or state of a protein at absolute zero is the one of lowest potential energy. The simplest computational model then needs a method to search the space of conformations and an energy function (approximating the physical potential energy) which is minimized during the search. (A protein’s conformation is the description of the 3-dimensional positions of all of the atoms for a fixed set of atoms and atom-atom connections. The configuration describes the atom-atom connectivity and only changes through chemical bond forming or breaking.) The conformation which yields the lowest

1257

1258

G

Genetic Algorithms for Protein Structure Prediction

value of the energy function is a best estimate of conformation of the natural protein. It is possible to extend this simple model to include the effects of finite temperature, but these extensions are beyond the scope of this article. In-depth discussions of molecular modeling, including energy functions for proteins and other molecules can be found in [6,8,10], and [1]. Because proteins possess many degrees of freedom, and the energy functions have many local minima, global optimization methods that search efficiently and are not prone to being caught in local minima are required. The GA is often used because it fits both of these criteria. Proteins [2] are long linear polymers composed of well-conserved sequences of the 20 amino acids. Each amino acid is in turn made up of a backbone R j (NH C˛

CO)

where R stands for one of the 20 side groups that make the amino acids unique. These range from a single hydrogen atom to chains having many degrees of freedom. The primary structure of the protein is simply the sequence of amino acids. For many naturally occurring proteins, this sequence carries sufficient information to determine the final 3-dimensional or tertiary structure of the protein. Experimentally, proteins that have been denatured (caused to unfold by heating the solution or changing its chemical composition) will spontaneously refold to their active, or native conformation, when the solution is returned to its original state. There are two sets of coordinates often used for specifying the conformation of a protein. The first are the standard Cartesian coordinates for each atom. For N atoms, this requires 3N 6 numbers. The alternative is to use internal coordinates which are the bond distances (distances between atoms bound together), the bond angles (angles formed by a given atom and two atoms bound to it), and the dihedral angles (the angle of rotation about a center bond for a set of 4 atoms bound as A B C D). To a good first approximation, the bond distances and bond angles are fixed at values that are independent of the particular amino acid or protein. Therefore, the conformation of a protein is determined largely by the values of its dihedral angles. There are on average about 15 atoms and about 3 dihedrals per

amino acid, requiring about N/5 degrees of freedom to describe the conformation of an N-atom protein. The dimension of conformation space for a moderate-size protein of 100 amino acids ( 1500 atoms) is 4500 when using Cartesian coordinates vs. 300 when using internal coordinates with fixed bond distances and angles. In many protein structure prediction applications, the simple GA approach is used. For each generation, one calculates the fitness (energy) of each individual in the population, selects pairs of individuals based on their energy, performs crossover and mutation. The GA chromosome directly codes for the values of the dihedral angles. Both binary encoded and real number encoded chromosomes have been used with equal success. For binary encoded dihedrals, one must decide on the resolution of the GA search. The maximum one would use is 10 bits per angle which gives a resolution of about 1/3 degree. Often as few as 5 or 6 bits will be sufficient, especially if the GA-generated conformations will be subjected to local gradient minimization. For each GA individual, the chromosome is decoded to give the values of the dihedrals which are passed to the energy function. This in turn returns an energy which is used as the fitness for the subsequent selection process. Another encoding scheme that is often used is based on the idea of a rotamer library. It is known from studying the set of experimentally known structures that the dihedral angles in many amino acid side chains take on restricted sets of values. Also, the values of several neighboring dihedrals are often correlated. It has then been possible to develop libraries of preferred sidechain conformations (called rotamers) for each amino acid. This can be incorporated into the GA by having each word in the chromosome simply determine which of a set of rotamers to use for each amino acid in the sequence. The use of rotamer libraries in the GA framework is illustrated in references [7,12,13,14], and [11]. The other major ingredient needed for a protein structure prediction method is an energy function to be minimized. This is a huge area of research which is beyond the scope of this article, but two major approaches will be summarized. The first scheme uses physicsbased empirical potentials. These are functions of the bond distances, bond angles, dihedral angles, and nonbonded distances (distances between atoms not directly

Genetic Algorithms for Protein Structure Prediction

bound together). The functional forms are derived from the results of accurate but computationally expensive quantum mechanical calculations that are performed on small molecular fragments such as individual amino acids. The results are fitted to simple functions with several free parameters. The parameter values are either taken from the original quantum calculations or from independent spectroscopic experiments. Various methods are used to approximate the effect of the water and salt environment around the protein. The advantage of these potentials is that they are continuous and very general. They can be constructed for any protein and give reasonable energies for any conformation requested. The disadvantage is that they are not yet sufficiently accurate to give reliable structure predictions. For many if not all of the proteins whose structure is known, there are conformations that have much lower calculated energy than that of the experimental conformation. The second approach is to use potentials based on observations of known protein structures. Basically, more probable conformations (ones that look more like real proteins) will have lower energy values. For instance certain sequences of amino acids almost always assume a particular secondary structure. The secondary structure of a protein describes the presence of multiamino acid helices, sheets and turns but not the exact placement of the atoms in the secondary structure elements or the spatial orientation of these elements. These potentials have the advantage that they build on our observations of proteins as entire molecules and incorporate long-range order. As with the empirical potentials, though, they suffer from accuracy problems. However, except for very small proteins (less than 20 amino acids) the structure-based potentials show the most promise. A common feature of GA-based protein structure prediction methods is the use of hybrid approaches combining standard GA with a local search method. The GA is then used primarily to perform an efficient global search which is biased towards regions of conformation space with low energy. This is a pragmatic approach driven by the large number of degrees of freedom even when internal coordinates are used. A simple and often used approach [5] is to subject GA-generated conformations to gradient minimization. Another approach is to use a population of individuals which carry

G

out independent Monte-Carlo or simulated annealing walks (cf. also Simulated annealing methods in protein folding; Monte-Carlo simulated annealing in protein folding) for a number of steps and then undergo selection, crossover and mutation [4,15,16]. See also Adaptive Simulated Annealing and its Application to Protein Folding Bayesian Global Optimization Genetic Algorithms Global Optimization Based on Statistical Models Monte-Carlo Simulated Annealing in Protein Folding Packet Annealing Random Search Methods Simulated Annealing Methods in Protein Folding Stochastic Global Optimization: Stopping Rules Stochastic Global Optimization: Two-phase Methods References 1. Allen MP, Tildesley DJ (1996) Computer simulation of liquids. Oxford Sci. Publ., Oxford 2. Branden C, Tooze J (1991) Introduction to protein structure. Garland Publ., Oxford 3. Creighton TE (1993) Proteins: structure and molecular properties. Freeman, New York 4. Friesner JR, Gunn A, Monge RA, Marshall GH (1994) Hierarchical algorithms for computer modeling of protein tertiary structure: folding of myoglobin to 6.2Å resolution. J Phys Chem 98:702 5. Judson RS (1997) Genetic algorithms and their use in chemistry. In: Lipkowitz KB, Boyd DB (eds) Rev. Computational Chemistry, vol 10. Wiley-VCH, Weinheim, pp 1–73 6. Karplus CL, Brooks M, Pettitt BM (1988) Proteins: A theoretical perspective of dynamics, structure and thermodynamics. Wiley/Interscience, New York 7. LeGrand S, Merz K (1993) The application of the genetic algorithm to the minimization of potential energy functions. J Global Optim 3:49 8. McCammon JA, Harvey S (1987) Dynamics of proteins and nucleic acids. Cambridge Univ. Press, Cambridge 9. Pedersen J, Moult J (1996) Genetic algorithms for protein structure prediction. Curr Opin Struct Biol 227–231 10. Rapaport DC (1995) The art of molecular dynamics simulation. Cambridge Univ. Press, Cambridge 11. Ring CS, Cohen FE (1994) Conformational sampling of loop structures using genetic algorithms. Israel J Chem 34:245

1259

1260

G

Geometric Programming

12. Sun S (1993) Reduced representation model of, protein structure prediction: statistical potential and genetic algorithms. Protein Sci 2:762 13. Tuffery P, Etchebest C, Hazout S, Lavery R (1991) A new approach to the rapid determiniation of protein side chain conformations. J Biomol Struct Dynam 8:1267 14. Tuffery P, Etchebest C, Hazout S, Lavery R (1993) A critical comparison of search algorithms applied to the optimization of protein side chain conformations. J Comput Chem 14:790 15. Unger R, Moult J (1993) Effects of mutations on the performance of genetic algorithms suitable for protein folding simulations. In: Tanaka M, Doyoma M, Kihara J, Yamamoto R (eds) Computer-Aided Innovation In New Materials. Elsevier, Amsterdam, pp 1283–1286 16. Unger R, Moult J (1993) Genetic algorithms for protein folding simulations. J Mol Biol 231:638

Geometric Programming YANJUN WANG Department of Applied Mathematics, Shanghai University of Finance and Economics, Shanghai, China MSC2000: 90C28, 90C30 Article Outline Keywords and Phrases Introduction Formulation Methods and Applications Transformation Linear Relaxation Programming Branch-and-Bound Algorithm Algorithm Statement

Applications References Keywords and Phrases Generalized geometric programming; Global optimization; Linear relaxation programming; Branch and bound Introduction Geometric programming is an important class of nonlinear optimization problems. Their source dates back to the 1960s when Zener began to study a special type

of minimization cost problem for design in engineering, now known as geometric programming. The term geometric programming is adopted because of the crucial role that the arithmetic-geometric mean inequality plays in its initial development. Actually, the early work in geometric programming was, for the most part, concerned with minimizing posynomial functions subject to inequality constraints on such functions, which was called posynomial geometric programming. In the past decade, because a number of models abstracted from application fields were not posynomial geometric programming, the theory had to be generalized to a much broader class of optimization problems called generalized geometric programming, which has spawned a wide variety of applications since its initial development. Its great impact has been in the areas of (1) engineering design [1,4,10,11]; (2) economics and statistics [2,3,6,9]; (3) manufacturing [8,17]; (4) chemical equilibrium [13,16]. Reference [19] focuses on solutions for generalized geometric programming. Formulation [19] provides a global optimization algorithm for the generalized geometric programming (GGP) problem stated as: 8 ˆ min G0 (x) ˆ ˆ ˆ 0; i D 1; : : : ; N, then the function Gm (x) is called a posynomial. Note that if we set ımt D C1 for all m D 0; 1; : : : ; M; t D 1; : : : ; Tm and ım D C1 for all m D 1; : : : ; M, then the GGP formulation reduces to the classical posynomial geometric programming (PGP) formulation that laid the foundation for the theory of the GGP problem.

Geometric Programming

Local optimization approaches for solving the GGP problem include three kinds of methods in general. First, successive approximation by posynomials, called “condensation,” is the most popular [14]. Second, Passy and Wilde [15] developed a weaker type of duality, called “pseudo-duality,” to accommodate this class of nonlinear optimization. Third, some nonlinear programming methods are adopted to solve the GGP problem based on exploiting the characteristics of the GGP problem [12]. Though local optimization methods for solving the GGP problem are ubiquitous, global optimization algorithms based on the characteristics of the GGP problem are scarce. Maranas and Floudas [13] proposed such a global optimization algorithm based on the exponential variable transformation of GGP, the convex relaxation, and branch and bound on some hyperrectangle region. Reference [19] proposes a branch-andbound optimization algorithm that solves a sequence of linear relaxations over partitioned subsets in order to find a global solution, and to generate the linear relaxation of each subproblem and to ensure convergence to a global solution, special strategies have been applied. (1) The equivalent reverse convex programming (RCP) formulation is considered. (2) A linear relaxation method for the RCP problem is proposed based on the arithmetic-geometric mean inequality and the linear upper bound of the reverse convex constraints; this method is more convenient with respect to computation than the convex relaxation method [13]. (3) A bound tightening method is developed that will enhance the solution procedure, and, based on this method, a branch-and-bound algorithm is proposed. Methods and Applications Transformation In [5], Duffin and Peterson show that any GGP problem can be transformed into the following reverse posynomial geometric programming (RPGP): 8 ˆ min ˆ ˆ ˆ ˆ ˆ ˆ 1, or for some m 2 fp C 1; : : : ; qg, g m (z) < 1, then the node indices q(s):w will be eliminated. If ˝ q(s):w (w D 1; 2) are all eliminated, then go to step 5.

step 3: (Updating upper bound) For undeleted subhyperrectangle update L U ; Ymt : A mt ; B mt ; Ymt

Solve LRP(˝ q(s):w ), where w D 1 or w D 2 or w D 1; 2, and denote the solutions and optimal values (ˆz(˝ q(s):w ); LB q(s):w ). Then if zˆ(˝ q(s):w ) is feasible for RCP, U D minfU ; LB q(s):w g. step 4: (Deleting step) If LB q(s):w > U C ı, then delete the corresponding node; step 5: (Fathoming step) Fathom any nonimproving nodes by setting QsC1 D Qs fq 2 Qs : LB q U ıg. If QsC1 D ;, then stop, and exp(U ) is the optimal value, z () (where 2 0 ) are the global solutions, where 0 D f : z0 () D U g. Otherwise, s D s C 1; step 6: (Node-selection step) Set LB(s) D minfLB q : q 2 Qs g, then select an active node q(s) 2 arg minfLB(s)g for further considering;

Geometric Programming

step 7: (Bound tightening step) If in this node q(s), zˆ(˝ q(s) ) is feasible in all convex constraints of RCP, then return to step 1, else the BTM technique will be adopted, and then return to step 1. Theorem 1 (convergence result) The above algorithm either terminates finitely with the incumbent solution being optimal to RCP or it generates an infinite sequence of iterations such that along any infinite branch of the branch-and-bound tree, any accumulation point of the sequence LB(s) will be the global minimum of the RCP problem. Proof A sufficient condition for a global optimization to be convergent to the global minimum, stated in Horst and Tuy [7], requires that the bounding operation be consistent and the selection operation bound improving. A bounding operation is called consistent if at every step any unfathomed partition can be further refined and if any infinitely decreasing sequence of successively refined partition elements satisfies: lim (U LB(s)) D 0 ;

s!C1

(3)

where LB(s) is a lower bound inside some subhyperrectangle in stage s and U * is the best upper bound at iteration s, not necessarily occurring inside the above same subhyperrectangle. In the following we will demonstrate that (3) holds. Since the employed subdivision process is the bisection, the process is exhaustive. Consequently, from the discussion in [13] (3) holds, and this means that the employed bounding operation is consistent. A selection operation is called bound improving if at least one partition element where the actual lower bound is attained is selected for further partition after a finite number of refinements. Clearly, the employed selection operation is bound improving because the partition element where the actual lower bound is attained is selected for further partition in the immediately following iteration. In summary, it is shown that the bounding operation is consistent and that the selection operation is bound improving; therefore, according to Theorem IV.3. in Horst and Tuy [7], the employed global optimization algorithm is convergent to the global minimum.

G

Applications Reference [19] reports the numerical experiment for the deterministic global optimization algorithm described above to demonstrate its potential and feasibility. The experiment is carried out with the C programming language. The simplex method is applied to solve the linear relaxation programming problems. To illustrate how the proposed algorithm works, first [19] gives a simple example to show the solving procedure of the proposed algorithm. Example 1: 8 ˆ min x12 C x22 ˆ ˆ ˆ 0), in such a manner that the algorithm will find the best solution from set b S almost for sure when generating solutions with temperature parameter K . However, there is no need to provide a separate cooling schedule for each problem solved. Simple scaling of the cost function ( f 0 (x) D C f (x), C > 0) can make one temperature schedule suitable for a wide range of problems from the same class. The choice of scaling factor can be made, for example, in the initial stage of the algorithm, when D 0. Additionally, if we multiply the denominator and numerator of (4) by exp( f (x )), where x is the best solution from b S, then the convergence to the b best solution from S is less dependent on the absolute values of solution costs. The general scheme of the GES method is presented in Fig. 1. There are some elements that are included in the scheme, but that were not discussed above: elite solutions set, prohibition of certain solutions and restarting the search. These elements are not necessary for success of the GES method and can be easily excluded. However, for some classes of problems they can provide a significant performance improvement. The main cycle (lines 2–36) is repeated until some stopping criterion is satisfied. The algorithm execution can be terminated when the best known record for the given problem is improved, or when the running time exceeds some limiting value. If the set of known solutions S˜ is empty, then the initialization of the data set is performed in lines 3–7. The cycle in lines 9–28 is executed until there is no improvement in nfail consecutive cycles. The main element of the GES method is the temperature cycle (lines 11–23). The probabilities that guide the search are estimated using expression (4) at the beginning of each temperature stage (line 12). For each probability vector, ngen solutions are generated (lines 13–22). These solutions are used as initial solutions for the local search procedure (line 15). The subset of encountered solutions R is used to update set b S (line 16). Some set of the solutions can be stored in memory, in order to provide a fast initialization of the algorithm’s memory structures (lines 27 and 34). Such a set is referred to as an elite set in the algorithm pseudocode. Certain solutions can be excluded from this set to avoid searching the same areas multiple times. In lines 29 and 30, the solutions for which the Hamming distance to

G

xbest is less than parameter dp are excluded from the elite set. A number of successful applications of the GES method have been reported in recent years [6]. The application of the GES method for the multidimensional knapsack problem is described in [8]. The GES based method was presented in [5] for solving job-shop scheduling problems. To date, suitable exact solution methods are not able to find highquality solutions with reasonable computational effort for the problems involving more than ten jobs and ten machines. The computational testing of the GES algorithm provided a set of new upper bounds for a wide set of challenging benchmark problems [2]. The comparison with existing techniques for job-shop scheduling asserts that the GES method has a great potential for solving scheduling problems. The application of GES for the unconstrained quadratic programming problem was discussed in [4], where GES was used in a combination with a tabu algorithm. Such an ensemble proved to be an extremely efficient tool for large-scale problems, outperforming some of the best available solution techniques. In conclusion, the universality of the GES method together with its flexibility make it an optimization tool worth considering. References 1. Glover F, Laguna M (1993) Tabu search in Modern Heuristic Techniques for Combinatorial Problems. In: Reeves C (ed). Blackwell, Oxford, pp 70–141 2. Job Shop Scheduling webpage, http://plaza.ufl.edu/shylo/ jobshopinfo.html. Accessed 14 Oct 2007 3. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by Simulated Annealing. Science 220(4598):671–680 4. Pardalos PM, Prokopyev OA, Shylo OV, Shylo VP (2007) Global equilibrium search applied to the unconstrained binary quadratic optimization problem. Optim Meth Softw. doi:10.1080/10556780701550083 5. Pardalos PM, Shylo OV (2006) An algortihm for the Job Shop Scheduling based on Global Equilibrium Search Techniques. Comput Manag Sci 3(4):331–348 6. Sergienko IV, Shylo VP (2006) Problems of discrete optimization: Challenges and main approaches to solve them. Cybernet Syst Anal 42:465–482 7. Shylo VP (1999) A global equilibrium search method. Kybernetika i Systemnuiy Analys 1:74–80 (in Russian) 8. Shylo VP (2000) Solution of multidimensional knapsack problems by global equilibrium search, Theory of Optimal Solutions. Glushkov VM Inst Cybern, NAS Ukraine Kiev, p 10

1271

1272

G

Globally Convergent Homotopy Methods

Globally Convergent Homotopy Methods LAYNE T. WATSON Virginia Polytechnic Institute and State University, Virginia, USA MSC2000: 65F10, 65F50, 65H10, 65K10 Article Outline Keywords Probability-One Globally Convergent Homotopies Optimization Homotopies Software See also References Keywords Continuation; Globally convergent; Homotopy; Nonlinear equations; Probability-one homotopy Probability-one homotopy methods are a class of algorithms for solving nonlinear systems of equations that are accurate, robust, and converge from an arbitrary starting point almost surely. These new globally convergent homotopy techniques have been successfully applied to solve Brouwer fixed point problems, polynomial systems of equations, constrained and unconstrained optimization problems, discretizations of nonlinear two-point boundary value problems based on shooting, finite differences, collocation, and finite elements, and finite difference, collocation, and Galerkin approximations to nonlinear partial differential equations. Probability-One Globally Convergent Homotopies A homotopy is a continuous map from the interval [0, 1] into a function space, where the continuity is with respect to the topology of the function space. Intuitively, a homotopy () continuously deforms the function (0) = g into the function (1) = f as goes from 0 to 1. In this case, f and g are said to be homotopic. Homotopy maps are fundamental tools in topology, and provide

a powerful mechanism for defining equivalence classes of functions. Homotopies provide a mathematical formalism for describing an old procedure in numerical analysis, variously known as continuation, incremental loading, and embedding. The continuation procedure for solving a nonlinear system of equations f (x) = 0 starts with a (generally simpler) problem g(x) = 0 whose solution x0 is known. The continuation procedure is to track the set of zeros of (; x) D f (x) C (1 )g(x)

(1)

as is increased monotonically from 0 to 1, starting at the known initial point (0, x0 ) satisfying (0, x0 ) = 0. Each step of this tracking process is done by starting at a point (e ;e x) on the zero set of , fixing some > 0, and then solving (e C ; x) D 0 for x using a locally convergent iterative procedure, which requires an in C ; x). The process vertible Jacobian matrix D x (e stops at = 1, since f (x) D (1; x) D 0 gives a zero x of f (x). Note that continuation assumes that the zeros of connect the zero x0 of g to a zero x of f , and that the Jacobian matrix Dx (, x) is invertible along the zero set of ; these are strong assumptions, which are frequently not satisfied in practice. Continuation can fail because the curve of zeros of (, x) emanating from (0, x0 ) may: 1) have turning points, 2) bifurcate, 3) fail to exist at some values, or 4) wander off to infinity without reaching = 1. Turning points and bifurcation correspond to singular Dx (, x). Generalizations of continuation known as homotopy methods attempt to deal with cases 1) and 2) and allow tracking of to continue through singularities. In particular, continuation monotonically increases , whereas homotopy methods permit to both increase and decrease along . Homotopy methods can also fail via cases 3) or 4). The map (, x) connects the functions g(x) and f (x), hence the use of the word ‘homotopy’. In general the homotopy map (, x) need not be a simple convex combination of g and f as in (1), and can involve nonlinearly. Sometimes is a physical parameter in the original problem f (x; ) = 0, where = 1 is the (nondimensionalized) value of interest, although ‘artificial parameter’ homotopies are generally more computation-

Globally Convergent Homotopy Methods

G

ally efficient than ‘natural parameter’ homotopies (, x) = f (x; ). An example of an artificial parameter homotopy map is (; x) D f (x; ) C (1 )(x a);

(2)

which satisfies (0, a) = 0. The name ‘artificial’ reflects the fact that solutions to (, x) = 0 have no physical interpretation for < 1. Note that (, x) in (2) has a unique zero x = a at = 0, regardless of the structure of f (x; ). All four shortcomings of continuation and homotopy methods have been overcome by probability-one homotopies, proposed in 1976 by S.N. Chow, J. MalletParet, and J.A. Yorke [2]. The supporting theory, based on differential geometry, will be reformulated in less technical jargon here. Definition 1 Let U Rm and V Rp be open sets, and let : U×[0, 1)×V ! Rp be a C2 map. is said to be transversal to zero if the p×(m+1+p) Jacobian matrix D has full rank on 1 (0). The C2 requirement is technical, and part of the definition of transversality. The basis for the probability-one homotopy theory is the parametrized Sard’s theorem, [2]: Theorem 2 Let : U × [0, 1) ×V ! Rp be a C2 map. If is transversal to zero, then for almost all a 2 U the map a (; x) D (a; ; x) is also transversal to zero. To discuss the importance of this theorem, take U = Rm , V = Rp , and suppose that the C2 map : Rm × [0, 1) × Rp ! Rp is transversal to zero. A straightforward application of the implicit function theorem yields that for almost all a 2 Rm , the zero set of a consists of smooth, nonintersecting curves which either: 1) are closed loops lying entirely in (0, 1) × Rp , 2) have both endpoints in {0} × Rp , 3) have both endpoints in {1} × Rp , 4) are unbounded with one endpoint in either {0} × Rp or in {1} × Rp , or 5) have one endpoint in {0} × Rp and the other in {1} × Rp . Furthermore, for almost all a 2 Rm , the Jacobian matrix Da has full rank at every point in 1 a (0). The goal is to

Globally Convergent Homotopy Methods, Figure 1 Zero set for a (, x) satisfying properties 1)–4)

construct a map a whose zero set has an endpoint in {0} × Rp , and which rules out 2) and 4). Then 5) obtains, and a zero curve starting at (0, x0 ) is guaranteed to reach a point (1; x). All of this holds for almost all a 2 Rm , and hence with probability one [2]. Furthermore, since a 2 Rm can be almost any point (and, indirectly, so can the starting point x0 ), an algorithm based on tracking the zero curve in 5) is legitimately called globally convergent. This discussion is summarized in the following theorem (and illustrated in Fig. 1). Theorem 3 Let f : Rp ! Rp be a C2 map, : Rm ×[0, 1)× Rp ! Rp a C2 map, and a (, x) = (a, , x). Suppose that 1) is transversal to zero. Suppose also that for each fixed a 2 Rm , 2) a (0, x) = 0 has a unique nonsingular solution x0 , 3) a (1, x) = f (x) (x 2 Rp ). Then, for almost all a 2 Rm , there exists a zero curve of a emanating from (0, x0 ), along which the Jacobian matrix Da has full rank. If, in addition, 4) 1 a (0) is bounded, then reaches a point (1; x) such that f (x) D 0). Furthermore, if D f (x) is invertible, then has finite arc length. Any algorithm for tracking from (0, x0 ) to (1; x), based on a homotopy map satisfying the hypotheses of this theorem, is called a globally convergent probability-one homotopy algorithm. Of course, the practical numerical details of tracking are nontriv-

1273

1274

G

Globally Convergent Homotopy Methods

ial, and have been the subject of twenty years of research in numerical analysis. Production quality software called HOMPACK90 [6] exists for tracking . The distinctions between continuation, homotopy methods, and probability-one homotopy methods are subtle but worth noting. Only the latter are provably globally convergent and (by construction) expressly avoid dealing with singularities numerically, unlike continuation and homotopy methods which must explicitly handle singularities numerically. Assumptions 2) and 3) in Theorem 3 are usually achieved by the construction of (such as (2)), and are straightforward to verify. Although assumption 1) is trivial to verify for some maps, if and a are involved nonlinearly in the verification is nontrivial. Assumption 4) is typically very hard to verify, and often is a deep result, since 1)–4) holding implies the existence of a solution to f (x) = 0. Note that 1)–4) are sufficient, but not necessary, for the existence of a solution to f (x) = 0, which is why homotopy maps not satisfying the hypotheses of the theorem can still be very successful on practical problems. If 1)–3) hold and a solution does not exist, then 4) must fail, and nonexistence is manifested by going off to infinity. Properties 1)–3) are important because they guarantee good numerical properties along the zero curve , which, if bounded, results in a globally convergent algorithm. If is unbounded, then either the homotopy approach (with this particular ) has failed or f (x) = 0 has no solution. A few remarks about the applicability and limitations of probability-one homotopy methods are in order. They are designed to solve a single nonlinear system of equations, not to track the solutions of a parameterized family of nonlinear systems as that parameter is varied. Thus drastic changes in the solution behavior with respect to that (natural problem) parameter have no effect on the efficacy of the homotopy algorithm, which is solving the problem for a fixed value of the natural parameter. In fact, it is precisely for this case of rapidly varying solutions that the probability-one homotopy approach is superior to classical continuation (which would be trying to track the rapidly varying solutions with respect to the problem parameter). Since the homotopy methods described here are not for general solution curve tracking, they are not (directly) applicable to bifurcation problems.

Homotopy methods also require the nonlinear system to be C2 (twice continuously differentiable), and this limitation cannot be relaxed. However, requiring a finite-dimensional discretization to be smooth does not mean the solution to the infinite-dimensional problem must also be smooth. For example, a Galerkin formulation may produce a smooth nonlinear system in the basis function coefficients even though the basis functions themselves are discontinuous. Homotopy methods for optimization problems may converge to a local minimum or stationary point, and in this regard are no better or worse than other optimization algorithms. In special cases homotopy methods can find all the solutions if there is more than one, but in general the homotopy algorithms are only guaranteed to find one solution. Optimization Homotopies A few typical convergence theorems for optimization are given next (see the survey in [5] for more examples and references). Consider first the unconstrained optimization problem min f (x): x

(3)

Theorem 4 Let f : Rn ! R be a C3 convex map with a minimum at e x, ke xk2 M. Then for almost all a, kak2 < M, there exists a zero curve of the homotopy map a (; x) D r f (x) C (1 )(x a); along which the Jacobian matrix Da (, x) has full rank, emanating from (0, a) and reaching a point (1;e x), where e x solves (3). A function is called uniformly convex if it is convex and its Hessian’s smallest eigenvalue is bounded away from zero. Consider next the constrained optimization problem min f (x): x0

(4)

This is more general than it might appear because the general convex quadratic program reduces to a problem of the form (4). Theorem 5 Let f : Rn ! R be a C3 uniformly convex map. Then there exists ı > 0 such that for almost all a 0

Globally Convergent Homotopy Methods

with kak2 < ı there exists a zero curve of the homotopy map a (; x) D K(x) C (1 )(x a); where ˇ3 ˇ ˇ ˇ @ f (x) @ f (x) 3 K i (x) D ˇˇ x i ˇˇ C C x 3i ; @x i @x i along which the Jacobian matrix Da (, x) has full rank, connecting (0, a) to a point (1; x), where x solves the constrained optimization problem (4). Given F : Rn ! Rn , the nonlinear complementarity problem is to find a vector x 2 Rn such that x 0;

F(x) 0;

x > F(x) D 0:

(5)

It is interesting that homotopy methods can be adapted to deal with nonlinear inequality constraints and combinatorial conditions as in (5). Define G : Rn ! Rn by G i (z) D jFi (z) z i j3 C (Fi (z))3 C z3i ;

G

is competitive with and often superior to other algorithms. Consider next the general nonlinear programming problem 8 ˆ ˆ r h(x) C > r g(x) D 0; ˆ ˆ ˆ ˆ ˆ ˆ g(x) D 0; where ˇ 2 Rp and 2 Rm . The complementarity conditions 0, g(x) 0, | g(x) = 0 are replaced by the equivalent nonlinear system of equations

i D 1; : : : ; n; and let a (; z) D G(z) C (1 )(z a): n

n

2

Theorem 6 Let F : R ! R be a C map, and let the Jacobian matrix DG(z) be nonsingular at every zero of G(z). Suppose there exists r > 0 such that z > 0 and zk = kzk1 r imply F k (z) > 0. Then for almost all a > 0 there exists a zero curve of a (, z), along which the Jacobian matrix Da (, z) has full rank, having finite arc length and connecting (0, a) to (1; z), where z solves (5). Theorem 7 Let F : Rn ! Rn be a C2 map, and let the Jacobian matrix DG(z) be nonsingular at every zero of G(z). Suppose there exists r > 0 such that z 0 and kzk1 r imply zk F k (z) > 0 for some index k. Then there exists ı > 0 such that for almost all a 0 with kak1 < ı there exists a zero curve of a (, z), along which the Jacobian matrix Da (, z) has full rank, having finite arc length and connecting (0, a) to (1; z), where z solves (5). Homotopy algorithms for convex unconstrained optimization are generally not computationally competitive with other approaches. For constrained optimization the homotopy approach offers some advantages, and, especially for the nonlinear complementarity problem,

W(x; ) D 0;

(8)

where 3 Wi (x; ) D j i C g i (x)j3 C 3i g i (x) ; i D 1; : : : ; m:

(9)

Thus the optimality conditions (7) take the form F(x; ˇ; ) 0 1 [r(x) C ˇ > r h(x) C > r g(x)]> A D 0: D@ h(x) W(x; ) (10) With z = (x, ˇ, ), the proposed homotopy map is a (; z) D F(z) C (1 )(z a);

(11)

where a 2 Rn+p+m . Simple conditions on , g, and h guaranteeing that the above homotopy map a (, z) will work are unknown, although this map has worked very well on some difficult realistic engineering problems.

1275

1276

G

Globally Convergent Homotopy Methods

Globally Convergent Homotopy Methods, Table 1 Taxonomy of homotopy subroutines

x = f (x) dense sparse FIXPDF FIXPDS FIXPNF FIXPNS FIXPQF FIXPQS

F(x) = 0 dense sparse FIXPDF FIXPDS FIXPNF FIXPNS FIXPQF FIXPQS

(a; ; x) = 0 dense sparse FIXPDF FIXPDS FIXPNF FIXPNS FIXPQF FIXPQS

Frequently in practice the functions , g, and h involve a parameter vector c, and a solution to (6) is known for some c = c(0) . Suppose that the problem under consideration has parameter vector c = c(1) . Then c D (1 )c (0) C c (1)

(12)

parametrizes c by and = (x;c) = (x;c()), g = g(x;c()), h = h(x;c()). The optimality conditions in (10) become functions of as well, F(, x, ˇ, ) = 0, and a (; z) D F(; z) C (1 )(z a)

(13)

is a highly implicit nonlinear function of . If F(0, z(0) ) = 0, a good choice for a in practice has been found to be a = z(0) . A natural choice for a homotopy would be simply F(; z) D 0;

(14)

since the solution z(0) to F(0, z) = 0 (the problem corresponding to c = c(0) ) is known. However, for various technical reasons, (13) is much better than (14). Software There are several software packages implementing both continuous and simplicial homotopy methods; see [1] and [6] for a discussion of some of these packages. A production quality software package written in Fortran 90 is described here. HOMPACK90 [6] is a Fortran 90 collection of codes for finding zeros or fixed points of nonlinear systems using globally convergent probability-one homotopy algorithms. Three qualitatively different algorithms (ordinary differential equation based, normal flow, quasi-Newton augmented Jacobian matrix) are provided for tracking homotopy zero curves, as well as separate routines for dense and sparse Jacobian matrices. A high level driver for the spe-

algorithm ordinary differential equation normal flow augmented Jacobian matrix

cial case of polynomial systems is also provided. HOMPACK90 features elegant interfaces, use of modules, support for several sparse matrix data structures, and modern iterative algorithms for large sparse Jacobian matrices. HOMPACK90 is logically organized in two different ways: by algorithm/problem type and by subroutine level. There are three levels of subroutines. The top level consists of drivers, one for each problem type and algorithm type. The second subroutine level implements the major components of the algorithms such as stepping along the homotopy zero curve, computing tangents, and the end game for the solution at = 1. The third subroutine level handles high level numerical linear algebra such as QR factorization, and includes some LAPACK and BLAS routines. The organization of HOMPACK90 by algorithm/problem type is shown in Table 1, which lists the driver name for each algorithm and problem type. The naming convention is 8 9 < D= F FIXP N ; : ; S Q where D ordinary differential equation algorithm, N normal flow algorithm, Q quasi-Newton augmented Jacobian matrix algorithm, F dense Jacobian matrix, and S sparse Jacobian matrix. Depending on the problem type and the driver chosen, the user must write exactly two subroutines, whose interfaces are specified in the module HOMOTOPY, defining the problem (f or ). The module REAL_PRECISION specifies the real numeric model with SELECTED_REAL_KIND(13); which will result in 64-bit real arithmetic on a Cray, DEC VAX, and IEEE 754 Standard compliant hardware.

Global Optimization Algorithms for Financial Planning Problems

G

The special purpose polynomial system solver POLSYS1H can find all solutions in complex projective space of a polynomial system of equations. Since a polynomial programming problem (where the objective function, inequality constraints, and equality constraints are all in terms of polynomials) can be formulated as a polynomial system of equations, POLSYS1H can effectively find the global optimum of a polynomial program. However, polynomial systems can have a huge number of solutions, so this approach is only practical for small polynomial programs (e. g., surface intersection problems that arise in CAD/CAM modeling). The organization of the Fortran 90 code into modules gives an object oriented flavor to the package. For instance, all of the drivers are encapsulated in a single MODULE HOMPACK90. The user’s calling program would then simply contain a statement like USE HOMPACK90, ONLY : FIXPNF Many scientific programmers prefer the reverse call paradigm, whereby a subroutine returns to the calling program whenever the subroutine needs certain information (e. g., a function value) or a certain operation performed (e. g., a matrix-vector multiply). Two reverse call subroutines (STEPNX, ROOTNX) are provided for ‘expert’ users. STEPNX is an expert reverse call stepping routine for tracking a homotopy zero curve that returns to the caller for all linear algebra, all function and derivative values, and can deal gracefully with situations such as the function being undefined at the requested steplength. ROOTNX provides an expert reverse call end game routine that finds a point on the zero curve where g(, x) = 0, as opposed to just the point where = 1. Thus ROOTNX can find turning points, bifurcation points, and other ‘special’ points along the zero curve. The combination of STEPNX and ROOTNX provide considerable flexibility for an expert user.

2. Chow SN, Mallet-Paret J, Yorke JA (1978) Finding zeros of maps: homotopy methods that are constructive with probability one. Math Comput 32:887–899 3. Forster W (1980) Numerical solution of highly nonlinear problems. North-Holland, Amsterdam 4. Watson LT (1986) Numerical linear algebra aspects of globally convergent homotopy methods. SIAM Rev 28:529–545 5. Watson LT (1990) Globally convergent homotopy algorithms for nonlinear systems of equations. Nonlinear Dynamics 1:143–191 6. Watson LT, Haftka RT (1989) Modern homotopy methods in optimization. Comput Methods Appl Mechanics Engrg 74:289–305 7. Watson LT, Sosonkina M, Melville RC, Morgan AP, Walker HF (1997) Algorithm 777: HOMPACK90: A suite of Fortran 90 codes for globally convergent homotopy algorithms. ACM Trans Math Softw 23:514–549

See also

It is becoming apparent that convex financial planning models are at times a poor approximation of the real world. More realistic, and more relevant, models need to dispense with normality assumptions and concavity of the utility functions to be optimized. Moreover, the problems are large scale but structured; consequently specialized algorithms have been proposed for their solution. The aim of this article is to discuss a non-

Parametric Optimization: Embeddings, Path Following and Singularities Topology of Global Optimization References 1. Allgower EL, Georg K (1990) Numerical continuation methods. Springer, Berlin

Global Optimization Algorithms for Financial Planning Problems PANOS PARPAS, BERÇ RUSTEM Department of Computing, Imperial College, London, UK MSC2000: 90B50, 78M50, 91B28 Article Outline Abstract Background Models Scenario Generation Portfolio Selection

Methods A Stochastic Optimization Algorithm Other Methods

References Abstract

1277

1278

G

Global Optimization Algorithms for Financial Planning Problems

convex portfolio-selection problem and describe algorithms that can be used for its solution. Background Modern portfolio theory started in the 1950s with H. Markowitz’s work [16,17]. Since then a lot of research has been done in improving the basic models and dispensing with the limiting assumptions of the field. The aim of this article is to introduce the problem of optimization of higher-order moments of a portfolio. This model is an extension of the celebrated meanvariance model of Markowitz [16,17]. The inclusion of higher-order moments has been proposed as one possible augmentation to the model in order to make it more applicable. The applicability of the model can be broadened by relaxing one of its major assumptions, i. e. that the rate of returns are normal. In order to solve the portfolio-selection problem, we first need to address the problem of scenario generation, i. e. the description of the uncertainties used in the portfolio-selection problem. Both problems are non-convex, large-scale, and highly relevant in financial optimization. We focus on a single-period model where the decision maker (DM) provides as input preferences with respect to mean, variance, skewness and possibly kurtosis of the portfolio. Using these four parameters we then formulate the multicriterion optimization problem as a standard non-linear programming problem. This version of the decision model is a non-convex linearly constrained problem. Before we can solve the portfolio-selection problem we need to describe the uncertainties regarding the returns of the risky assets. In particular we need to specify: (1) the possible states of the world and (2) the probability of each state. A common approach to this modelling problem is the method of matching moments (see e. g. [5,9,20]). The first step in this approach is to use the historical data to estimate the moments (in this paper we consider the first four central moments, i. e. mean, variance, skewness and kurtosis). The second step is to compute a discrete distribution with the same statistical properties as those calculated in the previous step. Given that our interest is on real-world applications, we recognize that there may not always be a distribution that matches the calculated statistical properties. For this reason we formulate the problem as a least-squares

problem [5,9]. The rationale behind this formulation is that we try to calculate a description of the uncertainty that matches our beliefs as well as possible. The scenario-generation problem also has a non-convex objective function and is linearly constrained. For the two problems described above we apply a new stochastic global optimization algorithm that has been developed specifically for this class of problems. The algorithm is described in [19]. It is an extension of the constrained case of the so-called diffusion algorithm [1,4,6,7]. The method follows the trajectory of an appropriately defined stochastic differential equation (SDE). Feasibility of the trajectory is achieved by projecting its dynamics onto the set defined by the linear equality constraints. A barrier term is used for the purpose of forcing the trajectory to stay within any bound constraints (e. g. positivity of the probabilities, or bounds on how much of each asset to own). A review of applications of global optimization to portfolio selection problems appeared in [13]. A deterministic global optimization algorithm for a multiperiod model appeared in [15]. This article complements the work mentioned above in the sense that we describe a complete framework for the solution of a realistic financial model. The type of models we consider, due to the large number of variables, cannot be solved by deterministic algorithms. Consequently, practitioners are left with two options: solve a simpler, but less relevant, model or use a heuristic algorithm (e. g. tabusearch or evolutionary algorithms). The approach proposed in this paper lies somewhere in the middle. The proposed algorithm belongs to the simulated-annealing family of algorithms, and it has been shown in [19] that it converges to the global optimum (in a probabilistic sense). Moreover, the computational experience reported in [19] seems to indicate that the method is robust (in terms of finding the global optimum) and reliable. We believe that such an approach will be useful in many practical applications. Models Scenario Generation From its inception stochastic programming (SP) has found several diverse applications as an effective paradigm for modelling decisions under uncertainty. The focus of initial research was on developing effec-

G

Global Optimization Algorithms for Financial Planning Problems

tive algorithms for models of realistic size. An area that has only recently received attention is on methods to represent the uncertainties of the decision problem. A review of available methods to generate meaningful descriptions of the uncertainties from data can be found in [5]. We will use a least-squares formulation (see e. g. [5,9]). It is motivated by the practical concern that the moments, given as input, may be inconsistent. Consequently, the best one can do is to find a distribution that fits the available data as well as possible. It is further assumed that the distribution is discrete. Under these assumptions the problem can be written as min !;p

s:t

k n X X iD1 k X

p j m i (! j ) i

2

jD1

p j D 1p j 0 j D 1; : : : ; k;

jD1

where i represents the statistical properties of interest and m i () is the associated ‘moment’ function. For example, if i is the target mean for the ith asset, then m i (! j ) D ! ji i. e. the jth realization of the ith asset. Numerical experiments using this approach for a multistage model were reported in [9] (without arbitrage considerations). Other methods such as maximum entropy [18] and semidefinite programming [2] enjoy strong theoretical properties but cannot be used when the data of the problem are inconsistent. A disadvantage of the least-squares model is that it is highly nonconvex, which makes it very difficult to handle numerically. These considerations lead to the development of the algorithm described in Sect. “A Stochastic Optimization Algorithm” (see also [19]) that can efficiently compute global optima for problems in this class. When using scenario trees for financial planning problems it becomes necessary to address the issue of arbitrage opportunities [9,12]. An arbitrage opportunity is a self-financing trading strategy that generates a strictly positive cash flow in at least one state and whose payoffs are non-negative in all other states. In other words, it is possible to get something for nothing. In our implementation we eliminate arbitrage opportunities by computing a sufficient set of states so that the resulting scenario tree has the arbitrage-free property. This is achieved by a simple two-step process. In the first step we generate random rates of returns; these

are sampled by a uniform distribution. We then test for arbitrage by solving the system x0i D e r

m X

x ij j ;

jD1

m X

j D 1; j 0;

(1)

jD1

j D 1; : : : ; m

i D 1; : : : ; n ;

where x0i represents the current (known) state of the world for the ith asset and x ij represents the jth realization of the ith asset in the next time period (these are generated by the simulations mentioned above). r is the riskless rate of return. The j are called the riskneutral probabilities. According to a fundamental result of Harisson and Kerps [10], the existence of the risk-neutral probabilities is enough to guarantee that the scenario tree has the desired property. In the second step, we solve the least-squares problem with some of the states fixed to the states calculated in the first step. In other words, we solve the following problem: min !;p

s:t

k n X X iD1 kCm X

p j m i (! j ) C

m X

p l m i (!ˆ l ) i

2

l D1

jD1

p j D 1p j 0

(2)

j D 1; : : : ; k C m :

jD1

In the problem above, !ˆ are fixed. Solving the preceding problem guarantees a scenario tree that is arbitrage free. Portfolio Selection In this section we describe the portfolio-selection problem when higher-order terms are taken into account. The classical mean–variance approach to portfolio analysis seeks to balance risk (measured by variance) and reward (measured by expected value). There are many ways to specify the single-period problem. We will be using the following basic model: min ˛E[w] C ˇV [w] w

s.t

n X

wi D 1

li wi ui

i D 1; : : : ; n ;

(3)

iD1

where E[] and V [] represent the mean rate of return and its variance respectively. The single constraint is known as the budget constraint and it specifies the initial wealth (without loss of generality we have assumed

1279

1280

G

Global Optimization Algorithms for Financial Planning Problems

that this is one). The ˛ and ˇ are positive scalars and are chosen so that ˛ C ˇ D 1. They specify the DM’s preferences, i. e. ˛ D 1 means that the DM is risk seeking, while ˇ D 1 implies that the DM is risk averse. Any other selection of the parameters will produce a point on the efficient frontier. The decision variable (w) represents the commitment of the DM to a particular asset. Note that this problem is a convex quadratic programming problem for which very efficient algorithms exist. The interested reader is referred to the review in [23] for more information regarding the Markowitz model. We propose an extension of the mean–variance model using higher-order moments. The vectoroptimization problem can be formulated as a standard non-convex optimization problem using two additional scalars to act as weights. These weights are used to enforce the DM’s preferences. The problem is then formulated as follows: min ˛E[w] C ˇV [w] S[w] C ıK[w] w

s.t

n X

wi D 1

li wi ui

i D 1; : : : ; n ;

(4)

iD1

where S[] and K[] represent the skewness and kurtosis of the rate of return respectively. and ı are positive scalars. The four scalar parameters are chosen so that they sum to one. Positive skewness is desirable (since it corresponds to higher returns, albeit with low probability), while kurtosis is undesirable since it implies that the DM is exposed to more risk. The model in (4) can be extended to multiple periods while maintaining the same structure (non-convex objective and linear constraints). The numerical solution of (2) and (4) will be discussed in the next section. Methods A Stochastic Optimization Algorithm The models described in the previous section can be written as: min f (x) x

s:t Ax D b x 0: A well-known method for obtaining a solution to an unconstrained optimization problem is to consider the

following ordinary differential equation (ODE): dX(t) D r f (X(t)) dt :

(5)

By studying the behaviour of X(t) for large t, it can be shown that X(t) will eventually converge to a stationary point of the unconstrained problem. A review of socalled continuous-path methods can be found in [25]. A deficiency of using (5) to solve optimization problems is that it will get trapped in local minima. To allow the trajectory to escape from local minima, it has been proposed by various authors (e. g. [1,4,6,7]) to add a stochastic term that would allow the trajectory to ‘climb’ hills. One possible augmentation to (5) that would enable us to escape from local minima is to add noise. One then considers the diffusion process: p (6) dX(t) D r f (X(t)) dt C 2T(t) dB(t) ; where B(t) is the standard Brownian motion in Rn . It has been shown in [4,6,7], under appropriate conditions on f and T(t), that as t ! 1, the transition probability of X(t) converges to a probability measure ˘ . The latter has its support on the set of global minimizers. For the sake of argument, suppose we did not have any linear constraints but only positivity constraints. We could then consider enforcing the feasibility of the iterates by using a barrier function. According to the algorithmic framework sketched out above, we could obtain a solution to our (simplified) problem by following the trajectory of the following SDE: p dX(t) D r f (X(t)) dt CX(t)1 dt C 2T(t)dB(t); (7) where > 0 is the barrier parameter. By X -1 we will denote an n-dimensional vector whose ith component is given by 1/X i . Having used a barrier function to deal with the positivity constraints, we can now introduce the linear constraints into our SDE@. This process has been carried out in [19] using the projected SDE: p dX(t) D P[r f (X(t))CX(t)1] dtC 2T(t)P dB(t); (8) where P D I AT (AAT )1 A. The proposed algorithm works in a similar manner to gradient-projection algorithms. The key difference is the addition of a barrier parameter for the positivity of the iterates and

Global Optimization Algorithms for Financial Planning Problems

a stochastic term that helps the algorithm escape from local minima. The global optimization problem can be solved by fixing and following the trajectory of (8) for a suitably defined function T(t). After sufficient time passes, we reduce and repeat the process. The proof that following the trajectory of (8) will eventually lead us to the global minimum appears in [19]. Note that the projection matrix for the type of constraints we need to impose for our models is particularly simple. For a conP straint of the type niD1 x i D 1 the projection matrix is given by ( Pi j D

n1

if i ¤ j;

n1 n

otherwise.

Other Methods In this article we have focused on the numerical solution of a financial planning problem using a stochastic algorithm. We end this article by briefly discussing other possible approaches. Only stochastic methods will be discussed; for deterministic methods we refer the interested reader to [13]. Two-phase methods: Methods belonging to this class, as the name suggests, have two phases: a local and global phase. In the global phase, the feasible region is uniformly sampled. From each feasible point a local optimization algorithm is started. The later process is the local phase. This basic algorithmic framework has been modified to improve its performance by various authors. Improving this type of method requires careful selection of the sample points from which to start the local optimizations. Inevitably there is some compromise between computational efficiency and theoretical convergence. For a review of two-phase methods we refer the reader to [21] and references therein. Simulated annealing (SA): This family of algorithms was inspired by the physical behaviour of atoms in a liquid. The method was independently proposed by Cerny[3] and Kirkpatrick et al. [11]. The method is inspired by a fundamental question of statistical mechanics concerning the behaviour of the system in low temperatures. For example, will the atoms remain fluid or will they solidify? If they solidify, do they form a crystalline solid or a glass? It turns out [11] that if the temperature is decreased slowly, then they form

G

a pure crystal; this state corresponds to the minimum energy of the system. If the temperature is decreased too quickly, then they form a crystal with many defects. SA algorithms generate a point from some distribution. Whether to accept the new point or not is decided by an acceptance function. The latter function is ‘temperature’ dependent. At high temperatures the function is likely to accept the new point, while at low temperatures only points close to the global optimum value are supposed to be accepted. As can be anticipated, the performance of the algorithm depends on the annealing schedule, i. e. how fast the temperature is reduced. Performance also depends on how points are sampled, the acceptance function and, of course, the stopping conditions. An excellent review article for SA is [14]. Stochastic adaptive search methods: These types of algorithms have strong theoretical properties but present challenging implementation issues. A typical algorithm from this class is the pure adaptive search method. This method works like a pure random search method but with the additional assumption of the ability to sample from a distribution that gives realizations that are strictly better than the incumbent. There exist many variants and combinations of this type of method, and an excellent review of them is given in [24]. Genetic algorithms: This class of algorithms has been inspired by concepts from evolutionary biology and from aspects of natural selection. There are two phases in these algorithms: generation of the population and updating. During the generation phase, candidate points (offsprings) are generated by sampling a p.d.f. This p.d.f. is usually specified from the original or the previous generation (the parents). In the second phase the population is updated. This update is performed by applying a selection mechanism and performing mutation operations on the population. There are very few theoretical results concerning the convergence properties of genetic algorithms. However, if their success in applications is anything to go by, then more attention needs to devoted to convergence aspects of the method. An excellent review of genetic algorithms is given in [22]. Tabu search: This is another heuristic algorithm that has been successfully used for global optimization (especially combinatorial problems) but lacks theoretical backing. This class of algorithms was proposed by Glover, and a review of the method appeared in [8]. The

1281

1282

G

Global Optimization in the Analysis and Management of Environmental Systems

algorithm has three phases: preliminary search, intensification, and diversification. In the first phase, the algorithm takes the current configuration, examines neighbouring solutions, and selects the one with the best objective function value. This process is continued until no improving state can be identified. At this stage the possibility of returning to this point is ruled out by placing it into a list. This list is called the tabu list. In the second phase (intensification), the tabu list is cleared and the algorithm returns to the first phase. In the final stage (diversification), the most frequent moves that were placed into the tabu list during the first phase are placed from the start into the list. The algorithm then starts from a random initial point. In this phase the algorithm is not allowed to make any moves that are in the tabu list.

References 1. Aluffi-Pentini F, Parisi V, Zirilli F (1985) Global optimization and stochastic differential equations. J Optim Theory Appl 47(1):1–16 2. Bertsimas D, Sethuraman J (2000) Moment problems and semidefinite optimization. In: Handbook of semidefinite programming, vol 27. Int Ser Oper Res Manage Sci. Kluwer, Boston, pp 469–509 ˇ 3. Cerný V (1985) Thermodynamical approach to the traveling salesman problem: an efficient simulation algorithm. J Optim Theory Appl 45(1):41–51 4. Chiang TS, Hwang CR, Sheu SJ (1987) Diffusion for global optimization in Rn . SIAM J Control Optim 25(3):737–753 5. Dupacova J, Consigli G, Wallace SW (2000) Scenarios for multistage stochastic programs. Ann Oper Res 100:25–53 (2001) 6. Geman S, Hwang CR (1986) Diffusions for global optimization. SIAM J Control Optim 24(5):1031–1043 7. Gidas B (1986) The Langevin equation as a global minimization algorithm. In: Disordered systems and biological organization (Les Houches, 1985), vol 20. NATO Adv Sci Inst Ser F Comput Syst Sci. Springer, Berlin, pp 321–326 8. Glover F, Laguna M (1998) Tabu search. In: Handbook of combinatorial optimization, vol. 3. Kluwer, Boston, pp 621– 757 9. Gülpınar N, Rustem B, Settergren R (2004) Simulation and optimization approaches to scenario tree generation. J Econ Dyn Control 28(7):1291–1315 10. Harrison JM, Kreps DM (1979) Martingales and arbitrage in multiperiod securities markets. J Econom Theory 20:381– 408 11. Kirkpatrick S, Gelatt CD Jr, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680

12. Klaassen P (1997) Discretized reality and spurious profits in stochastic programming models for asset/liability management. Eur J Oper Res 101(2):374–392 13. Konno H (2005) Applications of global optimization to portfolio analysis. In: Audet C, Hansen P, Savard G (eds) Essays and Surveys in Global Optimization. Springer, Berlin, pp 195–210 14. Locatelli M (2002) Simulated annealing algorithms for continuous global optimization. In: Handbook of global optimization, vol 2, vol 62. Nonconvex Optim Appl. Kluwer, Dordrecht, pp 179–229 15. Maranas CD, Androulakis IP, Floudas CA, Berger AJ, Mulvey JM (1997) Solving long-term financial planning problems via global optimization. J Econ. Dynam Control 21(8– 9):1405–1425 Computational financial modeling. 16. Markowitz HM (1952) Portfolio selection. J Finance 7:77–91 17. Markowitz HM (1952) The utility of wealth. J Polit Econ (60):151–158 18. Parpas P (2006) Algorithms in Stochastic Optimization. PhD Thesis, Imperial College London, May 2006 19. Parpas P, Rustem B, Pistikopoulos EN (2006) Linearly constrained global optimization and stochastic differential equations. J Global Optim 36(2):191–217 20. Prékopa A (1995) Stochastic programming, vol 324. Math Appl. Kluwer, Dordrecht 21. Schoen F (2002) Two-phase methods for global optimization. In: Handbook of global optimization, vol 2, vol 62. Nonconvex Optim Appl. Kluwer, Dordrecht, pp 151–177 22. Smith JE (2002) Genetic algorithms. In: Handbook of global optimization, vol 2, vol 62. Nonconvex Optim Appl. Kluwer, Dordrecht, pp 275–362 23. Steinbach MC (2001) Markowitz revisited: mean-variance models in financial portfolio analysis. SIAM Rev 43(1):31– 85 24. Wood GR, Zabinsky ZB (2002) Stochastic adaptive search. In: Handbook of global optimization, vol 2, vol 62. Nonconvex Optim Appl. Kluwer, Dordrecht, pp 231–249 25. Zirilli F (1982) The use of ordinary differential equations in the solution of nonlinear systems of equations. In: Nonlinear optimization, 1981 (Cambridge, 1981), NATO Conf Ser II: Syst Sci. Academic, London, pp 39–46

Global Optimization in the Analysis and Management of Environmental Systems JÁNOS D. PINTÉR Pintér Consulting Services, Inc., and Dalhousie University, Halifax, Canada MSC2000: 90C05

Global Optimization in the Analysis and Management of Environmental Systems

Article Outline Keywords Environmental Systems Analysis and Optimization Model Calibration ‘Black Box’ Optimization (in Environmental Systems) See also References Keywords Nonlinear decision models; Multi-extremality; Continuous global optimization; Applications in environmental systems modeling and management Environmental Systems Analysis and Optimization The harmonized consideration of technical, economic and environmental objectives in strategic planning and operational decision making is of paramount importance, on a worldwide scale. Environmental quality issues are of serious concern even in the most developed countries, although direct pollution control expenditures are typically in the 2–3 percent range of their gross domestic product. The ‘optimized’ or at least ‘acceptable’ solution of environmental quality problems requires the combination of knowledge from a multitude of areas, and requires an interdisciplinary effort. In the past decades, mathematical programming (MP) models have been applied also to the analysis and management of environmental systems. The annotated bibliography [9] reviews over 350 works, including some thirty books. Note further that the engineering, economic and environmental science literature contains a very large amount of work that can serve as a basis and therefore is closely related to such modeling efforts. For instance, the classic textbook [28] reviews the basic quantitative models applied in describing physical, chemical and biological phenomena of relevance. A more recent exposition (with a somewhat broader scope) is presented in, for instance, [11]. The chapters in the latter edited volume discuss the following issues: environmental crisis, as a multidisciplinary challenge;

G

soil pollution; air pollution; water pollution; water resources management; pesticides; gene technology; landscape planning; environmental economics; ecological aspects; environmental impact assessment; environmental management models. Environmental management models are discussed – in the broader context of governmental planning and operations – already in [8]. In addition to items listed above, the (relevant) topics covered include also solid waste management; urban development; policy analysis. Numerous further books can be mentioned; with varying emphasis on environmental science, engineering, economics or systems analysis. Consult, e. g., [1,2,3, 4,6,10,13,15,16,17,18,19,23,24,25,29,31,32,33]. Most of these works also provide extensive lists of additional references. In the framework of this short article there is no room to go into any detailed discussion of environmental models. Therefore we shall only emphasize one important methodological aspect reflected by the title: namely, the relevance of global optimization in this context. The predominant majority of MP models presented, e. g., in the books listed or in [9] belong to (continuous or possibly mixed integer) linear programming, or to convex nonlinear programming, with additional – usually rather simplified – considerations regarding system stochasticity. At the same time, more detailed or more realistic models of natural systems and their governing processes often possess high (explicit or hidden) high nonlinearity. For instance, one may think of power laws, periodic or chaotic processes, and (semi)random fluctuations, reflected by many natural objects on various scales: mountains, waters, plants, animals, and so on. For related far-reaching discussions, consult, for example, [5,7,20,21], or [30]. Since many natural objects and processes are inherently nonlinear, management models that optimize the behavior of environmental systems frequently lead to multi-extremal deci-

1283

1284

G

Global Optimization in the Analysis and Management of Environmental Systems

sion problems. Continuous global optimization (GO) is aimed at finding the ‘absolutely best’ solution of such models, in the possible presence of many other (locally optimal) solutions of various quality. See Continuous global optimization: Models, algorithms and software and Continuous global optimization: Applications for a number of textbooks and WWW sites related to the subject of GO. Therefore, here we mention only the handbook [14] and the WWW site [22]. We shall illustrate the relevance of GO by two very general examples, adapted from [26]. The latter book presents also a number of other case studies related to environmental modeling and management, with numerous additional references pertinent to this subject. Model Calibration The incomplete or poor understanding of environmental – as well as many other complex – systems calls for descriptive model development as an essential tool of the related research. The following main phases of quantitative systems modeling can be distinguished: identification: formulation of principal modeling objectives, determination (selection) of suitable model structure; calibration: (inverse) model fitting to available data and background information; validation and application in analysis, forecasting, control, management. Consequently, the ‘adequate’ or ‘best’ parameterization of descriptive models is an important stage in the process of understanding environmental systems. Interesting, practically motivated discussions of the model calibration problem are presented also in [1,3,12,32]. A fairly simple and commonly applied instance of the model calibration problem can be stated as follows. Given a descriptive system model (e. g. of a lake, river, groundwater or atmospheric system) that depends on certain unknown (physical, chemical) parameters; their vector is denoted by x; the set of a priori feasible parameterizations D; = y(m) (x) at time mo the model output values y(m) t t ments t = 1, . . . , T; a set of corresponding observations yt at t = 1, . . . , T; a discrepancy measure denoted by f which expresses and yt . the distance between y(m) t

Then the optimized model calibration problem can be formulated as ( min f (x) :D f fy(m) t (x); y t g (1) s.t. x 2 D: Frequently, D is a finite n-interval (a ‘box’); furthermore, f is a continuous or somewhat more special (smooth, Lipschitz, etc.) function. Additional structural assumptions regarding f may be difficult to postulate, due to the following reason. For each fixed parameter vector x, the model output sequence {y(m) t (x)} may be produced by some implicit formulas, or by a computationally demanding numerical procedure (such as e. g., the solution of a system of partial differential equations). Consequently, although model (1) most typically belongs to the general class of continuous GO problems, a more specific classification may be difficult to provide. Therefore one needs to apply a GO procedure that enables the solution of the calibration problem under the very general conditions outlined above. To conclude the brief discussion of this example, note that in [26] several variants of the calibration problem statement are studied in detail. Namely, the model development and solver system LGO is applied to solve model calibration problems related to water quality analysis in rivers and lakes, river flow hydraulics, and aquifer modeling. (More recent implementations of LGO are described elsewhere: consult, e. g., [27].) ‘Black Box’ Optimization (in Environmental Systems) As outlined above, the more realistic – as opposed to strongly simplified – analysis of environmental processes frequently requires the development of sophisticated systems of (sub)models: these are then connected to a suitable optimization modeling framework. For examples of various complexity, consult [1,2,10,19,32]. We shall illustrate this point by briefly discussing a modeling framework for river water quality management: for additional details, see [26] and references therein. Assume that the ambient water quality in a river at time t is characterized by a certain vector s(t). The components in s(t) can include, for instance the following: suspended solids concentration, dissolved oxygen concentration, biological oxygen demand, chemical oxy-

Global Optimization in the Analysis and Management of Environmental Systems

gen demand, concentrations of micro-pollutants and heavy metals, and so on. Naturally, the resulting water quality is influenced by a number of factors. These include the often stochastically fluctuating (discharge or nonpoint source) pollution load, as well as the regional hydro-meteorological conditions (streamflow rate, water temperature, etc). Some of these factors can be directly observed, while some others may not be completely known. In a typical model development process, submodels are constructed to describe all physical, chemical, biological, and ecological processes of relevance. (As for an example, one can refer to the classical Streeter–Phelps differential equations that approximate the longitudinal evolution of biological oxygen demand in a river; consult [25,28].) In order to combine such system description with management models, one has to be able to evaluate all decision considered. Each given decision x can be related, inter alia, to the location and sizing of industrial and municipal wastewater treatment plants, the control of nonpoint source (agricultural) pollution, the design of a wastewater sewage collection network, the daily operation of these facilities, and so on. The analysis frequently involves the computationally intensive evaluation of environmental quality – e. g., by solving a system of (partial) differential equations – for each decision option considered. The quite (possibly) more realistic stochastic extensions of such models may also require the execution of Monte-Carlo simulation cycles. Under such or similar circumstances, environmental management models can be (very) complex consisting of a number of ‘black box’ submodels. Consequently, the following general conceptual modeling framework may, and often will, lead to multi-extremal model instances requiring the application of suitable GO techniques: minfTCEM(x)g; EQmin EQ(x) EQmax ;

(2)

TFmin TF(x) TFmax ; in which TCEM(x) is total (discounted, expected) costs of environmental management; EQ(x) is resulting environmental quality (vector); EQmin and EQmax are vector bounds on ‘acceptable’ environmental quality indicators;

G

TF(x) are resulting technical system characteristics (vector); TFmin and TFmax are vector bounds on ‘acceptable’ technical characteristics. Numerous other examples could be cited: similarly to the case considered above, they may involve the solution of systems of (algebraic, ordinary or partial differential) equations, and/or the statistical analysis of the environmental (model) system studied. For further examples – including data analysis, combination of expert opinions, environmental model calibration, industrial wastewater management, regional pollution management in rivers and lakes, risk assessment and control of accidental pollution – in the context of global optimization consult, e. g., [26], and references therein.

See also Continuous Global Optimization: Applications Continuous Global Optimization: Models, Algorithms and Software Interval Global Optimization Mixed Integer Nonlinear Programming Optimization in Water Resources References 1. Beck MB (1985) Water quality management: A review of the development and application of mathematical models. Springer, Berlin 2. Beck MB (ed) (1987) Systems analysis in water quality management. Pergamon, Oxford 3. Beck MB, van Straten G (eds) (1983) Uncertainty and forecasting in water quality. Springer, Berlin 4. Bower BT (ed) (1977) Regional residuals environmental quality management. Johns Hopkins Univ. Press, Baltimore, MD 5. Casti JL (1990) Searching for certainty. Morrow, New York 6. Dorfman R, Jacoby HD, Thomas HA (eds) (1974) Models for managing regional water quality. Harvard Univ. Press, Cambridge, MA 7. Eigen M, Winkler R (1975) Das Spiel. Piper, Munich 8. Gass SI, Sisson RI (eds) (1974) A guide to models in governmental planning and operations. Environmental Protection Agency, Washington, DC 9. Greenberg HJ (1995) Mathematical programming models for environmental quality control. Oper Res 43:578–622 10. Haith DA (1982) Environmental systems optimization. Wiley, New York

1285

1286

G

Global Optimization: Application to Phase Equilibrium Problems

11. Hansen PE, Jørgensen SE (eds) (1991) Introduction to environmental management. Elsevier, Amsterdam 12. Hendrix EMT (1998) Global optimization at work. PhD Thesis, LU Wageningen 13. Holling CS (ed) (1978) Adaptive environmental assessment and management. IIASA & Wiley, New York 14. Horst R, Pardalos PM (eds) (1995) Handbook of global optimization. Kluwer, Dordrecht 15. Jørgensen SE (ed) (1983) Applications of ecological modelling in environmental management. Elsevier, Amsterdam 16. Kleindorfer PR, Kunreuther HC (eds) (1987) Insuring and managing hazardous risks: From Seveso to Bhopal. Springer, Berlin 17. Kneese AV, Ayres RU, d’Arge RC (1970) Economics and the environment: A materials balance approach. Johns Hopkins Univ. Press, Baltimore, MD 18. Kneese AV, Bower BT (1968) Managing water quality: Economics, technology, institutions. Johns Hopkins Univ. Press, Baltimore, MD 19. Loucks DP, Stedinger JR, Haith DA (1981) Water resources systems planning and analysis. Prentice-Hall, Englewood Cliffs, NJ 20. Mandelbrot BB (1983) The fractal geometry of nature. Freeman, New York 21. Murray JD (1983) Mathematical biology. Springer, Berlin 22. Neumaier A (1999) Global optimization. http://solon.cma. univie.ac.at/~neum/glopt.html 23. Nijkamp P (1980) Environmental policy analysis: Operational methods and models. Wiley, New York 24. Novotny W, Chesters G (1982) Handbook of nonpoint pollution. v. Nostrand, Princeton, NJ 25. Orlob GT (1983) Mathematical modeling of water quality: Streams, lakes and reservoirs. Wiley, New York 26. Pintér JD (1996) Global optimization in action. Kluwer, Dordrecht 27. Pintér JD (1998) A model development system for global optimization. In: De Leone R, Murli A, Pardalos PM, Toraldo G (eds) High Performance Software for Nonlinear Optimization: Status and Perspectives. Kluwer, Dordrecht, pp 301–314 28. Rich LG (1972) Environmental systems engineering. McGraw-Hill, New York 29. Richardson ML (ed) (1988) Risk assessment of chemicals in the environment. The Royal Soc. Chemistry London, London 30. Schroeder M (1991) Fractals, chaos, power laws. Freeman, New York 31. Seneca JJ, Taussig MK (1974) Environmental economics. Prentice-Hall, Englewood Cliffs, NJ 32. Somlyódy L, van Straten G (eds) (1983) Modeling and managing shallow lake eutrophication. Springer, Berlin 33. United States Environmental Protection Agency (1988) Waste minimization opportunity assessment manual. Techn. Report EPA Cincinnati

Global Optimization: Application to Phase Equilibrium Problems MARK A. STADTHERR Department Chemical Engineering, University Notre Dame, Notre Dame, USA MSC2000: 80A10, 80A22, 90C90, 65H20 Article Outline Keywords Background Phase Stability Analysis Interval Analysis Conclusion See also References Keywords Interval analysis; Global optimization; Phase equilibrium; Phase stability; Interval Newton The reliable calculation of phase equilibrium for multicomponent mixtures is a critical aspect in the simulation, optimization and design of a wide variety of industrial processes, especially those involving separation operations such as distillation and extraction. It is also important in the simulation of enhanced oil recovery processes such as miscible or immiscible gas flooding. Unfortunately, however, even when accurate models of the necessary thermodynamic properties are available, it is often very difficult to actually solve the phase equilibrium problem reliably. Background The computation of phase equilibrium is often considered in two stages, as outlined by M.L. Michelsen [12, 13]. The first involves the phase stability problem, that is, to determine whether or not a given mixture will split into multiple phases. The second involves the phase split problem, that is to determine the amounts and compositions of the phases assumed to be present. After a phase split problem is solved it may be necessary to do phase stability analysis on the results to determine whether the postulated number of phases was cor-

G

Global Optimization: Application to Phase Equilibrium Problems

rect, and if not repeat the phase split problem. Both the phase stability and phase split problems can be formulated as minimization problems, or as equivalent nonlinear equation solving problems. For determining phase equilibrium at constant temperature and pressure, the most commonly considered case, a model of the Gibbs free energy of the system is required. This is usually based on an excess Gibbs energy model (activity coefficient model) or an equation of state model. At equilibrium the total Gibbs energy of the system is minimized. Phase stability analysis may be interpreted as a global optimality test that determines whether the phase being tested corresponds to a global optimum in the total Gibbs energy of the system. If it is determined that a phase will split, then a phase split problem is solved, which can be interpreted as finding a local minimum in the total Gibbs energy of the system. This local minimum can then be tested for global optimality using phase stability analysis. If necessary the phase split calculation must then be repeated, perhaps changing the number of phases assumed to be present, until a solution is found that meets the global optimality test. Clearly the correct solution of the phase stability problem, itself a global optimization problem, is the key in this two-stage global optimization procedure for phase equilibrium. As emphasized in [10], while it is possible to apply rigorous global optimization techniques directly to the phase equilibrium problem, it is computationally more efficient to use a two-stage approach such as outlined above, since the dimensionality of the global optimization problem that must be solved (phase stability problem) is less than that of the full phase equilibrium problem. In solving the phase stability problem, the conventional solution methods are initialization dependent, and may fail by converging to trivial or nonphysical solutions or to a point that is a local but not a global minimum. Thus there is no guarantee that the phase equilibrium problem has been correctly solved. Because of the difficulties that may arise in solving phase equilibrium problems by standard methods (e. g., [12,13]), there has been significant interest in the development of more reliable methods. For example, the methods of A.C. Sun and W.D. Seider [16], who use a homotopy continuation approach, and of S.K. Wasylkiewicz, L.N. Sridhar, M.F. Malone and M.F. Doherty [18], who use an approach based on topological considerations, can

offer significant improvements in reliability. C.M. McDonald and C.A. Floudas [7,8,9,10] show that, for certain activity coefficient models, the phase stability and equilibrium problems can be made amenable to solution by powerful global optimization techniques, which provide a mathematical guarantee of reliability. An alternative approach for solving the phase stability problem, based on interval analysis, that provides both mathematical and computational guarantees of global optimality, was originally suggested by M.A. Stadtherr, C.A. Schnepper and J.F. Brennecke [15], who applied it in connection with activity coefficient models, as later done also in [11]. This technique, in particular the use of an interval Newton and generalized bisection algorithm, is initialization independent and can solve the phase stability problem with mathematical certainty, and, since it deals automatically with rounding error, with computational certainty as well. J.Z. Hua, Brennecke and Stadtherr [3,4,5,6] extended this method to problems modeled with cubic equation of state models, in particular the Van der Waals, Peng– Robinson, and Soave–Redlich–Kwong models. Though interval analysis provides a general purpose and model independent approach for guaranteed solution of the phase stability problem, the discussion below will focus on the use of cubic equation of state models.

Phase Stability Analysis The determination of phase stability is often done using tangent plane analysis [1,12]. A phase at specified temperature T, pressure P, and feed mole fraction vector z is unstable and can split (in this context, ‘unstable’ refers to both the thermodynamically metastable and classically unstable cases), if the molar Gibbs energy of mixing surface m(x, v) ever falls below a plane tangent to the surface at z. That is, if the tangent plane distance

D(x; v) D m(x; v) m0

n X @m iD1

@x i

(x i z i )

0

is negative for any composition (mole fraction) vector x, the phase is unstable. The subscript zero indicates evaluation at x = z, n is the number of components, and v is the molar volume of the mixture. A common approach for determining if D is ever negative is to min-

1287

1288

G

Global Optimization: Application to Phase Equilibrium Problems

imize D subject to the mole fractions summing to one 1

n X

xi D 0

(1)

iD1

and subject to the equation of state relating x and v: P

a RT C 2 D 0: vb v C ubv C wb 2

(2)

Here a and b are functions of x determined by specified mixing rules. The ‘standard’ mixing rules are b P P P = niD1 xi bi and a = niD1 njD1 xi xj aij , with a i j D p (1 k i j ) a i a j . The ai (T) and bi are pure component properties determined from the system temperature T, the critical temperatures T ci , the critical pressures Pci and acentric factors ! i . The binary interaction parameter kij is generally determined experimentally by fitting binary vapor-liquid equilibrium data. Equation (2) is a generalized cubic equation of state model. With the appropriate choice of u and w, common models such as Peng–Robinson (u = 2, w = 1), Soave–Redlich– Kwong (u = 1, w = 0), and Van der Waals (u = 0, w = 0) may be obtained. It is readily shown that the stationary points in this optimization problem must satisfy s i (x; v) s i (z; v0 ) D 0;

i D 1; : : : ; n 1;

As an example of a system that causes numerical difficulties, consider the binary mixture of hydrogen sulfide (component 1) and methane (component 2) at a temperature of 190 K and pressure of 40.53 bar (40 atm) modeled using the Soave–Redlich–Kwong equation of state, and with an overall feed composition of z1 = 0.0187. Figure 1 shows a plot of the reduced Gibbs energy of mixing m vs. x1 for this system (in the reduced composition space where x2 = 1 x1 ), and also shows the tangent at the feed composition. The corresponding tangent plane distance function is shown in Fig. 2 and Fig. 3. Note that this system has a region, around x1 of 0.03 to 0.05, where multiple real volume roots occur and thus multiple values of m and D exist; only the lowest values are physically significant. This system has five stationary points, four minima and one maximum. Conventional locally convergent methods are typically used with multiple initial guesses, generally at or near

(3)

where si D

@m @x i

@m @x n

:

The (n + 1) × (n + 1) system given by equations (1), (2) and (3) above can be used to solve for the stationary points in the optimization problem. The equation system for the stationary points has a trivial root at (x, v) = (z, v0 ) and frequently has multiple nontrivial roots as well. Thus conventional equation solving techniques may fail by converging to the trivial root or give an incorrect answer to the phase stability problem by converging to a stationary point that is not the global minimum of D. This is aptly demonstrated by the experiments of K.A. Green, S. Zhou and K.D. Luks [2], who show that the pattern of convergence from different initial guesses demonstrates a complex fractallike behavior for even very simple models like Van der Waals. The problem is further complicated by the fact that the cubic equation of state (2) may have multiple real volume roots v.

Global Optimization: Application to Phase Equilibrium Problems, Figure 1 Reduced Gibbs energy of mixing m versus x 1 for the system hydrogen sulfide and methane, showing tangent at a feed composition of 0.0187

Global Optimization: Application to Phase Equilibrium Problems, Figure 2 Tangent plane distance D versus x 1 for the example system of Fig. 1. See Fig. 3 for enlargement of area near the origin

Global Optimization: Application to Phase Equilibrium Problems

Global Optimization: Application to Phase Equilibrium Problems, Figure 3 Enlargement of part of Fig. 2, showing area near the origin

the pure components (x1 = 0 and x1 = 1). When this is done convergence will likely be to the local minimum at the feed composition (0.0187) and to the local minimum around 0.88. The global minimum with D < 0 is missed, leading to the incorrect conclusion that the mixture is stable. Interval Analysis Interval analysis makes possible the mathematically and computationally guaranteed solution of the phase stability problem. Since the mole fraction variables xi are known to lie between zero and one, and it is easy to put physical upper and lower bounds on the molar volume v as well, a feasible interval for all variables is readily identified. By applying an interval Newton/generalized bisection approach to the entire feasible interval, enclosures of all the stationary points of the tangent plane distance D can be found by solving the nonlinear equation system (1)–(3), and the global minimum of D thus identified. This approach requires no initial guess, and is applicable to any model for the Gibbs energy, not just those derived from equations of state. For the binary system used as an example above, all five stationary points are easily found, and the global minimum at x1 = 0.0767, v = 64.06 cm3 /mol, and D = 0.004 thus identified [3,6]. The efficiency of the interval approach can depend significantly on how tightly one can compute interval extensions for the functions involved. The interval extension of a function over a given interval is an enclosure for the range of the function over that interval. When the natural interval extension, that is the function range computed using interval arithmetic, is used, it may tightly bound the actual function range. How-

G

ever, it is not uncommon for the natural interval extension to provide a significant overestimation of the true function range, especially for functions of the complexity encountered in the phase stability and equilibrium problems. Some tightening of bounds can be achieved by taking advantage of information about function monotonicity. Another simple and effective way to alleviate this difficulty in this context is to focus on tightening the enclosure when computing interval extensions of mole fraction weighted averages, such as r D Pn iD1 x i r i , where the ri are constants. Due to the mixing rules for determining a and b, such expressions occur frequently, both in the equation of state (2) itself, as well in the derived model m(x, v) for the Gibbs energy of mixing and thus in equation (3). The natural interval extension of r will yield the true range (within roundout) of the expression in the space in which all the mole fraction variables xi are independent. However, the range can be tightened by considering the constraint that the mole fractions must sum to one. One approach for doing this is simply to eliminate one of the mole fraction variables, say xn . Then an enclosure for the range of r in the constrained space can be determined by computing the natural interval extension of P r n C n1 iD1 (r i r n )x i . However, this may not yield the sharpest possible bounds on r in the constrained space. For constructing the exact (within roundout) bounds on r in the constrained space, S.R. Tessier [17] and Hua, Brennecke and Stadtherr [5] have presented a very simple method, based on the observation that at the extrema of r in the constrained space, at least n 1 of the mole fraction variables must be at their upper or lower bound. This observation can be derived by viewing the problem of bounding the range of r in the constrained space as a linear programming problem. As shown in [5], when the constrained space interval extensions for mole fraction weighted averages are used, together with information about function monotonicity, significant improvements in computational efficiency, nearly an order of magnitude even for small (binary and ternary) problems, can be achieved in using the interval approach for solving the phase stability problem. For small problems, it is usually efficient to globally minimize D by finding all of its stationary points, since this does not require repeated evaluation of the range

1289

1290

G

Global Optimization: Application to Phase Equilibrium Problems

of D. However, in general, for making a determination of phase stability or instability, finding all the stationary points is not really necessary, nor for larger problems, desirable. For example, if an interval is encountered over which the interval extension of D has a negative upper bound, this guarantees that there is a point at which D < 0, and so one can immediately conclude that the mixture is unstable without determining all the stationary points. It is also possible to easily make use of the underlying global minimization problem. Since the objective function D has a known value of zero at the mixture feed composition (tangent point), any interval over which the interval extension of D has a lower bound greater than zero cannot contain the global minimum and can be discarded, even though it may contain a stationary point (at which D will be positive and thus not of interest). Thus, one can essentially combine the interval-Newton technique with an interval branch and bound procedure in which lower bounds are generated using interval techniques. Also, it should be noted that the global interval approach described here can easily be combined with existing local methods for determining phase stability and equilibrium. First, some (fast) local method is used. If it indicates instability then this is the correct answer as it means a point at which D < 0 has been found. If the local method indicates stability, however, this may not be the correct answer since the local method may have missed the global minimum in D. Applying interval analysis as described here can then be used to confirm that the mixture is stable if that is the case, or to correctly determine that it is really unstable if that is the case.

ysis of phase behavior, and in chemical process analysis in general [14], that likewise are amenable to solution using this powerful approach.

Conclusion

References

As demonstrated in [3,4,5,6,11,15], interval analysis can be used to solve phase stability and equilibrium problems efficiently and with complete reliability, providing a method that can guarantee with mathematical and computational certainty that the correct result is found, and thus eliminating computational problems that are encountered with conventional techniques. The method is initialization independent; it is also model independent, straightforward to use, and can be applied in connection with any equation of state or activity coefficient model for the Gibbs free energy of a mixture. There are many other problems in the anal-

See also Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators Bounding Derivative Ranges Global Optimization in Phase and Chemical Reaction Equilibrium Interval Analysis: Application to Chemical Engineering Design Problems Interval Analysis: Differential Equations Interval Analysis: Eigenvalue Bounds of Interval Matrices Interval Analysis: Intermediate Terms Interval Analysis: Nondifferentiable Problems Interval Analysis: Parallel Methods for Global Optimization Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Unconstrained and Constrained Optimization Interval Analysis: Verifying Feasibility Interval Constraints Interval Fixed Point Theory Interval Global Optimization Interval Linear Systems Interval Newton Methods Optimality Criteria for Multiphase Chemical Equilibrium

1. Baker LE, Pierce AC, Luks KD (1982) Gibbs energy analysis of phase equilibria. Soc Petrol Eng J 22:731–742 2. Green KA, Zhou S, Luks KD (1993) The fractal response of robust solution techniques to the stationary point problem. Fluid Phase Equilib 84:49–78 3. Hua JZ, Brennecke JF, Stadtherr MA (1996) Reliable phase stability analysis for cubic equation of state models. Comput Chem Eng 20:S395–S400 4. Hua JZ, Brennecke JF, Stadtherr MA (1996) Reliable prediction of phase stability using an interval-Newton method. Fluid Phase Equilib 116:52–59 5. Hua JZ, Brennecke JF, Stadtherr MA (1998) Enhanced interval analysis for phase stability: Cubic equation of state models. Industr Eng Chem Res 37:1519–1527

Global Optimization Based on Statistical Models

6. Hua JZ, Brennecke JF, Stadtherr MA (1998) Reliable computation of phase stability using interval analysis: Cubic equation of state models. Comput Chem Eng 22:1207–1214 7. McDonald CM, Floudas CA (1995) Global optimization and analysis for the Gibbs free energy function using the UNIFAC, Wilson, and ASOG equations. Industr Eng Chem Res 34:1674–1687 8. McDonald CM, Floudas CA (1995) Global optimization for the phase and chemical equilibrium problem: Application to the NRTL equation. Comput Chem Eng 19:1111–1139 9. McDonald CM, Floudas CA (1995) Global optimization for the phase stability problem. AIChE J 41:1798–1814 10. McDonald CM, Floudas CA (1997) GLOPEQ: A new computational tool for the phase and chemical equilibrium problem. Comput Chem Eng 21:1–23 11. McKinnon KIM, Millar CG, Mongeau M (1996) Global optimization for the chemical and phase equilibrium problem using interval analysis. In: Floudas CA, Pardalos PM (eds) State of the Art in Global Optimization: Computational Methods and Applications. Kluwer, Dordrecht, pp 365–382 12. Michelsen ML (1982) The isothermal flash problem. Part I: Stability. Fluid Phase Equilib 9:1–19 13. Michelsen ML (1982) The isothermal flash problem. Part II: Phase-split calculation. Fluid Phase Equilib 9:21–40 14. Schnepper CA, Stadtherr MA (1996) Robust process simulation using interval methods. Comput Chem Eng 20:187– 199 15. Stadtherr MA, Schnepper CA, Brennecke JF (1995) Robust phase stability analysis using interval methods. AIChE Symp Ser 91(304):356–359 16. Sun AC, Seider WD (1995) Homotopy-continuation method for stability analysis in the global minimization of the Gibbs free energy. Fluid Phase Equilib 103:213–249 17. Tessier SR (1997) Enhanced interval analysis for phase stability: Excess Gibbs energy models. MSc Thesis Dept Chemical Engin Univ Notre Dame 18. Wasylkiewicz SK, Sridhar LN, Malone MF, Doherty MF (1996) Global stability analysis and calculation of liquidliquid equilibrium in multicomponent mixtures. Industr Eng Chem Res 35:1395–1408

Global Optimization Based on Statistical Models ANTANAS ŽILINSKAS Institute Math. and Informatics, Vytautas Magnus University, Vilnius, Lithuania MSC2000: 90C30

G

Article Outline Keywords See also References Keywords Global optimization; Statistical models; Multimodal functions; Rational choice Many practically significant problems require to optimize in a ‘black box’ situation, when the objective function is given by a code, but its structure is not known. In some algorithms, developed for such a case, different heuristic ideas are implemented. A disadvantage of the heuristic algorithms is dependence of the results on many parameters which choice is difficult because of rather vague meaning of these parameters. To develop a theory of global optimization the ‘black box’ should be replaced by a ‘grey box’ corresponding to some model of predictability/uncertainty of values of an objective function. A model of an objective function is an important counterpart of any optimization theory (e. g., quadratic models are widely used to construct algorithms for local nonlinear optimization). The uncertainty on values of multimodal functions at the arbitrary points of the feasible region is more essential than uncertainty on the value of the objective function which will be calculated at the current iteration of the local descent. Therefore, the global optimization models that describe the objective function with respect to information obtained during the previous iterations are different from polynomial models used in local optimization. Different models may be used; e. g., a deterministic model, defining the guaranteed intervals for unknown function values, or a statistical model, modeling the uncertainty on function value by means of a random variable. The choice of a model is crucial because it defines the methodology of constructing the corresponding algorithms. A Lipschitzian type model enables the construction of global optimization algorithms with guaranteed (worst case) accuracy. However, the number of function evaluations in the worst case grows drastically with the dimensionality of the problem and the prescribed accuracy. In spite of this pessimistic theoretical result many practical rather complicated problems have been

1291

1292

G

Global Optimization Based on Statistical Models

solved heuristically. Because a heuristics is a human experience based methodology, oriented towards average (typical, normal) conditions, it seems reasonable to develop a theory formalizing the principle of rational behavior with respect to average conditions in global optimization. The average rationality is well justified for playing a ‘game against nature’ which models optimization conditions better than an antagonistic game where the principle of minimax (guaranteed result) is well justified. The method ology of average rationality was applied to develop the general theory of rational choice under statistically interpreted uncertainty [4]. This general theory was further specified to develop the theory of global optimization based on statistical models of multimodal functions [11]. To construct a statistical model of multimodal function f (x), x 2 A Rn , the axiomatic approach is applied: the rationality of comparisons of likelihood of different values of f () is postulated by simple, intuitively acceptable axioms, and it is proved that the interpretation of an unknown value f (x) as a Gaussian random variable x is compatible with the axioms. The parameters of x (mean value m(x|(xi , yi )) and variance 2 (x|(xi , yi )), where yi = f (xi ) are known function values obtained during the search) are introduced by axiomatic theory of extrapolation under uncertainty. In the one-dimensional case both functions are very simple: m(x|(xi , yi )) is piecewise linear (connecting the neighboring trial points) and 2 (x|(xi , yi )) is piecewise quadratic. By means of further (more restrictive) assumptions, the statistical models, corresponding to the stochastic functions, may be specified. The one-dimensional model corresponding to the Wiener process was introduced in [3]. However, the specification of a model as a stochastic function is not very reasonable: this normally involves additional very serious implementation difficulties and does not help to choose the model according to the a priori information on the problem. Using a statistical model the algorithm is constructed maximizing the probability to find better points than those found during the previous search. Such a strategy is justified also by the natural axioms of rationality of search. In the one-dimensional case the algorithm is easy to implement. In the multidimensional case, an auxiliary optimization problem must be solved [8].

Although the algorithm is based on the statistical model it is described without of use of randomization. Therefore it may be investigated by usual deterministic methods, e. g. the convergence of the algorithm in the is proved under weak assumptions on the underlying statistical model (continuity of m(x|), 2 (x|) and weak dependence of both characteristics at point x on (xi , yi ) for relatively remote points xi [8]). The models and algorithms of this approach are well grounded theoretically because they are derived from natural assumptions on rational behavior of an optimizer. As a topic for further research, the theory of average complexity seems very prospective. It would be important to evaluate the complexity of practically efficient algorithms constructed by the approach as well as to obtain general bounds and compare them with those obtained for Lipschitzian algorithms. The first results in this direction are interesting even for the onedimensional case: the limit distribution of error of passive random search in case of the Wiener model exists or does not exist depending on a subtle interpretation of the model [2]. Other important theoretical topics are: developing dual (global-local) models for the multidimensional case, and justification of multidimensional statistical models oriented towards algorithms of the branch and bound type (cf. also Integer programming: Branch and bound methods), whose auxiliary computations would be essentially less time consuming than maximization of the probability over the whole feasible region at each iteration. Many algorithms were constructed using different statistical models and more or less theoretically justified ideas. For example, a Bayesian algorithm (cf. also Bayesian global optimization) is defined by minimizing the average error with respect to the stochastic function chosen for a model [5]. By interpolation, the next calculation of a value of the objective function is performed at minimum point of m(|(xi , yi )) [1,6]. For the information-statistical method, an ad hoc onedimensional model is constructed [1,7]. The algorithms may be generalized for the case with ‘noisy’ functions, see for example the algorithm in [8,10]. The known results from the theory of stochastic functions as well as axiomatic construction of statistical models do not give numerically tractable models which are completely adequate to describe local and global properties of a typical global optimization prob-

Global Optimization Based on Statistical Models

lem [1]. But in the framework of statistical models the adequacy, e. g., to local prop erties of the objective function, might be tested as a statistical hypothesis. If the statistical model is locally inadequate in a subset of the feasible region, then the objective function is assumed unimodal in this subset and a local minimum of f (x) may be found by a local technique. An example of the combination of global and local search with a stopping rule corresponding to a high probability of finding the global minimum is presented in [9]. In the case of one-dimensional global optimization there are many competing algorithms including algorithms based on statistical models [8]. The algorithms representing different approaches may be compared with sufficient reliability by means of experimental testing. Since the codes in one-dimensional case are very precise realizations of theoretical algorithms then influence of implementation specifics is insignificant (at least with respect to multidimensional cases) and the comparison results may be generalized from codes to corresponding approaches. The results in [8] show that the algorithm from [9] and its modification [8] outperforms algorithms based on Lipschitzian type models even if a good estimate of the Lipschitz constant is available. The comparison of multidimensional algorithms is methodologically more difficult, partly because of very different stopping conditions. But generally speaking, the algorithms based on statistical models are efficient with respect to the number of evaluations of the objective function for the multimodal functions up to 10–15 variables [8]. The auxiliary computations require much computing time and computer memory. Therefore, such algorithms are rational to use for the problems, whose objective unction is expensive to evaluate. If an objective function is cheap to evaluate, the gain obtained from a low number of function evaluations may be less than the loss caused by the auxiliary computations. A detailed review of the subject is presented in [8]; further references may be found in [1]. See also Adaptive Global Search Adaptive Simulated Annealing and its Application to Protein Folding ˛BB Algorithm Bayesian Global Optimization

G

Continuous Global Optimization: Applications Continuous Global Optimization: Models, Algorithms and Software Differential Equations and Global Optimization DIRECT Global Optimization Algorithm Genetic Algorithms for Protein Structure Prediction Global Optimization in Binary Star Astronomy Global Optimization Methods for Systems of Nonlinear Equations Global Optimization Using Space Filling Monte-Carlo Simulated Annealing in Protein Folding Packet Annealing Random Search Methods Simulated Annealing Simulated Annealing Methods in Protein Folding Stochastic Global Optimization: Stopping Rules Stochastic Global Optimization: Two-phase Methods Topology of Global Optimization References 1. Boender G, Romeijn E (1995) Stochastic methods. In: Horst R, Pardalos PM (eds) Handbook Global Optim. Kluwer, Dordrecht, pp 829–869 2. Calvin J, Glynn P (1997) Average case behavior of random search for the maximum. J Appl Probab 34:631–642 3. Kushner H (1962) A versatile stochastic model of a function of unknown and time-varying form. J Math Anal Appl 5:150–167 4. Luce D, Suppes P (1965) Preference, utility and subjective probability. In: Luce D, Bush R, Galanter E (eds) Handbook Math. Psychology. Wiley, New York, pp 249–410 5. Mockus J (1989) Bayesian approach to global optimization. Kluwer, Dordrecht 6. Shagen I (1980) Stochastic interpolation applied to the optimization of expensive objective functions. In: COMPSTAT 1980. Physica Verlag, Heidelberg, pp 302–307 7. Strongin R (1978) Numerical methods in multiextremal optimization. Nauka, Moscow 8. Törn A, Žilinskas A (1989) Global optimization. Springer, Berlin 9. Žilinskas A (1978) Optimization of one-dimensional multimodal functions, Algorithm AS 133. Applied Statist, 23:367–385 10. Žilinskas A (1980) MIMUN-optimization of one-dimensional multimodal functions in the presence of noise, Algoritmus 44. Aplikace Mat 25:392–402 11. Žilinskas A (1985) Axiomatic characterisation of a global optimization algorithm and investigation of it’s search strategy. Oper Res Lett 4:35–39

1293

1294

G

Global Optimization in Batch Design Under Uncertainty

Global Optimization in Batch Design Under Uncertainty S. T. HARDING, CHRISTODOULOS A. FLOUDAS Department Chemical Engineering, Princeton University, Princeton, USA MSC2000: 90C26 Article Outline Keywords Conceptual Framework Constraints on Batch Size Minimum Cycle Time Constraints on Production Time Demand Constraints Economic Objective Function

Sources of Uncertainty Uncertainty in Process Parameters Uncertainty in Product Demand

Global Optimization Approaches The GOP Approach ˛BB Approach

Other Types of Batch Plants Mixed-Product Campaign Multipurpose Batch Plant-Single Equipment Sequence Multipurpose Batch Plant-Multiple Equipment Sequence

See also References Keywords Batch plant design; Multiproduct; Multipurpose; Uncertainty Batch processes are a popular method for manufacturing products in low volume or that require several complicated steps in the synthesis procedure. The growth in the market for specialty chemicals has contributed to the demand for efficient batch plants. Batch processes are especially attractive due to their inherent flexibility. They can accommodate a wide range of production requirements. Batch equipment can be reconfigured to produce more than one product. Finally, certain pieces of equipment in batch processes can be used for more than one task. An important area of concern in the design of batch processes is their ability to accommodate changes in

production requirements and processing parameters. The key issue is: given some degree of uncertainty in a) the future demand for the products and b) the parameters that describe the chemical and physical steps involved in the process, what is the appropriate amount of flexibility the process should possess so as to maintain feasible operation while maximizing profits? Many methods have been proposed for the design of batch plants under known market conditions and nominal operating conditions. Two major classes of batch plant designs are multiproduct plants and multipurpose plants. In the multiproduct plant, all products follow the same sequence of processing steps. Typically, one product is produced at a time in what is termed a singleproduct campaign (SPC). Multipurpose batch plants allow products to be processed using different sequences of equipment, and in some cases products can be produced simultaneously. While significant progress has been made in the design and scheduling of batch plants, until recently the issues of flexibility and design under uncertainty have received little attention. Among the first to address the problem of batch plant design under uncertainty in a novel way were [10], and [8]. They divided the variables in the design problem into five categories: structural, design, state, operating, and uncertain. Structural variables describe the interconnections of the equipment in the plant. Design variables describe the size of the process equipment and are fixed once the plant is constructed. State variables are dependent variables and are determined once the design and operating variables are specified. Operating variables are those whose values can be changed in response to variations in the uncertain variables. Finally, the uncertain parameters are the quantities that can have random values which can be described by a probability distribution. Usually the uncertain parameters have normal distributions and are considered to be independent of each other. [8] also introduced the distinction between variations which have short-term effects and those with long-term effects. [18] extended this idea, suggesting a distinction between ‘hard’ and ‘soft’ constraints in which the former must be satisfied for feasible plant operation, but the latter may be violated, subject to a penalty in the objective function. They considered the time required to produce a product as uncertain and developed a problem formulation.

Global Optimization in Batch Design Under Uncertainty

In [12], and [13] the authors addressed the problem of multiproduct batch plant design with uncertainties in both demand for the products and in technical parameters such as processing times and size factors. They restricted their designs to one piece of equipment per stage. [3] presented several variations on the problem of design with uncertain demands. They used interval methods to develop different solution procedures, including a two-stage approach and a penalty function approach. Another type of batch plant is the multipurpose plant. [14] proposed a scenario-based approach for the design of multipurpose batch plants with uncertain production requirements. The multipurpose approach resulted in a large scale MILP model for which efficient techniques for obtaining good upper and lower bounds were proposed. [15] developed a model for the multiproduct batch design problem which takes into account uncertainties in the product demands and in equipment availability. They considered the problem of design feasibility separately from the maximization of profits and presented an approach for achieving both criteria. [16] addressed the problem of uncertain demands, and used a scenario-based approach with discrete probability distributions for the demands. In addition, they considered the scheduling problem as a second stage, following the design problem. [6], and [7] considered the multiproduct batch plant design problem based on a stochastic programming formulation. They developed a relaxation of the production feasibility requirement and added a penalty term to the objective function to account for partial feasibility. Through this analysis, the problem can be reformulated as a single large scale nonconvex optimization problem. [2] extended this work to the design of multipurpose batch plants and implemented an efficient Gaussian quadrature technique to improve the estimation of the expected profit. [5] identified special structures in the nonconvex constraints for multiproduct and multipurpose batch design formulations. These properties can be exploited to obtain tight bounds on the global solution. This allows very large scale design problems to be solved in reasonable CPU time using the ˛BB method of [1]. Conceptual Framework Most batch design problems are variations on the same basic model of a batch plant. The plant consists of M

G

processing stages where each stage j contains N j identical pieces of equipment. The volume of each unit, V j , is a design variable, and the number of units per stage, N j , may be a variable or a fixed parameter. In the batch plant, NP products are to be made, and the amount of each produced is Qi . Each product is produced in a number of batches of identical size, Bi . Using these definitions, a number of constraints on the design of the plant can be imposed. These constraints are: 1) an upper limit on the batch size, 2) a lower limit on the amount of time between batches, 3) an upper limit on the total processing time allowed, and 4) a constraint on the production related to the demand for each product. The basic form of these constraints is shown below, for a multiproduct batch plant with single-product campaigns.

Constraints on Batch Size The batch size for each product i cannot be larger than the size of the pieces of equipment in each stage j. This can be written Bi

Vj ; Si j

i D 1; : : : ; N P;

j D 1; : : : ; M:

The size factor, Sij , is the capacity required in stage j to process one unit of product i.

Minimum Cycle Time In order to make sure that each batch is processed separately in a given stage, one batch cannot begin processing until the previous batch has been processed for a certain amount of time. This is called the cycle time TLi

ti j ; Nj

i D 1; : : : ; N P;

j D 1; : : : ; M:

The time factor, t ij is the amount of time to process one batch of product i in stage j.

1295

1296

G

Global Optimization in Batch Design Under Uncertainty

Constraints on Production Time The amount of time needed to produce all of the batches must be less than the total time available, H, NP X Qi TLi H: Bi

Uncertainty in Process Parameters

iD1

Demand Constraints The production for each product must meet the demand. Qi D Di : Economic Objective Function The objective is to maximize profits. The profit is calculated by subtracting the annualized capital costs from the revenues: Profit D

NP X

Qi pi

iD1

operation of the batch plant. Conversely, uncertainty in the product demand is an external source of uncertainty, therefore it only affects the objective function, and not the feasibility of the plant design.

M X

ˇ

˛ j N j Vj j ;

jD1

where pi is the price of product i. The annualization factor for the cost of the units in stage j is ˛ j . In the case where the number of units per stage, N j is variable and/or the unit sizes, V j , take only discrete values, this problem is a mixed integer nonlinear optimization problem (MINLP). If N j is fixed and the unit sizes are continuous, the problem is a nonlinear program (NLP). In either case, the problem is nonconvex, therefore conventional mixed integer and nonlinear solvers cannot be used robustly. Instead, global optimization techniques must be employed to guarantee that the optimal solution is located. Sources of Uncertainty Within the mathematical framework for a multiproduct batch plant there are a number of possible sources of uncertainty. The most commonly studied are uncertainty in the process parameters, like the size factors, Sij , and the time factors, t ij , and uncertainty in the product demand, Di . In addition to these, [3] considered uncertainty in the time horizon, H, and in the product prices, pi . Uncertainty in the process parameters is model inherent uncertainty, as classified by [11]. That is, uncertainty in the process parameters affects the feasible

The size factors and processing times affect the feasible design and operation of the batch plant. The goal is to design a plant that can operate feasibly, even if there is some uncertainty in the values of these parameters. The approach that is commonly followed is to consider a number of different scenarios, where each scenario corresponds to a set of parameter realizations. For example, if the size factors, Sij , have some nominal value, S i j , then one scenario is that all of the size factors are at their nominal value. Similarly, if we have some knowledge about the amount of uncertainty in the size factors, we can construct a lower extreme scenario, where each size factor is at its lower bound, SLij , and an upper extreme scenario, SUij . The new set of size factors, reflecting the different scenarios is represented by the p parameter S i j . The scenarios can be weighted using the p factor, w . The set of constraints for the batch design problem must be modified so that the design is feasible over the whole set of scenarios, P: Bi

Vj p

Si j

p

;

p TLi

ti j Nj

;

NP p X Qi p T H: B i Li iD1

Uncertainty in Product Demand Uncertainty in the demand for the products affects the profitability of the plant. In this case, the product demand is given by a probability distribution function J( i ) where i represents the uncertain demand for product i. The calculation of the expected revenues requires the integration over an optimization problem: " E max Qi

NP X

# pi Qi

iD1

(

Z D

max

2R(V j ;N j ) Q i

NP X iD1

) pi Qi

J() d :

(1)

G

Global Optimization in Batch Design Under Uncertainty

The integration should be performed over the feasible region of the plant, which is unknown at the design stage. See [6] for a Gaussian quadrature approach to discretize the integration. The range of uncertain demands is covered by a grid, where each point on the grid represents a set of demand realizations, and is assigned a weight corresponding to its probability, ! q J q . The set of quadrature points is represented by Q. The expected revenues are now calculated as a multiple summation: " E max Qi

D

NP X

#

iD1

pi Qi

iD1

Q P NP X 1 X q qX qp ! J pi Qi : p w pD1 qD1 iD1

In addition, the time horizon constraint must be modified: 8p 2 P; 8q 2 Q:

Global Optimization Approaches The set of constraints for the design of a multiproduct batch plant under uncertainty form a nonconvex optimization problem. Global optimization techniques must be used in order to ensure that the true optimal design is located. Following the analysis of [9], an exponential transformation can be applied, reducing the number of nonlinear terms in the model. Vj D exp(v j );

8 j 2 M;

B i D exp(b i );

8i 2 N P;

D

p exp(t Li );

8i 2 N P:

In [5] and [6] global optimization methods were developed to solve this problem, where the number of units in each stage, N j , is fixed. In this case, the cycle time becomes a parameter, determined by, ( p t Li

D max ln j

s.t. ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ :

p

v j ln(S i j ) C b i NP X qp p Q i exp(t Li b i ) H iD1

qp

q

iL Q i i

ln(VjL ) v j ln(VjU ) ! ! VjL VjU b i min ln : min ln p p j;p j;p Si j Si j (2)

NP qp X Qi p TLi H; B i iD1

p TLi

The nonlinear optimization problem to be solved is written as a minimization: 8 M ˆ X ˆ ˆ ˆ min ı ˛ j N j exp ˇ j v j ˆ qp ˆ ˆ ˆb i ;v j ;Q i jD1 ˆ ˆ ˆ Q P NP ˆ X ˆ 1 X q qX qp ˆ ˆ ! J pi Qi ˆ p ˆ w ˆ ˆ pD1 qD1 iD1 ˆ ˆ ˆ Q P NP ˆ X X X ˆ 1 q qp ˆ q q ˆ C ! J p Q ˆ i i i ˆ w p qD1 ˆ < pD1

p

ti j Nj

8i 2 N P; 8p 2 P:

Note that the time horizon constraint is the only nonconvex constraint remaining in the problem formulation. A penalty term is added to the objective function to account for unsatisfied demand, the penalty parameter is . The GOP Approach In [7] and [2] the GOP algorithm of [4,17] has been applied to solve design formulations for both multipurpose and multiproduct batch plants. GOP converges to the global optimum solution by solving a primal problem and a number of relaxed dual problems in each iteration. In [7] it is observed that if the variables in the batch design problem are partitioned so that y = {vj , bi } qp and x = {Q i }, then the problem is convex in y for every fixed x, and linear in x for every fixed y. This satisfies Condition A) of the GOP algorithm. A property was developed in [7] that allows the number of relaxed duals per iteration to be reduced from 2NPQ to 2NP , making the problem computationally tractable. ˛BB Approach

!) ;

The ˛BB approach of [1] was applied in [5] to solve both multiproduct and multipurpose design formulations. ˛BB is a branch and bound approach that

1297

1298

G

Global Optimization in Batch Design Under Uncertainty

converges to the global solution by solving a sequence of upper and lower bounding problems. The lower bounding problem is formulated by subtracting a quadratic term, multiplied by the constant ˛, from each of the nonconvex terms, thus convexifying the problem. Often, the size of the ˛ term must be estimated, resulting in poor lower bounds in the first few levels of the branch and bound tree. However, the nonconvex terms in the batch plant design formulation allow the exact value of ˛ to be calculated, resulting in a tight lower bound on the global solution. This technique has been used to find the optimal design for a multiproduct batch plant with 5 products in 6 stages. This corresponds to a nonconvex NLP with 15,636 variables, 3155 constraints, and 15,625 nonconvex terms.

Note that this constraint has the same form as the time horizon constraint for the single-product campaign formulation. Multipurpose Batch Plant-Single Equipment Sequence In a multipurpose batch plant, the equipment can be used for more than one function, therefore each product may have a different route through the plant. In the single equipment sequence case, there is one distinct route for each product. Production is carried out in a sequence of campaigns L, and there may be more than one product produced simultaneously in a campaign, h. The time needed for each campaign, Ch , is based on the maximum cycle time for all products in the campaign,

Other Types of Batch Plants In addition to the multiproduct batch plant with singleproduct campaign illustrated in the preceding sections, there are many other batch plant design formulations that can be adapted to consider the issue of uncertainty in design. Mixed-Product Campaign This is another example of a multiproduct batch plant. In this case, storage of the intermediate products is allowed between processing steps. In addition, batches of different products can be alternated. This allows a reduction in the total production time. Rather than being limited by the largest cycle time for all stages, this method calculates the total production time for each stage: q p;tot Tj

NP X iD1

qp

Qi Bi

! p

ti j:

The total time for each stage must be less than the total time allowed: ! NP qp X Qi q p;tot p ti j: H Tj Bi iD1

This can be written ! qp NP X Qi p t i j H: B i iD1

L X

qp

qp ˛ hi C h

hD1

Qi Bi

! p

TLi ;

where

˛ hi D

8 ˆ ˆ Tmin );

Suggested pseudocode after [16]

Let’s first remind that the guess generator proposed in [16] is based on a thermally disturbed simplex [13]. When the temperature approaches 0, the generator reduces to the Nelder–Mead algorithm and a local convergence can be expected. W.H. Press et al. announce a local convergence whereas V. Torczon [17] showed such a convergence cannot be guaranteed with the Nelder– Mead algorithm. The major drawback of this algorithm is that the simplex can degenerate (a vertex becomes a linear combination of strictly less than the other n ones). If that happens, only a subspace of the complete working space can be visited and the risk of missing the minimum raises. To decide whether or not to reinitialize the simplex can be based on the mean of the values at the n+1 vertices. The mean is compared with the mean at the previous temperature. If the relative change is not important enough or the generator stops at a local minimum, a new simplex is generated. The best point ever met is chosen as one of the vertices. A natural way to initialize a simplex is to choose the n remaining vertices such that each edge issued from the (n+1)th point is parallel to a different axis of coordinates. A refined version of that approach is adopted. Instead of randomly choosing the value of the component in the interval of accepted values for that component, some ‘taboo’ restrictions are added. The overall working space is divided in regions. When a new simplex is generated, each cells containing a vertex are marked as taboo. The random selection of the value of a component is repeated until the resulting cell (C) does not lie in a taboo region (TL).

DO use a simplex to get a new solution; IF initialization required THEN adopt the best solution as the (n+1)th vertex; for the first n vertices (Vi ) DO Vi = Vn+1 ; DO change the ith component of Vi ; identify C; WHILE (C in TL); add C to TL; OD; FI; decrease the temperature; WHILE (temperature > Tmin ); Adopted pseudocode

Ingber’s algorithm ([6,7]) is used for the annealing schedule. The initial temperature is set to 10hl o g 10 (D)i where hlog10 (D)i stands for the mean of the logarithm of the objective function over the first generated simplex. Element a(00 ) i(ı ) !(ı ) ˝(ı ) e P(yr) T (Besselian yr) V0 (km/s) !(00 ) mass A(Mˇ ) mass B(Mˇ )

Value 0:072 68 352 262:0 0:38 1:7255 1979:332 9:78 0:038 0:349 1:5 0:8

Std. dev. 0:0010 1:3 2:2 0:53 0:016 0:00098 0:0099 0:13 0:0012 0:0096 0:18 0:12

Orbital parameters of HIP111170 and their standard derivations

Global Optimization in Binary Star Astronomy

Example 1 (HIP111170) The double star HIP111170 ( = HR8851 = HD213429) is a good example to illustrate how appropriate a simultaneous adjustment is whereas a disjoint one would failed. The visual observations ([9,10]) are too few to allow a visual orbit determination: 3.5 observations (2

G

quantities) are necessary to adjust 7 parameters. Fortunately, the spectroscopic data are more numerous and the two radial velocity curves are well covered. From a mathematical point of view, two visual observations is the minimum if the spectroscopic observations [3] are well spread over the two curves. The table above gives the orbital parameters used for the figures. The obtained parallax is in quite good 00 agreement with the 0.03918˙0.00183 after the Hipparcos mission [4]. Conclusion Even when the observations seem very precise, the objective function describing the residual between the observed and computed data has many local minima. Astronomers should be aware of that fact as they should be aware of techniques to efficiently tackle such situations. See also

Global Optimization in Binary Star Astronomy, Figure 1 Adjusted visual orbit of HIP 111170. The cross represents component A

˛BB Algorithm Continuous Global Optimization: Applications Continuous Global Optimization: Models, Algorithms and Software Differential Equations and Global Optimization DIRECT Global Optimization Algorithm Global Optimization Based on Statistical Models Global Optimization Methods for Systems of Nonlinear Equations Global Optimization Using Space Filling Topology of Global Optimization References

Global Optimization in Binary Star Astronomy, Figure 2 Adjusted spectroscopic orbits of HIP 111170

1. Carette E, de Greve JP, van Rensebergen W, Lampens P (1995) Circini: A young visual binary with pre-mainsequence component(s)? Astronomy and Astrophysics 296:139 2. Docobo JA, Elipe A, McAlister H (eds) (1997) Visual Double Stars: Formation, Dynamics and Evolutionary Tracks. Kluwer, Dordrecht 3. Duquennoy A, Mayor M, Griffin RF, Beavers WI, Eitter JJ (1988) Duplicity in the solar neighbourhood; V. Spectroscopic orbit of the nearby double-lined star HR 8581. Astronomy and Astrophysics Suppl Ser 75:167 4. ESA (1997) The Hipparcos and Tycho catalogues. ESA SP1200 5. Hummel CA, Armstrong JT, Buscher DF, Mozurkewich D, Quirrenbach A, Vivekanand M (1995) Orbits of small angu-

1303

1304

G 6. 7. 8. 9.

10.

11.

12.

13. 14.

15.

16.

17. 18.

Global Optimization: Cutting Angle Method

lar scale binaries resolved with the Mark III interferometer. Astronomical J 110:376 Ingber L (1993) Adaptative simulated annealing (asa). Techn Report Caltech 1 Ingber L (1993) Simulated annealing: Practice versus theory. Res Note Caltech 1 Kirkpatrick S, Gelatt CD Jr, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671 McAlister HA, Hartkopf WI, Franz OG (1990) ICCD speckle observations of binary stars. V Measurements during 19881989 from the Kitt Peak and the Cerro Tololo 4 m telescope. Astronomical J 99:965 McAlister HA, Hartkopf WI, Hutter DJ, Shara MM, Franz OG (1987) ICCD speckle observations of binary stars. I: A survey of duplicity among the bright stars. Astronomical J 93:183 Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state calculations by fast computing machines. J Chem Phys 21(6):1087 Morbey CL (1975) A synthesis of the solutions of spectroscopic and visual binary orbits. Publ Astronomical Soc Pacific 87:689 Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7:308 Pourbaix D (1994) A trial-and-error approach to the determination of the orbital parameters of visual binaries. Astronomy and Astrophysics 290:682 Pourbaix D, Lampens P (1997) A new method used to revisit the visual orbit of the spectroscopic triple system eta Orionis A. In: Docobo JA, Elipe A and McAlister H (eds) Visual Double Stars: Formation, Dynamics and Evolutionary Tracks. Kluwer, Dordrecht, p 383 Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C, 2nd edn. Cambridge Univ. Press, Cambridge Torczon V (1991) On the convergence of the multidirectional search algorithm. SIAM J Optim 1(1):123 Torres G (1995) A visual-spectroscopic orbit for the binary ˙248. Publ Astronomical Soc Pacific 107:524

Global Optimization: Cutting Angle Method ADIL BAGIROV1 , GLEB BELIAKOV2 1 Centre for Informatics and Applied Optimization, School of Information Technology and Mathematical Sciences, University of Ballarat, Victoria, Australia 2 School of Engineering and Information Technology, Deakin University, Victoria, Australia

MSC2000: 90C26, 65K05, 90C56, 65K10

Article Outline Introduction Definitions Notation Abstract Convex Functions IPH Functions Lipschitz Functions

Methods Generalized Cutting Plane Method Global Minimization of IPH Functions over Unit Simplex Global Minimization of Lipschitz Functions The Auxiliary Problem Solution of the Auxiliary Problem

Conclusions References

Introduction The cutting angle method (CAM) is a deterministic method for solving different classes of global optimization problems. It is a version of the generalized cutting plane method, and it works by building a sequence of tight underestimates of the objective function. The sequence of global minima of the underestimates converges to the global minimum of the objective function. It can also be seen from the perspective of branchand-bound type methods, which iterate the steps of branching (partitioning the domain), bounding the objective function on the elements of the partition, and also fathoming (eliminating those elements of the partition which cannot contain the global minimum). The key element of CAM is the construction of tight underestimates of the objective function and their efficient minimization in a structured optimization problem. CAM is based on the theory of abstract convexity [23], which provides the necessary tools for building accurate underestimates of various classes of functions. Such underestimates arise from a generalization of the following classical result: each convex function is the upper envelop of its affine minorants [21]. In abstract convex analysis, the requirement of linearity of the minorants is dropped, and abstract convex functions are represented as the upper envelops of some simple minorants, or support functions, which are not necessarily affine. Depending on the choice of the support functions, one obtains different flavours of abstract convex analysis.

G

Global Optimization: Cutting Angle Method

By using a subset of support functions, one obtains an approximation of an abstract convex function from below. Such one-sided approximation, or underestimate, is very useful in optimization, as the global minimum of the underestimate provides a lower bound on the global minimum of the objective function. One can find the global minimum of the objective function as the limiting point of the sequence of global minima of the underestimates. This is the principle of the cutting angle method of global optimization [1,2,23]. The cutting angle method was first introduced for global minimization of increasing positive homogeneous (IPH) functions over the unit simplex [1,2,23]. Then it was extended to a broader class of Lipschitz programming problems [9,25]. In this Chapter, after providing the necessary theoretical background, we will describe versions of CAM for global minimization of IPH and Lipschitz functions over a polytope (in particular the unit simplex), and provide details of its algorithmic implementation. Definitions Notation

n is the dimension of the optimization problem; I D f1; : : : ; ng; xi is the ith coordinate of a vector x 2 Rn ; x k 2 Rn denotes the k-th vector of some sequence fx k gKkD1 ; P [l; x] D i2I l i x i is the inner product of vectors l and x; if x; y 2 Rn then x y , x i y i for all i 2 I; if x; y 2 Rn then x y , x i > y i for all i 2 I; n :D fx D (x1 ; : : : ; x n ) 2 Rn : x i 0 for all RC i 2 Ig (nonnegative orthant); RC1 denotes (1; C1]; e m D (0; : : : ; 0; 1; 0; : : : ; 0) denotes the m-th unit orth of the space Rn . P n : S D fx 2 RC i2I x i D 1g (unit simplex).

Abstract Convex Functions Let X Rn be some set, and let H be a nonempty set of functions h : X ! V [1; C1]. We have the following definitions [23]. Definition 1 A function f is abstract convex with respect to the set of functions H (or H-convex) if there

exists U H such that: f (x) D supfh(x) : h 2 Ug;

8x 2 X :

Definition 2 The set U of H-minorants of f is called the support set of f with respect to the set of functions H: supp( f ; H) D fh 2 H; h(x) f (x)

8x 2 Xg :

Definition 3 H-subgradient of f at x is a function h 2 H such that: f (y) h(y) (h(x) f (x));

8y 2 X :

The set of all H-subgradients of f at x is called Hsubdifferential @H f (x) D fh 2 H : f (y) h(y) (h(x) f (x)); 8y 2 Xg : Definition 4 The set @H f (x) at x is defined as @H f (x) D fh 2 supp( f ; H) : h(x) D f (x)g : Proposition 1 [23], p.10. If the set H is closed under vertical shifts, i. e., (h 2 H; c 2 R) implies h c 2 H, then @H f (x) D @H f (x). When the set of support functions H consists of all affine functions, then we obtain the classical convexity. Next we examine two other examples of sets of support functions H. IPH Functions n Recall that a function f defined on RC is increasing if x y implies f (x) f (y). n ! R is called IPH Definition 5 A function f : RC (Increasing Positively Homogeneous functions of degree one) if n ; 8x; y 2 RC

8x 2

n ; 8 RC

x y ) f (x) f (y); > 0 : f (x) D f (x) :

Let the set H1 be the set of min-type functions n n ; x 2 RC g: H1 D fh : h(x) D min a i x i ; a 2 RC i2I

n ! RC1 is abProposition 2 [23] A function f : RC stract convex with respect to H 1 if and only if f is IPH.

1305

1306

G

Global Optimization: Cutting Angle Method

Example 1 The following functions are IPH: P 1) f (x) D i2I a i x i with a i 0; 1 P 2) p k (x) D x k k (k > 0); p i2I i 3) f (x) D [Ax; x] where A is a matrix with nonnegative entries; P Q tj where J I; t j > 0; j2J t j D 4) f (x) D j2J x j 1. It is easy to check that the sum of two IPH functions is also an IPH function; if f is IPH, then the function f is IPH for all > 0; let T be an arbitrary index set and ( f t ) t2T be a family of IPH functions. Then the function finf(x) D inf t2T f t (x) is IPH; let ( f t ) t2T be the same family and there exists a point y 0 such that sup t2T f t (y) < C1 then the function fsup (x) D sup t2T f t (x) is finite and IPH. These properties allow us to give two more examples of IPH functions. Example 2 The following maxmin functions are IPH: 1) X jk ai xi f (x) D max min k2K

j2J

i2I

where 0; k 2 K; j 2 J; i 2 I. Here J and K are finite sets of indices; 2) k2K j2J k

X

j

ai xi

(1)

i2I

j

where a i 0; j 2 J k ; k 2 K. Here J k and K are finite sets of indices. Note that an arbitrary piecewise linear function f generated by a collection of linear functions f 1 ; : : : ; f m can be represented in the form (1) (see [5]); hence an arbitrary piecewise linear function generated by nonnegative vectors is IPH. n ; l ¤ 0 and I(l) D fi 2 I : l i > 0g. We Let l 2 RC consider the function x 7! hl; xi defined by the formula l(x) D hl; xi where the coupling function h; i is defined as

hl; xi D min l i x i : i2I(l )

f (x) D maxfhl; xi : l 2 H1 ; l f g ;

(3)

n Let x 0 2 RC be a vector such that f (x 0 ) > 0 and 0 0 l D f (x )/x . Then

hl; xi f (x) n for all x 2 RC and hl; x 0 i D f (x 0 ). The vector f (x 0 )/x 0 is called the support vector of a function f at a point x0 .

Lipschitz Functions Definition 6 A function f : X ! R is called Lipschitzcontinuous in X, if there exists a number M > 0 such that 8x; y 2 X : j f (x) f (y)j Mjjx yjj : The smallest such number is called the Lipschitz constant of f in the norm jj jj1 . Let the set H2 be the set of functions of the form

jk ai

f (x) D max min

l. We shall denote this function by the same symbol l(x). Clearly a min-type function is IPH. It follows from Proposition 2 that: n is IPH if and only A finite function f defined on RC if

(2)

Here I(l) D fi 2 f1; : : : ; ng j l i > 0g. This function is called a min-type function generated by the vector

H2 D fh : h(x) D a Cjjx bjj; x; b 2 Rn ; a 2 R; C 2 RC g : Proposition 3 [23] A function f : Rn ! RC1 is H 2 convex if and only if f is a lower semicontinuous function. The H 2 -subdifferential of f is not empty if f is Lipschitz. There is an interesting relation between IPH functions and Lipschitz functions, which allows one to formulate the problem of minimization of Lipschitz function over the unit simplex as the problem of minimization of IPH functions restricted to the unit simplex. Theorem 1 (see [23,25]). Let f : S ! R be a Lipschitz function and let MD

sup x;y2S;x¤y

j f (x) f (y)j kx yk1

(4)

1 The norm jj jj can be replaced by any metric, or, more generally, any distance function based on Minkowski gauge. For example, a polyhedral distance d P (x; y) D maxf[(x y); h i ] j1 i mg, where h i 2 Rn ; i D 1; : : : ; m is the set of vectors that T define a finite polyhedron P D m iD1 fx j [x; h i ] 1g.

Global Optimization: Cutting Angle Method

be the least Lipschitz constant of f in k k1 -norm, where P kxk1 D i2I jx i j. Assume that

Step 0. (Initialisation) 0.1 Set K = 1. 0.2

min f (x) 2M : x2S

G

Choose an arbitrary initial point x 1 2 D.

Step 1. (Calculate H-subdifferential)

Then there exists an IPH function g : g(x) D f (x) for all x 2 S.

n RC

! R such that

k=1;:::;K

Methods We consider the problem of global minimization of an H-convex function f on a compact convex set D X, minimize f (x) subject to x 2 D :

(5)

We will deal with the two mentioned cases of f being H 1 -convex (IPH) and H 2 -convex (Lipschitz). Generalized Cutting Plane Method A consequence of Propositions 2 and 3 is that we can approximate H-convex functions from below using a finite subset of functions from supp( f ; H). Suppose we know a number of values of the function f at the points x k ; k D 1; : : : ; K. Then the pointwise maximum of the support functions h k 2 @H f (x K ), K

k

H (x) D max h (x) kD1;:::;K

1.1 Calculate h K 2 @H f (x K ). 1.2 Define H K (x) := max h k (x), for all x 2 D. Step 2. (Minimize H K ) 2.1 Solve the Problem Minimize H K (x) subject to Let x be its solution. 2.2 Set K := K + 1; x K := x .

x 2 D:

Step 3. (Stopping criterion) 3.1 If K < K max and f best H K (x ) > go to Step 1. Global Optimization: Cutting Angle Method, Algorithm 1 Generalized Cutting Plane Algorithm

unit simplex S, that is we shall study the following optimization problem: minimize f (x) subject to x 2 S

(7)

(6)

is a lower approximation, or underestimate of f . We have the following generalization of the classical cutting plane method by Kelley [16]. K max is the limit on the number of iterations of the algorithm. The problem at Step 2.1 is called the auxiliary, or relaxed, problem. Its efficient solution is the key to numerical performance of the algorithm. For convex objective functions, H K is piecewise affine, and the solution to the relaxed problem is done by linear programming. However, when we consider other abstract convex functions, like IPH or Lipschitz, the relaxed problem is not linear, but it also has a special structure that leads to its efficient solution. Global Minimization of IPH Functions over Unit Simplex In this section we present an algorithm for the search for a global minimizer of an IPH function f over the

n where f is an IPH function defined on RC . Note that n an IPH function is nonnegative on RC , since f (x) f (0) D 0. We assume that f (x) > 0 for all x 2 S. It follows from positiveness of f that I(l) D I(x) for all x 2 S and l(x) D f (x)/x. Since I(e m ) D fmg, then the vector l D f (e m )/e m can be represented in the form l D f (e m )e m and

h f (e m )e m ; xi D f (e m )x m : Remark 1 Note that H K (x) :D max min l ik x i kD1;:::;K i2I(l k ) max H K1 (x); min l iK x i , which simplifies solution i2I(l K )

to the auxiliary problem at Step 2.1. This Algorithm reduces the problem of global minimization (7) to the sequence of auxiliary problems. It provides lower and upper estimates of the global minimum f * for the problem (7). Indeed, let K D min x2S H K (x) be the value of the auxiliary problem. It

1307

1308

G

Global Optimization: Cutting Angle Method

follows from (3) that

change of variables. Solution to the constrained auxiliary problem in Step 2.1 of the algorithm was investigated in [8].

hl k ; xi min l ik x i f (x) for all x 2 S; i2I(l k )

k D 1; : : : ; K : K

Hence H (x) f (x) for all x 2 S and K min x2S H K (x) minx2S f (x): Thus K is a lower estimate of the global minimum f * . Consider the number K D min kD1;:::;K f (x k ) D : f best . Clearly K f , so K is an upper estimate of f * . It is shown in [23] that K is an increasing sequence and K K ! 0 as K ! C1. Thus we have a stopping criterion, which enables us to obtain an approximate solution with an arbitrary given tolerance. Global Minimization of Lipschitz Functions Method Based on IPH Functions By using Theorem 1, global minimization of Lipschitz function over the simplex S can be reduced to the global minimization of a certain IPH function over S. Let f : S ! R be a Lipschitz function and let c 2M min f (x) ;

(8)

x2S

where M is defined by (4). Let f 1 (x) D f (x) C c. It follows from Theorem 1 that the function f 1 can be extended to an IPH function g. The problem minimize g(x)

subject to

x2S

(9)

is clearly equivalent to the problem minimize f1 (x) subject to

x2S:

(10)

Thus we apply the cutting angle method to solve problem (10). Clearly functions f and f 1 have the same minimizers on the simplex S. If the constant c in (8) is known, CAM is applied for the minimization of a Lipschitz function f over S with no modification. If c is unknown, we can assume that c is a sufficiently large number, however numerical experiments show that CAM is rather sensitive to the choice of c, in particular, when c is very large, the method converges very slowly. In order to estimate c we need to know an upper bound on the least Lipschitz constant M and a lower estimate of the global minimum of f . If the feasible domain is not the unit simplex S but a polytope, it can be embedded into S with a simple

Direct Method Consider H 2 -convex functions, which, by Proposition 3 include all Lipschitz functions. Let dP be a polyhedral distance function. As a consequence of H 2 -convexity, we can approximate Lipschitz functions from below using underestimates of the form H K (x) D max h k (x) kD1;:::;K

D max ( f (x k ) Cd P (x; x k )) ;

(11)

kD1;:::;K

where C M, and M is the Lipschitz constant of f with respect to the distance dP . Then we apply the Algorithm 1 to function f in the feasible domain D. The auxiliary problem as Step 2.1 becomes minimize max ( f (x k ) Cd P (x; x k )) kD1;:::;K

subject to x 2 D : The same considerations about the convergence of the algorithm as those for Algorithm 2 are applied. Note Step 0. (Initialisation) 0.1 Take points x m = e m , m = 1; : : : ; n. Set K = n. 0.2

Calculate l k = f (x k )/x k , k = 1; : : : ; K:

Step 1. (Calculate H-subdifferential) 1.1 Define H K (x) := max min l ik x i , for all x 2 S.

k=1;:::;K i2I(l k )

Step 2. (Minimize H K ) 2.1 Solve the Problem Minimize H K (x) Let x be its solution. 2.2 2.3

subject to

x 2 S:

Set K := K + 1; x K := x . Compute l K = f (x K )/x K

Step 3. (Stopping criterion) 3.1 If K < K max and f best H K (x ) > go to Step 1. Global Optimization: Cutting Angle Method, Algorithm 2 Cutting Angle Algorithm for IPH functions

G

Global Optimization: Cutting Angle Method

that in the univariate case the underestimate H K in (11) is exactly the same as the saw-tooth underestimate in Piyavski-Shubert method [20,26] if dP is symmetric. For minimization of Lipschitz functions, an estimate of the Lipschitz constant is required in both cases, when transforming f to an IPH function, or using Algorithm 1 directly. The crucial part in both methods is the efficient solution to the auxiliary problem in Step 2.1. The next section presents a very fast combinatorial algorithm for enumeration of all local minimizers of functions H K .

Proposition 5 (see, for example, [13]). Let x 0. Then (h k )0 (x; u) D min l ik u i ; i2Q k (x)

(H K )0 (x; u) D max (h k )0 (x; u) D max min l ik u i : k2R(x)

k2R(x) i2Q k (x)

Let x 2 S. The cone K(x; S) D fu 2 Rn : 9˛0 > 0 such that x C ˛u 2 S 8˛ 2 (0; ˛0 )g

The Auxiliary Problem K

The Step 2.1 (find the global minimum of H (x)) is the most difficult part of the cutting angle method. This problem is stated in the following form: minimize H K (x)

subject to x 2 S

(12)

where H K (x) D max min l ik x i D max h k (x) ; kK i2I(l k )

kK

(13)

K n, l k D f (x k )/x k are given vectors, k D 1; : : : ; K. Note that x k D e k ; k D 1; : : : ; n: Proposition 4 [2,3] Let K > n, l k D l kk e k ; k D 1; : : : ; n; l k > 0; jI(l k )j 2; k D n C 1; : : : ; K. Then each local minimizer of the function H K (x) defined by (13) over the simplex S is a strictly positive vector. Corollary 1 Let {xk } be a sequence generated by Algorithm 2. Then x k 0 for all k > n. Let ri(S) D fx 2 S : x i > 0 for all i 2 Ig be the relative interior of the simplex S. It follows from Proposition 4 and Corollary 1 that we can solve the problem (12) by sorting the local minima of the function H K over the set ri (S). We now describe some properties of local minima of H K on ri (S), which will allow us to identify these minima explicitly. It is well known that functions hk and H K are directionally differentiable. Let f 0 (x; u) denote directional derivative of the function f at the point x in the direction u. Also let R(x) D fk : h k (x) D H K (x)g ; Q k (x) D fi 2 I(l k ) : l ik x i D h k (x)g :

(14)

is called the tangent cone at the point x with respect to the simplex S. The following necessary conditions for a local minimum hold (see, for example, [13]). Suppose P x 2 ri(S). Then K(x; S) D fu : i2I u i D 0g: Proposition 6 Let x 2 S be a local minimizer of the function H K over the set S. Then (H K )0 (x; u) 0 for all u 2 K(x; S). Applying Propositions 5 and 6 we obtain the following result. Proposition 7 [2,3] Let x 0 be a local minimizer of the function H K over the set ri (S), such that H K (x) > 0. Then there exists an ordered subset fl k 1 ; l k 2 ; : : : ; l k n g of the set fl 1 ; : : : ; l K g such that 1) ! d 1 d ;:::; k where d D P xD 1 ; (15) k1 n l1 ln i2I k i li

2) max min

kK i2I(l k )

l ik l ik i

D 1;

(16)

3) Either k i D fig for all i 2 I or there exists m 2 I such that k m n C 1; if k m n then k m D m; 4) if k m n C 1 and l ik m ¤ 0 then l ik m > l ik i for all i 2 I; i ¤ m : Solution of the Auxiliary Problem It follows from Propositions 4 and 7 that we can find a global minimizer of the function H K defined by (13) over the unit simplex using the following procedure:

1309

1310

G

Global Optimization: Cutting Angle Method

sort all subsets fl k 1 ; : : : ; l k n g of the given set l 1 ; : : : ; l K vectors, such that (16) holds and l ik m > l ik i ; i ¤ m if k m n C 1; i 2 I(l k m ) and k m D m if k m n; for each such subset, find the vector x defined by (15); choose the vector with the least value of the function H K among all the vectors described above. Thus, the search for a global minimizer is reduced to sorting some subsets, containing n elements of the given set fl 1 ; : : : l K g with K > n. Fortunately, Proposition 7 allows one to substantially diminish the number of sorted subsets. The subsets L D fl k 1 ; : : : ; l k n g can be visualized with the help of an n n matrix whose rows are given by the participating support vectors 1 0 k l1 1 l2k 1 : : : l nk 1 C B k2 l2k 2 : : : l nk 2 C B l1 C B (17) LDB : :: : : :: C : : : A : @ :: l1k n

l2k n

:::

l nk n

The conditions 2) and 4) of Proposition 7 are then easily interpreted as follows. Condition 4) implies that the diagonal elements of matrix L are smaller than elements in their respective columns, and condition 2) implies that the diagonal of L is not dominated by any other support vector l k 62 L (zero entries of matrix L are excluded from compaisons). Thus we obtain a combinatorial problem of enumerating all combinations L that satisfy conditions 2) and 4). However it is impractical to enumerate all such combinations directly for large K. Fortunately there is no need to do so. It was shown in [6,7,8] that the required combinations can be put into a tree structure. The leaves of the tree correspond to the local minimizers of H K , whereas the intermediate nodes correspond to the minimizers of H n ; H nC1 ; : : : ; H K1 .The incremental algorithm based on the tree structure makes computations very efficient numerically (as processing of queries using trees requires logarithmic time of the number of nodes). It is possible to enumerate several billions of local minimizers of H K (e. g., when n D 5 and K D 100; 000) in a matter of seconds on a standard Pentium IV based workstation. The direct method of minimization of Lipschitz functions involves solution to a different auxiliary prob-

lem, that of minimizing H K given in (11), with dP being a simplicial distance function. It turns out that a very similar method of enumeration of local minimizers of H K , by putting them in a tree structure, also works [9]. There is a counterpart of Proposition 7, with the difference that the support vectors are defined by l ik D

f (x k ) x ik ; C

(18)

and the local minima and minimizers of H K are identified through d D H K (x ) D x i D

C(Trace(L) C 1) ; n

d l ik i ; i D 1; : : : ; n; C

(19)

where constant C is chosen greater or equal to the Lipschitz constant M of f in the simplicial distance dP . Thus both versions of CAM, for IPH and for Lipschitz functions, share the same algorithm, but with different definitions of support vectors. The actual algorithms for enumeration of local minima of H K and maintaining the tree structure, as well as treatment of linear constraints, are presented in [7,8,9]. The algorithms involve a crucial fathoming step, and can be seen as branch-and-bound type algorithms [9,12,23]. Conclusions Cutting angle methods are versions of the generalized cutting plane method for IPH, Lipschitz and other classes of abstract convex functions. The main idea of this deterministic method is to replace the original problem of minimizing f with a sequence of relaxed problems with special structure. The objective functions in the relaxed problems provides tight lower estimates of f , and the sequence of their solutions converge to the global minimum of f . Efficient solution to the relaxed problem makes CAM very fast on a class of global optimization problems. Optimization is not the only field such underestimates are applied. Versions of CAM are also used for non-uniform random variate generation [10] and multivariate data interpolation [11]. Both versions of CAM described here have been successfully applied to a number of real life problems,

Global Optimization: Envelope Representation

including very difficult molecular geometry prediction and protein folding problems [12,17]. A software library GANSO for global and non-smooth optimization, which includes the cutting angle method, is available from http://www.ganso.com.au.

References 1. Andramonov MY, Rubinov AM, Glover BM (1999) Cutting angle method in global optimization. Appl Math Lett 12:95–100 2. Bagirov AM, Rubinov AM (2000) Global minimization of increasing positively homogeneous functions over the unit simplex. Ann Oper Res 98:171–187 3. Bagirov AM, Rubinov AM (2001) Modified versions of the cutting angle method. In: Hadjisavvas N, Pardalos PM (eds) Advances in Convex Analysis and Global Optimization. Kluwer, Dordrecht, pp 245–268 4. Bagirov AM, Rubinov AM (2003) The cutting angle method and a local search. J Global Optim 27:193–213 5. Bartels SG, Kuntz L, Sholtes S (1995) Continuous selections of linear functions and nonsmooth critical point theory. Nonlinear Anal TMA 24:385–407 6. Batten LM, Beliakov G (2002) Fast algorithm for the cutting angle method of global optimization. J Global Optim 24:149–161 7. Beliakov G (2003) Geometry and combinatorics of the cutting angle method. Optimization 52:379–394 8. Beliakov G (2004) The Cutting Angle Method – a tool for constrained global optimization. Optim Methods Softw 19:137–151 9. Beliakov G (2005) A review of applications of the Cutting Angle methods. In: Rubinov A, Jeyakumar V (eds) Continuous Optimization. Springer, New York, pp 209–248 10. Beliakov G (2005) Universal nonuniform random vector generator based on acceptance-rejection. ACM Trans Modelling Comp Simulation 15:205–232 11. Beliakov G (2006) Interpolation of Lipschitz functions. J Comp Appl Math 196:20–44 12. Beliakov G, Lim KF (2007) Challenges of continuous global optimization in molecular structure prediciton. Eur J Oper Res 181(3):1198–1213 13. Demyanov VF, Rubinov AM (1995) Constructive Nonsmooth Analysis. Peter Lang, Frankfurt am Main 14. Horst R, Pardalos PM, Thoai NV (1995) Introduction to Global Optimization. Kluwer, Dordrecht 15. Horst R, Tuy H (1996) Global Optimization: Deterministic Approaches, 3rd edn. Springer, Berlin 16. Kelley JE (1960) The cutting-plane method for solving convex programs. J SIAM 8:703–712 17. Lim KF, Beliakov G, Batten LM (2003) Predicting molecular structures: Application of the cutting angle method. Phys Chem Chem Phys 5:3884–3890

G

18. Pallaschke D, Rolewicz S (1997) Foundations of Mathematical Optimization (Convex Analysis without Linearity). Kluwer, Dordrecht 19. Pinter JD (1996) Global Optimization in Action. Continuous and Lipschitz Optimization: Algorithms, Implementation and Applications. Kluwer, Dordrecht 20. Piyavskii SA (1972) An algorithm for finding the absolute extremum of a function. USSR Comp Math Math Phys 12:57–67 21. Rockafellar RT (1970) Convex Analysis. Princeton University Press, Princeton 22. Rolewicz S (1999) Convex analysis without linearity. Control Cybernetics 23:247–256 23. Rubinov AM (2000) Abstract Convexity and Global Optimization. Kluwer, Dordrecht 24. Rubinov AM, Andramonov MY (1999) Minimizing increasing star-shaped functions based on abstract convexity. J Global Optim 15:19–39 25. Rubinov AM, Andramonov MY (1999) Lipschitz programming via increasing convex-along-rays fuinctions. Optim Methods Softw 10:763–781 26. Shubert BO (1972) A sequential method seeking the global maximum of a function. SIAM J Numerical Anal 9:379–388 27. Singer I (1997) Abstract Convex Analysis. Wiley-Interscience Publication, New York

Global Optimization: Envelope Representation A. M. RUBINOV School Inform. Techn. and Math. Sci., University Ballarat, Ballarat, Australia MSC2000: 90C26 Article Outline Keywords See also References Keywords Abstract convexity; Envelope; -subdifferential; Support set; Supremal generator; Min-type function; Cutting angle method; Global optimization; Lipschitz programming Some classical methods of finite-dimensional convex minimization can be extended for quite broad classes of multi-extremal optimization problems. One successful

1311

1312

G

Global Optimization: Envelope Representation

generalization is based on the so-called envelope representation of the objective function. We begin with the simplest case of a convex differentiable function f in order to introduce this approach. For such a function the tangent hyperplane T = {xr f (y)(x y)+ f (y) = 0} is simultaneously a support hyperplane. That is, the inequality f (x) f (y)+ r f (y)(x y) holds for each x. This inequality can be expressed also in the following form: the affine function h y (x) D r f (y)(x y) C f (y)

(1)

is a support function for the function f . Thus the function f can be represented as the pointwise maximum of the functions of the form hy : f (x) D max h y (x): y

One of the main results of convex analysis asserts that an arbitrary lower semicontinuous convex function f (perhaps admitting the value +1) is the upper envelope (UE) of the set of all its affine minorants: h is an affine function; f (x) D sup h(x) : : h f (The inequality h f stands for h(x) f (x) for all x.) The supremum above is attained if and only if the subdifferential of f at the point x is nonempty. Since affine functions are defined by means of linear functions, one can say that convexity is‘linearity + envelope representation’. As it turns out the contribution of‘envelope representation’ to the convexity is fairly large. This observation stimulated the development of the rich theory of‘convexity without linearity’. (See [12,14,19] and references therein.) In particular, functions which can be represented as UE of subsets of a set of sufficiently simple functions are studied in this theory. We need the following definition. Let H be a set of functions. A function f is called abstract convex (AC) with respect to H (or H-convex) if f is the UE of a subset from H, that is f (x) D sup fh(x) : h 2 H; h f g :

(2)

The set H is called the set of elementary functions. For applications we need sufficiently simple elementary functions.

Many results from convex analysis related to various kinds of convex duality can be extended to abstract convex analysis Abstract convexity sheds some new lights to the classical Fenchel–Moreau duality and the so-called level sets conjugation (see [19]). The set s(f , H) = {h 2 H:h f }, presented in (2), is called the support set of f . The mapping f 7! s(f , H) is called the Minkowski duality ([9]). The support set accumulates a global information of a function f in terms of the set of elementary functions H and it can be useful in the study of global optimization problems involving the function f. One of the main notions of convex analysis, which plays the key role for applications to optimization, is the subdifferential. There are two equivalent definitions of the subdifferential of a convex function. The first of them is based on the global behavior of the function. A linear function l is called a subgradient (i. e. a member of the subdifferential) of the function f at a point y if the affine function h(x) = l(x) (l(y) f (y)) is a support function with respect to f , that is h(x) f (x) for all x. The second definition has a local nature and is connected with local approximation of the function: the subdifferential is a closed convex set of linear functions such that the directional derivative u 7! f 0x (u) at the point x is presented as the UE of this set. For a differentiable convex function these two definitions reflect respectively support and tangent sides of the gradient. The various generalizations of the second definition have led to development of the rich theory of nonsmooth analysis. The natural field for generalizations of the first definition is AC. A function h 2 H is called the subgradient (or Hsubgradient) of an H-convex function f at a point y if f (x) h(x) (h(y) f (y)) for all x. The set @H f (y) of all subgradients of f at y is referred to as the subdifferential of the function f at the point y. Let H 0 be the closure of the set H under vertical shifts, that is H 0 D h0 :

h 0 (x) D h(x) c; h 2 H; c 2 R

:

Clearly h 2 @H 0 f (y) if and only if f (y) = max{h0 (y):h0 f , h0 2 H 0 }. Thus if H is already closed under shifts then @H f (y) D fh 2 s( f ; H) : h(y) D f (y)g :

(3)

Global Optimization: Envelope Representation

Thus the subdifferential is not empty if and only if the supremum in (2) is attained. Sometimes (3) is used for the definition of the subdifferential for an arbitrary set of elementary functions H (not necessary closed under shifts). Many methods of convex minimization are based on the local properties of the convex subdifferential (more precisely, on the directional derivative). However there are some methods which exploit only the support property of the subdifferential. The conceptual schemes of these methods can be easily extended for AC functions. One of these methods is presented below. Consider the following problem f (x) ! min;

x 2 X;

(4)

where X is a compact set. Assume that f is AC with respect to a set of elementary functions H. We consider the following algorithm based on the generalized cutting plane idea, which is a nonlinear generalization of the classical cutting plane method.

0 1 2

Let k := 0: Choose an arbitrary initial point x0 2 X; Calculate a subgradient in the form (3) that is an element h k 2 s( f ; H) such that h k (x k ) = f (x k ); Find a global optimum y of the problem max h i (x) ! min; x 2 X.

0ik

3

(5)

Let x k+1 = y ; k := k + 1. Go to step 1:

Conceptual scheme (generalized cutting plane method)

Convergence of the sequence constructed by this procedure to a global minimizer has been proved under very mild assumptions by D. Pallaschke and S. Rolewicz [12]. Upper and lower estimates of the optimal value of the problem (4) can be computed, which lead to an efficient stopping criterion (compare with [2]). There are two major difficulties in the numerical implementation of the Algorithm. The first is the calculation of a subgradient. In general it is very difficult to find it numerically, however it is possible in several important particular cases. The second difficulty is the solution of the auxiliary problem (5). This is a linear

G

programming problem in the case of the set H of affine functions, but for sets of more complicated functions the problem (5) is essentially of a combinatorial nature or a problem of convex maximization. The simplest example of this approach is Lipschitz programming. If f is a Lipschitz function we can, for example, take as H the set of functions h of the form h(x) = a kx xo k c, where a is a positive and c is a real number, xo 2 X. In order to find an H-subgradient we should take a > L where L is the Lipschitz constant of the function f ; thus we need to know an upper estimate of this constant; this is a special piece of global information about this function. With such H the problem (4) can be reduced to a sequence of special problems of concave minimization. Some known algorithms of Lipschitz programming fall within the described approach [11,21]. For fairly large classes of functions defined on the n of all n-vectors with nonnegative coordinates cone RC it is possible to take as H a set of functions which includes as its main part a min-type function of the form l(x) D min l i x i ; i2T (l )

n x 2 RC ;

(6)

with T (l) D fi : l i > 0g : We define the infimum over empty set to be zero. If l is a strictly positive vector and c a positive number then the set {x: mini li xi c} is a complement to a ‘right angle’. Exploiting min-type functions instead of linear functions allows us to separate a point from the (not necessary convex) set by the complements of ‘right angles’. Various classes of elementary functions arise, based n . on the set L of all functions of the form (6) with l 2 RC In particular, L itself and sets H1 D fh : h(x) D l(x) c; l 2 L; c 2 Rg ; H2 D fh : h(x) D min(l(x); c); l 2 L; c 2 Rg are convenient for applications. The classes of AC with respect to H 1 and H 2 functions are quite large [14]. The first of them consists of all increasing (with respect to the usual order relation) functions f such that the function of a real variable t ! f (tx), t 2 [0, +1), is conn . This class contains all homogeneous vex for all x 2 RC functions of degree ı 1, their sums and UE of sets of

1313

1314

G

Global Optimization: Envelope Representation

such functions. In particular it contains all polynomials with nonnegative coefficients. The second class consists of all increasing functions f such that f (tx) tf (x) for n and t 2 [0, 1]. Concave increasing functions all x 2 RC f with f (0) 0 and UE of sets of such functions belong to this class. Also, positively homogeneous functions of degree ı 1, their sums and UE of sets of such functions belong to it. For minimizing AC functions with respect to H i (i = 1, 2) we need again to calculate the H i -subgradients in the form (5) and then to reduce the problem (4) to a sequence of auxiliary problems. A version of the generalized cutting plane method in such a case is called ‘cutting angle method’ ([2,14]). A.M. Rubinov et al. ([1,14,16,17]) have demonstrated that for AC functions generated by various classes of min-type functions it is possible to find subgradients very easily. In particular, only the number f (x) (resp. f 0 (x, x)) is required for the calculation of an element of @H 2 f (x) (resp. @H 1 f (x)), without any additional information about a global behavior of the function f . Thus the main problem with implementation of the cutting angle method is to solve the auxiliary subproblem, which is a problem of the mixed integer programming of a special kind in this case. n . It Let L be the set of all functions (6) with l 2 RC n can be shown ([14,16]) that a function f defined on RC is L-convex if and only if f is IPH (increasing and positively homogeneous of degree one).IPH functions can serve for the miminization of a Lipschitz function over P n : i xi = 1}. First ([14,15]), the unit simplex Sn = {x 2 RC for each Lipschitz function g defined on Sn there exists a constant c>0 such that the function e g(x) D g(x) C c n . Seccan be extended to an IPH function defined on RC ond, the auxiliary problem (5) for problem (4) with an IPH function f and X = Sn , has a special structure and can be efficiently solved for fairly large n (see [14, Chap. 9] and references therein). Thus, the minimization of a Lipschitz function over the unit simplex can be efficiently accomplished by the cutting angle method. Numerical experiments demonstrate that a combination of the cutting angle method with a local search is very efficient, since the cutting angle method allows one to leave a local minimizer fairly quickly. Envelope representation is useful also in the study of some theoretical problems arising in optimization. Many interesting examples of such applications can be

found in the books [12,14,19]. In particular, a general scheme of penalty and augmented Lagrangian based on the notion of the subdifferential is presented in [12]. I. Singer [19] demonstrated that Fenchel–Moreau duality leads to a unified theory of duality results for very general optimization problems. It can be shown [18] that AC forms the natural framework for the study of solvability theorems (generalizations of Farkas’ lemma; cf. Farkas lemma; Farkas lemma: Generalizations). In contrast with numerical methods based on applications of subdifferentials, the study of solvability theorems is based on application of support sets. AC serves also for the study of some problems of quasiconvex minimization (see for example [10,13,20]). A subset H of a set X of functions is called the supremal generator ([9]) of X if each function from X is AC with respect to H. There exist very small supremal generators of very large classes of functions. The following two examples of such supremal generators are useful for nonsmooth optimization. 1) Recall that a function f is called positively homogeneous (PH) of degree k if p(x) = k p(x) for > 0. It can be shown ([14]) that the set of all functions of the form h(x) D a

n X

! 12 x 2i

C

iD1

n X

li xi ;

(7)

iD1

where a 0, l1 , . . . , ln are real numbers is a supremal generator of the set PH1 of all lower semicontinuous PH functions of degree one defined on ndimensional space Rn . Since each function (7) is concave it follows that the set of all concave PH functions of degree one is a supremal generator of PH1 . 2) It can be shown ([3,4,9,14]) that the set H of all quadratic functions h of the form h(x) D a

n X iD1

x 2i C

n X

l i x i C c;

(8)

iD1

where a 0, l1 , . . . , ln , c are real numbers is a supremal generator of the set of all lower semicontinuous functions f :Rn ! R [ {+1} minored by H in the following sense: there exists h 2 H such that f h. Supremal generators are a convenient tool in the study of nonsmooth optimization problems. A local approximation of the first (resp. second) order of a nonsmooth

Global Optimization: Envelope Representation

function is fulfilled very often by various kinds of generalized derivatives of the first (resp. second) order, which are PH functions of the first (resp. second) degree. Practical applications of these derivatives to optimization are based on their representation in terms of linear (resp. quadratic) functions. Linearization of lower semicontinuous PH functions of the first degree can be accomplished by supremal generators of the space PH1 , consisting of concave functions. Each finite g 2 PH1 can be n concave function o presented as min l(x) : l 2 @g(0) where @g(0) is the superdifferential (in the sense of convex analysis) of this function g at the origin. Hence each function g 2 PH1 can be linearized by the operation sup min. The second order approximation of a nonsmooth function f at a point x can be accomplished by the subjet, that is the set @2; f (x) 8 < D (r g(x); r 2 g(x)) : :

9 f g has a = local minimum x : ; with g 2 C 2 (Rn )

(Here r g(x) (resp. r 2 g(x)) stands for the gradient (resp. Hessian) of a function g at a point x.) Let H be the set of all functions of the form (8). It can be shown (see [5,6]) that the subjet @2, f (x) is nonempty if and only if the H-subdifferential @H f (x) is not empty. AC with respect to H can also serve for supremal representation of the second order generalized derivatives of nonsmooth functions in terms of quadratic functions (see[5]).

See also Dini and Hadamard Derivatives in Optimization Nondifferentiable Optimization Nondifferentiable Optimization: Cutting Plane Methods Nondifferentiable Optimization: Minimax Problems Nondifferentiable Optimization: Newton Method Nondifferentiable Optimization: Parametric Programming Nondifferentiable Optimization: Relaxation Methods Nondifferentiable Optimization: Subgradient Optimization Methods

G

References 1. Abasov TM, Rubinov AM (1994) On the class of H-convex functions. Russian Acad Sci Dokl Math 48:95–97 2. Andramonov MYu, Rubinov AM, Glover BM (1999) Cutting angle methods in global optimization. Applied Math Lett 12:95–100 3. Balder EJ (1977) An extension of duality-stability relations to nonconvex optimization problems. SIAM J Control Optim 15:329–343 4. Dolecki S, Kurcyusz S (1978) On ˚-convexity in extremal problems. SIAM J Control Optim 16:277–300 5. Eberhard A, Nyblom M (1998) Jets, generalized convexity, proximal normality and differences of functions. Nonlinear Anal (TMA) 34:319–360 6. Eberhard A, Nyblom N, Ralph D (1998) Applying generalized convexity notions to jets. In: Croizeix J-P, MartinezLegaz J-E, Volle M (eds) Generalized Convexity, Generalized Monotonicity. Kluwer, Dordrecht, pp 111–158 7. Horst R, Pardalos PM (eds) (1996) Handbook of global optimization. Kluwer, Dordrecht 8. Kelley J (1960) The cutting plane method for solving convex programs. SIAM J 8:703–712 9. Kutateladze SS, Rubinov AM (1972) Minkowski duality and its applications. Russian Math Surveys 27:137–191 10. Martinez-Legaz J-E (1988) Quasiconvex duality theory by generalized conjugation methods. Optim 19:603–652 11. Mladineo RH (1986) An algorithm for finding the global maximum of a multimodal, multivariate function. Math Program 34:188–200 12. Pallaschke D, Rolewicz S (1997) Foundations of mathematical optimization (convex analysis without linearity). Kluwer, Dordrecht 13. Penot JP, Volle M (1990) On quasiconvex duality. Math Oper Res 15:597–625 14. Rubinov AM (2000) Abstract convexity and global optimization. Kluwer, Dordrecht 15. Rubinov AM, Andramonov MYu (1999) Lipschitz programming via increasing convex-along-rays functions. Optim Methods Softw 10:763–781 16. Rubinov AM, Andramonov MYu (1999) Minimizing increasing star-shaped functions based on abstract convexity. J Global Optim 15:19–39 17. Rubinov AM, Glover BM (1999) Increasing convex-alongrays functions with applications to global optimization. J Optim Th Appl 102(3) 18. Rubinov AM, Glover BM, Jeyakumar V (1995) A general approach to dual characterization of solvability of inequality systems with applications. J Convex Anal 2:309–344 19. Singer I (1997) Abstract convex analysis. Wiley/Interscience, New York 20. Volle M (1985) Conjugasion par tranches. Ann Mat Pura Appl 139:279–312 21. Wood GR (1992) The bisection method in higher dimensions. Math Program 55:319–337

1315

1316

G

Global Optimization: Filled Function Methods

Global Optimization: Filled Function Methods HONG-XUAN HUANG Department of Industrial Engineering, Tsinghua University, Beijing, People’s Republic of China MSC2000: 90C26, 90C30, 90C59, 65K05 Article Outline Keywords and Phrases Introduction Definitions Methods Two-Parameter Filled Functions Single-Parameter Filled Functions Nonsmooth Filled Functions Discrete Filled Functions

Summary References Keywords and Phrases Basin; Filled function; Filled function method

In addition, the basin B2 at a minimizer x2 is lower (or higher) than the basin B1 at another minimizer x1 if the following inequality holds: f (x2 ) < f (x1 ) (or f (x2 ) f (x1 )) : Definitions The first kind of filled function method was proposed in [5] for the unconstrained optimization problem min f (x) :

x2< n

The corresponding filled function involved two parameters, and was defined by kx x1 k2 1 exp ; (1) P(x; x1 ; r; ) D r C f (x) 2 where x1 is a minimizer of the objective function f (x), and r and are parameters such that rC f (x1 ) > 0; > 0. In order to demonstrate the principle of the filled function method, people usually assume that the function f (x) is twice continuously differentiable and coercive, i. e., its Hessian is continuous and the following condition holds: lim

Introduction The filled function methods describe a class of global optimization methods for attacking the problem of finding a global minimizer of a function f : X ! < over a certain subset X 0 and a suitable parameter h such that 0 < h < f (x1 ) f (x ) ; where x is a global minimizer of f (x), x1 is not a global but is a local minimizer of f (x), and (t) and '(t) are continuously differentiable univariate functions satisfying the following conditions [8]: (i) (0) D 0, 0 (t) ˛ > 0; 8t 0. (ii) (0) D 0, (t) is monotonically increasing for all t 2 < (or for t 2 (t1 ; C1), where t1 > 0). (iii) 0 (t) > 0; 8t 2 < (or 0 (t) > 0; 8t 2 (t1 ; C1), where t1 > 0). (iv) When t ! C1, 0 (t) is monotonically decreasing to 0 at least as fast as 1/t. Note that choices for these two functions can be t, tan(t), e t 1; : : : for (t) and arctan t; tanh(t); 1 e t ; : : : for (t). Single-Parameter Filled Functions In order to reduce the difficulty in coordination between r and in a two-parameter filled function, several single-parameter filled functions were proposed in [7]: Q(x; x1 ; A) D [ f (x) f (x1 )] exp Akx x1 k2 ; ˜ Q(x; x ; A) D [ f (x) f (x )] exp Akx x k ; 1

1

1

rE(x; x1 ; A) D r f (x) 2A[ f (x) f (x1 )](x x1 ); ˜ x1 ; A) D r f (x) A[ f (x) f (x1 )] r E(x;

x x1 : kx x1 k

1319

1320

G

Global Optimization: Filled Function Methods

More and more single-parameter filled functions appeared afterwards. For example, H(x; x1 ; a) D

1 akx x1 k2 ln[1 C f (x) f (x1 )]

was proposed in [11], which is defined only for the region where f (x) f (x1 ) 1. The L function L(x; x1 ; a) D kx x1 k2 [ f (x) f (x1 )]1/m and the mitigated L2 function 1 [ f (x) f (x1 )]1/m ML2 (x; x1 ; a) D kx x1 k p were proposed in [12] and [13], respectively, where m > 1 is a prefixed natural number, is a positive parameter, and ' is a mitigator. A function y : < ! < is said to be a mitigator if it is a twice continuously differentiable function in its domain and has the following properties: (i) y(0) D 0, y0 (t) > 0, and y00 (t) < 0 for all t > 0. (ii) lim y(t) exists. t!C1

Note that the ML2 function can reduce the negative definite effect in the Hessian of a single-parameter filled function such as the L function significantly. The numerical results and generalizations can be found in [12,13,14,15]. A more general form for the single-parameter filled functions can be expressed by Q(x; A) D ( f (x)

f (x k ))exp(Aw(kx

x k kˇ ))

;

where ˇ 1, A > 0, and the functions '(t) and w(t) have the following properties [20]: (i) '(t) is continuously differentiable for t 0. (ii) (0) D 0, 0 (t) > 0; 8t 0. (iii) 0 (t)/(t) is monotonically decreasing for t 2 (0; C1). (iv) w(0) D 0 and for any t 2 (0; C1), w(t) > 0; w 0 (t) c > 0. Note that the choices for these functions can be t, a t 1(a > 1), sinh(t); : : : for '(t) and t; sinh(t); e t 1; : : : for w(t). In order to avoid the influence of the exponential term, a general single-parameter filled function can be set by U(x; A) D ( f (x) f (x k )) Aw(kx x k kˇ ) ;

where the function (t) is continuous on [0; C1) and is differentiable in (0; C1). Furthermore, the functions (t) and w(t) have the following properties [20]: (i) (0) D 0; (ii) 0 (t) > 0 is monotonically decreasing for t 2 (0; C1) and lim t!0C 0 (t) D C1; (iii) w(0) D 0 and for any t 2 (0; C1), w(t) > 0; w 0 (t) c > 0. Nonsmooth Filled Functions It is well known that the constrained optimization problem can be formulated as a nonsmooth optimization problem by using the exact penalty function; see [3] or (3)–(5). With use of the methods of nonsmooth analysis, a nonsmooth unconstrained optimization problem was studied in [10], which involved a modified filled function as follows PF (x; x1 ; r; ) kx x1 k2 1 exp ; (7) D ln 1 C r C F(x) 2 where F(x) is a weak semismooth objective function and x1 is a local minimizer of F(x). For a composite function F(x) in the form F(x) D f (x) C h(c(x)) ; where f (x) and c(x) D (c1 (x); : : : ; c m (x))T are smooth functions and h : R m ! R is convex but nonsmooth [2], a kind of two-parameter filled function P(x; r; A) D

(r C f (x))exp(Akx x k k2 )

was considered in [20], where the function (t) has properties such as: (i) (t) > 0 for t 0. (ii) (t) is monotonically decreasing for t 0. (iii) (t1 ) (t2 ) c2 (t2 t1 ) for t2 > t1 0, where c2 > 0 is a constant. In addition, for the single-parameter filled functions, we can also consider some general forms as follows: U(x; A) D ( f (x) f (x k ))exp(Akx x k k2 ) ; or ˜ U(x; A) D ( f (x) f (x k )) Akx x k k2 ; where A > 0 is a parameter, and the function '(t) is required to satisfy certain conditions [20]:

G

Global Optimization: Filled Function Methods

(i) (0) D 0, (t) is monotonically increasing for t 0; for (ii) c1 (t2 t1 ) (t2 ) (t1 ) c2 (t2 t1 ) t2 > t1 0, where 0 < c1 c2 are constants. Note that even for a continuously differential unconstrained optimization problem, there may exist a nonsmooth filled function. For example, a two-parameter nonsmooth filled function P(x; x1 ; ; ) D f (x1 ) min[ f (x1 ); f (x)] kx x1 k2 C fmax[0; f (x)

(8) f (x1 )]g2

was introduced in [21], where f (x) is coercive and Lipschitz continuous with a constant L in 0, and ˛ = (ˇ + 1)/ˇ. Then, the function C

˛ U T

ˇ

is convex. Furthermore, if q is a positive variable, and S is a convex subset in R2C , the convex optimization problem in (2) can be used to compute a rigorous lower bound for the solution of the problem in (1), i. e., the problem in (2) is a valid convex relaxation of the problem in (1): 8 q ˇ ˆ ˆ 0. Then the following inequalities are convex: T

dth dtc h i ; log dd tthc

dth (T1 T2 ) i ; h T th log (T1dT ) 2 (T1 T2 ) dtc i h T 2) log (T1dT tc

Property 3 ([19]) Let dt h , dt c and T, be continuous positive variables. Also, let T 1 and T 2 , be positive constants such that T 1 T 2 > 0. Then the following in-

Global Optimization of Heat Exchanger Networks

G

equalities, which are based on the Paterson approximation [13] for the LMTD, are convex: 2p 1 (dth C dtc ) C T dth dtc ; 3 2 3 1 (dth C T1 T2 ) 2p T dth (T1 T2 ); C 3 2 3 2p 1 (T1 T2 C dtc ) C T (T1 T2 )dtc : 3 2 3 Property 4 ([19]) Let dt h , dt c and T, be continuous positive variables. Also, let T 1 and T 2 , be positive constants such that T 1 T 2 > 0. Then the following inequalities, which are based on the Chen approximation [2] for the LMTD, are convex:

Global Optimization of Heat Exchanger Networks, Figure 2 Heat exchanger network for the illustrative problem

1 (dth )(dtc )(dth C dtc ) 3 ; T 2 1 (dth )(T1 T2 )(dth C T1 T2 ) 3 T ; 2 1 (T1 T2 )(dtc )(T1 T2 C dtc ) 3 T : 2

Property 5 ([19]) Let dt h , dt c be continuous positive variables, and let T be the logarithmic mean temperature difference, T = [dt h dt c /log[dt h /dt c ]. Also, assume that r is a constant determined by the ratio of two particular values of dt h and dt c . Then, the following bounding inequality is valid, and holds as an equality along the line determined by the ratio r = dt h /dt c : T P(r)dth C Q(r)dtc ; where ( P(r) D

0:5

if r D 1;

1/r1Clog(r) [log(r)]2

if r ¤ 1;

( Q(r) D

0:5

if r D 1;

r1log(r) [log(r)]2

if r ¤ 1:

Several other useful properties and their application in the development of convex relaxations for HENs problems can be found in [1,6,14,19], and [20,21,22] As an illustrative example of the use of the above properties, and the application of global optimization techniques in heat exchanger networks, consider the

Global Optimization of Heat Exchanger Networks, Figure 3 Global optimum HEN design of the illustrative problem

determination of the global optimal design of the HEN shown in Fig. 2 [14]; stream data and cost information are included in Table 1. This problem was originally solved in [14] and [21] using the arithmetic mean temperature difference driving force (AMTD), and assuming isothermal mixing of process streams (t 5 = t 6 ). Figure 3 shows the global optimum solution of the nonconvex model (P) associated with the illustrative problem. A design with a total network cost of $36,199 is determined. Note that model (P) does not assume isothermal mixing, utilizes the approximation by Chen [2], and enforces a minimum approach temperature of 5 degrees. The global optimization of model (P) was performed with the branch and contract algorithm proposed in [21,23]; the convex model (R) was used in the computation of rigorous lower bounds of the total network cost. The solution process required 7 branch and bound nodes, and approximately 37 cpu seconds of a Pentium I processor running at 133Mhz. Alternative suboptimal solutions for the illustrative problem based

1339

G

1340

Global Optimization of Heat Exchanger Networks

Global Optimization of Heat Exchanger Networks, Table 1 Problem data for illustrative example

Stream H1 H2 C1 C2 C3

Tin (K) 575 718 300 365 358

Tout (K) 395 398 400

F (kW K1 ) 5.555 3.125 10 4.545 3.571

1

2

Cost of Heat Exchanger 1 ($yr ) = 270[A1 (m )] Cost of Heat Exchanger 2 ($yr1 ) = 720[A2 (m2 )] Cost of Heat Exchanger 3 ($yr1 ) = 240[A3 (m2 )] Cost of Heat Exchanger 4 ($yr1 ) = 900[A4 (m2 )] U1 =U1 = 0:1 kW m2 K1 U3 =U4 = 1:0 kW m2 K1

Model Constraints q1 D 5:555(t1 395); q1 D f 1 (t5 300); q2 D 3:125(t2 398); q2 D f 2 (t6 300); q3 D 4:545(t3 365); q3 D 5:555(575 t1 ); q4 D 3:571(t4 358); q4 D 3:125(718 t2 ); q1 C q2 D 1000; q1 C q3 D 999:9; q2 C q4 D 1000; f 1 C f 2 D 10;

on the rigorous LMTD include network designs with total costs of $38,513, $39,809, $41,836, and $47,681.

dt1h D t1 t5 ; dt1c D 95; dt2h D t2 t6 ;

Nonconvex Model (P) Indices 1, 2, 3, 4

dt2c D 98; dt3h D 575 t3 ;

= index for heat exchangers

1h, 2h, 3h, 4h = hot side of heat exchangers

dt3c D t1 365;

1c, 2c, 3c, 4c

dt4h D 718 t4 ;

= cold side of heat exchangers

Parameters U1 , U2 , U3 , U4 = overall heat transfer coefficients

Positive Variables t

= stream temperature

dt

= temperature difference at end of heat exchanger

T = approximation of the logarithmic mean temperature difference q

= heat transfer rate

f

= heat capacity flowrate

Objective Function q2 q1 C 720 U1 T1 U2 T2 q3 q4 C 240 C 900 : U3 T3 U4 T4

min 270

dt4c D t2 358; 1 (dt1h )(dt1c )(dt1h C dt1c ) 3 T1 D 2 1 (dt2h )(dt2c )(dt2h C dt2c ) 3 T2 D 2 1 (dt3h )(dt3c )(dt3h C dt3c ) 3 T3 D 2 1 (dt4h )(dt4c )(dt4h C dt4c ) 3 T4 D 2

; ; ; ;

f 1 t5 C f 2 t6 D 4000; 0 q Li q i q Ui ; 0

t Lj

tj

dt k 5;

t Uj ;

i D 1; 2; 3; 4; j D 1; 2; 3; 4; 5; 6;

k D 1h; 1c; 2h; 2c; 3h; 3c; 4h; 4c

0 f 1L f 1 f 1U ;

0 f 2L f 2 f 2U :

Global Optimization of Heat Exchanger Networks

Convex Model (R)

z11 D t5 300;

Objective Function

z22 D t6 300;

y15 t5U f 1 C f 1U t5 f 1U t5U ; y15 t5L f 1 C f 1U t5 f 1U t5L ; y15 t5U f 1 C f 1L t5 f 1L t5U ; y26 t6L f 2 C f 2L t6 f 2L t6L ;

Model Constraints 1

1

y15 C y26 D 4000;

y15 t5L f 1 C f 1L t5 f 1L t5L ;

[1 ]2 [2 ]2 C 720 U1 T1 U2 T2 [3 ]2 [4 ]2 C 240 C 900 : U3 T3 U4 T4

min 270

i (q Li ) 2 C

G

1

(q Ui ) 2 (q Li ) 2 (q i q Li ); q Ui q Li

i D 1; 2; 3; 4;

y26 t6U f 2 C f 2U t6 f 2U t6U ; y26 t6L f 2 C f 2U t6 f 2U t6L ; y26 t6U f 2 C f 2L t6 f 2L t6U ; 0 12 q L U q C q q 1 B 1 1 1 C q A ; z11 @ q f1 q1L C q1U

q1 D 5:555(t1 395); q1 D y15 300 f 1 ; q2 D 3:125(t2 398); q2 D y26 300 f 2 ;

0

q3 D 4:545(t3 365); z22

q3 D 5:555(575 t1 ); q4 D 3:571(t4 358); q4 D 3:125(718 t2 );

z11

q1 C q2 D 1000; q1 C q3 D 999:9;

z11

q2 C q4 D 1000; f 1 C f 2 D 10;

z22

dt1h D t1 t5 ; dt1c D 95;

z22

dt2h D t2 t6 ; z11

dt2c D 98; dt3h D 575 t3 ;

z11

dt3c D t1 365; dt4h D 718 t4 ; dt4c D t2 358; 1 (dt1h )(dt1c )(dt1h C dt1c ) 3 T1 2 1 (dt2h )(dt2c )(dt2h C dt2c ) 3 T2 2 1 (dt3h )(dt3c )(dt3h C dt3c ) 3 T3 2 1 (dt4h )(dt4c )(dt4h C dt4c ) 3 T4 2

z22 ; ; ;

z22

q

q2L q2U

1 B q2 C C q A ; @q f2 U L q2 C q2 q1 1 1 U C q1 L ; f1 f 1L f1 1 q1 1 L ; C q 1 f1 f 1U f 1U 1 q2 1 U ; C q 2 f2 f 2L f 2L q2 1 1 L C q2 U ; f2 f 2U f2 1 f 1U q1 q1L f 1 C q1L f 1L ; L U f1 f1 1 L f 1 q1 q1U f 1 C q1U f 1U ; L U f1 f1 1 f 2L f 2U

( f 2U q2 q2L f 2 C q2L f 2L );

1 L f q2 q2U f 2 C q2U f 2U ; f 2L f 2U 2

0 q Li q i q Ui ; 0 t Lj t j t Uj ; dt k 5;

i D 1; 2; 3; 4; j D 1; 2; 3; 4; 5; 6;

k D 1h; 1c; 2h; 2c; 3h; 3c; 4h; 4c;

0 f1L f 1 f 1U ; ;

12

y15 ; y26 ; z11 ; z22 0:

0 f 2L f 2 f 2U ;

1341

1342

G

Global Optimization: Hit and Run Methods

See also MINLP: Global Optimization with ˛BB MINLP: Heat Exchanger Network Synthesis MINLP: Mass and Heat Exchanger Networks Mixed Integer Linear Programming: Heat Exchanger Network Synthesis Mixed Integer Linear Programming: Mass and Heat Exchanger Networks

References 1. Adjiman CS, Androulakis IP, Floudas CA (1997) Global optimization of MINLP problems in process synthesis and design. Comput Chem Eng 21:S445–S450 2. Chen JJJ (1987) Letter to the Editors: Comments on improvement on a replacement for the logarithmic mean. Chem Eng Sci 42:2488–2489 3. Ciric AR, Floudas CA (1991) Heat exchanger network synthesis without decomposition. Comput Chem Eng 15:385– 396 4. Floudas CA, Pardalos PM (1990) A collection of test problems for constrained global optimization algorithms. no. 455 of Lecture Notes Computer Sci Springer, Berlin 5. Gundersen T, Naess L (1988) The synthesis of cost optimal heat exchanger network synthesis, An industrial review of the state of the art. Comput Chem Eng 12:503–530 6. Hashemi-Ahmady A, Zamora JM, Gundersen T (1999) A sequential framework for optimal synthesis of industrial size heat exchanger networks. In: Proc. 2nd Conf. Process Integration, Modeling and Optimization for Energy Saving and Pollution Reduction (PRESS’99), Hungarian Chemical Soc. 7. Horst R, Pardalos PM (eds) (1995) Handbook of global optimization. Kluwer, Dordrecht 8. Horst R, Tuy H (1993) Global optimization: Deterministic approaches, 2nd edn. Springer, Berlin 9. Iyer RR, Grossmann IE (1996) Global optimization of heat exchanger networks with fixed configuration for multiperiod design. In: Grossmann IE (ed) Global Optimization in Engineering Design. Kluwer, Dordrecht 10. Jezowski J (1994) Heat exchanger network grassroot and retrofit design: The review of the state of the art Part I: Heat exchanger network targeting and insight based methods of synthesis. Hungarian J Industr Chem 22:279–294 11. Jezowski J (1994) Heat exchanger network grassroot and retrofit design: The review of the state of the art - Part II: Heat exchanger network synthesis by mathematical methods and approaches for retrofit design. Hungarian J Industr Chem 22:295–308 12. Papoulias SA, Grossmann IE (1983) A structural optimization approach in process synthesis II. Heat recovery networks. Comput Chem Eng 7:707–721

13. Paterson WR (1984) A replacement for the logarithmic mean. Chem Eng Sci 39:1635–1636 14. Quesada I, Grossmann IE (1993) Global optimization algorithm for heat exchanger networks. Industr Eng Chem Res 32:487–499 15. Ryoo HS, Sahinidis NV (1995) Global optimization of nonconvex NLPs and MINLPs with applications in process design. Comput Chem Eng 19:551–566 16. Visweswaran V, Floudas CA (1996) Computational results for an efficient implementation of the GOP algorithm and its variants. In: Grossmann IE (ed) Global Optimization in Engineering Design. Kluwer, Dordrecht 17. Westerberg AW, Shah JV (1978) Assuring a global optimum by the use of an upper bound on the lower (dual) bound. Comput Chem Eng 2:83–92 18. Yee TF, Grossmann IE (1990) Simultaneous optimization models for heat integration-II. Heat exchanger network synthesis. Comput Chem Eng 14:1165–1184 19. Zamora JM (1997) Global optimization of nonconvex NLP and MINLP models. PhD Thesis, Dept. Chemical Engin. Carnegie-Mellon Univ. 20. Zamora JM, Grossmann IE (1997) A comprehensive global optimization approach for the synthesis of heat exchanger networks with no stream splits. Comput Chem Eng 21:S65– S70 21. Zamora JM, Grossmann IE (1998) Continuous global optimization of structured process systems models. Comput Chem Eng 22:1749–1770 22. Zamora JM, Grossmann IE (1998) A global MINLP optimization algorithm for the synthesis of heat exchanger networks with no stream splits. Comput Chem Eng 22:367– 384 23. Zamora JM, Grossmann IE (1999) A branch and contract algorithm for problems with concave univariate, bilinear and linear fractional terms. J Global Optim 14:217–249

Global Optimization: Hit and Run Methods ZELDA B. ZABINSKY Industrial Engineering University Washington, Seattle, USA MSC2000: 90C26, 90C90 Article Outline Keywords Hit and Run Based Algorithms See also References

Global Optimization: Hit and Run Methods

G

Keywords

Hit and Run Based Algorithms

Global optimization; Stochastic methods; Random search algorithms; Adaptive search; Simulated annealing; Improving hit and run; Hit and run methods; Mixed discrete-continuous global optimization; Pure random search; Pure adaptive search

The underlying concept of hit and run based algorithms is that, if hit and run could generate a uniformly distributed point in an improving level set, then PAS predicts that we need only a linear number of such points. The point generated by just one iteration of hit and run is far from uniform and may not be in the improving set, so the number of function evaluations is not expected to be linear in dimension, but in [16] it was shown that the expected number of function evaluations for IHR on the class of elliptical programs (e. g. positive definite quadratic programs) is polynomial in dimension, O(n5/2 ). The number of function evaluations includes those points that are rejected because they do not fall into the improving level set. This theoretical performance result motivates the use of hit and run for optimization. Numerical experience indicates that IHR has been especially useful in high-dimensional global optimization problems when there are many local minima embedded within a broad convex structure. The general framework for a hit and run based optimization algorithm for solving a global optimization problem, ( min f (x)

The hit and run algorithms fall into the category of sequential random search methods (cf. also Random search methods), or stochastic methods. These methods can be applied to a broad class of global optimization problems. They seem especially useful for problems with black-box functions which have no known structure. These problems often involve a very large number of variables, and may include both continuous and discrete variables. The concept of hit and run is to iteratively generate a sequence of points by taking steps of random length in randomly generated directions. R.L. Smith, in 1984 [12], showed that this method can be used to generate points within a set S that are asymptotically uniformly distributed. The hit and run method was originally applied to identifying nonredundant constraints in linear programs [1,3], and in stochastic programming [2]. Hit and run was first applied to optimization in [16], and the name improving hit and run (IHR) was adopted. The term ‘improving’ was intended to indicate that the sequence of points were improving with regard to their objective function values. The IHR algorithm couples the idea of pure adaptive search [8,15] with the hit and run generator to produce an easily implemented sequential random search algorithm. Pure adaptive search (PAS, see also Random search methods) predicts that points uniformly generated in improving level sets has, on the average, a linear number of iterations in terms of dimension. One way to approximate PAS, would be to use hit and run to generate approximately uniform points, and then select those that land in improving level sets. This is the idea behind improving hit and run. In addition to IHR, a family of methods have been developed that are based on hit and run. Other variations include: adding an acceptance probability with a cooling schedule, varying the choice of direction, varying the length of step, and modifying the sampling method to include a mixture of continuous and discrete variables.

s.t.

x 2 S;

where f is a real-valued function on S, is stated below.

PROCEDURE hit and run optimization method() InputInstance(); Generate an initial solution X0 ; Set Y0 = f (X0 ); Set k = 0; DO until stopping criterion is met; Generate a random direction D k ; Generate a random steplength k ; Evaluate candidate point Wk = X k + k D k ; Update the ( new point, Wk if candidate point accepted X k+1 = X k if rejected Set Yk+1 = min(Yk , f (X k+1)); OD; RETURN(Best solution found, Yk+1 ); END hit and run optimization method; Pseudocode for a hit and run based optimisation algorithm

1343

1344

G

Global Optimization: Hit and Run Methods

Improving hit and run uses the most basic hit and run generator, which is to generate a direction vector Dk that is uniformly distributed on a hypersphere, and then generate a steplength k which is generated uniformly on the intersection of Dk with the feasible set S. In many applications, S may be an n-dimensional polytope described by linear constraints, in which case the intersection of a direction with S is easily computed using a slight modification of a minimum ratio test (see [16] for details). This is the most basic hit and run generator, but several variations have been developed. One variation is to add an acceptance probability with a cooling schedule to the hit and run generator, as in simulated annealing (cf. Simulated annealing). This was developed in [10] and called the hide-and-seek algorithm. Just as IHR was motivated by pure adaptive search, hide-and-seek was motivated by adaptive search [9] (see also Random search methods). Adaptive search generates a series of points according to a sequence of Boltzman distributions, with parameter T changing on each iteration. The theory predicts that adaptive search with decreasing temperature parameter T will converge with probability one to the global optimum, and the number of improving points have the same linear bound as PAS. Hide-and-seek uses the basic hit and run generator, but accepts the candidate point with the Metropolis criterion and parameter T. It is interesting to consider the two extremes of the acceptance probability: if the temperature is fixed at infinity, then all candidate points are accepted, and the hit and run generator approximates pure random search with a uniform distribution; at the other extreme if the temperature is fixed to zero, then only improving points are accepted, and we have improving hit and run. H.E. Romeijn and Smith derived a cooling schedule which essentially starts with hit and run, and approaches IHR. They proved that hide-and-seek will eventually converge to the global optimum, even though it may experience deteriorations in objective function values. They also present computational results on several test functions, which compare favorably with other algorithms in the literature. A second variation to the basic hit and run generator is to modify the direction distribution. Thus far, we have only described choosing a direction according to a uniform distribution on an n-dimensional hypersphere, which has also been termed hyperspherical di-

rection (HD) choice. In [16] and [10], the direction distribution is defined more generally; the direction may be generated from a multivariate normal distribution with mean 0 and covariance matrix H. If the H matrix is the identity matrix, then the direction distribution is essentially the uniform distribution on a hypersphere. In [4] a nonuniform direction distribution is derived that optimizes the rate of convergence of the algorithm. Although exact implementation of the optimal direction distribution may be very difficult, it motivates an adaptive direction choice rule called artificial centering hit and run. Another choice for direction distribution is the coordinate direction (CD) method, in which the direction is chosen uniformly from the n coordinate vectors (spanning Rn ). Both HD and CD versions of direction choice were presented and applied to identifying nonredundant linear constraints in [1]. They were also tested in the context of global optimization in [14]. Computationally, CD can outperform HD on specific problems where the optimum is properly aligned, however HD is guaranteed to converge with probability one, while it is easy to construct problems where CD will never converge to the global optimum. A simple example is given in [5] where local minima are lined up on the coordinate directions, and it is impossible for the CD algorithm to leave the local minimum unless it accepts a nonimproving point. For such an example, in [5] it is shown that the CD algorithm coupled with a nonzero acceptance probability for nonimproving points will converge with probability one. Experimental results were also reported. A third variation to the basic hit and run generator modifies it to be applicable to discrete domains [7,11]. Hit and run as described so far has been defined on a continuous domain. An extension to a discrete domain was accomplished by superimposing the discrete domain onto a continuous real number system. It was motivated by design variables such as fiber angles in a composite laminate, or diameters in a 10-bar truss, where the discrete variables have a natural continuous analog. Two slightly different modifications have been introduced. In [11] the candidate points were generated using Hit and run on the expanded continuous domain, where the objective function of a nondiscrete point is equal to the objective function evaluated at its nearest

Global Optimization: Hit and Run Methods

Global Optimization: Hit and Run Methods, 1 Two schemes to modify hit and run to disrete domains

discrete value. In this way, the modified algorithm operates on a continuous domain where the objective function is a multidimensional step function, with plateaus surrounding the discrete points. This modification still converges with probability 1 to the global optimum, as proven in [11]. The diagram in Fig. 1 illustrates this method. Starting from point X 1 , hit and run on the continuous domain generates a candidate point such as A. The objective function at A is set equal to that of its nearest discrete point B, forcing f (A) = f (B). If the candidate point is accepted, then X 2 = A, and another candidate point (shown as C) is generated. A second scheme to modify hit and run to operate on discrete domains is to similarly generate a point on a continuous domain, and then round the generated point to its nearest discrete point in the domain on each iteration [6,7,13]. Again starting from point X 1 in Fig. 1, suppose A is generated. In this version, the candidate point is taken as the nearest discrete neighbor, in this example B. The objective function is evaluated at B, f (B), and if the point is accepted, then X 2 = B. The difference in this variation is illustrated by noting that the next candidate point is generated from B instead of from A, see point D in Fig. 1. Also note that only discrete points are maintained. In [6,7] it is shown that this

G

second scheme dominates the first scheme in terms of average performance for the special class of spherical programs, and numerical results have been promising. Another modification to the basic hit and run generator is in the way the steplength is generated. Instead of generating the point uniformly on the whole line segment, the line segment can be restricted to a fixed length, or adaptively modified. S. Neogi [6] refers to this as full-line length, restricted line length, or adaptive stepsize. In [6] the adaptive stepsize is coupled with an acceptance probability to maintain a fixed probability of generating an improving point. See [6] for a more detailed discussion of this variation of a simulated annealing algorithm based on the hit and run generator. The many variations of hit and run have been numerically tested on many test functions and applied to real applications. All of the papers referenced in this article include numerical results, but the details are left to the individual papers. Overall, the theoretical motivations and numerical experience leads us to believe that hit and run is a promising approach to global optimization. See also Random Search Methods Stochastic Global Optimization: Stopping Rules Stochastic Global Optimization: Two-phase Methods References 1. Berbee HCP, Boender CGE, Rinnooy Kan AHG, Scheffer CL, Smith RL, Telgen J (1987) Hit-and-run algorithms for the identification of nonredundant linear inequalities. Math Program 37:184–207 2. Birge JR, Smith RL (1984) Random procedures for nonredundant constraint identification in stochastic linear programs. Amer J Math Management Sci 4:41–70 3. Boneh A, Golan A (1979) Constraints’ redundancy and feasible region boundedness by random feasible point generator. Third European Congress Oper. Res., EURO III, Amsterdam, 9-11 April 1979 4. Kaufman DE, Smith RL (1998) Direction choice for accelerated convergence in hit-and-run sampling. Oper Res 46(1):84–95 5. Kristinsdottir BP (1997) Analysis and development of random search algorithms. PhD Thesis Univ. Washington 6. Neogi S (1997) Design of large composite structures using global optimization and finite element analysis. PhD Thesis Univ. Washington

1345

1346

G

Global Optimization: Interval Analysis and Balanced Interval Arithmetic

7. Neogi S, Zabinsky ZB, Tuttle ME (1994) Optimal design of composites using mixed discrete and continuous variables. Proc. ASME Winter Annual Meeting, Symp. Processing, Design and Performance of Composite Materials, vol 52. Dekker, New York, pp 91–107 8. Patel NR, Smith RL, Zabinsky ZB (1988) Pure adaptive search in Monte Carlo optimization. Math Program 4:317–328 9. Romeijn HE, Smith RL (1994) Simulated annealing and adaptive search in global optimization. Probab Eng Inform Sci 8:571–590 10. Romeijn HE, Smith RL (1994) Simulated annealing for constrained global optimization. J Global Optim 5:101–126 11. Romeijn HE, Zabinsky ZB, Graesser DL, Neogi S (1999) Simulated annealing for mixed integer/continuous global optimization. J Optim Th Appl 101(1) 12. Smith RL (1984) Efficient Monte Carlo procedures for generating points uniformly distributed over bounded regions. Oper Res 32:1296–1308 13. Zabinsky ZB (1998) Stochastic methods for practical global optimization. J Global Optim 13:433–444 14. Zabinsky ZB, Graesser DL, Tuttle ME, Kim GI (1992) Global optimization of composite laminate using improving hit and run. In: Floudas CA, Pardalos PM (eds) Recent Advances in Global Optimization. Princeton Univ. Press, Princeton, 343–365 15. Zabinsky ZB, Smith RL (1992) Pure adaptive search in global optimization. Math Program 53:323–338 16. Zabinsky ZB, Smith RL, McDonald JF, Romeijn HE, Kaufman DE (1993) Improving hit and run for global optimization. J Global Optim 3:171–192

Global Optimization: Interval Analysis and Balanced Interval Arithmetic JULIUS ŽILINSKAS1 , IAN DAVID LOCKHART BOGLE2 1 Institute of Mathematics and Informatics, Vilnius, Lithuania 2 Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, UK MSC2000: 65K05, 90C30, 90C57, 65G30, 65G40 Article Outline Keywords and Phrases Introduction Methods / Applications Interval Analysis in Global Optimization

Underestimating Interval Arithmetic Random Interval Arithmetic Balanced Interval Arithmetic

See also References Keywords and Phrases Global optimization; Interval arithmetic; Interval computations Introduction Mathematically the global optimization problem is formulated as f D min f (X) ; X2D

where a nonlinear function f (X), f : Rn ! R, of continuous variables X, is an objective function; D 2 Rn is a feasible region; n is a number of variables. A global minimum f * and one or all global minimizers X * : f (X ) D f should be found. No assumptions on unimodality are included in the formulation of the problem. Most often an objective function is defined by an analytical formula or an algorithm, which evaluates the value of the objective function using the values of variables and arithmetic operations. Continuous global optimization: models, algorithms and software. One of the classes of methods for global optimization are methods based on interval arithmetic. Interval arithmetic [10] provides bounds for the function values over hyper-rectangular regions defined by intervals of variables. The bounds may be used in global optimization to detect the sub-regions of the feasible region which cannot contain a global minimizer. Such subregions may be discarded from the subsequent search for a minimum. Interval arithmetic provides guaranteed bounds but sometimes they are too pessimistic. Interval arithmetic is used in global optimization to provide guaranteed solutions, but there are problems for which the time for optimization is too long. A disadvantage of interval arithmetic is the dependency problem [5]: when a given variable occurs more than once in interval computation, it is treated as a different variable in each occur-

Global Optimization: Interval Analysis and Balanced Interval Arithmetic

rence. This causes widening of computed intervals and overestimation of the range of function values. Analysis of both overestimating and underestimating intervals is useful to estimate how much interval bounds overestimate the range of function values. Moreover inner interval arithmetic operations may be used instead of standard interval arithmetic operations in some cases when dependency of operands is known or operands are known to be monotonic. Although monotonicity cannot easily be determined in advance, inner and standard interval arithmetic operations may be chosen randomly building random interval arithmetic, estimating the range of real function values from a sample of random intervals.

Methods / Applications Interval Analysis in Global Optimization Interval arithmetic is proposed in [10]. Interval arith metic operates with real intervals x D x; x D fx 2 Rjx x xg, defined by two real numbers x 2 R and x 2 R, x x. For any real arithmetic operation x ı y the corresponding interval arithmetic operation x ı y is defined as an operation whose result is an interval containing every possible number produced by the real operation with the real numbers from each interval. The interval arithmetic operations are defined as: h i x C y D x C y; x C y ; i h x y D x y; x y ; i 8 h ˆ x y; x y ; x >0;y >0; ˆ ˆ h i ˆ ˆ ˆ ˆ x y; x y ; x > 0 ;0 2 y ; ˆ ˆ h i ˆ ˆ ˆ ˆ x y; x y ; x >0;y 0; ˆ ˆ h ˆ ˆ ˆ < min(x y ; x y) ; i xyD ˆ max(x y; x y) ; 0 2 x ; 0 2 y ; ˆ ˆ i h ˆ ˆ ˆ ˆ ; 02 x ;y 0 ; y < 0 ; h i x/y; x/y ; 0 2 x ; y > 0 ; x/ y; x/ y ; 0 2 x ; y < 0 ; h i x/y; x/ y ; x < 0 ; y > 0 ; h i x/y; x/ y ; x < 0 ; y < 0 :

An interval function can be constructed replacing the usual arithmetic operations by interval arithmetic operations in the formula or the algorithm for calculating values of the function. An interval value of the function can be evaluated using the interval function with interval arguments. The resulting interval always encloses the range of real function values in the hyperrectangular region defined by the vector of interval arguments: n

o f (X) jX 2 X; X 2 Rn ; X 2 Rn f X ;

where f : Rn ! R, f : [R; R]n ! [R; R]. Because of this property the interval value of the function can be used as the lower and upper bounds for the function in the region which may be used in global optimization. The first version of interval global optimization algorithm was oriented to minimization of a rational function by bisection of sub-domains [12]. Interval methods for global optimization were further developed in [3,4,11], where the interval Newton method and the test of strict monotonicity were introduced. A thorough description including theoretical as well as practical aspects can be found in [5] where a very efficient interval global optimization method involving monotonicity and non-convexity tests and the special interval Newton method is described. Interval global optimization. A branch and bound technique is usually used to construct interval global optimization algorithms. An iteration of a classical branch and bound algorithm processes a yet unexplored sub-region of the feasible region. Iterations have three main components: selection of the sub-region from a candidate list to process, bound calculation, and branching. In interval global optimization algorithms bounds are calculated using interval arithmetic. All interval global opti-

1347

1348

G

Global Optimization: Interval Analysis and Balanced Interval Arithmetic

mization branch and bound algorithms use the hyperrectangular partitions and branching is usually performed bisecting the hyper-rectangle into two. Variants of interval branch-and-bound algorithms for global optimization where the bisection was substituted by the subdivision of subregions into many subregions in a single iteration step have been investigated in [2]. The convergence properties have been investigated in detail. An extensive numerical study is presented in [8]. Bisection global optimization methods; Interval analysis: Subdivision directions in interval branch and bound techniques. The tightness of bounds is a very important factor for efficiency of branch and bound based global optimization algorithms. An experimental model of interval arithmetic with controllable tightness of bounds to investigate the impact of bound tightening in interval global optimization was proposed in [14]. The experimental results on efficiency of tightening bounds were presented for several test and practical problems. Experiments have shown that the relative tightness of bounds strongly influences efficiency of global optimization algorithms based on the branch and bound approach combined with interval arithmetic.

Underestimating Interval Arithmetic Kaucher arithmetic [6,7] defining underestimates is useful to estimate how much interval bounds overestimate the range of function values. Kaucher arithmetic operations (ıu ) are defined as: i h x Cu y D x C y _ x C y ; h i x u y D x y _ x y ; i 8 h ˆ x ; x >0;y >0 y _ x y ˆ ˆ ˆ ˆ ˆ or x < 0 ; y < 0 ; ˆ ˆ h i ˆ ˆ ˆ x y; x y ; ˆ x > 0 ;0 2 y ; ˆ ˆ h i ˆ ˆ ˆ ˆ < xy _ x y ; x > 0 ; y < 0 x u y D or x < 0 ; y > 0 ; ˆ i h ˆ ˆ ˆ 02 x ;y >0; x y; x y ; ˆ ˆ ˆ ˆ ˆ ˆ [0; 0] ; 0 2 x ;0 2 y ; ˆ ˆ ˆ ˆ x y; x y ; 02 x ;y 0;y >0

i

or x < 0 ; y < 0 ;

ˆ ˆ ˆ ˆ ˆ ˆ x/ y; x/ y ; ˆ ˆ i h ˆ ˆ : x/y; x/y ;

or x < 0 ; y > 0 ; 02 x ;y >0;

h

x/y _ x/ y ;

x >0;y 0 or x < 0 ; y < 0 ; ˆ h i ˆ ˆ < (x; x; y; y); max(x y; x y) ; x ur y D ˆ x > 0 ; y < 0 or x < 0 ; y > 0 ; ˆ ˆ h i ˆ ˆ ˆ ˆ (x; x; y; y); (x; x; y; y) ; ˆ ˆ : otherwise ; ı ı x ur y D x u y ; where

8 (x 2 x 1 )y 2 x 1 (y 2 y 1 ) ˆ >1; ˆ 2(x 2 x 1 )(y 2 y 1 ) < x 2 y1 ; (x 2 x 1 )y 2 x 1 (y 2 y 1 ) x 1 y2 ; a m 2 R2C . Certain users, henceforth called the “attraction points,” are interested in having the facility located as close to them as possible. Others, called the “repulsion points,” would like the facility to be located as far away from them as possible. Let J 1 , J 2 denote the index sets of attraction and repulsion

G

Global Optimization in Location Problems

points, respectively. For each user j D 1; : : : ; m a function qj (t) is known that measures the cost of traveling a distance t away from aj ; also, hj (x) is a function of the distance from user j to point x 2 R2 . It is assumed that the function qj (t) is concave increasing with q j (t) ! C1 as t ! 1, while hj (x) is a convex function such that h j (x) ! C1 as kxk ! C1. So if x is the unknown location of the facility, then to take account of the interest of attraction points, one should try P to minimize the sum j2J1 q j (h j (x)), whereas from the point of view of repulsion points one should try to maxP imize the sum j2J2 q j (h j (x)). Under these conditions, a reasonable objective of the decision maker may be to locate the facility so as to minimize the quantity X X q j (h j (x)) q j (h j (x)) j2J 1

j2J 2

n over RC . Denoting the right derivative of qj (t) at 0 by + qj (0) and assuming qC j (0) < C18 j, it can easily be seen that each function g j (x) :D K j h j (x) C q j [h j (x)] is convex for K j qC j (0), and so we come up with the dc optimization problem n g; minfG(x) H(x)jx 2 RC

(1)

where G(x), H(x) are convex functions defined by X X G(x) D g j (x) C K j h j (x) ; j2J 2

H(x) D

X j2J 1

X

K j h j (x) :

j2J 2

n ˜ ˜ g; maxfG(x) H(x)jx 2 RC

˜ ˜ where G(x); H(x) are now the convex functions X X ˜ g j (x) C K j h j (x) ; G(x) D ˜ H(x) D

j2J 2

where qj (t) are convex decreasing functions (minimax problem). Assuming jqC j (0)j < 18 j as previously, we have the dc representation q j [h j (x)] D g j (x)K j h j (x); hence

n X

g j (x) max

(2)

g j (x) C

j2J 1

K j h j (x) :

h

K j h j (x) C

X

i g i (x) ;

i¤ j

and so (3) is again a dc optimization problem. By contrast, when siting an obnoxious facility, one wants to minimize the maximal attraction to an user, so the optimization problem to be solved is ˚ n ; max min jD1; ::: ;k q j [h j (x)]jx 2 RC

(4)

where qj (t) are concave increasing functions (minimax problem). Again, assuming jqC j (0)j < 18 j, we have the dc representation q j [h j (x)] D K j (x) g j (x), and so max q j [h j (x)]

jD1;:::;m

j2J 2

X

jD1; ::: ;n

jD1

where each qj is a convex decreasing function. Assuming qC j (0) > 1, the problem is then

j2J 1

jD1; ::: ;m

D

j2J 2

X

When siting emergency services, like a fire station, one does not want to maximize the overall attraction but rather to guarantee for every user a minimal attraction as large as possible. The problem, often referred to as the p-center problem, can be formulated as o n n ; (3) max min q j [h j (x)]jx 2 RC

min q j [h j (x)]

Problems with the above objective function are called minisum problems. In other circumstances, instead of minimizing the cost, one may seek to maximize the total attraction X X q j [h j (x)] q j [h j (x)] ; j2J 1

Maximin and Minimax

jD1; ::: ;m

j2J 1

g j (x) C

Obviously, any maxisum problem can be converted into a minisum one and vice versa. Most problems studied in the literature are minisum, under much more restricted assumptions than in the above setting (see [16] and references therein). Weber’s classical formulation corresponds to the case J2 D ¿ (no repulsion points) and h j (x) D kx a j k; q j (t) D w j t; w j 0; 8 j. The cases J2 ¤ ; with qj (t) nonlinear have begun to be investigated only recently, motivated by growing concerns about the environment.

D max

jD1; ::: ;n

h

K j h j (x) C

X i¤ j

i

g j (x)

m X jD1

g j (x) ;

1355

1356

G

Global Optimization in Location Problems

i. e., the minmax location problem (4) is again a dc optimization problem. A special maximin location problem worth mentioning is the design centering problem encountered in engineering design. Given a compact convex set B Rn containing 0 in its interior and m compact convex sets D j ; j D 1; : : : ; m contained in a compact convex set C Rn , find x 2 C so as to maximize r(x) D

min

jD0;1; ::: ;m

general case the constraint can be expressed as a dc inequality [12,22]. Of course the corresponding dc optimization problem is very hard. Although a method (the relief indicator method [18]) exists for dealing with general nonconvex constraints, so far it only works in low dimension. Multiple Source

r j (x) ; n

where r j (x) D minfp(y x) : y 2 D j g; p : R ! RC is the gauge of B and D0 D Rn n C. It can be shown [17] that the function r0 (x) is concave while r1 (x); : : : ; r m (x) are convex, so this can be viewed as a maximin problem in which each Dj is a user and rj (x) is the distance from point x to user j. Constrained Location In the real world many factors may set restrictions on the facility sites. Therefore, practical location problems are often constrained. Location on Union of Convex Sets The most simple type of restriction is that the facility can be located only in one of several given convex regions C1 ; : : : ; C k [8]. If C i D fx : c i (x) 0g, with ci (x) being convex functions, then the constraint x 2 [ kiD1 C i can be expressed as min c i (x) 0 ;

iD1; ::: ;k

When more than one facility is to be located, the objective function depends upon whether these facilities provide the same service or different services to the users. If there are r 2 facilities providing the same service, these facilities are called sources. Each user is then served by the closest source. So if xi is the unknown location of the ith facility and X D (x 1 ; : : : ; x r ) 2 (R2 )r , then the overall attraction is X X q j [ h˜ j (X)] q j [ h˜ j (X)] ; (5) ji n J 1

where h˜ j (X) D minfh j (x i ) : i D 1; : : : ; rg and q j ; h j Pr i are as previously. Since h˜ j (X) D iD1 h j (x ) P i max iD1; ::: ;r i¤l h j (x ), the first term in (5) is the dc function " r # X X X X i i g j (X) Kj h j (x )C max h j (x ) ; j2J 1

In other circumstances, the facility can be located only outside some forbidden regions that are, for instance, open convex sets C oi D fx : c i (x) < 0g; with ci (x) being convex functions (see, e. g., [2]). Since the constraint x … [ kiD1 C oi is equivalent to min iD1; ::: ;k c i (x) 0, this is again a dc constraint.

j2J 1

l D1; ::: ;r

iD1

i¤l

where K j jqC j (0)j and g j (X) D q j [ h˜ j (X)]CK j

X r iD1

which is a dc constraint. Location on Area with Forbidden Regions

j2J 2

h j (x i )C max

X

l D1;:::;r

h j (x i )

i¤l

is a convex function. Similarly for the second term in (5). Hence the objective function in the r source problem is a dc function on (R2 )r . The multisource problem is usually referred to as the generalized Weber problem, or also the r-median problem when J2 D ;. Traditionally it is often viewed as a location-allocation problem and formulated as a mixed 0-1 integer programming problem (see, e. g., [16]).

General Constrained Location Problem The most general situation occurs when the constraint set is a compact, not necessarily convex, set. However, a striking result of dc analysis shows that even in this

Clustering In many practical situations we have a set of objects of a certain kind that we want to classify into r 2 groups

Global Optimization in Location Problems

(clusters), each including elements close to each other in some well-defined sense. In the simplest case, this gives rise to the following problem: for a given finite set of points a1 ; : : : ; a m 2 Rn , find r cluster centers x i 2 Rn ; i D 1; : : : ; r, such that the sum of the minima over i 2 f1; : : : ; rg of the “distance” between each point aj and the cluster centers x i ; i D 1; : : : ; r, is minimized. If d(a, x) denotes the distance from a to x, then the problem is ) ( m X j i i min d(a ; x ) : x 2 [0; b] : (6) min jD1

iD1;:::;r

Formally, this is nothing but the r-median problem, i. e., the generalized Weber problem with J2 D ;. P If d(a; x) D niD1 ja i x i j, then, using the equality ja i x i j D minfy i : y i a i x i y i g, problem (6) can be written as ! m n X X jl min yi jD1

l D1; ::: ;r

iD1

G

P Pn x ) niD1 x i , and since u(a; x) D d(a; x) C PniD1 i Pn iD1 x i and iD1 x i are both increasing functions, it follows that d(a,x) is a dm (difference of monotonic) function, and, hence, (6) is a monotonic optimization problem. Multiple Facility When the r 2 facilities to be located provide different services, aside from the costs due to interactions between facilities and users, one should also consider the costs due to pairwise interactions between facilities. The latter costs can be expressed by functions of the form i l [h i l (x i ; x l )], where again h i l (x i ; x l ) are convex nonnegative valued functions and i l (t) are concave increasing functions on [0; C1) with finite right derivatives at 0. The total cost one would like to minimize is then r X iD1

Fi (x i ) C

X

i l [h i l (x i ; x l )] ;

(7)

i g˜. Terminate if !(Z) < , 8Z, in L. Denote first pair of list by (Y; y). Compute m := mid Y and g˜ = min( g˜; ub G(m)). RETURN END interval method

Hansen [3,7] is the particular interval method that will be used for locating the LS estimates of the HR problem and is outlined in Table 1. In [7], it was proven that convergence to the global minimum was achieved if w(G(X)) ! 0 as w(X) ! 0. Interval Method for Solving HR To apply the IM to solving the HR problem, the objective function (3) must be placed in its inclusion form:

Global Optimization Methods for Harmonic Retrieval, Figure 1 Objective function of a single sinusoid

mined from a priori information or from other high resolution HR methods [4]. The IM of Hansen’s, described in previous section, is used to determine the global minimum, , of (5). The objection function (5) for a single frequency, phase and amplitude held constant, is plotted in Fig. 1). It can easily see that this represents a very difficult but practical problem for global optimization. Simulations In this section, a numerical experiment will be demonstrated to show the performance of the IM for solving the HR problem (P). The experiments consist of estimating the sinusoid parameters for the following data, y(t) D 1:0 sin(2 (0:2)t C 0:0) C n(t); t D 1; : : : ; 35;

J( ) D

N X tD1

" y(t)

K X

#2 A k sin(2 Fk t C ˚ k )

;

(5)

kD1

where = [A1 , . . . , AK , F 1 , . . . , F K , ˚ 1 , . . . , ˚ K ] and Ak , F k , and ˚ k are the interval counterparts of ak , f k , and k , respectively. Throughout this paper capital letters represent interval variables that correspond to its real variable equivalent. The initial interval, 0 , is chosen such that it encompasses the global minimum. This is accomplished by choosing an interval that is deter-

(6)

where n(t) is white Gaussian noise. We choose the initial box for the IM algorithm to be = [A, F, ˚]| = [[0.71.2], [0.10.3], [00.4]]| . The signal-to-noise-ratio (SNR) is defined as # " K X (ak )2 ; 0:5 2 10 log n kD1

where 2n is the variance of the noise. The results of this simulation, shown in Table 2, is described in terms of sample mean and standard deviation based on 50

1361

1362

G

Global Optimization Methods for Harmonic Retrieval

Global Optimization Methods for Harmonic Retrieval, Table 2 IM estimates

IM: N = 35 (50 MC runs) SNR 10 5 0 a = 1:0 1:0155 1:0447 1:0633 ˙0:0518 ˙0:0913 ˙0:1365 f = :20 0:1995 0:1993 0:1989 ˙6:591 ˙0:0012 ˙0:0017 104 =0 0:0654 0:0839 0:1193 ˙0:0564 ˙0:0975 ˙0:1319 Global Optimization Methods for Harmonic Retrieval, Table 3 IQML Estimates

velopment of the EMIM algorithm for solving the HR problem. To determine the LS estimates of the sinusoidal parameters, the EM algorithm first decomposes the observed data y(t) into its signal components (E step) and then estimates the parameters of each signal component separately (M step). The algorithm iterates back and forth between the E step and M step, using the current estimate to decompose the observed data better and thus improve the next parameter estimate. For the HR problem, the incomplete data is the observed data, y(1), . . . , y(N). The complete data is modeled as the following K data records: y k (t) D ak sin(2 f k t C k ) C n k (t); k D 1; : : : ; K;

IQML: N = 35 (50 MC runs) SNR 10 5 0 a = 1:0 1:0080 0:9862 0:8919 ˙0:0533 ˙0:1479 ˙0:3230 f = :20 0:1998 0:1970 0:1728 ˙7:623 ˙0:0202 ˙0:0839 104 =0 0:0141 0:0126 0:5288 ˙0:0949 ˙0:2852 ˙1:1429

P where nk (t) = ˇ k [y(t) KkD1 ak sin(2 f k t + k )]. The P ˇ k ’s are arbitrary real-valued scalars satisfying KkD1 ˇ k PK = 1 and ˇ k 0. Thus kD1 nk (t) = n(t), for t = 1, . . . , N. The EM algorithm, beginning with n = 0, is represented by the following two steps: E) For k = 1, . . . , K, compute b(n) b(n) a(n) b (n) k (t) D b k sin(2 f k t C k ) " # K X (7) (n) (n) (n) b b b C ˇ k y(t) a l sin(2 f l t C l ) : l D1

Monte-Carlo (MC) runs. This results are based on the midpoints of . Note that the final estimates, b , are very close to the true value of with a small standard deviation. In comparison with IQML, see Table 3, the IM fares considerably better in both mean and standard deviation. This is particularly notable when comparing the frequency component, which represents the most important feature of harmonic retrieval. The convergence rate of the IM is sensitive to the order, K, of the HR problem. In fact, the dimensionality of the parameter space, In , increases at a rate of 3K. Thus, as n increases, the convergence rate becomes prohibitively slow. The curse of dimensionality can be mitigated through decomposition and parallelizing the problem by utilizing the EM algorithm as described in the next section. EMIM The detailed development of the EM algorithm [2] is well-known, and will be outlined here as part of the de-

M) For k = 1, . . . , K, b D arg min J (n) (nC1) k k ; a k ; f k ; k

(8)

where J (n) k D

N X

2 (b (n) k (t) a k sin(2 f k t C k )) :

(9)

tD1

b(n) b(n) > D [b a(n) The parameter vector b (n) k k ; f k ; k ] is the

estimate for k D [ak ; f k ; k ]> after n iterations. In the original HR problem, we have to search the (3 × K)-dimensional parameter space to find the minimum value of the least squares objective function. But after the EM algorithm decomposes the HR problem into K smaller subproblems, we only have to solve K subproblems each of which requires the search of a 3dimensional parameter space to find the global optimal point(s). This results in a significant reduction in computational complexity.

Global Optimization Methods for Harmonic Retrieval

To solve the minimization problem in M step, we resort to using the IM for finding the final interval that contains the point minimizing the objective function. Since IM has been proven to converge to the global optimum for continuous objective functions [3], this algorithm will not be trapped in the local extremum. Needed in the IM algorithm is the inclusion function of the objective function (6), which is constructed by forming the natural interval extension [3,7] of J k : (n)

Jk

D

N X

b (n) k (t) A k sin(2 F k t C ˚ k )

2

; (10)

tD1

where Ak , F k , and ˚ k are the interval counterparts of (0) D ak , f k , and k , respectively. The initial value b k (0) b(0) b(0) > [b a k ; f k ; k ] are arbitrarily guessed or can come from other high-resolution estimation methods. The initial interval k, 0 = [Ak, 0 , F k, 0 , ˚ k, 0 ]| for the M) step is the region over which the minimization is carried out. This initial interval k, 0 is used at the beginning of each M) step of the EMIM algorithm. At the (n + 1)st iteration of EMIM, the IM partitions k, 0 b (nC1) . iteratively to find the final interval estimate

k (nC1) (nC1) b The m(

) Db k will be used as the paramek (t) for the next iteration ter estimate to compute b (nC1) k of the EMIM algorithm. The process is repeated until PK b(nC1) b(n)

k , where is chosen by the kD1 k user. Consider the case where b (0) D b (0) i j and ˇ i = ˇ j .

G

determine the frequencies. We choose the initial box for the EMIM algorithm to be: [ 1;0 ; 2;0 ]> D [A1;0 ; F1;0 ; ˚1;0 ; A2;0 ; F2;0 ; ˚2;0 ]> D [[0:7 1:2]; [0:1 0:3]; [0 0:4]; [0:7 1:2]; [0:1 0:3]; [0 0:4]]> and ˇ 1 = 0.09, ˇ 2 = 0.91. The signal-to-noise-ratio (SNR) is defined as # " K X (ak )2 0:5 2 ; 10 log n kD1

where 2n is the variance of the noise. If no a priori information about the possible values of the sinusoid parameters is available, the full range of possible values for the frequency, the phase, and the amplitude must be used as the initial intervals. Utilizing the full range will impose no difficulty when very fast computing engines are used. However, other high resolution techniques can be used to yield a smaller and more cogent initial interval. Using 50 MC runs, we computed the sample means and standard deviations for the EMIM and the IQML algorithms. (See Table 4 and Table 5, respectively). As for the EMIM, the mid-points of the final interval estimates are considered as the final estimates, thus the

It is straightforward to see that b (n) (n) i (t) D b j (t) and Ji = Jj in the E)-step and M)-step, respectively. Thus, b (nC1) for all n which means that the final b (nC1) D

Global Optimization Methods for Harmonic Retrieval, Table 4 EMIM estimates

estimates for i and j will be the same. In order to avoid this problem, ˇ i must not equal ˇ j or b (0) i must (0) b not equal j in order to fully exploit the capability of the EMIM algorithm.

EMIM: N = 35; = 106 (50MC runs) SNR 10 5 0 a1 = 1:0 1:0305 1:0235 1:0263 ˙0:0992 ˙0:1389 ˙0:1622 f 1 = :20 0:1993 0:1993 0:1969 ˙2:209 ˙4:119 ˙0:0110 104 104 1 = 0 0:0631 0:0851 0:1369 ˙0:0764 ˙0:1152 ˙0:1609 a2 = 1:0 1:0284 1:0501 1:0995 ˙0:0744 ˙0:1036 ˙0:1054 f 2 = :22 0:2192 0:2194 0:2182 ˙0:0012 ˙0:0016 ˙0:0051 2 = 0 0:0746 0:0757 0:1314 ˙0:1177 ˙0:1224 ˙0:1662

i

j

Simulations Our experiments consist of estimating the sinusoidal parameters for the following data, y(t) D 1:0 sin(2 (0:2)t C 0:0) C 1:0 sin(2 (0:22)t C 0:0) C n(t); t D 1; : : : ; 35; where n(t) is white Gaussian noise. Since |0.2 0.22| < 1/35 = 0.02857, the periodogram cannot be used to

1363

1364

G

Global Optimization Methods for Systems of Nonlinear Equations

Global Optimization Methods for Harmonic Retrieval, Table 5 IQML estimates

See also Signal Processing with Higher Order Statistics

IQML: N = 35 SNR 10 a1 = 1:0 0:9549 ˙0:3283 f 1 = :20 0:1963 ˙0:0137 1 = 0 0:3332 ˙0:6117 a2 = 1:0 0:9013 ˙0:3732 f 2 = :22 0:2428 ˙0:0685 2 = 0 0:0886 ˙0:4852

(50 MC runs) 5 0 0:6615 0:7404 ˙0:2908 ˙0:2778 0:1707 0:1404 ˙0:0476 ˙0:0836 0:9323 0:5472 ˙1:0843 ˙1:0659 0:7567 0:8582 ˙0:2683 ˙0:2788 0:2559 0:2721 ˙0:0867 ˙0:0985 0:0079 0:3123 ˙0:7185 ˙0:8358

sample mean and variance can be calculated accordingly. Note that the EMIM generates estimates which have mean values very close to the true parameter values and relatively very small variances. As for the IQML, its variance for each value of SNR is significantly larger than the corresponding EMIM. Clearly, EMIM outperforms IQML by providing estimates that are less biased with smaller variances.

Conclusion In comparison between the two types of IM algorithms with the IQML method, it was shown that both the IM and EMIM algorithms represent a powerful tool for solving the HR problem. Furthermore, it has been noted that by decomposing the problem by the EMIM algorithm does not degrade the performance of using the IM. We have shown experimentally that the IM and EMIM algorithms are robust for very short data records and low SNR. Nevertheless, if the dimensionality is low or convergence to the ML estimates is desired, then the IM algorithm can be used. For either EMIM or IM, convergence time can be improved by generating initial interval of smaller widths by using other high resolution HR methods. Furthermore, using a multi-processor computer to implement the decomposed sub-problems in parallel can also reduce the execution time.

References 1. Bresler Y, Macovski A (1986) Exact maximum likelihood parameter estimation of superimposed exponential signals in noise. IEEE Trans Acoustics, Speech and Signal Processing 34:1081–89 2. Feder M, Weinstein E (1988) Parameter estimation of superimposed signals using the EM algorithm. IEEE Trans Acoustics, Speech and Signal Processing 36:477–489 3. Hansen E (1992) Global optimization using interval analysis. M. Dekker, New York 4. Kay SM (1988) Modern spectral estimation: Theory and application. Prentice-Hall, Englewood Cliffs, NJ 5. Kearfott R, Dawande M, Du K, Hu C (1992) INTLIB: A portable FORTRAN 77 elementary function library. Interval Comput 3:96–105 6. Moore RE (1979) Methods and applications of interval analysis. SIAM, Philadelphia 7. Ratschek H, Rokne J (1988) New computer methods for global optimization. Horwood, Westergate 8. Stoica P et al (1989) Maximum likelihood estimation of the parameters of multiple sinusoids from noisy measurements. IEEE Trans Acoustics, Speech and Signal Processing 37:378–392 9. Stoica P, Li J, Söderström T (1998) On the inconsistency of IQML. IEEE Trans Signal Processing 37:378–392

Global Optimization Methods for Systems of Nonlinear Equations GO for SNE NGUYEN V. THOAI University Trier, Trier, Germany MSC2000: 65H10, 90C26, 90C30 Article Outline Keywords See also References Keywords Systems of nonlinear equations; Global optimization The problem of finding a solution of a system of equations and/or system of inequalities is one of the main re-

Global Optimization Methods for Systems of Nonlinear Equations

search subjects in numerical analysis and optimization. The source of systems of equations and/or inequalities contains many ‘real-world’ problems ([2,7]), the nonlinear complementarity problem (cf. also Generalized nonlinear complementarity problem), the variational inequality problem (cf. also Variational inequalities) over a convex set, Karush-Kuhn-Tucker systems, the feasibility problem, the problem of computing a Brouwer’s fixed point ([10,15]). In general, a system of nonlinear equations and/or inequalities is given by

(SN E)

8 ˆ ˆ < h i (x) D 0; ˆ ˆ :

g j (x) 0;

i 2 I; j 2 J; x 2 X;

where I, J are finite index sets, X Rn is a convex set, and hi (i 2 I), g j (j 2 J) are nonlinear functions defined on a suitable set containing X. Solution methods for (SNE), which are based on convex and nonsmooth optimization techniques, and fixed point algorithms can be found in [2,3,4,5,14,15], and references given therein. In order to apply global optimization methods for solving (SNE), one defines a vector function h: Rn ! R|I| having components hi (x)(i 2 I), a function ˚ f (x) D maxfkh(x)k ; g j (x) : j 2 J) ; where k k is any vector norm on R|I| , and considers the following global optimization problem (GOP) f D min f f (x) : x 2 Xg : In particular, the function f in (GOP) can be defined by ˚ ˚ f (x) D max fjh i (x)j : i 2 Ig ; g j (x) : j 2 J) : In general, a vector x 2 Rn is a solution of (SNE) if and only if it is a global optimal solution of (GOP) and f = f (x ) = 0. Thus, finding a solution of (SNE) can be replaced by computing a global optimal solution of (GOP). In the case that I = ;, i. e., (SNE) is a system of inequalities, global optimization algorithms to (GOP) will terminate whenever a feasible point x 2 X is found satisfying f (x) 0. While applying a global optimization algorithm to (GOP), if it is pointed out that f > 0

G

(e. g., a lower bound of f can be computed such that > 0), then obviously (SNE) has no solution. There are three main classes of (SNE), which can be solved by implementable methods in global optimization: i) The functions hi (i 2 I) and g j (j 2 J) are all d.c. (a function is called d.c. if it can be expressed as the difference of two convex functions, see D.C. programming). ii) The functions hi (i 2 I) and g j (j 2 J) are all Lipschitzian with Lipschitz constants Li (i 2 I) and M j (j 2 J), respectively. iii) The corresponding problem (GOP) can be replaced by a convex relaxation problem. For class i), the function f in (GOP) is d.c., and one can find an explicit form of f as the difference of two convex functions, so that d.c. programming techniques can be applied ([9,11,12,18,19]). For class ii), if in the definition of f , `p -norms are used, i. e. 8 ! 1p ˆ X ˆ ˆ p < ; 1p (c> 1 x C c10 )(c2 x C c20 )

s.t.

x 2 D;

ˆ ˆ :s.t.

p Y

f i (x)

iD1

(2)

x 2 D;

ˆ ˆ :s.t.

p X

f 2i1 (x) f 2i (x) C g(x)

iD1

iD1

where D, the f i s and g are the same as in (3). As long as p is a small number, all of these nonconvex programs can be solved in a practical amount of time even if n exceeds a few hundreds. Linear Multiplicative Program Problem (1), though simple looking, is NP-hard (cf. also Complexity theory; Complexity classes in optimization) as shown in [11]. There are two major methods, each of which is based on a variant of parametric simplex algorithms for linear programming [12]. The first method introduces a parameter 0 and transforms (1) into an equivalent problem: 8 ˆ ˆ 0 for any x 2 D; the convex multiplicative program 8 ˆ ˆ 0; iii) Two distinct functions in F have at most one intersection point over > 0. Suppose [ s , t ] is an interval containing . Since f 1 and f 2 are convex, F(, ) is also a convex function for any > 0; and hence ( s ) and ( t ) can be computed by convex programming. For ( s , ( s )) and ( t , ( t )), let us construct a function in F according to i): u(; s ; t )

2 arg min f f 1 (x()) f 2(x()) : 2 (0; 1)g q gives D (1 ) ; and x( ) is an optimal solution to (1). Under some probabilistic assumptions, the average number of simplex pivots needed to solve a linear program with a single parameter is known to be polynomial in the problem input length [12]. Hence, (1) can also be solved in polynomial time on the average, which contrasts sharply with the result of the worst-case analysis.

ˇ ;

where ˛, ˇ 2 R. The function defined by (8) is a pointwise minimum of some functions in F such that ˛ = f 1 (x) and ˇ = f 2 (x) for x 2 D. The family F possesses the following properties: i) Any two points ( s , s ), ( t , t ) 2 R2 , with 0 < s < t , uniquely determine

(8)

over > 0. Since the right-hand side of (8) is a linear program, we can locate using the parametric objective simplex algorithm. In fact, noting that = /( + 1/) maps = {: > 0} to a unit interval {: 0 < < 1}, we solve ˚ > min c> 1 x C (1 )c2 x : x 2 D

are nonlinear functions. One effective approach in this case is branch and bound on the set of parameter values = {: > 0} [7] (cf. also Integer programming: Branch and bound methods). Let F denote the family of functions of the form:

D

(s )s ( t ) t C s2 t2

(s )/s ( t )/ t /: 1/s2 1/ t2

From iii) we have u(; s ; t )

();

8 2 [s ; t ]:

Let m 2 arg min {u(; s , t ): 2 [ s , t ]} and ( u(; s ; m ) if 0 < m ; u2 () D u(; m ; t ) if m : over [ s , t ] and is better Then u2 underestimates than u1 = u(; s , t ) in the sense: u1 () u2 ()

();

8 2 [s ; t ]:

In this way, as improving the underestimator of successively, we can generate the sequence of minimum points of uk s convergent to .

Global Optimization in Multiplicative Programming

The parametrization (7) can further be extended to (2) with p 2 [8] as follows: 8 p X ˆ ˆ ˆ min F(x; ) i f i (x) ˆ ˆ ˆ ˆ < iD1 (10) s.t. x 2 D; ˆ ˆ p ˆ Y ˆ ˆ ˆ i 1; 0: ˆ : iD1

Karush–Kuhn–Tucker conditions with respect to imply the equivalence between (2) and (10). Let () D min fF(x; ) : x 2 Dg : Then (10) reduces to a problem with p variables: 8 ˆ () ˆ ˛˜ k >

x k1

x k x k1

xk

xk

C x k1 2x 0

˛ k ; : : : ; ˛ j ; : : : ; ˛ N1 g be the sequence of ˛ values determining qk (x). Let q˜ k (x) be the function defined by the sequence of ˛ values f˛ 1 ; : : : ; ˛ k ; : : : ; ˛˜ j ; : : : ; ˛ N1 g where ˛˜ j < 0. A stationary point of q˜ k (x) does not exist on the interval [x k1 ; x k ] if either of the following bounds on ˛˜ j hold:

x k C x k1 2x N

if 0 (14) if > 0

where D ˇk x N x0 C ˛ k x k x k1 x k1 C x k x 0 x N : Lemma 3 Consider two intervals [x j1 ; x j ] and [x k ; x k1 ] where j > k. Let ˛ k < 0, and f˛ 1 ; : : : ;

x N x 0 ˛ k x k x k1 C ˇ k ˛˜ > j x x j1 x j C x j1 2x N ˛ j x j x j1 x j1 C x j 2x N ; C j x x j1 x j C x j1 2x N N x x 0 ˛ k x k x k1 ˇ k j ˛˜ < j x x j1 x j C x j1 2x N ˛ j x j x j1 x j1 C x j 2x N : C j x x j1 x j C x j1 2x N j

When q(x) is concave on a set of intervals and is guaranteed to have no stationary point on the remainder of the intervals, q(x) is monotonically nondecreasing between x0 and a global maximum x and monotonically nonincreasing between x and xN . Under the aforementioned conditions, the perturbation function q(x) is always non-negative and, thus, (x) is a valid underestimator of f (x) [11]. Illustrative Example As an illustration, we present here an example from Meyer and Floudas [11]. It involves the well-known Lennard–Jones potential energy function: f (x) D

2 1 6 12 x x

in the interval [x; x] D [0:85; 2:00]. The first term of this function is a convex function and dominates when x is small, while the second term is a concave function which dominates when x is large. The minimum eigenvalue of this function in an interval [x; x] can be calculated explicitly as follows: 8 156 84 ˆ ˆ if x 1:21707 ˆ 14 8 ˆ ˆ x e G > DG uD u; (25) eD en G> n where e, u are the deformation and displacement vectors respectively. The linear material constitutive law for the structure reads: s D K0 (e e0 );

(26)

where K 0 is the natural and stiffness flexibility matrix and e0 is the initial deformation vector. The nonlinear material law is considered in the form: s n 2 @CL n (e n ):

(27)

Here n (), is a general nonconvex superpotential and summation over all nonlinear elements gives the total strain energy contribution of them as: ˚n (e n ) D

q X

n(i) (e n ):

(28)

iD1

Finally classical support boundary conditions complete the description of the problem. The discretized form of the virtual work equation reads: > s> (e e) C s> n (e n e n ) D p (u u);

8 e ; u ; e n :

(29)

Entering the elasticity law (26) into the virtual work equation (29), and using (25) we get: u > GK0> G > (u u) (p C GK0 e0 )> (u u) C s> n (e n e n ) D 0; 8u 2 Vad ;

(30)

1481

1482

H

Hemivariational Inequalities: Applications in Mechanics

where K = G> K > 0 G denotes the stiffness matrix of the structure, p D p C GK0 e0 denotes the nodal equivalent loading vector and V ad includes all support boundary conditions of the structure. Further one considers the nonlinear elements (27) in the inequality form: o s> n (e n e n ) ˚ n (e n e n );

8e n ;

(31)

where ˚ on (en en ) is the directional derivative of the potential ˚ n . Thus the following discretized hemivariational inequality is obtained: 8 ˆ Find ˆ ˆ ˆ ˆ ˆ ˆ < s.t. ˆ ˆ ˆ ˆ ˆ ˆ ˆ :

kinematically admissible displacements u 2 Vad u > K(u u) p> (u u) C˚no (u n 8u 2 Vad :

(32)

u n ) 0;

Equivalently a substationarity problem for the total potential energy can be written: (

Find

u 2 Vad

s.t.

˘ (u) D statv2Vad f˘ (v)g :

(33)

Here the potential energy reads ˘ (v) D 12 v > Kv p> v C ˚n (v), where the first two terms (quadratic potential) are well-known in the structural analysis community. Other Applications in Mechanics Hemivariational inequalities have been used for the modeling and solution of delamination effects in composite and multilayered plates, in composite structures, for nonmonotone friction and skin effects and for nonlinear mechanics applications (for instance, in the analysis of semi-rigid joints in steel structures). Details can be found in [9,11,12,17,18] and in the citations given there. Another area of applications are nonconvex problems arising in elastoplasticity (cf. [4,5,6]). Some nonconvex problems in elastoplasticity have been treated by hemivariational inequality techniques in [17,18]. Mathematical results which are useful for the study of hemivariational inequalities can also be found in [2,3,13,14].

Numerical Algorithms A number of algorithms based on nonsmooth and nonconvex optimization concepts, on engineering methods or heuristics and on combination of these two approaches have been tested till now for the numerical solution of hemivariational inequality problems. Both finite elements and boundary elements have been used, the latter for boundary only nonlinear problems; see Nonconvex energy functions: Hemivariational inequalities and [1,8,9,17]. See also Generalized Monotonicity: Applications to Variational Inequalities and Equilibrium Problems Hemivariational Inequalities: Eigenvalue Problems Hemivariational Inequalities: Static Problems Nonconvex Energy Functions: Hemivariational Inequalities Nonconvex-nonsmooth Calculus of Variations Quasidifferentiable Optimization Quasidifferentiable Optimization: Algorithms for Hypodifferentiable Functions Quasidifferentiable Optimization: Algorithms for QD Functions Quasidifferentiable Optimization: Applications Quasidifferentiable Optimization: Applications to Thermoelasticity Quasidifferentiable Optimization: Calculus of Quasidifferentials Quasidifferentiable Optimization: Codifferentiable Functions Quasidifferentiable Optimization: Dini Derivatives, Clarke Derivatives Quasidifferentiable Optimization: Exact Penalty Methods Quasidifferentiable Optimization: Optimality Conditions Quasidifferentiable Optimization: Stability of Dynamic Systems Quasidifferentiable Optimization: Variational Formulations Quasivariational Inequalities Sensitivity Analysis of Variational Inequality Problems Solving Hemivariational Inequalities by Nonsmooth Optimization Methods

Hemivariational Inequalities: Eigenvalue Problems

Variational Inequalities Variational Inequalities: F. E. Approach Variational Inequalities: Geometric Interpretation, Existence and Uniqueness Variational Inequalities: Projected Dynamical System Variational Principles

References 1. Demyanov VF, Stavroulakis GE, Polyakova LN, Panagiotopoulos PD (1996) Quasidifferentiability and nonsmooth modelling in mechanics, engineering and economics. Kluwer, Dordrecht 2. Goeleven D (1996) Noncoercive variational problems and related results. Addison-Wesley and Longman 3. Haslinger J, Miettinen M, Panagiotopoulos PD (1999) Finite element method for hemivariational inequalities. Kluwer, Dordrecht 4. Kim SJ, Oden JT (1984) Generalized potentials in finite elastoplasticity. Part I. Internat J Eng Sci 22:1235–1257 5. Kim SJ, Oden JT (1985) Generalized potentials in finite elastoplasticity. Part II. Internat J Eng Sci 23:515–530 6. Kuczma MS, Stein E (1994) On nonconvex problems in the theory of plasticity. Arch Mechanicky 46(4):603–627 7. Maier G, Novati G (1990) Extremum theorems for finitestep backward-difference analysis of elastic-plastic nonlinearly hardening solids. Internat J Plasticity 6:1–10 8. Miettinen M, Mäkelä MM, Haslinger J (1995) On numerical solution of hemivariational inequalities by nonsmooth optimization methods. J Global Optim 8(4):401–425 9. Mistakidis ES, Stavroulakis GE (1998) Nonconvex optimization in mechanics. Algorithms, heuristics and engineering applications by the F.E.M. Kluwer, Dordrecht 10. Moreau JJ (1968) La notion de sur-potentiel et les liaisons unilatérales enélastostatique. CR 267A:954–957 11. Moreau JJ, Panagiotopoulos PD (eds) (1988) Nonsmooth mechanics and applications. CISM, vol 302. Springer, Berlin 12. Moreau JJ, Panagiotopoulos PD, Strang G (eds) (1988) Topics in nonsmooth mechanics. Birkhäuser, Basel 13. Motreanu D, Panagiotopoulos PD (1999) Minimax theorems and qualitative properties of the solutions of hemivariational inequalities. Kluwer, Dordrecht 14. Naniewicz Z, Panagiotopoulos PD (1995) Mathematical theory of hemivariational inequalities and applications. M. Dekker, New York 15. Oden JT, Kikuchi N (1988) Contact problems in elasticity: A study of variational inequalities and finite element methods. SIAM, Philadelphia 16. Panagiotopoulos PD (1983) Nonconvex energy functions. Hemivariational inequalities and substationary principles. Acta Mechanics 42:160–183

H

17. Panagiotopoulos PD (1985) Inequality problems in mechanics and applications. Convex and nonconvex energy functions. Birkhäuser, Basel 18. Panagiotopoulos PD (1993) Hemivariational inequalities. Applications in mechanics and engineering. Springer, Berlin 19. Washizu K (1968) Variational methods in elasticity and plasticity. Pergamon, Oxford

Hemivariational Inequalities: Eigenvalue Problems DANIEL GOELEVEN1 , DUMITRU MOTREANU2 1 I.R.E.M.I.A., University de la Réunion, Saint-Denis, France 2 Department Mat., University Al.I.Cuza, Iasi, Romania MSC2000: 49J52 Article Outline Keywords See also References Keywords Eigenvalue problem; Hemivariational inequalities; Critical point theory; Unilateral mechanics The theory of hemivariational inequalities has been created by P.D. Panagiotopoulos et al. (see [3,5,6,7]) for studying nonconvex and nonsmooth energy functions under nonmonotone multivalued laws. In this setting many relevant models lead to nonsmooth eigenvalue problems. A typical example is provided by the analysis of hysteresis phenomena. To illustrate it we present here the loading and unloading problems with hysteresis modes. Consider a plane linear elastic body ˝ with the boundary whose mechanical behavior is described by the virtual displacement variable u and the scalar parameter which determines the magnitude of the external loading on the system. The variable u must satisfy certain boundary or support conditions. For the sake of simplicity we assume that u = 0 on , so the space of kinematically admissible displacements u is the Sobolev

1483

1484

H

Hemivariational Inequalities: Eigenvalue Problems

space H 10 (˝), that is the closure of C1 0 (˝) with respect to the L2 -norm of the gradient. Let us suppose that there exist a fundamental (pre-bifurcation) solution 7! u0 () and another solution 7! u() = u0 () + z() that coincide for < 0 . Then one has lim ! 0 z() = 0 and the hysteresis bifurcation mode has the expression u1 (0 ) :D lim kz()k1 z():

Using the principle of virtual works together with physically realistic assumptions on the data and S (see e. g. [7]), we obtain the relation

8v 2 H01 (˝): (2)

˝

It is justified to accept that a generalized nonmonotone reaction-displacement ( S, u) holds in ˝ expressed by the next law Z jo (u1 (0 ); v) dx hS(u1 (0 )); vi ; ˝

Z a(u; v) C

jo (u; v) dx

˝

Z uvdx; ˝

8v 2 V:

(1)

! 0

a(u1 (0 ); v) C hS(u1 (0 )); vi Z 0 u1 (0 )vdx D 0;

j: R ! R with an appropriate growth condition for its generalized gradient, find u 2 V and 2 R such that

Note that this last mathematical model can also be used to formulate various other problems in Mechanics like unilateral bending problems in elasticity. A general approach for studying the abstract eigenvalue problem (5) is the nonsmooth critical point theory as developed by K.-C. Chang [1]. In that paper the minimax principles in the critical point theory are extended from the smooth functionals (see [8]) to the case of locally Lipschitz functionals. In this respect we associate to Problem (5), for each , the locally Lipschitz functional I : V ! R,

I (u) D

8v 2 H01 (˝); (3)

1 a(u; u) C 2

Z j(u) dx ˝

2

Z

u 2 dx;

˝

8u 2 V: where j: R ! R stands for a locally Lipschitz function with the generalized gradient @j and the generalized directional derivative

(5)

(6)

Note that a critical point u of I , i. e. 0 2 @I (u), is a solution of (5) because

jo (x; y) D max fhz; yi : z 2 @ j(x)g (see [2]). Relations (2) and (3) yield the following eigenvalue problem in hemivariational inequality form: Find (u = u(), ) 2 H 10 (˝) × R such that Z a(u; v) C ˝

jo (u; v) dx

@I (u) a(u; ) (u; )L 2 Z Z C @ j(u) dx a(u; ) (u; )L 2 C @ j(u) dx ˝

˝

Z uv dx; ˝

8v 2 H01 (˝): (4) Additional information concerning problems of type (4) can be found in [3,5,6,7]. Relation (4), as well as other models, motivates the study of abstract eigenvalue problems for hemivariational inequalities. The specific case of Problem (4) can be reformulated as follows: given a Banach space V embedded in L2 (˝), i. e. the space of square-integrable functions on ˝ RN , a continuous symmetric bilinear form a: V × V ! R and a locally Lipschitz function

(see [2]). Thus, to solve (5), it suffices to establish the existence of nontrivial critical points of the functional I introduced in (6). To this end we proceed along the lines in [4] by arguing in an abstract framework. Given a Banach space V and a bounded domain ˝ in Rm , m 1, let T: V ! Ls (˝;RN ) be a compact linear operator, where Ls (˝;RN ) stands for the Banach space of all Lebesgue measurable functions f : ˝ ! RN for which |f |s is integrable with 1 < s < 1. Let F: V ! R be a locally Lipschitz function and let G: ˝ ×RN ! R be a (Carathéodory) function such that G(x, y) is measurable in x 2 ˝, locally Lipschitz in y 2 RN and G(x, 0) = F(0) = 0, x 2 ˝. The hypotheses below are imposed

H

Hemivariational Inequalities: Eigenvalue Problems

H1) |w| c(1 + | y | s 1 ), 8w 2 @y G(x, y), x 2 ˝, y 2 RN , with a constant c > 0; H2) i) F(v) r hz, vV i ˛ k vV ˛ 0 , 8v 2 V, z 2 @F(v); ii) G(x, y) r hw, y i b |y |0 b0 , for a.e. x 2 ˝, y 2 RN , w 2 @y G(x, y), with positive constants r, ˛, ˛ 0 , b, b0 , , 0 , where 1 0 < min { , r1 , s }; H3) any bounded sequence { vn } V for which there is zn 2 @F(vn ) converging in V contains a convergent subsequence in V; p H4) i) lim infv ! 0 F(v kvkV > 0; ii) p

lim inf F(v) kvkV

The foregoing locally Lipschitz functional I satisfies the Palais–Smale condition in the sense of Chang [1]. Indeed, let (vn ) be a sequence in V with I(vn ) M and for which there exists a sequence J n 2 @I(vn ) with J n ! 0 in V . Then from H2) and taking into account that Jn D zn C T wn ; z n 2 @F(v n ); w n (x) 2 @ y G(x; (Tv n )(x)) a.e. x 2 ˝; we infer that M C r kv n kV F(v n ) r hz n ; v n iV Z C (G(x; (Tv n )(x)) r hw n (x); (Tv n )(x)i) dx ˝

v!0

(sp)/p

C j˝j

˛ kv n kV C C1 kv n kV0 C C2 ;

p

p

kTk lim inf G(x; y) jyj y!0

>0 uniformly with respect to x, 1 p < s; H5) lim inf F(tv0 )t 1/r t!C1 Z < lim inf t 1/r G(x; tTv0 ) dx t!C1

˝

for some v0 2 V. The following statement is our main result in studying the abstract eigenvalue problem (5). Theorem 1 Assume that the hypotheses H1)–H5) hold. Then there exists a nontrivial critical point u 2 V of I: V ! R defined by Z I(v) D F(v) C G(x; (Tv)(x)) dx ; v 2 V : ˝

Moreover, there exists z 2 @F(u) and w 2 Ls(s 1) (˝;RN ) such that w(x) 2 @ y G(x; (Tu)(x)) a.e. x 2 ˝; Z hz; viV C hw(x); (Tv)(x)i dx D 0 ; v 2 V: ˝

Conversely, if u 2 V verifies the relations above, corresponding to some z and w, and the function G(x, ) is regular at (Tu) (x) (in the sense of F.H. Clarke [2]) for each x 2 ˝, then u is a critical point of I.

with real constants C1 , C2 , provided that n is large enough. It is clear that the estimate above implies that the sequence (vn ) is bounded in V. Then a standard argument based on the assumption H3) allows to conclude that (vn ) possesses a strongly convergent subsequence. Namely, the boundedness of (vn ) implies that (Tvn ) is bounded in Ls (˝;RN ). Thus (wn ) is bounded in Ls/s 1) (˝;RN ) due essentially to the assumption H1). Since T is a compact operator and J n ! 0 we derive that (zn ) has a convergent subsequence in V . This fact combined with the boundedness of (vn ) allows to use the hypothesis H3). The claim that the locally Lipschitz functional I verifies the Palais –Smale condition is proved. Assumption H4) insures the existence of some constants ı > 0, A > 0 and B > 0, with A B j˝j(sp)/p kTk p > 0; such that p

F(v) A kvkV ;

kvkV ı;

(7)

and G(x; y) B jyj p ;

8x 2 ˝;

jyj ı:

Combining the inequality above with H1) one obtains that Z p G(x; (Tv)(x)) dx (A ) kvkV ; ˝

kvkV ;

(8)

1485

1486

H

Hemivariational Inequalities: Eigenvalue Problems

for some > 0 and 0 < ı. Indeed, assumption H1) and Lebourg ’s mean value theorem imply that G fulfills the following growth condition jG(x; y)j a1 C a2 jyjs ; 8x 2 ˝;

y 2 RN ;

with constants a1 , a2 0. The two estimates above for G(x, y) show that p

G(x; y) B jyj (a1 ı

s

for all t > 1, > 1, with new positive constants C, C 0 . In view of H5) and since 0 < 1/r, we can find sufficiently large such that C kv0 kV0 0 1/r C C 0 1/r Z 1/r C G(x; (Tv0 )(x)) dx < lim inf F(v0 ) 1/r :

˝

!C1

s

C a2 ) jyj ; 8x 2 ˝;

y 2 RN :

Then one deduces from the continuity of T that one has Z G(x; (Tv)(x))dx

With such fixed number , we see that there exists arbitrarily large t satisfying F(tv0 )(t)1/r C C kv0 kV0 0 1/r C C 0 1/r Z C 1/r G(x; (Tv0 )(x)) dx < 0:

˝

B j˝j(sp)/p kTk p

˝

sp

(a1 ı s C a2 ) kTk s kvkV

We deduce that p

I(t n v0 ) 0

kvkV ; 8v 2 V:

Since s > p we see that the numbers > 0 and > 0 can be chosen so small that relation (8) be verified. By (7) and (8) we arrive at the conclusion that there exist positive numbers , such that kvkV D :

I(v) ;

(9)

The formula @ t (t 1/r G(x; t y)) ˝ ˛ 1 D t 11/r [r @ y G(x; t y); t y G(x; t y)]; r the absolute continuity property and H2ii) show that t 1/r G(x; t y) G(x; y) Zt D

@ ( 1/r (G(x; y)d C jyj0 C C0

1

for a.e. x 2 ˝, y 2 RN , t > 1, where C, C0 are positive constants. Then one obtains I(tv0 ) (t)1/r 2 4F(tv0 )(t)1/r C C kv0 kV0 0 1/r

CC 0 1/r C 1/r

Z ˝

3 G(x; (Tv0 )(x)) dx 5

(10)

for a subsequence t n ! 1. The properties (9) and (10) permit to apply the mountain pass theorem in the nonsmooth version of Chang [1]. This yields the desired critical point u of I. The other assertions of the first part of Theorem are direct consequences of the last statement. The converse part of Theorem follows from the next formula Z Z @ G(x; u(x)) dx D @ y G(x; u(x)) dx; ˝

˝

8u 2 Ls (˝; R N );

which is valid under the growth condition in H1) and the regularity assumption for G (see [2]). The proof of Theorem is thus complete. In the case of problem (4) we choose V = H 10 (˝), the compact linear operator T: H 10 (˝) ! Ls (˝) equal to the embedding H 10 (˝) Ls (˝) with 2 < s < 2m(m 2)1 if m 3, Z 1 (jrvj2 v 2 ) dx; 8v 2 H01 (˝); F(v) D 2 ˝

R where for simplicity we take a(u, v) = ˝ r u r v dx, and G(x, t) = j(t). A significant possible choice for j is the following one jtjs C j(t) D s

Zt ˇ() d;

t 2 R;

(11)

0

where ˇ 2 L1 l oc (R) verifies t ˇ(t) 0 for t near 0, | ˇ(t) | c(1 + |t | ), t 2 R, with constants c > 0, 0 < 1.

Hemivariational Inequalities: Eigenvalue Problems

Corollary 2 Let j: R ! R be given by (11). If 1 denotes the first eigenvalue of on H 10 (˝), then for every < 1 the problem (5) with a as above, has a nontrivial eigenfunction u 2 H 10 (˝) which solves in addition the nonsmooth Dirichlet problem containing both superlinear and sublinear terms u C u C jujs2 u 2 [ˇ(u(x)); ˇ(u(x))] a.e. x 2 ˝; u D 0 on @˝; where the notations in [1] are used. The argument consists in verifying the assumptions H1)–H5) for the functional I = I , for < 1 , with I described in (6). To this end it is sufficient to take r 2 (1/s, 1/2), p = = 2, 0 = + 1 and v0 2 H 10 (˝) { 0 }. Applying Theorem one finds the stated result. Other related results and applications for eigenvalue problems in the form of hemivariational inequalities are given in [3,4,5,6,7] and the references therein. See also ˛BB Algorithm Eigenvalue Enclosures for Ordinary Differential Equations Generalized Monotonicity: Applications to Variational Inequalities and Equilibrium Problems Hemivariational Inequalities: Applications in Mechanics Hemivariational Inequalities: Static Problems Interval Analysis: Eigenvalue Bounds of Interval Matrices Nonconvex Energy Functions: Hemivariational Inequalities Nonconvex-nonsmooth Calculus of Variations Quasidifferentiable Optimization Quasidifferentiable Optimization: Algorithms for Hypodifferentiable Functions Quasidifferentiable Optimization: Algorithms for QD Functions Quasidifferentiable Optimization: Applications Quasidifferentiable Optimization: Applications to Thermoelasticity Quasidifferentiable Optimization: Calculus of Quasidifferentials Quasidifferentiable Optimization: Codifferentiable Functions

H

Quasidifferentiable Optimization: Dini Derivatives, Clarke Derivatives Quasidifferentiable Optimization: Exact Penalty Methods Quasidifferentiable Optimization: Optimality Conditions Quasidifferentiable Optimization: Stability of Dynamic Systems Quasidifferentiable Optimization: Variational Formulations Quasivariational Inequalities Semidefinite Programming and Determinant Maximization Sensitivity Analysis of Variational Inequality Problems Solving Hemivariational Inequalities by Nonsmooth Optimization Methods Variational Inequalities Variational Inequalities: F. E. Approach Variational Inequalities: Geometric Interpretation, Existence and Uniqueness Variational Inequalities: Projected Dynamical System Variational Principles

References 1. Chang K-C (1981) Variational methods for non-differentiable functionals and their applications to partial differential equations. J Math Anal Appl 80:102–129 2. Clarke FH (1984) Nonsmooth analysis and optimization. Wiley, New York 3. Goeleven D, Motreanu D, Panagiotopoulos PD (1997) Multiple solutions for a class of eigenvalue problems in hemivariational inequalities. Nonlinear Anal Th Methods Appl 29:9–26 4. Motreanu D (1995) Existence of critical points in a general setting. Set-Valued Anal, 3:295–305 5. Motreanu D, Panagiotopoulos PD (1999) Minimax theorems and qualitative properties of the solutions of hemivariational inequalities. Kluwer, Dordrecht 6. Naniewicz Z, Panagiotopoulos PD (1995) The mathematical theory of hemivariational inequalities and applications. M. Dekker, New York 7. Panagiotopoulos PD (1993) Hemivariational inequalities. applications in mechanics and engineering. Springer, Berlin 8. Rabinowitz PH (1986) Minimax methods in critical point theory with applications to differential equations, vol 65. CBMS Reg. Conf. Ser. Math., Amer. Math. Soc., Providence

1487

1488

H

Hemivariational Inequalities: Static Problems

Hemivariational Inequalities: Static Problems HVI ZDZISŁAW NANIEWICZ1,2 1 Institute Appl. Math. Mech., Warsaw University, Warsaw, Poland 2 Institute Math. Comp. Science, Techn. University Czestochowa, Czestochowa, Poland

Moreover, we assume that V is endowed with a direct b C V0 , where V 0 is a finitesum decomposition V D V dimensional linear subspace, with respect to which A is b and 2 V 0 semicoercive, i. e. 8u 2 V there exist b u2V such that u D b u C and

u V ) b u V ; (1) hAu; uiV c( b where c: R+ ! R stands for a coercivity function with c(r) ! 1 as r ! 1. Further, let j: RN ! R be a locally Lipschitz function fulfilling the unilateral growth conditions ([16,21]):

MSC2000: 49J40, 47J20, 49J40, 35A15 j0 (; ) ˛(r)(1 C jj ); 8; 2 R N ; jj r; r 0; (2)

Article Outline Keywords References

and j0 (; ) k jj ;

Keywords Semicoercive hemivariational inequality; Unilateral growth condition; Pseudomonotone mapping; Recession functional 1

N

Let V = H (˝;R ), N 1, be a vector valued Sobolev space of functions square integrable together with their first partial distributional derivatives in ˝, ˝ being a bounded domain in Rm , m > 2, with sufficiently smooth boundary . Assume that V is compactly imbedded into Lp (˝;RN ) (1 < p < 2m/m 2), [12]). We write k kV and k kL p (˝;R N ) for the norms in V and Lp (˝;RN ), respectively. For the pairing over V × V the symbol h , iV will be used, V being the dual of V. Let A: V ! V be a bounded, pseudomonotone operator. This means that A maps bounded sets into bounded sets and that the following conditions hold [3,5]: i) The effective domain of A coincides with the whole V; ii) If un ! u weakly in V and lim supn ! 1 h Aun , un uV i 0, then lim infn ! 1 hAun , un vV i h Au, u v iV for any v 2 V. Note that i) and ii) imply that A is demicontinuous, i. e. iii) If un ! u strongly in V, then Aun ! Au weakly in V.

8 2 R N ;

(3)

where 1 < p, k is a nonnegative constant and ˛ :R+ ! R+ is assumed to be a nondecreasing function from R+ into R+ . Here, j0 (;) stands for the directional Clarke derivative j0 (; ) D lim sup h!0 !0C

j( C h C ) j( C h) ;

(4)

by means of which the Clarke generalized gradient of j is defined by [6] ˚ @ j() :D 2 R N : j0 (; ) ; 8 2 R N ; ; 2 R N : Remark 1 The unilateral growth condition (2) is the generalization of the well known sign condition used for the study of nonlinear partial differential equations in the case of scalar-valued function spaces (cf. [27,28]). Consider the problem of finding u 2 V such as to satisfy the hemivariational inequality Z hAu g; v uiV C ˝

j0 (u; v u) d˝ 0; 8v 2 V:

(5)

It will be assumed that g 2 V fulfills the compatibility condition Z j1 () d˝; 8 2 V0 n f0g; (6) hg; iV < ˝

H

Hemivariational Inequalities: Static Problems

where j1 : RN ! R [ { + 1 } stands for the recession functional given by (cf. [2,4,10]) 1

0

j () D lim inf[ j (t; )]; ! t!C1

N

2R :

(7)

Because of (1), the problem to be considered here will be referred to as a semicoercive hemivariational inequality. The notion of hemivariational inequality has been first introduced by P.D. Panagiotopoulos in [22,23] for the description of important problems in physics and engineering, where nonmonotone, multivalued boundary or interface conditions occur, or where some nonmonotone, multivalued relations between stress and strain, or reaction and displacement have to be taken into account. The theory of hemivariational inequalities (as the generalization of variational inequalities, cf. [7]) has been proved to be very useful in understanding of many problems of mechanics involving nonconvex, nonsmooth energy functionals. For the general study of hemivariational inequalities and their applications, see [13,14,15,17,18,19,20,21,24,26] and the references quoted there. Some results in the area of static, semicoercive inequality problems can be found in [9,10,25]. To prove the existence of solutions to (5), the Galerkin method combined with the pseudomonotone regularization of the nonlinearities will be applied. Let us start with the following preliminary results. The regularization e j0R (; ), R > 0, of the Clarke direc0 tional derivative j (;) will be defined as follows: for any , 2 RN , set

where e ˛ : RC ! RC is a nondecreasing function independent of R. Proof To establish (9) and (10) it suffices to consider the case | | R and to invoke the estimates e 0 0 j R (; ) D j R ; jj 0 j R ; R jj jj jj R 0 j R ; R C R jj jj jj R kR ˛(jj)(1 C R ) C R ˛(r)(1 C jj ) C k jj ; 8; 2 R N ; jj r; r 0; and e j0R (; ) D j0 R ; jj jj 0 jj j R ; R kR D k jj ; R R jj jj respectively. The proof is complete. For any R > 0, the following regularization of the primal problem can be formulated: (PR )Find (u R ; R ) 2 V L q (˝; R N ); 1/p +1/q = 1, such that hAu R g; v u R iV Z C R (v u R ) d˝ D 0;

8v 2 V;

(11)

˝

8 < j0 (; ) if jj R; e j0R (; ) D 0 : j R jj ; if jj > R:

(8)

Lemma 2 Suppose that (2) and (3) are fulfilled. Then for R > 0,

where R (u R ) :D

8 < : Z

e j0R (; ) e ˛ (r)(1 C jj );

8 2 R N ;

8 2 R N ; jj r; r 0: (9) e j0R (; ) k jj ;

N

8 2 R ;

(12)

R 2 R (u R );

(10)

˝

2 L q (˝; R N ) :

Z v d˝ ˝

9 =

e j0R (u R ; v)d˝; 8 v 2 L p (˝; R N ) : ;

In order to show that (PR ) has solutions, the following auxiliary result is to be applied.

1489

1490

H

Hemivariational Inequalities: Static Problems

Lemma 3 Suppose that (1)-(3) and (6) hold. Then there exists R0 > 0 such that for any R > R0 the set of all u 2 V with the property that Z hAu g; uiV

e j0R (u; u) d˝ 0

(13)

˝

is bounded in V, i. e. there exists M > 0 (possibly depending on R > R0 ), such that (13) implies kukV M:

(14)

Proof Suppose on the contrary that this claim is not true, i. e. there exists a sequence { un }1 nD1 V with the property that Z hAu n g; u n iV

e j0R (u n ; u n ) d˝ 0;

(15)

˝

where k un kV ! 1 as n ! 1. By the hypothesis, each element un can be represented as un D b u n C e n n ;

(16)

b en 0, n 2 V 0 , k n kV = 1, and where b u n 2 V,

u n V ) b u n V ). Taking into account hAu n ; u n iV c( b (3) it follows that Z 0 hAu n g; u n iV e j0R (u n ; u n ) d˝

results, which in view of

c( b u n V ) kgkV k1 ! C1

as n ! 1

implies the assertion (18). The obtained results give rise to the following representation of un : 1 b un D en u n C n ; en where b u n /e n ! 0 strongly in V and n ! in V 0 as n ! 1 for some 2 V 0 with k kV = 1 (recall that V 0 has been assumed to be finite dimensional). Moreover, the compact imbedding V Lp (˝;RN ) permits one to suppose that b u n /e n ! 0 and n ! a.e. in ˝. Further, (15), together with the fact that A is semicoercive, leads to Z j0R (u n ; u n )d˝ 0 hAu n g; u n iV e

C en Z 1 1 b e j0R e n u n C n ; b u n n d˝: en en

˝

V

˝

(17)

where k1 = const. The obtained estimates imply that { en } is unbounded. Indeed, if it would not be so, then due to the behavior of c() at infinity, fb u n g had to be bounded. In such a case the contradiction with k un kV ! 1 as n ! 1 results. Therefore one can suppose without loss of generality that en ! + 1 as n ! 1. The next claim is that strongly in V :

nD1

˝

u n V kgkV ( b u n V C e n ) c( b u n V ) b

u n e n k1 kn k ; k1 b

1 b un ! 0 en

Thus, the boundedness of the sequence (

)1

b u n V

c( b u n V ) kgkV k1 en

c( b u n V e n hg; n iV u n V ) kgkV b

˝

u n V kgkV b c( b u n V ) b u n V Z e n hg; n iV k ju n j d˝

V

Indeed, if f b u n V g is bounded, then (18)

follows immediately. If b u n V ! 1 then c( b u n V ) ! C1. From (17) one has

b

u n V k1 C kgkV c( b : u n V ) kgkV k1 en

(18)

Hence 1

b u n V ) kgkV u n V hg; n iV c( b en Z b b un un C e j0R e n C n ; n d˝: en en ˝

(19) Now observe that either

1

b c( b u n V ) kgkV u n V ! 0 en

as n ! 1;

H

Hemivariational Inequalities: Static Problems

if f b u n V g is bounded, or 1

b u n V ) kgkV u n V 0 c( b en

for sufficiently large n, if b u n V ! 1 as n ! 1. Therefore, for any case

1

b u n V ) kgkV u n V 0: lim inf c( b n!1 en Moreover, by (10) the estimate follows:

(20)

n

This allows the application of Fatou ’s lemma in (19), from which one is led to hg; iV lim inf n!1 Z b b un un e j0R e n d˝ C n ; n en en ˝ Z lim inf n!1

b b un un 0 e jR en d˝: C n ; n en en (21)

Taking into account (8) and upper semicontinuity 0 of j (, ), one can easily verify that b b un un j0R e n ( C n ); n lim inf e n!1 en en j0 R ; ; jj which leads to Z hg; iV j0 R ; d˝: jj

˝

(22)

˝

The upper semicontinuity of j0 (;) allows us to conclude the existence of R > 0 and " > 0 such that Z Z R 0 ı 0 0 j j1 () d˝ ; d˝ 2 j 0 j

Since j1 () is lower semicontinuous and V 0 is finite dimensional, from (6) it follows that a ı > 0 can be found such that for any 2 V 0 with k V k = 1, Z j1 () d˝: (23) hg; iV C ı
R and 0 2 V 0 with k 0 kV < " . As the sphere {v 2 V 0 : kvkV = 1 } is compact in V 0 , there exists R0 > 0 such that Z Z R ı j0 j1 () d˝ ; ; d˝ 2 jj ˝

˝

for any 2 V 0 with k kV = 1, R > R0 . This combined with (23) contradicts (22). Accordingly, the existence of a constant M > 0 has been established such that (13) implies (14), whenever R > R0 . The proof of Lemma 3 is complete. Proposition 4 Let us assume all the hypotheses stated above. Then for any R > R0 the problem (PR ) possesses at least one solution. Moreover, if (uR , R ) is a solution of (PR ), then ku R kV M

(24)

for some constant M not depending on R > R0 . Proof Let be the family of all finite-dimensional subspaces F of V, ordered by inclusion. Denote by iF : F ! V the inclusion mapping of F into V and by iF : V ! F the dual projection mapping of V into F , F being the dual of F. The pairing over F × F will be denoted by h , iF . Set AF := iF ° A ° iF and g F := iF g. Fix R > R0 . For any F 2 consider a finitedimensional regularization of (PR ): (PF )

˝

˝

˝

b b un un e j0R e n C n ; n en en ˇ ˇ ˇ ˇb un k ˇˇ C n ˇˇ : e

˝

With the help of Fatou ’s lemma (permitted by (20)) we arrive at Z Z R 0 j1 () d˝: ; d˝ lim inf j R!1 jj

Find (u F ; F ) 2 F L q (˝; R N ) 2 F

such that Z hAu F g; viV C

F v d˝ D 0; 8v 2 F;

(25)

˝

F 2 R (u F ):

(26)

1491

1492

H

Hemivariational Inequalities: Static Problems

The first task is to show that for each F 2 , (PF ) has solutions. Notice that R () has nonempty, convex and closed values and if 2 R (v), v 2 Lp (˝;RN ), then k k L q (˝;RN ) K R ;

(27)

for some K R > 0 depending on the Lipschitz constant of j in the ball { 2 RN : | | R }. Moreover, from the upper semicontinuity of e j0R (; ) and Fatou ’s lemma it follows immediately that R is upper semicontinuous from Lp (˝;RN ) to Lq (˝;RN ), Lq (˝;RN ) being endowed with the weak topology. Further, let F : Lq (˝;RN ) ! F be the operator that to any 2 Lq (˝;RN ) assigns F 2 F defined by Z v d˝ for any v 2 F: (28) hF ; viF :D ˝

Note that F is a linear and continuous operator from the weak topology of Lq (˝;RN ) to the (unique) topol ogy on F . Therefore GF : F ! 2F , given by the formula G F (v F ) :D F R (v F )

for v F 2 F;

(29)

is upper semicontinuous. By the pseudomonotonicity of A it follows that AF : F ! F is continuous. Thus, AF + GF g F : F ! 2F is an upper semicontinuous multivalued mapping with nonempty, bounded, closed and convex values. Moreover, for any vF 2 F and F 2 GF (vF ) one has hA F v F C

F

gF ; vF iF

Z

hAv F g; v F iV

e j0R (v F ; v F ) d˝:

(30)

g F ; v F iF 0:

(31)

Accordingly, one can invoke [1, Corol. 3, p. 337] to deduce the existence of uF 2 F with ku F kV M C 1

weakcl(WF ) BV (O; M C 1);

8F 2 ;

where BV (O, M + 1) := {v 2 V kvkV M + 1 }. Thus, the family { weakcl(WF ): F 2 } is contained in the weakly compact set BV (O, M + 1) of V. Further, for any F 1 , . . . , F k 2 , k = 1, 2, . . . , the inclusion WF1 \ \ WFk WF results, with F = F 1 + + F k . Therefore, the family {weakcl(WF ): F 2 } has the finite intersection property. This implies that \F 2 weakcl (WF ) is not empty. From now on, let uR 2 BV (0, M + 1) belong to this intersection. Fix v 2 V arbitrarily and choose F 2 such that uR , v 2 F. Thus, there exists a sequence {uFn } WF with uFn ! uR weakly in V. Let Fn 2 R (uFn ) denote the corresponding sequence for which (uFn , Fn ) is a solution of (P Fn ) (for simplicity of notation, the symbols {un } and { n } will be used instead of uFn and Fn , respectively). Therefore hAu n g; w u n iV C

n (w u n ) d˝ D 0; ˝

Hence, in view of Lemma 3, for R > R0 there exists M > 0 not depending on F 2 such that the condition k vF kV = M + 1 implies F

The symbol weakcl (WF ) will be used to denote the closure of WF in the weak topology of V. From (32) one gets

Z

˝

hA F v F C

In the next step it will be shown that (PR ), R > R0 , has solutions. For F 2 , let 8 9 (u F 0 ; F 0 ) ˆ > ˆ > < = [ satisfies (PF 0 ) uF0 2 V : : WF :D ˆ > for some > : ; F 0 2; ˆ F 0 2 L q (˝; R N ) F 0 F

(32)

such that 0 2 AF uF + GF (uF ) g F . This implies that for some F 2 R (uF ) it follows that F = F ( F ) and (uF , F ) is a solution of (PF ).

8 w 2 Fn :

(33)

Since k n kL q (˝;RN ) K R and Lq (˝;RN ) is reflexive, it can also be supposed that for some R 2 Lq (˝;RN ), n ! R weakly in Lq (˝;RN ). By the hypothesis, the imbedding V Lp (˝;RN ) is compact, so un ! uR strongly in Lp (˝;RN ). Consequently, by the upper semicontinuity of R from Lp (˝;RN ) to Lq (˝;RN ) (L;q (˝;RN ) being endowed with the weak topology) it follows immediately that R 2 R (uR ), i. e. (12) holds. R Moreover, ˝ n (uR un )d ˝ ! 0 as n ! 1 and (33) with w = uR lead to lim hAun , un uR iV = 0. Accordingly, the pseudomonotonicity of A allows the conclusion that hAun , un iV ! hAuR , uR iV and Aun ! AuR

H

Hemivariational Inequalities: Static Problems

weakly in V . Finally, substituting w = v in (33) and letting n ! 1 give in conclusion (11) with v 2 V chosen arbitrarily. Thus the existence of solutions of (PR ) has been established. Let us proceed to the boundedness of solutions {uR }R>R0 of (PR ). Suppose on the contrary that this claim is not true. Then according to (11) and (12) there would exist a sequence Rn ! 1 such that k uR n kV ! 1 as n ! 1, and Z j0R n (u R n ; u R n ) d˝ 0: (34) hAu R n g; u R n iV f ˝

From now on, for simplicity of notations, instead of the subscript ‘Rn ’ we write ‘n’. Eq. (34) allows us to follow the lines of the proof of Lemma 3. First, analogously one arrives at the representation 1 b u n C n ; un D en en with b u n /e n ! 0 strongly in V and n ! 0 in V 0 as n ! 1 for some 0 2 V 0 with k 0 kV = 1. Secondly, the counterpart of (21) can be obtained in the form Z hg; iV lim inf n!1

˝

(35) 1 1 0 f b jRn en d˝: u n C n ; b u n n en en But

1 1 0 b b jf e ; C u u n n n n n Rn en en 1 1 0 b D j en u n C n ; b u n n ; en en ˇ ˇ if ˇb u n C e n n ˇ R n and 1 1 0 b b jf e ; C u u n n n n n Rn en en 0 1 1 1 R n ˇ b u C n ; b u n n A ; D j0 @ ˇ ˇ1 ˇ en n e n b ˇ u n C n ˇ en

ˇ ˇ u n C e n n ˇ > R n :. Therefore we easily conclude, usif ˇb ing (7), that 1 1 0 b b e ; C u u lim inf jf n n n n n Rn n!1 en en j1 (0 ):

Consequently, by Fatou ’s lemma, Z hg; 0 iV

j1 (0 ) d˝;

˝

contrary to (6). Thus, the boundedness of {uR }R>R0 follows and the proof of Proposition 4 is complete. The next result is related to the compactness property of { R : R > R0 } in L1 (˝;RN ). Proposition 5 Let a pair (uR , R ) 2 V × Lq (˝;RN ) be a solution of (PR ). Then the set 8 ˆ ˆ ˆ ˆ ˆ < ˆ ˆ ˆ ˆ ˆ :

R 2 L q (˝; R N ) :

(u R ; R ) is a solution of (PR ) for some u R 2 V ; R > R0

9 > > > > > = > > > > > ;

is weakly precompact in L1 (˝;RN ). Proof According to the Dunford–Pettis theorem [8] it is sufficient to show that for each " > 0 a ı > 0 can be determined such that for any ! ˝ with meas ! < ı, Z j R j d˝ < ";

R > R0 :

(36)

!

Fix r > 0 and let 2 RN be such that | | r. Then, by j0R (u R ; u R ) it results that (9), from R ( u R ) e R R u R C e ˛ (r)(1 C ju R j )

(37)

a.e. in ˝. Let us set r p (sgn R1 ; : : : ; sgn R N ); N where R i , i = 1, . . . , N, are the components of R and where sgny = 1 if y > 0, sgn y = 0 if y = 0, and sgny = 1 if y < 0. It is not difficult to verify that | | r for almost all x 2 ˝ and that r R p j R j : N Therefore, by (37) the estimate follows r p j R j R u R C e ˛ (r)(1 C ju R j ): N

1493

1494

H

Hemivariational Inequalities: Static Problems

Integrating this inequality over ! ˝ yields p p Z Z N N e ˛ (r) meas ! R u R d˝C j R j d˝ r r ! ! p N e ˛ (r)(meas !)(p)/p ku R kL p (˝) d˝: (38) C r Thus, from (24) one obtains Z j R j d˝ !

p Z N N e ˛ (r) meas ! R u R d˝ C r r ! p N e ˛ (r)(meas !)(p)/p ku R kV d˝ C r p p Z N N e ˛ (r) meas ! R u R d˝ C r r ! p N e ˛ (r)(meas !)(p)/p M d˝ C (39) r (k kL p (˝;R N ) k kV ). Further, it will be shown that Z R u R d˝ C (40) p

and consequently, (40) easily follows. Further, from (39) and (40), for r > 0, p N N CC e ˛ (r) meas ! r r

p

Z j R j d˝ !

p C

N e ˛ (r)(meas !)(p)/p M d˝: r

(41)

This estimate is crucial for obtaining (36). Namely, let " > 0. Fix r > 0 with p " N C< (42) r 2 and determine ı > 0 small enough to fulfill p N e ˛ (r) meas ! r p " N e ˛ (r)(meas !)(p)/p M ; C r 2 provided that meas ! < ı. Thus, from (41) it follows that for any ! ˝ with meas ! < ı, Z (43) j R j d˝ "; R > R0 : !

!

for some positive constant C not depending on ! ˝ and R > R0 . Indeed, from (10) one can easily deduce that

Now the main result will be formulated.

R u R C k ju R j 0 a.e. in ˝:

Theorem 6 Let A: V ! V be a pseudomonotone, bounded operator, j: RN ! R a locally Lipschitz function. Suppose that (1)-(3) and (6) hold. Then there exist u 2 V and 2 L1 (˝;RN ) such that

Thus it follows that Z ( R u R C k ju R j) d˝ !

Z

Z

( R u R C k ju R j) d˝;

Finally, { R }R>R0 is equi-integrable and its precompactness in L1 (˝;RN ) has been proved [8].

hAu g; v uiV C

˝

(v u) d˝ D 0; ˝

and consequently Z Z R u R d˝ R u R d˝ C 2k1 ku R kV : !

˝

8v 2 V \ L1 (˝; R N ); (44) (

2 @ j(u)

a.e. in ˝;

(45)

1

But A maps bounded sets into bounded sets. Therefore, by means of (11) and (24), Z R u R d˝ D hAu R g; u R iV ˝

u 2 L (˝): Moreover, the hemivariational inequality holds: Z hAu g; v uiV C

j0 (u; v u)d˝ 0;

˝

kAu R gkV ku R kV C0 ;

C0 D const;

8v 2 V ; (46)

Hemivariational Inequalities: Static Problems

where the integral above is assumed to take + 1 as its value if j0 (u;v u) 62 L1 (˝). Proof The proof is divided into a sequence of steps. Step 1. From Propositions 4 and 5 it follows that from the set { uR , R }R>R0 of solutions of (PR ) a sequence { uR n , R n } can be extracted with Rn ! 1 as n ! 1 (for simplicity of notations it will be denoted by (un , n )), such that hAu n g; v u n iV Z C n (v u n ) d˝ D 0; 8v 2 V;

(47)

˝

and 8 ˆ ˆ < n 2 R n (u n ); un ! u weakly in V; ˆ ˆ : ! weakly in L1 (˝; R N ) n

weakly in V

(48)

(50)

is valid for any v 2 V \ L1 (˝;RN ). Step 2. Now it will be proved that 2 @j(u) a.e in ˝, i. e. the first condition in (45) is fulfilled. Since V is compactly imbedded into Lp (˝;RN ), due to (48) one may suppose that strongly in L p (˝; R N ):

D

˝n!

j0 (u n ; v) d˝;

(for large n)

˝n!

(un remains pointwise uniformly bounded in ˝ \ ! and Rn ! 1 as n ! 1) combined with the weak convergence in L1 (˝;RN ) of n to , (51) and with the upper semicontinuity of Z 1 N L (˝ n !; R ) 3 u n 7! j0 (u n ; v) d˝;

it follows that Z Z v d˝

j0 (u; v) d˝;

˝n!

8v 2 L1 (˝ n !; R N ):

But the last inequality allows us to state that 2 @j(u) a.e. in ˝ \ !. Since meas ! < " and " was chosen arbitrarily, 2 @ j(u)

˝

un ! u

Z

a.e. in ˝;

(52)

(49)

Z v d˝ D 0

˝n!

˝n!

(by passing to a subsequence, if necessary). Thus, (47) implies that the equality hB g; viV C

function. From the estimate Z Z 0 jf n v d˝ R n (u n ; v) d˝

˝n!

for some u 2 V and 2 L1 (˝;RN ). The boundedness of {Aun } in V (recall that A has been assumed to be bounded and that k un kV M) allows the conclusion that for some B 2 V , Au n ! B

H

(51)

This implies that for a subsequence of {un } (again denoted by the same symbol) one gets un ! u a.e. in ˝. Thus, from Egoroff ’s theorem it follows that for any " > 0 a subset ! ˝ with meas ! < " can be determined such that un ! u uniformly in ˝ \ ! with u 2 L1 (˝ \ !;RN ). Let v 2 L1 (˝ \ !;RN ) be an arbitrary

as claimed. Step 3. Now it will be shown that u 2 L1 (˝), i. e. the second condition in (45) holds. For this purpose we shall need the following truncation result for vector-valued Sobolev spaces. Theorem 7 ([20]) For each v 2 H 1 (˝;RN ) there exists a sequence of functions { "n } L1 (˝) with 0 "n 1 such that f(1 "n )vg H 1 (˝; R N ) \ L1 (˝; R N ) (1 "n )v ! v

strongly in H 1 (˝; R N ):

(53)

Remark 8 For the truncation procedure of the form (53) in the case of a scalar-valued Sobolev space W p, m (˝) the reader is referred to [11]. According to the aforementioned theorem, for u 2 V one can find a sequence { "k } 2 L1 (˝) with 0 "k 1 such that e u k :D (1 " k )u 2 V \ L1 (˝; R N ) and e u k ! u in V as k ! 1. Without loss of generality it can be assumed thate u k ! u a.e. in ˝. Since it is already

1495

1496

H

Hemivariational Inequalities: Static Problems

known that 2 @j(u), one can apply (3) to obtain ( u) j0 (u; u) k |u |. Hence e u k D (1 " k ) u k juj :

(54)

This implies that the sequence f e u k g is bounded from below and e u k ! u a.e. in ˝. On the other hand, due to (50) one gets Z u k d˝ C hB C g;e u k iV D e ˝

for a positive constant C. Thus, by Fatou ’s lemma u 2 L1 (˝), as required. Step 4. Now the inequality Z Z (55) lim inf n u n d˝ u d˝

By arbitrariness of " > 0 and (56) one obtains Z n u n d˝

lim inf n!1

˝

Z

v d˝

˝

˝

8v 2 V \ L1 (˝; R N ): (57) By substituting v D e u k :D (1 " k )u (with e u k as described in the truncation argument of Theorem 7) into the right-hand side of (57) one gets Z lim inf n u n d˝ n!1

˝

Z

lim inf k!1

will be established. It can be supposed that un ! u a.e. in ˝, because un ! u strongly in Lp (˝;RN ). Fix v 2 L1 (˝;RN ) arbitrarily. Since n 2 R n (un ), Eq. (9) implies 0 n (v u n ) jf R n (u n ; v u n )

e ˛ (kvk L 1 (˝;RN ) )(1 C ju n j ): (56) From Egoroff ’s theorem it follows that for any " > 0 a subset ! ˝ with meas ! < " can be determined such that un ! u uniformly in ˝ \ !. R One can also suppose ˛ (kvk L 1 (˝;RN ) )(1 C that ! is small enough to fulfill !Re ju n j ) d˝ ", n = 1, 2, . . . , and ! ˛(kvkL 1 (˝;RN ) )(1 + k u ) d ˝ ". Hence Z 0 jf R n (u n ; v u n ) d˝ 0 jf R n (u n ; v u n ) d˝ C "

e u k d˝

˝

Z

lim sup k!1

j0 (u;e u k u) d˝:

(58)

˝

Taking into account that e u k ! u a.e. in ˝, j0 (u;e u k u) D " k j0 (u; u) " k k juj k juj and j uj e u k D (1 " k ) u k juj, Fatou ’s lemma and the dominated convergence can be used to deduce Z u k u) d˝ 0; lim sup j0 (u;e k!1

˝

and Z

˝

Z

j0 (u; v u) d˝;

˝

n!1

˝

Z

lim

k!1

e u k d˝ D

˝

Z u d˝: ˝

˝n!

Z

j0 (u n ; v u n ) d˝ C " (for large n);

D ˝n!

which by Fatou ’s lemma and upper semicontinuity of j0 (; ) yields Z 0 lim inf jf R n (u n ; v u n ) d˝ n!1

Z ˝

˝

j0 (u; v u) d˝ 2":

Finally, combining the last two inequalities with (58) yields (55), as required. Step 5. The next claim is that Z (59) hB g; uiV C u d˝ D 0: ˝

Indeed, (50) implies Z u k d˝ D 0; u k iV C e hB g;e ˝

(60)

H

Hemivariational Inequalities: Static Problems

with fe u k g as in Step 3. Since u 2 L1 (˝) and k juj e u k D (1 " k ) u j uj ; by the dominated convergence, Z Z e u k d˝ ! u d˝: ˝

˝

It means that (59) has to hold by passing to the limit as k ! 1 in (60). Step 6. In this step it will be shown that the pseudomonotonicity of A and (47) imply (44). Indeed, (47) with v 2 V \ L1 (˝;RN ) and (49) allows to state that lim sup hAu n ; u n uiV hB g; v uiV n!1 Z Z C v d˝ lim inf n u n d˝: n!1

˝

˝

Substituting v D e u k with e u k as in Step 3 and taking into account (55) one arrives at lim supn ! 1 hAun , un u iV 0 (by the application of the limit procedure as k ! 1). Therefore the use of pseudomonotonicity of A is allowed and yields h Aun , un iV ! hAu, u iV and Aun ! B = Au weakly in V as n ! 1. Finally, (47) implies (44), as claimed. Step 7. In the final step of the proof it will be shown that (44) and (45) imply (46). For this purpose, choose v 2 V \ L1 (˝;RN ) arbitrarily. From (2) one has (v u) j0 (u;v u) ˛(kvkL 1 (˝;RN ) )(1 + |u| ) with (v u) 2 L1 (˝) and ˛ (k vL 1 (˝;RN ) )(1 + |u | ) 2 L1 (˝). Hence j0 (u;v u) is finite integrable and consequently, (46) follows immediately from (44). Now consider the case j0 (u;v u) 2 L1 (˝) with v 62 V \ L1 (˝;RN ). According to Theorem 7 there exists a sequence e v k D (1 " k )v such that fe vk g v k ! v strongly in V. Since V \ L1 (˝; R N ) ande Z v k u) d˝ 0; v k uiV C j0 (u;e hAu g;e

Thus the application of Fatou ’s lemma gives the assertion. Finally, the proof of Theorem 6 is complete. Remark 9 The analogous result to that of Theorem 6 can be formulated for the hemivariational inequality R () d ˝ is replaced by the boundary (46) in which ˝ R integral () d , provided the imbedding H 1 (˝) Lp ( ) is compact (1 < p < (2m 2)/(m 2), [12]). Example 10 Let us consider a linear elastic body which in its undeformed state occupies an open, bounded, connected subset ˝ of R3 . ˝ is referred to a fixed Cartesian coordinate system 0x1 x2 x3 and its boundary is assumed to be Lipschitz regular; n = (ni ) denotes the outward unit normal vector to . We decompose into two disjointed parts F and S such that D F [ S . As usual, the symbols u: ˝ ! R3 and : ˝ ! S3 are used to denote the displacement field and the stress tensor field, respectively. Here S3 stands for the space of all real-valued 3 × 3 symmetric matrices. Consider the boundary value problem: i) The equilibrium equations: i j; j C b i D 0

in ˝:

(61)

ii) The displacement—strain relation: 1 " i j (u) D (u i; j C u j;i ) 2

in ˝:

(62)

iii) Hook’s law: i j D C i jk l " k l (u)

in ˝:

(63)

iv) The surface traction conditions i j n j D Fi

on F :

(64)

v) The nonmonotone subdifferential boundary conditions

˝

so in order to establish (46) it remains to show that Z Z 0 lim sup j (u;e v k u) d˝ j0 (u; v u) d˝: k!1

˝

˝

For this purpose let us observe thate v k u D (1" k )(v u)C" k (u) which combined with the convexity of j0 (u; ) yields the estimate j0 (u;e v k u) (1 " k ) j0 (u; v u) C " k j0 (u; u) ˇ ˇ ˇ j0 (u; v u)ˇ C k juj :

S 2 @ j(u)

on S :

(65)

Here, S = (Si ) = ( ij nj ) is the stress vector, and @j() is the generalized gradient of Clarke of a locally Lipschitz function j: R3 ! R; the summation convention over repeated indices holds and the elasticity tensor C = (Cijkl ) is assumed to satisfy the classical conditions of ellipticity and symmetry [24]. Let V = H 1 (˝;R3 ). By making use of the standard technique (cf. [24]), Eqs. (61)-(65) lead to the problem

1497

1498

H

Hemivariational Inequalities: Static Problems

of finding u 2 V such as to satisfy the hemivariational inequality Z Z C i jk l " i j (u)" k l (v u) d˝ b i (v i u i ) d˝ Z ˝ Z ˝ Fi (v i u i ) d C j0 (u; v u) d 0; F

S

8v 2 V:

Define A: V ! V by Z hAu; viV D C i jk l " i j (u)" k l (v) d˝;

(66)

u; v 2 V;

˝

and let V 0 := R = { 2 V : "ij () = 0, i, j = 1, 2, 3 } denote the space of all rigid-body displacements. Then (1) holds (for details see [24, p. 121]). Accordingly, if (2) (with < 4) and (3) are fulfilled and, moreover, the compatibility condition Z Z Z b i i d˝ C Fi i d < j1 () d ˝

F

S

is valid for any 2 R \ { 0 }, then the hypotheses of the theorem mentioned in Remark 9 are satisfied. Consequently, the existence of solutions to the hemivariational inequality (66) is ensured. References 1. Aubin JP, Ekeland I (1984) Applied nonlinear analysis. Wiley, New York 2. Baiocchi C, Buttazzo G, Gastaldi F, Tomarelli F (1988) General existence theorems for unilateral problems in continuum mechanics. Arch Rational Mechanics Anal 100:149–180 3. Brèzis H (1968) Èquations et inèquations non-lin èaires dans les espaces vèctoriels en dualitè. Ann Inst Fourier Grenoble 18:115–176 4. Brèzis H, Nirenberg L (1978) Characterizations of the ranges of some nonlinear operators and applications to boundary value problems. Ann Scuola Norm Sup Pisa Cl Sci IV V(2):225–326 5. Browder FE, Hess P (1972) Nonlinear mappings of monotone type in Banach spaces. J Funct Anal 11:251–294 6. Clarke FH (1983) Optimization and nonsmooth analysis. Wiley, New York 7. Duvaut G, Lions JL (1972) Les inèquations en mècanique et en physique. Dunod, Paris 8. Ekeland I, Temam R (1976) Convex analysis and variational problems. North-Holland, Amsterdam 9. Fichera G (1972) Boundary value problems in elasticity with unilateral constraints. Handbuch der Physik, vol VIa/2. Springer, Berlin, pp 347–389

10. Goeleven D (1996) Noncoercive variational problems and related topics. Res Notes Math, vol 357. Longman 11. Hedberg LI (1978) Two approximation problems in function space. Ark Mat 16:51–81 12. Kufner A, John O, Fuˇcik S (1977) Function spaces. Academia, Prague 13. Motreanu D, Naniewicz Z (1996) Discontinuous semilinear problems in vector-valued function spaces. Differential Integral Eq 9:581–598 14. Motreanu D, Panagiotopoulos PD (1995) Nonconvex energy functions, related eigenvalue hemivariational inequalities on the sphere and applications. J Global Optim 6:163–177 15. Motreanu D, Panagiotopoulos PD (1996) On the eigenvalue problem for hemivariational inequalities: Existence and multiplicity of solutions. J Math Anal Appl 197:75–89 16. Naniewicz Z (1994) Hemivariational inequalities with functions fulfilling directional growth condition. Appl Anal 55:259–285 17. Naniewicz Z (1994) Hemivariational inequality approach to constrained problems for star-shaped admissible sets. J Optim Th Appl 83:97–112 18. Naniewicz Z (1995) Hemivariational inequalities with functionals which are not locally Lipschitz. Nonlinear Anal 25:1307–1320 19. Naniewicz Z (1995) On variational aspects of some nonconvex nonsmooth global optimization problem. J Global Optim 6:383–400 20. Naniewicz Z (1997) Hemivariational inequalities as necessary conditions for optimality for a class of nonsmooth nonconvex functionals. Nonlinear World 4:117–133 21. Naniewicz Z, Panagiotopoulos PD (1995) Mathematical theory of hemivariational inequalities and applications. M. Dekker, New York 22. Panagiotopoulos PD (1981) Nonconvex superpotentials in the sense of F.H. Clarke and applications. Mechanics Res Comm 8:335–340 23. Panagiotopoulos PD (1983) Noncoercive energy function, hemivariational inequalities and substationarity principles. Acta Mechanics 48:160–183 24. Panagiotopoulos PD (1985) Inequality problems in mechanics and applications. Convex and nonconvex energy functions. Birkhäuser, Basel 25. Panagiotopoulos PD (1991) Coercive and semicoercive hemivariational inequalities. Nonlinear Anal 16:209–231 26. Panagiotopoulos PD (1993) Hemivariational inequalities. Applications in mechanics and engineering. Springer, Berlin 27. Rauch J (1977) Discontinuous semilinear differential equations and multiple valued maps. Proc Amer Math Soc 64:277–282 28. Webb JRL (1980) Boundary value problems for strongly nonlinear elliptic equations. J London Math Soc 21:123– 132

Heuristic and Metaheuristic Algorithms for the Traveling Salesman Problem

Heuristic and Metaheuristic Algorithms for the Traveling Salesman Problem YANNIS MARINAKIS Department of Production Engineering and Management, Decision Support Systems Laboratory, Technical University of Crete, Chania, Greece

H

G D (V ; E) of n nodes, with node set V D f1; : : : ; ng, arc set E D f(i; j)ji; j D 1; : : : ; ng, and nonnegative costs cij associated with the arcs [8]: XX c D min ci j xi j (1) i2V j2V

s.t. X

x i j D 1;

i2V

(2)

x i j D 1;

j2V

(3)

j2V

X MSC2000: 90C59

i2V

XX Article Outline Introduction Heuristics for the Traveling Salesman Problem Metaheuristics for the Traveling Salesman Problem References Introduction The Traveling Salesman Problem (TSP) is one of the most representative problems in combinatorial optimization. If we consider a salesman who has to visit n cities [46], the Traveling Salesman Problem asks for the shortest tour through all the cities such that no city is visited twice and the salesman returns at the end of the tour back to the starting city. Mathematically, the problem may be stated as follows: Let G D (V; E) be a graph, where V is a set of n nodes and E is a set of arcs, let C D [c i j ] be a cost matrix associated with E, where cij represents the cost of going from city i to city j, (i; j D 1; : : : ; n), the problem is to find a permutation (i1 ; i2 ; i3 ; : : : ; i n ) of the integers from 1 through n that minimizes the quantity c i1 i2 C c i2 i3 C : : : C c i n i1 . We speak of a Symmetric TSP, if for all pairs (i, j), the distance cij is equal to the distance cji . Otherwise, we speak of the Asymmetric TSP [7]. If the triangle inequality holds (c i j c i i 1 Cc i 1 j ; 8i; j; i1 ), the problem is said to be metric. If the cities can be represented as points in the plain such that cij is the Euclidean distance between point i and point j, then the corresponding TSP is called the Euclidean TSP. Euclidean TSP obeys in particular the triangle inequality c i j c i i 1 C c i 1 j for all i; j; i1 . An integer programming formulation of the Traveling Salesman Problem is defined in a complete graph

x i j jSj 1;

8S V ; S ¤ ;

(4)

i2S j2S

x i j 2 f0; 1g;

for all i; j 2 V ;

(5)

where x i j D 1 if arc (i, j) is in the solution and 0 otherwise. In this formulation, the objective function clearly describes the cost of the optimal tour. Constraints (2) and (3) are degree constraints: they specify that every node is entered exactly once and left exactly once. Constraints (4) are subtour elimination constraints. These constraints prohibit the formation of subtours, i. e. tours on subsets of less than V nodes. If there was such a subtour on a subset S of nodes, this subtour would contain jSj edges and as many nodes. Constraints (4) would then be violated for this subset since the left-hand side of (4) would be equal to jSj while the right-hand side would be equal to jSj 1. Because of degree constraints, subtours over one node (and hence, over n 1 nodes) cannot occur. For more formulations of the problem see [34,60]. The Traveling Salesman Problem (TSP) is one of the most famous hard combinatorial optimization problems. It has been proven that TSP is a member of the set of NP-complete problems. This is a class of difficult problems whose time complexity is probably exponential. The members of the class are related so that if a polynomial time algorithm was found for one problem, polynomial time algorithms would exist for all of them [41]. However, it is commonly believed that no such polynomial algorithm exists. Therefore, any attempt to construct a general algorithm for finding optimal solutions for the TSP in polynomial time must (probably) fail. That is, for any such algorithm it is possible to construct problem instances for which the execution time grows at least exponentially with the size of the input. Note, however, that time complexity here

1499

1500

H

Heuristic and Metaheuristic Algorithms for the Traveling Salesman Problem

refers to the worst case behavior of the algorithm. It can not be excluded that there exist algorithms whose average running time is polynomial. The existence of such algorithms is still an open question. Since 1950s many algorithms have been proposed, developed and tested for the solution of the problem. Algorithms for solving the TSP may be divided into two categories, exact algorithms and heuristic–metaheuristic algorithms. Heuristics for the Traveling Salesman Problem There is a great need for powerful heuristics that find good suboptimal solutions in reasonable amounts of computing time. These algorithms are usually very simple and have short running times. There is a huge number of papers dealing with finding near optimal solutions for the TSP. Our aim is to present the most interesting and efficient algorithms and the most important ones for facing practical problems. In the 1960s, 1970s and 1980s the attempts to solve the Traveling Salesman Problem focused on tour construction methods and tour improvement methods. In the last fifteen years, metaheuristics, such as simulated annealing, tabu search, genetic algorithms and neural networks, were introduced. These algorithms have the ability to find their way out of local optima. Heuristics and metaheuristics constitute an increasingly essential component of solution approaches intended to tackle difficult problems, in general, and global and combinatorial problems in particular. When a heuristic is designed, the question which arises is about the quality of the produced solution. There are three different ways that one may try to answer this question. 1. Empirical. The heuristic is applied to a number of test problem instances and the solutions are compared to the optimal values, if there are known, or to lower bounds on these values [33,35]. 2. Worst Case Analysis. The idea is to derive bounds on the worst possible deviation from the optimum that the heuristic could produce and to devise bad problem instances for which the heuristic actually achieves this deviation [42]. 3. Probabilistic Analysis. In the probabilistic analysis it is assumed that problem instances are drawn from certain simple probability distributions, and it is, then, proven mathematically that particular solu-

tion methods are highly likely to yield near-optimal solutions when the number of cities is large [43]. Tour Construction methods build up a tour step by step. Such heuristics build a solution (tour) from scratch by a growth process (usually a greedy one) that terminates as soon as a feasible solution has been constructed. The problem with construction heuristics is that although they are usually fast, they do not, in general, produce very good solutions. One of the simplest tour construction methods is the nearest neighborhood in which, a salesman starts from an arbitrary city and goes to its nearest neighbor. Then, he proceeds from there in the same manner. He visits the nearest unvisited city, until all cities are visited, and then returns to the starting city [65,68]. An extension of the nearest neighborhood method is the double-side nearest neighborhood method where the current path can be extended from both of its endnodes. Some authors use the name Greedy for Nearest Neighborhood, but it is more appropriately reserved for the special case of the greedy algorithm of matroid theory [39]. Bentley [11] proposed two very fast and efficient algorithms, the K-d Trees and the Lazily Update Priority Queues. In his paper, it was the first time that somebody suggested the use of data structures for the solution of the TSP. A priority queue contains items with associated values (the priorities) and support operations that [40]: remove the highest priority item from the queue and deliver it to the user, insert an item, delete an item, and modify the priority of an item. The insertion procedures [68] take a subtour of V nodes and attempt to determine which node (not already in the subtour) should join the subtour next (the selection step) and then determine where in the subtour it should be inserted (the insertion step). The most known of these algorithms is the nearest insertion algorithm. Similar to the nearest insertion procedure are the cheapest insertion [65], the arbitrary insertion [12], the farthest insertion [65], the quick insertion [12], and the convex hull insertion [12] algorithms. There is a number of heuristic algorithms that are designed for speed rather for quality of the tour they construct [40]. The three most known heuristic algo-

Heuristic and Metaheuristic Algorithms for the Traveling Salesman Problem

rithms of this category are the Strip algorithm, proposed by Beardwood et al. [10], the Spacefilling Curve proposed by Platzmann and Bartholdi [58] and the Fast Recursive Partitioning heuristic proposed by Bentley [11]. The saving algorithms are exchange procedures. The most known of them is the Clarke-Wright algorithm [17]. Christofides [12,65] suggested a procedure for solving the TSP based on spanning trees. He proposed a method of transforming spanning trees to Eulerian graphs. The improvement methods or local search methods start with a tour and try to find all tours that are “neighboring” to it and are shorter than the initial tour and, then, to replace it. The tour improvements methods can be divided into three categories according to the type of the neighborhood that they use [64]. Initially, the constructive neighborhood methods, which successively add new components to create a new solution, while keeping some components of the current solution fixed. Some of these methods will be presented in the next section where the most known metaheuristics are presented. Secondly, the transition neighborhood methods, which are the classic local search algorithms (classic tour improvement methods) and which iteratively move from one solution to another based on the definition of a neighborhood structure. Finally, the population based neighborhood methods, which generalize the two previous categories by considering neighborhoods of more than one solution. The most known of the local search algorithms is the 2-opt heuristic, in which two edges are deleted and the open ends are connected in a different way in order to obtain a new tour [48]. Note that there is only one way to reconnect the paths. The 3-opt heuristic is quite similar with the 2-opt but it introduces more flexibility in modifying the current tour, because it uses a larger neighborhood. The tour breaks into three parts instead of only two [48]. In the general case, ı edges in a feasible tour are exchanged for ı edges not in that solution as long as the result remains a tour and the length of that tour is less than the length of the previous tour. Lin-Kernighan method (LK) was developed by Lin and Kernighan [37,49,54] and for many years was considered to be the best heuristic for the TSP. The Or-opt procedure, well known as node exchange heuristic, was first introduced by Or [56]. It removes a sequence of up-to-three adjacent nodes and inserts it

H

at another location within the same route. Or-opt can be considered as a special case of 3-opt (three arcs exchanges) where three arcs are removed and substituted by three other arcs. The GENI algorithm was presented by Gendreau, Hertz and Laporte [22]. GENI is a hybrid of tour construction and local optimization. Metaheuristics for the Traveling Salesman Problem The last fifteen years an incremental amount of metaheuristic algorithms have been proposed. Simulated annealing, genetic algorithms, neural networks, tabu search, ant algorithms, together with a number of hybrid techniques are the main categories of the metaheuristic procedures. These algorithms have the ability to find their way out of local optima. A number of metaheuristic algorithms have been proposed for the solution of the Traveling Salesman Problem. The most important algorithms published for each metaheuristic algorithm are given in the following: Simulated Annealing (SA) belongs [1,2,45,64] to a class of local search algorithms that are known as threshold accepting algorithms. These algorithms play a special role within local search for two reasons. First, they appear to be quite successful when applied to a broad range of practical problems. Second, some threshold accepting algorithms such as SA have a stochastic component, which facilitates a theoretical analysis of their asymptotic convergence. Simulated Annealing [3] is a stochastic algorithm that allows random uphill jumps in a controlled fashion in order to provide possible escapes from poor local optima. Gradually the probability allowing the objective function value to increase is lowered until no more transformations are possible. Simulated Annealing owes its name to an analogy with the annealing process in condensed matter physics, where a solid is heated to a maximum temperature at which all particles of the solid randomly arrange themselves in the liquid phase, followed by cooling through careful and slow reduction of the temperature until the liquid is frozen with the particles arranged in a highly structured lattice and minimal system energy. This ground state is reachable only if the maximum temperature is sufficiently high and the cooling sufficiently slow.

1501

1502

H

Heuristic and Metaheuristic Algorithms for the Traveling Salesman Problem

Otherwise a meta-stable state is reached. The metastable state is also reached with a process known as quenching, in which the temperature is instantaneously lowered. Its predecessor is the so-called Metropolis filter. Simulated Annealing algorithms for the TSP are presented in [15,55,65]. Tabu search (TS) was introduced by Glover [24,25] as a general iterative metaheuristic for solving combinatorial optimization problems. Computational experience has shown that TS is a well established approximation technique, which can compete with almost all known techniques and which, by its flexibility, can beat many classic procedures. It is a form of local neighbor search. Each solution S has an associated set of neighbors N(S). A solution S 0 2 N(S) can be reached from S by an operation called a move. TS can be viewed as an iterative technique which explores a set of problem solutions, by repeatedly making moves from one solution S to another solution S 0 located in the neighborhood N(S) of S [31]. TS moves from a solution to its best admissible neighbor, even if this causes the objective function to deteriorate. To avoid cycling, solutions that have been recently explored are declared forbidden or tabu for a number of iterations. The tabu status of a solution is overridden when certain criteria (aspiration criteria) are satisfied. Sometimes, intensification and diversification strategies are used to improve the search. In the first case, the search is accentuated in the promising regions of the feasible domain. In the second case, an attempt is made to consider solutions in a broad area of the search space. The first Tabu Search algorithm implemented for the TSP appears to be the one described by Glover [23,29]. Limited results for this implementation and variants on it were reported by Glover [26]. Other Tabu Search algorithms for the TSP are presented in [74]. Genetic Algorithms (GAs) are search procedures based on the mechanics of natural selection and natural genetics. The first GA was developed by John H. Holland in the 1960s to allow computers to evolve solutions to difficult search and combinatorial problems, such as function optimization and machine learning [38]. Genetic algorithms offer a particularly attractive approach for problems like traveling salesman problem since they are generally quite effective for rapid global search of large, non-linear and

poorly understood spaces. Moreover, genetic algorithms are very effective in solving large-scale problems. Genetic algorithms mimic the evolution process in nature. GAs are based on an imitation of the biological process in which new and better populations among different species are developed during evolution. Thus, unlike most standard heuristics, GAs use information about a population of solutions, called individuals, when they search for better solutions. A GA is a stochastic iterative procedure that maintains the population size constant in each iteration, called a generation. Their basic operation is the mating of two solutions in order to form a new solution. To form a new population, a binary operator called crossover, and a unary operator, called mutation, are applied [61,62]. Crossover takes two individuals, called parents, and produces two new individuals, called offsprings, by swapping parts of the parents. Genetic algorithms for the TSP are presented in [9,51,59,64,67]. Greedy Randomized Adaptive Search Procedure GRASP [66] is an iterative two phase search method which has gained considerable popularity in combinatorial optimization. Each iteration consists of two phases, a construction phase and a local search procedure. In the construction phase, a randomized greedy function is used to build up an initial solution. This randomized technique provides a feasible solution within each iteration. This solution is then exposed for improvement attempts in the local search phase. The final result is simply the best solution found over all iterations. Greedy Randomized Adaptive Search Procedure algorithms for the TSP are presented in [50,51]. The use of Artificial Neural Networks to find good solutions to combinatorial optimization problems has recently caught some attention. A neural network consists of a network [57] of elementary nodes (neurons) that are linked through weighted connections. The nodes represent computational units, which are capable of performing a simple computation, consisting of a summation of the weighted inputs, followed by the addition of a constant called the threshold or bias, and the application of a nonlinear response (activation) function. The result of the computation of a unit constitutes its output. This output is used as an input for the nodes to which

Heuristic and Metaheuristic Algorithms for the Traveling Salesman Problem

it is linked through an outgoing connection. The overall task of the network is to achieve a certain network configuration, for instance a required input–output relation, by means of the collective computation of the nodes. This process is often called self–organization. Neural Networks algorithms for the TSP are presented in [4,6,53,69]. The Ant Colony Optimization (ACO) metaheuristic is a relatively new technique for solving combinatorial optimization problems (COPs). Based strongly on the Ant System (AS) metaheuristic developed by Dorigo, Maniezzo and Colorni [19], ant colony optimization is derived from the foraging behaviour of real ants in nature. The main idea of ACO is to model the problem as the search for a minimum cost path in a graph. Artificial ants walk through this graph, looking for good paths. Each ant has a rather simple behavior so that it will typically only find rather poor-quality paths on its own. Better paths are found as the emergent result of the global cooperation among ants in the colony. An ACO algorithm consists of a number of cycles (iterations) of solution construction. During each iteration a number of ants (which is a parameter) construct complete solutions using heuristic information and the collected experiences of previous groups of ants. These collected experiences are represented by a digital analogue of trail pheromone which is deposited on the constituent elements of a solution. Small quantities are deposited during the construction phase while larger amounts are deposited at the end of each iteration in proportion to solution quality. Pheromone can be deposited on the components and/or the connections used in a solution depending on the problem. Ant Colony Optimization algorithms for the TSP are presented in [16,18,19,70]. One way to invest extra computation time is to exploit the fact that many local improvement heuristics have random components, even if in their initial tour construction phase. Thus, if one runs the heuristic multiple times he will get different results and can take the best. The Iterated Lin Kernighan algorithm (ILK) [54] has been proposed by Johnson [39] and it is considered to be one of the best algorithms for obtaining a first local minimum. To improve this local minimum, the algorithm exam-

H

ines other local minimum tours ‘near’ the current local minimum. To generate these tours, ILK first applies a random and unbiased nonsequential 4-opt exchange to the current local minimum and then optimizes this 4-opt neighbor using the LK algorithm. If the tour obtained by this process is better than the current local minimum then ILK makes this tour the current local minimum and continues from there using the same neighbor generation process. Otherwise, the current local minimum remains as it is and further random 4-opt moves are tried. The algorithm stops when a stopping criterion based either on the number of iterations or the computational time is satisfied. Two other approaches are the Iterated 3-opt and the Chained Lin-Kernighan [5], where random kicks are generated from the solution and from these new points the exploration for a better solution is continued [40]. Ejection Chain Method provides a wide variety of reference structures, which have the ability to generate moves not available to neighborhood search approaches traditionally applied to TSP [63,64]. Ejection Chains are variable depth methods that generate a sequence of interrelated simple moves to create a more complex compound move. An ejection consists of a succession of operations performed on a given set of elements, where the mt operation changes the state of one or more elements which are said to be ejected in the m t C1 operations. Of course, there is a possibility to appear changes in the state of other elements, which will lead to other ejections, until no more operations can be made [27]. Other Ejection Chain Algorithms are presented in [20,21]. Scatter Search is an evolutionary strategy originally proposed by Glover [28,30]. Scatter Search operates on a set of reference solutions to generate a new set of solutions by weighted linear combinations of structured subset of solutions. The reference set is required to be made up of high quality and diverse solutions and the goal is to produce weighted centers of selected subregions that project these centers into regions of the solution space that are to be explored by auxiliary heuristic procedures. Path Relinking [28,30], combines solutions by generating paths between them using local search neighborhoods, and selecting new solutions encountered along these paths.

1503

1504

H

Heuristic and Metaheuristic Algorithms for the Traveling Salesman Problem

Guided Local Search (GLS), originally proposed by Voudouris and Chang [71,72], is a general optimization technique suitable for a wide range of combinatorial optimization problems. The main focus is on the exploitation of problem and search–related information to effectively guide local search heuristics in the vast search spaces of NP-hard optimization problems. This is achieved by augmenting the objective function of the problem to be minimized with a set of penalty terms which are dynamically manipulated during the search process to steer the heuristic to be guided. GLS augments the cost function of the problem to include a set of penalty terms and passes this, instead of the original one, for minimization by the local search procedure. Local search is confined by the penalty terms and focuses attention on promising regions of the search space. Iterative calls are made to local search. Each time local search gets caught in a local minimum, the penalties are modified and local search is called again to minimize the modification cost function. Guided Local Search algorithms for the TSP are presented in [71,72]. Noising Method was proposed by Charon and Hudry [13] and is a metaheuristic where if it is wanted to minimize the function f 1 , this method do not take the true values of f 1 into account but it considers that they are perturbed in some way by noises in order to get a noised function f 1 noised . During the run of the algorithm, the range of the perturbing noises decreases (typically to zero), so that, at the end, there is no significant noise and the optimization of f 1 noised leads to the same solution as the one provided by a descent algorithm applied to f 1 with the same initial solution. This algorithm was applied to the Traveling Salesman Problem by Charon and Hudry [14]. Particle Swarm Optimization (PSO) is a population-based swarm intelligence algorithm. It was originally proposed by Kennedy and Eberhart as a simulation of the social behavior of social organisms such as bird flocking and fish schooling [44]. PSO uses the physical movements of the individuals in the swarm and has a flexible and well-balanced mechanism to enhance and adapt to the global and local exploration abilities. PSO algorithms for the solution of the Traveling Salesman Problem are presented in [32,47,73].

Variable Neighborhood Search (VNS) is a metaheuristic for solving combinatorial optimization problems whose basic idea is systematic change of neighborhood within a local search [36]. Variable Neighborhood Search algorithms for the TSP are presented in [52].

References 1. Aarts E, Korst J (1989) Simulated Annealing and Boltzmann Machines - A stochastic Approach to Combinatorial Optimization and Neural Computing. John Wiley and Sons, Chichester 2. Aarts E, Ten Eikelder HMM (2002) Simulated Annealing. In: Pardalos PM, Resende MGC (eds) Handbook of Applied Optimization. Oxford University Press, Oxford, pp 209–221 3. Aarts E, Korst J, Van Laarhoven P (1997) Simulated Annealing. In: Aarts E, Lenstra JK (eds) Local Search in Combinatorial Optimization. John Wiley and Sons, Chichester, pp 91– 120 4. Ansari N, Hou E (1997) Computational Intelligence for Optimization, 1st edn. Kluwer, Boston 5. Applegate D, Cook W, Rohe A (2003) Chained LinKernighan for Large Traveling Salesman Problems. Informs J Comput 15:82–92 6. Bai Y, Zhang W, Jin Z (2006) An New Self-Organizing Maps Strategy for Solving the Traveling Salesman Problem. Chaos Solitons Fractals 28(4):1082–1089 7. Balas E, Fischetti M (2002) Polyhedral Theory for the Assymetric Traveling Salesman Problem. In: Gutin G, Punnen A (eds) The Traveling Salesman Problem and its Variations. Kluwer, Dordrecht, pp 117–168 8. Balas E, Toth P (1985) Branch and Bound Methods. In: Lawer EL, Lenstra JK, Rinnoy Kan AHG, Shmoys DB (eds) The Travelling Salesman Problem: A Guided Tour of Combinatorial Optimization. John Wiley and Sons, Chichester, pp 361–401 9. Baralia R, Hildago JI, Perego R (2001) A Hybrid Heuristic for the Traveling Salesman Problem. IEEE Trans Evol Comput 5(6):1–41 10. Beardwood J, Halton JH, Hammersley JM (1959) The Shortest Path Through Many Points. Proc Cambridge Philos Soc 55:299–327 11. Bentley JL (1992) Fast Algorithms for Geometric Traveling Salesman Problems. ORSA J Comput 4:387–411 12. Bodin L, Golden B, Assad A, Ball M (1983) The State of the Art in the Routing and Scheduling of Vehicles and Crews. Comput Oper Res 10:63–212 13. Charon I, Hudry O (1993) The Noising Method: A New Combinatorial Optimization Method. Oper Res Lett 14:133–137 14. Charon I, Hudry O (2000) Applications of the Noising Method to the Traveling Salesman Problem. Eur J Oper Res 125:266–277

Heuristic and Metaheuristic Algorithms for the Traveling Salesman Problem

15. Chen Y, and Zhang P (2006) Optimized Annealing of Traveling Salesman Problem from the nth-Nearest-Neighbor Distribution. Physica A: Stat Theor Phys 371(2):627–632 16. Chu SC, Roddick JF, Pan JS (2004) Ant Colony System with Communication Strategies. Inf Sci 167(1–4):63–76 17. Clarke G, and Wright J (1964) Scheduling of Vehicles from a Central Depot to a Number of Delivery Points. Oper Res 12:568–581 18. Dorigo M, Gambardella LM (1997) Ant Colony System: A Cooperative Learning Approach to the Traveling Salesman Problem. IEEE Trans Evol Comput 1(1):53–66 19. Dorigo M, Stutzle T (2004) Ant Colony Optimization, A Bradford Book. The MIT Press Cambridge, Massachusetts, London 20. Gamboa D, Rego C, Glover F (2005) Data Structures and Ejection Chains for Solving Large-Scale Traveling Salesman Problems. Eur J Oper Res 160(1):154–171 21. Gamboa D, Rego C, Glover F (2006) Implementation Analysis of Efficient Heuristic Algorithms for the Traveling Salesman Problem. Comput Oper Res 33(4):1154–1172 22. Gendreau M, Hertz A, Laporte G (1992) New Insertion and Postoptimization Procedures for the Traveling Salesman Problem. Oper Res 40:1086–1094 23. Glover F (1986) Future Paths for Integer Programming and Links to Artificial Intelligence. Comput Oper Res 13:533– 549 24. Glover F (1989) Tabu Search I. ORSA J Comput 1(3):190– 206 25. Glover F (1990) Tabu Search II. ORSA J Comput 2(1):4–32 26. Glover F (1990) Tabu search: A tutorial. Center for Applied Artificial Intelligence, University of Colorado, pp 1–47 27. Glover F (1992) Ejection Chains, Reference Structures and Alternating Path Algorithms for Traveling Salesman Problem. Discrete Appl Math 65:223–253 28. Glover F (1997) A Template for Scatter Search and Path Relinking. Lecture Notes in Computer Science, vol 1363. pp 13–54 29. Glover F, and Laguna M (2002) Tabu Search. In: Pardalos PM, Resende MGC (eds) Handbook of Applied Optimization. Oxford University Press, Oxford, pp 194–209 30. Glover F, Laguna M, Marti R (2003) Scatter Search and Path Relinking: Advances and Applications. In: Glover F, Kochenberger GA (eds) Handbook of Metaheuristics. Kluwer, Boston, pp 1–36 31. Glover F, Laguna M, Taillard E, de Werra D (eds) (1993) Tabu Search. J.C. Baltzer AG, Science Publishers, Basel, Switzerland 32. Goldbarg EFG, Souza GR, Goldbarg MC (2006) Particle Swarm Optimization for the Traveling Salesman Problem. EVOCOP 2006 LNCS 3906:99–110 33. Golden BL, Stewart WR (1985) Empirical Analysis of Heuristics. In: Lawer EL, Lenstra JK, Rinnoy Kan AHG, Shmoys DB (eds) The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization. John Wiley and Sons, Chichester, pp 207–249

H

34. Gutin G, Punnen A (eds) (2002) The Traveling Salesman Problem and its Variations. Kluwer, Dordrecht 35. Haimovich M, Rinnoy Kan AHG, Stougie L (1988) Analysis of Heuristics for Vehicle Routing Problems. In: Golden BL, Assad AA (eds) Vehicle Routing: Methods and Studies. Elsevier Science Publishers, North Holland, pp 47–61 36. Hansen P, Mladenovic N (2001) Variable Neighborhood Search: Principles and Applications. Eur J Oper Res 130:449–467 37. Helsgaun K (2000) An Effective Implementation of the LinKernighan Traveling Salesman Heuristic. Eur J Oper Res 126:106–130 38. Holland J H. (1975) Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor 39. Johnson DS, McGeoch LA (1997) The Traveling Salesman Problem: A Case Study. In: Aarts E, Lenstra JK (eds) Local Search in Combinatorial Optimization. John Wiley and Sons, Chichester, pp 215–310 40. Johnson DS, McGeoch LA (2002) Experimental Analysis of the STSP. In: Gutin G, Punnen A (eds) The Traveling Salesman Problem and its Variations. Kluwer, Dordrecht, pp 369–444 41. Johnson DS, Papadimitriou CH (1985) Computational Complexity. In: Lawer EL, Lenstra JK, Rinnoy Kan AHD, Shmoys DB (eds) The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization. John Wiley and Sons, Chichester, pp 37–85 42. Johnson DS, Papadimitriou CH (1985) Performance Guarantees for Heuristics. In: Lawer EL, Lenstra JK, Rinnoy Kan AHD, Shmoys DB (eds) The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization. John Wiley and Sons, Chichester, pp 145–181 43. Karp RM, Steele JM (1985) Probabilistic Analysis of Heuristics. In: Lawer EL, Lenstra JK, Rinnoy Kan AHD, Shmoys DB (eds) The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization. John Wiley and Sons, Chichester, pp 181–206 44. Kennedy J, Eberhart R (1995) Particle Swarm Optimization. Proc. 1995 IEEE Int Conf Neural Netw 4:1942–1948 45. Kirkpatrick S, Gelatt CD, Vecchi MP (1982) Optimization by Simulated Annealing. Science 220:671–680 46. Lawer EL, Lenstra JK, Rinnoy Kan AHG, Shmoys DB (1985) The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization. Wiley and Sons, New York 47. Li X, Tian P, Hua J, Zhong N (2006) A Hybrid Discrete Particle Swarm Optimization for the Traveling Salesman Problem. SEAL 2006, LNCS 4247:181–188 48. Lin S (1965) Computer Solutions of the Traveling Salesman Problem. Bell Syst Tech J 44:2245–2269 49. Lin S, Kernighan BW (1973) An Effective Heuristic Algorithm for the Traveling Salesman Problem. Oper Res 21:498–516 50. Marinakis Y, Migdalas A, Pardalos PM (2005) Expanding Neighborhood GRASP for the Traveling Salesman Problem. Comput Optim Appl 32:231–257

1505

1506

H

Heuristic Search

51. Marinakis Y, Migdalas A, Pardalos PM (2005) A Hybrid Genetic-GRASP algortihm Using Langrangean Relaxation for the Traveling Salesman Problem. J Combinat Optim 10:311–326 52. Mladenovic N, Hansen P (1997) Variable Neighborhood Search. Comput Oper Res 24:1097–1100 53. Modares A, Somhom S, Enkawa T (1999) A Self-Organizing Neural Network Approach for Multiple Traveling Salesman and Vehicle Routing Problems. Int Trans Oper Res 6(6):591–606 54. Neto DM (1999) Efficient Cluster Compensation for Lin Kernighan Heuristics. Ph.D. Thesis, Computer Science University of Toronto, Canada 55. Ninio M, Schneider JJ (2005) Weight Annealing. Physica A: Stat Theor Phys 349(3–4):649–666 56. Or I (1976) Traveling Salesman-Type Combinatorial Problems and their Relation to the Logistics of Regional Blood Banking. Ph.D. Thesis, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston IL 57. Sodererg B, Peterson C (1997) Artificial Neural Networks. In: Aarts E, Lenstra JK (eds) Local Search in Combinatorial Optimization. John Wiley and Sons, Chichester, pp 173– 214 58. Platzmann LK, Bartholdi JJ (1989) Spacefilling Curves and the Planar Traveling Salesman Problem. J Assoc Comput Mach 36:719–735 59. Potvin J Y. (1996) Genetic Algorithms for the Traveling Salesman Problem. Metaheuristics Combinatorial Optim Ann Oper Res 63:339–370 60. Punnen AP (2002) The Traveling Salesman Problem: Applications, Formulations and Variations. In: Gutin G, Punnen A (eds) The Traveling Salesman Problem and its Variations. Kluwer, Dordrecht, pp 1–28 61. Reeves CR (1995) Genetic Algorithms. In: Reeves CR (ed) Modern Heuristic Techniques for Combinatorial Problems. McGraw - Hill, London, pp 151–196 62. Reeves CR (2003) Genetic Algorithms. In: Glover F, Kochenberger GA (eds) Handbooks of Metaheuristics. Kluwer, Dordrecht, pp 55–82 63. Rego C (1998) Relaxed Tours and Path Ejections for the Traveling Salesman Problem. Eur J Oper Res 106:522– 538 64. Rego C, Glover F (2002) Local Search and Metaheuristics. In: Gutin G, Punnen A (eds) The Traveling Salesman Problem and its Variations. Kluwer, Dordrecht, pp 309–367 65. Reinelt G (1994) The Traveling Salesman, Computational Solutions for TSP Applications. Springer, Berlin 66. Resende MGC, Ribeiro CC (2003) Greedy Randomized Adaptive Search Procedures. In: Glover F, Kochenberger GA (eds) Handbook of Metaheuristics. Kluwer, Boston, pp 219–249 67. Ronald S (1995) Routing and Scheduling Problems. In: Chambers L (ed) Practical Handbook of Genetic Algorithms. CRC Press, New York, pp 367–430

68. Rosenkratz DJ, Stearns RE, Lewis PM (1977) An Analysis of Several Heuristics for the Travelling Salesman Problem. SIAM J Comput 6:563–581 69. Siqueira PH, Teresinha M, Steiner A, Scheer S (2007) A New Approach to Solve the Traveling Salesman Problem. Neurocomputing 70(4–6):1013–1021 70. Taillard ED (2002) Ant Systems. In: Pardalos PM, Resende MGC (eds) Handbook of Applied Optimization. Oxford University Press, Oxford, pp 130–138 71. Voudouris C, Tsang E (1999) Guided Local Search and its Application to the Travelling Salesman Problem. Eur J Oper Res 113:469–499 72. Voudouris C, Tsang E (2003) Guided Local Search. In: Glover F, Kochenberger GA (eds) Handbooks of Metaheuristics. Kluwer, Dordrecht, pp 185–218 73. Wang Y, Feng XY, Huang YX, Pu DB, Zhou WG, Liang YC, Zhou CG (2007) A Novel Quantum Swarm Evolutionary Algorithm and its Applications. Neurocomputing 70 (4–6):633–640 74. Zachariasen M, Dam M (1996) Tabu Search on the Geometric Traveling Salesman Problem. In: Osman IH, Kelly JP (eds) Meta-heuristics: Theory and Applications. Kluwer, Boston, pp 571–587

Heuristic Search ALEXANDER REINEFELD ZIB Berlin, Berlin, Germany MSC2000: 68T20, 90B40, 90C47 Article Outline Keywords and Phrases Introduction Depth-First Search Best-First Search Applications See also References Keywords and Phrases Optimization; Heuristic search Introduction Heuristic search [7,9] is a common technique for finding a solution in a decision tree or graph containing one

Heuristic Search

or more solutions. Many applications in operations research and artificial intelligence rely on heuristic search as their primary solution method. Heuristic search techniques can be classified into two broad categories: depth-first search (DFS) and bestfirst search (BFS). As a consequence of its better information base, BFS usually examines fewer nodes but occupies more storage space for maintaining the already explored nodes. Depth-First Search DFS expands an initial state by generating its immediate successors. At each subsequent step, one of the most recently generated successors is selected and expanded. At terminal states, or when it can be determined that the current state does not lead to a solution, the search backtracks, that is, the node expansion proceeds with the next most recently generated state. Practical implementations use a stack data structure for maintaining the states (nodes) on the path to the currently explored state. The space complexity of the stack, O(d), increases only linearly with the search depth d. Backtracking is the most rudimentary variant of DFS. It terminates as soon as any solution has been found; hence, there is no guarantee for finding an optimal (least-cost) solution. Moreover, backtracking might not terminate in graphs containing cycles or when the search depth is unbounded. Depth-first branch and bound (DFBB) [6] employs a heuristic function to eliminate parts of the search space that cannot contain an optimal solution. It continues after finding a first solution until the search space is completely exhausted. Whenever a better solution is found, the current solution path and its value are updated. Inferior subtrees, i. e., subtrees that are known to be worse than the current solution, are eliminated. The alpha-beta algorithm [2] used in game tree searching is a variant of DFBB that operates on trees with alternating levels of AND and OR nodes [5]. Because the strength of play correlates to the depth of the search, much effort has been spent on devising efficient parallel implementations ( parallel heuristic search). Best-First Search BFS sorts the sequence of node expansions according to a heuristic function. The A* search algorithm [7] uses

H

a heuristic evaluation function f (n) = g(n)+ h(n) to decide which successor node n to expand next. Here, g(n) is the cost of the path from the initial state to the current node n and h(n) is the estimated completion cost to a nearest goal state. If h does not overestimate the remaining cost, A* is guaranteed to find an optimal (leastcost) solution: it is said to be admissible. It does so with a minimal number of node expansions [9]—no other search algorithm (with the same heuristic h) can do better. This is possible, because A* keeps the search graph in memory, occupying O(wd ) memory cells for trees of width w and depth d. Best-first frontier search [4] also finds an optimal solution, but with a much lower space complexity than A*. It only keeps the frontier nodes in memory and discards the interior (closed) nodes. Care must be taken to ensure that the search frontier does not contain gaps that would allow the search to leak back into interior regions. The memory savings are most pronounced in directed acyclic graphs. In the worst case, that is, in trees of width w, it still saves a fraction of 1/w of the nodes that BFS would need to store. Iterative-deepening A* (IDA*) [3] simulates A*’s best-first node expansion by a series of DFSs, each with the cost-bound f (n) increased by the minimal amount. The cost-bound is initially set to the heuristic estimate of the root node, h (root). Then, for each iteration, the bound is increased to the minimum value that exceeded the previous bound. Like A* , IDA* is guaranteed to find an optimal solution [3], provided the heuristic estimate function h is admissible and never overestimates the path to the goal. IDA* obeys the same asymptotic branching factor as A* [7], if the number of newly expanded nodes grows exponentially with the search depth [3]. This growth rate, the heuristic branching factor, depends on the average number of applicable operators per node and the discrimination power of the heuristic function h. Applications Typical applications of heuristic search techniques may be found in many areas—not only in the fields of artificial intelligence and operations research, but also in other parts of computer science. In the two-dimensional rectangular cutting-stock problem [1], we are given a set Rs = {(l i , w i ), i = 1,. . . , m}

1507

1508

H

Heuristics for Maximum Clique and Independent Set

of m rectangles of width w i and length l i that are to be cut out of a single rectangular stock sheet S. Assuming that S is of width W and that the theoretically unbounded length is L, the problem is to find an optimal cut with minimal length expansion. Since the elements R i are cut after the cutting pattern has been determined, we can look at the problem as a bin-packing or vehiclerouting problem, which are also known to be nondeterministic polynomial-time (NP) complete [8]. Very large scale integration (VLSI) floorplan optimization is a stage in the design of VLSI chips, where the dimensions of the basic building blocks (cells) must be determined, subject to the minimization of the total chip layout area. This can be done with a BFS or a DFBB approach. Again, only small problem cases can be solved optimally, because VLSI floorplan optimization is also NP-complete. In the satisfiability problem, it must be determined whether a Boolean formula containing binary variables in conjunctive normal form is satisfiable, that is, whether an assignment of truth values to the variables exists for which the formula is true. The 15-puzzle benchmark in single-agent game-tree search consists of 15 square tiles located in a square tray of size 4 × 4. One square, the “blank square,” is kept empty so that an orthogonally adjacent tile can slide into its position, thus leaving an empty position at its origin. The problem is to re-arrange a given initial configuration with the fewest number of moves into a goal configuration without lifting one tile over another. While it would seem easy to obtain any solution, finding optimal (shortest) solutions is NP-complete. The 15-puzzle spawns a search space of 16! 21013 states. See also Asynchronous Distributed Optimization Algorithms Automatic Differentiation: Parallel Computation Load Balancing for Parallel Optimization Techniques Parallel Computing: Complexity Classes Parallel Computing: Models Parallel Heuristic Search Stochastic Network Problems: Massively Parallel Solution

References 1. Christofides N, Whitlock C (1977) An algorithm for twodimensional cutting problems. Oper Res 25(1):30–44 2. Knuth DE, Moore RW (1975) An analysis of alpha-beta pruning. Artif Intell 6(4):293–326 3. Korf RE (1985) Depth-first iterative-deepening: An optimal admissible tree search. Artif Intell 27:97–109 4. Korf RE, Zhang W, Thayer I, Hohwald H (2005) Frontier Search J ACM 52:715–748 5. Kumar V, Nau DS, Kanal L (1988) A general branchand-bound formulation for AND/OR graph and game-tree search. In: Kanal L, Kumar V (eds) Search in Artificial Intelligence. Springer, New York, pp 91–130 6. Lawler EL, Wood DE (1966) Branch and bound methods: A survey. Oper Res 14:600–719 7. Nilsson NJ (1980) Principles of artificial intelligence. Tioga Publ., Palo Alto 8. Papadimitriou CH, Steiglitz K (1982) Combinatorial optimization: Algorithms and complexity. Prentice-Hall, Englewood Cliffs, NJ 9. Pearl J (1984) Heuristics. Intelligent search strategies for computer problem solving. Addison-Wesley, Reading

Heuristics for Maximum Clique and Independent Set MARCELLO PELILLO University Ca’ Foscari di Venezia, Venezia Mestre, Italy MSC2000: 90C59, 05C69, 05C85, 68W01 Article Outline Keywords Sequential Greedy Heuristics Local Search Heuristics Advanced Search Heuristics Simulated Annealing Neural Networks Genetic Algorithms Tabu Search

Continuous Based Heuristics Miscellaneous Conclusions See also References Keywords Heuristics; Algorithms; Clique; Independent set

Heuristics for Maximum Clique and Independent Set

Throughout this article, G = (V, E) is an arbitrary undirected and weighted graph unless otherwise specified, where V = (1, . . . , n} is the vertex set of G and E V × V is its edge set. For each vertex i 2 V, a positive weight wi is associated with i, collected in the weight vector w 2 Rn . For a subset S V, the weight of S is defined as P W(S) = i 2 S wi , and G(S) = (S, E \ S × S) is the subgraph induced by S. The cardinality of S, i. e., the number of its vertices, will be denoted by |S|. A graph G = (V, E) is complete if all its vertices are pairwise adjacent, i. e. 8i, j 2 V with i 6D j, we have (i, j) 2 E. A clique C is a subset of V such that G(C) is complete. The clique number of G, denoted by !(G) is the cardinality of the maximum clique. The maximum clique problem asks for cliques of maximum cardinality. The maximum weight clique problem asks for cliques of maximum weight. Given the weight vector w 2 Rn , the weighted clique number is the total weight of the maximum weight clique, and will be denoted by !(G, w). We should distinguish a maximum clique from a maximal clique. A maximal clique is one that is not a proper subset of any other clique. A maximum (weight) clique is a maximal clique that has the maximum cardinality (weight). An independent set (also called stable set or vertex packing) is a subset of V whose elements are pairwise nonadjacent. The maximum independent set problem asks for an independent set of maximum cardinality. The size of a maximum independent set is the stability number of G, (denoted by ˛(G)). The maximum weight independent set problem asks for an independent set of maximum weight. Given the weight vector w 2 Rn , the weighted stability number, denoted ˛(G, w), is the weight of the maximum weight independent set. The complement graph of G = (V, E) is the graph G D (V; E), where E D f(i; j) : i; j 2 V ; i ¤ j and (i; j) … Eg. It is easy to see that S is a clique of G if and only if S is an independent set of G. Any result or algorithm obtained for one of the two problems has its equivalent forms for the other one. Hence ˛(G) D !(G), more generally, ˛(G; w) D !(G; w). The maximum clique and independent set problems are well-known examples of intractable combinatorial optimization problems [18]. Apart from the theoretical interest around these problems, they also find practical applications in such diverse domains as

H

computer vision, experimental design, information retrieval, fault tolerance, etc. Moreover, many important problems turn out to be easily reducible to them, and these include, for example, the Boolean satisfiability problem, the subgraph isomorphism problem, and the vertex covering problem. The maximum clique problem has also a certain historical value, as it was one of the first problems shown to be NP-complete in the now classical paper of R.M. Karp on computational complexity [64]. Due to their inherent computational complexity, exact algorithms are guaranteed to return a solution only in a time which increases exponentially with the number of vertices in the graph, and this makes them inapplicable even to moderately large problem instances. Moreover, a series of recent theoretical results show that the problems are in fact difficult to solve even in terms of approximation. Strong evidence of this fact came in 1991, when it was proved in [32] that if there is a polynomial time algorithm that approxi1n , mates the maximum clique within a factor of 2 l o g then any NP-hard problem can be solved in ‘quasipolyO(1) nomial’ time (i. e., in 2 l o g n time). The result was further refined in [6,7] one year later. Specifically, it was proved that there exists an > 0 such that no polynomial time algorithm can approximate the size of the maximum clique within a factor of n , unless P = NP. Developments along these lines can be found in [14,15,49]. In light of these negative results, much effort has recently been directed towards devising efficient heuristics for maximum clique and independent set, for which no formal guarantee of performance may be provided, but are anyway of interest in practical application. Lacking (almost by definition) a general theory of how these algorithms work, their evaluation is essentially based on massive experimentation. In order to facilitate comparisons among different heuristics, a set of benchmark graphs arising from different applications and problems has recently been constructed in conjunction with the 1993 DIMACS challenge on cliques, coloring and satisfiability [63]. In this article we provide an informal survey of recent heuristics for maximum clique and related problems, and up-to-date bibliographic pointers to the relevant literature. A more comprehensive review and bibliography can be found in [18].

1509

1510

H

Heuristics for Maximum Clique and Independent Set

Sequential Greedy Heuristics Many approximation algorithms in the literature for the maximum clique problem are called sequential greedy heuristics. These heuristics generate a maximal clique through the repeated addition of a vertex into a partial clique, or the repeated deletion of a vertex from a set that is not a clique. Decisions on which vertex to be added in or moved out next are based on certain indicators associated with candidate vertices as, for example, the vertex degree. There is also a distinction between heuristics that update the indicators every time a vertex is added in or moved out, and those that do not. Examples of such heuristics can be found in [62,89]. The differences among these heuristics are their choice of indicators and how indicators are updated. A heuristic of this type can run very fast. Local Search Heuristics Let us define C G to be the set of all the maximal cliques of G. Basically, a sequential greedy heuristic finds one set in C G , hoping it is (close to) the optimal set, and stops. A possible way to improve our approximation solutions is to expand the search in C G . For example, once we find a set S 2 C G , we can search its ‘neighbors’ to improve S. This leads to the class of the local search heuristics [2]. Depending on the neighborhood structure and how the search is performed, different local search heuristics result. A well-known class of local search heuristics in the literature is the k-interchange heuristics. They are based on the k-neighbor of a feasible solution. In the case of the maximum clique problem, a set C 2 C G is a kneighbor of S if |C M S| k, where k |S|. A kinterchange heuristic first finds a maximal clique S 2 C G , then it searches all the k-neighbors of S and returns the best clique found. Clearly, the main factors for the complexity of this class of heuristics are the size of the neighborhood and the searches involved. For example, in the k-interchange heuristic, the complexity grows roughly with O(nk ). A class of heuristics designed to search various sets of C G is called the randomized heuristics. The main ingredient of this class of heuristics is the part that finds a random set in C G . A possible way to do that is to include some random factors in the generation of a set of C G . A randomized heuristic runs a heuristic (with ran-

dom factors included) a number of times to find different sets over C G . For example, we can randomize a sequential greedy heuristic and let it run N times. The complexity of a randomized heuristic depends on the complexity of the heuristic and the number N. An elaborated implementation of the randomized heuristic for the maximum independent set problem can be found in [33], where local search is combined with randomized heuristic. The computational results in it indicated that the approach was effective in finding large cliques of randomly generated graphs. A different implementation of a randomized algorithm for the maximum independent set problem can be found in [5]. Advanced Search Heuristics Local search algorithms are only capable of finding local solutions of an optimization problem. Powerful variations of the basic local search procedure have been developed which try to avoid this problem, many of which are inspired from various phenomena occurring in nature. Simulated Annealing In condensed-matter physics, the term ‘annealing’ refers to a physical process to obtain a pure lattice structure, where a solid is first heated up in a heat bath until it melts, and next cooled down slowly until it solidifies into a low-energy state. During the process, the free energy of the system is minimized. Simulated annealing, introduced in 1983 by S. Kirkpatrick, C.D. Gelatt and M.P. Vecchi [65], is a randomized neighborhood search algorithm based on the physical annealing process. Here, the solutions of a combinatorial optimization problem correspond to the states of the physical system, and the cost of a solution is equivalent to the energy of the state. In its original formulation, simulated annealing works essentially as follows. Initially, a tentative solution in the state space is somehow generated. A new neighboring state is then produced from the previous one and, if the value of the cost function f improves, the new state is accepted, otherwise it is accepted with probability exp{ f /}, where f is the difference of the cost function between the new and the current state, and is a parameter usually called the temperature in

Heuristics for Maximum Clique and Independent Set

analogy with physical annealing, which is varied carefully during the optimization process. The algorithm proceeds iteratively this way until a stopping condition is met. One of the critical aspects of the algorithm relates to the choice of the proper ‘cooling schedule,’ i. e., how to decrease the temperature as the process evolves. While a logarithmic slow cooling schedule (yielding an exponential time algorithm) provably guarantees the exact solution, faster cooling schedules, producing acceptably good results, are in widespread use. Introductory textbooks describing both theoretical and practical issues of the algorithm are [1,66]. E. Aarts and J. Korst [1], without presenting any experimental result, suggested the use of simulated annealing for solving the independent set problem, using a penalty function approach. Here, the solution space is the set of all possible subsets of vertices of the graph G, and the problem is formulated as one of maximizing the cost function f (V 0 ) = |V 0 | |E0 |, where |E0 | is the number of edges in G(V 0 ), and is a weighting factor exceeding 1. M. Jerrum [61] conducted a theoretical analysis of the performance of a clique-finding Metropolis process, i. e., simulated annealing at fixed temperature, on random graphs. He proved that the expected time for the algorithm to find a clique that is only slightly bigger than that produced by a naive greedy heuristic grows faster than any polynomial in the number of vertices. This suggests that ‘true’ simulated annealing would be ineffective for the maximum clique problem. Jerrum’s conclusion seems to be contradicted by practical experience. In [56], S. Homer and M. Peinado compare the performance of three heuristics, namely the greedy heuristic developed in [62], a randomized version of the Boppana–Halldórsson subgraphexclusion algorithm [24], and simulated annealing, over very large graphs. The simulated annealing algorithm was essentially that proposed by Aarts and Korst, with a simple cooling schedule. This penalty function approach was found to work better than the method in which only cliques are considered, as proposed in [61]. The algorithms were tested on various random graphs as well as on DIMACS benchmark graphs. The authors ran the algorithms over an SGI workstation for graphs with up to 10,000 vertices, and on a Connection Machine for graphs with up to 70,000 vertices. The overall conclusion was that simulated annealing outperforms

H

the other competing algorithms; it also ranked among the best heuristics for maximum clique presented at the 1993 DIMACS challenge [63]. Neural Networks Artificial neural networks (often simply referred to as ‘neural networks’) are massively parallel, distributed systems inspired by the anatomy and physiology of the cerebral cortex, which exhibit a number of useful properties such as learning and adaptation, universal approximation, and pattern recognition (see [50,52] for an introduction). In the mid 1980s, J.J. Hopfield and D.W. Tank [57] showed that certain feedback continuous neural models are capable of finding approximate solutions to difficult optimization problems such as the traveling salesman problem [57]. This application was motivated by the property that the temporal evolution of these models is governed by a quadratic Liapunov function (typically called ‘energy function’ because of its analogy with physical systems) which is iteratively minimized as the process evolves. Since then, a variety of combinatorial optimization problems have been tackled within this framework. The customary approach is to formulate the original problem as one of energy minimization, and then to use a proper relaxation network to find minimizers of this function. Almost invariably, the algorithms developed so far incorporate techniques borrowed from statistical mechanics, in particular mean field theory, which allow one to escape from poor local solutions. We mention the articles [69,82] and the textbook [88] for surveys of this field. In [1], an excellent introduction to a particular class of neural networks (the Boltzmann machine) for combinatorial optimization is provided. Early attempts at encoding the maximum clique and related problems in terms of a neural network were already done in the late 1980s in [1,12,44,83], and [84] (see also [85]). However, little or no experimental results were presented, thereby making it difficult to evaluate the merits of these algorithms. In [68], F. Lin and K. Lee used the quadratic zero-one formulation from [78] as the basis for their neural network heuristic. On random graphs with up to 300 vertices, they found their algorithm to be faster than the implicit enumerative algorithm in [26], while obtaining slightly worse results in terms of clique size.

1511

1512

H

Heuristics for Maximum Clique and Independent Set

T. Grossman [45] proposed a discrete, deterministic version of the Hopfield model for maximum clique, originally designed for an all-optical implementation. The model has a threshold parameter which determines the character of the stable states of the network. The author suggests an annealing strategy on this parameter, and an adaptive procedure to choose the network’s initial state and threshold. On DIMACS graphs the algorithm performs satisfactorily but it does not compare well with more powerful heuristics such as simulated annealing. A. Jagota [58] developed several variations of the Hopfield model, both discrete and continuous, to approximate maximum clique. He evaluated the performance of his algorithms over randomly generated graphs as well as on harder graphs obtained by generating cliques of varying size at random and taking their union. Experiments on graphs coming from the Solomonoff–Levin, or ‘universal’ distribution are also presented in [59]. The best results were obtained using a stochastic steepest descent dynamics and a meanfield annealing algorithm, an efficient deterministic approximation of simulated annealing. These algorithms, however, were also the slowest, and this motivated Jagota et al. [60] to improve their running time. The mean-field annealing heuristic was implemented on a 32-processor Connection Machine, and a twotemperature annealing strategy was used. Additionally, a ‘reinforcement learning’ strategy was developed for the stochastic steepest descent heuristic, to automatically adjust its internal parameters as the process evolves. On various benchmark graphs, all their algorithms obtained significantly larger cliques than other simpler heuristics but ran slightly slower. Compared to more sophisticated heuristics, they obtained significantly smaller cliques on average but were considerably faster. M. Pelillo [80] takes a completely different approach to the problem, by exploiting a continuous formulation of maximum clique and the dynamical properties of the so-called relaxation labeling networks. His algorithm is described in the next section. Genetic Algorithms Genetic algorithms are parallel search procedures inspired from the mechanisms of evolution in natural

systems [45,55]. In contrast to more traditional optimization techniques, they work on a population of points, which in the genetic algorithm terminology, are called chromosomes or individuals. In the simplest and most popular implementation, chromosomes are simply long strings of bits. Each individual has an associated ‘fitness’ value which determines its probability of survival in the next ‘generation’: the higher the fitness, the higher the probability of survival. The genetic algorithm starts out with an initial population of members generally chosen at random and, in its simplest version, makes use of three basic operators: reproduction, crossover and mutation. Reproduction usually consists of choosing the chromosomes to be copied in the next generation according to a probability proportional to their fitness. After reproduction, the crossover operator is applied between pairs of selected individuals to produce new offsprings. The operator consists of swapping two or more subsegments of the the strings corresponding to the two chosen individuals. Finally, the mutation operator is applied, which randomly reverses the value of every bit within a chromosome with a fixed probability. The procedure just described is sometimes referred to as the ‘simple’ genetic algorithm [45]. One of the earliest attempts to solve the maximum clique problem using genetic algorithms was done in 1993 by B. Carter and K. Park [27]. After showing the weakness of the simple genetic algorithm in finding large cliques, even on small random graphs, they introduced several modifications in an attempt to improve performance. However, despite their efforts they did not get satisfactory results, and their general conclusion was that genetic algorithms need to be heavily customized in order to be competitive with traditional approaches, and that they are computationally very expensive. In a later study [79], genetic algorithms were proven to be less effective than simulated annealing. At almost the same time, T. Bäck and S. Khuri [8], working on the maximum independent set problem, arrived at the opposite conclusion. By using a straightforward, general-purpose genetic algorithm called GENEsYs and a suitable fitness function which included a graded penalty term to penalize infeasible solutions, they got interesting results over random and regular graphs with up to 200 vertices. These results indicate that the choice of the fitness function is crucial for genetic algorithms to provide satisfactory results.

Heuristics for Maximum Clique and Independent Set

A.S. Murthy et al. [74] also experimented with a genetic algorithm using a novel ‘partial copy crossover’, and a modified mutation operator. However, they presented results over very small (i. e., up to 50 vertices) graphs, thereby making it difficult to properly evaluate the algorithm. T.N. Bui and P.H. Eppley [25] obtained encouraging results by using a hybrid strategy which incorporates a local optimization step at each generation of the genetic algorithm, and a vertex-ordering preprocessing phase. They tested the algorithm over some DIMACS graphs getting results comparable to that in[39] Instead of using the standard binary representation for chromosomes, J.A. Foster and T. Soule [36] employed an integer-based encoding scheme. Moreover, they used a time weighting fitness function similar in spirit to those in [27]. The results obtained are interesting, but still not comparable to those obtained using more traditional search heuristics. C. Fleurent and J.A. Ferland [35] developed a general-purpose system for solving graph coloring, maximum clique, and satisfiability problems. As far as the maximum clique problem is concerned, they conducted several experiments using a hybrid genetic search scheme which incorporates tabu search and other local search techniques as alternative mutation operators. The results presented are encouraging, but running time is quite high. In [53], M. Hifi modifies the basic genetic algorithm in several aspects: a) a particular crossover operator creates two new different children; b) the mutation operator is replaced by a specific heuristic feasibility transition adapted to the weighted maximum stable set problem. This approach is also easily parallelizable. Experimental results on randomly generated graphs and also some (unweighted) instances from the DIMACS testbed [63] are reported to validate this approach. Finally, E. Marchiori [71] has developed a simple heuristic-based genetic algorithm which consists of a combination of the simple genetic algorithm and a naive greedy heuristic procedure. Unlike previous approaches, here there is a neat division of labor, the search for a large subgraph and the search for a clique being incorporated into the fitness function and the heuristic procedure, respectively. The algorithm out-

H

performs previous genetic-based clique finding procedures over various DIMACS graphs, both in terms of quality of solutions and speed. Tabu Search Tabu search, introduced independently by F. Glover [41,42] and P. Hansen and B. Jaumard [48], is a modified local search algorithm, in which a prohibitionbased strategy is employed to avoid cycles in the search trajectories and to explore new regions in the search space. At each step of the algorithm, the next solution visited is always chosen to be the best legal neighbor of the current state, even if its cost is worse than the current solution. The set of legal neighbors is restricted by one or more tabu lists which prevent the algorithm to go back to recently visited solutions. These lists are used to store historical information on the path followed by the search procedure. Sometimes the tabu restriction is relaxed, and tabu solutions are accepted if they satisfy some aspiration level condition. The standard example of a tabu list is one which contains the last k solutions examined, where k may be fixed or variable. Additional lists containing the last modifications performed, i. e., changes occurred when moving from one solution to the next, are also common. These types of lists are referred to as short-term memories; other forms of memories are also used to intensify the search in a promising region or to diversify the search to unexplored areas. Details on the algorithm and its variants can be found in [43] and [51]. In 1989, C. Friden et al. [37] proposed a heuristic for the maximum independent set problem based on tabu search. The size of the independent set to search for is fixed, and the algorithm tries to minimize the number of edges in the current subset of vertices. They used three tabu lists: one for storing the last visited solutions and the other two to contain the last introduced/deleted vertices. They showed that by using hashing for implementing the first list and choosing a small value for the dimensions of the other two lists, a best neighbor may be found in almost constant time. In [38,86], three variants of tabu search for maximum clique are presented. Here the search space consists of complete subgraphs whose size has to be maximized. The first two versions are deterministic algorithms in which no sampling of the neighborhood is

1513

1514

H

Heuristics for Maximum Clique and Independent Set

performed. The main difference between the two algorithms is that the first one uses just one tabu list (of the last solutions visited), while the second one uses an additional list (with an associated aspiration mechanism) containing the last vertices deleted. Also, two diversification strategies were implemented. The third algorithm is probabilistic in nature, and uses the same two tabu lists and aspiration mechanism as the second one. It differs from it because it performs a random sampling of the neighborhood, and also because it allows for multiple deletion of vertices in the current solution. Here no diversification strategy was used. In [38,86] results on randomly generated graphs were presented and the algorithms were shown to be very effective. P. Soriano and M. Gendreau [87] tested their algorithms over the DIMACS benchmark graphs and the results confirmed the early conclusions. R. Battiti and M. Protasi [13] extended the tabu search framework by introducing a reactive local search method. They modified a previously introduced reactive scheme by exploiting the particular neighborhood structure of the maximum clique problem. In general reactive schemes aim at avoiding the manual selection of control parameters by means of an internal feedback loop. Battiti and Protasi’s algorithm adopts such a strategy to automatically determine the so-called prohibition parameter k, i. e., the size of the tabu list. Also an explicit memory-influenced restart procedure is activated periodically to introduce diversification. The search space consists of all possible cliques, as in the approach by Friden et al., and the function to be maximized is the clique size. The worst-case computational complexity of this algorithm is O(max{n, m}), where n and m are the number of vertices and edges of the graph respectively. They noticed, however, that in practice, the number of operations tends to be proportional to the average degree of the vertices of the complement graph. They tested their algorithm over many DIMACS benchmark graphs obtaining better results then those presented at the DIMACS workshop in competitive time. Continuous Based Heuristics In 1965, T.S. Motzkin and E.G. Straus [73] established a remarkable connection between the maximum clique problem and a certain quadratic programming prob-

lem. Let G = (V, E) be an undirected (unweighted) graph and let denote the standard simplex in the ndimensional Euclidean space Rn : ˚ D x 2 Rn : x i 0 for all i 2 V; e > x D 1 ; where the letter e is reserved for a vector of appropriate length, consisting of unit entries (hence e> x = P i 2 V xi ). Now, consider the following quadratic function, sometimes called the Lagrangian of G: g(x) D x > A G x;

(1)

where AG = (aij ) is the adjacency matrix of G, i. e. the symmetric n × n matrix where aij = 1 if (i, j) 2 E and aij = 0 if (i, j) 62 E, and let x be a global maximizer of g on . In [73] it is proved that the clique number of G is related to g(x ) by the following formula: !(G) D

1 : 1 g(x )

Additionally, it is shown that a subset of vertices S is a maximum clique of G if and only if its characteristic vector xS , which is the vector of defined as x Si = 1/|S| if i 2 S and x Si = 0 otherwise, is a global maximizer of g on . In [40,81], the Motzkin–Straus theorem has been extended by providing a characterization of maximal cliques in terms of local maximizers of g on . One drawback associated with the original Motzkin–Straus formulation relates to the existence of spurious solutions, i. e., maximizers of g which are not in the form of characteristic vectors [77,81]. In principle, spurious solutions represent a problem since, while providing information about the cardinality of the maximum clique, they do not allow us to easily extract its vertices. During the 1990s, there has been much interest around the Motzkin–Straus and related continuous formulations of the maximum clique problem. They suggest in fact a fundamentally new way of solving the maximum clique problem, by allowing us to shift from the discrete to the continuous domain in an elegant manner. As pointed out in [76], continuous formulations of discrete optimization problems turn out to be particularly attractive. They not only allow us to exploit the full arsenal of continuous optimization techniques, thereby leading to the development of new algorithms, but may also reveal unexpected theoretical properties.

Heuristics for Maximum Clique and Independent Set

In [77], P.M. Pardalos and A.T. Phillips developed a global optimization approach based on the Motzkin– Straus formulation and implemented an iterative clique retrieval process to find the vertices of the maximum clique. However, due to its high computational cost they were not able to run the algorithm over graphs with more than 75 vertices. Pelillo [80] used relaxation labeling algorithms to approximately determining the size of the maximum clique using the original Motzkin–Straus formulation. These are parallel, distributed algorithms developed and studied in computer vision and pattern recognition, which are also surprisingly related to replicator equations, a class of dynamical systems widely studied in evolutionary game theory and related fields [54], Heuristics for maximum clique and independent set. The model operates in the simplex and possesses a quadratic Liapunov function which drives its dynamical behavior. It is these properties that naturally suggest using them as a local optimization algorithm for the Motzkin–Straus program. The algorithm is especially suited for parallel implementation, and is attractive for its operational simplicity, since no parameters need to be determined. Extensive simulations over random graphs with up to 2000 vertices have demonstrated the effectiveness of the approach and showed that the algorithm outperforms previous neural network heuristics. In order to avoid time-consuming iterative procedures to extract the vertices of the clique, L.E. Gibbons, D.W. Hearn and Pardalos [39] have proposed a heuristic which is based on a parameterized formulation of the Motzkin–Straus program. They consider the problem of minimizing the function: n X 1 > xi 1 h(x) D x A G x C 2 iD1

on the domain: ( S(k) D

n

x2R :

n X iD1

x 2i

!2

) 1 ; x i 0; 8i ; k

where k is a fixed parameter. Let x be a global minimizer of h on S(k), and let V(k) = h(x ). In [39] it is proved that V(k) = 0 if and only if there exists an independent set S of G with size |S| k. Moreover, the vertices of G associated with the indices of the posi-

H

tive components of x form an independent set of size greater than or equal k. These properties motivated the following procedure to find a maximum independent set of G or, equivalently, a maximum clique of G. Minimize the function h over S(k), for different values of k between predetermined upper and lower bounds. If V(k) = 0 and V(k+ 1) 6D 0 for some k, then the maximum clique of G has size k, and its vertices are determined by the positive components of the solution. Since minimizing h on S(k) is a difficult problem, they developed a heuristic based on the observation that by removing the nonnegativity constraints, the problem is that of minimizing a quadratic form over a sphere, a problem which is solvable in polynomial time. However, in so doing a heuristic procedure is needed to round the approximate solutions of this new problem to approximate solutions of the original one. Moreover, since the problem is solved approximately, we have to find the value of the spherical constraint 1k which yields the largest independent set. A careful choice of k is therefore needed. The resulting algorithm was tested over various DIMACS benchmark graphs [63] and the results obtained confirmed the effectiveness of the approach. The spurious solution problem has been solved in [16]. Consider the following regularized version of function g: 1 b (2) g(x) D x > A G x C x > x 2 which is obtained from (1) by substituting the adjacency matrix AG of G with 1 b AG D A G C I; 2 where I is the identity matrix. Unlike the Motzkin– Straus formulation, it can be proved that all maximizers of b g on are strict, and are characteristic vectors of maximal/maximum cliques in the graph. In an exact sense, therefore, a one-to-one correspondence exists between maximal cliques and local maximizers of b g in on the one hand and maximum cliques and global maximizers on the other hand. In [16,20], replicator equations are used in conjunction to this spuriousfree formulation to find maximal cliques of G. Note that here the vertices comprising the clique are directly given by the positive components of the converged vectors, and no iterative procedure is needed to determine

1515

1516

H

Heuristics for Maximum Clique and Independent Set

them, as in [77]. The results obtained over a set of random as well as DIMACS benchmark graphs were encouraging, especially considering that replicator equations do not incorporate any mechanism to escape from local optimal solutions. This suggests that the basins of attraction of the global solution with respect to the quadratic functions g and b g are quite large; for a thorough empirical analysis see also [23]. One may wonder whether a subtle choice of initial conditions and/or a variant of the dynamics may significantly improve the results, but experiments in [22] indicate this is not the case. In [19] the properties of the following function are studied: b g ˛ (x) D x > A G x C ˛x > x: It is shown that when ˛ is positive all the properties enjoyed by the standard regularization approach [16] hold true. Specifically, in this case a one-to-one correspondence between local/global maximizers in the continuous space and local/global solutions in the discrete space exists. For negative ˛’s an interesting picture emerges: as the absolute value of ˛ grows larger, local maximizers corresponding to maximal cliques disappear. In [19], bounds on the parameter ˛ are derived which affect the stability of these solutions. These results have suggested an annealed replication heuristic, which consists of starting from a large negative ˛ and then properly reducing it during the optimization process. For each value of ˛ standard replicator equations are run in order to obtain local solutions of the corresponding objective function. The rationale behind this idea is that for values of ˛ with a proper large absolute value only local solutions corresponding to large maximal cliques will survive, together with various spurious maximizers. As the value of ˛ is reduced, spurious solutions disappear and smaller maximal cliques become stable. An annealing schedule is proposed which is based on the assumption that the graphs being considered are random. In this respect, the proposed procedure differs from usual simulated annealing approaches, which mostly use a ‘black-box’ cooling schedule. Experiments conducted over several DIMACS benchmark graphs confirm the effectiveness of the proposed approach and the robustness of the annealing strategy. The overall conclusion is that the annealing procedure does help to avoid inefficient lo-

cal solutions, by initially driving the dynamics towards promising regions in state space, and then refining the search as the annealing parameter is increased. The Motzkin–Straus theorem has been generalized to the weighted case in [40]. Note that the Motzkin– Straus program can be reformulated as a minimization problem by considering the function f (x) D x > (I C A G )x; where A G is the adjacency matrix of the complement graph G. It is straightforward to see that if x is a global minimizer of f in , then we have: !(G) = 1/f(x ). This is simply a different formulation of the Motzkin– Straus theorem. Given a weighted graph G = (V, E) with weight vector w, let M(G, w) be the class of symmetric n × n matrices B = (bij )i, j 2 V defined as 2bij bii + bjj if (i, j) 62 E and bij = 0 otherwise, and b i i D 1/w i for all i 2 V. Given the following quadratic program, which is in general indefinite, (

min

f (x) D x > Bx

s.t.

x 2 ;

(3)

in [40] it is shown that for any B 2 M(G, w) we have: !(G; w) D

1 ; f (x )

where x is a global minimizer of program (3). Furthermore, denote by xS the weighted characteristic vector of S, which is a vector with coordinates x Si = wi /W(S) if i 2 S and x Si = 0 otherwise. It can be seen that a subset S of vertices of a weighted graph G is a maximum weight clique if and only if its characteristic vector xS is a global minimizer of (3). Notice that the matrix I C A G belongs to M(G, e). In other words, the Motzkin–Straus theorem turns out to be a special case of the preceding result. As in the unweighted case, the existence of spurious solutions entails the lack of one-to-one correspondence between the solutions of the continuous problem and those of the original, discrete one. In [21] these spurious solutions are characterized and a regularized version which avoids this kind of problems is introduced, exactly as in the unweighted case (see also [17]). Replicator equations are then used to find maximal weight

Heuristics for Maximum Clique and Independent Set

cliques in weighted graphs, using this formulation. Experiments with this approach on both random graphs and DIMACS graphs are reported. The results obtained are compared with those produced by a very efficient maximum weight clique algorithm of the branch and bound variety. The algorithm performed remarkably well especially on large and dense graphs, and it was typically an order of magnitude more efficient than its competitor. Finally, we mention the work by Massaro and Pelillo [72], who transformed the Motzkin–Straus program into a linear complementarity problem [31], and then solved it using a variation of Lemke’s well-known algorithm [67]. The preliminary results obtained over many weighted and unweighted DIMACS graphs show that this approach substantially outperforms all other continuous based heuristics. Miscellaneous Another type of heuristics that finds a maximal clique of G is called the subgraph approach (see [11]). It is based on the fact that a maximum clique C of a subgraph G0 G is also a clique of G. The subgraph approach first finds a subgraph G0 G such that the maximum clique of G0 can be found in polynomial time. Then it finds a maximum clique of G0 and use it as the approximation solution. The advantage of this approach is that in finding the maximum clique C G0 , one has (implicitly) searched many other cliques of G0 (C G0 C G ). Because of the special structure of G0 , this implicit search can be done efficiently. In [11], G0 is a maximal induced triangulated subgraph of G. Since many classes of graphs have polynomial algorithms for the maximum clique problem, the same idea also applies there. For example, the class of edge-maximal triangulated subgraphs was chosen in [9,90], and [91]. Some of the greedy heuristics, randomized heuristics and subgraph approach heuristics are compared in [90] and [91] on randomly generated weighted and unweighted graphs. Various new heuristics were presented at the 1993 DIMACS challenge devoted to clique, coloring and satisfiability [63]. In particular, in [10] an algorithm is proposed which is based on the observation that finding the maximum clique in the union of two cliques can be done using bipartite matching techniques. In [46] re-

H

stricted backtracking is used to provide a trade-off between the size of the clique and the completeness of the search. In [70] an edge projection technique is proposed to obtain a new upper bound heuristic for the maximum independent set problem. This procedure was used, in conjunction with the Balas–Yu branching rule [11], to develop an exact branch and bound algorithm which works well especially on sparse graphs. See [3] for a new population-based optimization heuristic inspired by the natural behavior of human or animal scouts in exploring unknown regions, and applied it to maximum clique. The results obtained over a few DIMACS graphs are comparable with those obtained using continuous-based heuristics but are inferior to those obtained by reactive local search. Recently, DNA computing [4] has also emerged as a potential technique for the maximum clique problem [75,92]. The major advantage of DNA computing is its high parallelism, but at present the size of graphs this algorithm can handle is limited to a few tens. Additional heuristics for the maximum clique/ independent set and related problems on arbitrary or special class of graphs can be found in [28,29,30,34]. Conclusions During the 1990s, research on the maximum clique and related problems has yielded many interesting heuristics. This article has provided an expository survey on these algorithms and an up-to-date bibliography (as of 2000). However, the activity in this field is so extensive that a survey of this nature is outdated before it is written. See also Graph Coloring Greedy Randomized Adaptive Search Procedures Replicator Dynamics in Combinatorial Optimization References 1. Aarts E, Korst J (1989) Simulated annealing and Boltzmann machines. Wiley, New York 2. Aarts E, Lenstra JK (eds) (1997) Local search in combinatorial optimization. Wiley, New York 3. Abbattista F, Bellifemmine F, Dalbis D (1998) The Scout algorithm applied to the maximum clique problem. Ad-

1517

1518

H 4. 5.

6.

7.

8.

9.

10.

11. 12.

13.

14.

15.

16. 17. 18.

19.

20.

Heuristics for Maximum Clique and Independent Set

vances in Soft Computing—Engineering and Design. Springer, Berlin Adleman LM (1994) Molecular computation of solutions to combinatorial optimization. Science 266:1021–1024 Alon N, Babai L, Itai A (1986) A fast and simple randomized parallel algorithm for the maximal independent set problem. J Algorithms 7:567–583 Arora S, Lund C, Motwani R, Sudan M, Szegedy M (1992) Proof verification and the hardness of approximation problems. Proc. 33rd Ann. Symp. Found. Comput. Sci., Pittsburgh, pp 14–23 Arora S, Safra S (1992) Probabilistic checking of proofs: A new characterization of NP. Proc. 33rd Ann. Symp. Found. Comput. Sci., Pittsburgh, pp 2–13 Bäck T, Khuri S (1994) An evolutionary heuristic for the maximum independent set problem. Proc. 1st IEEE Conf. Evolutionary Comput., 531–535 Balas E (1986) A fast algorithm for finding an edge-maximal subgraph with a TR-formative coloring. Discrete Appl Math 15:123–134 Balas E, Niehaus W (1996) Finding large cliques in arbitrary graphs by bipartite matching. In: Johnson DS, Trick MA (eds) Cliques, Coloring, and Satisfiability: 2nd DIMACS Implementation Challenge, DIMACS 26. AMS, Providence, RI, pp 29–51 Balas E, Yu CS (1986) Finding a maximum clique in an arbitrary graph. SIAM J Comput 14:1054–1068 Ballard DH, Gardner PC, Srinivas MA (1987) Graph problems and connectionist architectures. Techn Report Dept Comput Sci Univ Rochester 167 Battiti R, Protasi M (1995) Reactive local search for the maximum clique problem. TR-95-052 Techn Report Internat Comput Sci Inst Berkeley, to appear in Algorithmica Bellare M, Goldwasser S, Lund C, Russell A (1993) Efficient probabilistically checkable proofs and application to approximation. Proc. 25th Ann. ACM Symp. Theory Comput., pp 294–304 Bellare M, Goldwasser S, Sudan M (1995) Free bits, PCPs and non-approximability – Towards tight results. Proc. 36th Ann. Symp. Found. Comput. Sci., pp 422–431 Bomze IM (1997) Evolution towards the maximum clique. J Global Optim 10:143–164 Bomze IM (1998) On standard quadratic optimization problems. J Global Optim 13:369–387 Bomze IM, Budinich M, Pardalos PM, Pelillo M (1999) The maximum clique problem. In: Du D-Z, Pardalos PM (eds) Handbook Combinatorial Optim., Suppl. A. Kluwer, Dordrecht, pp 1–74 Bomze IM, Budinich M, Pelillo M, Rossi C (2000) An new “annealed” heuristic for the maximum clique problem. In: Pardalos PM (eds) Approximation and Complexity in Numerical Optimization: Continuous and Discrete Problems. Kluwer, Dordrecht, pp 78–95 Bomze IM, Pelillo M, Giacomini R (1997) Evolutionary approach to the maximum clique problem: Empirical evi-

21.

22.

23.

24.

25.

26. 27.

28.

29.

30. 31. 32.

33.

34.

35.

36.

37.

dence on a larger scale. In: Bomze IM, Csendes T, Horst R, Pardalos PM (eds) Developments in Global Optimization. Kluwer, Dordrecht, pp 95–108 Bomze IM, Pelillo M, Stix V (2000) Approximating the maximum weight clique using replicator dynamics. IEEE Trans Neural Networks 11(6) Bomze IM, Rendl F (1998) Replicator dynamics for evolution towards the maximum clique: Variations and experiments. In: De Leone R, Murli A, Pardalos PM, Toraldo G (eds) High Performance Algorithms and Software in Nonlinear Optimization. Kluwer, Dordrecht, pp 53–67 Bomze IM, Stix V (1999) Genetic engineering via negative fitness: Evolutionary dynamics for global optimization. Ann Oper Res 90 Boppana R, Halldóorsson MM (1992) Approximating maximum independent sets by excluding subgraphs. BIT 32:180–196 Bui TN, Eppley PH (1995) A hybrid genetic algorithm for the maximum clique problem. Proc. 6th Internat. Conf. Genetic Algorithms, pp 478–484 Carraghan R, Pardalos PM (1990) An exact algorithm for the maximum clique problem. Oper Res Lett 9:375–382 Carter B, Park K (1993) How good are genetic algorithms at finding large cliques: An experimental study. Techn Report Comput Sci Dept Boston Univ no. BU-CS-93-015 Chiba N, Nishizeki T, Saito N (1983) An algorithm for finding a large independent set in planar graphs. Networks 13:247–252 Chrobak M, Naor J (1991) An efficient parallel algorithm for computing a large independent set in a planar graph. Algorithmica 6:801–815 Chvátal V (1979) A greedy heuristic for the set-covering problem. Math Oper Res 4:233–23 Cottle RW, Pang J, Stone RE (1992) The linear complementarity problem. AP Feige U, Goldwasser S, Lováasz L, Safra S, Szegedy M (1991) Approximating clique is almost NP-complete. Proc. 32nd Ann. Symp. Found. Comput. Sci., San Juan, Puerto Rico, pp 2–12 Feo TA, Resende MGC, Smith SH (1994) A greedy randomized adaptive search procedure for maximum independent set. Oper Res 42:860–878 Fisher ML, Wolsey LA (1982) On the greedy heuristic for continuous covering and packing problems. SIAM J Alg Discrete Meth 3:584–591 Fleurent C, Ferland JA (1996) Object-oriented implementation of heuristic search methods for graph coloring, maximum clique, and satisfiability. In: Johnson DS, Trick MA (eds) Cliques, Coloring, and Satisfiability: 2nd DIMACS Implementation Challenge, DIMACS 26. AMS, Providence, RI, pp 619–652 Foster JA, Soule T (1995) Using genetic algorithms to find maximum cliques. Techn Report Dept Comput Sci Univ Idaho no. LAL 95-12 Friden C, Hertz A, de Werra M (1989) STABULUS: A tech-

Heuristics for Maximum Clique and Independent Set

38.

39.

40.

41. 42. 43.

44.

45.

46.

47.

48. 49.

50. 51.

52. 53.

54. 55. 56.

nique for finding stable sets in large graphs with tabu search. Computing 42:35–44 Gendreau A, Salvail L, Soriano P (1993) Solving the maximum clique problem using a tabu search approach. Ann Oper Res 41:385–403 Gibbons LE, Hearn DW, Pardalos PM (1996) A continuous based heuristic for the maximum clique problem. In: Johnson DS, Trick MA (eds) Cliques, Coloring, and Satisfiability: 2nd DIMACS Implementation Challenge, DIMACS 26. AMS, Providence, RI, pp 103–124 Gibbons LE, Hearn DW, Pardalos PM, Ramana MV (1997) Continuous characterizations of the maximum clique problem. Math Oper Res 22:754–768 Glover F (1989) Tabu search–Part I. ORSA J Comput 1:190–260 Glover F (1990) Tabusearch–Part II. ORSA J Comput 2:4–32 Glover F, Laguna M (1993) Tabu search. In: Reeves C (ed) Modern Heuristic Techniques for Combinatorial Problems. Blackwell, Oxford pp 70–141 Godbeer GH, Lipscomb J, Luby M (1988) On the computational complexity of finding stable state vectors in connectionist models (Hopfield nets). Techn Report Dept Comput Sci Univ Toronto 208, Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading, MA Goldberg MK, Rivenburgh RD (1996) Constructing cliques using restricted backtracking. In: Johnson DS, Trick MA (eds) Cliques, Coloring, and Satisfiability: 2nd DIMACS Implementation Challenge, DIMACS 26. AMS, Providence, RI, pp 89–101 Grossman T (1996) Applying the INN model to the max clique problem. In: Johnson DS, Trick MA (eds) Cliques, Coloring, and Satisfiability: 2nd DIMACS Implementation Challenge, DIMACS 26. AMS, Providence, RI, pp 125–146 Hansen P, Jaumard B (1990) Algorithms for the maximum satisfiability problem. Computing 44:279–303 Håstad J (1996) Clique is hard to approximate within n1 ’ :. Proc. 37th Ann. Symp. Found. Comput. Sci., pp 627–636 Haykin S (1994) Neural networks: A comprehensive foundation. MacMillan, New York Hertz A, Taillard E, de Werra D (1997) Tabu search. In: Aarts E, Lenstra JK (eds) Local Search in Combinatorial Optimization. Wiley, New York pp 121–136 Hertz J, Krogh A, Palmer RG (1991) Introduction to the theory of neural computation. Addison-Wesley, Reading, MA Hifi M (1997) A genetic algorithm-based heuristic for solving the weighted maximum independent set and some equivalent problems. J Oper Res Soc 48:612–622 Hofbauer J, Sigmund K (1998) Evolutionary games and population dYnamics. Cambridge Univ. Press, Cambridge Holland JH (1975) Adaptation in natural and artificial systems. Univ. Michigan Press, Ann Arbor, MI Homer S, Peinado M (1996) Experiments with polynomial-

57. 58.

59.

60.

61. 62. 63.

64.

65. 66. 67. 68.

69. 70.

71.

72.

73. 74.

H

time CLIQUE approximation algorithms on very large graphs. In: Johnson DS, Trick MA (eds) Cliques, Coloring, and Satisfiability: 2nd DIMACS Implementation Challenge, DIMACS 26. AMS, Providence, RI, pp 147–167 Hopfield JJ, Tank DW (1985) Neural computation of decisions in optimization problems. Biol Cybern 52:141–152 Jagota A (1995) Approximating maximum clique with a Hopfield neural network. IEEE Trans Neural Networks 6:724–735 Jagota A, Regan KW (1997) Performance of neural net heuristics for maximum clique on diverse highly compressible graphs. J Global Optim 10:439–465 Jagota A, Sanchis L, Ganesan R (1996) Approximately solving maximum clique using neural networks and related heuristics. In: Johnson DS, Trick MA (eds) Cliques, Coloring, and Satisfiability: 2nd DIMACS Implementation Challenge, DIMACS 26. AMS, Providence, RI, pp 169–204 Jerrum M (1992) Large cliques elude the Metropolis process. Random Struct Algorithms 3:347–359 Johnson DS (1974) Approximation algorithms for combinatorial problems. J Comput Syst Sci 9:256–278 Johnson DS, Trick MA (eds) (1996) Cliques, coloring, and satisfiability: 2nd DIMACS implementation challenge, DIMACS 26. AMS, Providence, RI Karp RM (1972) Reducibility among combinatorial problems. In: Miller RE, Thatcher JW (eds) Complexity of Computer Computations. Plenum, New York, pp 85–103 Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220:671–680 van Laarhoven PJM, Aarts EHL (1987) Simulated annealing: Theory and applications. Reidel, London Lemke CE (1965) Bimatrix equilibrium points and mathematical programming. Managem Sci 11:681–689 Lin F, Lee K (1992) A parallel computation network for the maximum clique problem. Proc. 1st Internat. Conf. Fuzzy Theory Tech., Looi C-K (1992) Neural network methods in combinatorial optimization. Comput Oper Res 19:191–208 Mannino C, Sassano A (1996) Edge projection and the maximum cardinality stable set problem. In: Johnson DS, Trick MA (eds) Cliques, Coloring, and Satisfiability: 2nd DIMACS Implementation Challenge, DIMACS 26. AMS, Providence, RI, pp 205–219 Marchiori E (1998) A simple heuristic based genetic algorithm for the maximum clique problem. Proc. ACM Symp. Appl. Comput., pp 366–373 Massaro A, Pelillo M (2000) A pivoting-based heuristic for the maximum clique problem. Presented at the Internat. Conf. Advances in Convex Analysis and Global Optimization, Samos, June 2000 Motzkin TS, Straus EG (1965) Maxima for graphs and a new proof of a theorem of Turán. Canad J Math 17:533–540 Murthy AS, Parthasarathy G, Sastry VUK (1994) Clique finding–A genetic approach. Proc. 1st IEEE Conf. Evolutionary Comput., pp 18–21

1519

1520

H

High-order Maximum Principle for Abnormal Extremals

75. Ouyang Q, Kaplan PD, Liu S, Libchaber A (1997) DNA solutions of the maximal clique problem. Science 278:1446–449 76. Pardalos PM (1996) Continuous approaches to discrete optimization problems. In: Di Pillo G, Giannessi F (eds) Nonlinear Optimization and Applications. Plenum, New York pp 313–328 77. Pardalos PM, Phillips AT (1990) A global optimization approach for solving the maximum clique problem. Internat J Comput Math 33:209–216 78. Pardalos PM, Rodgers GP (1990) Computational aspects of a branch and bound algorithm for quadratic zero-one programming. Computing 45:131–144 79. Park K, Carter B (1994) On the effectiveness of genetic search in combinatorial optimization. no.BU-CS-9010 Techn Report Computer Sci Dept Boston Univ 80. Pelillo M (1995) Relaxation labeling networks for the maximum clique problem. J Artif Neural Networks 2:313–328 81. Pelillo M, Jagota A (1995) Feasible and infeasible maxima in a quadratic program for maximum clique. J Artif Neural Networks 2:411–420 82. Peterson C, Söderberg B (1997) Artificial neural networks. In: Aarts E, Lenstra JK (eds) Local Search in Combinatorial Optimization. Wiley, New York pp 173–213 83. Ramanujam J, Sadayappan P (1988) Optimization by neural networks. Proc. IEEE Internat. Conf. Neural Networks, pp 325–332 84. Shrivastava Y, Dasgupta S, Reddy SM (1990) Neural network solutions to a graph theoretic problem. Proc. IEEE Internat. Symp. Circuits Syst., pp 2528–2531 85. Shrivastava Y, Dasgupta S, Reddy SM (1992) Guaranteed convergence in a class of Hopfield networks. IEEE Trans Neural Networks 3:951–961 86. Soriano P, Gendreau M (1996) Diversification strategies in tabu search algorithms for the maximum clique problem. Ann Oper Res 63:189–207 87. Soriano P, Gendreau M (1996) Tabu search algorithms for the maximum clique problem. In: Johnson DS, Trick MA (eds) Cliques, Coloring, and Satisfiability: 2nd DIMACS Implementation Challenge, DIMACS 26. AMS, Providence, RI, pp 221–242 88. Takefuji Y (1992) Neural network parallel computing. Kluwer, Dordrecht 89. Tomita E, Mitsuma S, Takahashi H (1988) Two algorithms for finding a near-maximum clique. Techn Report UEC-TRC1 90. Xue J (1991) Fast algorithms for vertex packing and related problems. PhD Thesis GSIA Carnegie-Mellon Univ. 91. Xue J (1994) Edge-maximal triangulated subgraphs and heuristics for the maximum clique problem. Networks 24:109–120 92. Zhang B-T, Shin S-Y (1998) Code optimization for DNA computing of maximal clique’s. Advances in Soft Computing–Engineering and Design. Springer, Berlin

High-order Maximum Principle for Abnormal Extremals URSZULA LEDZEWICZ1 , HEINZ SCHÄTTLER2 1 Department Math. and Statist., Southern Illinois University at Edwardsville, Edwardsville, USA 2 Department Systems Sci. and Math., Washington University, St. Louis, USA MSC2000: 49K15, 49K27, 41A10, 47N10 Article Outline Keywords Regularity in the Equality Constraint Critical Directions p-Order Local Maximum Principle Conclusion See also References Keywords Local maximum principle; High-order tangent sets; High-order necessary conditions for optimality; Abnormal processes We formulate a generalized local maximum principle which gives necessary conditions for optimality of abnormal trajectories in optimal control problems. The results are based on a hierarchy of primal constructions of high-order approximating cones (consisting of tangent directions for equality constraints, feasible directions for inequality constraints, and directions of decrease for the objective) and dual characterizations of empty intersection properties of these cones (see High-order necessary conditions for optimality for abnormal points). Characteristic for the theorem is that the multiplier associated with the objective is nonzero. We consider an optimal control problem in Bolza form with fixed terminal time: (OC) Minimize the functional ZT L(x(t); u(t); t) dt C `(x(T))

I(x; u) D 0

(1)

High-order Maximum Principle for Abnormal Extremals

subject to the constraints x˙ (t) D f (x(t); u(t); t); x(0) D 0;

q(x(T)) D 0;

r u() 2 U D fu 2 L1 (0; T) : u(t) 2 Ug :

The terminal time T is fixed and we make the following regularity assumptions on the data: L: Rn × Rm ×[0, T] ! R and f : Rn × Rm × ! [0, T] Rn are C1 in (x, u) for every t 2 [0, T]; both functions and their derivatives are measurable in t for every (x, u) and the functions and all partial derivatives are bounded on compact subsets of Rn ×Rm × [0, T]; `: Rn ! R and q: Rn ! Rk are C1 and the rows of the Jacobian matrix qx (i. e. the gradients of the equations defining the terminal constraint) are linearly independent; U Rm is a closed and convex set with nonempty interior. Let H(0 ; ; x; u; t) D 0 L(x; u; t) C f (x; u; t)

(2)

be the Hamiltonian for the control problem. If the input-trajectory pair (x , u ) is optimal for problem (OC), then the local maximum principle [7] states that there exist a constant 0 0, an absolutely continuous function :[0, T] ! (Rn ) (which we write as a row vector), which is a solution to the adjoint equation ˙ D H x (0 ; (t); x(t); u (t); t); with terminal condition (T) D 0 `x (x (T)) C q x (x (T));

(3)

(for some row vector 2 (Rk ) ) such that (0 , (t)) 6D 0 for all t 2 [0, T] and the following local minimum condition holds for all u 2 U: hHu (0 ; (t); x (t); u (t); t); u u (t)i 0:

(4)

Input-trajectory pairs (x , u ) for which multipliers 0 and exist such that these conditions are satisfied are called (weak) extremals. If 0 > 0, then it is possible to normalize 0 = 1 and the extremal is called normal while extremals with 0 = 0 are called abnormal. Although the terminology abnormal, which has its origins in the calculus of variations [4], seems to suggest that these type of extremals are an aberration, for optimal control problems this is not the case. The phenomenon is quite general and abnormal extremals

H

cannot be excluded from optimality a priori. For instance, there exist optimal abnormal trajectories for the standard problem of stabilizing the harmonic oscillator time-optimally in minimum time, a simple timeinvariant linear system. In the abnormal case conventional necessary conditions for optimality provide conditions which only describe the structure of the constraints. For example, if there are no control constraints, then these conditions only involve the equality constraint defined by the dynamics and terminal conditions as zero set of an operator F: Z ! Y between Banach spaces. If F 0 (z ) is not onto, but ImF 0 (z ) is closed (and this is always the case for the optimal control problem) then the standard Lagrange multiplier type necessary conditions for optimality (which imply the local maximum principle [7]) can be satisfied trivially by choosing a multiplier which annihilates the image of F 0 (z ) and setting all other multipliers to zero.) The corresponding necessary conditions are independent of the objective and describe only the structure of the constraint yielding little information about the optimality of the abnormal trajectory. Much of the difficulty in analyzing abnormal points in extremum problems can be traced back to the fact that the equality constraint is typically no longer a manifold near abnormal points, but intersections of manifolds are common. Hence, in order to develop necessary and/or sufficient conditions for optimality of abnormal extremals, it is imperative to analyze different branches of the zero-set of F. Finding these branches is at the heart of the matter. Generalizing a result of E.R. Avakov [2,3] in [10] a high-order generalization of the classical Lyusternik theorem is given which for general p 2 N describes the structure of p-order tangent directions to an operator equality constraint in a Banach space for nonregular operators under a more general surjectivity assumption involving the first p derivatives of the operator. Based on these results p-order tangent cones to the equality constraint can explicitly be calculated along critical directions which comprise the low order terms. Combining these cones with standard constructions of high-order cones of decrease for the functional and high-order feasible cones to inequality constraints, all taken along critical directions, generalized necessary conditions for optimality for extremum problems in Banach spaces can be derived which allow to incorporate the objective with a nonzero mul-

1521

1522

H

High-order Maximum Principle for Abnormal Extremals

tiplier. Characteristic of these results is that they are parametrized by critical directions as it is ‘natural’ near abnormal points. In [12] (see High-order necessary conditions for optimality for abnormal points) an abstract formulation of these results is presented for minimization problems in Banach spaces. The main result gives a dual characterization for the empty intersection property of the various approximating cones along critical directions, but primal arguments using the cones themselves are often equally effective. In this article we formulate these abstract results for the optimal control problem, but we only consider the so-called weak or local version of the maximum principle. This result is weaker than the Pontryagin maximum principle [15] in the sense that the Pontryagin maximum principle asserts that the Hamiltonian of the control problem is indeed minimized over the control set at every time along the reference trajectory by the reference control. The local version only gives the necessary conditions for optimality for this property. However, it is well-known how to use an argument of A. Ya. Dubovitskii to derive the Pontryagin maximum principle from the local version [7, Lecture 13] and a preliminary strong version of the results of this article is given in [9]. Other theories of necessary conditions which are tailored to abnormal processes include a method known as ‘weakening equality constraints’ introduced in [14] and developed further in [5]. References [2,3] are along the lines of the results described here and give necessary conditions for optimality of abnormal extremals based on quadratic approximations. Similarly, both weak and strong versions of a second order generalized maximum principle are given by the authors in [8]. While mostly optimization related techniques are used in these papers, on a different level [1] uses differential geometric techniques to develop a theory of the second variation for abnormal extremals. They give both necessary and sufficient conditions for so-called corank-1 abnormal extremals (extremals for which there exists a unique multiplier) in terms of the Jacobi equation and related Morse indices and nullity theorems. Second order necessary conditions for optimality in the type of accessory problem results without normality assumptions have first been given in [6]. Also, the results in [16] are derived without making normality assumptions.

Regularity in the Equality Constraint We model the optimal control problem (OC) in the framework of optimization theory as a minimization problem in a Banach space under equality and inequaln (0, T) denote the Banach space ity constraints. Let W 11 of all absolutely continuous functions x: [0, T] ! Rn RT with norm jxj D kx(0)k C 0 kx˙ (s)k ds and let n W 11 (0; T) D W11n (0; T) \ fx 2 W11n (0; T) : x(0) D 0g : Then the problem is to minimize the functional I over a class A of input-trajectory pairs (x; u) 2 n m W 11 (0; T) L1 (0; T) which is defined by equality constraints and the convex inequality constraint u 2 U.˚The equality constraints can be modeled as n m F D (x; u) 2 W 11 (0; T) L1 (0; T) : F(x; u) D 0 where F is the operator n m n (0; T) L1 (0; T) ! W 11 (0; T) R k F : W 11

with F(x, u) given by 0 Z() @x() f (x(s); u(s); s) ds;

1 q(x(T))A :

0

It is easy to see that the operator F has continuous Fréchet derivatives of arbitrary order. For instance, n m (0; T)L1 (0; T) is given F 0 (x, u) acting on (; ) 2 W 11 by 1 0 Zt @(t) f x C f u ds; q x (x(T))(T)A : 0

All partial derivatives of f are evaluated along a reference input-trajectory pair (x, u) 2 A. The formulas for higher order derivatives are given by equally straightforward multilinear forms. We first describe the image of the operator F 0 (x , u ) for a reference input-trajectory pair (x , u ). Denote the fundamental matrix of the variational equation by ˚(t, s), i. e. @ ˚(t; s) D f x (x(t); u(t); t)˚(t; s); @t ˚(s; s) D Id: Furthermore, let R Rn denote the reachable subspace of the linearized system ˙ D f x h C f u v; h(t)

h(0) D 0;

(5)

H

High-order Maximum Principle for Abnormal Extremals

at time T. It is well-known that R is a linear subspace of Rn and that R = Rn if and only if equation (5) is completely controllable. In general we have that

in general for k 2, Gk = Gk [F](z;H k 1 ): Z ! Y, v ! Gk (v), is given by G k (v) D

0

Lemma 1 ImF (x , u ) consists of all pairs (a; b) 2 n W 11 (0; T) R k such that 0 T 1 Z b 2 q x (x (T)) @ ˚(T; s) a˙ (s) ds C RA :

0

k1 X 1 r! rD1

X

@

1 F (rC1) (z )(h j 1 ; : : : ; h j r ; v)A :

j 1 CC j r Dk1

(6)

(8)

0

In particular, ImF 0 (x , u ) is closed and of finite codimension. The following characterizations of the nonregularity of the operator F and its codimension are well-known.

We also denote by Rq [F](z;H ` ) those terms in the TayPp lor expansion of F(z + iD1 "i hi ) which are homogeneous of degree q 2, but only involve vectors from H ` . The general structure of these remainders is given by 1 0

Proposition 2 The codimension of F 0 (x , u ) is equal to the number of linearly independent solu˙ tions to (t) D (t) f x (x (t); u (t); t) which satisfy (t)f u (x (t), u (t), t) on [0, T] and for which (T) is orthogonal to ker qx (x (T)).

C B q C X 1B C B X F (r) (z )(h j 1 ; : : : ; h j r )C : B C B r! rD2 A @ j 1 CC j r Dq;

(9)

1 j k `; 1kr

Let Proposition 3 The operator F is nonregular at = (x , u ) if and only if is an abnormal weak extremal which satisfies H u (0, (t), x (t), u (t), t) 0 on [0, T].

Critical Directions We describe the set of critical directions along which high-order tangent approximations to the equality conn m (0; T) L1 (0; T) straint F can be set up. Let Z D W 11 and suppose an admissible process z = (x , u ) 2 A and a finite sequence H p1 = (h1 , . . . , hp 1) 2 Zp 1 are given. The following operators allow to formalize high-order approximations to an equality constraint at nonregular points (see, High-order necessary conditions for optimality for abnormal points). For k = 1, . . . , p 1, the directional derivatives r k F(z )(H k ) of F at z along the sequence H k = (h1 , . . . , hk ) are given by 0 k X 1@ r! rD1

X

1 F (r) (z )(h j 1 ; : : : ; h j r )A

(7)

j 1 CC j r Dk

and we let Gk [F](z ;H k 1 ) denote the Fréchetderivatives of the (k 1)th directional derivative of F at z along H k 1 . Thus formally G1 [F](z ) = F 0 (z ) and

Yi D

i X

Im G k [F](z ; H k1 );

i D 1; : : : ; p: (10)

kD1

The following conditions are necessary for the existence of a p-order tangent vector along H p 1 [10]: i) the first p 1 directional derivatives of F along H p 1 vanish, r i F(z )(H i ) D 0; 8i D 1; : : : ; p 1; ii) the compatibility conditions R p1Ci [F](z ; H p1 ) 2 Yi ; i D 1; : : : ; p 1; are satisfied. In these equations all partial derivatives of f are evaluated along the reference trajectory. These conditions are also sufficient if the operator F is p-regular at z in direction of the sequence H p 1 in the sense of the following definition. Definition 4 Let F: Z ! Y be an operator between Banach spaces. We say the operator F is p-regular at z in direction of the sequence H p 1 2 Zp 1 if the following conditions are satisfied:

1523

1524

H

High-order Maximum Principle for Abnormal Extremals

A1) F: Z ! Y is (2p 1)-times continuously Fréchet differentiable in a neighborhood of z . A2) The subspaces Y i , i = 1, . . . , p, are closed. A3) The map G p = G p [F](z ; H p 1 ), Y2 Y G p : Z ! Y1 Y1 Yp1 v 7! G p (v) D G1 (v); 1 G2 (v); : : : ; p1 G p (v) ; where i : Y i + 1 ! Y i + 1 /Y i denotes the canonical projection onto the quotient space, is onto. In the sense of this definition 1-regularity corresponds to the classical Lyusternik condition while 2-regularity is similar to Avakov’s definition [3]. Under these assumptions vectors hp exist which extend H p 1 to porder tangent vectors to F at z [10,12]. For the critical directions for the objective I we focus on the least degenerate critical case and therefore make the following assumption: iii) I 0 (z ) is not identically zero and r i I(z )(H i ) = 0 for i = 1, . . . , p 1. The assumption that the first p 1 directional derivatives vanish is directly tied in with optimality. If there exists a first nonzero directional derivative r j I(z )(H j ) with j 0, and none of the directions H p 1 is of any use in improving the value. We restrict to " 0 since we also want to include inequality constraints. On the other hand, if r j I(z )(H j ) < 0, then H j is indeed a direction of decrease and arbitrary high-order extensions of this sequence will give better values. Thus the reference trajectory is not optimal. We also need to define the critical directions for the inequality constraint U in the optimal control problem. More generally, we define a p-order feasible set to an inequality constraint in a Banach space. Definition 5 Let S Z be a subset with nonempty interior. We call v a p-order feasible vector for S at z in direction of H p 1 = (h1 , . . . , hp 1 ) 2 Zp 1 if there exist an "0 > 0 and a neighborhood V of v so that for all 0 2. Since no other I-derivatives arise in the directional derivatives r i I(0, 0)(H i ) for i = 3, . . . , p 1, the direction H p 1 = ((1 , 1 );(0, 0); ;(0, 0)) with [3] [2] [1] 1 = 1 0 and a nonzero 1 is a nonzero p-regular critical direction for the problem to minimize I subject to F = 0 for any p 2. We thus can apply Theorem 7. Since there are no control constraints we can normalize the multipliers so that 0 = 1. The additional multipliers i , i = 1, . . . , p 1, are associated with elements in the dual spaces of the quotients Y i + 1 /Y i (see High-order necessary conditions for optimality for abnormal points). But here Y i = Im F 0 (0, 0) for i = 1, . . . , p 1, and Y p is the full space. Thus we have i 0 for i = 2, . . . , p 1 and the only nonzero multipliers are and p 1 which for simplicity of notation we just call . Now (14) states that is an adjoint multiplier for which the conditions of the local Maximum Principle for an abnormal extremal are satisfied. This multiplier is unique and of the form (t) = (, 0, ), but 2 R could be zero. For the extended adjoint equation and minimum condition (19) we need to evaluate the directional derivatives r p 1 f (x, u)(H i ). Straightforward, but a bit tedious calculations show that 1 0 0 0 0 p1 C B f (0; 0)(H i ) x D @0 0 0 A r p1 0 0 [2] 1 and

r p1 f (0; 0)(H i )

u

0:

Thus the extended minimum condition reduces to B 0, the minimum condition of the weak maximum principle. Hence also 2 (t) 0 and 1 (t) = 3 (t). But now the extended adjoint equation is given by 1 0 0 0 0 C ˙ (t) D (2; 0; 2) B @0 0 0 p1 A 0 0 [2] 1

High-order Necessary Conditions for Optimality for Abnormal Points

and thus

4 D ˙ 1 (t) ˙ 3 (t) [2] 1 (t)

p1

D [2] 1 (t)

p1

:

6.

But we can certainly choose [2] 1 nonconstant to violate this condition. This contradiction proves that cannot be optimal for the problem to minimize I for any p 2.

7. 8. 9.

Conclusion Theorem 7 is based on p-order approximations. If these remain inconclusive, higher order approximations can easily be set up. If the operator F is p-regular in direction of H p 1 , then given a p-regular tangent direction, it is possible to set up higher order approximations of arbitrary order. In fact, only a system of p linear equations needs to be solved in every step. These results provide a complete hierarchy of primal constructions of higher-order approximating directions and dual characterizations of empty intersection properties of approximating cones which can be used to give necessary conditions for optimality for increasingly more degenerate structures. For these results see [13].

10.

11.

12.

13.

14.

See also Dynamic Programming: Continuous-time Optimal Control Hamilton–Jacobi–Bellman Equation Pontryagin Maximum Principle

15.

16.

H

Congress of Nonlinear Analysts, Part 4, Athens 1996. Nonlinear Anal 30:2439–2448 Gilbert EG, Bernstein DS (1983) Second-order necessary conditions in optimal control: accessory-problem results without normality conditions. J Optim Th Appl 41:75– 106 Girsanov IV (1972) Lectures on mathematical theory of extremum problems. Springer, Berlin Ledzewicz U, Schättler H (1997) An extended maximum principle. Nonlinear Anal 29:59–183 Ledzewicz U, Schättler H (1998) High order extended maximum principles for optimal control problems with nonregular constraints. In: Hager WW, Pardalos PM (eds) Optimal Control: Theory, Algorithms and Applications. Kluwer, Dordrecht, pp 298–325 Ledzewicz U, Schättler H (1998) A high-order generalization of the Lyusternik theorem. Nonlinear Anal 34:793– 815 Ledzewicz U, Schättler H (1998) A high-order generalization of the Lyusternik theorem and its application to optimal control problems. In: Chen W, Hu S (eds) Dynamical Systems and Differential Equations II. pp 45–59 Ledzewicz U, Schättler H (1999) High-order approximations and generalized necessary conditions for optimality. SIAM J Control Optim 37:33–53 Ledzewicz U, Schättler H (2000) A high-order generalized local maximum principle. SIAM J Control Optim 38:823–854 Milyutin AA (1981) Quadratic conditions of an extremum in smooth problems with a finite-dimensional image. Methods of the Theory of Extremal Problems in Economics. Nauka Moscow, Moscow, pp 138–177 (In Russian.) Pontryagin LS, Boltyanskii VG, Gamkrelidze RV, Mishchenko EF (1962) The mathematical theory of optimal processes. Wiley, New York Stefani G, Zezza PL (1996) Optimality conditions for a constrained control problem. SIAM J Control Optim 34:635–659

References 1. Agrachev AA, Sarychev AV (1995) On abnormal extremals for Lagrange variational problems. J Math Syst, Estimation and Control 5:127–130 2. Avakov ER (1988) Necessary conditions for a minimum for nonregular problems in Banach spaces. Maximum principle for abnormal problems of optimal control. Trudy Mat Inst Akad Nauk SSSR 185:3–29 (In Russian.) 3. Avakov ER (1989) Necessary extremum conditions for smooth anormal problems with equality-and inequality constraints. J Soviet Math 45. Matematicheskie Zametki 45:3–11 4. Caratheodory C (1935) Variationsrechnung und partielle Differentialgleichungen erster Ordnung. Teubner, Leipzig 5. Dmitruk AV (1998) Quadratic order conditions of a local minimum for abnormal extremals. In: Proc. 2nd World

High-order Necessary Conditions for Optimality for Abnormal Points URSZULA LEDZEWICZ1 , HEINZ SCHÄTTLER2 1 Department Math. and Statist., Southern Illinois University at Edwardsville, Edwardsville, USA 2 Department Systems Sci. and Math., Washington University, St. Louis, USA

MSC2000: 49K27, 46N10, 41A10, 47N10

1527

1528

H

High-order Necessary Conditions for Optimality for Abnormal Points

Article Outline Keywords A High-Order Formulation of the Dubovitskii–Milyutin Theorem High-Order Directional Derivatives High-Order Tangent Cones High-Order Cones of Decrease High-Order Feasible Cones to Inequality Constraints Given by Smooth Functionals High-Order Feasible Cones to Closed Convex Inequality Constraints Generalized Necessary Conditions for Optimality See also References Keywords Lyusternik theorem; High-order tangent sets; High-order necessary conditions for optimality; Abnormal processes We consider the problem of minimizing a functional I: X ! R in a Banach space X under both equality and inequality constraints. The inequality constraints are of two types, either described by smooth functionals f : X ! R as P = {x 2 X: f (x) 0} or described by closed convex sets C with nonempty interior. The equality constraints are given in operator form as Q = {x 2 X : F(x) = 0} where F: X ! Y is an operator between Banach spaces. Models of this type are common in optimal control problems. The standard first order Lagrange multiplier type necessary conditions for optimality at the point x state that there exist multipliers 0 , . . . , m , y which do not all vanish identically such that the Euler–Lagrange equation 0

0 I (x ) C

m X

j f j0 (x ) C F 0 (x )y D 0;

(1)

jD1

is satisfied (see for instance [7,9]). This article addresses the case when the Fréchet-derivative F 0 (x ) of the operator defining the equality constraint is not onto, i. e. the regular case. In this case the classical Lyusternik theorem [14] does not apply to describe the tangent space to Q and (1) can be satisfied trivially by choosing a nonzero multiplier y from the annihilator of Im

F 0 (x ) while setting all other multipliers zero. This generates so-called abnormal points for which the standard necessary conditions for optimality only describe the degeneration of the equality constraint without any relation to optimality. Here we describe an approach to high-order necessary conditions for optimality in these cases which is based a high-order generalization of the Lyusternik theorem [12]. By using this theorem one can determine the precise structure of polynomial approximations to Q at x when the surjectivity condition on F 0 (x ) is not satisfied, but when instead a certain operator Gp which takes into account all derivatives up to and including order p is onto. The order p is chosen as the minimum number for which the operator Gp becomes onto. If Gp is onto, then the precise structure of q-order polynomial approximations to Q at x for any q p can be determined. This leads to the notion of high-order tangent cones to the equality constraint Q at points x in a nonregular case. Combining these with high-order feasible cones for the inequality constraints and highorder cones of decrease, a generalization of the Dubovitskii–Milyutin theorem is formulated. From this theorem generalized necessary conditions for optimality can be deduced which reduce to classical conditions for normal cases, but give new and nontrivial conditions for abnormal cases. First results of this type have been obtained for quadratic approximations (p = 2) in [3,4,5] and [11]. Some of these conditions have been analyzed further also in connection with sufficient conditions for optimality, [1,2]. In [10] also quadratic approximations for problems with inequality constraints are considered. For the regular case when F 0 (x ) is onto second order approximating sets were introduced in [6] to derive second order necessary conditions for optimality, while higher order necessary conditions for optimality in this case are given, for instance, in [8] or [15]. These, however, are not the topic of this article. A High-Order Formulation of the Dubovitskii–Milyutin Theorem Let X and Y be Banach spaces. Let I: X ! R be a functional, F: X ! Y an operator, f j : X ! R, j = 1, . . . , m, functionals and let C X be a closed convex set with nonempty interior. We assume that I, the functionals f j

High-order Necessary Conditions for Optimality for Abnormal Points

and the operator F are sufficiently often continuously Fréchet-differentiable and consider the problem 8 ˆ ˆmin ˆ x ˆ ˆ < s.t. (P) ˆ ˆ ˆ ˆ ˆ :

I

Definition 1 Let H p 1 = (h1 , . . . , hp 1 ) 2 X p 1 and P p1 : set x(") D x + iD1 "i hi . We call H p 1 a (p 1)-order approximating sequence to a set S X at x 2 Clos S, respectively we call x:" ! x("), a (p 1)-order approximating curve, if there exist an "0 > 0 and a function r defined on [0, "0 ] with values in X, r: [0, "0 ] ! X, with the property that " i h i C r(") 2 S

(2)

iD1

kr(")k D 0: "!0 " p1

"i h i C " p v (4)

The collection of all p-order vectors of decrease for I at x in direction of the sequence H p 1 will be called the p-order set of decrease to I at x in direction of the sequence H p 1 and will be denoted by DS(p) (I;x , H p 1 ). Definition 3 We call v0 a p-order feasible vector for an inequality constraint P at x 2 X in direction of H p 1 if there exist an "0 > 0 and a neighborhood V of v0 so that for all 0 < " "0 x C

p1 X

" i h i C " p V D x(") C " p V P:

(5)

iD1

The collection of all p-order feasible vectors v0 for P at x in direction of the sequence H p 1 will be called the p-order feasible set to P at x in direction of the sequence H p 1 and will be denoted by FS(p) (P;x , H p 1 ). Note that by definition the p-order set of decrease to I and the p-order feasible set to P, both at x in direction of the sequence H p 1 , are open.

and lim

!

D I(x(") C " p v) I(x ) C ˛" p :

We define high-order polynomial approximations to the admissible domain A. We denote sequences (h1 , . . . , hk ) 2 X k by H k with the subscript giving the length of the sequence.

p1 X

p1 X iD1

Q D fx 2 X : F(x) D 0g :

x(") C r(") D x C

a neighborhood V of v0 and a number ˛ < 0 so that for all v 2 V we have I x C

x 2 A D \mjD1 Pj \ Q \ C; ˚ Pj D x 2 X : f j (x) 0

H

(3)

We call a (p 1)-order approximating sequence/curve (p 1)-order feasible if S is an inequality constraint, respectively (p 1)-order tangent if S is an equality constraint. Let x 2 F and assume as given a (p 1)-order approximating sequence H p 1 = (h1 , . . . , hp 1 ) 2 X p 1 with corresponding (p 1)-order approximation x(") P p1 : D x + iD1 "i hi . It is implicitly assumed that x has not been ruled out for optimality. Then we extend the existing (p 1)-order approximations to p-order approximations and derive the corresponding necessary conditions for optimality. The following definitions are direct generalizations of standard existing definitions [7]. Definition 2 We call v0 a p-order vector of decrease for a functional I: X ! R at x 2 X in direction of the sequence H p 1 = (h1 , . . . , hp 1 ) 2 X p 1 if there exist

Definition 4 We call hp a p-order tangent vector to an equality constraint Q at x in direction of the sequence H p 1 if H p = (h1 , . . . , hp ) 2 X p is a p-order approximating sequence to the set Q at x 2 Q. The collection of all p-order tangent vectors to Q at x in direction of the sequence H p 1 will be called the p-order tangent set to Q at x in direction of the sequence H p 1 and will be denoted by TS(p) (Q;x , H p 1 ). These approximating sets can be embedded into cones in the extended state-space X × R. This has the advantage that many classical results like the Minkowski– Farkas lemma or the annihilator lemma can be directly applied in calculating dual cones (see also [11]). Let us generally refer to p-order sets of decrease, feasible sets and tangent sets as p-order approximating sets and denote them by AS(p) (Z;x , H p 1 ). Then we define the corresponding approximating cones as follows: Definition 5 Given a p-order approximating set AS(p) (Z;x , H p 1 ) to a set Z X at x in direction

1529

1530

H

High-order Necessary Conditions for Optimality for Abnormal Points

of the sequence H p 1 , the p-order approximating cone to Z at x in direction of H p 1 , AC(p) (Z;x , H p 1 ), is the cone in X × R generated by the vectors (v, 1) 2 AS(p) (Z;x , H p 1 ) × R. Thus we talk of the p-order cone of decrease for the functional I, p-order feasible cones for inequality constraints and p-order tangent cones for equality constraints, all at x in direction of the sequence H p 1 . Definition 6 Let C Z be a cone in a Banach space Z with apex at 0. The dual (or polar) cone to C consists of all continuous linear functionals 2 Z which are nonnegative on C, i. e.

C D f 2 Z : h; vi 0; 8v 2 Cg :

(6)

High-Order Directional Derivatives We describe a formalism to calculate higher derivatives [12,13] which will be needed to describe high-order approximating cones. Let F: X ! Y be an operator between Banach spaces which is sufficiently often continuously Fréchet differentiable in a neighborhood of x 2 X and consider the Taylor expansion of F along a curve (") D x C

m X

"i h i :

iD1

We have F( (")) D F(x ) C

m X

" i r i F(x )(h1 ; : : : ; h i ) Ce r(");

iD1

Then we have

where r i F(x )(h1 , . . . , hi ) is given by

Theorem 7 [11,13] (p-order Dubovitskii–Milyutin theorem). Suppose the functional I attains a local minimum for problem (P) at x 2 A. Let H p 1 = (h1 , . . . , hp 1 ) 2 X p 1 be a (p 1)-order approximating sequence such that the p-order cone of decrease for the functional I, the p-order feasible cones for the inequality constraints Pj , j = 1, . . . , m, and C, and the p-order tangent cone to the equality constraint Q, all at x in direction of the sequence H p 1 , are nonempty and convex. Then there exist continuous linear functionals 0 D (0 ; 0 ) 2 DC(p) (I; x ; H p1 ) ; j D ( j ; j ) 2 FC(p) ( f j ; x ; H p1 ) ; for j = 1, . . . , m,

0 i X X 1 @ r! j CC j rD1 1

1 F (r) (x )(h j 1 ; : : : ; h j r )A

(8)

r Di

ande r(") is a function of order o("m ) as " ! 0. Note that i r F(x )(h1 , . . . , hi ) simply collects the "i -terms in this expansion. These terms, which we call the ith-order directional derivatives of F along the sequence H i = (h1 , . . . , hi ), 1 i m, are easily calculated by straightforward Taylor expansions. For example, r 1 F(x )(H1 ) D F 0 (x )h1 ; 1 r 2 F(x )(H2 ) D F 0 (x )h2 C F 00 (x )(h1 ; h1 ): 2 The higher-order directional derivative r i F(x ) is homogeneous of degree i in the directions in the sense that

˝ D (mC1 ; mC1 ) 2 FC(p) (C; x ; H p1 )

r i F(x )("h1 ; : : : ; " i h i ) D " i r i F(x )(h1 ; : : : ; h i ): and ˚ D (mC2 ; mC2 ) 2 TC(p) (Q; x ; H p1 ) ; all depending on H p 1 , such that mC2 X jD0

j 0;

mC2 X

j 0

(7)

jD0

hold. Furthermore, not all the j , j = 0, . . . , m + 2, vanish identically.

In particular, no indices j1 and j2 with j1 + j2 > i can occur together as arguments in any of the terms in r i F(x ). Thus all vectors hj whose index satisfies 2j > i appear linearly in r i F(x ) and are multiplied by terms which are homogeneous of degree i j. In fact, there exist linear operators Gk = Gk [F](x ;H k 1 ), k 2 N, depending on the derivatives up to order k of F in the point x and on the vectors H k 1 = (h1 , . . . , hk 1 ), which describe the contributions of these components. We have G1 [F](x ) = F 0 (x ) and in general

H

High-order Necessary Conditions for Optimality for Abnormal Points

Gk = Gk [F](x ;H k 1 ): Z ! Y, v ! Gk (v), is given by G k (v) D 0

k1 X 1 r! rD1

X

@

1 F (rC1) (x )(h j 1 ; : : : ; h j r ; v)A :

High-Order Tangent Cones We first describe the set of critical directions along which high-order tangent approximations to the equality constraint Q can be set up. For a given admissible process z 2 A and a finite sequence H p 1 = (h1 , . . . , hp 1 ) 2 X p 1 , let

j 1 CC j r Dk1

(9) These operators Gk [F](x ;H k 1 ) are the Fréchetderivatives of the (k 1)th directional derivative of F at x along H k 1 . Note that these terms are homogeneous of degree k 1. For simplicity of notation we often suppress the arguments. For example, we write 0

G1 (v) D F (x )v;

00

G2 (v) D F (x )(h1 ; v); 1 G3 (v) D F 00 (x )(h2 ; v) C F 000 (x )(h1 ; h1 ; v): 2 Given an order p 2 N, it follows that we can separate the linear contributions of the vectors hp , . . . , h2p 1 in derivatives of orders p through 2p 1 and for i = 1, . . . , p, we have an expression of the form r p1Ci F(x )(H p1Ci ) D i X G k [F](x ; H k1 )h pCik CR p1Ci [F](x ; H p1 ): kD1

Here among the terms which are homogeneous of degree p 1 + i the sum gives the terms which contain one of the vectors hp , . . . , hp 1 + i , and the remainder R combines all other terms which only include vectors of index p 1. The general structure of the remainder Rq [F](z;H ` ) for arbitrary q 2 and ` is given by 1 0 C B q C X 1B C B X F (r) (x )(h j 1 ; : : : ; h j r )C : B C B r! rD2 A @ j 1 CC j r Dq;

(10)

1 j k `; 1kr

Thus Rq (H ` ) consists of the terms which are homogeneous of degree q, but only involve vectors from H ` . For example, R3 [F](z ;H 2 ) is given by 1 F 00 (z )(h1 ; h2 ) C F (3) (z )(h1 ; h1 ; h1 ): 6 Note that the remainders only have contributions from derivatives of at least order two. These operators allow to formalize high-order approximations to an equality constraint at nonregular points [13].

Yi D

i X

Im G k [F](x ; H k1 );

i D 1; : : : ; p:

kD1

It is clear that the first p 1 directional derivatives of F along H p 1 must vanish, r i F(z )(H i ) D 0;

8i D 1; : : : ; p 1;

(11)

if H p 1 is a (p 1)-order tangent direction. But additional compatibility conditions of the form R p1Ci [F](x ; H p1 ) 2 Yi ;

i D 1; : : : ; p 1; (12)

are necessary as well if we want to extend H p 1 to a porder tangent direction H p = (H p 1 ;hp ). Conditions (11) and (12) are indeed sufficient for the existence of p-order approximations along H p 1 under the following regularity condition: Definition 8 Let F: X ! Y be an operator between Banach spaces. We say the operator F is p-regular at x in direction of the sequence H p 1 2 X p 1 if the following conditions are satisfied: A1) F: X ! Y is (2p 1)-times continuously Fréchet differentiable in a neighborhood of x ; A2) the subspaces Y i , i = 1, . . . , p, are closed; A3) the map G p = G p [F](x ;H p 1 ) G p : X ! Y1

Y2 Y ; Y1 Yp1

v 7! G p (v) D (G1 (v); : : : ; p1 G p (v)); where i : Y i + 1 ! Y i + 1 /Y i denotes the canonical projection onto the quotient space, is onto. In the sense of this Definition, 1-regularity corresponds to the classical Lyusternik condition while 2-regularity is similar to Avakov’s definition [5]. Theorem 9 [12] Let H p 1 be a sequence so that r i F(x )(H i ) = 0 for i = 1, . . . , p 1, and suppose the operator F is p-regular at x in direction of H p 1 . Then

1531

1532

H

High-order Necessary Conditions for Optimality for Abnormal Points

TS(p) (Q;x , H p 1 ) is nonempty if and only if for i = 1, . . . , p 1, the compatibility conditions R p1Ci [F](x; H p1 ) 2 Yi are satisfied. In this case TS(p) (Q;x , H p 1 ) is the closed affine subspace of X given by the solutions to the linear equation G p [F](x; H p1 )(v) C R p1 [F](x ; H p1 ) D 0; (13)

where Rp 1 [F](x , H p 1 ) 2 Z is the point with components

denote the canonical isomorphism. Here ?i + 1 denotes the annihilator in Y i + 1 , i. e. ? iC1

Yi

˚ D y 2 YiC1 : hy ; vi D 0; 8v 2 Yi

1 and we formally set Y 0 = {0}, so that Y ? 0 Š Y 1 . Then we have:

Proposition 11 [11,13] A functional 2 X lies in (kerG p )? if and only if it can be represented in the form D

p X

G i [F](x ; H i1 )yi

(14)

iD1

R p [F](x; H p1 ); 1 R pC1 [F](x; H p1 ); : : : ;

p1 R2p1 [F](x; H p1 ) :

This formulation of the result clearly brings out the geometric structure of the p-order tangent sets as closed affine linear subspaces of X generated by the kernel of G p , kerG p . Corollary 10 [12] Let H p 1 be a sequence such that the operator F is p-regular at x in direction of H p 1 . Suppose the first (p 1) directional derivatives r i F(x )(H i ) vanish for i = 1, . . . , p 1, and the compatibility conditions Rp 1 + i [F](x ;H p 1 ) 2 Y i are satisfied for i = 1, . . . , p. Then the p-order tangent cone to Q = {x 2 X: F(x) = F(x )} at x in direction of H p 1 , TC(p) (Q;x , H p 1 ), consists of all solutions (w, ) 2 X ×R+ (i. e. > 0) of the linear equation G p [F](w) C R p1 [F](x ; H p1 ) D 0:

For applications to optimization problems we need the subspace of continuous linear functionals which annihilate G p . Since the operator G p is onto, it follows by the annihilator lemma or the closed-range theorem [9] that (ker G p )? D Im(G p ); where G p : Z D Y1 (

Y2 Y ) ( ) ! X; Y1 Yp1

denotes the adjoint map. Let i : (

YiC1 ? ) ! Yi iC1 Yi

i for some functionals yi 2 Y ? i1 , i = 1, . . . , p.

Proposition 12 [11,13] The dual or polar p-order tangent cone consists of all linear functionals (, ) 2 X × R which can be represented in the following form: There i exist functionals yi 2 Y ? i1 , i = 1, . . . , p, and a number r 0 such that D

p X

G i [F](x ; H i1 )yi ;

iD1

D

p X ˝

˛ yi ; R p1Ci [F](x ; H p1 ) C r:

iD1

High-Order Cones of Decrease We now consider critical directions for the objective I and determine the p-order sets of decrease of a functional I: X ! R. These results also apply to p-order feasible sets to inequality constraints defined by smooth functionals. We assume as given a (p 1)-order sequence H p 1 and we calculate the p-order set of decrease of I at x along H p 1 . Trivial cases arise if there exists a first nonzero directional derivative r i I(x )(H i) of I with i p 1. In this case we have either DS(p) (I;x , H p 1 ) = ; if r i I(x )(H i ) > 0 or DS(p) (I;x , H p 1 ) = X if r i I(x )(H i ) < 0. In the first case the sequence H p 1 cannot be used to exclude optimality of x since indeed x is a local minimum along the approximating curve generated by H p 1 . In the second case hi is an ith-order direction of decrease along H i 1 and thus every vector v 2 X is admissible as a pth order component. The only nontrivial case arises

High-order Necessary Conditions for Optimality for Abnormal Points

if r i I(x )(H i ) = 0 for all i with i p 1 and if I 0 (x ) 6D 0. Proposition 13 [13] Suppose I 0 (x ) 6D 0 and for all i with i p 1 we have r i I(x )(H i ) = 0. Then the p-order cone of decrease for the functional I at x in direction of H p 1 , DC(p) (I;x , H p 1 ), consists of all vectors (w, ) 2 X × R which satisfy I 0 (x )w C R p [I](x ; H p1 ) < 0: Thus DC(p) (I;x , H p 1 ) is nonempty, open and convex. The dual or polar cone to DC(p) (I;x , H p 1 ) can easily be calculated using the Minkowski–Farkas lemma [7]. High-Order Feasible Cones to Inequality Constraints Given by Smooth Functionals In this section we give the form of the p-order feasible cones, FC(p) (P;x , H p 1 ), for inequality constraints P described by smooth functionals, P D fx 2 X : f (x) 0g : Similar like for sets of decrease, if there exists a first index i p 1 such that r i f (x )(H i) 6D 0, then the constraint will either be satisfied for any p-order vector v 2 X if r i f (x )(H i ) < 0 or it will be violated if r i f (x )(H i ) > 0. This leads to the definition of p-order active constraints. Definition 14 The inequality constraint P is said to be p-order active along the sequence H p 1 if for all i, i = 1, . . . , p 1, we have r i f (x )(H i ) = 0. Only p-order active constraints enter the necessary conditions for optimality derived via p-order approximations along an admissible sequence H p 1 ; p-order inactive constraints generate zero multipliers since DS(p) (P;x , H p 1 ) = X (p-order complementary slackness conditions) and can be ignored for high-order approximations. Proposition 15 If the constraint P = { x 2 X: f (x) 0} is p-order active along the sequence H p 1 , then the p-order feasible cone, FC(p) (P;x , H p 1 ), consists of all vectors (w, ) 2 X × R+ which satisfy f 0 (x )w C R p [ f ](x; H p1 ) < 0:

H

Hence, if f 0 (x ) 6D 0, then FC(p) (P;x , H p 1 ) is nonempty, open and convex. High-Order Feasible Cones to Closed Convex Inequality Constraints Let C X be a closed convex set with nonempty interior. Again we assume that H p 1 is a (p 1)-order feasible sequence. Note that it follows from Definition 3 that FS(p) (C;x , H p 1 ) is open (since any vector in the neighborhood V of v also lies in FS(p) (C;x , H p 1 )). It is also clear that FS(p) (C;x , H p 1 ) is convex, since C is. Thus FC(p) (C;x , H p 1 ) is an open, convex cone. Furthermore, if there exists an integer j < p so that hj 2 FS(j) (C;x , H j 1 ), then any vector v is allowed as a p-order feasible direction and thus trivially FS(p) (C;x , Hp 1 ) = X, i. e. the convex constraint x 2 C is not porder active. In this case the necessary conditions for optimality along H p 1 are exactly the same as without C. The dual or polar cone FC(p) (C;x , H p 1 ) can be identified with all supporting hyperplanes to FS(p) (C;x , H p 1 ) at x . More precisely, it consists of all linear functionals (, ) 2 X ×R which satisfy h; vi C 0;

8v 2 FS(p) (C; x ; H p1 ):

Corollary 16 [13] Let C X be a closed convex set with nonempty interior and suppose the p-order feasible set FS(p) (C;x , H p 1 ) is nonempty. If (, ) 2 FC(p) (C;x , Hp 1 ) , then is a supporting hyperplane to C at x . Generalized Necessary Conditions for Optimality We now give generalized necessary conditions for optimality for problem (P) based on general p-order approximations. We assume as given a sequence H p 1 = (h1 , . . . , hp 1 ) 2 X p 1 with the following properties: P1) The first p 1 directional derivatives of F along H p 1 vanish, r i F(x )(H i ) D 0;

8i D 1; : : : ; p 1;

the compatibility conditions R p1Ci [F](x ; H p1 ) 2 Yi are satisfied for i = 1, . . . , p 1, and the operator F is p-regular at x in direction of the sequence Hp 1.

1533

1534

H

High-order Necessary Conditions for Optimality for Abnormal Points

P2) Either the first nonvanishing derivative r i I(x )(H i) is negative or r i I(x )(H i ) = 0 for i = 1, . . . , p 1. P3) If the jth constraint is not p-order active, then the first nonzero derivative r i f (x )(H i ) is negative. P4) FS(p) (C;x , H p 1 ) is nonempty. These conditions guarantee respectively that the corresponding p-order approximating cones to the constraints or the functional I are nonempty and convex. The next theorem generalizes the classical first order necessary conditions for optimality for a mathematical programming problem with convex inequality constraints [7, Thm. 11.4]. Theorem 17 If x is optimal for problem (P), then given any sequence H p 1 = (h1 , . . . , hp 1 ) 2 X p 1 for which conditions P1)–P4) are satisfied, there exist Lagrange i multipliers i 0, i = 0, . . . , m, functionals yi 2 Y ? i1 , i = 1, . . . , p, and a supporting hyperplane h, vi + 0 for all v 2 FS(p) (C;x , H p 1 ), all depending on the sequence H p 1 , such that the multipliers i , i = 0, . . . , m, and do not all vanish, and 0 I 0 (x ) C

m X

j f j0 (x ) C

jD1

p X

G i yi ;

(15)

iD1

0 R p [I](x ; H p1 ) C

m X

j R p [ f j ](x ; H p1 )

jD1

C

p X ˝

˛ yi ; R p1Ci [F](H p1 ) :

(16)

iD1

Furthermore, the following p-order complementary slackness conditions hold: 0 = 0 if DS(p) (I;x , H p 1 ) = X; j = 0 if FS(p) (Pj ;x , H p 1 ) = X; = 0 if FS(p) (C;x , H p 1 ) = X. Remark 18 This theorem gives the formulation for the case which is nondegenerate in the sense that the operator G p is onto and it is this condition which implies the nontriviality of the multipliers j , j = 0, . . . , m, and . If G p is not onto, but ImG p is closed, while all the other conditions remain in effect, then a degenerate version of this theorem can easily be obtained by choosing a nontrivial multiplier e y 2 (Im G p )? . This i then gives rise to nontrivial multipliers yi 2 Y ? i1 which Pp have the property that iD1 Gi yi 0. Thus (15) still

holds if we set j = 0, for j = 0, . . . , m, and = 0. Thus the difference is that it can only be asserted that not all i of the multipliers j , j = 0, . . . , m, yi 2 Y ? i1 , i = 1, . . . , p, and do vanish. See also Kuhn–Tucker Optimality Conditions References 1. Arutyunov AV (1991) Higher-order conditions in anormal extremal problems with constraints of equality type. Soviet Math Dokl 42(3):799–804 2. Arutyunov AV (1996) Optimality conditions in abnormal extremal problems. System Control Lett 27:279–284 3. Avakov ER (1985) Extremum conditions for smooth problems with equality-type constraints. USSR Comput Math Math Phys 25(3):24–32. (Zh Vychisl Mat Fiz 25(5)) 4. Avakov ER (1988) Necessary conditions for a minimum for nonregular problems in Banach spaces. Maximum principle for abnormal problems in optimal control. Trudy Mat Inst Akad Nauk SSSR 185:3–29; 680–693 (In Russian.) 5. Avakov ER (1989) Necessary extremum conditions for smooth anormal problems with equality-and inequality constraints. J Soviet Math 45:3–11. (Matematicheskie Zametki 45) 6. Ben-Tal A, Zowe J (1982) A unified theory of first and second order conditions for extremum problems in topological vector spaces. Math Program Stud 19:39–76 7. Girsanov IV (1972) Lectures on mathematical theory of extremum problems. Springer, Berlin 8. Hoffmann KH, Kornstaedt HJ (1978) Higher-order necessary conditions in abstract mathematical programming. J Optim Th Appl (JOTA) 26:533–568 9. Ioffe AD, Tikhomirov VM (1979) Theory of extremal problems. North-Holland, Amsterdam 10. Izmailov AF (1994) Optimality conditions for degenerate extremum problems with inequality-type constraints. Comput Math Math Phys 34:723–736 11. Ledzewicz U, Schättler H (1995) Second-order conditions for extremum problems with nonregular equality constraints. J Optim Th Appl (JOTA) 86:113–144 12. Ledzewicz U, Schättler H (1998) A high-order generalization of the Lyusternik theorem. Nonlinear Anal 34:793–815 13. Ledzewicz U, Schättler H (1999) High-order approximations and generalized necessary conditions for optimality. SIAM J Control Optim 37:33–53 14. Lyusternik LA (1934) Conditional extrema of functionals. Math USSR Sb 31:390–401 15. Tretyakov AA (1984) Necessary and sufficient conditions for optimality of p-th order. USSR Comput Math Math Phys 24(1):123–127

Hilbert’s Thirteenth Problem

Hilbert’s Thirteenth Problem VICTOR KOROTKICH Central Queensland University, Mackay, Australia

H

idea of W. Tschirnhausen in 1683 [24] was to adjoin a new equation, i. e., to P(X) D 0 one adjoins

MSC2000: 01A60, 03B30, 54C70, 68Q17 Y D Q(X); Article Outline

where Q is a polynomial of degree strictly less than that of P, chosen expediently. In this way one can show that the roots of an equation of degree 5 can be expressed via the usual arithmetic operations in terms of radicals and of the solution (x) of the quintic equation

Keywords See also References Keywords

X5 C x X C 1 D 0

Superpositions of functions; Algebraic equations; E-entropy; Information The formulation of Hilbert’s thirteenth problem [8] reads: ‘impossibility of solving the general equation of degree 7 by means of any continuous functions depending only on two variables’ [21]. On this basis, D. Hilbert proposed that the complexity of functions is specified essentially by the number of variables. However, as turned out later, this proposal being valid for analytic functions is not true in the general case. In particular, complexity of r times continuously differentiable functions of n variables depends not on the number of variables n but on the ratio n/r. It is known that the equation of third degree can be reduced by translation to X 3 C pX C q D 0; which has the solution (S. del Ferro, 16th century) 2

s

4p3

27q2

31/3

C q 5 X D 4 C 2 4(27) 31/3 2 s 3 C 27q 2 q 4p 5 : C 4 2 4(27) The equation of fourth degree can be solved by superposition of addition, multiplication, square roots, cube roots and fourth roots. To try to solve algebraic equations of higher degree (a vain hope according to N.H. Abel and E. Galois), the

depending on the parameter x. Similarly for the equation of degree 6, the roots are expressible in the same way if we include also a function (x, y), a solution of a 6th-degree equation depending on two parameters x and y. For degree 7 we would have to include also a function (x, y, z), solution of the equation X 7 C x X 3 C yX 2 C zX C 1 D 0: Hence the natural question: Can (x, y, z) be expressed by superposition of algebraic functions of two variables [10]? A great number of papers are devoted to the representability of functions as superpositions of functions depending on a smaller number of variables and satisfying certain additional conditions such as algebraicity, analyticity and smoothness. Hilbert was aware of the fact that superpositions of discontinuous functions represent all functions of a larger number of variables. He also knew about the existence of analytic functions of three variables that cannot be represented by any finite superpositions of analytic functions of two variables [8]. In the statement of his 13th problem, Hilbert proceeded from a result of Tschirnhausen [24], according to which a root of an algebraic equation of degree n > 5, i. e., a function f (x1 , . . . , xn ) determined by an equation f n C x1 f n1 C C x n D 0;

(1)

can be expressed as a superposition of algebraic functions of n 4 variables [21]. Hilbert assumed that the

1535

1536

H

Hilbert’s Thirteenth Problem

number n 4 cannot be reduced for n = 6, 7, 8 and also proved that in order to solve an equation of degree n = 9 it suffices to have functions of n 5 variables [9]. A. Wiman [26] extended the latter result to n > 9, while N. Chebotarev [6] reduced the number of variables involved in the representation of functions to n 6 for n 21 and to n 7 for n 121. Chebotarev was the first to attempt to find topological obstructions to the representability of algebraic functions as superpositions of algebraic functions, but his proofs were not convincing [5,17]. Using topological notions related to the behavior of a many-valued algebraic function on and near a branching manifold, it is proved that algebraic functions cannot be represented by complete superpositions of integral algebraic functions. Completeness means that the represented function must involve all the branches of the many-valued functions and not only one of them as, for example, in the formulas expressing solutions to equations of the 3rd and the 4th degree [21]. Certain topological obstructions to the representation by a complete superpositions of algebraic functions were constructed in this way [2]. V. Lin [15] established the following, most complete, result: In any neighborhood of the origin for n 3 the root f (x1 , . . . , xn ) of equation (1) is not a complete superposition of entire algebroid functions of fewer than n 1 variables and single-valued holomorphic functions of an arbitrary number of variables. Thus, from the standpoint of complete superpositions of entire algebraic functions, even fourth-degree equations cannot be solved without using functions of three variables [21]. Hilbert had had another motivation for his thirteenth problem: nomography, the method of solving equations by drawing a one-parameter family of curves. This problem, arising in the methods of computation of Hilbert’s time, inspired the development of Kolmogorov’s notion of "-entropy [20]. Applications of "entropy have its crucial role in theories of approximation now used in computer science [22]. In Kolmogorov "-entropy, a natural characteristic of a function class F is H" (F) D log2 N" (F); where N " (F) is the minimum number of points in an "net in F. Broadly speaking, the "-entropy of a function class F is the amount of information needed to specify

with accuracy " a function of the class F. A main problem in "-entropy is estimates for the rate of growth of H " (F) as " ! 0 for Lipschitz functions, classes of analytic functions and functions possessing a given number of derivatives. A.N. Kolmogorov showed that the "entropy of r times continuously differentiable functions of n variables grows as "n/r [20]. Since a digital computer can store only a finite set of numbers, functions must be replaced by such finite sets. Therefore, studies in "-entropy are important for the correct estimation of the possibilities of computational methods for approximately representing functions, their implementation on computers and their storage in the computer memory. Also "-entropy has many other applications [23]. An "-net of Lipschitz functions of n variables is constructed to design global optimization algorithms. This "-net is based on the Kolmogorov’s minimal "-net of one-dimensional Lipschitz functions and is encoded in terms of monotone functions of k-valued logic. This construction gives a representation of an ndimensional global optimization problem by a minimal number of one-dimensional ones without loss of information [13]. Let us briefly recall the history of the solution of the Hilbert’s thirteenth problem by Kolmogorov and V. Arnol’d. Hilbert’s problem was first solved on the basis of ideas by using technique developed by A. Kronrod [14]. In this way Kolmogorov proved that any continuous function of n 4 variables can be represented as a superposition of continuous functions of three variables [11]. For an arbitrary function of four variables the representation has the form f (x1 ; x2 ; x3 ; x4 ) D

4 X

h r [x4 ; g1r (x1 ; x2 ; x3 ); g2r (x1 ; x2 ; x3 )]:

rD1

The question whether an arbitrary continuous function of three variables can be represented as a superposition of continuous functions of two variables remained open. The method reduced the representability of functions of three variables as superpositions of functions of two variables to a representability problem for functions defined on universal trees of threedimensional space [21].

Hilbert’s Thirteenth Problem

Contrary to the expectations of Hilbert and of his contemporary mathematicians, in 1957 Arnol’d [1], who was a student of Kolmogorov, solved the latter problem and gave the final solution to Hilbert’s thirteenth problem in the form of a theorem asserting that any continuous function of n 3 variables can be represented as a superposition of functions of two variables [21]. A few weeks later Kolmogorov showed that any continuous function f of n variables can be represented as a superposition 2 3 2nC1 n X X q 4 pq (x p )5 (2) f (x1 ; : : : ; x n ) D qD1

pD1

of continuous functions of one variable and the operation of addition [12]. In Kolmogorov’s representation (2) the inner functions pq are fixed and only the outer functions q depend on the represented function f . The results of [11] do not follow from the theorem presented in [12] in their exact statements, but their essence (in the sense of the possibility of representing functions of several variables by means of superpositions of functions of a smaller number of variables and their approximation by superpositions of a fixed form involving polynomials in one variable and addition) is obviously contained in it [12]. The method for proving the theorem is more elementary than that in [1,11] and reduces to direct constructions and calculations. In Kolmogorov’s opinion, the proof of the theorem was his most technically difficult achievement [21]. Thorough proofs of Kolmogorov’s theorem and the lemmas of his paper [12] were published in [16,18,20] and others. G. Lorenz [16] noted that the outer functions q can be replaced by a single function . D. Sprecher [18] reduced all the inner functions to translations and extensions of a single function with the property that there exits " > 0 and > 0 such that any continuous function of n variables can be represented as f (x1 ; : : : ; x n ) D

2nC1 X

[ p (x p C "q) C q)]:

(3)

qD1

B. Fridman [7] proved that the inner functions pq in (2) can be chosen so that they satisfy a Lipschitz condition. Sprecher [19] extended this result to the repre-

H

sentation (3) (the function can be chosen to satisfy a Lipschitz condition). It follows from Kolmogorov’s representation (2) and Bari’s representation [3] of any continuous function of one variable as a sum of three superpositions of P absolutely continuous functions f k ° gk that all continuous functions of any number of variables can be represented by means of superpositions of absolutely continuous functions of one variable and the operation of addition [21]. In the opposite direction are the results of A. Vitushkin [25] and L. Bassalygo [4]. When we deal with superpositions of formal series or analytic functions it can be shown that, for example, almost every entire function has at an arbitrary point of C3 a germ which is not expressible by superposition of series in two variables. So there are many more entire functions of three variables than of two [10]. The result of Vitushkin is that there exist r times continuously differentiable functions of n variables that cannot be expressed in terms of finite superpositions of s 1 times continuously differentiable functions of k < n variables if n/r > ks [25], representability depends on n/r. Bassalygo proved that for any three functions k continuous on a square there exists a continuous function f which cannot be repreP sented as k ° k for any continuous k [4]. See also History of Optimization References 1. Arnold V (1957) On the representation of continuous functions of three variables as superpositions of continuous functions of two variables. Dokl Akad Nauk SSSR 114(4):679–681(in Russian.) 2. Arnold V (1970) On cohomology classes of algebraic functions invariant under a Tschirnhausen transformation. Funkts Anal i Prilozhen 4(1):84–85(in Russian.) 3. Bari N (1930) Memoire sur la representation finie des fonctions continues. Math Anal 103:185–248 4. Bassalygo L (1966) On the representation of continuous functions of two variables by continuous functions of one variable. Vestn MGU Ser Mat-Mekh 21:58–63(in Russian). 5. Chebotarev N (1943) The resolvent problem and critical manifolds. Izv Akad Nauk SSSR Ser Mat 7:123–146(in Russian.) 6. Chebotarev N (1954) On the resolvent problem. Uchen Zap Kazan Univ 114(2):189–193(in Russian.)

1537

1538

H

History of Optimization

7. Fridman B (1967) Increasing smoothness of functions in Kolmogorov’s theorem on superpositions. Dokl Akad Nauk SSSR 177:1019–1022(in Russian.) 8. Hilbert D (1902) Sur les problemes futurs des mathematiques. Proc. Second Internat. Congress of Mathematicians, Gauthier-Villars, 58–114 9. Hilbert D (1927) Ueber die Gleichung neunten Grades. Math Ann 97:243–250 10. Kantor J (1996) Hilbert’s problems and their sequels. Math Intelligencer 18(1):21–30 11. Kolmogorov A (1956) On the representation of continuous functions of several variables as superpositions of continuous functions of a smaller number of variables. Dokl Akad Nauk SSSR 108(2):179–182 (in Russian.) 12. Kolmogorov A (1957) On the representation of continuous functions of several variables as superpositions of continuous functions of one variable and addition. Dokl Akad Nauk SSSR 114(5):953–956 (in Russian.) 13. Korotkich V (1990) Multilevel dichotomy algorithm in global optimization. In: Sebastian H, Tammer K (eds) System Modelling and Optimization. Springer, Berlin, pp 161– 169 14. Kronrod A (1950) Uspekhi Mat Nauk 5(1):24–134(in Russian.) 15. Lin V (1976) Superpositions of algebraic functions. Funkts Anal i Prilozhen 10(1):37–45(in Russian.) 16. Lorenz G (1962) Metric entropy, width and superpositions of functions. Amer Math Monthly 69:469–485 17. Morozov V (1954) On some questions in the resolvent problem. Uchen Zap Kazan Univ 114(2):173–187(in Russian.) 18. Sprecher D (1965) On the structure of continuous functions of several variables. Trans Amer Math Soc 115:340–355 19. Sprecher D (1972) An improvement in the superposition theorem of Kolmogorov. J Math Anal Appl 38:208–213 20. Tikhomirov V (1963) A.N. Kolmogorov’s work on "-entropy of function classes and superpositions of functions. Uspekhi Mat Nauk 18(5):55–92(in Russian.) 21. Tikhomirov V (ed) (1991) Selected works of A.N. Kolmogorov: Mathematics and mechanics, vol 1. Kluwer, Dordrecht 22. Traub J, Wasilkowski G, Wozniakowski H (1988) Information-based complexity. Acad. Press, New York 23. Traub J, Wozniakowski H (1980) Theory of optimal algorithms. Acad. Press, New York 24. Tschirnhausen W (1683) Methodus auferendi omnes terminos intermedios ex data equatione. Acta Eruditorum 25. Vitushkin A (1955) On multidimensional variations. Gostekhteorizdar 26. Wiman A (1927) Ueber die Anwendung der Tschirnhausen Transformation auf die Reduktion algebraischer Gleichungen. Nova Acta R Soc Sci Uppsaliensis extraordin. edit.: 3–8

History of Optimization DING-ZHU DU1 , PANOS M. PARDALOS2 , WEILI WU3 1 Department Computer Sci. and Engineering, University Minnesota, Minneapolis, USA 2 Center for Applied Optim. Department Industrial and Systems Engineering, University Florida, Gainesville, USA 3 Department Computer Sci. and Engineering, University Minnesota, Minneapolis, USA MSC2000: 01A99 Article Outline Keywords See also References Keywords History; Optimization Did you ever watch how a spider catches a fly or a mosquito? Usually, a spider hides at the edge of its net. When a fly or a mosquito hits the net, the spider will pick up each line in the net to choose the tense one and then goes rapidly along the line to its prey. Why does the spider chooses the tense line? Some biologists explain that the line gives the shortest path from the spider to its prey. Did you heard the following story about a wise general? He had a duty to capture a town behind a mountain. When he and his soldiers reached the top of the mountain, he found that his enemy had already approached the town very closely from another way. His dilemma was how to get in the town before the enemy arrive. It was a challenging problem for the general. The general solved the problem by asking each soldier to roll down the mountain in a blanket. Why is this faster? Physicists tell us that a free ball rolling down a mountain always chooses the most rapid way. Do you know the tale of a horse match of Tian Gi? It is a story set in BC time. Tian Gi was a general in one of several small counties of China, called Qi. The King of Qi knew that Tian Gi had several good horses and ordered Tian Gi to have a horse match with him. The match consisted of three rounds. In each round, each

History of Optimization

side chose a horse to compete with the other side. Tian Gi knew that his best horse could not compete with the best one of the King, his second best horse could not compete with the second best one of King, and his third best horse could not compete with the third best one of the King. Therefore, he did not use his best horse against the best horse of the King. Instead, he put his third best horse in the first round against the best one of the King, his best horse in the second round against the second best one of the King, and his second best horse in the third round against the third best one of the King. The final result was that although he lost the first round of the match, he won the last two rounds. Tian Gi’s strategy was the best to win this match. Today, economists tell us that many economic systems and social systems can be modeled into games. Each contestant in the game tries to maximize certain benefits. Optimality is a fundamental principle, establishing natural lows, ruling biologic behaviors, and conducting social activities. Therefore, optimization started from the earliest stages of human civilization. Of course, before mathematics was well established, optimization could be done only by simulation. One may find many wise men’s stories in the human history about it. For example, to find the best way to get out of a mountain, someone followed a stream, and to find the best way to get out from a desert, someone set an old horse free and followed the horse’s trace. In the 19th century or even today, simulation is still used for optimizing something. For example, to find a shortest path on a network, one may make a net with rope in a proportional size and pull the net tightly between two destinations. The tense rope shows the shortest path. To find an optimal location of a school for three villages, one may drill three holes on a table and put a piece of rope in each hole. Then tie three ropeends above the table together and hang a one-kg-weight on each rope-end under the table. When this mechanical system is balanced, the knot of the three rope-pieces points out the location of the school. The history of optimization in mathematics can be divided into three periods. In the first period, one did not know any general method to find a maximum/minimum point of a function. Only special techniques were found to maximize/minimize some special functions. A typical func-

H

tion is the quadratic function of one variable y D ax 2 C bx C c: The study of quadratic functions was closely related to the study of constantly-accelerating movement. What is the highest point that a stone is thrown out with certain initial speed and certain angle? What is the farthest point where a stone thrown with certain initial speed can reach when throwing angle varies? These were questions considered by some physicists and generals. In fact, the stone-throwing machine was an important weapon in military. Today (as of 2000), computing maximum/ minimum points of a quadratic function is still an important technique of optimization, existing in elementary mathematics books. The technique had been also extended to other functions such as yD

x2 C x C 1 : x 2 C 2x C 3

Actually, multiplying both sides by x2 + 2x+3 and simplifying, we obtain (y 1)x 2 C (2y 1)x C (3y 1) D 0: Since x is a real number, we must have (2y 1)2 4(y 1)(3y 1) 0: Therefore, 8y2 C 12y 3 0; that is, 2(3

p p 3) y 2(3 C 3):

It is interesting to note that with this technique we obtained the global maximum and minimum of y. A new period started in 1646 by P. de Fermat. He proposed, in his paper [5], a general approach to compute local maxima/minima points of a differentiable function, that is, setting the derivative of the function to be zero. Today, this approach is still included in almost all textbooks of calculus as an application of differentiation. In this period, optimization existed scattered and disorderly in mathematics. Because optimization had not become an important branch of applied mathematics, some mathematicians did not pay so much attention to results on optimization and some contributions

1539

1540

H

History of Optimization

were even not put in any publication. This left many mysteries in the history of optimization. For example, who is the first person who proposed the Steiner tree? It was one such mystery. To obtain a clear view, let us explain it in a little detail. In the same paper mentioned above, Fermat also studied a problem of finding a point to minimize the total distance from it to three given points in the Euclidean plane. Suppose three given points are (x1 , y1 ), (x2 , y2 ), and (x3 , y3 ). Then the total distance from a point (x, y) to these three points is f (x; y) D

3 q X

(x x i )2 C (y y i )2 :

iD1

By Fermat’s general method, the minimum point of f (x, y) must satisfy the following equations 3

X @f x xi p D 0; D @x (x x i )2 C (y y i )2 iD1

@f D @y

3 X iD1

y yi p D 0: (x x i )2 C (y y i )2

However, obtaining x and y from this system of equations seems hopeless. Therefore, Fermat mentioned this problem again in a letter to A. Mersenne that it would be nice if a clear solution could be obtained for this problem. E. Torricelli, a student of G. Galilei, obtained a clever solution with a geometric method. He showed that if three given points form a triangle without an angle of at least 120°, then the solution is a point at which three segments from it to three given points produce three angles of 120°. Otherwise, the solution is the given point at which the triangle formed by the three given points has an angle of at least 120°.This result can also be proved by the mechanic system described at the beginning of this article. In the first case, the knot of the three rope-pieces stays not at any given point and hence the balance condition of the three forces of equal magnitude yields the condition on the angles. In the second case, the knot falls in one of the three holes, and the condition on the angle guarantees that the knot would not move away from the hole. Fermat’s problem was extensively studied later and was generalized to four points by J.Fr. Fagnano in 1775 and to n points by P. Tedenat and S. L’Huiller in 1810.

Fagnano pointed out that it is very easy to find the solution of Fermat’s problem for four points. When four given points form a convex quadrilateral, the solution of Fermat’s problem is the intersection of two diagonals, i. e., the intersection of two diagonals minimizes the total distance from one point to four given points. Otherwise, there must be one of the given points lying inside the triangle formed by the other three given points; this given point is the solution. On March 19, 1836, H.C. Schumacher wrote a letter to C.F. Gauss. In his letter, he mentioned a paradox about Fermat’s problem: Consider a convex quadrilateral ABCD. It has been known that the solution of Fermat’s problem for four points A, B, C, and D is the intersection E of diagonals AC and BD. Suppose extending DA and CB can obtain an intersection F. Now, moving A and B to F. Then E will also be moved to F. However, when the angle at F is less than 120°, the point F cannot be the solution of Fermat’s problem for three given points F, D, and C.What happens? On March 21, 1836, Gauss wrote a letter to Schumacher in which he explained that the mistake of Schumacher’s paradox occurs at the place where Fermat’s problem for four points A, B, C, and D is changed to Fermat’s problem for three points F, C, and D. When A and B are identical to F, the total distance from E to four points A, B, C, and D equals 2EF + EC + ED, not EF + EC + ED. Thus, the point E may not be the solution of Fermat’s problem for F, C, and D. More importantly, Gauss proposed a new problem. He said that it is more interesting to find a shortest network rather than a point. Gauss also presented several possible connections of the shortest network for four given points. Unfortunately, Gauss’ letter was discovered only in 1986. From 1941 to 1986, many publications have followed R. Courant and H. Robbins who in their popular book [2] called Gauss’ problem as the Steiner tree problem. The Steiner tree has become a popular and important name. If you search ‘Steiner tree’ with ‘yahoo.com’ on the internet, then you will receive a list of 4675 webpages on Steiner trees. We have no way to change back the name from Steiner trees to Gauss trees. It may be worth mentioning that J. Steiner, a geometrician in 19th century whose name is used for the shortest networks, has not been found so far to have any significant contribution to Steiner trees.

History of Optimization

G.B. Dantzig, who first proposed the simplex method to solve linear programming in 1947, stated in [4]: ‘What seems to characterize the pre- 1947 era was lack of any interests in trying to optimize’. Due to the lack of interests in optimization, many important works appeared before 1947 were ignored. This happened not only for Steiner trees, but also to other areas of optimization, including some important contributions in linear and nonlinear programming. The discovery of linear programming started a new age of optimization. However, in [4], Dantzig made the following comment: ‘Linear programming was unknown prior to 1947’. This is not quite correct; there were some late exceptions. J.B.J. Fourier (of Fourier series fame) in 1823 and the well-known Belgian mathematician Ch. de la Vallée Poussin in 1911 each wrote a paper about it. Their work had as much influence on post- 1947 developments as would finding in an Egyptian tomb an electronic computer built in 3000 BC. L.V. Kantorovich’s remarkable 1939 monograph on the subject was also neglected for ideological reasons in the USSR. It was resurrected two decades later after the major developments had already taken place in the West. An excellent paper by F.L. Hitchcock in 1941 on the transportation problem was also overlooked until after others in the late 1940s and early 1950s have independently rediscovered its properties. He also recalled how he made his discovery: ‘My own contribution grew out of my World War II experience in the Pentagon. During the war period (1941– 1945), I had become an expert on programmingplanning methods using desk calculators. In 1946 I was mathematical advisor to the US Air Force Comptroller in the Pentagon. I had just received my PhD (for research I had done mostly before the war) and was looking for an academic position that would pay better than a low offer I had received from Berkeley. In order to entice me to not take another job, my Pentagon colleagues, D. Hitchcock and M. Wood, challenged me to see what I could do to mechanize the planning process. I was asked to find a way to more rapidly compute a time-staged development, training and logistical supply program. In those days mechanizing planning meant using analog devices or punch-card equipment. There were no electronic computers’. This challenge problem made Dantzig discover his great work in linear programming without electronic

H

computer. But, we have to point out that it is due to the rapid development of computer technology that applications of linear programming can be made so wide and so great, and areas of optimization can have so fast growing. In 1951, A.W. Tucker and his student H.W. Kuhn published the Kuhn–Tucker conditions. This is considered as an initial point of nonlinear programming. However, A. Takayama has an interesting comment on these condition: ‘Linear programming aroused interest in constraints in the form of inequalities and in the theory of linear inequalities and convex sets. The Kuhn– Tucker study appeared in the middle of this interest with a full recognition of such developments. However, the theory of nonlinear programming when constraints are all in the form of equalities has been known for a long time – in fact, since Euler and Lagrange. The inequality constraints were treated in a fairly satisfactory manner already in 1939 by Karush. Karush’s work is apparently under the influence of a similar work in the calculus of variations by Valentine. Unfortunately, Karush’s work has been largely ignored’. Yet, this is another work that appeared before 1947 and it was ignored. In the 1960s, G. Zoutendijk, J.B. Rosen, P. Wolfe, M.J.D. Powell, and others published a number of algorithms for solving nonlinear optimization problems. These algorithms form the basis of contemporary nonlinear programming. In 1954, L.R. Ford and D.R. Fulkerson initiated the study on network flows. This is considered as a starting point on combinatorial optimization although Fermat is the first one who studied a major combinatorial optimization problem. In fact, it was because of the influence of the results of Ford and Fulkerson, that interests on combinatorial optimization were growing, and so many problems, including Steiner trees, were proposed or re-discovered in history. In 1958, R.E. Gomory published the cutting plane method. This is considered as an initiation of integer programming, an important direction of combinatorial optimization. In 1955, Dantzig published his paper [3] and E.M.L. Beale proposed an algorithm to solve similar problems. They started the study on stochastic programming. R.JB. Wets in the 1960s, and J.R. Birge and A. Prékopa in the 1980s made important contributions in this branch of optimization.

1541

1542

H

Homogeneous Selfdual Methods for Linear Programming

Now, optimization has merged into almost every corner of economics. New branches of optimization appeared in almost every decade, global optimization, nondifferential optimization, geometric programming, large scale optimization, etc. No one in his/her whole life is able to study all branches in optimization. Each researcher can only be an expert in a few branches of optimization. Of course, the rapid development of optimization is accomplished with recognition of its achievements. One important fact is that several researchers in optimization have received the Nobel Prize in economics, including Kantorovich and T.C. Koopmans. They received the Nobel Prize on economics in 1975 for their contributions to the theory of optimum allocation of resources. H.M. Markowitz received the Nobel Prize on economics in 1990 for his contribution on the quadratic programming model of financial analysis. Today, optimization has become a very large and important interdisciplinary area between mathematics, computer science, industrial engineering, and management science. The ‘International Symposium on Mathematical Programming’ is one of major conferences on optimization. From the growing number of papers presented in this conference we may see the projection of growing optimization area: 1949) Chicago, USA, 34 papers; 1951) Washington DC, USA, 19 papers; 1955) Washington DC, USA, 33 papers; 1959) Santa Monica, USA, 57 papers; 1962) Chicago, USA, 43 papers; 1964) London, UK, 83 papers; 1967) Princeton, USA, 91 papers; 1970) The Hague, The Netherlands, 137 papers; 1973) Stanford, USA, about 250 papers; 1976) Budapest, Hungary, 327 papers; 1979) Montreal, Canada, 458 papers; 1982) Bonn, FRG, 554 papers; 1985) Cambridge, USA, 589 papers; 1988) Tokyo, Japan, 624 papers. (This data is quoted from [1].) With the current fast growth of computer technology optimization it is expected to continue its great speed of developments. These developments may contain include a deep understanding of the successful heuristics for combinatorial optimization problems with nonlin-

ear programming approaches. It may also include digital simulations to some natural optimization process. As many mysteries and open problems still exist in optimization, it will still be an area receiving a great attention. See also Carathéodory, Constantine Carathéodory Theorem Inequality-constrained Nonlinear Optimization Kantorovich, Leonid Vitalyevich Leibniz, Gottfried Wilhelm Linear Programming Operations Research Von Neumann, John References 1. Balinski ML (1991) Mathematical programming: Journal, society, recollections. In: Lenstra JK, Rinnooy Kan AHG, Schrijver A (eds) History of Mathematical Programming. NorthHolland, Amsterdam, pp 5–18 2. Courant R, Robbins H (1941) What is mathematics? Oxford Univ. Press, Oxford 3. Dantzig GB (1955) Linear programming under uncertainty. Managem Sci 1:197–206 4. Dantzig GB (1991) Linear programming: The story about how it began. In: Lenstra JK, Rinnooy Kan AHG, Schrijver A (eds) History of Mathematical Programming. North-Holland, Amsterdam, pp 19–31 5. de Fermat P (1934) Abhandlungen über Maxima und Minima. In: Oswalds Klassiker der exakten Wissenschaft, vol 238. H. Miller, reprint from original

Homogeneous Selfdual Methods for Linear Programming ERLING D. ANDERSEN Odense University, Odense M, Denmark MSC2000: 90C05 Article Outline Keywords See also References

Homogeneous Selfdual Methods for Linear Programming

) satisfying 8 ˆ x j sj D 0; 8j ˆ ˆ ˆ 0; 8 j; j j ˆ D 0; ˆ ˆ ˆ : C > 0;

Keywords Optimization; Linear programming; Interior point methods; Homogeneous; Selfdual The linear program 8 ˆ ˆmin < s.t. ˆ ˆ :

c> x (1)

Ax D b; x0

may have an optimal solution, be primal infeasible or be dual infeasible for a particular set of data c 2 Rn , b 2 Rm , and A 2 Rm × n . In fact the problem can be both primal and dual infeasible for some data where (1) is denoted dual infeasible if the dual problem 8 ˆ ˆ y A> y C s D c;

(2)

s0

corresponding to (1) is infeasible. The vector s is the socalled dual slacks. However, most methods for solving (1) assume that the problem has an optimal solution. This is in particular true for interior point methods. To overcome this problem it has been suggested to solve the homogeneous and selfdual model 8 ˆ min ˆ ˆ ˆ ˆ ˆ ˆ y C c 0; >

(3)

>

b y c x 0; x 0;

0;

instead of (1). Clearly, (3) is a homogeneous LP and is selfdual which essentially follows from the constraints form a skew-symmetric system. The interpretation of (3) is is a homogenizing variable and the constraints represent primal feasibility, dual feasibility, and reversed weak duality. The homogeneous model (3) was first studied by A.J. Goldman and A.W. Tucker [2] in 1956 and they proved that (3) always has a nontrivial solution (x , y ,

H (4)

where s := c A> y 0 and := b> y c> x 0. A solution to (3) satisfying the condition (4) is said to be a strictly complementary solution. Moreover, Goldman and Tucker showed that if (x , , y , s , ) is any strictly complementary solution, then exactly one of the two following situations occurs: > 0 if and only if (1) has an optimal solution. In this case(x , y , s )/ is an optimal primal-dual solution to (1). > 0 if and only if (1) is primal or dual infeasible. In the case b> y > 0 (c> x < 0) then (1) is primal (dual) infeasible. The conclusion is that a strictly complementary solution to (3) provides all the information required, because in the case > 0 then an optimal primal-dual solution to (1) is trivially given by (x, y, s) = (x , y , s )/ . Otherwise, the problem is primal or dual infeasible. Therefore, the main algorithmic idea is to compute a strictly complementary solution to (3) instead of solving (1) directly. Y. Ye, M.J. Todd, and S. Mizuno [6] suggested to solve (3) by solving the problem 8 0 ˆ ˆmin n z ˆ ˆ ˆ ˆ s.t. Ax b bz D 0; ˆ ˆ ˆ < A> y C c C cz 0; (5) ˆ b > y c > x C dz 0; ˆ ˆ ˆ ˆ ˆ b > y c> x d D n0 ; ˆ ˆ ˆ : x 0; 0; where b :D Ax 0 b 0 ; c :D c 0 C A> y0 C s0 ; d :D c > x 0 b > y0 C 0 ; n0 :D (x 0 )> s0 C 0 0 ; and (x 0 ; 0 ; y0 ; s0 ; 0 ) D (e; 1; 0; 1)

1543

1544

H

Homogeneous Selfdual Methods for Linear Programming

(e is an n vector of all ones). It can be proved that the problem (5) always has an optimal solution. Moreover, the optimal value is identical to zero and it is easy to verify that if (x, , y, z) is an optimal strictly complementary solution to (5), then (x, , y) is a strictly complementary solution to (3). Hence, the problem (5) can solved using any method that generates an optimal strictly complementary solution because the problem always has a solution. Note by construction then (x, , y, z) = (x0 , 0 , y0 , 1) is an interior feasible solution to (5). This implies that the problem (1) can be solved by most feasible-interior point algorithms. X. Xu, P.-F. Hung, and Ye [4] suggest an alternative solution method which is also an interior point algorithm, but specially adapted to the problem (3).The socalled homogeneous algorithm can be stated as follows: 1) Choose (x0 , 0 , y0 , s0 , 0 ) such that (x0 , 0 , s0 , 0 )> 0. Choose "f , "g > 0 and 2 (0, 1) and let := 1 . 2) k := 0. 3) Compute: r kp :D b k Ax k ; r dk r kg

k

> k

k

:D c A y s ; :D k C c > x k b > y k ;

k :D

(x k )> s k C k k : nC1

4) If k (r kp ;r dk ;r kg ) k "f and k "g , then terminate. 5) Solve the linear equations Ad x bd D r kp ; >

A d y C ds cd D r dk ; c > d x C b > d y d D r kg ; S k d x C X k ds D X k s k C k e; k d C k d D k k C k ; for (dx , d , dy , ds , d ) where X k := diag(xk ) and Sk := diag(sk ). 6) For some 2 (0, 1), let ˛ k be the optimal objective value to 8 ˆ max ˛ ˆ ˆ 0 1 0 1 ˆ ˆ ˆ xk dx ˆ ˆ B kC ˆ B C < B C B C B C C ˛ Bd C 0; s.t. B C Bd C k ˆ ˆ @s A @ sA ˆ ˆ ˆ k ˆ d ˆ ˆ ˆ : 1 ˛ :

7) 0 1 1 0 1 xk dx x kC1 B k C Bd C B kC1 C B C C B C B B C B C B kC1 C B y C :D B y k C C ˛ k Bd y C B kC B C B kC1 C @s A A @ ds A @s kC1 k d 0

8) k = k+ 1. 9) goto 3) The following facts can be proved about the algorithm 8 kC1 k k ˆ ˆ s kC1 C kC1 kC1 ) D (1 (1 )˛ k )((x k )> s k C k k ); which shows that the primal residuals (rp ), the dual residuals (rd ), the gap residual (rg ), and the complementary gap (x> s + ) all are reduced strictly if ˛ k > 0 and at the same rate. This shows that (xk , k , yk , sk , k ) generated by the algorithm converges towards an optimal solution to (3) (and the termination criteria in step 4) is ultimately reached). In principle the initial point and the stepsize ˛ k should be chosen such that min(x kj s kj ; k k ) ˇ k ; j

for k D 0; 1; : : : ;

is satisfied for some ˇ 2 (0, 1) because this guarantees (xk , k , yk , sk , k ) converges towards a strictly complementary solution. Finally, it is possible to prove that the algorithm has the complexity O(n3.5 L) given an appropriate choice of the starting point and the algorithmic parameters. Further details about the homogeneous algorithm can be seen in [3,5]. Issues related to implementing the homogeneous algorithm are discussed in [1,4]. See also Entropy Optimization: Interior Point Methods Interior Point Methods for Semidefinite Programming Linear Programming: Interior Point Methods

Hyperplane Arrangements

Linear Programming: Karmarkar Projective Algorithm Potential Reduction Methods for Linear Programming Sequential Quadratic Programming: Interior Point Methods for Distributed Optimal Control Problems Successive Quadratic Programming: Solution by Active Sets and Interior Point Methods References 1. Andersen ED, Andersen KD (2000) The MOSEK interior point optimizer for linear programming: An implementation of the homogeneous algorithm. In: Frenk H, Roos K, Terlaky T, Zhang S (eds) High Performance Optimization. Kluwer, Dordrecht, pp 197–232 2. Goldman AJ, Tucker AW (1956) Theory of linear programming. In: Kuhn HW, Tucker AW (eds) Linear Inequalities and related Systems. Princeton Univ. Press, Princeton, pp 53–97 3. Roos C, Terlaky T, Vial J-P (1997) Theory and algorithms for linear optimization: An interior point approach. Wiley, New York 4. Xu X, Hung P-F, Ye Y (1996) A simplified homogeneous and self-dual linear programming algorithm and its implementation. Ann Oper Res 62:151–171 5. Ye Y (1997) Interior point algorithms: theory and analysis. Wiley, New York p 6. Ye Y, Todd MJ, Mizuno S (1994) An O( nL)-iteration homogeneous and self-dual linear programming algorithm. Math Oper Res 19:53–67

Hyperplane Arrangements PETER ORLIK Department Math., University Wisconsin, Madison, USA MSC2000: 52C35, 05B35, 57N65, 20F36, 20F55 Article Outline Keywords Some Examples Combinatorics Divisor Complement Ball Quotients Logarithmic Forms Hypergeometric Integrals See also References

H

Keywords Hyperplane arrangement; Geometric semilattice; Orlik–Solomon algebra; Divisor; Singularity; Complement; Homotopy type; Poincaré polynomial; Ball quotient; Logarithmic form; Hypergeometric integral Let V be an `-dimensional affine space over the field K. An arrangement of hyperplanes, A, is afinite collection of codimension one affine subspaces in V, [5]. Some Examples 1) A subset of the coordinate hyperplanes is called a Boolean arrangement. 2) An arrangement is in general position if at each point it is locally Boolean. 3) The braid arrangement consists of the hyperplanes {xi = xj :1 i < j `}. It is the set of reflecting hyperplanes of the symmetric group on ` letters. 4) The reflecting hyperplanes of a finite reflection group is a reflection arrangement. Combinatorics An edge X of A is a nonempty intersection of elements of A. Let L(A) be the set of edges partially ordered by reverse inclusion. Then L is a geometric semilattice with minimal element V, rank given by codimension, and maximal elements of the same rank, r(A). The Moebius function on L is defined by (V) = 1 and for X> P V, V Y X (Y) = 0. The characteristic polynomial P of A is (A, t) = X 2 L (X)t dimX . The ˇ-invariant of A is ˇ(A) D (1)r(A) (A; 1). For a generic arrange P A) k n `k . ment of n hyperplanes (A; t) D r( kD0 (1) k t For the braid arrangement (A, t) = t(t1)(t 2) (t(` 1)). Similar factorizations hold for all reflection arrangements involving the (co)exponents of the reflection group. Given a p-tuple of hyperplanes, S = (H 1 , . . . , H p ), let \ S = H 1 \ \ H p and note that \ S may be empty. We say that S is dependent if \ S 6D ; and codim(\ S)< |S|. Let E(A) be the exterior algebra on symbols (H) for H 2 A where product is juxtaposition. Define @: E ! E by @1 = 0, @(H) = 1 and for p Pp ck H p ). Let 2, @(H1 H p ) D kD1 (1) k1 (H1 H I(A) be the ideal of E(A) generated by {S: \ S = ;} [ {@S:S is dependent}. The Orlik–Solomon algebra of A is

1545

1546

H

Hyperplane Arrangements

A(A) = E(A)/I(A). See also connections with matroid theory [3]. Divisor The divisor of A is the union of the hyperplanes, N(A). If K = R or K = C, then N has the homotopy type of a wedge of ˇ(A) spheres of dimension r(A) 1, [4]. The singularities of N are not isolated. The divisor of a general position arrangement has normal crossings, but this is not true for arbitrary A. Blowing up N along all edges where it is not locally a product of arrangements yields a normal crossing divisor. Complement The complement of A is M(A) = V N(A). 1) If K = Fq , then M is a finite set of cardinality |M| = (A, q). 2) If K = R, then M is a disjoint union of open convex sets (chambers) of cardinality (1)` (A, 1). If r(A) = `, M contains ˇ(A) chambers with compact closure, [7]. 3) If K = C, then M is an open complex (Stein) manifold of the homotopy type of a finite CW complex. Its cohomology is torsion-free and its Poincaré polynomial is Poin(M, t) = (t)` (A, t 1 ). The product structure is determined by the isomorphism of graded algebras H (M) ' A(A). The fundamental group of M has an effective presentation but the higher homotopy groups of M are not known in general. The complement of a Boolean arrangement is a complex torus. In a general position arrangement of n> ` hyperplanes M has nontrivial higher homotopy groups. For the braid arrangement, M is called the pure braid space and its higher homotopy groups are trivial. The symmetric group acts freely on M with orbit space the braid space whose fundamental group is the braid group. The quotient of the divisor by the symmetric group is called the discriminant, which has connections with singularity theory. Ball Quotients Examples of algebraic surfaces whose universal cover is the complex ball were constructed as ‘Kummer’ covers of the projective plane branched along certain arrangements of projective lines, [2].

Logarithmic Forms For H 2 A choose a linear polynomial ˛ H with H = Q ker ˛ H and let Q(A) D H2A ˛H . Let ˝ p [V] denote all global regular (i. e., polynomial) p-forms on V. Let ˝ p (V) denote the space of all global rational p-forms on V. The space ˝ p (A) of logarithmic p-forms with poles along A is ˝ p (A) D f! 2 ˝ p (V) : Q! 2 ˝ p [V]; Q(d!) 2 ˝ pC1 [V]g : The arrangement is free if ˝ 1 (A) is a free module over the polynomial ring. A free arrangement A has integer Q exponents {b1 , . . . , b ` } so that (A, t) = `kD1 (tbk ). Reflection arrangements are free. This explains the factorization of their characteristic polynomials. Hypergeometric Integrals Certain rank one local system cohomology groups of M may be identified with spaces of hypergeometric integrals, [1]. If the local system is suitably generic, these cohomology groups may be computed using the algebra A(A). Only the top cohomology group is nonzero and it has dimension ˇ(A). See [6] for connections with the representation theory of Lie algebras and quantum groups, and with the Knizhnik–Zamolodchikov differential equations of physics. See also Hyperplane Arrangements in Optimization References 1. Aomoto K, Kita M (1994) Hypergemetric functions. Springer, Berlin 2. Barthel G, Hirzebruch F, Höfer T (1987) Geradenkonfigurationen und Algebraische Flächen. Vieweg, Braunschweig/Wiesbaden 3. Björner A, Las Vergnas M, Sturmfels B, White N, Ziegler GM (1993) Oriented matroids. Cambridge Univ. Press, Cambridge 4. Goresky M, MacPherson R (1988) Stratified Morse theory. Springer, Berlin 5. Orlik P, Terao H (1992) Arrangements of hyperplanes. Springer, Berlin 6. Varchenko A (1995) Multidimensional hypergeometric functions and representation theory of Lie algebras and quantum groups. World Sci., Singapore

Hyperplane Arrangements in Optimization

7. Zaslavsky T (1975) Facing up to arrangements: Face-count formulas for partitions of space by hyperplanes. Memoirs Amer Math Soc 154

Hyperplane Arrangements in Optimization PANOS M. PARDALOS Center for Applied Optim. Department Industrial and Systems Engineering, University Florida, Gainesville, USA

H

Clearly there are at most 3n possible position vectors, however, in general most of these will not occur. We say that points p and qlie on the same face if s(p) = s(q). The nonempty set of points with position vector r is called the face f (r): o n f (r) D p 2 Rd : s(p) D r

The nonempty sets of this form are called the faces of the arrangement A(S). The position vector of a face f (r) = g is defined to be r,

MSC2000: 05B35, 20F36, 20F55, 52C35, 57N65 s( f (r)) D r:

Article Outline Keywords See also References Keywords Hyperplane arrangement; Polynomial time algorithm A finite set S of hyperplanes in Rd defines a dissection of Rd into connected sets of various dimensions. We call this dissection the arrangementA(S) of S. Given a vector = (1 , . . . , d ) 2 Rd {0} and a number 0 2 R, we may define a hyperplane H and associated halfspaces H , H + by o n H D x 2 Rd : x D 0 ; o n H D x 2 Rd : x < 0 ; n o H C D x 2 Rd : x > 0 : Clearly, H, H , H + are disjoint and H [ H [ H + = Rd . We may now specify the location of a point relative to the set of hyperplanes S = {H 1 , . . . , H n }. For a point p and 1 j n, define 8 ˆ ˆ1 if p 2 H j ; < s j (p) D 0 if p 2 H j ; ˆ ˆ :C1 if p 2 H C : j

The vector s(p) = (s1 (p), . . . , sn (p)) is called the position vector of p.

A face f is called a k-face if its dimension is k. Special names are used to denote k-faces for special values of k: a 0-face is called a vertex, a 1-face is called an edge, a (d1)-face is called a facet, and a d-face is called a cell. A face is said to be a subface of another face g if the dimension of f is one less than the dimension of g and f is contained in the boundary of g; it follows that si (f ) = 0 unless si (f ) = si (g) for 1 i n. If f is a subface of g, then we also say that f and g are incident (upon each other) or that they define an incidence. An arrangement A(S) of n d hyperplanes is called simple if any d hyperplanes of S have a unique point in common and if any d + 1 hyperplanes have no point in common. If n < d, we say that A(S) is simple if the common intersection of the n hyperplanes is a (dn)-flat. For more details see [3,4] and [5]. As an application of hyperplane arrangements in algorithm design for optimization problems, see [1].In it the problem of minimizing the Euclidean distance function on Rn subject to m equality constraints and upper and lower bounds (box constraints) is considered. A parametric characterization in Rm of the family of solutions to this problem is provided, thereby showing equivalence with a problem of search in an arrangement of hyperplanes in Rm . This characterization and the technique for constructing arrangements due to H. Edelsbrunner, J. O’Rourke and R. Seidel are used to develop an exact algorithm for the problem. The algo-

1547

1548

H

Hyperplane Arrangements in Optimization

rithm is strongly polynomial running in time (nm ) for each fixed m. See also Hyperplane Arrangements References 1. Berman P, Kovoor N, Pardalos PM (1993) Algorithms for the least distance problem. Complexity in Numerical Optimization. World Sci., Singapore, pp 33–56

2. Chazelle B, Guibas J, Lee DT (1985) The power of geometric duality. Proc. 15th Annual ACM Symp. Theory of Computing. ACM, New York, pp 217–225 3. Edelsbrunner H (1987) Algorithms in combinatorial geometry. Springer, Berlin 4. Edelsbrunner H, O’Rourke J, Seidel R (1986) Constructing arrangements of lines and hyperplanes with applications. SIAM J Comput 15:341–363 5. Orlik P, Terao H (1992) Arrangements of hyperplanes. Springer, Berlin 6. Pardalos PM, Kovoor N (1990) An algorithm for a singly constrained class of quadratic programs subject to upper and lower bounds. Math Program 46:321–328

Identification Methods for Reaction Kinetics and Transport

I

I

Identification Methods for Reaction Kinetics and Transport ANDRÉ BARDOW, WOLFGANG MARQUARDT AVT – Process Systems Engineering, RWTH Aachen University, Aachen, Germany MSC2000: 34A55, 35R30, 62G05, 62G08, 62P30, 62P10, 62J02, 62K05, 76R50, 80A23, 80A30, 80A20 Article Outline Introduction Methods and Applications Model-Based Experimental Analysis Incremental vs. Simultaneous Model Identification Case Studies

References Introduction Kinetic phenomena drive the macroscopic behavior of biological, chemical, and physical systems. The lack of mechanistic understanding of these kinetic phenomena is still the major bottleneck for a more widespread application of model-based techniques in process design, optimization, and control. In recent years, kinetic phenomena have become of increasing importance given the rapidly developing capabilities for the numerical treatment of more complex models on the one hand and the need for predictive models on the other. Despite this demand, kinetic modeling of process systems is still a challenge. This contribution presents systematic work processes to derive and validate models that capture the underlying physicochemical mechanisms of an observed behavior. The work process of model-based experimental analysis (or MEXA for short)

is introduced in the next section. The key factor in the procedure is an incremental strategy for model structure refinement tailored for the identification of reaction kinetics and transport phenomena [30]. While identification of kinetic models from experimental data can, in principle, be performed by application of standard statistical tools of nonlinear regression [2] and model discrimination [39], this direct approach in general leads to a large number of NLP or even MINLP problems being solved [16,21,34,37] that may be computationally prohibitive and in particular does not reflect the underlying physics. In contrast, the incremental identification approach discussed here presents a physically motivated and adapted divide-and-conquer strategy to the complex optimization problem of kinetic model identification. Applications of this approach in the areas of (bio)chemical reactions [6,12,13,15,32], multicomponent diffusion [3,5], and heat transfer in fluid flow [22,25] are discussed.

Methods and Applications Model-Based Experimental Analysis The typical work flow of the MEXA procedure is as follows (Fig. 1): 1. An initial experiment with a suitable measurement system is designed on the basis of a priori knowledge and intuition. 2. A first mathematical model of experiment and measurement system is proposed. 3. Numerical simulation studies are performed to explore the expected behavior of the experiment. 4. The model is then employed for rigorous experimental design [41] to gain maximum information with respect to the goal of the investigation.

1549

1550

I

Identification Methods for Reaction Kinetics and Transport

Identification Methods for Reaction Kinetics and Transport, Figure 1 Model-based experimental analysis [30]

5. The designed experiment is performed and at least some of the variables of interest are observed using appropriate measurement techniques. 6. Formulation and solution of inverse problems refers to combinations of state, parameter, and unknown input estimation as well as model structure identification and selection. 7. Typically, the first model does not reflect the studied phenomena with sufficient detail and accuracy. Therefore, iterative model refinement, intertwined with iterative improvement of the experimental and measurement techniques, must be carried out to improve the predictive capabilities of the model based on the extended understanding gained. Work processes consisting of the steps design of experiments, data analysis, and modeling date back to at least the 1970s [26]. However, the development and benchmarking of such work processes has only recently been formulated as an important research objective, e. g., by the Collaborative Research Center CRC 540 “Modelbased Experimental Analysis in Fluid Multi-Phase Reactive Systems” (http://www.sfb540.rwth-aachen.de/) at RWTH Aachen University as well as by Asprey and Macchietto [1]. The power of these work processes depends on the specific strategies employed for systematically improving both the model structure and the experimental setup in every refinement step during model

identification. While experimental design is the focus of the work of Asprey and Macchietto [1], the research in CRC 540 is complementary and emphasizes the strategy for model structure refinement as discussed in what follows. Incremental vs. Simultaneous Model Identification Incremental Modeling and Identification The key idea of the incremental approach for model structure refinement is to follow the incremental steps of systematic model development [29] also in model identification (Fig. 2). Therefore, the main steps of model development and their connection to incremental identification are outlined next. Model B In model development, balance envelopes and their interactions are determined first, the spatiotemporal resolution of the model is decided, and the extensive quantities x to be balanced are selected. The balance equation is formulated as a sum of generalized fluxes, e. g., @x D r Jf C Js ; @t dx D A(x) C B(x)w : dt

(1) (2)

Identification Methods for Reaction Kinetics and Transport

I

Identification Methods for Reaction Kinetics and Transport, Figure 2 Incremental modeling and identification [30]

Equation (1) exemplifies a balance for a distributed quantity x flowing with the flux J f and being generated/consumed according to the source/sink term J s . Note that further generalized fluxes may arise through initial or boundary conditions. A lumped quantity is balanced in Eq. (2), where A,B are matrix functions of appropriate dimensions describing, e. g., inter- and intraphase transport and source/sink terms. Note that no constitutive equations are considered yet to specify the generalized fluxes J()1 (here: J f , J s , w) as a function of the intensive thermodynamic state variables. In incremental model identification, the unknown generalized fluxes J() are estimated directly from the balance equation. For this purpose, measurements of the states x() with sufficient resolution in time t and/or space z are assumed. The unknown flux J() in the balance equation is then determined as a function of time and space coordinates – without the need for specifying a constitutive equation. Model BF In model development, constitutive equations are specified for each flux term in the balances on the next decision level: J() D J(x(); rx(); : : : ; k()) :

(3)

This could be, e. g., correlations for interfacial fluxes or reaction rates. 1 The ()-argument summarizes the spatial and/or respective temporal dependency of the quantity.

Similarly, in incremental model identification on level BF, flux model candidates (3) are selected or generated to relate the flux to rate coefficients, to measured states, and to their derivatives. The flux estimates obtained on level B are now interpreted as inferential measurements. These can then be used, together with the real measurements, to determine a rate coefficient k() as a function of time and space. Often, the flux model can directly be solved for the rate coefficient function k(). Model BFR In model development, the rate coefficients introduced in the correlations on the level BF – such as a reaction rate or heat and mass transfer coefficients – often themselves depend on the states. Consequently, a model relating rate coefficients and states has to be chosen on yet another level BFR k() D k(x(); rx(); : : : ; ) :

(4)

This cascaded decision process can continue as long as the submodels considered involve not only constant parameters but also functions of the states. Mirroring this step in incremental model identification, a model for the rate coefficients is identified. This model (4) is assumed to only depend on the measured states and constant parameters. These parameters can be computed from the estimated rate coefficients k() and the measured states x() by solving an algebraic regression problem.

1551

1552

I

Identification Methods for Reaction Kinetics and Transport

Such a structured approach during model identification renders the individual decisions in the modeling and identification process completely transparent: the modeler is in full control of the model refinement process. Simultaneous Model Identification In the previous section, it was shown that there exists a natural hierarchy in models of kinetic phenomena. Classic approaches to model identification, however, neglect this inherent structure. These simultaneous approaches assume that the model structure is correct and consider only the fully specified model (Fig. 2). Models for the flux expression (Model BF) and the phenomenological coefficients (Model BFR) have to be specified a priori. In practical situations, these models are initially uncertain. Now, all assumptions on the process will simultaneously influence the results of the model identification procedure. The estimates may be biased if the parameter estimation is based on a model containing structural errors [42]. The theoretically optimal properties of a maximum likelihood approach [2] are therefore lost in the presence of structural model mismatch. Initialization and convergence may be difficult since the whole problem is solved in one step [18]. More importantly, it may be difficult in a simultaneous approach to identify which part of the model introduced the error. Furthermore, several candidate model structures may exist for each kinetic phenomenon. The aggregation of such submodels with the balance equations will inevitably lead to a multitude of candidate models. Alternatively, general approximation schemes like neural nets can be used, often leading to several hundred unknown parameters. Both approaches may be prohibitive due to computational cost, especially when more complex or even distributed parameter systems are considered. Discussion of Identification Approaches The incremental approach splits the identification procedure into a sequence of inverse problems, thereby reducing uncertainty and computational complexity. It thus has the potential to overcome a number of the disadvantages of the simultaneous approach: Avoid combinatorial complexity: Rather than postulating large numbers of nested model structures,

a structured, fully transparent process is used in the incremental model refinement strategy. An uncontrolled combinatorial growth of the number of model candidates is avoided. Reduce uncertainty: In the incremental approach, any decision on the model structure relates to a single physicochemical phenomenon. Submodel selection is guided by the previous estimation step, which provides input–output data inferred from the measurements. Identifiability can also be assessed more easily on the level of the submodel. Computational advantages: The decomposition inherent in incremental model refinement avoids the solution of many difficult output least-squares problems with (partial-)differential-algebraic constraints and potentially large data sets. Rather, an often linear inverse problem must be solved first. All the following problems are nonlinear regression problems with algebraic constraints – regardless of the complexity of the overall model. This decomposition not only facilitates initialization and convergence, but it also allows for incremental testing of model validity at every decision level for the submodels. Largely intractable estimation problems may become computationally feasible. Still, it should be kept in mind that the incremental and the simultaneous methods were derived for different purposes: the incremental approach is aimed at gross elimination of candidate models and/or systematic derivation of suitable candidate model structures, whereas the simultaneous approach gives the best parameter estimates once the correct model structure is known [6]. Multistep approaches to model identification have been applied rather intuitively in the past. The sequence of flux estimation and parameter regression is, e. g., commonly employed in reaction kinetics as the socalled differential method [19]. More recently, a twostep approach has been applied for the hybrid modeling of fermentation processes [36,38]. First reaction fluxes are estimated from measured data, then neural networks and fuzzy models are employed to correlate the fluxes with the measurements. Mahoney et al. [28] estimate the crystal growth rate directly from the population balance equations using a method of characteristics approach and indicate the possibility of correlating it with solute concentration next.

Identification Methods for Reaction Kinetics and Transport

Though the incremental refinement approach is rather intuitive, a successful implementation requires tailored ingredients such as high-resolution field measurement techniques for state variables, algorithms for model-free flux estimation by inversion of the balance equations, methodologies for the generation, assessment, and selection of the most suitable model structures, and model-based experimental design methods. A detailed discussion of these areas in relation to incremental model identification can be found in [30]. Various aspects are highlighted in the following case studies. Here, the progress in the development of the incremental model identification approach is reported for challenging kinetic modeling problems of gradually increasing complexity from (bio)chemical reactions to diffusion in liquids and to heat transfer at falling liquid films. In addition, the incremental approach has been successful in the identification of hybrid process models [24]. Case Studies (Bio)chemical Reaction Kinetics The identification of the mechanism and kinetics of chemical reactions is one of the most relevant and still not yet fully satisfactorily solved tasks in process systems modeling [8]. In biological systems, the situation is often even more severe due to the complexity of living systems. The incremental identification approach has been applied for a variety of reaction systems [6,12,13,15,32]. Here, selected features are discussed to elucidate the general properties of this problem class. Model B: Reaction flux estimation in lumped systems For illustration, we assume a well-mixed and isothermal homogenous reaction system. The balance equation for the mole number ni of species i is then dn i D f iin f iout C f ir ; dt

i D 1; : : : ; n c ; (5)

where f i in , f i out are, respectively, the molar flow rates into and out of the reactor and f i r is the unknown reaction flux of species i. It is worth noting that the fluxes enter the balance equations linearly and the equations are decoupled for each species.

I

All reaction fluxes f i r can thus be estimated individually by numerical differentiation of concentration data for each measured species on level B from material balances only. Tikhonov-Arsenin filtering [31] or smoothing splines [6] with regularization parameter choice based on the L-curve or generalized cross-validation have been shown to give reliable estimates. Model BF: Estimation of reaction rates and stoichiometry If the reaction stoichiometry is unknown, target factor analysis (TFA) [11] is used to test possible stoichiometries and to determine the number of relevant reactions. The reaction rates r(t) can then be calculated from the typically nonsquare linear equation system relating reaction fluxes f i r (t) and rates by the stoichiometric matrix N: f r D v(t)N T r(t) ;

(6)

with v(t) denoting the reactor volume. Model BFR: Estimation of kinetic coefficients On the next level, concentrations are determined either from smoothed measurements using nonparametric methods [40] or unmeasured concentrations are reconstructed from stoichiometry and mass balances [13]. Since a complete set of concentration and rate data is now available, candidate reaction rate laws of the general form r(t) D m(c(t); )

(7)

can now be discriminated by nonlinear algebraic regression [42]. Model identification may not immediately result in reliable model structures and parameters because of a lack of information content in the data. Iterative improvement with optimally chosen experimental conditions as suggested in the MEXA work process can then be employed [13]. The incremental identification method has been worked out for arbitrary reaction schemes with reversible or irreversible as well as dependent or independent reactions. The minimum type of concentration measurements required to guarantee identifiability has been assessed theoretically. The incremental identification strategy has been used in a benchmark study considering a homogeneous reaction system [13]. Computational effort for model

1553

1554

I

Identification Methods for Reaction Kinetics and Transport

identification could be reduced by almost two orders of magnitude using the structured search in the incremental method. The inclusion of data-driven model substructures in hybrid models is straightforward, as already exemplified for neural networks [15] and sparse grids [14]. The basic framework can easily be extended to nonisothermal systems, and even multiphase transport has been considered [12,23]. An application study to a biochemical reaction system is presented in [32]. Multicomponent Diffusion While phase equilibrium models are available even for complex multicomponent mixtures [17], there is a lack of experimentally validated diffusion models in particular for multicomponent liquid mixtures [9]. The incremental identification of diffusive mass transport models is therefore outlined in this section. The application is based on the recently introduced Raman diffusion experiment [4,7]. Here, one-dimensional interdiffusion of two initially layered liquid mixtures is observed by 1D-Raman spectroscopy. Concentration profiles ci of all species are obtained with high resolution in time and space [20]. Model B: Estimation of 1D-diffusion fluxes For the 1Ddiffusion process, the mass balance equation for each species i can be given as @J i @c i D ; @t @z

i D 1; : : : ; n c 1 :

(8)

The determination of the diffusive flux J i falls into the class of interior flux estimation in distributed parameter systems [30]. While interior fluxes cannot be determined in 2D or 3D situations without specification of a constitutive model, the model-free flux estimation is possible in the one-dimensional situation considered here. Only one nonzero mass flux component has to be determined from differentiated concentrations measured along a line in the direction of the diffusive flux. Such a strategy has been followed in [3,5]. The Raman concentration measurements were first differentiated with respect to time by means of spline smoothing [33] and subsequently integrated over the spatial coordinate to render a diffusive flux estimate without specifying a diffusion model: Z z @c i (; t) d : (9) J i (z; t) D @t 0

This technique directly carries over to multicomponent diffusion [5] provided concentration measurements are available for every species. In particular, there is only a linear increase in complexity due to the natural decoupling of the multicomponent material balances (8). Model BF: Estimation of diffusion flux models A flux model has to be introduced on the next level. For example, generalized Fick or Maxwell–Stefan models could be selected as candidates. In case of binary mixtures, the Fick diffusion coefficient can, e. g., be determined at any point in time and space: D(z; t) D

J(z; t) : @c(z; t)/@z

(10)

Positivity requirements may now be used, e. g., to assess model assumptions on this level. Model BFR: Estimation of model parameters The estimated diffusion coefficient data can now be correlated with the measured concentrations to obtain a diffusion model: D(z; t) D m(c(z; t); ) :

(11)

Error-in-variables methods [10] and statistical model discrimination techniques [35] are employed to decide on the most appropriate model for the concentration dependence of the diffusion coefficient. This concentration dependency has been shown to be even identifiable from a single Raman diffusion experiment [3]. An application in food science has recently been presented in [27]. In case of multicomponent diffusion, the last two levels have to be merged because all species concentration gradients are determined by the diffusive flux of any species due to the cross effects of multicomponent diffusion [3,5]. The merged steps BF and BFR then allow for efficient initialization of these complex estimation problems. Heat Transfer at Falling Films Liquid falling films are a challenging benchmark problem for general fluid multiphase reaction systems as they show all the relevant features of this problem class. Here, the first steps in the application of the incremental approach to heat transfer in falling films are considered [22,25]. Model B: Boundary flux estimation in distributed systems In order to study its heat transfer character-

Identification Methods for Reaction Kinetics and Transport

istics, a laminar-wavy falling film is heated by resistance heating using a supporting wall as heater. Infrared thermography is employed to measure a transient 2D temperature field on the backside of the wall. An inverse heat conduction problem for the three-dimensional wall has to be solved to determine the boundary heat flux between the wall and the falling film as the first step of the incremental approach: @T D aT ; @t

(12)

rTj D w(z ; t) ;

(13)

rTj D q(z ; t) ;

(14)

with and being the parts of the surface with unknown and with known boundary heat fluxes w(z ; t) and q(z ; t), respectively. The boundary flux estimation problem is solved by means of a multigrid finite-element discretization of the heat conduction Eq. (12) in conjunction with the conjugate gradient method. Gradient computation is performed using the adjoint method. This framework allows for the solution of the discretized problem involving about three million variables on a desktop computer [22]. These results show that the identification of kinetic phenomena may become feasible even in complex flow problems using the structured search strategy of the incremental approach. A generalization of the presented problem to work out the full incremental identification concept for heat transfer problems in falling films is currently in progress [25]. References 1. Asprey SP, Macchietto S (2000) Statistical tools for optimal dynamic model building. Comput Chem Eng 24(2– 7):1261–1267 2. Bard Y (1974) Nonlinear Parameter Estimation. Academic, New York 3. Bardow A, Göke V, Koß HJ, Lucas K, Marquardt W (2005) Concentration-dependent diffusion coefficients from a single experiment using model-based Raman spectroscopy. Fluid Phase Equilib 228:357–366 4. Bardow A, Göke V, Koß HJ, Marquardt W (2006) Ternary diffusivities by model-based analysis of Raman spectroscopy measurements. AIChE J 52(12):4004–4015

I

5. Bardow A, Marquardt W (2004) Identification of diffusive transport by means of an incremental approach. Comput Chem Eng 28(5):585–595 6. Bardow A, Marquardt W (2004) Incremental and simultaneous identification of reaction kinetics: methods and comparison. Chem Eng Sci 59(13):2673–2684 7. Bardow A, Marquardt W, Göke V, Koß HJ, Lucas K (2003) Model-based measurement of diffusion using Raman spectroscopy. AIChE J 49(2):323–334 8. Berger R, Stitt E, Marin G, Kapteijn F, Moulijn J (2001) Eurokin – chemical reaction kinetics in practice. CATTECH 5(1):30–60 9. Bird RB (2004) Five decades of transport phenomena. AIChE J 50(2):273–287 10. Boggs PT, Byrd RH, Rogers J, Schnabel RB (1992) User’s reference guide for ODRPACK version 2.01 – software for weighted orthogonal distance regression. Technical Report NISTIR 92-4834. US Department of Commerce, National Institute of Standards and Technology 11. Bonvin D, Rippin DWT (1990) Target factor-analysis for the identification of stoichiometric models. Chem Eng Sci 45(12):3417–3426 12. Brendel M (2006) Incremental identification of complex reaction systems. PhD thesis, RWTH Aachen University 13. Brendel M, Bonvin D, Marquardt W (2006) Incremental identification of kinetic models for homogeneous reaction systems. Chem Eng Sci 61(16):5404–5420 14. Brendel M, Marquardt W (2008) An algorithm for multivariate function estimation based on hierarchically refined sparse grids. Computing and Visualization in Science, doi:10.1007/s00791-008-0085-1 15. Brendel M, Mhamdi A, Bonvin D, Marquardt W (2004) An incremental approach for the identification of reactor kinetics. In: Allgöwer F, Gao F ADCHEM – 7th International Symposium on Advanced Control of Chemical Processes. Preprints vol I, pp 177–182, Hong Kong 16. Brink A, Westerlund T (1995) The joint problem of model structure determination and parameter-estimation in quantitative IR spectroscopy. Chemom Intell Lab Syst 29(1):29–36 17. Chen C-C, Mathias PM (2002) Applied thermodynamics for process modelling. AIChE J 48(2):194–200 18. Cheng ZM, Yuan WK (1997) Initial estimation of heat transfer and kinetic parameters of a wall-cooled fixed-bed reactor. Comput Chem Eng 21(5):511–519 19. Froment GF, Bischoff KB (1990) Chemical Reactor Analysis and Design. Wiley, New York 20. Göke V (2005) Messung von Diffusionskoeffizienten mittels eindimensionaler Raman-Spektroskopie. PhD thesis, RWTH Aachen University 21. Goodwin GC, Payne RL (1977) Dynamic system identification: experiment design and data analysis. Academic, New York 22. Gross S, Somers M, Mhamdi A, Al Sibai F , Reusken A, Marquardt W, Renz U (2005) Identification of boundary heat

1555

1556

I 23.

24.

25.

26. 27.

28.

29.

30.

31.

32.

33. 34.

35.

36.

37.

Ill-posed Variational Problems

fluxes in a falling film experiment using high resolution temperature measurements. Int J Heat Mass Tran 48(25– 26):5549–5562 Kahrs O, Brendel M, Marquardt W (2005) Incremental identification of NARX models by sparse grid approximation. In: Proceedings of the 16th IFAC World Congress, 3–8 July 2005 Prague Kahrs O, Marquardt W (2008) Incremental identification of hybrid process models. Comput Chem Eng 32(4–5):694– 705 Karalashvili M, Groß S, Mhamdi A, Reusken A, Marquardt W (2007) Incremental identification of transport phenomena in wavy films. In: Plesu V, Agachi P (eds) 17th European Symposium on Computer Aided Process Engineering – ESCAPE17, Elsevier, Amsterdam (CD-ROM) Kittrell J (1970) Mathematical modeling of chemical reactions. Adv Chem Eng 8:97–183 Lucas T, Bohuon P (2005) Model-free estimation of mass fluxes based on concentration profiles. I. Presentation of the method and of a sensitivity analysis. J Food Eng 70(2):129–137 Mahoney AW, Doyle FJ, Ramkrishna D (2002) Inverse problems in population balances: growth and nucleation from dynamic data. AIChE J 48(5):981–990 Marquardt W (1995) Towards a process modeling methodology. In: Berber R (eds) Methods of Model-Based Control. Kluwer, Dordrecht, pp 3–41 Marquardt W (2005) Model-based experimental analysis of kinetic phenomena in multi-phase reactive systems. Chem Eng Res Des 83(A6):561–573 Mhamdi A, Marquardt W (1999) An inversion approach to the estimation of reaction rates in chemical reactors. In: Proceedings of the European Control Conference (ECC’99), Karlsruhe, Germany, pp F1004–1 Michalik C, Schmidt T, Zavrel M, Ansorge-Schumacher M, Spiess A, Marquardt W (2007) Application of the incremental identification method to the formate dehydrogenase. Chem Eng Sci 62(18–20):5592–5597 Reinsch CH (1967) Smoothing by spline functions. Numer Math 10(3):177–183 Skrifvars H, Leyffer S, Westerlund T (1998) Comparison of certain MINLP algorithms when applied to a model structure determination and parameter estimation problem. Comput Chem Eng 22(12):1829–1835 Stewart WE, Shon Y, Box GEP (1998) Discrimination and goodness of fit of multiresponse mechanistic models. AIChE J 44(6):1404–1412 Tholudur A, Ramirez WF (1999) Neural-network modeling and optimization of induced foreign protein production. AIChE J 45(8):1660–1670 Vaia A, Sahinidis NV (2003) Simultaneous parameter estimation and model structure determination in FTIR spectroscopy by global MINLP optimization. Comput Chem Eng 27(6):763–779

38. van Lith PF, Betlem BHL, Roffel B (2002) A structured modeling approach for dynamic hybrid fuzzy-first principles models. J Process Control 12(5):605–615 39. Verheijen PJT (2003) Model selection: an overview of practices in chemical engineering. In: Asprey SP, Macchietto S Dynamic Model Development: Methods, Theory and Applications. Elsevier Science, Amsterdam, pp 85–104 40. Wahba G (1990) Spline Models for Observational Data. SIAM, Philadelphia 41. Walter E, Pronzato L (1990) Qualitative and quantitative experiment design for phenomenological models – a survey. Automatica 26(2):195–213 42. Walter E, Pronzato L (1997) Identification of Parametric Models from Experimental Data. Springer, Berlin

Ill-posed Variational Problems IVP ALEXANDER KAPLAN, RAINER TICHATSCHKE Universität Trier, Trier, Germany MSC2000: 65K05, 65M30, 65M32, 49M30, 49J40 Article Outline Keywords See also References Keywords Ill-posed variational problem; Well-posedness; Tikhonov regularization; Proximal point methods It is generally accepted that the notion‘ill-posed problem’ originates from a considered concept of wellposedness: A problem is called ill-posed if it is not wellposed. There are a lot of different notions of wellposedness (cf. [15,23,27,35,38] and [40]), which correspond to certain classes of variational problems and numerical methods and take into account the‘quality’ of the input data, in particular their exactness. For a comparison of different concepts of well-posedness see [12,15] and [35]. For instance, Tikhonov well-posedness [35,38] is convenient if we deal with methods generating feasible minimizing sequences, and it is not appropriate to analyse stability of exterior penalty methods.

I

Ill-posed Variational Problems

We shall proceed from two concepts of wellposedness which are suitable for wide classes of problems and methods. The first concept is destined to the problem min fJ(u) : u 2 Kg ;

where J ı :U 0 ! R, Bı :U 0 ! Y are assumed to be continuous. Then, the problem min fJı (u) : u 2 U0 ;

Bı (u) 0g

(4)

(1)

where K is a nonempty closed subset of a Banach space V with the norm k k and J : V ! R [ {+ 1} is a proper lower-semicontinuous functional. Definition 1 (cf. [27]) The sequence {un } V is said to be a generalized minimizing sequence (Levitin–Polyak minimizing sequence) for (1) if lim d(u n ; K) D 0

n!1

corresponds to an arbitrary but fixed variation ' ı 2 ˚ ı . The set of optimal solutions of (4) will be denoted by U (' ı ). Definition 3 Problem (1), (2) is called well-posed if i) it is uniquely solvable, ii) there exists a constant ı 0 > 0 such that for any ı 2 (0, ı 0 ) and any ' ı 2 ˚ ı the set U (' ı )is nonempty, iii) limı ! 0 d(u , U (' ı )) = 0 for arbitrary ' ı 2 ˚ ı .

and lim J(u n ) D inf J(u);

n!1

u2K

with d(u; K) D inf ku vk v2K

distance function. Definition 2 (1) is called well-posed (Levitin–Polyak well-posed) if i) it is uniquely solvable, and ii) any generalized minimizing sequence converges to u = arg min{J(u):u 2 K}.

Example 4 Problem (1), (2) with

The second concept (cf. [20,23]) concerns (1) with K D fu 2 U0 : B(u) 0g ;

(2)

U 0 V a convex closed set, B:U 0 ! Y a convex continuous mapping into a Banach space Y, and J:U 0 ! R a convex continuous functional. The relation ‘’ in (2) and the convexity of B are defined according to a positive cone in Y. In this case, the study of the dependence of a solution on data perturbations is often more natural and simpler than the analysis of the convergence of a generalized minimizing sequence. We suppose that U 0 is exactly given and a violation of the condition u 2 U 0 does not arise. For a fixed ı > 0, the set of variations is defined by n ˚ı D 'ı (Jı ; Bı ) : kJ Jı kC(U 0 ) ı; o sup kB(u) Bı (u)kY ı ; (3) u2U 0

Depending on the pecularities of the problem considered, the ‘quality’ of data as well as the requirements to an approximation, other norms in (3) and additional assumptions w.r.t. J ı , Bı can be considered (for instance, convexity of J ı , Bı ). For a relaxation of the inequalities in (3) see [39,40]. Of course, the Definitions 2 and 3 are not equivalent, and in the framework of the chosen concept of well-posedness the problem is called ill-posed if any condition is violated in the corresponding definition used.

V D R3 ;

Y D R3 ;

J(u) D u2 ; ˚ U 0 D u 2 R3 : 0 u k 2 ; B(u) > 1 1 D u1 C u2 ; u3 ; u1 u2 u3 C 1 2 2 is well-posed according to Definition 2, but it is illposed according to Definition 3. This example reflects, in particular, the situation that an arbitrary small data perturbation may lead to an unsolvable problem. Example 5 The unconstrained problem

minimize

J(u) D

1 X kD1

k 1 u 2k

over

V D l2

1557

1558

I

Ill-posed Variational Problems

is ill-posed according to both Definitions 2 and 3. To verify that take n

u n D (1; 0; : : : ; 0; 1; 0 : : :); Jın D

1 X

ın D

1 ; n

k 1 u 2k C maxfn1 u 2n ; n1 g:

kD1; k¤n

If it is supposed that V is a reflexive Banach space, Y = C(T) (with T a compact set), that Problem (1), (2) is uniquely solvable and that Slater’s condition is valid, then the condition lim sup ku u k D 0;

ı!C0 u2Wı

1) using approximate data one cannot be sure that a solution of the ‘perturbed’ problem is close to the solution (or to the solution set) of the original problem; 2) in the majority of the numerical methods it is possible that the calculated minimizing sequence does not converge (in a suitable sense) to a solution of the problem. It may also happen that standard solution methods break down for such problems. Example 6 Problem (1), (2) with V D R2 ; Y D C[0; 1]; J(u) D u1 ; ˚ U 0 D u 2 R2 : u 2 0 ; and

1 Bu(t) D u1 t p 2

with Wı D fu 2 U0 : J(u) J(u ) C ı; max B(u)(t) ıg t2T

is necessary and sufficient for this problem to be wellposed according to Definition 3 (with convex J ı , Bı ) (cf. [20]). Let us mention also concepts of well-posedness using different notions of hyper- or epiconvergence. As an example, identifying the functions with their epigraphs, in [9] for the class of Problem (1) the closeness of data is measured in the Attouch–Wets metric defined on the data space. Here, Problem (1) is said to be well-posed if it is uniquely solvable and its solution depends continuously (in V) on the data perturbation (for details see [35]). These concepts are closely related to the classical idea of Hadamard of the continuous dependence of the solution on the data. Some notions of well-posedness do not suppose uniqueness of a solution of the problem considered (cf. [35,40]). A correspondinggeneralization of Definition 2 leads to the following conditions: i) the optimal set U is nonempty, ii) each generalized minimizing sequence has a subsequence converging to an element of U , or (the weaker condition) ii’) d(un , U ) ! 0 for each generalized minimizing sequence {un }. If the problem is ill-posed, the following difficulties occur:

2 u2 :

Obviously, solutions of this linear semi-infinite problem are points u 2 U {(0, a): a 0}. Choosing a finite grid T 0 on [ 0, 1] with ˇ ˇ ˇ 1 ˇ t0 D arg min ˇˇ t p ˇˇ : t 2 T 0 2 p and t0 ¤ 1/ 2, then for the approximate problem (with T 0 instead of [0, 1]), the ray ( ) 1 2 2 u 2 R : u1 D t0 p u2 ; u2 0 2 is feasible and J(u) ! 1 on this ray if kuk ! 1. This example shows the typical behavior of ill-posed semi-infinite problems: Although the original problem is solvable, the discretized ones may be not solvable, even if dense grids are used. Due to unsolvability of the discretized problems, the direct application of discretization and exchange methods for solving semiinfinite programs is impossible. Moreover, the assumptions required for the application of reduction methods are violated in this example, too. (For the conceptual description of the methods mentioned see [16]). Nevertheless, it is well-known that some classical methods, applied to ill-posed problems, possess stabilizing qualities: They generate minimizing sequences with better convergence properties than those properties which are guaranteed for an arbitrary minimizing sequence.

Ill-posed Variational Problems

For instance, for the ill-posed problem (1), where J is a convex functional of the class C1, 1 and K is a convex closed subset of a Hilbert space V, the gradient projection method (with a constant steplength parameter and an inexact calculation of the gradient pk r J(uk ) at each step k) converges weakly to some element of U if P U 6D ; and k pk r J(uk ) k k , 1 kD1 k < 1 (cf. [33]). In [20] it is shown that penalty methods applied to a finite-dimensional convex programming problem, for which the conditions ii), iii) in Definition 3 may be violated, converge to the unique solution of this problem if the exactness of the data is improved within the solution process by a special rule, depending on the change of the penalty parameter. Stable methods for solving convex ill-posed variational problems are mainly based on Tikhonov’s regularization approach (cf. [29,39,40]) and the proximal point approach (cf. [30,37]). Nowadays the direct application of these approaches (when multiple regularization of the original problem is performed and the regularized problems are solved with high accuracy) loses its importance in comparison with techniques using regularization inwards of the basic numerical algorithm which is suitably chosen for solving well-posed problems of the corresponding class of problems. Let us briefly describe these techniques under the assumption that V is a Hilbert space. Suppose a certain basic method (for instance, discretization or penalization method) generates the sequence of auxiliary problems J i (u) ! min;

u 2 Ki V ;

then in the Tikhonov approach successively the auxiliary problems J i (u) C ˛ i ku uk2 ! min; (˛ i > 0;

lim ˛ i D 0;

u 2 Ki ;

u 2 V a fixed element)

(5)

are solved, whereas the proximal point approach leads to the following sequence of auxiliary problems

2 J i (u) C i u u i1 ! min;

u 2 Ki ;

(6)

with 0 < i < , ui 1 an approximate solution of (6) at the stage i := i 1 and u0 2 V an arbitrary starting point.

I

We refer to (5) and (6) as Tikhonov’s iterative regularization method and proximal-like method, respectively. Usually, dealing with a convex variational problem, the functions J i are convex and the sets K i are convex and closed. Therefore, the objective functions in the Problems (5) and (6) are strongly convex, and hence, these problems are uniquely solvable (if K i 6D ;). It should be emphasized that, inasmuch i ! 0 is not necessary for the convergence of the proximal-like methods (in particular, i > 0 can be chosen), they possess a better stability and provide a better efficiency of fast convergent methods solving the regularized auxiliary problems. Theoretical foundations for the construction and the convergence analysis of Tikhonov’s iterative regularization methods have been developed in [32,40]. We refer to some methods coupling Tikhonov’s regularization with gradient projection methods [8], Newton methods [7], augmented Lagrangian [2] and penalty methods [40]. In the latter paper the stability of regularized penalty methods for Problem (1), (2) with Y = Rn is proved without assuming convexity of J and B. For applications of Tikhonov’s regularization in the framework of successive discretization of ill-posed variational problems see [28,40]. Proximal-like methods have been intensively developed during the last two decades. Starting with the papers [3] and [36], where the proximal method of multipliers has been investigated, regularized variants of different penalty methods (cf. [1,5,6,19]), steepest descent method [18], Newton methods [34,41] and quasi-Newton methods [10] have been suggested. In [21] proximal regularization is coupled with penalization and successive discretization for solving ill-posed convex semi-infinite problems, and in [22] a proximal method with successive discretization has been studied for solving elliptic variational inequalities. There is a couple of papers in which proximal regularization is used to obtain new decomposition (splitting) algorithms ([11,14,17]) and new bundle algorithms for nondifferentiable optimization problems ([4,10,24,25,31]). In some papers mentioned a nonquadratic proximal regularization is carried out by means of the Bregman function [13]. General schemes for the investigation of proximallike methods have been developed in [20,26] and [37].

1559

1560

I

Ill-posed Variational Problems

The scheme in [20] includes a generalization of (6), where the proximal iterations are repeated for fixed J i , K i until they providean‘appropriate’ decrease of the functional J i . See also Sensitivity and Stability in NLP References 1. Alart P, Lemaire B (1991) Penalization in non-classicalconvex programming via variational convergence. Math Program 51:307–331 2. Antipin AS (1975) Regularization methods for convex programming problems. Ekonomika i Mat Metody 11:336–342 (in Russian). 3. Antipin AS (1976) On a method for convex programs using a symmetrical modification of the Lagrange function. Ekonomika i Mat Metody 12:1164–1173 (in Russian). 4. Auslender A (1987) Numerical methods for nondifferentiable convex optimization. Math Program Stud 30:102–126 5. Auslender A, Crouzeix JP, Fedit P (1987) Penalty proximal methods in convex programming. J Optim Th Appl 55:1–21 6. Auslender A, Haddou M (1995) An interior-proximal method for convex linearly constrained problems and its extension to variational inequalities. Math Program 71:77–100 7. Bakushinski AB, Goncharski AV (1989) Ill-posed problems – Numerical methods and applications. Moscow Univ. Press, Moscow 8. Bakushinski AB, Polyak BT (1974) About the solution of variational inequalities. Soviet Math Dokl 15:1705–1710 9. Beer G (1993) Topologies on closed and convex closed sets. Math and its Appl, vol 268. Kluwer, Dordrecht 10. Bonnans J, Gilbert JCh, Lemaréchal C, Sagastizabal C (1995) A family of variable metric proximal methods. Math Program 68:15–47 11. Chen G, Teboulle M (1992) A proximal-based decomposition method for convex minimization problems. Math Program 66:293–318 12. Dontchev AL, Zolezzi T (1993) Well-posed optimization problems. Lecture Notes Math, vol 1543. Springer, Berlin 13. Eckstein J (1993) Nonlinear proximal point algorithms using Bregman functions, with application to convex programming. Math Oper Res 18:202–226 14. Eckstein J, Bertsekas D (1992) On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math Program 55:293–318 15. Furi M, Vignoli A (1970) About well-posed minimization problems for functionals in metric spaces. J Optim Th Appl 5:225–290

16. Hettich R,Kortanek KO (1993) Semi-infinite programming: Theory, methods and applications. SIAM Rev 35:380– 429 17. Ibaraki S, Fukushima M, Ibaraki T (1992) Primal-dual proximal point algorithm for linearly constrained convex programming problems. Comput Optim Appl 1:207–226 18. Iusem AN, Svaiter BF (1995) A proximal regularization of the steepest descent method. RAIRO Rech Opérat 29:123–130 19. Kaplan A (1982) Algorithm for convex programming using a smoothing for exact penalty functions. Sibirsk Mat Zh 23:53–64 20. Kaplan A, Tichatschke R (1994) Stable methods for illposed variational problems: Prox-regularization of elliptical variational inequalities and semi-infinite optimization problems. Akad. Verlag, Berlin 21. Kaplan A, Tichatschke R (1996) Path-following proximal approach for solving ill-posed convex semi-infinite programming problems. J Optim Th Appl 90:113–137 22. Kaplan A, Tichatschke R (1997) Prox-regularization and solution of ill-posed elliptic variational inequalities. Appl Math 42:111–145 23. Karmanov VG (1980) Mathematical programming. Fizmatgiz, Moscow 24. Kiwiel K (1995) Proximal level bundle methods for convex nondifferentiable optimization, saddle point problems and variational inequalities. Math Program 69B:89–109 25. Kiwiel K (1996) Restricted step and Levenberg–Marquardt techniques in proximal bundle methods for non-convex nondifferentiable optimization. SIAM J Optim 6:227–249 26. Lemaire B (1988) Coupling optimization methods and variational convergence. ISNM 84:163–179 27. Levitin ES, Polyak BT (1966) Minimization methods under constraints. J Vycisl Mat i Mat Fiz 6:787–823 28. Liskovets OA (1987) Regularization of problems with monotone operators when the spaces and operators are discretly approximated. USSR J Comput Math Math Phys 27:1–8 29. Louis AK (1989) Inverse and schlecht gestellte Probleme. Teubner, Leipzig Studienbücher Math 30. Martinet B (1970) Régularisation d’inéquations variationelles par approximations successives. RIRO 4:154–159 31. Mifflin R (1996) A quasi-second-order proximal bundle algorithm. Math Program 73:51–72 32. Mosco U (1969) Convergence of convex sets and of solutions of variational inequalities. Adv Math 3:510–585 33. Polyak BT (1987) Introduction to optimization. Optim. Software, New York 34. Qi L, Chen X (1997) A preconditioning proximal Newton method for non-differentiable convex optimization. Math Program 76B:411–429 35. Revalski JP (1995) Various aspects of well-posedness of optimization problems. In: Lucchetti R, Revalski J (eds) Recent Developments in Well-Posed Variational Problems. Kluwer, Dordrecht

I

Image Space Approach to Optimization

36. Rockafellar RT (1976) Augmented Lagrange multiplier functions and applications of the proximal point algorithm in convex programming. Math Oper Res 1:97–116 37. Rockafellar RT (1976) Monotone operators and the proximal point algorithm. SIAM J Control Optim 14:877–898 38. Tikhonov AN (1966) On the stability of the functional optimization problems. USSR J Comput Math Math Phys 6:631–634 39. Tikhonov AN, Arsenin VJ (1977) Methods for solving illposed problems. Wiley, New York 40. Vasil’ev FP (1981) Methods for solving extremal problems. Nauka, Moscow 41. Wei Z, Qi L (1996) Convergence analysis of a proximal Newton method. Numer Funct Anal Optim 17:463–472

Image Space Approach to Optimization FRANCO GIANNESSI Department Math., Universita di Pisa, Pisa, Italy MSC2000: 90C30 Article Outline Keywords See also References Keywords Image space; Separation; Separation functions; Necessary conditions; Sufficient condition; Lagrange multipliers; Lagrange function; Euler equation The study of the properties of the image of a real-valued functionis an old one; recently, it has been extended to multifunctions and to vector-valued functions. However, in most cases the properties of the image have not been the purpose of study and their investigation has occurred as an auxiliary step toward other achievements (see, e. g., [4,16,17]). Traces of the idea of studying the images of functions involved ina constrained extremum problem go back to the work of C. Carathéodory (in 1935, [3, Chap.5]). In the 1950s R. Bellman [1], with his celebrated maximum principle, proposed – for the first time in the field of optimization – to replace the given

unknown by a new one which runs in the image; however, alsohere the image is not the main purpose. Only in the late 1960s and 1970ssome Authors, independently from each other, have brought explicitly such a study into the field of optimization [2,6,7,10,11]. The approach consists in introducing the space, call it image space (IS), where the images of the functions of the given optimization problem run. Then, a new problem is defined in the IS, which is equivalent to the given one. In a certain sense, such an approach has some analogieswith what happens in the measure theory when one goes from Mengoli–Cauchy–Riemann measure to the Lebesgue one. The approach will now briefly be described. Assume we are given the integers m and p with m 0 and 0 p m, the subset X of a Hilbert space H whose scalar product is denoted by h, i, and the functions f :X ! R, g i :X ! R, i = 1, . . . , m. Consider the minimization problem: 8 ˆ ˆ 0, (4) can be satisfied. In fact, at x D 0, (4) is equivalent to log(x C 1) x exp(x), 8x 2 ] 1, + 1[, which is true if D 1 and is large enough. Hence, Theorem 2 can now be applied to state that x D 0 is a global minimum point of (P). Example 4 Let us set X H = AC 2 (T), where T := [0, ] is the domain of the elements x = (x1 , x2 ) 2 H; x1 = x1 (t) and x2 = x2 (t), t 2 T, are the parametric equations of a curve in R2 ; given a positive real `, must be such that the length of be `. X is now the set of pairs x =

(x1 , x2 ) 2 H, such that x1 (0) = x2 (0) = 0, x2 (t) 0, 8t 2 T, each xi is regular in the sense of Jordan and closed. Moreover, we set p = m = 1, T = [0, 2 ], and Z q Z dx12 C dx22 `: f (x) D x1 dx2 ; g(x) D Consider the problem ( max f (x); P(`) s.t. g(x) D 0;

x 2 X:

The solution of this classic isoperimetric problem is well known:

` cos t ; x 1 (t; `) D 2 2 h

i ` 1 C sin t ; x 2 (t; `) D 2 2 t 2 T D [0; 2 ]; or, in nonparametric form, x21 + x22 x2 `/ = 0, and the maximum is `2 /4 . If in P(l) we replace g(x) = 0 with g(x) = , so that we consider P(` + ), then 8x 2 X we have max (u) D f (x)

(u;v)2K (x) vD

(` C )2 : 4

It follows that K (x) is included in a convex (with respect to u-axis) parabola; hence H and K (x) can be separated by a line, so that (1) and (4) can be verified at = 0. Any x 2 X (and not necessarily an optimal one) allows to carry on the analysis in the image space. Of course, in general, it is impossible to have an explicit form of K (x). In the present example, to show explicitly a part of K (x), namely the perturbation function, we have exploited the knowledge of the maximum point. Let us stress that the sufficient condition (1), as well as (4), is an important result; however, in general, it is not necessary and it is difficult to be verified since the inequality must be fulfilled 8x 2 X. Therefore, it is useful to weaken the analysis, by replacing K (x) withan ‘approximation’ which be ‘easier’ to handle. A natural way of doing this consists in approximating K (x) at F(x) :D ('(x; x); g(x)) by means of its tangent cone (e. g., in the sense of Bouligand [4,6]); in general a cone is obviously easier than any set. For the sake of simplicity, now we will consider a particular case of (P)

I

Image Space Approach to Optimization

and adopt a separation scheme less general than above, which however embeds the classic theory [12]; for more general results see [6,7,9,10,11,13,14,18,19,20]. Consider the particular case where p = 0 (so that m ; the presence of bilateral constraints makes C = RC the analysis extremely difficult, unless f and g are assumed to fulfill conditions which make applicable Dini or Lyusternik implicit function theorems) and X is open. Denote by C the set of sublinear real-valued functions (i. e., positively homogeneous of degree one and convex) defined on H; f is superlinear if and only if f is sublinear. f is said to be C -differentiable at x if and only if 9DC f : X X ! R, such that DC (x; ) 2 C , and

lim

z!0

1 [(x; z) kzk :D f (x C z) f (x) DC f (x; z)] D 0; (7)

where k k is the norm in H generated by the scalar product and z belongs to a neighborhood, say Z, of x. DC f is said to be the C -derivative of f at x. If DC f is linear (the linear functions are obviously elements of C ), then f is differentiable [7]. It is easy to see that a C differentiable function is directionally differentiable in any direction z (in the sense that there exists the limit of [ f (x C ˛z) f (x)]/˛ as ˛ # 0); the vice versa is not true, as shown by the following example(which generalizes the so-called Peano function showed by G. Peano to detect a famous mistake by Lagrange): Example 5 H = X = R = R2 , x = (x1 , x2 ), x¯ D 0, kzk = kzk2 , f (x) = (x21 + x22 )1/2 if x 6> 0 and f (x) = ˛(x2 /x21 )(x22 + x22 )1/2 if x > 0, where ˛:R+ \ {0} ! R is defined by ˛(t) = 1 if 0 < t 1 or t 3, ˛(t) = 3 2t if 1 < t 2 and ˛(t) = 2t 5 if 2 < t < 3. In this Example, at x D x the directional derivative exists and is f 0 (x¯; z) D 1 (z12 Cz22 ) 2 , while it is not possible to verify (7). Note that f is continuous, but not locally Lipschitz, and f 0 (0;z) > 0, 8z 2 R2 . Example 6 H = X = R = R, f : R ! R+ with f (x) = j x j + x2 if x 2 Q and f (x) = j x j + 2x2 if x 62 Q, Q being the set of rational numbers. f is C -differentiable at x D 0 with C -derivative DC f (0; z) D jzj. Note that f is continuous at x D 0 only.

The C -subdifferential of a C -differentiable function f at x 2 X is defined by DC f (x; z) hz ; zi ; 0 ; @C f (x) :D z 2 H : 8z 2 Z where H 0 is the continuous dual of H; z is called the C -subgradient of f at x. When f is convex, then @C f (x) collapses to the classic subdifferential which is denoted simply by @ f (x); hence, @C f (x) is nothing more than the subdifferential of DC f (x; z) or @C f (x; z) D @DC f (x; z). When DC f (x; z) is linear, then @C f (x) is a singleton and collapses to the classic differential. In the latest example @C f (0) D [1; 1]. Consider the further example: H = X = R = R, f (x) = x2 sin 1/x if x 6D 0 and f (0) = 0. We find DC f (0; z) D 0, 8z 2 Z (indeed f is differentiable), so that @C f (0) D f0g, while the Clarke subdifferential [4] is [1, 1]. Now consider the following regularity condition: (RC)

T(K (x)) \ f(u; v) 2 H : v D 0g D ;;

where T(K (x)) is the Bouligand tangent cone of K (x) at x. Several conditions on f and g are well known (mainly when H = Rn ) which guarantee (RC). Consider for instance the case where H = Rn and f , g are derivable. (RC) holds if the gradients r g i (x), i 2 fi 2 I : g i (x) D 0g are linearly independent. (RC) holds if g is affine. (RC) holds if g is concave and 9b x 2 X such that g(b x) > 0. For additional conditions see Theorems of the alternative and optimization; [6,12]. The approximation of (P), we want to discuss in the present particular case p = 0, consists in replacing f and g i , i 2 I, with their C -derivatives. More precisely, instead of the map F = (', g), we consider now the superlinear map FC (x; z) :D (DC f (x; z); g i (x) C DC g i (x; z)

i 2 I);

which is the first order expansion of the (C )differentiable map F. K (x) is now replaced by the cone KC (x) :D FC (x; X x). (S0 ) is now replaced by (S 000 )

H \ KC (x) D ;;

which holds if x is a minimum point of (P) (but not necessarily vice versa due to the above approximation of K (x); hence, from the necessity and sufficiency of

1565

1566

I

Image Space Approach to Optimization

(S0 ) we jump to the sole necessity of (S000 ), since H and KC (x) are linearly separable; hence (S000 ) can be proved by means of the subclass of W pe at = 0. As a consequence the Lagrangian function L (which, as we have seen above, is, up to a formal transformation, the separation function w) will be used at = 0; thus we set L(x;) := L(x;, 0). This leads to the following necessary condition, whose proof can be found in [7]. Theorem 7 Let the functions f , g i , i 2 I, be C differentiable, and assume that (RC) be fulfilled. If x is a minimum point of (P), then 9 2 Rm , such that inf DC L(x; z; ) 0;

(8)

g(x) 0; 0; E D ; g(x) D 0;

(9)

z2B

(10)

where DC L(x; z; ) is the C -derivative of L at x D x; D , and B := {z: k z k = 1}. (8) is equivalent to X i @C (g i (x)); (11) 0 2 @C f (x) C i2I

which becomes 0 2 @ f (x) C

X

i @(g i (x));

(12)

i2I

if, in particular, X, f and g are convex. When f and g are differentiable on X, then (8) collapses to V L = 0 along x D x, where V L is the first variation of L, and in case (P)iso becomes x0 (t; x; x 0 ; )

d 0 0 (t; x; x 0 ; ) D 0; dt x

(13)

P where := 0 i 2 I i i is the integrand of L. If X = H = Rn , then (8) collapses to L0x (x; ) D 0;

(14)

0

where Lx is the gradient of L with respect to x. Note that (13) is the classic Euler equation and (14) is the classic Lagrange equation; is the vector of Lagrange multipliers which turns out to be the gradient of the hyperplane (w = 0 at = 0) which separates the two sets of (S00 ).

Now, let us go back to the separation scheme which led to the sufficient condition (1). The choice of proving (S0 ) indirectly through separation has a lot of interesting consequences which go beyond the initial purpose. One of them is the introduction of a (nonlinear) dual space: that of functionals w. When we restrict ourselves to W pe , then the dual space is isomorphic to R2m (to Rm at = 0; this is the classic duality scheme in finitedimensional optimization). Such an isomorphism is the characteristic of constrained extremum problems having finite-dimensional image (independently of the dimension of the space where the unknown runs). Having recognized that we have introduced a dual space, to define a dual problem is immediate. Indeed, looking at (1), since the inequality must be fulfilled 8x 2 X, it is straightforward, for each and , to search for max x2X w('(x; x); g(x); ; ) and then to find , which make such a maximum as small as possible and, hopefully, not greater than zero. Hence, we are led to study the problem: (P )

max

min L(x; ; );

m x2X 2C ; 2RC

which we call generalized dual problem of (P); any pair ( , ) which solves (P ) is a dual variable [6,19]. At = 0, (P ) is the classic dual problem of (P) [12]; indeed, the classic duality theory starts by defining (P ) as a dual problem, independently of the separation scheme and hence of the other theories like the saddlepoint one. It is easy to show that the maximum in (P ) is of the minimum in (P); the difference between the latter and the former is called duality gap; it is now clear that a positive duality gap corresponds to a lack of separation between H and K (x) at the minimum point x. Another important topic which can be derived by the separation scheme is the penalization theory. Seemingly independent of the other topics of Optimization, it is indeed strictly related to them, since it can be drawn from the separation scheme, as will be now briefly outlined (recall the remark after Theorem 1). Consider again the family W pe within which select a sequence, say {wr := w(u, v;r , r )}1 rD1 , of separation functions, such that the positive level set (with respect to (u, v)) of wr+1 be strictly included in that of wr . Then, we can try to ‘fulfill (1) asymptotically’ or to set up the sequence of problems: (Pr )

min L(x; r ; r ); x2X

r D 1; 2; : : : :

Implicit Lagrangian

r Under suitable conditions, a limit point of {xr }1 1 (x being a solution of (Pr) is a solution of (P). See [6,15] for details. Let us stress the fact that the separation scheme and its consequences come down from (S), and do not ‘see’ (P); they are unacquainted with the fact that the impossibility of (S) expresses optimality for (P). Therefore, it is obvious that the separation approach can be applied to every kind of problem which leads to the impossibility of a system like (S). In fact, such an approach can be applied to vector optimization and to variational inequalities [8], and to generalized systems [14,19].

See also Vector Optimization

I

13. Mastroeni G, Pappalardo M, Yen ND (1994) Image of a parametric optimization problem and continuity of the perturbation function. J Optim Th Appl 81(1):193–202 14. Mastroeni G, Rapcsák T (1999) On convex generalized systems. J Optim Th Appl 15. Pappalardo M (1994) Image space approach to penalty methods. J Optim Th Appl 64(1):141–152 16. Porciau BH (1980) Multipliers rules. Amer Math Monthly 87:443–452 17. Porciau BH (1983) Multipliers rules and separation of convex sets. J Optim Th Appl 40:321–331 18. Quang PH (1992) Lagrangian multiplier rules via image space analysis. Nonsmooth Optimization: Methods and Applications. Gordon and Breach, New York, 354–365 19. Rapcsák T (1997) Smooth nonlinear optimization in Rn . Nonconvex, no. 19. Optim Appl. Kluwer, Dordrecht 20. Tardella F (1989) On the image of a constrained extremum problem and some applications to the existence of a minimum. J Optim Th Appl 60(1):93–104

References 1. Bellmann R (1957) Dynamic programming. Princeton Univ. Press, Princeton 2. Camerini PM, Galbiati G, Maffioli F (1991) The image of weighted combinatorial problems. Ann Oper Res 33:181– 197 3. Carathéodory C (1982) Calculus of variations and partial differential equations of the first order. Chelsea, New York 4. Clarke FH (1983) Optimization and nonsmooth analysis. Wiley, New York 5. Courant R (1943) Variational methods for the solution of problems of equilibrium and vibrations. Bull Amer Math Soc 49:1–23 6. Giannessi F (1984) Theorems of the alternative and optimality conditions. J Optim Th Appl 42(3):331–365, Preliminary version published in: Proc. Math. Program. Symp., Budapest, 1976 7. Giannessi F (1994) General optimality conditions via a separation scheme. In: Spedicato E (ed) Algorithms for Continuous Optimization. Kluwer, Dordrecht, 1–23 8. Giannessi F, Mastroeni G, Pellegrini L (1999) On the theory of vector optimization and variational inequalities. Image space analysis and separation. Vector Variational Inequalities and Vector Equilibria. Mathematical Theories. Kluwer, Dordrecht, 153–215 9. Giannessi F, Rapcsák T (1995) Images, separation of sets and extremum problems. In: Agarwal RP (ed) Recent Trends in Optimization Theory and Applications. Ser Appl Anal. World Sci., Singapore, 79–106 10. Hestenes MR (1966) Calculus of variations and optimal control theory. Wiley, New York 11. Hestenes MR (1975) Optimization theory: The finite dimensional case. Wiley, New York 12. Mangasarian OL (1994) Nonlinear programming. SIAM Ser, vol 10. SIAM, Philadelphia

Implicit Lagrangian IL MICHAEL V. SOLODOV Institute Mat. Pura e Apl., Rio de Janeiro, Brazil MSC2000: 90C33, 90C30 Article Outline Keywords Unconstrained Implicit Lagrangian Restricted Implicit Lagrangian Regularity Conditions Derivative-Free Descent Methods Error Bounds Extensions See also References Keywords Complementarity problem; Merit function; Optimization The nonlinear complementarity problem (see [3,15]) is to find a point x 2 Rn such that x 0;

F(x) 0;

hx; F(x)i D 0;

(1)

1567

1568

I

Implicit Lagrangian

where F : Rn ! Rn and h, i denotes the usual inner product in Rn . A popular approach for solving the nonlinear complementarity problem (NCP) is to construct a merit function f such that solutions of NCP are related in a certain way to the optimal set of the problem ( min f (x) s.t.

x 2 C:

Of practical interest is the case when the set C has simple structure and smoothness properties of F and dimensionality n of the variables space are preserved. There is a number of ways to reformulate the NCP as an equivalent optimization problem (for a survey, see [7]). Unconstrained Implicit Lagrangian The first smooth unconstrained merit function was proposed by O.L. Mangasarian and M.V. Solodov [12]. This function is commonly referred to as the implicit Lagrangian; it has the following form: M˛ (x) D hx; F(x)i 1 C k(x ˛F(x))C k2 kxk2 2˛ 1 C k(F(x) ˛x)C k2 kF(x)k2 2˛ where ˛ > 1 is a parameter and ()+ denotes the orthogn , onal projection map onto the nonnegative orthant RC i. e. the ith component of the vector (z)+ is max{0, zi }. It turns out that M ˛ (x) is nonnegative on Rn provided ˛ > 1, and is zero if and only if x is a solution of the NCP. If F is differentiable on Rn , then so is M ˛ () and its gradient vanishes at all solutions of NCP for ˛ > 1. Hence, one can attempt to solve the NCP by means of solving the smooth unconstrained optimization problem ( min M˛ (x) (2) s.t. x 2 Rn : The implicit Lagrangian owes its name to the way the function was first derived in [12]. Consider the constrained minimization problem (MP) ( min hx; F(x)i s.t.

x 0; F(x) 0

which is related to the NCP (1) in the sense that its global minima of zero coincide with the solutions of

NCP. Because of the special structure of the MP (the objective function is the inner product of the functions defining constraints), for every feasible x such that hx; F(x)i D 0 it can be observed that x plays the role of the Lagrange multiplier [2] for the constraint F(x) 0, while F(x) plays a similar role for the constraint x 0. Keeping in mind this observation, consider the augmented Lagrangian [1] for the above MP: L˛ (x; u; v) D hx; F(x)i 1 C k(˛F(x) C u)C k2 kuk2 2˛ 1 C k(˛x C v)C k2 kvk2 ; 2˛ where u 2 Rn and v 2 Rn are Lagrange multipliers corresponding to the constraints F(x) 0 and x 0 respectively. Since it is known a priori that at any solution x of MP (and NCP) one could take u D x and v D F(x), it is intuitively reasonable to ‘solve’ for multipliers u and v in terms of the original variables. Replacing u by x and v by F(x) in the augmented Lagrangian, one obtains the implicit Lagrangian function M ˛ (x). The parameter ˛ must be strictly bigger than one, because it can be checked that M 1 (x) = 0 for all x 2 Rn . Another interesting property is that the partial derivative of the implicit Lagrangian with respect to ˛ is also nonnegative for all x, and is zero if and only if x is a solution of the NCP [11]. However, a merit function based on this derivative is nonsmooth. Restricted Implicit Lagrangian When the implicit Lagrangian is restricted to the nonnegative orthant Rn+ , where nonnegativity of x is explicitly enforced, the last two terms in the expression for M ˛ (x) can be dropped. Thus the restricted implicit Lagrangian is obtained: N˛ (x)6 D hx; F(x)iC

1 k(x ˛F(x))C k2 kxk2 ; 2˛

where ˛ > 0. In this form, the function was introduced in [12]. It is also equivalent to the regularized gap function proposed by M. Fukushima [6] in the more general context of variational inequality problems (cf. also Variational inequalities). The restricted implicit Lagrangian is nonnegative n provided the parameter ˛ is positive, and for all x 2 RC its zeroes coincide with solutions of the NCP. It also

Implicit Lagrangian

inherits the differentiability of F. Thus the NCP can be solved via the bound constrained optimization problem ( min N˛ (x) (3) s.t. x 0: Note that since it is known a priori that every solution of the NCP is nonnegative, one may also consider a bound constrained problem with the function M ˛ (x). However, for the constrained reformulations the function N ˛ (x) is probably preferable because it is somewhat simpler. Regularity Conditions It should be emphasized that only global solutions of optimization problems (2) and (3) are solutions of the underlying NCP (1). On the other hand, standard iterative methods are guaranteed to find stationary points rather than global minima of optimization problems. It is therefore important to derive conditions which ensure that these stationary points are also solutions of NCP. One such condition is convexity. However, the implicit Lagrangian is known to be convex only in the case of strongly monotone affine F, and provided parameter ˛ is large enough [17]. Clearly, this is very restrictive. Thus other regularity conditions were investigated. For the unconstrained problem (2), the first sufficient condition was given by N. Yamashita and Fukushima [24]. They established that if the Jacobian rF(x) is positive definite at a stationary point x of (2), then x solves the NCP. This result was later extended in [8] to the case when rF(x) is a P-matrix. Finally, F. Facchinei and C. Kanzow [5] obtained a certain regularity condition which is both necessary and sufficient for a stationary point point of the unconstrained implicit Lagrangian to be a solution of NCP (it is similar to the condition stated below for the restricted case). For constrained problem (3), Fukushima [6] first showed the equivalence of stationary points to NCP solutions under the positive definiteness assumption on the Jacobian of F. A regularity condition which is both necessary and sufficient, was given by Solodov [20]: n is said to be regular if r F(x)> reverses a point x 2 RC the sign of no nonzero vector z 2 Rn satisfying z P > 0;

zC D 0;

z N < 0;

(4)

I

where C :D fi : x i 0; Fi (x) 0; x i Fi (x) D 0g ; P :D fi : x i > 0; Fi (x) > 0g ; N :D fi : x i 0; Fi (x) < 0g : Recall [4] that the matrix r F(x)> is said to reverse the sign of a vector z 2 Rn if z i [rF(x)> z] i 0;

8i 2 f1; : : : ; ng:

(5)

n RC

Therefore a point x 2 is regular if the only vector z n 2 R satisfying both (4) and (5) is the zero vector. A stationary point of (3) solves the NCP if and only if it is regular in the sense of the given definition. Derivative-Free Descent Methods When F is differentiable, so are the functions M ˛ (x) and N ˛ (x). Therefore, any standard optimization algorithm which makes use of first order derivatives can be applied to problems (2) and (3). However, taking advantage of the underlying structure one can also devise special descent algorithms which do not use derivatives of F. This can be especially useful in cases when derivatives are not readily available or are expensive to compute. In [24], it was shown that when F is strongly monotone and continuously differentiable, then the direction d(x) D (ˇ ˛)(x (x ˛F(x))C ) C (1 ˛ˇ)(F(x) (F(x) ˛x)C ) is a descent direction for M ˛ () at x 2 Rn , provided ˇ > 0 is chosen appropriately. A descent method based on this direction with appropriate line search, converges globally to the unique solution of the NCP [24]. In [13], it was established that the rate of convergence is actually at least linear. For the restricted implicit Lagrangian, a descent method with the direction d(x) D (x ˛F(x))C x was proposed in [6]. The algorithm was proven to be convergent for the strongly monotone NCP (no rate of convergence has been established however). In [26], by using adaptive parameter ˛, this method was further extended to monotone (not necessarily strongly monotone) and Lipschitz continuous (not necessarily differentiable) functions.

1569

1570

I

Implicit Lagrangian

Error Bounds The implicit Lagrangian also appears useful for providing bounds on the distance from a given point to the solution set of the NCP. In particular, if F is affine then there exists a constant c > 0 such that (see [11]) dist(x; S) cM˛ (x)1/2 for all x close to S, where S denotes the solution set. This inequality is called a local error bound. X.-D. Luo and P. Tseng [10] proved that, in the affine case, this bound is global (i. e. it holds for all x 2 Rn ) if and only if the associated matrix is of the class R0 . For the nonlinear case, Kanzow and Fukushima [9] showed that M ˛ (x)1/2 provides a global error bound if F is a uniform P-function which is Lipschitz continuous. In the context of error bounds, the following relation established in [11] is useful: for all x 2 Rn ,

minimization of a difference of regularized gap functions. This result was further extended by Yamashita, K. Taji and Fukushima [25] who obtained similar results for differences of regularized gap functions whose parameters are not necessarily the inverse of each other. For algorithms based on this approach, see [22]. For a unified treatment of extensions of the implicit Lagrangian and the regularized gap function for the generalized complementarity problems see [23]. Yet another context where the implicit Lagrangian can be used is optimization reformulation of the extended linear complementarity problem[21].

See also Kuhn–Tucker Optimality Conditions Lagrangian Duality: Basics

˛ 1 (˛ 1) kr(x)k2 M˛ (x) (˛ 1) kr(x)k2 References where r(x) D x (x F(x))C is the natural residual [14]. Therefore the implicit Lagrangian M ˛ (x) provides a local/global error bound if and only if so does the natural residual r(x). For the restricted implicit Lagrangian, one only has the following relation: 2˛N˛ (x) kr(x)k2 : Thus, in principle, N ˛ (x) may provide a bound in cases when the natural residual does not. For a general discussion of error bounds see [16]. Extensions The implicit Lagrangian can be extended to the context of generalized complementarity problems and variational inequality problems via its relation with the regularized gap function. As observed by J.-M. Peng and Y.X. Yuan [19], the function M ˛ (x) can be represented as a difference of two regularized gap functions with parameters 1/˛ and ˛. Since the regularized gap function can also be defined for variational inequalities, one might consider a similar expression in this more general context. Peng [18] established the equivalence of the variational inequality problem to unconstrained

1. Bertsekas DP (1982) Constrained optimization and Lagrange multiplier methods. Acad. Press, New York 2. Bertsekas DP (1995) Nonlinear programming. Athena Sci., Belmont, MA 3. Cottle RW, Giannessi F, Lions J-L (1980) Variational inequalities and complementarity problems: Theory and applications. Wiley, New York 4. Cottle RW, Pang J-S, Stone RE (1992) The linear complementarity problem. Acad. Press, New York 5. Facchinei F, Kanzow C (1997) On unconstrained and constrained stationary points of the implicit Lagrangian. J Optim Th Appl 92:99–115 6. Fukushima M (1992) Equivalent differentiable optimization problems and descent methods for asymmetric variational inequality problems. Math Program 53:99–110 7. Fukushima M (1996) Merit functions for variational inequality and complementarity problems. In: Di Pillo G, Giannessi F (eds) Nonlinear Optimization and Applications. Plenum, New York pp 155–170 8. Jiang H (1996) Unconstrained minimization approaches to nonlinear complementarity problems. J Global Optim 9:169–181 9. Kanzow C, Fukushima M (1996) Equivalence of the generalized complementarity problem to differentiable unconstrained optimization. J Optim Th Appl 90:581–603 10. Luo X-D, Tseng P (1997) On a global projection-type error bound for the linear complementarity problem. Linear Alg Appl 253:251–278 11. Luo Z-Q, Mangasarian OL, Ren J, Solodov MV (1994) New error bounds for the linear complementarity problem. Math Oper Res 19:880–892

Increasing and Convex-Along-Rays Functions on Topological Vector Spaces

12. Mangasarian OL, Solodov MV (1993) Nonlinear complementarity as unconstrained and constrained minimization. Math Program 62:277–297 13. Mangasarian OL, Solodov MV (1999) A linearly convergent descent method for strongly monotone complementarity problems. Comput Optim Appl 14:5–16 14. Pang J-S (1986) Inexact Newton methods for the nonlinear complementarity problem. Math Program 36(1):54–71 15. Pang J-S (1995) Complementarity problems. In: Horst R, Pardalos PM (eds) Handbook Global Optimization. Kluwer, Dordrecht, pp 271–338 16. Pang J-S (1997) Error bounds in mathematical programming. Math Program 79:299–332 17. Peng J-M (1996) Convexity of the implicit Lagrangian. J Optim Th Appl 92:331–341 18. Peng J-M (1997) Equivalence of variational inequality problems to unconstrained optimization. Math Program 78:347–356 19. Peng J-M, Yuan YX Unconstrained methods for generalized complementarity problems. J Comput Math (to appear) 20. Solodov MV (1997) Stationary points of bound constrained reformulations of complementarity problems. J Optim Th Appl 94:449–467 21. Solodov MV (1999) Some optimization reformulations for the extended linear complementarity problem. Comput Optim Appl 13:187–200 22. Solodov MV, Tseng P Some methods based on D-gap function for solving monotone variational inequalities. Comput Optim Appl (to appear) 23. Tseng P, Yamashita N, Fukushima M (1996) Equivalence of complementarity problems to differentiable minimization: A unified approach. SIAM J Optim 6:446–460 24. Yamashita N, Fukushima M (1995) On stationary points of the implicit Lagrangian for nonlinear complementarity problems. J Optim Th Appl 84:653–663 25. Yamashita N, Taji K, Fukushima M (1997) Unconstrained optimization reformulations of variational inequality problems. J Optim Th Appl 92:439–456 26. Zhu DL, Marcotte P (1993) Modified descent methods for solving the monotone variational inequality problem. Oper Res Lett 14:111–120

Increasing and Convex-Along-Rays Functions on Topological Vector Spaces HOSSEIN MOHEBI Mahani Mathematical Research Center and Department of Mathematics, University of Kerman, Kerman, Iran MSC2000: 26A48, 52A07, 26A51

I

Article Outline Keywords and Phrases Introduction ICAR Functions Subdifferentiability of ICAR Functions DCAR Functions References Keywords and Phrases Monotonic analysis; ICAR functions; Abstract convexity Introduction The role of convex analysis in optimization is well known. One of the major properties of a convex function is its representation as the upper envelope of a family of affine functions. More specifically, every lower semicontinuous proper convex function can be expressed as the supremum of the family of affine functions, majorized by it [4]. The subject of abstract convexity arose precisely by generalizing this idea (see [5,6]). A function is said to be abstract convex if and only if it can be represented as the upper envelope of a class of functions, usually called elementary functions. One of the first studies in abstract convexity concerned the analysis of increasing and positively homogeneous (IPH) functions. It was initially carn n and RC , where ried out for functions defined over RCC n n RCC :D int RC , and later on extended to an arbitrary closed convex cone in [1] and an arbitrary topological vector space in [3]. This study was further extended to include increasing and convex-along-rays (ICAR) n . The study of IPH and ICAR funcfunctions over RC tions has given rise to the subject of monotonic analysis, the study of increasing functions enjoying some additional properties, which has important applications in global optimization (see [5] for more details). The systematic study of this subject was started in [1] and [2] by J. Dutta, J. E. Martinez-Legaz, and A. M. Rubinov, where they analyzed IPH and ICAR functions defined on a cone. In the present article, we extend this analysis to the study of ICAR functions defined over an arbitrary topological vector space. We want to emphasize that the role of IPH functions in monotonic analysis is the same as the role of sublinear functions in

1571

1572

I

Increasing and Convex-Along-Rays Functions on Topological Vector Spaces

convex analysis, whereas ICAR functions play the role of convex functions. We define elementary functions, which can be considered as generalizations of min-type functions, and demonstrate that ICAR functions are abstract convex with respect to the class of such elementary functions. This leads us to develop suitable notions of subdifferential for ICAR functions. Finally, we study the class of decreasing and convex-along-rays (DCAR) functions. Let H be a set of functions h : X ! RC1 defined on a set X. Recall (see [6]) that a function f : X ! RC1 is called abstract convex with respect to H (H-convex) if there exists a set U H such that f (x) D supfh(x) : h 2 Ug. Let supp( f ; H) :D fh 2 H : h f g be the support set of a function f : X ! RC1 with respect to H. The function co H f : X ! RC1 defined by co H f (x) :D supfh(x) : h 2 supp( f ; H)g is called the H-convex hull of f . Clearly, f is H-convex if and only if f D co H f . Let f : X ! RC1 be a proper function and x0 2 dom f D fx 2 X : f (x) < C1g. The set @H f (x0 ) :D fh 2 H : f (x) f (x0 ) C h(x) h(x0) for all x 2 Xg is called the H-subdifferential of f at the point x0 . Obviously, @H f (x0 ) is nonempty if f (x0 ) D maxfh(x0 ) : h 2 supp( f ; H)g. Let (X,Y) be a pair of sets with a coupling function ' : X Y ! RC1 . Denote by F X the union of the set of all functions f : X ! RC1 and the function 1, where 1(x) D 1 for all x 2 X: The Fenchel– Moreau conjugation corresponding to ' is the mapping f ! f ' defined on F X by f ' (y) D supf'(x; y) f (x)g;

y2Y:

x2X

Let ' 0 be the function defined on Y X by ' 0 (y; x) D '(x; y). Then the Fenchel–Moreau con0 jugation corresponding to ' 0 is the mapping g ! g ' defined on F Y by 0

g ' (x) :D supf'0(y; x) g(y)g y2Y

D supf'(x; y) g(y)g :

set fh y; : y 2 Y; 2 Rg. Let ' : X Y ! RC1 defined by '(x; y) D y(x). The following result is well known (see, e. g., [5], Theorem 8.8): 0

Theorem 1 Let f 2 F X . Then f '' D co H Y f . In par0 ticular, f '' D f if and only if f is H Y -convex. ICAR Functions Let X be a topological vector space. A set K X is called conic if K K for all > 0. We assume that X is equipped with a closed convex pointed cone K (the letter means that K \ (K) D f0g). The increasing property of our functions will be understood to be with respect to the ordering induced on X by K: x y () y x 2 K : A function f : X ! RC1 is called convex along rays (shortly CAR) if the function f x (˛) D f (˛x) is convex on the ray [0; C1) for each x 2 X. Similarly, f is called increasing along rays (shortly IAR) if the function f x (˛) D f (˛x) is increasing on the ray [0; C1) for each x 2 X. Also the function f : X ! RC1 is called increasing if x y H) f (x) f (y) and it is called decreasing if x y H) f (x) f (y). In the sequel, we shall study the increasing convex-along-rays (ICAR) and decreasing convex-along-rays (DCAR) functions. ¯ C deConsider the coupling function l : X X ! R fined by l(x; y) D maxf 0 : y xg ; with the conventions max ; :D 0 and max R :D C1. This function is introduced and examined in [3]. We shall include some properties of l for the sake of completeness. Proposition 1 For every x; x 0 ; y 2 X and > 0, one has l( x; y) D l(x; y) ;

(1)

1 l(x; y) ;

(2)

l(x; y) D

l(x; x) D 1 () x … K ;

(3)

x x 0 H) l(x; y) l(x 0 ; y) ;

(4)

x 2 K; y 2 K H) l(x; y) D C1 :

(5)

y2Y

In the case where Y is a set of functions defined on set X, for each y 2 Y and 2 R, consider the function h y; (x) :D y(x) ; x 2 X. Denote by H Y the

Increasing and Convex-Along-Rays Functions on Topological Vector Spaces

Proof 1 We only prove (3). Note that, since 0 2 K, we have l(x; x) 1 for all x 2 X. If x … K, then x x for some > 0. This implies that 1. Thus, l(x; x) D 1. Conversely, let l(x; x) D 1. Assume, toward a contradiction, that x 2 K. Then, x x for all > 1, and so l(x; x) D C1 > 1, which contradicts the assumption that l(x; x) D 1. Hence, x … K. In view of the above proposition, for example, when x D y 2 K, the maximum in the definition of l(x,y) is actually attained. The following proposition gives us a necessary condition for which l(x,y) is finite. Proposition 2 If y … K, then l(x; y) < C1 for all x 2 X. Proof 2 If l(x; y) D C1, then there exists a sequence n ! C1 such that y (1/n )x. Hence y 0. ¯ C is an IPH function By ([3], Remark 2.1), l y : X ! R for each y 2 X. We also have the following proposition: Proposition 3 Let y … K: The ¯ C is upper semicontinuous. l y : X ! R

function

Proof 3 Fix x 2 X. Let fx n g X be such that x n ! x. Set D l im l y (x n ). If D 0, then, by the nonnegativity of ly , we have l y (x) . Let > 0. It follows from y … K that < C1. Consider the subsequence fns gs1 such that l y (x n s ) > 0 and l y (x n s ) ! . We have x n s l y (x n s )y 2 K for all s 1. Since K is closed, we get x y 2 K and so, by the definition of l, l(x; y) . Hence ly is upper semicontinuous. Set X 0 D X n (K) and L0 D fl y : y … Kg. Fix y 2 X 0 . ¯ C defined by Let l : X ! R l(x) :D l y (x) : We have l(x) D l(x) for all 2 [0; C1). (Note that l y (x) < C1 for all x 2 X). For each x 2 X, consider the function l x : RC ! RC defined by l x (t) :D l(tx) D l y (tx) : It is not difficult to check that lx is increasing and continuous. Hence the function l is ICAR, IAR, and continuous along rays. Proposition 4 Let f : X ! RC1 be increasing and IAR. Then f (x) D f (0) for all x 0.

I

Proof 4 Fix x 2 X such that x 0. It follows from x 0 and the monotonicity of f that f (x) f (0). On the other hand, since f is IAR, we have f (x) D f x (1) f x (0) D f (0) : Hence f (x) D f (0).

We give an example of an increasing function that is not IAR. Example 1 Consider the function f : R3 ! R defined by f (x) D min x i ; 1i3

8 x 2 R3 :

Recall that x y if and only if x i y i for all 1 i 3. It is easy to see that f is increasing but, if we set x D (2; 3; 4), then f x : [0; C1) ! R is not increasing. Therefore, f is not IAR. The following functions are samples of ICAR and IAR functions. Example 2 Consider the functions f : Rn ! R and g : Rn ! R defined by ( n max1in x i x … R ; f (x) D 0 otherwise: and g(x) D exp( f ) : It is easy to check that f and g are ICAR and IAR with respect to coordinatewise ordering on Rn . Let W D L0 [ f0g, where 0(x) D 0 for all x 2 X. Consider the set H D fl : l 2 W; 2 Rg. We have the following result: Theorem 2 Let f : X ! RC1 be a function. Then f is ICAR, IAR, and lscAR if and only if it is H-convex. Proof 5 It is clear that each function h 2 H is ICAR, IAR, and continuous along rays. Therefore each H-convex function is ICAR, IAR, and lscAR. Conversely, let f : X ! RC1 be an ICAR, IAR, and lscAR function. Consider y 2 X. Since f y is increasing, convex, and lsc, it follows from ([5], Lemma 3.1) that there exists a set Vy RC R such that f y (t) D supv2Vy fv1 t v2 g, for each t 0, where

1573

1574

I

Increasing and Convex-Along-Rays Functions on Topological Vector Spaces

v D (v1 ; v2 ) 2 Vy . First, we suppose that y … K. For v D (v1 ; v2 ) 2 Vy , we set h v (x) D v1 l y (x) v2 ;

x 2 X:

Clearly h v 2 H. Let v1 D 0 or l y (x) D 0. Since f is IAR, we have f (x) D f x (1) f x (0) D f (0) D f y (0) v2 D h v (x);

for all x 2 X :

Suppose now that v1 > 0 and l y (x) > 0. (Note that l y (x) < C1). Since x l y (x)y 2 K and f is increasing, it follows that

v

Thus, h (x) f (x) for all x 2 X, that is, h 2 supp( f ; H), where supp( f ; H) D fh 2 H : h(x) f (x);

8x 2 Xg :

Therefore supfh v (y); v 2 Vy g supfh(y) : h 2 supp( f ; H)g f (y) :

( '(x; y) D

l y (x)

y 2 X0;

0

y D 0:

(8)

Theorem 3 A function f : X ! RC1 is ICAR, IAR, 0 and lscAR if and only if f D f '' : Subdifferentiability of ICAR Functions Consider the subdifferential @W of f : X ! RC1 at point x0 2 dom f : @W f (x0 ) D fh 2 W : f (x) f (x0 ) C h(x) h(x0)

(6)

On the other hand, in view of (3), we have v2V y

8x 2 Xg : We have the following result:

f (y) D f y (1) D sup (v1 v2 ) D sup (v1 l y (y) v2 ) v2V y

D sup h v (y) : v2V y

Thus, f (y) D supfh(y) : h 2 supp( f ; H)g. We now assume that y 2 K. By Proposition 4, we have f (y) D f (0). It follows from the proof of ([5], Lemma 3.1) that v1 D 0 for all v 2 Vy . Consider v 2 Vy . We set h v (x) D v2 ;

It follows from ([3], Proposition 3.3) that there exists a bijection from X 0 [ f0g onto W. Therefore we can identify W with Y D X 0 [ f0g by means of the mapping . We define the coupling function ' : X Y ! RC1 by

Combining Theorem 1 with Theorem 2, one gets:

h v (x) f y (l y (x)) D f (l y (x)y) f (x) : v

Hence fh v : v 2 Vy g supp( f ; H), and in view of (6) and (7), we get f (y) D supfh(y) : h 2 supp( f ; H)g. This completes the proof.

x 2 X:

Proposition 5 Let f : X ! RC1 be an ICAR and IAR function and x0 2 X 0 . If x0 2 dom f for some > 1; then @W f (x0 ) is nonempty and we have f l x 0 : 2 @ f x 0 (1)g @W f (x0 ) ;

(9)

where f x 0 (˛) D f (˛x0 ). Moreover, if f is strictly increasing at point x0 (i. e., x 2 X; x x0 and x ¤ x0 imply f (x) < f x0 ) and f x0 is strictly increasing at point ˛ D 1, by replacing @W with @L 0 ; then equality holds in (9). Proof 6 Since the increasing convex function f x0 is continuous at 1, it has a subgradient 0 at this point. Thus

Then f (x) D f x (1) f (0) D sup v2 h v (x); v2V y

for all x 2 X

t 1 f x 0 (t) f x 0 (1)

for all t 0 :

(10)

Let x 2 X be arbitrary. If l x 0 (x) D 0, then by setting t D l x 0 (x) D 0 in (10), we have

and f (y) D f (0) D sup v2 D sup h v (y) : v2V y

v2V y

(7)

l x 0 (x) 1 f x 0 (0) f x 0 (1) :

I

Increasing and Convex-Along-Rays Functions on Topological Vector Spaces

Since l x 0 (x0 ) D 1 and f is IAR, we get l x 0 (x) l x 0 (x0 ) f (x) f (x0 ) : Assume now that l x 0 (x) > 0. We have l x 0 (x)x0 x. In view of the monotonicity of f , one has

f x (1) f x (0). On the other hand, since f is IAR, f (x) D f x (1) f x (0) D f (0). Hence f (x) D f (0) D min x2X f (x). Recall that, by ([3], Proposition 3.3) , we can identify L0 with X 0 by means of the mapping 0 . Let us denote by @ X 0 f (x0 ) the set ( 0 )1 (@L 0 f (x0 )). Then

f (x) f (l x 0 (x)x0 ) D f x 0 (l x 0 (x)) @ X 0 f (x0 )

f x 0 (1) C l x 0 (x) :

D fy 2 X 0 : l y (x)l y (x0 ) f (x) f (x0 ); 8x 2 Xg :

Thus f (x) f (x0 ) l x 0 (x) l x 0 (x0 ) ; which shows that l x 0 2 @W f (x0 ). This implies that @W f (x0 ) ¤ ; and we have (9). Now, let f x0 be strictly increasing at ˛ D 1 and f be strictly increasing at x0 . In this case is different from zero in the left-hand side of (9). Consider l 2 @L 0 f (x0 ). Then there exists y 2 X 0 such that l D l y . We have f (x) f (x0) l y (x) l y (x0 );

Proposition 7 Let f : X ! R be an ICAR and IAR function. Then @ X 0 f (0) D fy 2 X 0 : l y (x) ( f x )0C (0);

8x 2 X 0 g ;

where ( f x )0C is the right derivative of the function f x given by f x (˛) D f (˛x). Proof 8 For each y 2 X 0 , one has y 2 @ X 0 f (0) () l y (x) l y (0) f (x) f (0); 8x 2 X

for all x 2 X : (11)

() l y (˛x) f x (˛) f x (0); We will show that l y (x0 ) > 0. Reasoning by contradiction, let us assume that l y (x0 ) D 0. It follows from (11) that f x 0 (t) f x 0 (1) 0 for all t 0. Since f x0 is strictly increasing at point 1, we get a contradiction. Thus l y (x0 ) > 0. Since l y (x0 )y x0 , one has f (x0 ) f (l y (x0 )y) f (x0 ) C l y (l y (x0 )y) l y (x0 ) D f (x0 ) C l y (x0 )l y (y) l y (x0 ) ;

8x 2 X 0 ; 8˛ > 0 f x (˛) f x (0) ; ˛ 8x 2 X 0 ; 8˛ > 0

() l y (x)

() l y (x) ( f x )0C (0);

8x 2 X 0 :

The second of equivalence is a consequence of Proposition 1 and ([3], Remark 2.1).

D f (x0 ) : Hence f (x0 ) D f (l y (x0 )y). Since f is strictly increasing at x0 , we get l y (x0 )y D x0 or y D x0 /(l y (x0 )). Moreover, for all t 0, by (11), we have f x 0 (t) D f (tx0) f (x0 ) C l y (tx0 ) l y (x0 ) ; D f x 0 (1) C l y (x0 )(t 1) ; which shows that l y (x0 ) 2 @ f x 0 (1): We set D l y (x0 ). Then l D l x 0 / D l x 0 . This completes the proof. Proposition 6 Let f : X ! RC1 be an ICAR and IAR function. If x 2 X 0 is a point such that the one-sided derivative f 0 (x; x) D 0, then x is a global minimizer of f over X. Proof 7 Since the function f x is convex and f x0 (1) D f 0 (x; x) D 0, we have f x (1) D min t2[0;C1) f x (t). Thus

DCAR Functions In this section, we shall study decreasing convex-alongrays (DCAR) functions defined on X. To this end, we ¯ deintroduce the coupling function u : X X ! R fined by u(x; y) D maxf 2 R : x yg : Let x; x 0 ; y 2 X be arbitrary and > 0. It is easy to check that the function u has the following properties: (1) x x 0 H) u(x 0 ; y) u(x; y), (2) u( x; y) D u(x; y), (3) u(x; y) D C1 H) y 2 K. We have also u(x; y) D l(x; y). For each y 2 X, consider the cone Q y D fx 2 X : 0 u(x; y) C1g :

1575

1576

I

Increasing and Positively Homogeneous Functions on Topological Vector Spaces

Lemma 1 Let y 2 X and x; x 0 2 Q y . The following inequality holds: u(x C x 0 ; y) u(x; y) C u(x 0 ; y) :

(12)

Proof 9 Let A D f˛ 2 R : ˛x yg, B D fˇ 2 R : ˇx 0 yg and C D f 2 R : (x C x 0 ) yg. In view of the transitive property of the relation , we get A C B C and this yields (12). It follows from the properties of u and Lemma 1 that Qy is a downward convex cone. Fix y 2 X. We define the ¯ C by function r y : X ! R ( u(x; y) x 2 Q y ; r y (x) D (13) 0 otherwise : By the properties of u, we get ry is a decreasing and positively homogenous function of degree one. Let y … K and set r D r y . It is not difficult to see that the function r x : RC ! RC defined by r x (t) D r(tx) is increasing, convex, and continuous. Thus, for each y … K, the function ry is DCAR, IAR, and continuous along rays. n of Example 3 Let X D Rn and K be the cone RC n all vectors in R with nonnegative coordinates. Let I D f1; 2; : : : ; ng: Each vector x 2 Rn generates the following sets of indices:

IC (x) D fi 2 I : x i > 0g ; I0 (x) D fi 2 I : x i D 0g ; I (x) D fi 2 I : x i < 0g : Let x 2 Rn and c 2 R: Denote by ordinates ( c c ; i … I0 (x); D xi x i 0; i 2 I0 (x):

c x

the vector with co-

n Then, for each x; y 2 RC ; we have 8 0; n x 2 W; n D 1; 2; : : : ; n ! ; > 0) H) x 2 W. Definition 2 A nonempty subset A of K is called normal if x 2 A; x 0 2 K and x 0 x imply x 0 2 A. Definition 3 A nonempty subset B of K is called conormal if x 2 B, x 0 2 K and x x 0 imply x 0 2 B. A normal subset A of K is radiant, that is, x 2 A and 0 < < 1 imply x 2 A. A conormal subset B of K is coradiant, that is, x 2 B and > 1 imply x 2 B. A set W X is called downward if (x 2 W; x 0 x) H) x 0 2 W. (In particular, the empty set is downward). Similarly, a set V X is called

1577

1578

I

Increasing and Positively Homogeneous Functions on Topological Vector Spaces

upward if (x 0 2 V ; x 0 x) H) x 2 V . Let W X be a radiant set. The Minkowski gauge W : X ! R¯C of this set is defined by x 2 Wg :

l(x; x) D 1 () x … K;

(8)

x 2 K; y 2 K H) l(x; y) D C1 ;

(9)

(2)

x x 0 H) l(x; y) l(x 0 ; y) ;

(10)

The Minkowski cogauge V : X ! R¯C of a coradiant set V X is defined by

y y0 H) l(x; y) l(x; y0 ) :

(11)

W (x) D inff > 0 :

V (x) D supf > 0 :

x 2 Vg :

(3)

It is easy to check that the Minkowski gauge of a downward set and the Minkowski cogauge of an upward set are IPH. Consider the function l : K K ! R¯C defined by l(x; y) D maxf 2 RC : y xg : This function is introduced and examined in [2]. To motivate our study, we characterize the IPH functions defined on K. Theorem 1 ([2], Theorem 16) Let p : K ! RC be a function. Then p is IPH if and only if p(x) l(x; y)p(y) for all x; y 2 K; with the convention (C1) 0 D 0. Characterizations of Nonnegative IPH Functions Carrying forward the motivation from Theorem 1, we shall now proceed to develop a similar type of property for IPH functions p : X ! R¯C . To achieve this, we need to introduce the coupling function ¯ C defined by l : X X ! R l(x; y) :D maxf 0 : y xg

Proof We only prove parts (7) and (10). Let l(x; y) D C1 for some x; y 2 X. By (4) there exists a sequence fn gn1 such that n ! C1 and y 1/n x for all n 1. Since K is a closed cone, we get y 0. This proves (7). To prove (10), let x x 0 ; x;y D f 0 : y xg and x 0 ;y D f 0 : y x 0 g. It is clear that x;y x 0 ;y (notice that if x 0 ;y D ;, then x;y D ;). Hence l(x; y) l(x 0 ; y). n Example 1 Let X D Rn and K be the cone RC of n all vectors in R with nonnegative coordinates. Let I D f1; 2; : : : ; ng. Each vector x 2 Rn generates the following sets of indices:

IC (x) D fi 2 I : x i > 0g; I0 (x) D fi 2 Ix i D 0g; I (x) D fi 2 I : x i < 0g : Let x 2 Rn and c 2 R: Denote by c/x the vector with coordinates ( c c ; i … I0 (x) ; D xi x i 0; i 2 I0 (x) : Then, for each x; y 2 Rn ; we have ( l(x; y) D

(4)

(we use the conventions max ; :D 0 and max RC :D C1). The next proposition gives some properties of the coupling function l. Proposition 1 For every x; x 0 ; y 2 X and > 0, one has

min i2I C (y) 0;

xi yi ;

x 2 KC y ; x … KC y ;

where n KC y D fx 2 R : 8 i 2 IC (y) [ I0 (y); x i 0; xi xi max : min i2I (y) y i i2I C (y) y i

We also need to introduce the coupling function u : X ¯ C defined by X ! R

l( x; y) D l(x; y) ;

(5)

1 l(x; y) ;

(6)

u(x; y) :D minf 0 : x yg

(7)

(with the convention min ; :D C1).

l(x; y) D

l(x; y) D C1 H) y 2 K ;

(12)

I

Increasing and Positively Homogeneous Functions on Topological Vector Spaces

The following proposition gives some properties of the coupling function u. Proposition 2 For every x; x 0 ; y 2 X and > 0, one has u( x; y) D u(x; y) ;

(13)

1 u(x; y) ;

(14)

u(x; y) D

u(x; y) D 0 () x 2 K ;

(15)

u(x; x) D 1 () x … K ;

(16)

x x 0 H) u(x; y) u(x 0 ; y);

(17)

y y0 H) u(x; y) u(x; y0 ) :

(18)

n of all Example 2 Let X D Rn and K be the cone RC n vectors in R with nonnegative coordinates. Then, for each x; y 2 Rn , we have ( max i2I C (y) xy ii ; x 2 c C y ; u(x; y) D C 0; x … cy ;

where n cC y D fx 2 R : 9 i 2 IC (y) s.t. x i 0

(iii). We shall now prove the implication (iii) ! (i). Consider x; y 2 X such that y x. By (4) we get l(x; y) 1. Then (iii) yields that p(x) p(y). Hence p is increasing. Let x 2 X; > 0 and l(x; x) D C1. It follows from (6) and (7) that x; x 2 K. Since p is increasing, we get p(x) D p(x) D 0. Let x … K and set y D x. Then, by (6) and (8), we have l(x; x) D 1/. Thus p(x) 1/p(x), and so p(x) p(x). By replacing with 1/ and x with x, we obtain p(x) p(x). This proves that p is positively homogeneous. We next prove the implication (i) ! (iv). Let u(x; y) D 0. By (15) we get x 0. Then p(x) D 0, and so p(x) u(x; y)p(y). If u(x; y) D C1, then, in view of the convention (C1) 0 D C1, we have p(x) u(x; y)p(y). We now assume that 0 < u(x; y) < C1. Then, in view of (12) and the closedness of K, we get x u(x; y)y. Hence p(x) u(x; y)p(y). Finally, the proof of the implication (iv) ! (i) can be done in a manner analogous to that of the implication (iii) ! (i). We shall now describe a class of elementary functions with respect to which the IPH functions are supremally generated. Given y 2 X, let us set l y (x) :D l(x; y) for all x 2 X. Thus, by (4), l y (x) D maxf 0 : y xg; 8 x 2 X:

and

8 i 2 I (y) [ I0 (y); x i 0g: ¯ C be a function. Then the Theorem 2 Let p : X ! R following assertions are equivalent: (i) p is IPH. (ii) p(x) p(y) for all x; y 2 X, and > 0 such that y x. (iii) p(x) l(x; y)p(y) for all x; y 2 X; with the convention (C1) 0 D 0. (iv) p(x) u(x; y)p(y) for all x; y 2 X; with the convention (C1) 0 D C1. Proof It is clear that (i) implies (ii). To prove the implication (ii) ! (iii), notice first that, due to (7), l(x; y) D C1 implies that y 2 K and so p(y) D 0. Then, by the convention (C1) 0 D 0, we have p(x) l(x; y)p(y). If l(x; y) D 0, then, by the nonnegativity of p, we get p(x) l(x; y)p(y). Finally, let 0 < l(x; y) < C1. Then in view of (4) and the closedness of K, we have x l(x; y)y, and so (ii) implies

(19)

¯ C is an IPH funcRemark 1 The function l y : X ! R tion for each y 2 X. It obviously follows from (5) and (10). Let L be the set of all supremally generating elementary functions, defined by (19), that is, L :D fl y : y 2 Xg : Consider the mapping (y) D l y ;

(20) : X ! L defined by

y 2 X:

We have the following proposition: Proposition 3 The mapping Moreover, it is antitone:

: X ! L is onto.

y1 y2 H) l y2 l y1

(21)

and antihomogeneous (positively homogeneous of degree 1): l y D 1 l y 8 y 2 X; 8 > 0 :

(22)

1579

1580

I

Increasing and Positively Homogeneous Functions on Topological Vector Spaces

Proof By the definition of L, is obviously onto. Implications (21) and (22) follow from (11) and (6), respectively. The following example shows that the mapping one-to-one.

is not

Example 3 Let X D R2 and K D R2C . Consider the distinct points y1 D (a; b) and y2 D (c; d), where a; b; c and d are negative numbers. By the results obC tained in Example 1, we have K C y 1 D K y 2 D X. Since IC (y1 ) D IC (y2 ) D ;, it follows that l y1 (x) D l y2 (x) D C1 for every x 2 X. Thus, l y1 D l y2 . Let X 0 D X n (K) and L0 D fl y : y 2 X 0 g. We can get: Proposition 4 The mapping 0 D j X 0 is a bijection from X 0 onto L0 , where |X 0 is the restriction of to X 0 . Proof Since, by the definition of L0 , 0 is obviously onto, we only have to prove that 0 is one-to-one. To this aim, assume that y1 ; y2 2 X 0 are such that l y1 D l y2 . By (8), we have 1 D l(y1 ; y1 ) D l(y1 ; y2 ). Hence, by (4), we get y2 y1 . By symmetry it follows that y1 y2 . Since K is pointed, we conclude that y1 D y2 . ¯ C is called Recall (see [8]) that a function p : X ! R abstract convex with respect to the set L or L-convex if and only if there exists a set W L such that p(x) D sup l 2W l(x). If W L0 , then using 0 we can identify W with some subset of X. In terms of X, p is L0 -convex if there exists a subset Y X 0 such that p(x) D sup y2Y l y (x). It follows from Remark 1 that L consists of nonnegative IPH functions, hence each L-convex function is IPH. ¯ C be a function and L be Theorem 3 Let p : X ! R the set described by Eq. (20). Then p is IPH if and only if there exists a set Y X such that p(x) D max l y (x) 8 x 2 X y2Y

(with the convention max ; :D 0). In this case, one can ¯ C is take Y D fy 2 X : p(y) 1g. Hence, p : X ! R IPH if and only if it is L-convex. Proof We shall only show that every IPH function ¯ C satisfies p(x) D max y2Y l y (x), for all p : X ! R x 2 X, with Y D fy 2 X : p(y) 1g :

It is clear that Y \ (K) D ;. For any x; y 2 X with p(y) 1, it follows from Theorem 2 that p(x) l y (x). This means that p l y for all y 2 Y, and so p max y2Y l y . If p(x) D 0, then, by nonnegativity of the function ly , we have max y2Y l y (x) D 0 D p(x). Assume now that 0 < p(x) < C1. Since p(x/p(x)) D 1, we get x/p(x) 2 Y. Moreover, it follows from (6) that p(x) D l(x; x/p(x)). Therefore, p(x) D max y2Y l y (x). Finally, suppose that p(x) D C1. It follows from the positive homogeneity of p that (1/)x 2 Y for all > 0. Then, max y2Y l y (x) l(1/ )x (x) D for all > 0. This means that max y2Y l y (x) D C1 D p(x). This completes the proof. The IPH functions are also infimally generated by the ¯ C ; y 2 X; defined elementary functions u y : X ! R by u y (x) :D u(x; y) D minf 0 : x yg; 8 x 2 X: In view of (13) and (17), it is clear that the function uy is IPH. Set U :D fu y : y 2 Xg :

(23)

¯ C by We define the mapping ' : X ! R '(y) :D u y ;

y 2 X:

We omit the proof of the following results, which are similar to those of Propositions 3 and 4. Proposition 5 The mapping ' : X ! U is onto. Moreover, it is antitone and antihomogeneous (positively homogeneous of degree -1). Let U 0 D fu y : y 2 X 0 g: We can get: Proposition 6 The mapping ' 0 D 'j X 0 is a bijection from X 0 onto U 0 , where ' |X 0 is the restriction of ' to X0 . ¯ C is called abstract concave with A function p : X ! R respect to the set U, or U-concave, if there exists a set W U such that p(x) D infu2W u(x). Since U consists of nonnegative IPH functions, we get each U-concave function is IPH. The proof of the following theorem can be done in a manner analogous to the one of Theorem 3. ¯ C be a function and U be Theorem 4 Let p : X ! R the set described by (23). Then p is IPH if and only if

Increasing and Positively Homogeneous Functions on Topological Vector Spaces

there exists a set W U such that p(x) D min u y (x) u y 2W

8x 2 X:

In this case, one can take W D fu y : y 2 X; p(y) 1g. ¯ C is IPH if and only if it is UHence p : X ! R concave. Abstract Convexity of Nonnegative IPH Functions We are now going to develop an abstract convexity (resp. abstract concavity) approach to IPH functions. The set L (resp. U) will play the role of the conjugate space in the usual linear model, while IPH functions will be regarded as analogous to sublinear functions. The well-known dual object related to a sublinear function is the so-called polar function (see, for example, [12,14]). We now give an analog of this concept for IPH functions and define also a related notion of polar set of a set W X. Definition 4 The lower polar function of p : X ! ¯ C defined by ¯ C is the function p0 : L ! R R p0 (l y ) D sup x2X

l y (x) ; p(x)

1 p(y)

We shall call supp(p; X) the X support of p. ¯ C be a function. Then, Proposition 7 Let p : X ! R p is IPH if and only if supp(p; X) D fy 2 X : p(y) 1g :

(28)

D fy 2 X : p0 (l y ) 1g :

8 ly 2 L ;

(25)

8 ly 2 L :

(26)

Proof By (8), (9), and (24) we have p0 (l y ) l y (y)/p(y) 1/p(y) for every y 2 X. Let p be an IPH function and x; y 2 X be arbitrary. Suppose that 0 < p(x) < C1 and 0 < p(y) < C1. It follows from Theorem 2 that l y (x) 1 : p(x) p(y)

supp(p; X) D fy 2 X : l y pg :

supp(p; X) D fy 2 X : l y (x) p(x) 8 x 2 Xg

and p is IPH if and only if p0 (l y ) D

The set supp(p; L) D fl y 2 L : l y (x) p(x) 8 x 2 Xg ¯C is called the support set of the function p : X ! R with respect to set L. If p is finite-valued or IPH, then, in view of (9), we get supp(p; L) L0 , and using 0 we can identify supp(p; L) with some subset of X. Let us denote by supp(p; X) the set ( 0 )1 (supp(p; L)). Then

(24)

¯ C be a function. Then Theorem 5 Let p : X ! R 1 p(y)

Therefore, p0 (l y ) D supx2X l y (x)/p(x) 1/p(y). This, together with (25), yields that p0 (l y ) D 1/p(y). To prove the converse, let x; y 2 X be arbitrary. It follows from (26) that l y (x)/p(x) 1/p(y). Thus, l y (x)p(y) p(x). Since x and y are arbitrary, by Theorem 2 (the implication (iii) H) (i)), we conclude that p is IPH.

Proof Let p be an IPH function. We have ly 2 L

(with the conventions 0/0 D 0 and 1/1 D 0).

p0 (l y )

I

(27)

If p(x) D 0, then, by part (iii) of Theorem 2, we have l y (x) D 0 or p(y) D 0, which in both cases (27) holds. In view of (1), (27) holds in the other cases.

Then (26) immediately yields (28). To prove the converse, let x; y 2 X be arbitrary. If p(y) D 0; then it is clear that p(y)l y (x) p(x). Let 0 < p(y) < C1. Then, by hypothesis, we have r D y/p(y) 2 supp(p; X). Thus lr (x) p(x), and by (6) we get l y (x)p(y) p(x). Finally, let p(y) D C1. By (28), y 2 supp(p; X). Thus, l y (x) p(x). If p(x) D 0, then the nonnegativity of ly yields that l y (x) D 0, and so l y (x)p(y) p(x). Clearly the latter inequality holds for p(x) D C1. Let 0 < p(x) < C1. Then r D x/p(x) 2 supp(p; X). Therefore, lr (y) p(y) and by (6), p(x)l x (y) p(y). Hence, by Theorem 2, p is IPH, which completes the proof. Proposition 8 For any set W X, the following assertions are equivalent: (i) W is upward, coradiant and closed along rays. ¯ C such (ii) There exists an IPH function p : X ! R that supp(p; X) D W. Furthermore, function p of (ii) is unique, namely, p is the Minkowski cogauge W of W.

1581

1582

I

Increasing and Positively Homogeneous Functions on Topological Vector Spaces

Proof (i) H) (ii). Let p D W . It is clear that p is positively homogeneous. Moreover, since W is upward, p is increasing. By [9, Proposition 5.6], since W is closed along rays and coradiant, one has W D fy 2 X : p(y) 1g :

(29)

Whence in view of (28), W D supp(p; X). (ii) H) (i). Let W D supp(p; X) for an IPH func¯ C . By (28) W is coradiant, upward, tion p : X ! R and closed along rays. Finally, the uniqueness of p in (ii) follows from the following equalities, the last one of which uses the convention sup ; D 0 and is a consequence of (29): p(y) D supf > 0 : p(y)g n y o D sup > 0 : 1 p( ) y D supf > 0 : 2 Wg D W (y): This completes the proof.

¯ C , the L-subdifferential at For a function p : X ! R a point x0 2 X is defined as follows: @L p(x0 ) D fl y 2 L : p(x) p(x0 ) l y (x) l y (x0 )g: If @L p(x0 ) L0 , then the set @ X p(x0 ) D ( 0 )1 (@L p(x0 )) will be called X-subdifferential of p at x0 (note that @ X p(x0 ) X 0 ). Thus @ X p(x0 ) D fy 2 X : p(x) p(x0 ) l y (x) l y (x0 )g: (30) The following simple statement will be useful in the sequel. ¯ be an IPH function Proposition 9 Let p : X ! R and x 2 dom p be a point such that p(x) ¤ 0. Then, r D x/p(x) … K. Proof Let p(x) > 0. Then p(r) D 1 > 0. Since p is an IPH function, we get r … K: If p(x) < 0; then p(r) D 1. Then, in view of the monotonicity of p, we get r … K or r … K. This completes the proof. ¯ C be an IPH function Theorem 6 Let p : X ! R and x 2 dom p be a point such that p(x) ¤ 0. Let r D x/p(x). Then lr 2 @L p(x), and hence @L p(x) is nonempty.

Proof It follows from Proposition 9 and (7) that r … K and l(y; r) < C1 for all y 2 X. Clearly p(x) 2 f 0 : r xg. Then, by the definition of l, we have l(x; r) p(x) > 0. We shall now show that lr (y) p(y) for any y 2 X. To this end, let y 2 X be arbitrary. If l(y; r) D 0, then lr (y) p(y). Let 0 < l(y; r) < C1. We have l(y; r)r y. Since p is IPH, we get l(y; r)p(r) p(y). Because of p(r) D 1, we get lr (y) p(y). Since y 2 X was arbitrary, we conclude that lr (x) D p(x) and lr (y) p(y) for all y 2 X. This yields that lr 2 @L p(x). Remark 2 Let int K ¤ ;. Consider nonzero IPH func¯ C and x 2 X such that p(x) D 0. We tion p : X ! R can show @L p(x) ¤ ;. Indeed, since p 6 0, there exists r 2 int K such that p(r) > 0 (see [2], Proposition 6). Set r 0 D r/p(r). It is clear that p(r 0 ) D 1, and so, by (28), r 0 2 supp(p; X), that is, lr 0 (t) p(t) for all t 2 X. It follows from the nonnegativity of lr 0 that lr 0 (x) D p(x) D 0. Hence, lr 0 2 @L p(x). ¯C We next define the upper polar function p0 : U ! R ¯ C by of the function p : X ! R p0 (u y ) D inf

x2X

u y (x) ; p(x)

uy 2 U

(31)

(with the conventions 0/0 D C1 and C1/C1 D C1). The proof of the following result can be done in a manner analogous to that of Theorem 5. ¯ C be a function. Then Theorem 7 Let p : X ! R p0 (u y )

1 ; p(y)

8 uy 2 U ;

(32)

and p is IPH if and only if p0 (u y ) D

1 ; p(y)

8 uy 2 U :

(33)

We shall now study the structure of support sets from above, which are characterized by the elementary functions uy rather than by the functions ly (which characterize support sets from below). We shall denote the support set from above, or upper support set, of the ¯ C with respect to set U as function p : X ! R SuppC (p; U) D fu y 2 U : u y pg :

Increasing and Positively Homogeneous Functions on Topological Vector Spaces

In what follows, we state the counterpart of Proposition 7 for the support set from above. ¯ C be a function. Then Proposition 10 Let p : X ! R SuppC (p; U) D fu y 2 U : p0 (u y ) 1g :

(34)

Furthermore, p is IPH if and only if SuppC (p; U) D fu y 2 U : p(y) 1g :

(35)

Proof Equality (34) follows easily from the definitions of SuppC (p; U) and p0 . Furthermore, if p is IPH, then (35) follows from (33) and (34). To prove the converse, let x; y 2 X be arbitrary. If 0 < p(y) < C1, then u y/p(y) 2 SuppC (p; U). Thus, u y/p(y) (x) p(x). By (14) we get p(y)u(x; y) p(x). If p(y) D C1, we have p(x) u(x; y)p(y) (here we use the convention (C1) 0 D C1). Finally, let p(y) D 0. It is clear that u y 2 SuppC (p; U). Thus, u y (x) p(x). If u y (x) D 0, then, in view of the nonnegativity of p, we get p(x) D 0, and so p(x) u(x; y)p(y). Now, suppose that 0 0 that u y 2 SuppC (p; U) for all > 0, and in view of (14), we get (1/)u y (x) D u y (x) p(x) for all > 0. This means that p(x) D 0. Therefore, p(x) u(x; y)p(y). Hence, by Theorem 2 (implication (iv) H) (i)), p is IPH. ¯ C , in a manner analogous For the function p : X ! R to the case of L-subdifferential, we now define the U-superdifferential of p at x0 as follows: @C U p(x0 ) :D fu y 2 U : u y (x)u y (x0 ) p(x) p(x0 )g: One can prove the following result for U-superdifferential in a manner analogous to the proof of Theorem 6, and therefore we omit its proof. ¯ C be an IPH function Theorem 8 Let p : X ! R and x 2 dom p be a point such that p(x) ¤ 0. Let r D x/p(x). Then u r 2 @C U 0 p(x). Definition 5 Let U X. Then the left polar set of W is defined by W ol D fx 2 X : l(x; y) 1 8 y 2 Wg : Analogously, we define the right polar set of V X.

I

Definition 6 Let V X. Then the right polar set of V is defined by V or D fy 2 X : l(x; y) 1 8 x 2 Vg : In the following theorem, we assume that int K ¤ ;. Theorem 9 Let W; V X and V \ int K ¤ ;. Then the following assertions are true: (i) One has W D W ol or if and only if W is upward, coradiant and closed along rays. (ii) One has V D V orol if and only if V is downward, radiant, and closed along rays. Proof Since X ol D X or D ;; ;ol D ;or D X, and X is upward, downward, radiant, coradiant and closed in itself, both statements are true when W D V D X. For the rest of the proof we shall assume that W ¤ X and V ¤ X. (i) Let W X and W or ¤ ;. By the definition of W or , Remark 1, and Proposition 3, W or is coradiant, upward, and closed along rays. Therefore, W D W ol or implies that W is coradiant, upward, and closed along rays. To prove the converse, we shall first show that W W ol or . Let y 2 W. Since for any x 2 W ol we have l y (x) 1, it is clear that y 2 W ol or . We shall now show that W ol or W. Let y 2 W ol or . By Proposition 8 we have W D supp(p; X) for some IPH function ¯ C . Let x 2 X and 2 (p(x); C1) be p : X ! R arbitrary. For every y0 2 W D supp(p; X), since l y 0 (x) p(x) < , using (5), one gets l y 0 (x/) D 1 l y 0 (x) < 1, whence x/ 2 W ol . Therefore, l y (x) D l y (x/) < . Hence, l y (x) p(x). This proves that l y p, that is, y 2 supp(p; X) D W. (ii) Suppose that V ol is a nonempty set. Then, by the definition of V ol , Proposition 3, and Remark 1, V ol is downward, radiant, and closed along rays. Therefore, V D V orol implies that V is downward, radiant, and closed along rays. To prove the converse, we shall first show that V V orol . Let x 2 V. Since for any y 2 V or we have l y (x) 1, it follows that x 2 V orol . We shall now show that V orol V. Let x 2 V orol . Consider ¯ C . It follows the Minkowski gauge V : X ! R from [9], Proposition 5.1 that V D ft 2 X : V (t) 1g :

(36)

1583

1584

I

Increasing and Positively Homogeneous Functions on Topological Vector Spaces

Thus, if V (x) D 0, then x 2 V . Assume that V (x) > 0. Since V \ int K ¤ ;, we get V (x) < C1. Set r D x/V (x). By (28), r 2 supp(V ; X). Then lr (t) V (t) for each t 2 X. In view of (36), we obtain lr (t) 1 for all t 2 V, that is, r 2 V or . Thus, lr (x) 1, and so by (6) and (8), V (x) 1 (note that V (x) > 0 implies that x … K). This proves that x 2 V , which completes the proof. Abstract Concavity of DPH Functions ¯ is called decreasRecall that a function q : X ! R ing if x y H) q(x) q(y). If p is an IPH function, then the functions q(x) D p(x) and q (x) D p(x) are DPH (decreasing and positively homogeneous of degree one). Hence, DPH functions can be investigated by using the properties of IPH functions. In this section, we shall study DPH functions separately. To this end, ¯ dewe need to introduce the function g : X X ! R fined by g(x; y) :D minf 2 R : y xg

(37)

(with the conventions min ; :D C1 and min R :D 1). The following proposition can be easily proved: Proposition 11 For every x; x 0 ; y 2 X and > 0, one has g(x; y) D g(x; y) ;

(38)

¯ C y D fx 2 C y : g(x; y) 2 R g :

(45)

It is easy to check that C y is an upward convex cone and CC y is a downward cone. Each element y 2 X generates the following functions: ( g(x; y); x 2 C C y C (46) f y (x) D C1; otherwise; and f y (x)

( D

g(x; y);

x 2 C y

C1;

otherwise:

(47)

Let F be the set of all functions defined by (46) and (47). Remark 3 The function f y is DPH for each y 2 X. The proof of the following proposition is similar to that of Proposition 9, and therefore we omit its proof. ¯ be a DPH function Proposition 12 Let q : X ! R and x 2 dom q be a point such that q(x) ¤ 0. Then r D x/q(x) … K. ¯ be a DPH funcProposition 13 Let q : X ! R tion and x 2 dom q be a point such that q(x) ¤ 0. Let r D x/q(x). Then the superdifferential @C F q(x) is nonempty and the following assertions are true: 1. If q(x) > 0; then f rC 2 @C F q(x). 2. If q(x) < 0; then f r 2 @C F q(x). Proof We only prove part (i). Since q(x) 2 f 2 R : r xg, by (37), we get g(x; r) q(x). In view of (39) and (42), we have g(x; r) D q(x)g(x; x) D q(x) > 0

1 g(x; y) D g(x; y) ;

(39)

x x 0 H) g(x; y) g(x 0 ; y) ;

(40)

g(x; y) D 1 H) y 2 K ;

(41)

g(x; x) D 1 () x … K :

(42)

It is worth noting that in (37) we cannot restrict the definition of g to 0 because we shall lose property (42). For each y 2 X, we consider the cones C y ; CC y and C y defined by C y D fx 2 X : g(x; y) 2 R1 g ;

(43)

CC y D fx 2 C y : g(x; y) > 0g ;

(44)

(note that since q(x) ¤ 0, it follows from Proposition 12 that x … K). By (44) and (46) we have x 2 CrC and f rC (x) D g(x; r) q(x). We shall now show that f rC (y) q(y) for every y 2 X. Let y 2 X be arbitrary. If y … CrC , then f rC (y) D C1 q(y). Assume that y 2 CrC . Then g(y; r)r y. Since q is DPH, we get g(y; r)q(r) q(y). It follows from q(r) D 1 and (46) that f rC (y) q(y). This yields that f rC (x) D q(x) and f rC (y) q(y) for each y 2 X. Hence f rC 2 @C F q(x). It follows from the preceding proposition that we do not need functions of the form (47) in the study of nonpositive DPH functions. For each r 2 X, we can consider the function sr : X ! R¯ defined by ( g(x; r); x 2 Cr sr (x) D (48) 0; x … Cr ;

Increasing and Positively Homogeneous Functions on Topological Vector Spaces

I

instead of the function f r defined by (47). Let S be the set of all functions defined by (48). Since Cr is an upward set, we get that set S consists of nonpositive DPH functions; hence each S-concave function is DPH. We shall now give an infimal representation of DPH functions.

For each y 2 X, consider the cones

Proposition 14 Let q : X ! R be a nonzero function. Then q is DPH if and only if it is S-concave.

Clearly, K y is a downward cone. Each element y 2 X ¯ generates the function g y : X ! R defined by

Proof We only prove the part if. Let W 0 D suppC (q; S). We shall show that W 0 ¤ ;. Consider x 2 X such that q(x) < 0. Set r D x/q(x). It follows from Proposition 12 and (41) that r … K and g(y; r) > 1 for all y 2 X. Since q(x) 2 f 2 R : r xg, by (37) we get g(x; r) q(x) < 0. Then x 2 Cr , and by (48) we obtain sr (x) q(x). We shall now show that sr (y) q(y) for each y 2 X. Let y 2 X be arbitrary. If y … Cr , then sr (y) D 0 q(y). Assume that y 2 Cr . Then (g(y; r))(r) D g(y; r)r y. Since q is DPH, we get g(y; r)q(r) q(y). It follows from q(r) D 1 and y 2 Cr that sr (y) q(y). Thus sr 2 W 0 D suppC (q; S) and sr (x) D q(x). Finally, if q(x) D 0, then s(x) D 0 for each s 2 W 0 . Hence q(x) D mins2W 0 s(x), that is, q is S-concave. In the sequel, we introduce the function h : X X ! ¯ defined by R h(x; y) :D maxf 2 R : y xg

(49)

(we use the conventions max ; :D 1 and max R :D C1). The next proposition gives some properties of the coupling function h. We omit its easy proof. Proposition 15 For every x; x 0 ; y 2 X and > 0, one has h( x; y) D h(x; y) ;

(50)

1 h(x; y) ;

(51)

h(x; y) D

K y D fx 2 X : h(x; y) 2 RC1 g

(56)

and K y D fx 2 K y : h(x; y) < 0g :

( g y (x)

D

h(x; y);

x 2 K y

1;

x … K y:

(57)

(58)

Let G be the set of all functions defined by (52). We conclude this section by a result on negative IPH functions. Theorem 10 Let p : X ! R¯ be an IPH function and x 2 dom p be a point such that p(x) ¤ 0. Let r D x/p(x). Then gr 2 @G p(x), and hence @G p(x) is nonempty. Proof It is clear that p(x) < 0. Since p(x) 2 f 2 R : r xg, by (49) we get h(x; r) p(x). In view of Proposition 15, we have h(x; r) D p(x)h(x; x) D p(x) < 0 (note that since p(x) ¤ 0; it follows from Proposition 9 that x … K). By (51) and (52) we have x 2 K r and gr (x) D h(x; r) p(x). We shall now show that gr (y) p(y) for every y 2 X. Let y 2 X be arbitrary. If y … K r , then gr (y) D 1 p(y). Assume that y 2 K r . Then (h(y; r))(r) D h(y; r)r y. Since p is IPH, we have h(y; r)p(r) p(y). It follows from p(r) D 1 and (52) that gr (y) p(y). This yields that gr (x) D p(x) and gr (y) p(y) for each y 2 X. Hence gr 2 @G p(x). References

h(x; y) D C1 H) y 2 K ;

(52)

h(x; x) D 1 () x … K ;

(53)

x x 0 H) h(x; y) h(x 0 ; y) ;

(54)

x 2 K; y 2 K H) h(x; y) D C1 :

(55)

1. Billera LJ (1974) On games without side payments arising from a general class of markets. J Math Econom 1(2):129– 139 2. Dutta J, Martinez-Legaz JE, Rubinov AM (2004) Monotonic analysis over cones: I. Optimization 53:129–146 3. Gunawardena J (1998) An introduction to idempotency. Cambrige University Press, Cambridge 4. Gunawardena J (1999) From max-plus algebra to nonexpansive mappings: a nonlinear theory for discrete event systems. Theor Comp Sci 293(1):141–167

1585

1586

I

Inequality-constrained Nonlinear Optimization

5. Martinez-Legaz JE, Rubinov AM (2001) Increasing positively homogeneous functions on Rn . Acta Math Vietnam 26(3):313–331 6. Martinez-Legaz JE, Rubinov AM, Singer I (2002) Downward sets and their separation and approximation properties. J Glob Optim 23(2):111–137 7. Mohebi H, Sadeghi H (2007) Monotonic analysis over ordered topological vector spaces: I. Optimization 56(3):1–17 8. Rockafellar RT (1970) Convex analysis. In: Princeton Mathematical Series, vol 28. Princeton University Press, Princeton 9. Rubinov AM (2003) Monotonic analysis: convergence of sequences of monotone functions. Optimization 52:673– 692 10. Rubinov AM (2000) Abstract convex analysis and global optimization. Kluwer, Boston Dordrecht London 11. Rubinov AM, Singer I (2000) Best approximation by normal and co-normalsets. J Approx Theory 107:212–243 12. Rubinov AM, Singer I (2001) Topical and sub-topical functions, downward sets and abstract convexity. Optimization 50:307–351 13. Sharkey WW (1981) Convex games without sidepayments. Int J Game Theory 10(2):101–106 14. Singer I (1997) Abstract convex analysis. Wiley-Interscience, New York

Inequality-constrained Nonlinear Optimization WALTER MURRAY Stanford University, Stanford, USA MSC2000: 49M37, 65K05, 90C30 Article Outline Keywords Synonyms The Problem First Order Optimality Conditions Second Order Optimality Conditions Algorithms See also References Keywords Constrained optimization; Optimality conditions; Algorithms Synonyms IEQNO

The Problem An inequality-constrained nonlinear programming problem may be posed in the form 8 < min f (x) x2Rn (1) :s.t. c(x) 0; where f (x) is a nonlinear function and c(x) is an mvector of nonlinear functions with ith component ci (x), i = 1, . . . , m. We shall assume that f and c are sufficiently smooth. Let x denote a solution to (1). We are mainly concerned about smoothness in the neighborhood of x . In such a neighborhood we assume that both the gradient of f (x) denoted by g(x) and the m × n Jacobian of c(x) denoted by J(x) exist and are Lipschitz continuous. As is the case with the unconstrained problem a solution to this problem may not exist. Typically additional assumptions are made to ensure a solution does exist. A common assumption is to assume that the objective f (x) is bounded below on the feasible set. However, even this is not sufficient to assure a minimizer exists but it is obviously a necessary condition for an algorithm to be assured of converging. If the feasible region is compact then a solution does exist. We shall only be concerned with local solutions. First Order Optimality Conditions The problem is closely related to the equality-constrained problem. If it was known which constraints were active (exactly satisfied) at a solution and which were slack (strictly positive) then the optimality conditions for (1) could be replaced by the optimality conditions for the equality case. Note that this does not imply the inequality problem could be replaced by an equality problem when it comes to determining a solution by an algorithm. The inequality problem may have solutions corresponding to different sets of constraints being active. Also an equality problem may have solutions that are not solutions of the inequality problem. Nonetheless this equivalence in a local neighborhood enables us to determine the optimality conditions for this problem from those of an equality-constrained problem. In order to study the optimality conditions it is necessary to introduce some notation. Let b c(x) and c(x) denote the constraints active and slack at x respectively. Likewise, let b J(x) and J denote

Inequality-constrained Nonlinear Optimization

their respective Jacobians. Assume that b J(x ) is full rank. Points at which the Jacobian of the active constraints is full rank are said to be regular. It follows from the necessary conditions for the equality case that D 0; J(x )>b g(x ) b

It follows that f (x C ˛p) f (x ) C ˛(p> g(x ) C ˛M): From the necessary conditions on x we get p> g(x ) D p>b ; J >b

b c(x ) D 0; c(x ) > 0;

which implies

where b is vector of Lagrange multipliers. These equations may be written in the form:

I

f (x C ˛p) f (x ) C ˛(p>b J >b C ˛M): Using the definition of p gives

>

g(x ) J(x ) D 0;

Cb j C ˛M): f (x C ˛p) f (x ) C ˛(ıe >b

c(x ) 0; T c(x ) D 0; where is the extended set of Lagrange multipliers. The set is extended by defining a multiplier to be zero for the slack constraints at x c(x ). The above first order optimality conditions are not the only necessary conditions. Unlike the equality case there may be a feasible arc that moves off one or more of the active constraints along which the objective is reduced. In other words we need some characterization that is necessary for the active set to be binding. The key to identifying the binding set is to examine the sign of b . It follows from the definition of b that b J g; D (b Jb J > )1b

(2)

where the argument x has been

dropped for simplicity.

b

Note that (2) implies that is bounded. Define p as b J p D ıe C e j ; where ı > 0, e denotes the vector of ones and ej is the unit column with one in the jth position. It follows from the assumption on the continuity of the Jacobian that x + ˛ p is feasible for 0 ˛ ˛ is sufficiently small. From the mean value theorem we have f (x C ˛p) D f (x ) C ˛p> g(x C ˛p); where 0 1. The Lipschitz continuity of g implies M exists such that p> g(x C ˛p) p> g(x ) C ˛M:

It follows from the boundedness of b that if b j < 0 then for ı sufficiently small there exists ˛ such that for 0 < ˛ ˛, f (x C ˛p) < f (x ): Consequently, a necessary condition for x to be a minimizer under the assumptions made is that b 0. Equivalently, 0. For different assumptions such as b J not being full rank the condition need not hold as the following simple case illustrates. Suppose we have an equalityconstrained problem with c(x) = 0 then an equivalent inequality-constrained problem is 8 ˆ minn f (x) ˆ = 0 is a complementarity condition. At least one of (ci (x ), i ) must be zero. It is possible for both to be zero. If there is no index for which both are zero then c(x ) and are is said to satisfy strict complementarity. If b J(x ) is full rank then it follows from (2) that is an isolated point.

1587

1588

I

Inequality-constrained Nonlinear Optimization

The function L(x, ), L(x; ) D F(x) > c(x); is known as the Lagrangian. The optimality condition g(x ) J(x )> D 0 is equivalent to r x L(x , ) = 0. It is also equivalent to Z(x )> g(x ) = 0, where the columns of Z(x) are a basis for the null space of the rows of b J(x). The vector Z(x)> g(x) is called the reduced gradient. Clearly Lagrange multipliers play a significant role in defining the solution of an inequality-constrained problem. There is a significant difference in that role between linear and nonlinear constraints. In the case of linear constraints the numerical value of the multiplier plays no role in defining x only the sign of the multiplier is significant. For nonlinear constraints the numerical value as well as the sign is of significance. To appreciate why it first necessary to appreciate that for problems that are nonlinear in either the constraints or the objective, curvature of the functions are relevant in defining x . More precisely the curvature of the Lagrangian. It easily seen that curvature of the objective is relevant since for unconstrained problems no solution would exist otherwise. To appreciate that curvature in c(x) is relevant note that any problem can be transformed into a problem with just a linear objective by adding an extra variable. For example, add the constraint xn + 1 f (x) 0 and minimize xn + 1 instead of f (x). Since we have established the curvature of f (x) is relevant that relevance must still be there even though f (x) now appears only within a constraint. It is harder to appreciate that it is the relative curvature of the various constraints and objective that is of significance. Second Order Optimality Conditions We shall now assume that the problem functions are twice continuous differentiable. From the unconstrained case it is known that a necessary condition is that r 2 f (x ) is positive semidefinite. Obviously a generalization of this condition needs to hold for (1). Again the Lagrangian will be shown to play a key role. We start by examining the behavior of f (x) along a feasible arc emanating from x . Although the first order optimality conditions make the first order change in the objective

along a feasible arc nonnegative, it could be zero. Consequently, the second order change needs to be nonnegative for arcs where this is true. We restrict our interest to feasible arcs that remain on the set of constraints active at x . If x(˛) represents a twice differentiable arc, with x(0) = x , that lies on the active set thenb c(x(˛)) D 0. Define p d(x(0))/d ˛ and h d2 (x(0))/d ˛ 2 . We have d d b c i (x(˛))> x(˛); c i (x(˛)) D r(b d˛ d˛ d2 d d b x(˛)> r 2b c i (x(˛)) D c i (x(˛)) x(˛) d˛ 2 d˛ d˛ d2 C rb c i (x(˛))> 2 x(˛): d˛ Sinceb c(x(˛)) D 0 it follows that d2 b c i (x(0)) D rb c i (x )p D 0: (3) c i (x )> h C p> r 2b d˛ 2 Similarly we get d2 f (x(0)) D g(x )> h C p> r 2 f (x )p: d˛ 2 Since d f (x(0)) D g(x )> p D 0 d˛ (otherwise there would be a descent direction from x ) we require that g(x )> h C p> r 2 f (x )p 0: Substituting for g(x ) using the first order optimality conditions gives h > J(x )> C p> r 2 f (x )p 0: It follows from (3) and the definition of the extended multipliers that we require

m X

i p> r 2 c i (x )p C p> r 2 f (x )p 0:

iD1

From the definition of L(x , ) and b J(x )p D 0 this condition is equivalent to requiring that Z(x )> r 2 L(x , ) Z(x ) be positive semidefinite. This matrix is called the reduced Hessian of the Lagrangian. Since the

Inequality-constrained Nonlinear Optimization

condition is on the second derivatives it is termed a second order optimality condition. It can now be appreciated that the numerical value of the Lagrange multipliers play a role in defining the solution of a nonlinearlyconstrained problem. Note that when there are n active constraints then there is no feasible arc that remains on the active set and the second order optimality condition is empty. When b J has n rows then the reduced Hessian has zero dimension. For convenience we can define symmetric matrices of zero dimension to be positive definite. Necessary and sufficient conditions for x to be a minimizer are complex. However, sufficient conditions are easy to appreciate. We have established no feasible descent direction exists that moves off any of the active constraints. Consequently, if b > 0 then f (x) increases along any feasible arc emanating from x that moves off a constraint. We now only need to be sure the same is true for all arcs emanating from x that remaining on the active set. This is assured if d2 f (x(0)) D g(x )> h C p> r 2 f (x )p > 0; d˛ 2 which implies Z(x )> r 2 L(x , ) Z(x ) is positive definite. Assuming that x is a regular point, strict complementarity hold, the first order necessary conditions hold, and the reduced Hessian at x is positive definite then x is a minimizer and an isolated point. Algorithms Algorithms for inequality problems have a combinatorial element not present in algorithms for equalityconstrained problems. The simplest case of linear programming (LP) illustrates the point. Under mild assumptions the solution of an LP is given by the solution of a set of linear equations, i. e. a vertex of the feasible region. The difficult issue is determining which of the constraints define those equations. If there are m inequality constraints and n variables there are m!/n! (n m)! choices of active constraints. Even for modest values of m and n the possible choices are astronomical. This clearly rules out methods based on exhaustive search. One class of methods to solve inequality problems are so-called active set methods, an example being the simplex method for LP. First a guess is made of the

I

active set (called the working set) and then an estimate to the solution of the resulting equality-constrained problem is computed (in the case of LP or quadratic programming (QP) this would be precise) and at the new point a new guess is made of the active set. The estimate of the solution of the equality-constrained problem is usually made by finding a point that satisfies an approximation to the first order necessary conditions. Unless an intelligent guess is made of the active set such algorithms are doomed to fail. Typically after the initial active set such algorithms generate subsequent working sets automatically. For linearly-constrained problems this is usually a very simple procedure. Assuming the current iterate is feasible an attempt is made to move to the new estimate of the solution. If this is infeasible the best (or a point better than the current iterate) is found along the direction to the new estimate. The constraints active at the new feasible point are then used to define the working set. Usually the active set will be the working set but occasionally we need to move off a constraint that is currently active. How to identify such a constraint is usually straightforward and can be done by examining an estimate to the Lagrange multipliers (obtained from the solution to the approximation of the first order necessary conditions). More complex strategies are possible that move off several constraints simultaneously. An initial feasible point is found by solving an LP. One consequence of this strategy is that it is only necessary to consider working sets for which the objective function has a lower value than at the current iterate. Once we are in a neighborhood of the solution the working set does not change if strict complementarity holds at the solution and x is a regular point. Typically the change in the working set at each iteration of active set methods for linearly-constrained problems is small (usually one), which results in efficiencies when computing the estimate to the new equality-constrained problem. In practice active set methods work well and usually identify the active set at the solution with very little difficulty. For an LP the number of iterations required to identify the active set usually grows linearly with the size of the problem. However, pathological cases exist in which the number of iterations is astronomical and real LP problems do arise where the number of iterates required is much greater than the typical case. Nonetheless algorithms for linearly-constrained

1589

1590

I

Inequality-constrained Nonlinear Optimization

problems based on active set methods are highly successful. For nonlinear problem the issue of identifying the active set at the solution is usually less significant since even when the active set is known the number of iterations required to solve a problem may be large. A more relevant issue is that not knowing the active set causes some problems such as making the linear algebra routines much more complicated. For small problems this is of little consequence but in the large scale case it complicates the data structures required. Nonlinearly-constrained problems are usually an order of magnitude more complicated to solve than linearly-constrained problems. One reason is that algorithms for problems with nonlinear constraints usually do not maintain feasible iterates. If a problem has just one nonlinear equality constraint then generating each member of a sequence that lies on that constraint is itself an infinite process. Methods that generate infeasible iterates need to have some means of assessing whether a point is better than another point. For feasible-point algorithms this is a simple issue since the objective provides a measure of merit. A typical approach is to define a merit function, which balances a change in the objective against the change in the degree of infeasibility. A commonly used merit function is M(x; ) D f (x) C

m X

maxf0; c i (x)g;

iD1

where is a parameter that needs to be sufficiently large. Usually it will not be known what ‘sufficiently’ large is so this parameter is adjusted as the sequence of iterates is generated. Note that M(x, ) is not a smooth function and has a discontinuity in its derivative when any element of c(x) is zero. In particular it is not continuous at x when a constraint is active at x . Were this not the case then constrained problems could be transformed to unconstrained problems and solved as such. While transforming a constrained problem into a simple single smooth unconstrained problem is not possible the transformation approach is the basis of a variety of methods. A popular alternative to direct methods is to transform the problem into that of solving a sequence of smooth linearly-constrained problems. This is the method at the heart of MINOS (see [8,9]) one of the most widely used methods for solving

problems with nonlinear constraints. Other transformation methods transform the problem to that of solving a sequence of unconstrained or bounds-constrained problem. Transformation methods have an advantage of over direct methods when developing software. For example, if you have a method for solving large scale linearly-constrained problems then it can be used as a kernel in an algorithm to solve large scale nonlinearlyconstrained problems. See also Equality-constrained Nonlinear Programming: KKT Necessary Optimality Conditions First Order Constraint Qualifications History of Optimization Kuhn–Tucker Optimality Conditions Lagrangian Duality: Basics Redundancy in Nonlinear Programs Relaxation in Projection Methods Rosen’s Method, Global Convergence, and Powell’s Conjecture Saddle Point Theory and Optimality Conditions Second Order Constraint Qualifications Second Order Optimality Conditions for Nonlinear Optimization SSC Minimization Algorithms SSC Minimization Algorithms for Nonsmooth and Stochastic Optimization References 1. Bazaraa MS, Sherali HD, Shetty CM (1993) Nonlinear programming: Theory and algorithms. second Wiley, New York 2. Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Sci., Belmont, MA 3. Fletcher R (1988) Practical methods of optimization. Wiley, New York 4. Gill PE, Murray W, Wright MH (1981) Practical optimization. Acad. Press, New York 5. Karush W (1939) Minima of functions of several variables with inequalities as side constraints. Dept Math Univ Chicago 6. Kuhn HW (1991) Nonlinear programming: A historical note. In: Lenstra JK, Rinnooy Kan AHG, Schrijver A (eds) Elsevier, AmsterdamHistory of Mathematical Programming: A Collection of Personal Reminiscences. , pp 82–96 7. Kuhn HW, Tucker AW (1951) Nonlinear programming. In: Neyman J (ed) Proc. 2nd Berkeley Symposium on Mathe-

Inference of Monotone Boolean Functions

matical Statistics and Probability. Univ Calif Press, Berkeley, CA, pp 481–492 8. Murtagh BA, Saunders MA (1982) A projected Lagrangian algorithm and its implementation for sparse nonlinear constraints. Math Program Stud 16:84–117 9. Murtagh BA, Saunders MA (1993) MINOS 5.4 user’s guide. SOL Report Dept Oper Res Stanford Univ 83-20R, 10. Nash SG, Sofer A (1996) Linear and nonlinear programming. McGraw-Hill, New York

Inference of Monotone Boolean Functions VETLE I. TORVIK, EVANGELOS TRIANTAPHYLLOU Department Industrial and Manufacturing Systems Engineering, Louisiana State University, Baton Rouge, USA MSC2000: 90C09 Article Outline Keywords Inference of Monotone Boolean Functions The Shannon Function and the Hansel Theorem Hansel Chains Devising a Smart Question-Asking Strategy Conclusions See also References Keywords Boolean function; Monotone Boolean function; Isotone Boolean function; Antitone Boolean function; Classification problem; Boolean function inference problem; Free distributive lattice; Conjunctive normal form; CNF; Disjunctive normal form; DNF; Interactive learning of Boolean functions; Shannon function; Hansel theorem; Hansel chain; Sequential Hansel chains question-asking strategy; Binary search-Hansel chains question-asking strategy; Binary search The goal in a classification problem is to uncover a system that places examples into two or more mutually exclusive groups. Identifying a classification system is beneficial in several ways. First of all, examples can be organized in a meaningful way, which will make

I

the exploration and retrieval of examples belonging to specific group(s) more efficient. The tree-like directory structure, used by personal computers in organizing files, is an example of a classification system which enables users to locate files quickly by traversing the directory paths. A classification system can make the relations between the examples easy to understand and interpret. A poor classification strategy, on the other hand, may propose arbitrary, confusing or meaningless relations. An extracted classification system can be used to classify new examples. For an incomplete or stochastic system, its structure may pose questions whose answers may generalize the system or make it more accurate. A special type of classification problem, called the Boolean function inference problem, is when all the examples are represented by binary (0 or 1) attributes and each example belongs to one of two categories. Many other types of classification problems may be converted into a Boolean function inference problem. For example, a multicategory classification problem may be converted into several two-category problems. In a similar fashion, example attributes can be converted into a set of binary variables. In solving the Boolean function inference problem many properties of Boolean logic are directly applicable. A Boolean function will assign a binary value to each Boolean vector (example). See [22] for an overview of Boolean functions. Usually, a Boolean function is expressed as a conjunction of disjunctions, called the conjunctive normal form (CNF), or a disjunction of conjunctions, called the disjunctive normal form (DNF). CNF can be written as: 0 1 k _ ^ @ xi A ; jD1

i2 j

where xi is either the attribute or its negation, k is the number of attribute disjunctions and j is the jth index set for the jth attribute disjunction. Similarly, DNF can be written as: 0 1 k ^ _ @ xi A : jD1

i2 j

It is well known that any Boolean function can be written in CNF or DNF form. See [20] for an algorithm con-

1591

1592

I

Inference of Monotone Boolean Functions

verting any Boolean expression into CNF. Two functions in different forms are regarded as equivalent as long as they assign the same function values to all the Boolean vectors. However, placing every example into the correct category is only one part of the task. The other part is to make the classification criteria meaningful and understandable. That is, an inferred Boolean function should be as simple as possible. One part of the Boolean function inference problem that has received substantial research efforts is that of simplifying the representation of Boolean functions, while maintaining a general representation power.

Boolean function, consider ordering the binary vectors as follows [21]:

Inference of Monotone Boolean Functions

and

When the target function can be any Boolean function with n attributes, all of the 2n examples have to be examined to reconstruct the entire function. When we have a priori knowledge about the subclass of Boolean functions the target function belongs to, on the other hand, it may be possible to reconstruct it using a subset of the examples. Often one can obtain the function values on examples one by one. That is, at each inference step, an example is posed as a question to an oracle, which, in return, provides the correct function value. A function, f , can be defined by its oracle Af which, when fed with a vector x = hx1 , . . . , xn i, returns its value f (x). The inference of a Boolean function from questions and answers is known as interactive learning of Boolean functions. In many cases, especially when it is either difficult or costly to query the oracle, it is desirable to pose as few questions as possible. Therefore, the choice of examples should be based on the previously classified examples. The monotone Boolean functions form a subset of the Boolean functions that have been extensively studied not only because of their wide range of applications (see [2,7,8] and [24]) but also their intuitive interpretation. Each attribute’s contribution to a monotone function is either nonnegative or nonpositive (not both). Furthermore, if all of the attributes have nonnegative (or nonpositive) effects on the function value then the underlying monotone Boolean function is referred to as isotone (respectively antitone). Any isotone function can be expressed in DNF without using negated attributes. In combinatorial mathematics, the set of isotone Boolean functions is often represented by the free distributive lattice (FDL). To formally define monotone

Definition 1 Let En denote the set of all binary vectors of length n; let x and y be two such vectors. Then, the vector x = hx1 , . . . , xn i precedes vector y = hy1 , . . . , yn i (denoted as x y) if and only if xi yi for 1 i n. If, at the same time x 6D y, then x strictly precedes y (denoted as x y). According to this definition, the order of vectors in E2 can be listed as follows: h11i h01i h00i

h11i h10i h00i : Note that the vectors h01i and h10i are in a sense incomparable. Based on the order of the Boolean vectors, a nondecreasing monotone (isotone) Boolean function can be defined as follows [21]: Definition 2 A Boolean function f is said to be an nondecreasing monotone Boolean function if and only if for any vectors x, y 2 En , such that x y, then f (x) f (y). A nonincreasing monotone (antitone) Boolean function can be defined in a similar fashion. As the method used to infer an antitone Boolean function is the same as that of a isotone Boolean function, we will restrict our attention to the isotone Boolean functions. When analyzing a subclass of Boolean functions, it is always informative to determine its size. This may give some indications of how general the functions are and how hard it is to infer them. The number of isotone Boolean functions, (n), defined on En is sometimes referred to as the nth Dedekind number after R. Dedekind, [6] who computed it for n = 4. Since then it has been computed for up to E8 . (1) = 3; (2) = 6; (3) = 20; (4) = 168 [6]; (5) = 7, 581 [4]; (6) = 7, 828, 354 [28]; (7) = 2, 414, 682, 040, 998 [5]; (8) = 56, 130, 437, 228, 687, 557, 907, 788 [29].

Inference of Monotone Boolean Functions

Wiedeman’s algorithm [29] employed a Cray-2 processor for 200 hours to compute the value for n = 8. This gives a flavor of the complexity of computing the exact number of isotone Boolean functions. The computational infeasibility for larger values of n provides the motivation for approximations and bounds. The best known bound on (n) is due to D. Kleitman, [12] and Kleitman and G. Markowsky, [13]: n

(n) 2(bn/2c)

1Cc

log n n

;

where c is a constant and bn/2c is the integer part of n/2. This bound, which is an improvement over the first bound obtained by G. Hansel, [11], are also based on the Hansel chains described below. Even though these bounds can lead to good approximations for (n), when n is large, the best known asymptotic is due to A.D. Korshunov, [15]: ( n for even n; 2(n/2) e f (n) (n) n C1 g(n) 2(n/21/2) e for odd n; where n f (n) D n/2 1 g(n) n D n/2 3/2

!

!

1 2n/2

1

C

2(nC3)/2 ! n 1 C n/2 1/2 2(nC1)/2

n2 2nC5

n2

n 2nC4

;

n nC6 n 2 2 C3 n2 C nC4 : 2

I. Shmulevich [24] achieved a similar but slightly inferior asymptotic for even n in a simpler and more elegant manner, which led to some interesting distributional conjectures regarding isotone Boolean functions. Even though the number of isotone Boolean functions is large, it is a small fraction of the number of general Boolean functions, 22n . This is the first hint towards the feasibility of efficiently inferring monotone Boolean functions. Intuitively, one would conjecture that the generality of this class was sacrificed. That is true, however, a general Boolean function consists of a set of areas where it is monotone. In fact, any Boolean function q(x1 , . . . , xn ) can be represented by several nondecreasing g i (x) and nonincreasing hj (x) monotone Boolean

I

functions in the following manner [17]: 0 1 ^ _ @ g i (x) h j (x)A : q(x) D i

j

As a result, one may be able to solve the general Boolean function inference problem by considering several monotone Boolean function inference problems. Intuitively, the closer the target function is to a monotone Boolean function, the fewer monotone Boolean functions are needed to represent it and more successful this approach might be. In [17] the problem of joint restoration of two nested monotone Boolean functions f 1 and f 2 is stated. The approach in [17] allows one to further decrease the dialogue with an expert (oracle) and restore a complex function of the form f 1 & : f 2 , which is not necessarily monotone. The Shannon Function and the Hansel Theorem The complexity of inferring isotone Boolean functions was mentioned in the previous section, when realizing that the number of isotone Boolean functions is a small fraction of the total number of general Boolean functions. In defining the most common complexity measure for the Boolean function inference problem, consider the following notation. Let M n denote the set of all monotone Boolean functions, and A = {F} be the set of all algorithms which infer f 2 M n , and ' (F, f ) be the number of questions to the oracle Af required to infer f . The Shannon function ' (n) can be introduced as follows [14]: '(n) D min max '(F; f ) : F2A

f 2M n

An upper bound on the number of questions needed to restore a monotone Boolean function is given by the following equation (known as the Hansel theorem) [11]: ! ! n n '(n) D C : bn/2c bn/2c C 1 That is, if a proper question-asking strategy is applied, the total number of questions needed to infer any monotone Boolean function should not exceed ' (n). The Hansel theorem can be viewed as the worst-case scenario analysis. Recall, from the previous section, that

1593

1594

I

Inference of Monotone Boolean Functions

all of the 2n questions are necessary to restore a general Boolean function. D.N. Gainanov [9] proposed three other criteria for evaluating the efficiency of algorithms used to infer isotone Boolean functions. One of them is the average case scenario and the two others consider two different ways of normalizing the Shannon function by the size of the target function.

Inference of Monotone Boolean Functions, Table 1 Hansel chains for E3

chain # 1

2 Hansel Chains The vectors in En can be placed in chains (sequences of vectors) according to monotonicity. The Hansel chains is a particular set of chains that can be formed using a dimensionally recursive algorithm [11]. It starts with the single Hansel chain in E1 : H 1 D fh0i ; h1ig: To form the Hansel chains in E2 , three steps are required, as follows: 1 2 3

Attach the element ‘0’ to the font of each vector in H 1 and get chain C 2 min = fh00i; h01ig. Attach the element ‘1’ to the front of each vector in H 1 and get chain C 2 max = fh10i; h11ig. Move the last vector in C 2 max , i.e. vector h11i, to the end of C 2 min : H 2;1 = fh00i; h01i; h11ig; H 2;2 = fh10ig.

3

vector in-chain index 1 2 3 4 1 2 1 2

vector 000 001 011 111 100 101 010 110

order. That is, if the vectors V j and V k are in the same chain then V j < V k (i. e., V j strictly precedes V k when j < k). Therefore, if the underlying Boolean function is isotone, then one can classify vectors within a chain easily. For example, if a vector V j is negative (i. e., f (V j ) = 0), then all the vectors preceding V j in the same chain are also negative (i. e., f (V k ) = 0 for any k < j). Similarly, if a vector V j is positive, then all the vectors succeeding V j in the same chain are also positive. The monotone ordering of the vectors in Hansel chains motivates the composition of an efficient question-asking strategy discussed in the next section.

Devising a Smart Question-Asking Strategy 3

To form the Hansel chains in E , these steps are repeated: 1 C 3;1 min = fh000i; h001i; h011ig; C 3;2 min = fh010ig. 2 C 3;1 max = fh100i; h101i; h111ig; C 3;2 max = fh110ig. 3 H 3;1 = fh000i; h001i; h011i; h111ig; H 3;2 = fh100i; h101ig; H 3;3 = fh010i; h110ig. Note that since there is only one vector in the C3, 2 max chain, it can be deleted after the vector h110i is moved to C3, 2 min . This leaves the three chains listed in Table 1. In general, the Hansel chains in En can be generated recursively from the Hansel chains in En 1 by following the three steps described above. A nice property of the Hansel chains is that all the vectors in a particular chain are arranged in increasing

The most straightforward question-asking strategy, which uses Hansel chains, sequentially moves from chain to chain. Within each chain one may also sequentially select vectors to pose as questions. After an answer is given, the vectors (in other chains also) that are classified as a result of monotonicity are eliminated from further questioning. Once all the vectors have been eliminated, the underlying function is revealed. The maximum number of questions for this method, called the sequential Hansel chains question-asking strategy, will not exceed the upper limit ' (n), given in the Hansel theorem, as long as the chains are searched in increasing size. Although the sequential question-asking strategy is easy to implement and effective in reducing the total number of questions, there is still room for improvements. N.A. Sokolov [25] introduced an algorithm that sequentially moves between the Hansel chains in de-

I

Inference of Monotone Boolean Functions

Inference of Monotone Boolean Functions, Table 2 Iteration 1

chain #

1

2 3

index of vectors in the chain 1 2 3 4 1 2 1 2

vector

000 001 011 111 100 101 010 110

vector middle reward P if reward N if classi- vector the vector the vector fied in the is positive is negative chain 4

2

selected middle answer vector with the largest min(P; N)

other vectors determined

1 1 1

4

2

4

2

1

creasing size and performs a middle vector search of each chain. His algorithm does not require storing all the Hansel chains since at each iteration it only requires a single chain. This advantage is obtained at the cost of asking more questions than needed. In an entirely different approach, Gainanov [9] presented a heuristic that has been used in numerous algorithms for inferring a monotone Boolean function, such as in [3] and in [18]. This heuristic takes as input an unclassified vector and finds a border vector (maximal false or minimal true) by sequentially questioning neighboring vectors. The problem with most of the inference algorithms based on this heuristic is that they do not keep track of the vectors classified, only the resulting border vectors. Note that for an execution of this heuristic, all of the vectors questioned are not necessarily covered by the resulting border vector, implying that valuable information may be lost. In fact, several border vectors may be unveiled during a single execution of this heuristic, but only one is stored. Many of these methods are designed to solve large problems where it might be inefficient or even infeasible to store all of the information gained within the execution of the heuristic. However, these methods are not efficient (not even for small size problems), in terms of the number of queries they require. One may look at each vector as carrying a ‘reward’ value in terms of the number of other vectors that will be classified concurrently. This reward value is a random variable that takes on one of two (one if the two values are the same) values depending on whether the

vector is a positive or a negative example of the target function. The expected reward is somewhere between these two possible values. If one wishes to maximize the expected number of classified vectors at each step, the probabilities associated with each of these two values need to be computed in addition to the actual values. Finding the exact probabilities is hard, while finding the reward values is relatively simple for a small set of examples. This is one of the underlying ideas for the new inference algorithm termed the binary search-Hansel chains question-asking strategy. This method draws its motivation, for calculating and comparing the ‘reward’ values for the middle vectors in each Hansel chain, from the widely used binary search algorithm (see, for instance, [19]). Within a given chain, a binary search will dramatically reduce the number of questions (to the order of log2 while the sequential search is linear). Once the ‘reward’ values of all the middle vectors have been found, the most promising one will be posed as a question to the oracle. Because each vector has two values, selecting the most promising vector is subjective and several different evaluative criteria can be used. The binary search-Hansel chains question-asking strategy can be divided into the following steps: 1) Select the middle vector of the unclassified vectors in each Hansel chain. 2) Calculate the reward values for each middle vector. That is, calculate the number of vectors that can be classified as positive (denoted as P) if it is positive and negative (denoted as N) if it is negative.

1595

1596

I

Inference of Monotone Boolean Functions

Inference of Monotone Boolean Functions, Table 3 Iteration 2. The vector h100i is chosen and based on the answer, the class membership of the vectors h100i and h000i is determined

chain #

1

2 3

index of vectors in the chain 1 2 3 4 1 2 1 2

vector

000 001 011 111 100 101 010 110

vector middle reward P if classi- vector the vector fied in the is positive chain 4 1 1 1 2 1 2

3) Select the most promising middle vector, based on the (P, N) pairs of the middle vectors, and ask the oracle for its membership value. 4) Based on the answer in Step 3, eliminate all the vectors that can be classified as a result of the previous answer and the property of monotonicity. 5) Redefine the middle vectors in each chain as necessary. 6) Unless all the vectors have been classified, go back to Step 2. The inference of a monotone Boolean function on E3 by using the binary search-Hansel chains question-asking strategy is illustrated below. The specifics of Iteration 1, described below, are also shown in Table 2. At the beginning of first iteration, the middle vectors in each Hansel chain (as described in Step 1) are selected and marked with the ‘ ’ symbol in Table 2. Then, according to Step 2, the reward value for each one of these middle vectors is calculated. For instance, if h001i (the second vector in chain 1) has a function value of 1, then the three vectors h000i, h001i and h010i are also classified as positive. That is, the value of P for vector h001i equals 4. Similarly, h000i will be classified as 0 if h001i is classified as 0 and thus its reward value N equals 2. Once the ‘reward’ values of all the middle vectors have been evaluated, the most promising middle vector will be selected based on their (P, N) pairs. Here we choose the vector whose min (P, N) value is the largest among the middle vectors. If there is a tie, it will be broken randomly. Based on this evaluative criterion, vector

reward N if selected middle answer the vector vector with is negative the largest min(P; N) 1

2

other vectors determined 0

0

2

2 is chosen in chain 1 and is marked with ‘ ’ in the column ‘selected middle vector with the largest min (P, N)’. After receiving the function value of 1 for vector h001i, its value is placed in the ‘answer’ column. This answer is used to eliminate all of the vectors succeeding h001i. The middle vector in the remaining chains are updated as needed. At least one more iteration is required, as there still are unclassified vectors. After the second iteration, no unclassified vectors are left in chains 1 and 2, and the middle of these chains need not be considered anymore. Therefore, an ‘X’ is placed in the column called ‘middle vector in the chain’ in Table 4. At the beginning of the third iteration, the vector h010i is chosen and the function value of the remaining two vectors h010i and h110i are determined. At this point all the vectors have been classified and the question-asking process stops. The algorithm posed a total of three questions in order to classify all the examples. The final classifications listed in Table 5. corresponds to the monotone Boolean function x2 _ x3 . Conclusions This paper described some approaches and some of the latest developments in the problem of inferring monotone Boolean functions. As it has been described here, by using Hansel chains in the sequential questionasking strategy, the number of questions will not exceed the upper bound stated in the Hansel theorem. How-

I

Inference of Monotone Boolean Functions

Inference of Monotone Boolean Functions, Table 4 Iteration 3

chain #

1

2 3

index of vectors in the chain 1 2 3 4 1 2 1 2

vector

000 001 011 111 100 101 010 110

vector middle reward P if classi- vector the vector fied in the is positive chain 0 1 X 1 1 0 X 1 2

1 2 3

vector inchain index 1 2 1 2 1 2 3 4

1

other vectors determined

1 1

Inference of Monotone Boolean Functions, Table 5 The resulting class memberships

chain #

reward N if selected middle answer the vector vector with is negative the largest min(P; N)

vector 100 101 010 110 000 001 011 111

function value 0 1 1 1 0 1 1 1

ever, by combining the binary search of Hansel chains with the notion of an evaluative criterion, the number of questions asked can be further reduced. At present, the binary search-Hansel chains question-asking strategy is only applied to Hansel chains with a dimension of less than 10. However, it is expected that this method can be applied to infer monotone Boolean functions of larger dimensions with slight modifications. See also Alternative Set Theory Boolean and Fuzzy Relations Checklist Paradigm Semantics for Fuzzy Logics Finite Complete Systems of Many-valued Logic Algebras

Optimization in Boolean Classification Problems Optimization in Classifying Text Documents References 1. Alekseev DVB (1988) Monotone Boolean functions. Encycl. Math., vol 6. Kluwer, Dordrecht, 306–307 2. Bioch JC, Ibaraki T (1995) Complexity of identifixation and dualization of positive Boolean functions. Inform and Comput 123:50–63 3. Boros E, Hammer PL, Ibaraki T, Kawakami K (1997) Polynomial-time recognition of 2-monotonic positive Boolean functions given by an oracle. SIAM J Comput 26(1):93–109 4. Church R (1940) Numerical analysis of certain free distributive structures. J Duke Math 9:732–734 5. Church R (1965) Enumeration by rank of the elements of the free distributive lattice with 7 generators. Notices Amer Math Soc 12:724 6. Dedekind R (1897) Ueber Zerlegungen von Zahlen durch ihre grössten gemeinsamen Teiler. Festchrift Hoch Braunschweig u Ges Werke, II:103–148 7. Eiter T, Gottlob G (1995) Identifying the minimal transversals of a hybergraph and related problems. SIAM J Comput 24(6):1278–1304 8. Fredman ML, Khachiyan L (1996) On the complexity of dualization of monotone disjunctive normal forms. J Algorithms 21:618–628 9. Gainanov DN (1984) On one criterion of the optimality of an algorithm for evaluating monotonic Boolean functions. USSR Comput Math Math Phys 24(4):176– 181 10. Gorbunov Y, Kovalerchuk B (1982) An interactive method of monotone Boolean function restoration. J Acad Sci USSR Eng 2:3–6 (in Russian.) 11. Hansel G (1966) Sur le nombre des fonctions Boolenes

1597

1598

I 12.

13.

14. 15. 16.

17.

18.

19. 20.

21. 22.

23.

24.

25.

26.

27.

28. 29.

Infinite Horizon Control and Dynamic Games

monotones den variables. CR Acad Sci Paris 262(20):1088– 1090 Kleitman D (1969) On Dedekind’s problem: The number of monotone Boolean functions. Proc Amer Math Soc 21(3):677–682 Kleitman D, Markowsky G (1975) On Dedekind’s problem: The number of isotone Boolean functions. II. Trans Amer Math Soc 213:373–390 Korobkov VK (1965) On monotone functions of the algebra of logic. Probl Kibernet 13:5–28 (in Russian.) Korshunov AD (1981) On the number of monotone Boolean functions. Probl Kibernet 38:5–108 (in Russian.) Kovalerchuk B, Triantaphyllou E, Deshpande AS, Vityaev E (1996) Interactive learning of monotone Boolean functions. Inform Sci 94(1–4):87–118 Kovalerchuk B, Triantaphyllou E, Vityaev E (1995) Monotone Boolean functions learning techniques integrated with user interaction. Proc Workshop Learning from Examples vs Programming by Demonstrations: Tahoe City, pp 41–48 Makino K, Ibaraki T (1997) The maximum latency and identification of positive Boolean functions. SIAM J Comput 26(5):1363–1383 Neapolitan R, Naimipour K (1996) Foundations of algorithms. D.C. Heath, Lexington, MA Peysakh J (1987) A fast algorithm to convert Boolean expressions into CNF. Techn Report IBM Comput Sci RC 12913(57971) Rudeanu S (1974) Boolean functions and equations. NorthHolland, Amsterdam Schneeweiss WG (1989) Boolean functions: With engineering applications and computer applications. Springer, Berlin Schneeweiss WG (1996) A necessary and sufficient criterion for the monotonicity of Boolean functions with deterministic and stochastic applications. IEEE Trans Comput 45(11):1300–1302 Shmulevich I (1997) Properties and applications of monotone Boolean functions and stack filters. PhD Thesis, Dept. Electrical Engin. Purdue Univ. http://shay.ecn.purdue.edu/ ~shmulevi/ Sokolov NA (1982) On the optimal evaluation of monotonic Boolean functions. USSR Comput Math Math Phys 22(2):207–220 Torvik VI, Triantaphyllou E (2000) Minimizing the average query complexity of learning monotone Boolean functions. Working Paper Dept Industrial and Manuf Syst Engin, Louisiana State Univ Triantaphyllou E, Lu J (1998) The knowledge acquisition problem in monotone Boolean systems. Encycl. Computer Sci. and Techn. In: Kent A, Williams JG (eds) MD Ward M (1946) Note on the order of free distributive lattices. Bull Amer Math Soc no. Abstract 135 52:423 Wiedemann D (1991) A computation of the eight Dedekind number. Order 8:5–6

Infinite Horizon Control and Dynamic Games IHDG DEAN A. CARLSON1 , ALAIN B. HAURIE2 1 University Toledo, Toledo, USA 2 University Geneva, Geneva, Switzerland MSC2000: 91Axx, 49Jxx Article Outline Keywords Unbounded Cost Nonzero-Sum Infinite Horizon Games Cooperative Solution Noncooperative Solutions See also References Keywords Dynamic optimization; Noncooperative equilibrium; Cooperative equilibrium; Overtaking equilibrium In economics or biology there is no natural end time for a process. Nations as well as species have a very long future to consider. A mathematical abstraction for this phenomenon is the concept of infinite time horizon simply defined as an unbounded time interval of the form [0, + 1). The study of competing agents in a dynamic deterministic setting over a long time period can be cast in the framework of an infinite horizon dynamic game. This game is defined by the following ‘objects’: A system evolving over an infinite horizon is characterized by a state x 2 X Rm0 . Some agents also called the players i = 1, . . . , p can influence the state’s evolution through the choice of an appropriate control in an admissible class. The control value at a given time n for player i is denoted ui (n) 2 U i Rmi . The state evolution of such a dynamical system may be described either as a difference equation, if discrete time is used, or a differential equation in a continuous time framework. For definiteness we fix our attention here on a stationary difference equation

Infinite Horizon Control and Dynamic Games

and merely remark that similar comments apply for the case when other types of dynamical systems are considered. x(n C 1) D f (x(n); u1 (n); : : : ; u p (n)) for n = 0, 1, . . . , where f: Rm0 × . . . × Rmp ! Rm0 is a given state transition function. We assume that the agents can observe the state of the system and remember the history of the system evolution up to the current time n, that is, the sequence h n D fx(0); u(0); : : : ; u(n 1); x(n)g ; where u(n) denotes the controls chosen by all players at period n (i. e., u(n) = (u1 (n), . . . , up (n))). A policy or a strategy is a way for each agent, to adapt his/her current control choice to the history of the system, that is a mapping i : (n, hn ) ! U i which tells player i which control ui (n) 2 U i to select given that the time period is n and the state history is hn . Once such a model is formulated the question arises as to what strategy or policy should each agent adopt so that his/her decision provides him/her with the most benefit. The decision to adopt a good strategy is based on a performance criterion defined over the life of the agent (in this case [0, + 1)), that is, for each time horizon N the payoff to player i is determined by J Ni (x; u) D

N X

ˇ in g i (x(n); u(n));

nD1

where x and u denote the state and control evolutions over time, g i : Rm0 × × Rm p ! R is a given reward function and ˇ i 2 [0, 1] is a discount factor for each player i = 1, . . . , p. Two categories of difficulties have to be dealt with when one studies infinite horizon dynamic games: the consideration of an unbounded time horizon gives rise to the possibility of having diverging values for the performance criterion (i. e., tending to + 1 on all possible evolutions). This happens typically when there is no discounting (ˇ i = 1). A related issue is the stability vs. instability of the optimally controlled system.

I

A second category of difficulties are associated with the consideration of all possible actions and reactions of the different agents over time, since an infinite time horizon will always give any agent enough time to correct his/her strategy choice, if necessary. The first difficulty is already present in a single agent system where the problem reduces to a dynamic optimization problem and is typically cast in the framework of the calculus of variations or optimal control in either discrete or continuous time. The second type of difficulty arises typically in nonzero-sum games. Unbounded Cost To introduce the difficulties involved in studying infinite horizon problems we first consider the single player case. The single player case is the most studied of these problems with a relatively rich history beginning with the seminal paper of F. Ramsey [8]. Therefore we shall introduce the subject with the Ramsey model, using simpler notations than the one introduced above. In Ramsey’s work a continuous time model for the economic growth of a nation is developed and analyzed. In discrete time, the dynamics for Ramsey’s model is described by the difference equation x nC1 D x n C f (x n ) c n with a fixed initial condition x0 . Here, xn 0 denotes the amount of capital stock at the end of the time period n; f (xn ) is a nonnegative valued function, known as the production function, which is defined for all positive xn and represents the rate at which capital stock is produced given a stock level xn ; and cn > 0 represents the rate at which the nation consumes the capital stock. Since a nation usually does not consume at a rate faster than it produces we also have the inequality constraint 0 c n f (x n )

for all n D 1; 2; : : : :

The performance of the system is measured as an accumulation of social welfare over the time scale. Thus, up to a fixed time N, this is represented by the sum J N (fc n g) D

N X

U(c n );

nD1

in which U(cn ) is called a social utility function and represents the ‘rate of enjoyment’ of society at a consump-

1599

1600

I

Infinite Horizon Control and Dynamic Games

tion rate cn . The goal of a decision maker in this model is to determine cn , n = 1, 2, . . . , so that lim J N (fc n g) D

N!C1

lim

N!C1

N X

U(c n )

nD1

is maximized. An immediate concern in attempting to solve such a problem is that the performance criterion is well defined. That is, for a given feasible element {xn , cn }, n = 1, 2, . . . , is the above infinite series convergent? Additionally, if there exists feasible elements for which the convergence is assured how does one know if the supremum is finite. Ramsey was aware of these two difficulties and these issues were addressed in his work. In dealing with this lack of precision two ideas have arisen. The first of these is to introduce the notion of discounting to ‘level the playing field’ by scaling units to present value terms. This manifests itself through a positive weighting scheme. Specifically the performance criterion is modified through the introduction of a constant ‘discount rate’, ˇ, between 0 and 1. That is, the above infinite series is replaced by lim J N (fc n )g D

N!C1

lim

N!C1

N X

ˇ n U(c n ):

nD1

It is now an easy matter to see that if the sequence {U(cn )} is bounded then the infinite series converges. Moreover, if all feasible sequences {cn } are bounded and U() is a continuous function it is easy to see that the supremum (as well as the infimum) over all such sequences is bounded above and the optimization problem is well defined. A criticism of discounting voiced by Ramsey is that it weights a decision makers preference toward the present at the expense of the past. Consequently Ramsey seeks another approach. This alternate idea was that the rate at which a nation consumes is bounded and that ideally the best system would be one in which the rate is as large as possible. Thus Ramsey introduced the notion of a ‘maximal sustainable rate of enjoyment’ which he referred to as bliss. The notion of bliss, denoted by B, is defined now as an optimal steady state problem. That is, B D max fU(c) : c D f (x); x 0g D max fU(c) : c 0g :

With this idea, the performance index is replaced by a new performance given as lim J N (fc n g) D

N!C1

lim

N!C1

N X

B U(c n );

nD1

and the goal is to choose {cn } as a minimizer instead of a maximizer. Observe that B U(cn ) 0 for all n so that the above limit is bounded below by zero. Thus, if bliss is attained by some feasible sequence (that is, c n D c for all n sufficiently large with B D U(c), then the performance criterion is finite for at least one feasible element {xn , cn } and the minimization problem is well defined. Using the notion of bliss, Ramsey solved this problem using classical variational analysis (i. e., the Euler–Lagrange equation from the calculus of variations) to arrive at what is now referred to as Ramsey’s rule of economic growth. Finally we remark that the solution, say {xn , cn }, obtained by Ramsey asymptotically approaches fx; cg, where x the unique solution to the equation c D f (x). The approach adopted by Ramsey in his model has become a prototype for studying more complex problems. In particular, the notion of bliss and the optimal steady state problem combined with the idea that bliss is obtained in finite time is now referred to as a reduction to finite costs. Finally the asymptotic convergence to the optimal steady state is referred to as an asymptotic turnpike property. Since Ramsey ‘solved’ his problem through an application of necessary conditions he did not directly address the question of existence of an optimal solution. He assumed that the solution to the necessary condition was in fact a solution. However, in 1962, S. Chakravarty [4] gave a simple example in which the solution of Ramsey’s rule was not a minimizer but a maximizer! This led to the quest for the existence of optimal solutions for these problems. As the performance objective is unbounded, the traditional notion of a minimizer is no longer valid. Thus, new types of optimality were introduced in the 1960s by C.C. von Weizäcker [10] to deal with this problem. These notions are now known as overtaking optimality, weakly overtaking optimality, and finite optimality. The most useful and strongest of these three types of optimality is overtaking optimality. In words, a sequence {xn , cn } is overtaking optimal if when compared with any other fea-

Infinite Horizon Control and Dynamic Games

sible sequence {xn , cn } the finite horizon performance criterion, J N ({cn }) is larger than J N ({cn }) to within an arbitrarily small margin of error for all N sufficiently large. The introduction of new types of optimality led to new important results concerning these problems. The first necessary conditions for these types optimality were given in 1974 by H. Halkin [5] in which the classical Pontryagin maximum principle was extended. Of particular notice in this result was the fact that the classical transversality condition found in corresponding finite horizon problems does not necessarily hold. This fact led to many results which insure some sort of boundary condition holds at infinity. The first general existence theorem for these optimization problems was given by W.A. Brock and H. Haurie [1] in 1976. During the 1980s these major results were extended in a variety of directions and many of these results are discussed in [3]. Nonzero-Sum Infinite Horizon Games We now turn our attention to p-player games. We use from now on the general notations introduced in the introduction. To simplify a little the exposition we shall use a simplified paradigm where each player is controlling his/her own dynamical system. Hence each player enjoys his/her own state and control, say {xi (n), ui (n)} for i = 1, . . . , p and n = 1, 2, . . . , and has a performance criterion, say J iN (x, u), which is described in discrete time up to the end of period N as J Ni (x; u) D

N X

g i (x(n); u(n)):

I

ference equations x i (n C 1) D f i (x(n); u(n)) for n = 0, 1, . . . and i = 1, . . . , p. The goal of each of the players is to ‘play’ the game so that their decisions provide them with the best performance possible. This action is in conflict with the other players and therefore generally it is not possible for the players to minimize or maximize their performance. The way one defines optimality in a game depends on the mood of play, i. e. if the players behave in a cooperative or in a noncooperative fashion. Cooperative Solution If players cooperate they will want to reach an undominated solution, also called a Pareto solution after its originator, V. Pareto [7], who introduced the concept in 1896. A pair {x, u} is called a cooperative solution if there does not exist a feasible point {y, v} satisfying J i (x, u) J i (y, v) for all players i = 1, . . . , p with at least one strict inequality for one of the players. It is well known that such an equilibrium can be obtained by solving an appropriate single player game in which the payoff is a weighted sum of the payoffs of all of the players Jr (x; u) D

X

r j J i (x; u);

jD1;:::;p

r j 0;

j D 1; : : : ; p:

In this way the problem is reduced to the case of infinite horizon optimization and the remarks made earlier apply.

nD1

Here we use the notation x(n) D f(x1 (n); : : : ; x p (n))g and u(n) D f(u1 (n); : : : ; u p (n))g: From the notation we see that each players performance measure depends not only on their own decision but also those of the other players. This coupling may also occur in the dynamical system as well. In discrete time these systems may be represented by a system of p dif-

Noncooperative Solutions If players do not cooperate one may consider that they will be satisfied of the outcome if, for each player, his/her strategy is the best response he/she can make to the strategies adopted by the other players. This is the concept of equilibrium, introduced in 1951 by J.F. Nash [6] in the context of matrix games. In general, the search for a Nash equilibrium can not be reduced to an optimization problem. Since each players decision is his/her best decision under the assumption that the decisions of the other players are fixed, the search for an equilibrium is equivalent to the search for a fixed-point

1601

1602

I

Infinite Horizon Control and Dynamic Games

of a reaction mapping that associates with each strategy choice by the p players the set of optimal responses by each of them. To better understand this concept it is preferable to consider first a game defined in its normal form. Let j 2 j design the strategies of player j. Let V j ( 1 , . . . , p ) 2 R be the payoff to player j associated with the strategy choices D (1 ; : : : ; p ) of the p players. is a Nash equilibrium if Vj (1 ; : : : ; j ; : : : ; p ) Vj ( ); 8 j 2 j ;

j D 1; : : : ; p:

Now we introduce the product strategy set Qp jD1 j and the mapping

D

: ! R; p

( 1 ; 2 ) D

X

Vj (11 ; : : : ; j2 ; : : : ; p1 ):

jD1

Finally let us define the point to set mapping : ! 2 defined by (

) C

C

0

( ) D ˜ : ( ; ) D sup ( ; )

:

0 2

is the best response mapping for the game. A fixedpoint of is a strategy vector such that 2 ( ): is a fixed-point of if and only if it is a Nash equilibrium. In a dynamic setting the concept of strategy is closely related to the information structure of the game. We have assumed, in the beginning that the players can remember the whole (state and control) history of the dynamical system they contribute to control. This is the most precise information that can be available to the players at each instant of time. On the other end we can assume that the only information available to a player is the initial state of the system x0 = x(0) and the current time t. A strategy j for player j will thus be an open-loop control {uj (n)}n = 0, . . . , 1. An equilibrium in this class of strategies is called an open-loop Nash equilibrium. An intermediate case is the one where each player can observe the state of the system at each time

period but does not recall the previous history of the system, neither the state nor the control values. A (stationary) strategy j for player j will thus be a closed-loop control or a feedback control j :x 7! uj = j (x). An equilibrium in this class of strategies is called a feedback Nash equilibrium. In the economics literature, feedback strategies are also called Markov strategies to emphasize the lack of memory in the information structure. For a single agent deterministic system, i. e. an optimal control problem, the information structure does not really matter. The agent will not be able to do better than the optimal open-loop control, even if he/she has a perfect memory. In a two-player zero-sum dynamic game this will also be the case. In a nonzero-sum game the different information structures lead to different types of equilibria. A criticism of the open-loop Nash equilibrium is that it is not necessarily subgame perfect in the sense of R. Selten [9]. This means that if a player deviates from the equilibrium control for a while and then decides to play again ‘correctly’, then the previously defined equilibrium is not an equilibrium any more. A feedback Nash equilibrium can be made subgame perfect if one uses dynamic programming to characterize it. A memory strategy Nash equilibrium can also be made subgame perfect. Furthermore, the possibility to remember past actions or state values permit the player to define a so-called communication equilibrium where, before the play the agents communicate with each other and decide to use a specific memory strategy equilibrium. The memory permits the inclusion of threats that would support a cooperative outcome. The cooperative outcome becomes also a Nash equilibrium outcome. This type of results have been known as the ‘folks theorem’ in economics. The infinite horizon is essential to obtain this type of result. Nevertheless, the open-loop concept still has wide interest for a variety of reasons. In infinite horizon games the notion of overtaking Nash equilibrium is defined analogously to the concept in the single-player. These ideas have just recently begun to be studied extensively with the first existence theory for open-loop Nash equilibria and a corresponding turnpike theory being given in 1996 in [2]. Finally, from a practical setting, the numerical computation of a feedback Nash equilibrium is much less understood than the computation of an open-loop Nash equilibrium. The analo-

Information-based Complexity and Information-based Optimization

gous theory for feedback (or closed-loop) equilibria is still waiting to be developed. In closing, the theory of infinite horizon dynamic games is for the most part still in its infancy and much remains to be studied and researched. One important open question concerns the existence of overtaking feedback Nash equilibria and another is that once such an equilibrium is known to exist can a robust numerical procedure for computation of equilibrium be developed. See also Control Vector Iteration Duality in Optimal Control with First Order Differential Equations Dynamic Programming: Continuous-time Optimal Control Dynamic Programming and Newton’s Method in Unconstrained Optimal Control Dynamic Programming: Optimal Control Applications Hamilton–Jacobi–Bellman Equation MINLP: Applications in the Interaction of Design and Control Multi-objective Optimization: Interaction of Design and Control Optimal Control of a Flexible Arm Optimization Strategies for Dynamic Systems Robust Control Robust Control: Schur Stability of Polytopes of Polynomials Semi-infinite Programming and Control Problems Sequential Quadratic Programming: Interior Point Methods for Distributed Optimal Control Problems Suboptimal Control

I

5. Halkin H (1974) Necessary conditions for optimal control problems with infinite horizon. Econometrica 42:267–273 6. Nash JF (1951) Non-cooperative games. Ann of Math 54:286–295 7. Pareto V (1896) Cours d‘economie politique. Rouge, Lausanne 8. Ramsey F (1928) A mathematical theory of saving. Economic J 38:543–549 9. Selten R (1975) Reexamination of the perfectness concept for equilibrium points in extensive games. Internat J Game Theory 4:25–55 10. Von Weizäcker CC (1965) Existence of optimal programs of accumulation for an infinite time horizon. Review of Economic Studies 32:85–104

Information-based Complexity and Information-based Optimization J. F. TRAUB1 , A. G. WERSCHULZ2,1 1 Department Computer Sci., Columbia University, New York, USA 2 Department Computer and Information Sci., Fordham University, New York, USA MSC2000: 65K05, 68Q05, 68Q10, 68Q25, 90C05, 90C25, 90C26 Article Outline Keywords Information-Based Complexity Computational Complexity of High-Dimensional Integration Mathematical Finance General Theory

Information-Based Optimization See also References

References 1. Brock WA, Haurie A (1976) On existence of overtaking optimal trajectories over an infinite time horizon. Math Oper Res 1:337–346 2. Carlson DA, Haurie A (1996) A turnpike theory for infinite horizon open-loop competitive processes. SIAM J Control Optim 34(4):1405–1419 3. Carlson DA, Haurie A, Leizarowitz A (1991) Infinite horizon optimal control: Deterministic and stochastic systems, 2nd edn. Springer, Berlin 4. Chakravarty S (1962) The existence of an optimum savings program. Econometrica 30:178–187

Keywords Information-based complexity; Information-based optimization; Real number model; Linear programming; Nonlinear optimization; High-dimensional integration; Mathematical finance; Curse of dimensionality This article concerns optimization in two senses. The first is that information-based complexity (IBC) is the

1603

1604

I

Information-based Complexity and Information-based Optimization

study of the minimal computational resources to solve continuous mathematical problems. (Other types of mathematical problems are also studied; the problems studied by IBC will be characterized later.) J.F. Traub and A.G. Werschulz [14] provide an expository introduction to the theory and applications of IBC, with over 400 recent papers and books. A general formulation with proofs can be found in [13]. The second is that the computational complexity of optimization problems is one of the areas studied in IBC. S.A. Vavasis [16 pag. 135] calls this information-based optimization. We will discuss information-based complexity and information-based optimization in turn. Information-Based Complexity To introduce computational complexity, we first define the model of computation. The model of computation states which operations are permitted and how much they cost. The model of computation is based on two assumptions: 1) We can perform arithmetic operations and comparisons on real numbers at unit cost. 2) We can perform an information operation at cost c. Usually, 1. We comment on these assumptions. The real number model (Assumption 1) is used as an abstraction of the floating-point model typically used in scientific computation. Except for the possible effect of roundoff errors and numerical stability, complexity results will be the same in these two models. The real number model should be contrasted with the Turing machine model, typically used for discrete problems. The cost of an operation in a Turing machine model depends on the size of the operands, which is not a good assumption for floating point numbers. For a full discussion of the pros and cons of the Turing machine and real number models see [14 Chapt. 8]. Whether the real number or Turing machine model is used can make an enormous difference. For example, L.G. Khachiyan [3] shows that linear programming is polynomial in the Turing machine model. In 1982, Traub and H. Wo´zniakowski [15] showed that Khachiyan’s algorithm is not polynomial in the real number model and conjectured that linear programming is not polynomial in this model. This conjecture is still open.

The purpose of information operations (Assumption 2) is to replace the input by a finite set of numbers. For integration, the information operations are typically function evaluations. Computational Complexity of High-Dimensional Integration We illustrate some of the important ideas of IBC with the example of high-dimensional integration. We wish to compute the integral of a real-valued function f of d variables over the unit cube in d dimensions. Typically, we have to settle for computing a numerical approximation with an error ". To guarantee an "-approximation we have to know some global information about the integrand. We assume that the class F of integrands has smoothness r. One such class is F r , which consists of those functions having continuous derivatives of order through r, these derivatives satisfying a uniform bound. A real function of a real variable cannot be entered into a digital computer. We evaluate f at a finite number of points and we call the set of values of f the local information, for brevity information, about f . An algorithm combines the function values into a number that approximates the integral. In the worst-case setting we want to guarantee an error at most " for every f 2 F. The computational complexity, for brevity complexity, is the least cost of computing the integral to within " for every f . We want to stress that the complexity depends on the problem and on ", but not on the algorithm. Every possible algorithm, whether or not it is known, and all possible points at which the integrand is evaluated are permitted to compete when we consider least possible cost. It can be shown that if F = F r , then the complexity of our integration problem is of order "dr . If r = 0, e. g., if our set of integrands consists of uniformly bounded continuous functions, the complexity is infinite. That is, it is impossible to solve the problem to within ". Let r be positive and in particular let r = 1. Then the complexity is of order " d . Because of the exponential dependence on d, we say the problem is computationally intractable. This is sometimes called the curse of dimensionality. We will compare this d-dimensional integration problem with the well-known traveling salesman prob-

Information-based Complexity and Information-based Optimization

lem (TSP), an example of a discrete combinatorial problem. The input is the location of the n cities and the desired output is the minimal route; the city locations are usually represented by a finite number of bits. Therefore the input can be exactly entered into a digital computer. The complexity of this problem is unknown but conjectured to be exponential in the number of cities. That is, the problem is conjectured to be computationally intractable and many other combinatorial problems are conjectured to be intractable. Most problems in scientific computation which involve multivariate functions belonging to F r have been proven computationally intractable in the number of variables in the worst-case setting. These include nonlinear equations [10], partial differential equations [19], function approximation [7], integral equations [19], and optimization [6]. Material on the computational complexity of optimization will be presented in the second half of this article. Very high-dimensional integrals occur in many disciplines. For example, problems with dimension ranging from the hundreds to the thousands occur in mathematical finance. Path integrals, which are of great importance in physics, are infinite-dimensional, and therefore invite high-dimensional approximations. This motivates our interest in breaking the curse of dimensionality. Since this is a complexity result, we cannot get around it by a clever algorithm. We can try to break the curse by settling for a stochastic assurance rather than a worst-case deterministic assurance. Examples of stochastic assurance are provided by the randomized and average case settings which we will consider below. We can also try to break the curse by changing the class of inputs. A good example of this occurs in mathematical finance. Mathematical Finance The valuation of financial instruments often requires the calculation of very high-dimensional integrals. Dimensions of 360 and higher are not unusual. Furthermore, since the integrals can be very complicated requiring between 105 and 106 floating point operations per integrand evaluation, it is important to minimize the number of evaluations. Extensive numerical testing shows that these problems do not suffer from the curse

I

of dimensionality. A possible explanation is given by I. Sloan and Wo´zniakowski [11], who show that the curse can be broken by changing the class of integrands to capture the essence of the mathematical finance problem. See [14 Chapt. 4] for a survey of high-dimensional integration and mathematical finance. General Theory In general, IBC is defined by the assumptions that the information concerning the mathematical model is partial, contaminated, and priced. Referring to the integration example, the mathematical input is the integrand and the information is a finite set of function values. It is partial because the integral cannot be recovered from function values. For a partial differential equation the mathematical input consists of the functions specifying the initial value and/or boundary conditions. Generally, the mathematical input is replaced using a finite number of information operations. These operations may be functionals on the mathematical input or physical measurements that are fed into a mathematical model. In addition to being partial the information is often contaminated by, for example, round-off or measurement error ([8]). If the information is partial or contaminated it is impossible to solve the problem exactly. Finally, the information is priced. As examples, functions can be costly to evaluate or information needed for oil exploration models can be obtained by setting off shocks. With the exception of certain finitedimensional problems, such as roots of systems of polynomial equations and problems in numerical linear algebra, the problems typically encountered in scientific computation have information that is partial and/or contaminated and priced. As part of our study of complexity we investigate optimal algorithms, that is, algorithms whose cost is equal or close to the complexity of the problem. This has sometimes led to new solution methods. The reason that we can often obtain the complexity and an optimal algorithm for IBC problems is that partial and/or contaminated information permits arguments at the information level. This level does not exist for combinatorial problems where we usually have to settle for trying

1605

1606

I

Information-based Complexity and Information-based Optimization

to establish a complexity hierarchy and trying to prove conjectures such as P 6D NP. A powerful tool at the information level is the notion of the radius of information, R. The radius of information measures the intrinsic uncertainty of solving a problem using given information. We can compute an "-approximation if and only if R ". The radius depends only on the problem being solved and the available information; it is independent of the algorithm. The radius of information is defined in all IBC settings.

for a stochastic assurance, or by changing the class of inputs. For the constrained optimization problem, we first describe changing the class of functions, and then turn to weakening the assurance. Nemirovsky and Yudin [6] take F = F conv to be the class of convex functions that satisfy a Lipschitz condition with a uniform constant on a bounded convex set D. Then 1 ; comp(") D log "

Information-Based Optimization

where the constant in the -notation depends polynomially on the dimension d of D and m, the number of constraints. Thus, convexity breaks the curse of dimensionality. The worst-case deterministic assurance may be weakened to a stochastic assurance; we report on the randomized and average case settings. Nemirovsky and Yudin [6] show that randomization does not break the curse of dimensionality for computing the minimum value of the nonlinear constrained problem. G.W. Wasilkowski [17] establishes an even more negative result if an "-approximation to the value of x that minimizes f 0 is sought. He permits randomization and shows that for all " < 1/2, this problem is unsolvable even if d = 1. The results considered so far use a sequential model of computation. One could also ask about the complexity under a parallel model of computation. If we have k processors running in parallel, how much can the computation of the minimum be sped up? Clearly, the best possible speedup is k. Nemirovsky [5] considers this problem for the case F = F conv , showing that

We turn to the application of IBC concepts to information-based optimization. In their seminal book, A.S. Nemirovsky and D.B. Yudin [6] study a constrained optimization problem. They wish to minimize a nonlinear function subject to nonlinear constraints. Let f = [f 0 , . . . , f m ], where f 0 denotes the objective function and f 1 , . . . , f m denote constraints. Let F be the product of m+ 1 copies of F r . Then d/r ! 1 comp(") D

: " Thus this problem suffers from the curse of dimensionality. Vavasis [16 Chapt. 6] reports on the worst-case complexity of minimizing an objective function with box constraints. He assumes objective functions defined on the unit cube in d dimensions and takes F as the class of continuous functions with uniform Lipschitz constant L. For global minimization, comp(") D

L 2"

d ! :

Thus global minimization is intractable. In contrast to global minimization, the problem of computing a local minimum is tractable with suitable conditions on F. Let F consist of continuously differentiable real functions on [0, 1]d whose gradients satisfy a uniform Lipschitz condition with constant M. Then 4d(M/")2 function and gradient evaluations are sufficient. As discussed above, there are two ways one can attempt to break the curse of dimensionality: by settling

par

comp ("; k) D ˝

d ln(2kd)

1/3

! 1 ; ln "

where the ˝-constant is independent of k and ". Hence we find that ! ln(2kd) 1/3 comp(") ; DO comppar ("; k) d which is much less than k. Thus parallel computation is not very attractive for this problem. The average case setting looks more promising than the randomized setting, but since it is technically very

Information-based Complexity and Information-based Optimization

difficult, the results to date are quite limited. In the average case setting we want to guarantee that the expected error is at most " and we minimize the expected cost. In the average case setting, an a priori measure must be placed on F. Typically, this measure is Gaussian; in particular, Wiener measures are used. Since the distribution of the random variable minx f (x) is difficult to obtain, the average case analysis of the global optimization problem is very difficult. Only partial results have been obtained. Let d = 1 and F Cr [0, 1]. Assume that F is endowed with the r-fold Wiener measure. p Wasilkowski [18] shows that approximately ("1 ln "1 )1/(rC1/2) function evaluations suffice. This is better than the worst case, where some " 1/r function values are needed. Stronger results have been obtained for the case of d = 1 and r = 0, i. e., optimization for continuous scalar functions, equipped with the Wiener measure. K. Ritter [9] considers the case of nonadaptive methods, showing that 2 ! 1 non : comp (") D

" Moreover, the optimal evaluation points are equidistant knots. More recently (1997), J.M. Calvin [1] investigates adaptive methods for this problem, showing that for any ı 2 (0, 1), !

compad (") D O

1/(1ı) 1 : "

The study of optimization in the average case setting is a very promising area for future research. Important open problems include: obtaining multivariate results, obtaining lower bounds, obtaining better upper bounds. We now restrict our attention to the special optimization problem of linear programming (LP), which we discuss in the worst-case setting. In 1979, Khachiyan [3] studied an ellipsoid algorithm and proved that LP is polynomial in the Turing machine model. In 1982, Traub and Wo´zniakowski [15] showed that the cost of this ellipsoid algorithm is not polynomial in the real-number model, and conjec-

I

tured that the LP problem is not polynomial in the realnumber model. This nicely illustrates the difference between the cost of an algorithm and the complexity of a problem, since the result concerning the cost of the ellipsoid algorithm leaves open the question of problem complexity. The Traub–Wo´zniakowski conjecture remains open. A related open question is whether LP can be solved in strongly polynomial time. (Note that the underlying models of computation are different: the real-number model versus the Turing machine model.) This question is also still open, with results known only for special cases. In 1984, N. Megiddo [4] showed that LP can be solved in linear time if the number of variables is fixed, while in 1986, É. Tardos [12] showed that many LP problems that arise from combinatorial applications can be solved in strongly polynomial time. We now discuss the computation of fixed points, which we include here because the result involves ellipsoid methods. The problem is to compute the fixed point of f (x); that is, to solve the nonlinear equation x = f (x) for any f 2 F, where F is the class of functions on [0, 1]d having a Lipschitz constant of q, with q 2 (0, 1). The simple iteration algorithm xi+ 1 = f (xi ), with x0 = 0, can compute an "-approximation with at most ln 1/" nsi ("; q) D ln 1/q evaluations of f . Thus the simple iteration algorithm behaves poorly if q is close to one. Z. Huang, Khachiyan, and K. Sikorski [2] show that an inscribed ellipsoid algorithm computes an "approximation with 1 1 ne ("; q) D O d ln C ln " 1q function evaluations. Thus their algorithm is excellent for computing fixed points of functions with q close to unity; that is, almost noncontracting functions. See also Complexity Classes in Optimization Complexity of Degeneracy Complexity of Gradients, Jacobians, and Hessians Complexity Theory Complexity Theory: Quadratic Programming

1607

1608

I

Integer Linear Complementary Problem

Computational Complexity Theory Fractional Combinatorial Optimization Kolmogorov Complexity Mixed Integer Nonlinear Programming NP-complete Problems and Proof Methodology Parallel Computing: Complexity Classes

References 1. Calvin JM (1997) Average performance of adaptive algorithms for global optimization. Ann Appl Probab 34:711–730 2. Huang Z, Khachiyan L, Sikorski K (1999) Approximating fixed points of weakly contracting mappings. J Complexity 15:200–213 3. Khachiyan LG (1979) A polynomial algorithm in linear programming. Soviet Math Dokl 20:191–194 4. Megiddo N (1984) Linear programming in linear time when the dimension is fixed. J ACM 31:114–127 5. Nemirovsky AS (1994) On parallel complexity of nonsmooth convex optimization. J Complexity 10:451–463 6. Nemirovsky AS, Yudin DB (1983) Problem complexity and method efficiency in optimization. Wiley/Interscience, New York 7. Novak E (1988) Deterministic and stochastic error bounds in numerical analysis, vol 1349. Springer, Berlin 8. Plaskota L (1996) Noisy information and computational complexity. Cambridge Univ. Press, Cambridge 9. Ritter K (1990) Approximation and optimization on the Wiener space. J Complexity 337–364 10. Sikorski K (2000) Optimal solution of nonlinear equations. Oxford Univ. Press, Oxford 11. Sloan IH, Wo´zniakowski H (1998) When are quasi-Monte Carlo algorithms efficient for high dimensional integrals? J Complexity 14:1–33 12. Tardos E (1986) A strongly polynomial algorithm to solve combinatorial linear programs. Oper Res 34(2):250–256 13. Traub JF, Wasilkowski GW, Wo´zniakowski H (1988) Information-based complexity. Acad. Press, New York 14. Traub JF, Werschulz AG (1998) Complexity and information. Cambridge Univ. Press, Cambridge 15. Traub JF, Wo´zniakowski H (1982) Complexity of linear programming. Oper Res Lett 1:59–62 16. Vavasis SA (1991) Nonlinear optimization: Complexity issues. Oxford Univ. Press, Oxford 17. Wasilkowski GW (1989) Randomization for continuous problems. J Complexity 5:195–218 18. Wasilkowski GW (1992) On average complexity of global optimization problems. Math Program 57(2)(Ser. B):313–324 19. Werschulz AG (1991) The computational complexity of differential and integral equations: An information-based approach. Oxford Univ. Press, Oxford

Integer Linear Complementary Problem Integer LCP PANOS M. PARDALOS1 , YASUTOSHI YAJIMA2 1 Center for Applied Optim. Department Industrial and Systems Engineering, University Florida, Gainesville, USA 2 Tokyo Institute Technol., Tokyo, Japan MSC2000: 90C25, 90C33 Article Outline Keywords See also References Keywords LCP; Zero-one integer programming Let M be a given n × n matrix, and let q be a given n vector. The linear complementary problem (LCP; cf. also Linear complementarity problem) is to find a vector x which satisfies the following system: 8 ˆ ˆ < x 0;

Mx C q 0; ˆ ˆ : x > (Mx C q) D 0:

(1)

When some or all the variables are required to be integers, the problem is called integer linear complementary problem (ILCP). Suppose that for each i (i = 1, . . . , k), the variable xi is required to be integer among x i 2 f0; : : : ; n i g; while for each i (i = k+ 1, . . . , n), the variable xi is continuous and 0 x i ˇi :

Integer Linear Complementary Problem

The problem can be formulated as the feasibility problem which finds a solution x and z such that 8 ˆ 0 M i x C q i B i (1 z i ); ˆ ˆ ˆ ˆ ˆ 0 ni zi xi ; ˆ ˆ ˆ ; : : : ; X n> ; v 1 ; : : : ; v n ; Pn

iD1

mi

n

‚ …„ ƒ ‚ …„ ƒ q> D (0; : : : ; 0; 1; : : : ; 1); and 0

0 B A21 B B : B :: B B n1 BA MDB > Be B B 0 B B :: @ : 0

A12 0 :: :

A1n A2n :: :

e 0 :: :

0 e :: :

An2 0 e> :: :

0 0 0 :: :

0 0 0 :: :

0 0 0 :: :

0

e>

0

0

0 0 :: :

1

C C C C C C e C C; 0C C 0C C :: C : A 0

where e is a vector of all ones whose dimension is given by context. Then, the above polymatrix game can be equivalently written as (1). Moreover, suppose that some players, i (i = 1, . . . , k), can select only one pure strategies, while the other players, i (i = k+ 1, . . . , n), can select mixed strategies. For each player i (i = 1, . . . , k), the vector X i is required to be zero-one integer, which results (ILCP). See also Branch and Price: Integer Programming with Column Generation Convex-simplex Algorithm Decomposition Techniques for MILP: Lagrangian Relaxation Equivalence Between Nonlinear Complementarity Problem and Fixed Point Problem Generalized Nonlinear Complementarity Problem Integer Programming Integer Programming: Algebraic Methods Integer Programming: Branch and Bound Methods Integer Programming: Branch and Cut Algorithms Integer Programming: Cutting Plane Algorithms Integer Programming Duality Integer Programming: Lagrangian Relaxation LCP: Pardalos–Rosen Mixed Integer Formulation Lemke Method Linear Complementarity Problem Linear Programming Mixed Integer Classification Problems

Multi-objective Integer Linear Programming Multi-objective Mixed Integer Programming Multiparametric Mixed Integer Linear Programming Order Complementarity Parametric Linear Programming: Cost Simplex Algorithm Parametric Mixed Integer Nonlinear Optimization Principal Pivoting Methods for Linear Complementarity Problems Sequential Simplex Method Set Covering, Packing and Partitioning Problems Simplicial Pivoting Algorithms for Integer Programming Stochastic Integer Programming: Continuity, Stability, Rates of Convergence Stochastic Integer Programs Time-dependent Traveling Salesman Problem Topological Methods in Complementarity Theory References 1. Howson JT Jr (1972) Equilibria of polymatrix games. Managem Sci 18(5):312–318 2. Pardalos PM, Nagurney A (1990) The integer linear complementarity problem. Internat J Comput Math 31:205–214 3. Pardalos PM, Rosen JB (1987) Global optimization approach to the linear complementarity problem. SIAM J Sci Statist Comput 9(2):341–353

Integer Linear Programs for Routing and Protection Problems in Optical Networks MEEYOUNG CHA1 , W. ART CHAOVALITWONGSE2 , ZHE LIANG2 , JENNIFER YATES3 , AMAN SHAIKH3 , SUE B. MOON4 1 Division of Computer Science, Korean Advanced Institute of Science and Technology, Daejeon, Korea 2 Department of Industrial and Systems Engineering, Rutgers University, Piscataway, USA 3 AT&T Labs – Research, Florham Park, USA 4 Division of Computer Science, Korean Advanced Institute of Science and Technology, Daejeon, Korea MSC2000: 68M10, 90B18, 90B25, 46N10

ILPs for Routing and Protection Problems in Optical Networks

Article Outline Keywords and Phrases Introduction Motivations and Challenges in Optimization Models Path-Protection (Diverse) Routing Problem Minimum Color Problem Finding the Dual Link Problem SRRG-Diverse Routing Problem Shared Path-Protection Problem

Path-Restoration Problem Optical Path Restoration Problem IP Path–Restoration Problem

Concluding Remarks References Keywords and Phrases Integer linear programming; Routing; Optical networks; Diverse path Introduction Owing to the exponential growth of traffic demand, there is an emerging challenge as well as an opportunity for service providers to employ Internet Protocol (IP) backbone networks carried on top of optical transport networks, forming IP over optical infrastructure. This technology is poised to take over most of broadband operational services as an integrated transport platform for the following reasons. Optical networks offer the capability to carry numerous wavelength signals or channels simultaneously without interaction between each wavelength, known as wavelength division multiplexing (WDM) [14,16,18]. Also WDM optical switches are known to be reliable, support high speed, and are economical, which makes them an attractive selection for the default modern transport network. Moreover, new emerging services that require high bandwidth and reliability (such as Internet Protocol TV – IPTV [2]) are considering optical networks as the underlying network to directly carry the traffic. However, to best efficiently utilize WDM networks, network operators face a number of management and operation challenges, which often require complex mathematical models and advanced optimization techniques. This article focuses on these challenges and briefly review how integer linear programming (ILP) formulation and algorithms have been

I

developed and applied to the domain of optical networks. Motivations and Challenges in Optimization Models The management and operation of WDM networks involve a number of challenges, which should address the physical topology formation, logical topology formation, survivability and fault management. Designing a new transport network is very complex, as it requires one to make decisions on where to place optical nodes so as to provide survivability, connectivity, and cost-effectiveness. Once the physical topology has been fixed, the logical topology of the backbone is decided by setting up lightpaths from one optical node to another. In transport networks, providing survivability and fault management is the most important task. Especially, routing should rapidly recover from any failure in the logical topology, because even a short outage reflects a massive amount of traffic loss in high-speed transport networks. The management and operation of these complex challenges benefit greatly from using mathematical modeling and optimization techniques. Today’s backbone mostly takes a form of a layered IP over optical network. For survivability, it is extremely important to address the practical issue of how IP routing and protection schemes can effectively exploit the lower layer path diversity. IP layer failures are known to occur most frequently, while fiber span failures are catastrophic in that they lead to multiple simultaneous upper layer failures. Moreover, some of the IP layer failures can be only addressed in the IP layer, as lower layer (optical) survivability mechanisms cannot detect failures occurring at higher (IP or applications) layers. In order to rapidly recover from network-wide failures and provide persistent end-to-end path quality, service providers may set up two diverse paths: the service (primary) path and the restoration (backup) path. Any failure in the service path can be hidden, as traffic can be instantly rerouted to the restoration path. Obviously, the efficacy of the restoration path depends heavily on how disjoint these two paths are (under the most frequent single failures). Therefore, it is important to understand how the layering employed in the network affects the correlation of failures among paths.

1611

1612

I

ILPs for Routing and Protection Problems in Optical Networks

This demonstrates the importance of protection against failures in layered networks arising out of shared risk resource groups (SRRG). An example of SRRG is multiple IP links sharing a common optical component, including ducts or conduits through which multiple optical links are routed under the ground. To effectively deliver high-quality services to customers, network providers are required to incorporate this SRRG information into their routing and protection schemes, which has become one of the most challenging problems in networking practice. Note that the SRRGdiverse constraint involves multiple link failure models, in the form of shared risk link groups (SRLGs) and shared risk node groups (SRNGs). Most of previous studies in IP-over-WDM networks considered only two possible alternatives of routing and protection schemes: protection at the optical layer or restoration at the IP layer [7,11,15,16,17].

Path-Protection (Diverse) Routing Problem Finding a backup path that is disjoint for each working or “primary” path, in general, has been recognized as path-protection schemes and has been widely studied in optical networks. Medard et al. [12] focused on the problem of identifying two redundant trees from a single source to a set of destinations that can survive any single link failure (i. e., the elimination of any vertex in the graph leaves each destination vertex connected to the source via at least one of the directed trees). Ellinas et al. [5] focused on the problem of identifying two diverse paths that are SRRG failure resilient. They were the first to theoretically prove that if an arbitrary set of links can belong to a common SRLG, then the problem of finding SRLG-diverse paths between a given source and destination is NP-complete for unicast traffic. Subsequently, Zang et al. [22] proposed heuristic algorithms for the combined problem of finding SRLGdiverse paths and wavelength assignment for one-toone (unicast) traffic. Most recently, Cha et al. [2] studied the SRLG-diverse routing for one-to-many (multicast) traffic, where they focused on the combined problem of minimizing the network cost of multicasting traffic from dual sources to multiple destinations while providing path protection against a single SRLG failure.

Minimum Color Problem Coudert et al. [4] proposed new techniques for the minimum color path problem for multiple failure tolerance from a SRRG failure. The consequence minimum color st-cut problem was also shown to be NP-complete and hard to approximate. Each SRRG is associated with a so-called color in a colored graph G c D (V; E; C), where C is a family of subsets of E. The minimum color path problem is to find a path from a node s to a node t that minimizes the number of different colors of its links. This problem was proven to bee NP-complete in [21] and polynomial in the special case where all the edges of each color have a common extremity. Many insightful theoretical results of this problem were reported in [4]. Definition 1 [4] Let G D (V; E; C) be a colored graph, where C is a partition of E. The minimum color cut consists in finding a minimal set of colors disconnecting G. Let s; t 2 V be two distinct vertices in G. The minimum color st-cut problem is to find a minimal set of colors disconnecting s from t. Theorem 1 [4] The minimum color st-cut is NP-hard. Coudert et al. [4] proved this theorem by proposing the reduction of each set of the set cover instance to a color of the minimum st-cut instance. Theorem 2 [4] The minimum color st-cut problem is not approximable within a factor o(log n) unless N P TIME(n O(log log n) ). Theorem 3 [4] When the edges of each color induce a connected subgraph the minimum color st-cut and the minimum color cut problems are polynomial. Theorem 4 [4] The minimum color st-cut is k-approximable when the number of connected components of the subgraph induced by edges of each color is bounded by k. Define a nonnegative variable for each node, where some edge e between nodes i and j institutes a cut if x i ¤ x j . If any edge of color c institutes a cut, then c is a color cut. Let a binary variable y c be associated with each color c, where y c D 1 when the color c is selected to be in a set of color cut, and y c D 0 otherwise. The minimum color st-cut problem can be formulated as a mixed integer linear program as follows: X yc (1) min c2C

ILPs for Routing and Protection Problems in Optical Networks

subject to xi 0

y c jx i x j j8c 2 C ; 8i; j 2 c ;

8i 2 V ;

xs D 0 ; x t D 1 :

(2) (3) (4)

Finding the Dual Link Problem The problems related to finding a pair of link or node disjoint paths in single cost networks have been studied since the mid 1980s [19]. The min–sum problem of dual link is to minimize the sum of the costs of the two disjoint paths and can be solved using a polynomial time algorithm called the shortest pair of paths [19]. In a recent study, the min–sum problem was shown to be a special case of the min–cost flow problem [2]. In contrast to the min–sum problem, the min–max problem, whose objective is to minimize the length of the longer one of the two paths was proven to be NPcomplete [10]. The min–min problem, whose objective is to minimize the length of the shorter one of the two paths, can also be proven to be NP-complete [20] by using the reduction of a the well-known 3-satisfiability (3SAT) problem. The proof can be described as follows [20]. An instance of 3SAT is a boolean formula that is the AND of m clauses C j ( j D 1; : : : ; m). A clause is the OR of three literals, each of which is an occurrence of variable x i (i D 1; : : : ; n) or its negation. A truth assignment is a function : fx i g ! ftrue; falseg. C j is satisfied by if it contains a literal with truth value. The question of 3SAT is to determine whether there is a truth assignment that satisfies all m clauses simultaneously. With the 3SAT approach, Xu et al. [20] proposed the following theorem. Theorem 5 [20] The problem of finding two node/linkdisjoint paths between a pair of source and destination nodes in a directed/undirected network with minimum cost for the shorter one is NP-complete. SRRG-Diverse Routing Problem The SRRG-diverse routing problem can be considered to be a generalization of the link-diverse/disjoint routing problem. In a multicast context, the linkdisjoint path-protection problem can be viewed as a diverse routing problem of identifying two redundant trees from a single source to a set of destinations

I

that can survive any single link failure [6,12]. The diverse routing problem has been previously shown to be NP-complete [1,7,8]. During the past few years, there has been increasing interest in the diverse routing problem with SRRG-diverse constraints as SRRGdiversity requirements play a very crucial role in real life network provisioning problems. An example of real life problems is finding a pair of diverse paths at the optical layer, which involves the search of two SRLG-diverse paths as each link at the OXC (optical cross connect) layer may be related to several SRLGs. Many recent studies have shown that the generalized SRLG (or SRRG) diverse routing is a special case of the diverse routing problem, which is also NPcomplete [5,8,11,13,22]. Among those studies, the diverse routing problem of unicast (one-to-one) traffic with SRLG-diverse constraints has been proven to be NP-complete [5]. In later studies, the diverse routing problem was extended to many special cases (e. g., diverse routing under both wavelength capacity and path length constraints [22], multicast routing under SRLGdiverse constraint [3]). Cha et al. [3] proposed a generalized case of the SRRG-diverse routing problem where there are two source nodes. The problem can be formally defined as follows. Let G D (V; E) be an undirected graph representing the backbone network. We denote the set of network nodes by V, while E is the set of duplex communications links (edges). Let the number of nodes and edges be n and m, respectively. There is a set of two source nodes, denoted by S V , and there is a set of destination nodes, denoted by D V . Denote B as a set of SRLGs. Each link (i; j) 2 E in the graph has a cost function (c i j ) associated with it and belongs to a subset of SRLGs in B. Note that the cost c i j is the sum of the port cost at nodes i and j and the transport cost relative to the distance of link (i; j). This problem was proven to be NP-hard in [3] by using a reduction from the SRLGdiverse path problem [5]. Theorem 6 The two-source SRRG-diverse routing problem is strongly NP-hard. This theorem was proven in [3] where this problem was claimed to be a generalization of the problem of finding SRLG-diverse paths between a source and a destination in a given graph (SRG (shared risk group)-diverse routing) proposed in the paper by Ellinas et al. [5]. We add

1613

1614

I

ILPs for Routing and Protection Problems in Optical Networks

two nodes (jSj) in the graph and two links connecting each of the two new nodes with the source with the costs of 0. Assume that the two new links do not share any SRLGs. We then add jDj nodes and 2 jDj edges. Each of these nodes is connected to the destination node by two edges that do not share any SRLGs. Then the transformation is complete and clearly polynomial. The ILP of the two-source SRRG-diverse routing problem can be formulated as follows. Define the following decision variables. Yi;s j = 1 if link (i; j) is used by the multicast tree rooted at source node s; 0 otherwise. X i;s j;d D 1 if link (i; j) is used by the multicast tree rooted at source node s to destination d; 0 otherwise. s D 1 if the path from source s to destination d uses Zb;d an SRLG b; 0 otherwise. The ILP formulation is given by min

X X

Yi;s j c i; j

(5)

s2S (i; j)2E

subject to X

Yi;s j X i;s j;d

8(i; j) 2 E;

8s 2 S; 8d 2 D ; X X i;s j;d X sj;i;d

f jj(i; j)2Eg

(6)

f jj( j;i)2Eg s D i;d

8i 2 V ; 8s 2 S ; 8d 2 D ; (7)

s X i;s j;d Zb;d

8(i; j) 2 b; 8s 2 S; 8d 2 D; 8b 2 B ;

X

s Zb;d 1

8d 2 D; 8b 2 B ;

(8) (9)

s2S s X i;s j;d ; Yi;s j ; Zb;d 2 f0; 1g 8s 2 S; 8d 2 D;

8(i; j) 2 E; 8b 2 B ;

where

s i;d

8 < 1 D 1 : 0

if i D s ; if i D d ; otherwise :

(10)

(11)

The constraints in (6) ensure that an edge must be selected by the multicast tree when it is used by the multicast tree to carry any traffic. The flow constraints in (7) ensure the flow conservation at each node, allowing each destination to have a flow path from the source. s is the net flow capacity generated, More precisely, i;d carried, or destined at node i for destination d by the multicast tree rooted at node s, which should have the

value of 1 if node i is the source, 1 if node i is the destination (acting as a sink), and 0 otherwise (whether node i belongs to the multicast tree or not). The constraints in (8) ensure that a SRLG must be selected by the path from a source to a destination when a link that belongs to the SRLG is used by the path. The constraints in (9) state that Wdb is greater than or equal to the number of number of distinct sources that uses bundle b to reach d; that is, if b is used in only one or none of the sources to reach b, the value is greater than or equal to zero. Shared Path-Protection Problem Sahasrabuddhe et al. [17] proposed fault management in IP over WDM networks using techniques which are protection at the WDM layer and restoration at the IP layer. “Protection” refers to preprovisioned failure recovery (i. e., set up a backup lightpath for every primary lightpath), whereas “restoration” refers to more dynamic recovery (i. e., overprovision the network so that after a fiber failure, the network should still be able to carry all the traffic it was carrying before the fiber failure). Typically, their protection scheme focuses on shared path protection against single fiber span failures, where multiple independent primary paths share the backup path capacity to minimize the total capacity required in the network. Given E as a set of unidirectional fiber links in the network, F as a set of bidirectional fibers in the network, R i j as a set of alternate routes for node pair i j, and W as the maximum number of wavelengths on a link, the ILP of the shared path-protection routing problem can be formulated as follows. Define the following decision variables. w k is the number of wavelengths used by primary lightpaths on link k, s k is the number of spare wavelengths used on link k, and Vi j is the number of primary lightpaths between node pair i j. mwk D 1 if one or more backup lightpaths are using wavelength w on link k; 0 otherwise. iwj;r D 1 if the rth route between node pair i j utilizes wavelength w before any fiber failures; 0 otherwise. ı ib;w j;p D 1 if a primary on route between node pair i j is protected by route between the node pair by employing wavelength; 0 otherwise. The ILP of the shared path-protection routing problem is rather complicated as it considers end-to-end lightpath assignment on the physical links, physical diversity of the primary and backup paths, and sharing backup path

ILPs for Routing and Protection Problems in Optical Networks

capacity for failure-independent primary paths. Therefore, the key ILP is formulated as follows (please refer to [17] for the complete set of equations): min

E X

(w k C s k )

kD1

subject to

X

X

(12) X

ı ib;w j;p 1 (13)

i j p2R i j : f 2p b2R i j :k2b

1 f F; 1 k E; 1 w W ; wk C sk W W XX

1kE;

iwj;r D Vi j

(14)

8i j ;

(15)

r2R i j wD1

X

X

W X

iwj;r D w k

1kE;

(16)

i j r2R i j ;k2r wD1 W X

mwk D s k

wD1

X

1kE; iwj;r C mwk 1

X

(17) 1 k E;

I

Note that additional constraints on the number of receivers and transmitters used can be incorporated in the model [17]. Path-Restoration Problem Path-restoration techniques have been frequently employed to provide highly capacity efficient (close to) real-time restoration of a network failure. In contrast to the path-protection routing, the operation mode of the path restoration only uses the full bandwidth from the primary while finding an alternative path. The pathrestoration routing problem can involve different network layers: physical (optical) and IP. The optical path restoration directly replaces the prefailure capacity at the transmission carrier signal level, which has no performance effects in the upper layers. On the other hand, the IP path restoration dynamically reroutes the signals around failures using routing table updates or dynamic call-routing. An interior gateway protocol (IGP) is widely used to dynamically find an alternative path and perform load sharing in traffic distribution.

i j r2R i j ;k2r

1wW; W X wD1

iwj;r

D

X

W X

ı ib;w j;p

(18)

8i j; 8p 2 R i j ;

b2R i j ;b¤p wD1

1wW:

(19)

The objective of ILP is to minimize the total capacity used given in (12). The crux of the formulation is at the set of constraints in (13), which ensure that that two backup lightpaths share wavelength w on link k only if the corresponding primary paths are fiber-disjoint. The above ILP formulation includes constraints for setting up lightpaths with shared path protection, where the number of channels on each link is bounded by (14), the number of primary lightpaths between a node pair is defined by (15), the number of primary lightpaths traversing a link is defined by (16), the spare capacity of each link is defined by (17), the usage of primary or backup lightpaths on a wavelength is defined by (18), and every primary lightpath is ensured to be protected by a backup lightpath by (19). In addition, multicommodity flow constraints (omitted here) are added to ensure the amount of traffic sourced from node to destination node is covered by the capacity of the wavelength.

Optical Path Restoration Problem Iraschko and Grover [9] studied the path-restoration routing problem, where the task is to deploy a set of replacement signal paths between two end nodes of the failed span, capable of yielding the maximum total amount of replacement capacity, while respecting the finite number of spare links on each span. They assume that, given failure scenarios, a predefined set of distinct eligible routes are precomputed for end node pairs (i. e., primary paths). Then, the goal of the path-restoration routing problem is to maximize the total of all restoration flow assignments for those primary paths, using only the commodities selected by the failure scenario and only the surviving spans in the reserve network. It also requires that all flow assignments made over all routes, for all simultaneously restored node pairs, should not exceed the spare capacity of any span in the reserve network. The outcome of such optimized design will require minimal capacity for the reserve network. The ILP of the path-restoration routing problem can be formulated as follows. Define the following decision variables and parameters. Let i represent a failure scenario, such as a single span cut or a node loss. D i

1615

1616

I

ILPs for Routing and Protection Problems in Optical Networks

is the number of end node pairs that have lost one or more units of demand owing to the failure i and r is the index to indicate these node pairs, r 2 (1 D i ). X ir is the number of demand units lost by an end pair r under failure i. Finally, the network is G(N; E; s), where N is the set of nodes, E is the set of spans, and s is the vector of spare capacities on each span s j . The ILP formulation is given by

Concluding Remarks

r

max

D j Pi X X

r;p

fi

(20)

rD1 pD1 r

Pi X

subject to

r;p

D X ir

fi

8r 2 (1 D i ) ; (21)

pD1 r

D j Pi X X

r;p r;p

ı i; j f i

sj

8j 2 E ;

(22)

rD1 pD1 r;p

fi

0; integer

8(r; p) ;

(23)

r;p

where f i is a whole number assignment of flow to the pth route available for restoration of node pair r under failure scenario i. Pir is the total number of eligible restoration routes available to node pair r for the r;p restoration of failure i. ı i; j D 1 if span j is in the pth eligible route for restoration of node pair r in the event of failure scenario i; 0 otherwise. IP Path–Restoration Problem In the IP-layer communication, as apposed to the abovementioned optical path restoration problem, the path-restoration problem is defined differently on the basis of the WDM protection model from Sect. “Shared Path-Protection Problem” (for the complete model, see [17]). The key ILP model for this IP path restoration problem is given by X X ij sd (24) max ij

sd

subject to

XX j

i;w j;r Transwi

r2R i j

8i; 1 w W ; XX X

i

i;w j;r Recwj

8 j; 1 w W ;

(25) (26)

r2R i j

X

i j r2R i j ;k2r

where the objective function in (24) is to minimize the average hop distance before a fault, the constraints in (25) and (26) ensure that node i uses at most Transwi transmitters and node j uses at most Recwj receivers on wavelength w, and the constraints in (27) ensure that the wavelength w on link j is used either by a primary lightpath or by backup lightpaths.

iwj;r 1

1 k E; 1 w W ; (27)

In this article, we reviewed how ILP formulations are used in WDM optical networking. In optical networks, preprovisioning the networks to support fast restoration of failures is critical, which means to set up physically disjoint backup paths (links) for the primary paths (links). Here, ILP formulations are valuable in finding the global optimal backup paths among all the possible alternative paths for traffic demands of interest. A set of constraints in ILP can be set up to represent the restoration flow balance constraint, the link capacity flow constraint, and physical diversity of the primary and backup paths. References 1. Bhandari R (1999) Survivable Networks: Algorithms for Diverse Routing, Klumer, Norwell 2. Cha M, Chaovalitwongse W, Art, Ge Z, Yates J, Moon S (2006) Path Protection Routing with Constraints SRLG to Support IPTV in Mesh WDM Networks. IEEE Global Internet Symposium 3. Cha M, Choudhury G, Yates J, Shaikh A, Moon S (2006) Case Study: Resilient Backbone Network Design for IPTV Services: WWW IPTV Workshop 4. Coudert D, Datta P, Rivano H, Voge M-E (2005) Minimum Color Problems and Shared Risk Resource Group in Multilayer Networks, Research Report RR-2005–37-FR, I3S 5. Ellinas G, Bouillet E, Ramamurthy R, Labourdette J-F, Chaudhuri S, Bala K (2003) Routing and Restoration Architectures in Mesh Optical Networks. Opt Netw Mag 4(1):91– 106 6. Fei A, Cui J, Gerla M, Cavendish D (2001) A “Dual-Tree” Scheme for Fault-tolerant Multicast. IEEE ICC, June 2001 7. Grover, Wayne D (2003) Mesh-Based Survivable Networks. Prentice Hall, Upper Saddle River 8. Hu JQ (2003) Diverse Routing in Optical Mesh Networks. IEEE/ACM ToN 51(3):489–494 9. Iraschko R, Grover W (2002) A Highly Efficient PathRestoration Protocol for Management of Optical Network Transport Integrity. IEEE JSAC 18(5):779–793 10. Li C, McCormick ST, Simchi-Levi D (1992) Finding Disjoint Paths with Different Path Costs: Complexity and Algorithms. Network 22:653–667

Integer Programming

11. Li G, Kalmanek C, Doverspike R (2002) Fiber Span Failure Protection in Mesh Optical Networks. Opt Netw Mag 3(3):21–31 12. Medard M, Finn SG, Barry RA (1999) Redundant Trees for Preplanned Recovery in Arbitrary Vertex-Redundant or Edge-Redundant Graphs. IEEE/ACM ToN 7(5):641–652 13. Modiano E, Narula-Tam A (2002) Survivable Lightpath Routing: A New Approach to the Design of WDM-Based Networks. IEEE JSAC 20(4):800–809 14. Murthy C, Siva R, Gurusamy M (2001) WDM Optical Networks. Prentice Hall, Englewood Cliffs 15. Ramamurthy S, Sahasrabuddhe L, Mukherjee B (2003) Survivable WDM Mesh Networks. J Lightwave Tech 21(11): 870–883 16. Ramaswami R, Sivarajan K (1998) Optical Networks: A practical Perspective. Morgan Kaufmann, Los Altos 17. Sahasrabuddhe L, Ramamurthy S, Mukherjee B (2002) Fault Management in IP-Over-WDM Networks: WDM Protection Versus IP Restoration. IEEE JSAC 20(1):21–33 18. Stern TE, Bala K (1999) Multiwavelength Optical Networks: A Layered Approach. Addison-Wesley, Boston 19. Suurballe J, Tarjan R (1984) A Quick Method for Finding Shortest Pairs of Disjoint Paths. Network 14:325–336 20. Xu D, Chen Y, Xiong Y, Qiao C, He X (2006) On the Complexity of and Algorithms for Finding the Shortest Path with a Disjoint Counterpart. IEEE/ACM ToN 14(1):147–158 21. Xu D, Xiong Y, Qiao C (2003) Protection with MultiSegments (PROMISE) in Networks with Shared Risk Link Group (SRLG). IEEE/ACM ToN 11(2):248–258 22. Zang H, Ou C, Mukherjee B (2003) Path-Protection Routing and Wavelength Assignment (RWA) in WDM Mesh Networks Under Duct-Layer Constraints. IEEE/ACM ToN 11(2):248–258

Integer Programming EGON BALAS Graduate School Industrial Admin., Carnegie Mellon University, Pittsburgh, USA MSC2000: 90C10, 90C11, 90C27, 90C57 Article Outline Keywords Scope and Applicability Combinatorial Optimization Solution Methods The State of the Art See also References

I

Keywords Integer programming; Combinatorial optimization; Enumeration; Polyhedral methods; Disjunctive programming; Lift-and-project One of the byproducts of World War II was the discovery that the routing of ship convoys, and other human activities like transportation, production, allocation, etc. could be modeled mathematically, i. e. the often intricate choices that they involve could be captured into a system of equations and inequalities and could be optimized according to some agreed upon criterion. The simplest such model, involving only linear functions, became known as linear programming. Parallel developments have led to the discovery of the computer, which made it practical to solve linear programs of a realistic size. A few years later the theory was extended to systems involving nonlinear convex functions. Convexity was needed to ensure that any local optimum is a global optimum. An amazing variety of activities and situations can be adequately modeled as linear programs (LPs) or convex nonlinear programs (NLPs). Adequately means that the degree of approximation to reality that such a representation involves is acceptable. However, as the world that we inhabit is neither linear nor convex, the most common obstacle to the acceptability of these models is their inability to represent nonconvexities or discontinuities of different sorts. This is where integer programming comes into play: it is a universal tool for modeling nonconvexities of various kinds. To illustrate, imagine a factory that produces two items and whose capacity is determined by the four linear constraints represented in Fig. 1. These inequalities, along with x1 0, x2 0 (only nonnegative amounts can be produced), define the feasible set, shown as the shaded area. Since the latter is a convex polyhedron, if profit is a linear function of the amounts produced, then an optimal (i. e. profit-maximizing) production plan will correspond to one of the vertices of the polyhedron. Imagine now that the following reasonable condition is imposed: for each item there is a threshold quantity below which it is not worth producing it. The threshold is b units for item 1 and d units for item 2. Furthermore, at least one of the two items must be produced. As shown in Fig. 2., the feasible set now consists

1617

1618

I

Integer Programming

(nonlinear) integer programming problem, or simply an integer program (IP, linear unless otherwise stated). If only some of the variables are restricted to integer values, we have a mixed integer program (MIP). Such a problem can be stated as 8 ˆ min cx ˆ ˆ ˆ ˆ ˆ Ax b ˆ 0

or

x2 > 0;

imposed in the variant shown in Fig. 2., can be formulated by introducing two 0–1 variables, ı 1 and ı 2 , and the constraints bı1 x1 cı1 ; dı2 x2 eı2 ; ı1 C ı2 1;

ı1 ; ı2 2 f0; 1g:

Next we present a few well-known pure and mixed integer models. The fixed charge problem asks for the minimization, subject to linear constraints, of a function of the form P i ci (xi ), with ( f i C c i x i if x i > 0; c i (x i ) :D 0 if x i D 0: Whenever xi is bounded by U i and f i > 0 for all i, such a problem can be restated as a (linear) MIP by setting c i (x i ) D c i x i C f i y i ; xi Ui yi ; y i 2 f0; 1g for all i: Clearly, when xi > 0 then yi is forced to 1, and when xi = 0 the minimization of the objective function drives yi to 0. The facility location problem consists of choosing among m potential sites (and associated capacities) of facilities to serve n clients at a minimum total cost: 8 n m X m X X ˆ ˆ ˆ min c x C fi yi ˆ i j i j ˆ ˆ ˆ iD1 jD1 iD1 ˆ ˆ ˆ m ˆ X ˆ ˆ ˆ x i j D d j ; j D 1; : : : ; n; ˆ < iD1

ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ :

n X

xi j ai yi ;

i D 1; : : : ; m;

jD1

x i j 0;

i D 1; : : : ; m; j D 1; : : : ; n;

y i 2 f0; 1g;

i D 1; : : : ; m

I

Here dj is the demand of client j, ai is the capacity of a potential facility to be located at site i, cij is the perunit cost of shipments from facility i to client j, and f i is the fixed cost of opening a facility of capacity ai at location i. In any feasible solution, the indices i such that yi = 1 designate the chosen locations for the facilities to be opened. Variants of this problem include the uncapacitated facility location problem (where dj = 1 for all j and the constraints involving the capacities can be replaced by xij yi , i = 1, . . . , m, j = 1, . . . , n), the warehouse location problem (which considers cheap bulk shipments from plants to warehouses and expensive packaged shipments to retailers), and various emergency facility location problems (where one chooses locations to minimize the maximum distance traveled by any user of a facility, rather than the sum of travel costs). The knapsack problem is an integer program with a single constraint: max fcx : ax b; x 0 integerg ; where c and a are positive n-vectors, while b is a positive scalar. When the variables are restricted to 0 or 1, we have the 0–1 knapsack problem. A variety of situations can be fruitfully modeled as set covering problems: Given a set M and a family of weighted subsets S1 , . . . , Sn of M, find a minimumweight collection C of subsets whose union is M. If A is a 0– 1 matrix whose rows correspond to the elements of M and whose columns are the incidence vectors of the subsets S1 , . . . , Sn , and c is the n-vector of subsetweights, the problem can be stated as 8 ˆ ˆ c ˇ i . Lemma 3 If G c = {x˛ i xˇ i : i = 1, . . . , t} is the reduced Gröbner basis of I A with respect to c then i) {x˛ i : i = 1, . . . , t} is the minimal generating set of the initial ideal inc (I A ); and ii) for each binomial x˛ i xˇ i 2 G c , ˇ i is the unique optimal solution to the integer program IPA, c (A˛ i ). Proof Part i) follows from the definition of reduced Gröbner bases. For each binomial x˛ i xˇ i 2 G c we have A˛ i = Aˇ i , ˛ i , ˇ i 2 Nn and c ˛ i > c ˇ i . If ˇ i is a nonoptimal solution to IPA, c (A˛ i ) then xˇ i lies in inc (I A ) by Lemma 1 and hence some x˛ j for j = 1, . . . , t

1625

1626

I

Integer Programming: Algebraic Methods

will divide xˇ i contradicting the definition of a reduced Gröbner basis. The conditions in Lemma 3 are in fact also sufficient for a finite subset of binomials in I A to be the reduced Gröbner basis of I A with respect to c. Given f 2 I A , the normal form of f with respect to G c is the unique remainder obtained upon dividing f by G c . See [11] for details on the division algorithm in k[x]. The structure of G c implies that the normal form of a monomial xv 0 with respect to G c is a monomial xv such that both v and v0 are solutions to IPA, c (Av). The Conti–Traverso algorithm for IPA, c can be summarized as follows.

1 2

3 4

Input: The matrix A and cost vector c. Pre-processing: Find a generating set for the toric ideal I A . Compute the reduced Gröbner basis, Gc , of I A with respect to the cost vector c. To solve IPA;c (b): Find a solution v to IPA;c (b): Compute the normal form x v of the monomial x v with respect to the reduced Gröbner basis Gc . Then v is the optimal solution to IPA;c (b).

Conti-Traverso algorithm: How to solve programs in IPA, c

Proof In order to prove the correctness of this algorithm, it suffices to show that for each solution v of IPA, c (b), the normal form of xv modulo G c is the mono mial xv where v is the unique optimal solution to IPA, c (b). Suppose xw is the normal form of the monomial xv . Then w is also a solution to IPA, c (b) since the 0 exponent vectors of all monomials xw obtained during division of xv by G c satisfy b = Av = Aw0 , w0 2 Nn . If w 6D v , then xw xv 2 I A and inc (xw xv ) = xw since c w w > c v . This implies that x 2 inc (I A ) and hence can be further reduced by G c contradicting the definition of the normal form. Computational Issues The Conti–Traverso algorithm above raises several computational issues. In Step 1, we require a generating set of the toric ideal I A which can be a computationally challenging task as the size of A increases. The original Conti–Traverso algorithm starts with the ideal J A :=

hxj t a j t a j + : j = 1, . . . , n, t 0 t 1 t d 1i in the larger polynomial ring k[t 0 , . . . , t d , x1 , . . . , xn ] where aj = aC j aj is the jth column of the matrix A. The toric ideal I A = J A \ k[x] and hence the reduced Gröbner basis of I A with respect to c can be obtained by elimination (see [11 Chapt. 3]). Although conceptually simple, this method has its limitations as the size of A increases since it requires d + 1 extra variables over those present in I A and the Buchberger algorithm for computing Gröbner bases [8] is sensitive to the number of variables involved. Two different algorithms for computing a generating set for I A without introducing extra variables can be found in [5] and [18] respectively. Once the generating set of I A has been found, one needs to compute the reduced Gröbner basis G c of I A . This can be done by any computer algebra package that does Gröbner basis computations like Macaulay2, Maple, Reduce, Singular or Cocoa to name a few. Cocoa has a dedicated implementation for toric ideals [6]. As the size of the problem increases, a straightforward computation of reduced Gröbner bases of I A can become expensive and even impossible. Several tricks can be applied to help the computation, many of which are problem specific. In Step 3 of the Conti–Traverso algorithm above one requires an initial solution to IPA, c (b). The original Conti–Traverso algorithm achieves this indirectly during the elimination procedure. Theoretically this task can be as hard as solving IPA, c (b), although in practice this depends on the specific problem at hand. The last step – to compute the normal form of a monomial with respect to the current reduced Gröbner basis – is (relatively speaking) a computationally easy task. In practice, one is often only interested in solving IPA, c (b) for a fixed b. In this situation, the Buchberger algorithm can be truncated to produce a sufficient set of binomials that will solve this integer program [35]. This idea was originally introduced in [36] in the context of 0 – 1 integer programs in which all the data is nonnegative. See also [10]. A ‘nontoric algorithm’ for solving integer programs with fixed right-hand sides was recently proposed in [4]. Test Sets in Integer Programming A geometric interpretation of the Conti–Traverso algorithm above and more generally of the Buchberger al-

Integer Programming: Algebraic Methods

gorithm for toric ideals can be found in [34]. A test set for IPA, c is a finite subset of vectors in kerZ (A) such that for an integer program IPA, c (b) and a nonoptimal solution v to this program, there is some u in the test set such that c v > c (v u). By interpreting a binomial x˛ i xˇ i 2 G c as the vector ˛ i ˇ i 2 kerZ (A), it can be seen that G c is the unique minimal test set for the family IPA, c . A closely related test set for integer programming is the set of neighbors of the origin introduced by H.E. Scarf [26]. The binomial x˛ i xˇ i 2 G c can also be viewed as the directed line segment [˛ i , ˇ i ] directed from ˛ i to ˇ i . For each b 2 posZ (A) we now construct a directed graph Fb, c as follows: the vertices of this graph are the solutions to IPA, c (b) and the edges of this graph are all possible directed line segments from G c that connect two vertices of this graph. Then G c is a necessary and sufficient set of directed line segments such that Fb, c is a connected graph with a unique sink (at the optimal solution) for each b 2 posZ (A). This geometric interpretation of G c can be used to solve several problems. By reversing the directions on all edges in Fb, c , one obtains a directed graph with a unique root. One can enumerate all lattice points in Pb by searching this graph starting at its root. This idea was used in [33] to solve a class of manufacturing problems. The graphs Fb, c provide a way to connect all the feasible solutions to an integer program by lattice moves. This idea was applied to statistical sampling in [13]. Universal Gröbner Bases A subset UA of I A is a universal Gröbner basis for I A if UA contains a Gröbner basis of I A with respect to all (generic) cost vectors c 2 Rn . The Graver basis of A [16] is a finite universal Gröbner basis of I A that can be described as follows. For each 2 {+, }n , let H be the unique minimal generating set (over N) of the semigroup kerZ (A) \ Rn . Then the Graver basis, GrA := [ H \{0}. An algorithm to compute GrA can be found in [30]. It was shown in [34] that all reduced Gröbner bases of I A are contained in GrA which implies that there are only finitely many distinct reduced Gröbner bases for I A as c varies over generic cost vectors. Let UGBA denote the union of all the distinct reduced Gröbner bases of I A . Then UGBA is a universal Gröbner basis of I A that is contained in the Graver basis GrA . The

I

following theorem from [30] characterizes the elements of UGBA and thus allows one to test whether a binomial x˛ i xˇ i 2 GrA belongs UGBA . A second test can also be found in [30]. A vector u 2 Zn is said to be primitive if the g.c.d. of its components is one. Theorem 4 For a primitive vector u 2 kerZ (A), the biC nomial xu xu belongs to UGBA if and only if the line segment [u+ , u ] is a primitive edge in the polytope P Au C . The degree of a binomial x˛ i xˇ i 2 I A , is defined to P P be ˛ ij + ˇ ij . The degree of the universal Gröbner basis UGBA is then simply the maximum degree of any binomial in UGBA . This number is an important complexity measure for the family of integer programs that have A as coefficient matrix. The current best bound for the degree of UGBA is as follows. See [29, Chapt. 4] for a full discussion. Theorem 5 The degree of a binomial x˛ i xˇ i 2 UGBA , is at most (n d)(d + 1)D(A) where D(A) is the maximum absolute value of the determinant of a d × d submatrix of A. It has been conjectured that this bound can be improved to (d + 1)D(A) and some partial results in this direction can be found in [17]. The universal Gröbner bases of several special instances of A have been investigated in the literature, a few of which we mention here. For the family of 1 × n matrices A(n) := [1, . . . , n] it was shown in [12] that the Graver basis of A(n) is in bijection with the primitive partition identities with largest part n. A matrix A 2 Zd × n is unimodular if the absolute values of the determinants of all its nonsingular maximal minors are the same positive constant. For u 2 kerZ (A), the binomial C xu xu 2 IA is a circuit of A if u is primitive and has minimal support with respect to inclusion. Let C A denote the set of circuits of A. Then in general, C A UGBA GrA . If A is unimodular, then all of the above containments hold at equality although the converse is false: there are nonunimodular matrices for which C A = GrA . If An is the node-edge incidence matrix of the complete graph K n then the elements in UGBA n can be identified with certain subgraphs of K n . Gröbner bases of these matrices were investigated in [23]. The integer programs associated with An are the b-matching prob-

1627

1628

I

Integer Programming: Algebraic Methods

lems in the literature [24]. See [29, Chapt. 14] for some other specific examples of Gröbner bases. Variation of Cost Functions in Integer Programming We now consider all cost vectors in Rn (not just the generic ones) and study the effect of varying them. As seen earlier I A has only finitely many distinct reduced Gröbner bases as c is varied over the generic cost vectors. We say that two cost vectors c1 and c2 are equivalent with respect to IPA if for each b 2 posZ (A), the integer programs IPA;c 1 (b) and IPA;c 2 (b) have the same set of optimal solutions. The Gröbner basis approach to integer programming allows a complete characterization of the structure of these equivalence classes of cost vectors. Theorem 6 [30] i) There exists only finitely many equivalence classes of cost vectors with respect to IPA . ii) Each equivalence class is the relative interior of a convex polyhedral cone in Rn . iii) The collection of all these cones defines a complete polyhedral fan in Rn called the Gröbner fan of A. iv) Let db denote any Rprobability measure with support posZ (A) such that b b db < 1. R Then the Minkowski integral St(A) = b Pb db is an (n d)-dimensional convex polytope, called the state polytope of A. The normal fan of St(A) equals the Gröbner fan of A. Gröbner fans and state polytopes of graded polynomial ideals were introduced in [25] and [2] respectively. For a toric ideal both these entities have self contained construction methods that are rooted in the combinatorics of these ideals [30]. For a software system for computing Gröbner fans of toric ideals see [21]. We call Pb for b 2 posZ (A) a Gröbner fiber of A if C there is some xu xu 2 UGBA such that b = Au+ = Au . Since there are only finitely many elements in UGBA the matrix A has only finitely many Gröbner fibers. Then the Minkowski sum of all Gröbner fibers of A is a state polytope of A. For a survey of algorithms to construct state polytopes and Gröbner fans of graded polynomial ideals see [29, Chapt. 2; 3]. The Gröbner fan of A provides a model for global sensitivity analysis for the family of integer programs IPA, c .

We now briefly discuss a theory analogous to the above for linear programming based on results in [7] and [14]. For a comparison of integer and linear programming from this point of view see [30]. Let LPA, c (b) := min{c x: Ax = b, x 0} where A and c are as before and b is any vector in the rational polyhedral cone pos(A) := {Ax: x 0}. We define two cost vectors c1 and c2 to be equivalent with respect to LPA if the linear programs LPA;c 1 (b) and LPA;c 2 (b) have the same set of optimal solutions for all b 2 pos(A). Let A := {a1 , . . . , an } be the vector configuration in Zd consisting of the columns of A. For a subset A, we let pos() denote the cone generated by . A polyhedral subdivision of A is a collection of subsets of A such that {pos(): 2 } is a set of cones in a polyhedral fan whose support is pos(A). The elements of are called the faces or cells of . For convenience we identify A with the set of indices [n] and any subset of A by the corresponding subset [n]. A cost vector c 2 Rn induces the regular subdivision c of A [7,14] as follows: is a face of c if there exists a vector y 2 Rd such that aj y = cj whenever j 2 and aj y < cj otherwise. A cost vector c 2 Rn is said to be generic with respect to LPA if every linear program in the family LPA, c has a unique optimal solution. When c is generic for LPA , the regular subdivision c is in fact a triangulation called the regular triangulation of A with respect to c. Two cost vectors c1 and c2 are equivalent with respect to LPA if and only if c 1 = c 2 . The equivalence class of c with respect to LPA is hence {c0 2 Rn : c0 = c } which is the relative interior of a polyhedral cone in Rn called the secondary cone of c, denoted as Sc . The cone Sc is n-dimensional if and only if c is generic with respect to LPA . The set of all equivalence classes of cost vectors fit together to form a complete polyhedral fan in Rn called the secondary fan of A. This fan is the normal fan of a polytope called the secondary polytope of A. See [7] for construction methods for both the secondary fan and polytope of A. We conclude this section by showing that the Gröbner and secondary fans of A are related. The Stanley– Reisner ideal of c is the square-free monomial ideal hx i 1 x i r : {i1 , . . . , ir } is a nonface of c i k[x]. Theorem 7 [28] The radical of the initial ideal inc (I A ) is the Stanley–Reisner ideal of the regular triangulation c .

Integer Programming: Algebraic Methods

Corollary 8 [28] i) The Gröbner fan of A is a refinement of the secondary fan of A. ii) A secondary polytope of A is a summand of a state polytope of A. Corollary 8 reaffirms the view that integer programming is an arithmetic refinement of linear programming. Group Relaxations in Integer Programming We now investigate group relaxations of integer programs in the family IPA, c from an algebraic point of view. The results in this section are taken from [19,20] and [32], sometimes after an appropriate translation into polyhedral language. See these papers for the algebraic motivations that led to these results. The group relaxation of IPA, c (b) [15] is the program c x : A x C A x D b; Group (b) :D min fe x 0; x D (x ; x )Zn g, where A , the submatrix of A whose columns are indexed by [n], is the opc D timal basis of the linear program LPA, c (b) and e 1 c c A A . Here the cost vector c has also been partitioned as c D (c ; c ) using the set [n]. Definition 9 Suppose L is any sublattice of Zn , w 2 Rn and v 2 Nn . The lattice program LatL;w (v) defined by this data is 8 ˆ ˆ

ˆ ˆ x : x 2 S

is an optimization problem ˚ max c > R x : x 2 SR ; where S SR and c> x c> R x for all x 2 S. Clearly, solving a problem relaxation provides an upper bound on the objective value of the underlying problem. Perhaps the most common relaxation of problem (IP) is the linear programming relaxation formed by relaxing the integer restrictions and enforcing appropriate bound conditions on the variables; i. e., cR = c and SR = { x 2 Rn : Ax b, l x u }. A formal statement of a general branch and bound algorithm [48] is presented in Table 1. The notation L is used to denote the list of active subproblems {IPi }, where IP0 = (IP) denotes the original integer program. The notation z i denotes an upper bound on the optimal objective value of IPi , and z i p denotes the incumbent

I

Integer Programming: Branch and Bound Methods, Table 1 General branch and bound algorithm

1 2

3

4 i) ii)

5

(Initialization): Set L = fIP0 g; z¯0 = +1; and z i p = 1: (Termination): If L = ;; then the solution x which yielded the incumbent objective value z i p is optimal. If no such x exists (i.e., z i p = 1), then (IP) is infeasible. (Problem selection and relaxation): Select and delete a problem IP i from L. Solve a relaxation of IP i . Let z Ri denote the optimal objective value of the relaxation, and let x i R be an optimal solution if one exists. (Thus, z Ri = c > x i R ; or z Ri = 1:) (Fathoming and Pruning): If z Ri z i p go to Step 2. If z Ri > z i p and x i R is integral feasible, update z i p = z Ri : Delete from L all problems with z¯i z i p : Go to Step 2. j=k

(Partitioning): Let fS i j g j=1 be a partition of the constraint set S i of the problem IP i : Add probj=k lems fIPi j g j=1 to L, where IP i j is IP i with feasible region restricted to S i j and z¯i j = z Ri for j = 1; : : : ; k: Go to Step 2.

objective value (i. e., the objective value corresponding to the current best integral feasible solution to (IP)). The actual implementation of a branch and bound algorithm is typically viewed as a tree search, where the problem at the root node of the tree is the original (IP). The tree is constructed in an iterative fashion with new nodes formed by branching on an existing node for which the optimal solution of the relaxation is fractional (i. e., some of the integer restricted variables have fractional values). Typically, two child nodes are formed by selecting a fractional valued variable and adding appropriate constraints in each child subproblem to ensure that the associated constraint sets do not include solutions for which this chosen branching variable assumes the same fractional value. The phrase fathoming a node is used in reference to criteria that imply that a node need not be explored further. As indicated in Step 4, these criteria include: a) the objective value of the subproblem relaxation at the node is less than or equal to the incumbent objective value; and

1635

1636

I

Integer Programming: Branch and Bound Methods

b) the solution for the subproblem relaxation is integer valued. Note that a) includes the case when the relaxation is infeasible, since in that case its objective value is 1. Condition b) provides an opportunity to prune the tree; effectively fathoming nodes for which the objective value of the relaxation is less than or equal to the updated incumbent objective value. The tree search ends when all nodes are fathomed. A variety of strategies have been proposed for intelligently selecting branching variables, for problem partitioning, and for selecting nodes to process. However, no single collection of strategies stands out as being best in all cases. In the remainder of this article, some of the strategies that have been implemented or proposed are summarized. An illustrative example is presented. Some of the related computational strategies – preprocessing and reformulation, heuristic procedures, and the concept of reduced cost fixing – which have proved to be highly effective in branch and bound implementations are considered. Finally, there is a discussion of recent linear programming based branch and bound algorithms that have employed interior point methods for the subproblem relaxation solver, which is in contrast to using the more traditional simplex-based solvers. Though branch and bound is a classic approach for solving integer programs, there are practical limitations to its success in applications. Often integer feasible solutions are not readily available, and node pruning becomes impossible. In this case, branch and bound fails to find an optimal solution due to memory explosion as a result of excessive accumulation of active nodes. In fact, general integer programs are NP-hard; and consequently, as of this writing (1998), there exists no known polynomial time algorithm for solving general integer programs [30]. In 1983, a breakthrough in the computational possibilities of branch and bound came as a result of the research by H. Crowder, E.L. Johnson, and M.W. Padberg. In their paper [22], cutting planes were added at the root node to strengthen the LP formulation before branch and bound was called. In addition, features such as reduced cost fixing, heuristics and preprocessing were added within the tree search algorithm to facilitate the solution process. See Integer programming: Cutting plane algorithms for de-

tails on cutting plane applications to integer programming. Since branch and bound itself is an inherently parallel technique, there has been active research activity among the computer science and operations research communities in developing parallel algorithms to improve its solution capability. Most commercial integer programming solvers use a branch and bound algorithm with linear programming relaxations. Unless otherwise mentioned, the descriptions of the strategies discussed herein are based on using the linear programming relaxation. See [48] for references not listed here; [51] also includes useful material about branch and bound. Partitioning Strategies When linear programming relaxation is employed, partitioning is done via addition of linear constraints. Typically, two new nodes are formed on each division. Suppose xR is an optimal solution to the relaxation of a branch and bound node. Common partitioning strategies include: Variable dichotomy [23]. If xRj is fractional, then two new nodes are created, one with the simple bound xj b x Rj c and the other with xj d x Rj e; where b c and d e denote the floor and the ceiling of a real number. In particular, if xj is restricted to be binary, then the branching reduces to fixing xj = 0 and xj = 1, respectively. One advantage of simple bounds is that they maintain the size of the basis among branch and bound nodes, since the simplex method can be implemented to handle both upper and lower bounds on variables without explicitly increasing the dimensions of the basis. Generalized-upper-bound dichotomy (GUB diP chotomy) [8]. If the constraint j 2 Q xj = 1 is present in the original integer program, and xRi , i 2 Q, are fractional, one can partition Q = Q1 [ Q2 P P R R such that j2Q 1 x j and j2Q 2 x j are approximately of equal value. Then two branches can be P P formed by setting j2Q 1 xj = 0 and j2Q 2 xj = 0, respectively. Multiple branches for bounded integer variable. If x Rj is fractional, and xj 2 { 0, . . . , l }, then one can create l + 1 new nodes, with xj = k for node k, k = 0, . . . , l. This idea was proposed in the first branch and

I

Integer Programming: Branch and Bound Methods

bound algorithm by A.H. Land and A.G. Doig [39], but currently (1998) is not commonly used. Branching Variable Selection During the partitioning process, branching variables must be selected to help create the children nodes. Clearly the choice of a branching variable affects the running time of the algorithm. Many different approaches have been developed and tested on different types of integer programs. Some common approaches are listed below: Most/least infeasible integer variable. In this approach, the integer variable whose fractional value is farthest from (closest to) an integral value is chosen as the branching variable. Driebeck–Tomlin penalties [25,57]. Penalties give a lower bound on the degradation of the objective value for branching each direction from a given variable. The penalties are the cost of the dual pivot needed to remove the fractional variable from the basis. If many pivots are required to restore primal feasibility, these penalties are not very informative. The up penalty, when forcing the value of the kth basic variable up, is (1 f k )c j ; j:a k j 0

fk c j : ak j

Once the penalties have been computed, a variety of rules can be used to select the branching variable (e. g., maxk max(uk , dk ), or maxk min(uk , dk )). A penalty can be used to eliminate a branch if the LP objective value for the parent node minus the penalty is worse than the incumbent integer solution. Penalties are out of favor because their cost is considered too high for their benefit. Pseudocost estimate. Pseudocosts provide a way to estimate the degradation to the objective value by forcing a fractional variable to an integral value. The technique was introduced in 1970 by M. Benichou et

al. [10]. Pseudocosts attempt to reflect the total cost, not just the cost of the first pivot, as with penalties. Once a variable xk is labeled as a candidate branching variable, the pseudocosts are computed as: Uk D

z k z uk 1 fk

and

Dk D

z k z dk ; fk

where z k is the objective value of the parent, z uk is the objective value resulting from forcing up, and z dk is the objective value from forcing down. (If the subproblem is infeasible, the associated pseudocost is not calculated.) If a variable has been branched upon repeatedly, an average may be used. The branching variable is chosen as that with the maximum degradation, where the degradation is computed as: Dk f k + U k (1 f k ). Pseudocosts are not considered to be beneficial on problems where there is a large percentage of integer variables. Pseudoshadow prices. Similar to pseudocosts, pseudoshadow prices estimate the total cost to force a variable to an integral value. Up and down pseudoshadow prices for each constraint and pseudoshadow prices for each integer variable are specified by the user or given an initial value. The degradation in the objective function for forcing an integer variable xk up or down to an integral value can be estimated. The branching variable is chosen using criteria similar to penalties and pseudocosts. See [27,40] for precise mathematical formulations on this approach. Strong branching. This branching strategy arose in connection with research on solving difficult instances of the traveling salesman problem and general mixed zero-one integer programming problems [2,12,13]. Applied to zero-one integer programs within a simplex-based branch and cut setting, strong branching works as follows. Let N and K be positive integers. Given the solution of some linear programming relaxation, make a list of N binary variables that are fractional and closest to 0.5 (if there are fewer than N fractional variables, take all fractional variables). Suppose that I is the index set of this list. Then, for each i 2 I, fix xi first to 0 and then to 1 and perform K iterations (starting with the optimal basis for the LP relaxation of the current node) of the dual simplex method with steepest edge pricing. Let Li , U i , i 2 I, be the objective values

1637

1638

I

Integer Programming: Branch and Bound Methods

that result from these simplex runs, where Li corresponds to fixing xi to 0 and U i to fixing it to 1. A branching variable can be selected based on the best weighted-sum of these two values. Priorities selection. Variables are selected based on their priorities. Priorities can be user-assigned, or based on objective function coefficients, or on pseudocosts. Node Selection Given a list of active problems, one has to decide which subproblem should be selected to be examined next. This in turn will affect the possibilities of improving the incumbent, the chance of node fathoming, and the total number of problems needed to be solved before optimality is achieved. Below, various strategies given in [7,10,11,20,27,29,31,47] are presented. Depth-first search with backtracking. Choose a child of the previous node as the next node; if it is pruned, choose the other child. If this node is also pruned, choose the most recently created unexplored node, which will be the other child node of the last successful node. Best bound. Among all unexplored nodes, choose the one which has the best LP objective value. In the case of maximization, the node with the largest LP objective value will be chosen. The rationale is that since nodes can only be pruned when the relaxation objective value is less than the current incumbent objective value, the node with largest LP objective value cannot be pruned, since the best objective value corresponding to an integer feasible solution cannot exceed this largest value. Sum of integer infeasibilities. The sum of infeasibilities at a node is calculated as X min( f j ; 1 f j ): sD j

Choose the node with either maximum or minimum sum of integer infeasibilities. Best estimate using pseudocosts. This technique was introduced [10] along with the idea of using pseudocosts to select a branching variable. The individual pseudocosts can be used to estimate the resulting integer objective value attainable from node k: X min(D i f i ; U i (1 f i )); k D z k i

where z k is the value of the LP relaxation at node k. The node with the best estimate is chosen. Best estimate using pseudoshadow prices. Pseudoshadow prices can also be used to provide an estimate of the resulting integer objective value attainable from the node, and the node with the best estimate can then be chosen. Best projection [29,47]. Choose the node among all unexplored nodes which has the best projection. The projection is an estimate of the objective function value associated with an integer solution obtained by following the subtree starting at this node. It takes into account both the current objective function value and a measure of the integer infeasibility. In particular, the projection pk associated with node k is defined as pk D zk

s k (z0 z i p ) ; s0

where z 0 denotes the objective value of the LP at the root node, zip denotes an estimate of the optimal integer solution, and sk denotes the sum of the integer infeasibilities at node k. The projection is a weighting between the objective function and the sum of infeasibilities. The weight (z0 z i p )/s0 corresponds to the slope of the line between node 0 and the node producing the optimal integer solution. It can be thought of as the cost to remove one unit of infeasibility. Let nk be the number of integer infeasibilities at node k. A more general projection formula is to let wk = nk + (1 ) sk , where 2 [0, 1], and define pk D zk

w k (z0 z i p ) : w0

Example 1 Here, a two-variable integer program is solved using branch and bound. The most infeasible integer variable is used as the branching variable, and best bound is used for node selection. Consider the problem 8 ˆ max 13x1 C 8x2 ˆ ˆ ˆ ˆ ˆ x1 C 2x2 10 ˆ b, for some j 2 K, replace aj by b. A stronger version of this procedure is possible when the problem formulation involves other constraints of appropriate structure. 7) Logical implications and probing: a) Logical implications: Choose a binary variable xk and fix it to 0 or 1. Perform 4. This analysis may yield logical implications such as xk = 1 implies xj = 0, or xk = 1 implies xj = 1, for some other variable xj . The implied equality is then added as an explicit constraint. b) Probing: Perform logical implications recursively. An efficient implementation of probing appears to be very difficult. Details of computational issues regarding probing are discussed in [33], and [54]. Heuristics Heuristic procedures provide a means for obtaining integer feasible solutions quickly, and can be used repeatedly within the branch and bound search tree. A good heuristic – one that produces good integer feasible solutions – is a crucial component in the branch and bound algorithm since it provides an upper bound for reduced cost fixing (discussed later) at the root, and thus allows reduction in the size of the linear program that must be solved. This in turn may reduce the time required to solve subsequent linear programs at nodes within the search tree. In addition, a good upper bound increases the likelihood of being able to fathom active nodes, which is extremely important when solving large scale integer programs as they tend to create many active nodes leading to memory explosion. Broadly speaking, five ideas are commonly used in developing heuristics. The first idea is that of greediness. Greedy algorithms work by successively choosing variables based on best improvement in the objective value. Kruskal’s algorithm [37], which is an exact algorithm for finding the minimum-weight spanning tree in a graph, is one of the most well-known greedy

algorithms. Greedy algorithms have been applied to a variety of problems, including 0–1 knapsack problems [36,41,53], uncapacitated facility location problems [38,56], set covering problems [3,4], and the traveling salesman problem [52]. A second idea is that of local search, which involves searching in a local neighborhood of a given integer feasible solution for a feasible solution with a better objective value. The k-interchange heuristic is a classic example of a local search heuristic [38,44,46]. Simulated annealing is another example, but with a bit of a twist. It allows, with a certain probability, updated solutions with less favorable objective values in order to increase the likelihood of escaping from a local optimum [16]. Randomized enumeration is a third idea that is used to obtain integer feasible solutions. One such method is that of genetic algorithms (cf. Genetic algorithms), where the randomness is modeled on the biological mechanisms of evolution and natural selection [32]. Recent work on applying a genetic algorithm to the set covering problem can be found in [9]. The term primal heuristics refers to certain LPbased procedures for constructing integral feasible solutions from points that are in some sense good, but fail to satisfy integrality. Typically, these nonintegral points are obtained as optimal solutions of LP relaxations. Primal heuristic procedures involve successive variable fixing and rounding (according to rules usually governed by problem structure) and subsequent re-solves of the modified primal LP [6,12,14,34,35]. The fifth general principle is that of exploiting the interplay between primal and dual solutions. For example, an optimal or heuristic solution to the dual of an LP relaxation may be used to construct a heuristic solution for the primal (IP). Problem dependent criteria based on the generated primal-dual pair may suggest seeking an alternative heuristic solution to the dual, which would then be used to construct a new heuristic solution to the primal. Iterating back-and-forth between primal and dual heuristic solutions would continue until an appropriate termination condition is satisfied [21,26,28]. It is not uncommon that a heuristic involves more than one of these ideas. For example, pivot-andcomplement is a simplex-based heuristic in which binary variables in the basis are pivoted out and replaced by slack variables. When a feasible integer solution is

Integer Programming: Branch and Bound Methods

obtained, the algorithm performs a local search in an attempt to obtain a better integer feasible solution [5]. Obviously, within a branch and bound implementation, the structure of the problems that the implementation is targeted at influences the design of an effective heuristic [2,12,13,14,22,26,34,35,43]. Continuous Reduced Cost Implications Reduced cost fixing is a well-known and important idea in the literature of integer programming [22]. Given an optimal solution to an LP relaxation, the reduced costs c j are nonpositive for all nonbasic variables xj at lower bound, and nonnegative for all nonbasic variables at their upper bounds. Let xj be a nonbasic variable in a continuous optimal solution having objective value zLP , and let z i p be the objective value associated with an integer feasible solution to (IP). The following are true: a) If xj is at its lower bound in the continuous solution and z LP z i p c j , then there exists an optimal solution to the integer program with xj at its lower bound. b) If xj is at its upper bound in the continuous solution and z LP z i p c j , then there exists an optimal solution to the integer program with xj at its upper bound. When reduced cost fixing is applied to the root node of a branch and bound tree, variables which are fixed can be removed from the problem, resulting in a reduction in the size of the integer program. A variety of studies have examined the effectiveness of reduced cost fixing within the branch and bound tree search [12,14,22,34,35,49,50]. Subproblem Solver When linear programs are employed as the relaxations within a branch and bound algorithm, it is common to use a simplex-based algorithm to solve each subproblem, using dual simplex to reoptimize from the optimal basis of the parent node. This technique of advanced basis has been shown to reduce the number of simplex iterations to solve the child node to optimality, and thus speedup the overall computational effort. Recently with the advancement in computational technology, the increase in the size of integer programs, and the success of interior point methods (cf. also Linear programming: Interior point methods) to solve large scale linear programs [1,45] there are some branch and bound

I

algorithms employing interior point algorithms as the linear programming solver [17,42,43,55]. In this case, advanced basis is no longer available and care has to be taken to take advantage of warmstart vectors for the interior point solver so as to facilitate effective computational results. In [42,43], a description of the ideas of ‘advanced warmstart’ and computational results are presented. See also Branch and Price: Integer Programming with Column Generation Decomposition Techniques for MILP: Lagrangian Relaxation Integer Linear Complementary Problem Integer Programming Integer Programming: Algebraic Methods Integer Programming: Branch and Cut Algorithms Integer Programming: Cutting Plane Algorithms Integer Programming Duality Integer Programming: Lagrangian Relaxation LCP: Pardalos-Rosen Mixed Integer Formulation Mixed Integer Classification Problems Multi-objective Integer Linear Programming Multi-objective Mixed Integer Programming Multiparametric Mixed Integer Linear Programming Parametric Mixed Integer Nonlinear Optimization Set Covering, Packing and Partitioning Problems Simplicial Pivoting Algorithms for Integer Programming Stochastic Integer Programming: Continuity, Stability, Rates of Convergence Stochastic Integer Programs Time-dependent Traveling Salesman Problem References 1. Andersen ED, Gondzio J, Mészáros C, Xu X (1996) Implementation of interior point methods for large scale linear programming. In: Terlaky T (ed) Interior Point Methods in Mathematical Programming. Kluwer, Dordrecht, ftp://ftp.sztaki.hu/pub/oplab/PAPERS/kluwer.ps.Z 2. Applegate D, Bixby RE, Chvátal V, Cook W (1998) On the solution of travelling salesman problems. Documenta Math no. Extra Vol. Proc. ICM III:645–656 3. Baker EK (1981) Efficient heuristic algorithms for the weighted set covering problem. Comput Oper Res 8:303– 310

1641

1642

I

Integer Programming: Branch and Bound Methods

4. Baker EK, Fisher ML (1981) Computational results for very large air crew scheduling problems. OMEGA Internat J Management Sci 19:613–618 5. Balas E, Martin CH (1980) Pivot and complement: A heuristic for 0/1 programming. Managem Sci 26:86–96 6. Baldick R (1992) A randomized heuristic for inequalityconstrained mixed-integer programming. Techn Report Dept Electrical and Computer Engin Worcester Polytechnic Inst 7. Beale EML (1979) Branch and bound methods for mathematical programming systems. Ann Discret Math 5:201– 219 8. Beale EML, Tomlin JA (1970) Special facilities in a general mathematical programming system for nonconvex problems using ordered sets of variables. In: Lawerence J (ed) Proc. Fifth Internat. Conf. Oper. Res., Tavistock Publ., pp 447–454 9. Beasley JE, Chu PC (1996) A genetic algorithm for the set covering problem. Europ J Oper Res 194:392–404 10. Benichou M, Gauthier JM, Girodet P, Hehntges G, Ribiere G, Vincent O (1971) Experiments in mixed integer linear programming. Math Program 1:76–94 11. Benichou M, Gauthier JM, Hehntges G, Ribiere G (1977) The efficient solution of large-scale linear programming problems - some algorithmic techniques and computational results. Math Program 13:280–322 12. Bixby RE, Cook W, Cox A, Lee EK (1995) Parallel mixed integer programming. Techn Report Center Res Parallel Computation, Rice Univ CRPC-TR95554 13. Bixby RE, Cook W, Cox A, Lee EK (1999) Computational experience with parallel mixed integer programming in a distributed environment. Ann Oper Res 90:19–43 14. Bixby RE, Lee EK (1998) Solving a truck dispatching scheduling problem using branch-and-cut. Oper Res 46:355–367 15. Bixby RE, Wagner DK (1987) A note on detecting simple redundancies in linear systems. Oper Res Lett 6:15–18 16. Bonomi E, Lutton JL (1984) The N-city traveling salesman problem: Statistical mechanics and the metropolis algorithm. SIAM Rev 26:551–568 17. Borchers B, Mitchell JE (March 1991) Using an interior point method in a branch and bound algorithm for integer programming. Techn Report Math Sci Rensselaer Polytech Inst 195 18. Bradley GH, Hammer PL, Wolsey L (1975) Coefficient reduction in 0–1 variables. Math Program 7:263–282 19. Brearley AL, Mitra G, Williams HP (1975) Analysis of mathematical programming problems prior to applying the simplex method. Math Program 5:54–83 20. Breu R, Burdet CA (1974) Branch and bound experiments in zero-one programming. Math Program 2:1–50 21. Conn AR, Cornuejols G (1987) A projection method for the uncapacitated facility location problem. Techn Report Graduate School Industr Admin Carnegie-Mellon Univ 26-86-87

22. Crowder H, Johnson EL, Padberg M (1983) Solving largescale zero-one linear programming problem. Oper Res 31:803–834 23. Dakin RJ (1965) A tree search algorithm for mixed integer programming problems. Comput J 8:250–255 24. Dietrich B, Escudero L (1990) Coefficient reduction for knapsack-like constraints in 0/1 programs with variable upper bounds. Oper Res Lett 9:9–14 25. Driebeek NJ (1966) An algorithm for the solution of mixed integer programming problems. Managem Sci 21:576–587 26. Erlenkotter D (1978) A dual-based procedure for uncapacitated facility location. Oper Res 26:992–1009 27. Fenelon M (1991) Branching strategies for MIP. CPLEX 28. Fisher ML, Jaikumer R (1981) A generalized assignment heuristic for vehicle routing. Networks 11:109–124 29. Forrest JJ, Hirst JPH, Tomlin JA (1974) Practical solution of large mixed integer programming problems with UMPIRE. Managem Sci 20:736–773 30. Garey MR, Johnson DS (1979) Computers and intractability – A guide to the theory of NP-completeness. Freeman, New York 31. Gauthier JM, Ribiere G (1977) Experiments in mixed integer programming using pseudo-costs. Math Program 12:26–47 32. Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading, MA 33. Guignard M, Spielberg K (1981) Logical reduction methods in zero-one programming. Oper Res 29:49–74 34. Hoffman KL, Padberg M (1991) Improving LP-representations of zero-one linear programs for branch-and-cut. ORSA J Comput 3:121–134 35. Hoffman KL, Padberg M (1992) Solving airline crewscheduling problems by branch-and-cut. Managem Sci 39:657–682 36. Ibarra OH, Kim CE (1975) Fast approximation algorithms for the knapsack and sum of subset problems. J ACM 22:463–468 37. Kruskal JB (1956) On the shortest spanning subtree of a graph and the traveling salesman problem. Proc Amer Math Soc 7:48–50 38. Kuehn AA, Hamburger MJ (1963) A heuristic program for locating warehouses. Managem Sci 19:643–666 39. Land AH, Doig AG (1960) An automatic method for solving discrete programming problems. Econometrica 28:497–520 40. Land AH, Powell S (1979) Computer codes for problems of integer programming. Ann Discret Math 5:221–269 41. Lawler EL (1979) Fast approximation algorithms for the knapsack problems. Math Oper Res 4:339–356 42. Lee EK, Mitchell JE (1996) Computational experience in nonlinear mixed integer programming. In: The Oper. Res. Proc. 1996. Springer, Berlin, pp 95–100 43. Lee EK, Mitchell JE (2000) Computational experience of an interior-point SQP algorithm in a parallel branch-and-

Integer Programming: Branch and Cut Algorithms

44.

45.

46.

47.

48. 49.

50.

51. 52.

53. 54.

55.

56. 57.

bound framework. In: Frenk H et al (eds.) High Performance Optimization. Kluwer, Dordrecht, pp 329–347 (Chap. 13). Lin S, Kernighan BW (1973) An effective heuristic algorithm for the traveling salesman problem. Oper Res 21:498– 516 Lustig IJ, Marsten RE, Shanno DF (1994) Interior point methods for linear programming: Computational state of the art. ORSA J Comput 6(1):1–14. see also the following commentaries and rejoinder Manne AS (1964) Plant location under economies of scale-decentralization and computation. Managem Sci 11:213–235 Mitra G (1973) Investigations of some branch and bound strategies for the solution of mixed integer linear programs. Math Program 4:155–170 Nemhauser GL, Wolsey LA (1988) Integer and combinatorial optimization. Wiley, New York Padberg M, Rinaldi G (1989) A branch-and-cut approach to a traveling salesman problem with side constraints. Managem Sci 35:1393–1412 Padberg M, Rinaldi G (1991) A branch-and-cut algorithm for the resolution of large-scale symmetric traveling salesman problems. SIAM Rev 33:60–100 Parker RG, Rardin RL (1988) Discrete optimization. Acad. Press, New York Rosenkrantz DJ, Stearns RE, Lewis PM (1977) An analysis of several heuristics for the traveling salesman problem. SIAM J Comput 6:563–581 Sahni S (1975) Approximate algorithms for the 0–1 knapsack problem. J ACM 22:115–124 Savelsbergh MWP (1994) Preprocessing and probing for mixed integer programming problems. ORSA J Comput 6:445–454 de Silva A, Abramson D (1998) A parallel interior point method and its application to facility location problems. Comput Optim Appl 9:249–273 Spielberg K (1969) Algorithms for the simple plant location problem with some side-conditions. Oper Res 17:85–111 Tomlin JA (1971) An improved branch and bound method for integer programming. Oper Res 19:1070–1075

Integer Programming: Branch and Cut Algorithms Branch and Cut JOHN E. MITCHELL Math. Sci. Rensselaer Polytechnic Institute, Troy, USA

MSC2000: 90C10, 90C11, 90C05, 90C08, 90C06

I

Article Outline Keywords A Standard Form Primal Heuristics Preprocessing Families of Cutting Planes When to Add Cutting Planes Lifting Cuts Implementation Details Solving Large Problems Conclusions See also References

Keywords Cutting planes; Branch and bound; Integer program; Exact algorithms Branch and cut methods are exact algorithms for integer programming problems. They consist of a combination of a cutting plane method (cf. Integer programming: Cutting plane algorithms) with a branch and bound algorithm (cf. Integer programming: Branch and bound methods). These methods work by solving a sequence of linear programming relaxations of the integer programming problem. Cutting plane methods improve the relaxation of the problem to more closely approximate the integer programming problem, and branch and bound algorithms proceed by a sophisticated divide-and-conquer approach to solve problems. The material in this entry builds on the material contained in the entries on cutting plane and branch and bound methods. Perhaps the best known branch and cut algorithms are those that have been used to solve traveling salesman problems. This approach is able to solve and prove optimality of far larger instances than other methods. Two papers that describe some of this research and also serve as good introductions to the area of branch and cut algorithms are [21,32]. A more recent work on the branch and cut approach to the traveling salesman problem is [1]. Branch and cut methods have also been used to solve other combinatorial optimization problems; recent references include [8,10,13,23,24,26]. For these problems, the cutting planes are typically derived

1643

1644

I

Integer Programming: Branch and Cut Algorithms

from studies of the polyhedral combinatorics of the corresponding integer program. This enables the addition of strong cutting planes (usually facet defining inequalities), which make it possible to considerably reduce the size of the branch and bound tree. Far more detail about these strong cutting planes can be found in Integer programming: Cutting plane algorithms. Branch and cut methods for general integer programming problems are also of great interest (see, for example, the papers [4,7,11,16,17,22,28,30]). It is usually not possible to efficiently solve a general integer programming problem using just a cutting plane approach, and it is therefore necessary to also to branch, resulting in a branch and cut approach. A pure branch and bound approach can be sped up considerably by the employment of a cutting plane scheme, either just at the top of the tree, or at every node of the tree. For general problems, the specialized facets used when solving a specific combinatorial optimization problem are not available. Some useful families of general inequalities have been developed; these include cuts based on knapsack problems [17,22,23], Gomory cutting planes [5,12,19,20], lift and project cutting planes [3,4,29,33], and Fenchel cutting planes [9]. All of these families of cutting planes are discussed in more detail later in this entry. The software packages MINTO [30] and ABACUS [28] implement branch and cut algorithms to solve integer programming problems. The packages use standard linear programming solvers to solve the relaxations and they have a default implementation available. They also offer the user many options, including how to add cutting planes and how to branch. Example 1 Consider the integer programming problem 8 ˆ min 5x1 6x2 ˆ ˆ ˆ x: Ax = b, 0 x u } may give the optimal solution to the integer program. This is guaranteed to happen if the constraint matrix A is totally unimodular, that is, the determinant of every square submatrix of A is either 0 or ˙ 1. Examples of totally unimodular matrices include the nodearc incidence matrix of a directed graph, the nodeedge incidence matrix of a bipartite undirected graph, and interval matrices (where each row of A consists of a possibly empty set of zeroes followed by a set of ones followed by another possibly empty set of zeros). It therefore suffices to solve the linear programming relaxation of maximum flow problems and shortest path problems on directed graphs, the assignment problem, and some problems that involve assigning workers to shifts, among others.

1651

1652

I

Integer Programming: Cutting Plane Algorithms

Chvátal–Gomory Cutting Planes One method of generating cutting planes involves combining together inequalities from the current description of the linear programming relaxation. This process is known as integer rounding, and the cutting planes generated are known as Chv átal-Gomory cutting planes. Integer rounding was implicitly described by Gomory in [12,13], and described explicitly by V. Chv átal in [7]. Consider again the example problem given earlier. The first step is to take a weighted combination of the inequalities. For example, 0:2(x1 C 2x2 7) C 0:4(2x1 x2 3) gives the valid inequality for the relaxation: x1 2:6: In any feasible solution to the integer programming problem, the left hand side of this inequality must take an integer value. Therefore, the right hand side can be rounded down to give the following valid inequality for the integer programming problem: x1 2: This process can be modified to generate additional inequalities. For example, taking the combination 0.5 (x1 + 2x2 7) + 0 (2x1 x2 3) gives 0.5 x1 + x2 3.5, which is valid for the relaxation. Since all the variables are constrained to be nonnegative, rounding down the left hand side of this inequality will only weaken it, giving x2 3.5, also valid for the LP relaxation. Now rounding down the right hand side gives x2 3, which is valid for the integer programming problem, even though it is not valid for the LP relaxation. Gomory originally derived constraints using the optimal simplex tableau. The LP relaxation of the simple example above can be expressed in equality form as: 8 ˆ ˆmin 2x1 x2 ˆ ˆ zU B z. Similar tests can be derived for nonbasic variables at their upper bounds. It is also possible to fix variables when an interior point method is used to solve the relaxations [26]. Once some variables have been fixed in this manner, it is often possible to fix further variables using logical implications. For example, in a traveling salesman problem, if xe has been set equal to one for two edges incident to a particular vertex, then all other edges incident to that vertex can have their values fixed to zero. Solving Large Problems It is generally accepted that interior point methods are superior to the simplex algorithm for solving sufficiently large linear programming problems. The situation for cutting plane algorithms for large integer programming problems is not so clear, because the dual simplex method is very good at reoptimizing if only a handful of cutting planes are added. Nonetheless, it does appear that interior point cutting plane algorithms may well have a role to play, especially for problems with very large relaxations (thousands of variables and constraints) and where a large number of cutting planes are added simultaneously (hundreds or thousands). LP relaxations of integer programming problems can experience severe degeneracy, which can cause the simplex

Integer Programming: Cutting Plane Algorithms

method to stall. Interior point methods suffer far less from the effects of degeneracy. In [27], an interior point cutting plane algorithm is used for a maximum cut problem on a sparse graph, and the use of the interior point solver enables the solution of far larger instances than with a simplex solver, because of both the size of the problems and their degeneracy. A combined interior point and simplex cutting plane algorithm for the linear ordering problem is described in [30]. In the early stages, an interior point method is used, because the linear programs are large and many constraints are added at once. In the later stages, the dual simplex algorithm is used, because just a few constraints are added at a time and the dual simplex method can then reoptimize very quickly. The combined algorithm is up to ten times faster than either a pure interior point cutting plane algorithm or a pure simplex cutting plane algorithm on the larger instances considered. The polyhedral combinatorics of the quadratic assignment problem are investigated in [21]. It was found necessary to use an interior point method to solve the relaxations, because of the size of the relaxations. Provably Good Solutions Even if a cutting plane algorithm is unable to solve a problem to optimality, it can still be used to generate good feasible solutions with a guaranteed bound to optimality. This approach for the traveling salesman problem is described in [23]. The value of the current LP relaxation provides a lower bound on the optimal value of the integer programming problem. The optimal solution to the current LP relaxation (or a good feasible solution) can often be used to generate a good integral feasible solution using a heuristic procedure. The value of an integral solution obtained in this manner provides an upper bound on the optimal value of the integer programming problem. For example, for the traveling salesman problem, edges that have xe close to one can be set equal to one, edges with xe close to zero can be set to zero, and the remaining edges can be set so that the solution is the incidence vector of a tour. Further refinements are possible, such as using 2-change or 3-change procedures to improve the tour, as described in [25].

I

This has great practical importance. In many situations, it is not necessary to obtain an optimal solution, and a good solution will suffice. If it is only necessary to have a solution within 0.5% of optimality, say, then the cutting plane algorithm can be terminated when the gap between the lower bound and upper bound is smaller than this tolerance. If the objective function value must be integral, then the algorithm can be stopped with an optimal solution once this gap is less than one. Equivalence of Separation and Optimization The separation problem for an integer programming problem can be stated as follows: Given an instance of an integer programming problem and a point x, determine whether x is in the convex hull of feasible integral points. Further, if it is not in the convex hull, find a separating hyperplane that cuts off x from the convex hull. An algorithm for solving a separation problem is called a separation routine, and it can be used to solve an integer programming problem. The ellipsoid algorithm [17,24] is a method for solving linear programming problems in polynomial time. It can be used to solve an integer programming problem with a cutting plane method, and it will work in a polynomial number of stages, or calls to the separation routine. If the separation routine requires only polynomial time then the ellipsoid algorithm can be used to solve the problem in polynomial time. It can also be shown that if an optimization problem can be solved in polynomial time then the corresponding separation problem can also be solved in polynomial time. There are instances of any NP-hard problem that cannot be solved in polynomial time unless P = NP.Therefore, a cutting plane algorithm cannot always generate good cutting planes quickly for NP-hard problems. In practice, fast heuristics are used, and these heuristics may occasionally be unable to find a cutting plane even when one exists. Conclusions Cutting plane methods have been known for almost as long as the simplex algorithm. They have come back into favor since the early 1980s because of the development of strong cutting planes from polyhedral theory.

1655

1656

I

Integer Programming: Cutting Plane Algorithms

In practice, cutting plane methods have proven very successful for a wide variety of problems, giving provably optimal solutions. Because they solve relaxations of the problem of interest, they make it possible to obtain bounds on the optimal value, even for large instances that cannot currently be solved to optimality. See also Branch and Price: Integer Programming with Column Generation Decomposition Techniques for MILP: Lagrangian Relaxation Integer Linear Complementary Problem Integer Programming Integer Programming: Algebraic Methods Integer Programming: Branch and Bound Methods Integer Programming: Branch and Cut Algorithms Integer Programming Duality Integer Programming: Lagrangian Relaxation LCP: Pardalos-Rosen Mixed Integer Formulation Mixed Integer Classification Problems Multi-objective Integer Linear Programming Multi-objective Mixed Integer Programming Multiparametric Mixed Integer Linear Programming Parametric Mixed Integer Nonlinear Optimization Set Covering, Packing and Partitioning Problems Simplicial Pivoting Algorithms for Integer Programming Stochastic Integer Programming: Continuity, Stability, Rates of Convergence Stochastic Integer Programs Time-dependent Traveling Salesman Problem References 1. Applegate D, Bixby RE, Chvátal V, Cook W (1998) On the solution of travelling salesman problems. Documenta Math, no. Extra Vol. Proc. ICM III:645–656 2. Balas E, Ceria S, Cornuéjols G (1996) Mixed 0–1 programming by lift-and-project in a branch-and-cut framework. Managem Sci 42:1229–1246, ftp: cumparsita.gsb.columbia.edu 3. Balas E, Ceria S, Cornuéjols G, Natraj N (1996) Gomory cuts revisited. Oper Res Lett 19:1–9, ftp: cumparsita.gsb.columbia.edu 4. Barahona F, Grötschel M, Jünger M, Reinelt G (1988) An application of combinatorial optimization to statistical physics and circuit layout design. Oper Res 36(3):493–513

5. Boyd EA (1994) Fenchel cutting planes for integer programs. Oper Res 42:53–64 6. Ceria S, Cornuéjols G, Dawande M (1995) Combining and strengthening Gomory cuts. In: Balas E, Clausen J (eds) Lecture Notes Computer Sci., vol 920. Springer, Berlin, ftp: cumparsita.gsb.columbia.edu 7. Chvátal V (1973) Edmonds polytopes and a hierarchy of combinatorial problems. Discret Math 4:305–337 8. Crowder HP, Johnson EL, Padberg M (1983) Solving largescale zero-one linear programming problems. Oper Res 31:803–834 9. Dantzig GB, Fulkerson DR, Johnson SM (1954) Solutions of a large-scale travelling salesman problem. Oper Res 2:393–410 10. Edmonds J (1965) Maximum matching and a polyhedron with 0, 1 vertices. J Res Nat Bureau Standards 69B:125– 130 11. Garey MR, Johnson DS (1979) Computers and intractibility: A guide to the theory of NP-completeness. Freeman, New York 12. Gomory RE (1958) Outline of an algorithm for integer solutions to linear programs. Bull Amer Math Soc 64:275–278 13. Gomory RE (1963) An algorithm for integer solutions to linear programs. In: Graves RL, Wolfe P (eds) Recent Advances in Mathematical Programming. McGraw-Hill, New York, pp 269–302 14. Grötschel M, Holland O (1985) Solving matching problems with linear programming. Math Program 33:243–259 15. Grötschel M, Holland O (1991) Solution of large-scale travelling salesman problems. Math Program 51(2):141–202 16. Grötschel M, Jünger M, Reinelt G (1984) A cutting plane algorithm for the linear ordering problem. Oper Res 32:1195–1220 17. Grötschel M, Lovasz L, Schrijver A (1988) Geometric algorithms and combinatorial optimization. Springer, Berlin 18. Grötschel M, Martin A, Weismantel R (1996) Packing Steiner trees: A cutting plane algorithm and computational results. Math Program 72:125–145 19. Hoffman KL, Padberg M (1985) LP-based combinatorial problem solving. Ann Oper Res 4:145–194 20. Hoffman KL, Padberg M (1991) Improving LP-representation of zero-one linear programs for branch-and-cut. ORSA J Comput 3(2):121–134 21. Jünger M, Kaibel V (1996) A basic study of the QAP polytope. Techn Report Inst Informatik Univ Köln 96.215 22. Jünger M, Reinelt G, Thienel S (1995) Practical problem solving with cutting plane algorithms in combinatorial optimization. In: Combinatorial Optimization. In: DIMACS. Amer. Math. Soc., Providence, RI, pp 111–152 23. Jünger M, Thienel S, Reinelt G (1994) Provably good solutions for the traveling salesman problem. ZOR - Math Meth Oper Res 40:183–217 24. Khachiyan LG (1979) A polynomial algorithm in linear programming. Soviet Math Dokl 20:1093–1096 (Dokl Akad Nauk SSSR 224:1093–1096)

Integer Programming Duality

25. Lin S, Kernighan BW (1973) An effective heuristic for the traveling salesman problem. Oper Res 21:498–516 26. Mitchell JE (1997) Fixing variables and generating classical cutting planes when using an interior point branch and cut method to solve integer programming problems. Europ J Oper Res 97:139–148 27. Mitchell JE (1998) An interior point cutting plane algorithm for Ising spin glass problems. In: Kischka P, Lorenz H-W (eds) Oper. Res. Proc., SOR 1997, Jena, Germany. Springer, Berlin, pp 114–119. www.math.rpi.edu/mitchj/ papers/isingint.ps 28. Mitchell JE (2000) Computational experience with an interior point cutting plane algorithm. SIAM J Optim 10(4):1212–1227 29. Mitchell JE, Borchers B (1996) Solving real-world linear ordering problems using a primal-dual interior point cutting plane method. Ann Oper Res 62:253–276 30. Mitchell JE, Borchers B (2000) Solving linear ordering problems with a combined interior point/simplex cutting plane algorithm. In: Frenk H et al (eds) High Performance Optimization. Kluwer, Dordrecht, pp 349–366 (Chap. 14) 31. Nemhauser GL, Sigismondi G (1992) A strong cutting plane/branch-and-bound algorithm for node packing. J Oper Res Soc 43:443–457 32. Nemhauser GL, Wolsey LA (1988) Integer and combinatorial optimization. Wiley, New York 33. Padberg M, Rinaldi G (1991) A branch-and-cut algorithm for the resolution of large-scale symmetric traveling salesman problems. SIAM Rev 33(1):60–100 34. Schrijver A (1986) Theory of linear and integer programming. Wiley, New York 35. Schrijver A (1995) Polyhedral combinatorics. In: Graham RL, Grötschel M, Lovász L (eds) Handbook Combinatorics, vol 2. Elsevier, Amsterdam, pp 1649–1704 36. De Simone C, Diehl M, Jünger M, Mutzel P, Reinelt G, Rinaldi G (1995) Exact ground states of Ising spin glasses: New experimental results with a branch and cut algorithm. J Statist Phys 80:487–496

I

Surrogate Duality Lagrangian Duality Superadditive Duality Solving the Superadditive Dual Another Functional Dual Inference Duality Conclusions See also References Keywords Integer programming; Duality One of the more elegant and satisfying ideas in the theory of optimization is linear programming duality. The dual of a linear programming problem is not only interesting theoretically but has great practical value, because it provides sensitivity analysis, bounds on the optimal value, and marginal values for resources. It is natural to want to extend duality to integer programming in order to obtain these same benefits. The matter is not so simple, however. Linear programming duality actually represents several concepts of duality that happen to coincide in the case of linear programming but diverge as one moves to other types of optimization problems. The benefits also decouple, because each duality concept provides some of them but not others. Five types of integer programming duality are surveyed here. None is clearly superior to the others, and their strengths and weaknesses are summarized in at the end of the article. Linear Programming Duality

Integer Programming Duality J. N. HOOKER Graduate School of Industrial Admin., Carnegie Mellon University, Pittsburgh, USA

A brief summary of linear programming duality will provide a foundation for the rest of the discussion. Consider the linear programming (primal) problem, 8 ˆ ˆ S o xo D x p> S p x p and therefore 1 (S) D 1 (S). Note that similar results as in Theorem 1 hold for real skew-symmetric interval matrices, see [13]. Remark 2 Let S[S; S] be defined as before with S D > S > and S D S . Define the real interval matrix B[S; S] S[S; S] by B fB D [b k` ] : [s k` b k` s k` ]; k; ` D 1; : : : ; ng. Using Bendixon’s theorem [11 Thm. 5.3] (i. e., for B 2 B, min ) , (B0 ) and max 1); if (u k v` > 0; v k u` < 0)

Proof We will prove that 1 (H) D 1 (H ). The rest of the proof is similar and will therefore be omitted. Because the minimization in (3) is over a compact set (i. e., {x, H: x 2 Cn , kxk = 1, H 2 H}) and x Hx is a real continuous function of x and H, it follows that x Hx attains its minimal value for some xo 2 Op and H o 2 H. By expanding xo Hxo as in (11) and noting that xo is constant, it can be seen that there is an H p` 2 H p for which xo H o xo xo H p` xo x p` H p` x p` , where xp` denotes the unit-length eigenvector of H p` corresponding to 1 (H p` ). Moreover, because xo and H o solve the optimization problem (9), it follows that xo H o xo x p` H p` x p` . Hence xo H o xo D x p` H p` x p` and therefore 1 (H) D 1 (H ). Note that similar results as in Theorem 4 hold for skewHermitian interval matrices. Remark 5 This remark is similar to Remark 2. Let H[H; H] be defined as before with H D H , H D H , and =H D =H. Define the complex interval matrix A[H; H] H[H; H] by 8 9 [ k > 1 either an entry of B or of C, see [2] for more details. Hence, since the number of free parameters in ˇ ˇ (n 2 3nC2) that ˇH p ˇ D 2 2 . Further, H is n2 , it follows (n 2 3nC2) and H D let H p D H p` : ` D 1; : : : ; 2 2 ˚ p (n 2 Cn2) 2 H : p D 1; : : : ; 22n2 hence jH j D 2 . Similarly as above, by maximizing x Hx over H 2 H and

Using Bendixon’s theorem and Theorem 4, it follows that min v D 1:

(14)

Let A = B+ iC, B 2 B, C 2 C. Since Ax = x we obtain (r C i i )x D (B C iC)(u C iv) D Bu Cv C i(Bv C Cu): Premultiplying the above equation by x = u| iv| , equating the real and imaginary parts, and noting that x x = 1 we obtain r D u> Bu u> Cv C v> Bv C v> Cu and i D v> Bu C v> Cv C u> Bv C u> Cu: We have that u> Bu D u> B0 u (B0 )u> u and v> Bv D v> B0 v (B0 )v> v, see [11], where B0

>

BCB : 2

(15)

r (B0 ) u> Cv C v> Cu:

(B0 ) D D

(BCB) 2

and C c D

max

kxkD1;x2R n

max

kxkD1;x2R n

(B0c ) C

(CCC) , 2

(16) then

x> Bx > x B c x C x> (B B c )x max

D u> C c v C v> C c u u> (C C c )v C v> (C C c )u

max (u> C c v C v> C c u) k(u> ;v> )kD1

max (juj> C jvj C jvj> C juj) kD1 k > u 0 C c u max Cc 0 v k(u> ;v> )kD1 v > 0 C juj juj C max > > 0 jvj jvj C k(u ;v )kD1 C

(u> ;v> )

(C 00c ) C (00C ); (19) where C has similar meaning as B defined in (18), ! C c CC > c 0 00 2 ; C c C C > c c 0 2 (20) 0 0 C 00C ; 0C 0 and 0C has similar meaning as B0 defined in (15). Hence, using (16), (17), and (19) we finally obtain

kxkD1;x2R n

The lower bound on r , r , can be obtained by noting that = r ii is an eigenvalue of A 2 A = B + i( C), where B D fB : B B B; B 2 Rnn g and, C D fC : C C C; C 2 Rnn g using (21) and then replacing the roles of B and C by B and C we obtain r r D (B0c ) (0B ) C (C 00c ) (00C ): (22)

jxj> B jxj

D (B0c ) C (0B );

(17)

where |x| abs(x) taken elementwise, B B B c ;

u> Cv C v> Cu

r r D (B0c ) C (0B ) C (C 00c ) C (00C ): (21)

Note that similar results pertain to the real matrix C. Hence, using (14) we obtain

Choose B c D

To obtain the final form of the upper bound on r , r , it remains to carry out the following derivation:

(18)

and both 0B and B0c have similar meaning as B0 defined in (15).

The upper and lower bounds on i can be similarly obtained by noting that i = i ir is an eigenvalue of iA 2 iA = C + i( B), using (21) and (22), respectively, and then replacing the roles of B and C by C and B. We thus obtain Theorem 6 Let A = B + iC be as defined above, A c D (ACA) D B c C iC c be the central matrix of A, and = 2 r + ii be any eigenvalue of the matrix A 2 A, then r r r

and i i i ;

Interval Analysis: Eigenvalue Bounds of Interval Matrices

where r D r D i D

I

See also (B0c ) (0B ) C (C 00c ) (00C ); (B0c ) C (0B ) C (C 00c ) C (00C ); (C 0c ) (0C ) (B00c ) (00B );

i D (C 0c ) C (0C ) (B00c ) C (00B ); all the primed matrices (i. e., B0c , C0c , 0B , and 0C ) have similar meaning as B0 defined in (15); B is as in (18) and C has similar meaning; and, C0c and 00C are as in (20) with B00c and 00B having similar meaning, respectively. Corollary 7 Note the following consequences: i) if r < 0, then the interval matrix A is Hurwitz stable. ii) if the rectangle o n (x; y) : r x r ; i y i is contained in the open unit disk, then A is Schur stable. Some computational simplifications for Theorem 6 can be obtained by using the following lemmas. Lemma 8 Let Cc 00 be as defined in (20), then (C 00c ) D (C C > )

(C 00c ) D ( c 2 c ), where (A) = {||: 2 (A)} denotes the spectral radius of the matrix A. Proof Let G = (Cc C>) c /2 and Gv = v (note that since G is skew symmetric, is purely imaginary, see [11]); the eigenvalues of C0c are ˙ i with corresponding eigenvectors (v| , iv| )| , which gives the desired result. Note that this Lemma can also be applied to B00c . Lemma 9 Let 0 S D ; S 0 where S 2 Rn × n and S = S| , then (D) D (D) D (S). Proof 4 The eigenvectors of D are (w| , ˙ w| )| , where w is an eigenvector of S. Hence the eigenvalues of D are ˙, with an eigenvalue of S, which gives the desired result. Note that this Lemma can also be applied to 00B and 00C .

˛BB algorithm Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators Bounding Derivative Ranges Eigenvalue Enclosures for Ordinary Differential Equations Global Optimization: Application to Phase Equilibrium Problems Hemivariational Inequalities: Eigenvalue Problems Interval Analysis: Application to Chemical Engineering Design Problems Interval Analysis: Differential Equations Interval Analysis: Intermediate Terms Interval Analysis: Nondifferentiable Problems Interval Analysis: Parallel Methods for Global Optimization Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Unconstrained and Constrained Optimization Interval Analysis: Verifying Feasibility Interval Constraints Interval Fixed Point Theory Interval Global Optimization Interval Linear Systems Interval Newton Methods Semidefinite Programming and Determinant Maximization

References 1. Adjiman CS, Floudas CA (1996) Rigorous convex underestimators for general twice-differentiable problems. J Global Optim 9:23–40 2. Hertz D (1992) The extreme eigenvalues and stability of Hermitian interval matrices. IEEE Trans Circuits and Systems I 39(6):463–466 3. Hertz D (1992) The extreme eigenvalues and stability of real symmetric interval matrices. IEEE Trans Autom Control 37(4):532–535 4. Hertz D (1992) Simple bounds on the extreme eigenvalues of Toeplitz matrices. IEEE Trans Inform Theory 38(1):175– 176 5. Hertz D (1993) The maximal eigenvalue and stability of a class of real symmetric interval matrices. IEEE Trans Cir-

1695

1696

I 6.

7.

8. 9. 10.

11. 12.

13.

14.

Interval Analysis: Intermediate Terms

cuits and Systems I: Fundamental Theory and Applications 40(1):56–57 Hertz D (1993) On the extreme eigenvalues of Toeplitz and Hankel interval matrices. Multidimensional Signals and Systems 4:83–90 Hertz D (1993) Root clustering of interval matrices. In: Jamshidi M, Mansour M, Anderson BDO (eds) Fundamentals of Discrete-Time Systems: A Tribute to Professor Eliahu I. Jury. IITSI Press, Albuquerque, NM, pp 271–278 Horn RA, Johnson CR (1985) Matrix analysis. Cambridge Univ. Press, Cambridge Horn RA, Johnson CR (1991) Topics in matrix analysis. Cambridge Univ. Press, Cambridge Jorge P, Ferreira SG (1994) Localization of the eigenvalues of Toeplitz matrices using additive decomposition, embedding in circulants, and the Fourier transform. SYSID, Copenhagen, Denmark 3:271–275; 175–176 Marcus M, Minc H (1966) Introduction to linear algebra. MacMillan, New York Rohn J (1992) Positive definiteness and stability of interval matrices. Report NI-92-05, Numerik Inst Denmarks Tekniske Hojskole, Lyngby 2800, March Rohn J (1996) Bounds on eigenvalues of interval matrices. Techn Report Inst Computer Sci Acad Sci Prague, Czech Republic 688, October Rudin W (1978) Principles of mathematical analysis. McGraw-Hill, New York

Interval Analysis: Intermediate Terms R. BAKER KEARFOTT Department Math., University Louisiana at Lafayette, Lafayette, USA MSC2000: 65G20, 65G30, 65G40, 65H20 Article Outline Keywords Use In Automatic Differentiation Use In Constraint Satisfaction Techniques. Use In Symbolic Preprocessing. See also References Keywords Expression parsing; Constraint satisfaction techniques; Interval dependency; Verification; Interval computations; Global optimization

Interval Analysis: Intermediate Terms, Table 1

(i) (ii) (iii) (iv) (v)

v1 v2 v3 v4 v5 v6 v7

= x1 , = x2 , = v12 , = v22 , = v4 v3 , = v3 v5 , = v6 + v4 .

Interval Analysis: Intermediate Terms, Table 2

OP 5 5 4 21 20

p 3 4 5 6 7

q 1 2 4 3 6

r 3 5 4

In global optimization algorithms, the computer must repeatedly evaluate an objective function, as well as, possibly, inequality and equality constraints. Such functions are given as algebraic expressions or as subroutines or sections of computer code. When such computer code is executed, operations are applied to the independent variables, producing intermediate terms. These intermediate terms are, in turn, combined to produce other intermediate terms, or, eventually, the objective function value. For example, consider the problem ( min (x) D x12 x12 x22 C x22 (1) over the box x D ([1; 1]; [1; 1])>: To evaluate , the computer may start with the independent variable values v1 = x1 and v2 = x2 internally produce quantities v3 , v4 , v5 , and v6 , to finally produce the dependent variable value (x) = v7 . Table 1 indicates how this may be done. A list such as in Table 1 may be represented as a table of addresses of variables and operations. For examx2q corresponds to operation ple, if the operation xp xq xr corresponds to operation code 4, xp code 5, xp xq + xr corresponds to operation code 20, and xp xq xr corresponds to operation code 21, then the set of relations in Table 1 is represented by Table 2. Such a sequence of operations is called a code list, but is sometimes called other things, such as a tape. As-

Interval Analysis: Intermediate Terms

Interval Analysis: Intermediate Terms, Table 3

v1 = x 1 v2 = x 2 (i) v3 = v12 , (ii) v4 = v22 , (iii) v5 = v4 v3 , (iv) v6 = v3 v5 , (v) v7 = v6 + v4 , (vi) v8 = 2v1 , (vii) v9 = v8 v4 , (viii) v10 = v8 v9 , (ix) v11 = 2v2 , (x) v12 = v3 v11 , (xi) v13 = v12 , (xii) v14 = v13 + v11 , = v7 , @ @x 1 = v10 , @ = v14 . @x 2

suming the axioms of real arithmetic hold for evaluation, code lists for a given algebraic expression or portion of a computer program are not unique. The concept of a code list is familiar to computer science students who have worked with compilers, since a compiler produces such lists while translating algebraic expressions into machine language. However, code lists and access to the intermediate expressions are of particular importance in interval global optimization, for the following reasons. Code lists provide a convenient internal representation for the objective and constraints, to be used for automatic differentiation, for both point and interval evaluation of objectives, gradients, and Hessian matrices. The values of the intermediate quantities can be used within the optimization algorithm in processes that reduce the size of the search region. Symbolic manipulation can reduce the overestimation, or interval dependency that would otherwise occur with interval evaluations. Details are given below.

entiation or as a symbolic representation of the system of equations to be solved in the backward mode. See [7] for an in-depth look at the forward mode of automatic differentiation, and see [3] for somewhat more recent research on the subject. See [6, pp. 37–39] for some examples and additional references. Also see Automatic differentiation: Introduction, history and rounding error estimation. Use In Constraint Satisfaction Techniques. Since each intermediate variable in the code list is connected to one or two others via an elementary, invertible operation, narrow bounds on one such intermediate variable can be used to obtain narrow bounds on others. For example, suppose that the code list in Table 1 has been symbolically differentiated, to get the code list in Table 3. Then, if the subbox x = ([0.5, 1], [ 1, 0.5])| is to be considered for possible inclusion of optima, the derivative code list in Table 3 can be evaluated by forward substitution to obtain the interval set of intermediate values in Table 4. Furthermore, since (1) is an unconstrained problem, an optimum must occur where @/ @x1 = 0 and @/ @x2 = 0. In particular, any global optimizer x must have v10 (x ) D 0:

A code list can be used either as a pattern to specify the computations in the forward mode of automatic differ-

(2)

Using (2) in line (viii) of the derivative code list in Table 5, v9 D v8 v10 ; whence v˜ 9

[1; 2] 0;

v9

e v9 \ v9 D [1; 2] :

Now, using (vii) of Table 5, e v4 v4

v9 [1; 2] D [0:5; 2]; D v8 [1; 2] e v4 \ v4 D [0:5; 1]:

Now using (ii) of Table 5 gives p p v4 [ v4 e v2 [0:70; 1] [ [1; 0:70];

Use In Automatic Differentiation

I

v2

e v2 \ v2 D [1; 0:70]:

(3) (4)

The last computation represents a narrowing of the range of one of the independent variables.

1697

1698

I

Interval Analysis: Intermediate Terms

Interval Analysis: Intermediate Terms, Table 4

v1 = [:5; 1], v2 = [1; :5], v3 = [:25; 1], v4 = [:25; 1], v5 = [:0625; 1], v6 = [:75:9375], v7 = [:5; 1:9375], v8 = [1; 2], v9 = [:25; 2], v10 = [1; 1:75], v11 = [2; 1], v12 = [2; :25], v13 = [:25; 2], v14 = [1:75; 1], 2 [:5; 1:9375], @ @x 1 2 [1; 1:75], @ 2 [1:75; 1]. @x 2

Interval Analysis: Intermediate Terms, Table 5

(i) (ii) (iii) (iv) (v) (vi) (vii) (viii)

v1 = x 1 , v2 = x 2 , v3 = v12 , v4 = v22 , v5 = v3 v2 , v6 = v13 , v7 = v6 + v4 , v8 = v7 + 1, v8 + v5 = 0, v8 3v5 = 0

(A similar computation could also have been carried out to obtain narrower bounds on v1 .) If, in addition, an upper bound D 0 for the global optimum of is known, then v7 2 [1; 0] \ [0:5; 1:9375] D [0:5; 0]: This can now be used in Table 3, (v), along with new intermediate variable bounds, wherever possible, to obtain e v4

v7 v6 D [0:5; 0] [0:75; 0:9375]

D [1:4375; 0:75]; v4

v4 \e v4 D [0:5; 0:75]:

Now using Table 3, (vii), e v9

[1; 2][0:5; 0:75] D [0:5; 1:5];

v9

e v9 \ v9 D [1; 1:5];

then using (viii) and v10 = 0 gives v8 = [1, 1.5]. Finally, using Table 3, (vi), gives v1

[0:5; 0:75] \ [0:5; 1] D [0:5; 0:75]:

(5)

Now, evaluating in (1) (or redoing the forward substitution represented in Table 4) at (x1 , x2 ) = ([0.5, 0.75], [1, 0.70]) gives 2 [:5:75]2 [:5:75]2 [1; :7]2 C [1; :7]2 D [:25:5625] [:25:5625][:49; 1] C [:49; 1] D [:25:5625] C [:5625; :1225] C [:49; 1] D [:1775; 1:44]; contradicting the known upper bound D 0. This proves that there can be no global optimizer of (1) within ([0.5, 1], [1, 0.5])| . (Note that, in fact, there are no global optimizers in ([1, 1], [1, 1])| if the problem is considered to be unconstrained.) The above procedure is easily automated, as is done in, say, GlobSol [2,6], UniCalc [1], or other interval constraint propagation software. This example illustrates a more general technique, associated with constraint propagation and logic programming. See [4] for an introduction to this view of the subject, and see [5] for alternate techniques of interval constraint satisfaction. Use In Symbolic Preprocessing. To understand how symbolic analysis based on the code list may help, consider the following example: 8 ˆ Find all solutions to ˆ ˆ ˆ ˆ ˆ f (x) D 0; f D ( f1 ; f 2 )> ˆ < (6) within the box x D ([2; 0]; [1; 1])> ˆ ˆ 3 2 2 ˆ ˆ where f 1 (x1 ; x2 ) D x1 C x1 x2 C x2 C 1 ˆ ˆ ˆ : f (x ; x ) D x 3 3x 2 x C x 2 C 1: 2

1

2

1

1 2

2

A possible code list is There is much interval dependency in this system, both in the individual equations (since each variable occurs in various terms), and between the equations (since the

I

Interval Analysis: Nondifferentiable Problems

equations share common terms). However, examination of the code list in Table 5 reveals that a change of variables can make the system more amenable to interval computation. Seeing that (vii) and (viii) are linear in v5 and v8 = v4 + v6 + 1, define ( y1 D v5 D x12 x2 ; (7) y2 D v4 C v6 D x13 C x22 : Then the system becomes ( y2 C y1 C 1 D 0; y2 3y1 C 1 D 0:

(8)

Thus, the linear system (8) may be solved easily for y1 and y2 . The interval bounds may then be plugged into (7) to obtain x1 and x2 . There is no overestimation in any of the expressions for function components or partial derivatives in either (8) or (7). Additional research should reveal how to automate this change of variables process. See also Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators Bounding Derivative Ranges Global Optimization: Application to Phase Equilibrium Problems Interval Analysis: Application to Chemical Engineering Design Problems Interval Analysis: Differential Equations Interval Analysis: Eigenvalue Bounds of Interval Matrices Interval Analysis: Nondifferentiable Problems Interval Analysis: Parallel Methods for Global Optimization Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Unconstrained and Constrained Optimization Interval Analysis: Verifying Feasibility Interval Constraints Interval Fixed Point Theory Interval Global Optimization Interval Linear Systems Interval Newton Methods

References 1. Babichev AB, Kadyrova OB, Kashevarova TP, Leshchenko AS, Semenov AL (1993) UniCalc, a novel approach to solving systems of algebraic equations. Interval Comput 2:29–47 2. Corliss GF, Kearfott RB (1998) Rigorous global search: Industrial applications. In: Csendes T (ed) (Special issues of the journal ‘Reliable Computing’). Kluwer, Dordrecht 3. Griewank A and Corliss GF (eds) (1991) Automatic differentiation of algorithms: Theory, implementation, and application. SIAM, Philadelphia 4. Van Hentenryck P (1989) Constraint satisfaction in logic programming. MIT, Cambridge, MA 5. Van Hentenryck P, Michel L, Deville Y (1997) Numerica: A modeling language for global optimization. MIT, Cambridge, MA 6. Kearfott RB (1996) Rigorous global search: Continuous problems. Kluwer, Dordrecht 7. Rall LB (1981) Automatic differentiation: Techniques and applications. Lecture Notes Computer Sci, vol 120. Springer, Berlin

Interval Analysis: Nondifferentiable Problems R. BAKER KEARFOTT Department Math., University Louisiana at Lafayette, Lafayette, USA MSC2000: 65G20, 65G30, 65G40, 65H20 Article Outline Keywords Posing As Continuous Problems A Special Method for Minimax Problems Treating As Continuous Problems

See also References Keywords Nondifferentiability; Interval slopes; Verification; Interval computations; Global optimization Nondifferentiable problems arise in various places in global optimization. One example is in l1 and l1 optimization. That is, min (x) D min kFk1 D min x

x

m X iD1

j f i (x)j

(1)

1699

1700

I

Interval Analysis: Nondifferentiable Problems

and

min (x) D min kFk1 D min x

x

max j f i (x)j ; (2)

1im

where x is an n-vector, arise in data fitting, etc., and has a discontinuous gradient. In other problems, piecewise linear or piecewise quadratic approximations are used, and the gradient or the Hessian matrix are discontinuous. In fact, in some problems, even the objective function can be discontinuous. Much thought has been given to nondifferentiability in algorithms to find local optima, and various techniques have been developed for local optimization. Some of these techniques can be used directly in interval global optimization algorithms. However, the power of interval arithmetic to bound the range of a pointvalued function, even if that function is discontinuous, can be used to design effective algorithms for nondifferentiable or discontinuous problems whose structure is virtually identical to that of algorithms for differentiable or continuous problems. Posing As Continuous Problems Several techniques are available for re-posing problems as differentiable problems, in particular for Problem (1) and Problem (2). One such technique, suggested in [2, p. 74] and elsewhere, involves rewriting the forms |e|, max{e1 , e2 }, and min{e1 , e2 } occurring in variable expressions in the objective and constraints in terms of additional constraints, as follows: Replace an expression |e| by a new variable xn + 1 and the two constraints xn + 1 0 and x2nC1 = e2 . Replace max{e1 , e2 } by e1 C e2 C je1 e2 j : 2 Replace min{e1 , e2 } by e1 C e2 je1 e2 j : 2 Alternately, as explained in [1] and elsewhere, the entire Problems (1) and (2) can be replaced by constrained problems. In particular, (1) can be replaced by 8 m X ˆ ˆ ˆ min vi ˆ < iD1 (3) ˆ s.t. v i f i (x); i D 1; : : : ; m; ˆ ˆ ˆ : v i f i (x); i D 1; : : : ; m;

where the vi are new variables. Likewise, (2) can be replaced by 8 ˆ ˆ 1; and suppose the interval [2, 2] is to be searched for global minima. For illustration purposes, suppose (0.25) = 0.125 has been evaluated, so that 0.125 is an upper bound on the global optimum, and suppose the subinterval x = [0.5, 1.5] is to be analyzed. To obtain an interval enclosure for the range of over x, we take (x) 2 [0:5; 1:0]2 [(1 C [1:0; 1:5]) D [0:25; 1:0][[2:0; 2:5] D [0:25; 2:5];

Interval Analysis: Nondifferentiable Problems

where a [ b is the smallest interval that contains both a and b. Thus, since 0.125 < [0.25, 2.5], a minimum of cannot possibly occur within the interval [0.5, 1.5]. Similar considerations apply if the gradient r is discontinuous. In such cases, the gradient test (see Interval analysis: Unconstrained and constrained optimization) will keep boxes that either contain zeros of the gradient or critical points corresponding to gradient discontinuities where the gradient changes sign. When the gradient is discontinuous, interval Newton methods can still be used for iteration, as well as to verify existence. (See [3, (6.4) and (6.5), p. 217] for a formula; see Interval Newton methods for an introduction to interval Newton methods; see Interval fixed point theory for an explanation of interval fixed point theory.) Application to problems with discontinuous gradients is based on extended interval arithmetic (with infinities) and astute computation of slope bounds; see [3, pp. 214–215] for details. Example 1 Consider ˇ ˇ f (x) D ˇx 2 x ˇ 2x C 2 D 0:

I

Consider using the interval Newton method e x

xˇ (k)

x(kC1)

f (xˇ(k) ) ; S( f ; x(k) ; xˇ (k) )

x(k) \e x;

with xˇ (k) equal to the midpoint xˇ D 0:9 of x(k) , and x(0) = [0.7, 1.1], where S( f ; x(k) ; xˇ (k) ) is a bound on the slope enclosure of f at xˇ . (See Fig. 1 for the concept of slope range.) An initial slope enclosure is then S(f , [0.7, 1.1], 0.9) = [3, 1], e x D :9

:29 D [:996; 1:19]; [3; 1]

and S( f ; [0:7; 1:1]; 0:9) D [3; 1]. If this interval Newton method is iterated, then on iteration 3, existence of a root within x(3) was proven, since x(3) intx(2) , where intx(2) is the interior of x(2) . For details, see [3, pp. 224–225]. See also

(6)

This function has both a root and a cusp at x = 1, with a left derivative of 3 and a right derivative of 1 at x = 1. If 1 2 x, then a slope enclosure is given by S(f , x, x) = [1, 1](x + x 1) 2.

Interval Analysis: Nondifferentiable Problems, Figure 1 The concept of a slope range for a nondifferentiable function

Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators Bounding Derivative Ranges Global Optimization: Application to Phase Equilibrium Problems Interval Analysis: Application to Chemical Engineering Design Problems Interval Analysis: Differential Equations Interval Analysis: Eigenvalue Bounds of Interval Matrices Interval Analysis: Intermediate Terms Interval Analysis: Parallel Methods for Global Optimization Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Unconstrained and Constrained Optimization Interval Analysis: Verifying Feasibility Interval Constraints Interval Fixed Point Theory Interval Global Optimization Interval Linear Systems Interval Newton Methods

1701

1702

I

Interval Analysis for Optimization of Dynamical Systems

References 1. Gill PE, Murray W, Wright M (1981) Practical optimization. Acad. Press, New York 2. Van Hentenryck P, Michel L, Deville Y (1997) Numerica: A modeling language for global optimization. MIT, Cambridge, MA 3. Kearfott RB (1996) Rigorous global search: Continuous problems. Kluwer, Dordrecht 4. Shen Z, Neumaier A, Eiermann MC (1990) Solving minimax problems by interval methods. BIT 30:742–751

Interval Analysis for Optimization of Dynamical Systems YOUDONG LIN, MARK A. STADTHERR Department of Chemical and Biomolecular Engineering, University of Notre Dame, Notre Dame, USA Article Outline Introduction Formulation Methods Taylor Models Verifying Solver for Parametric ODEs Deterministic Global Optimization Method

Cases Catalytic Cracking of Gas Oil Singular Control Problem

Conclusions References Introduction There are many applications of optimization for dynamical systems, including parameter estimation from time series data, determination of optimal operating profiles for batch and semibatch processes, optimal start-up, shutdown, and switching of continuous system, etc. To address such problems, one approach is to discretize any control profiles that appear as decision variables. There are then basically two types of methods available: (1) the complete discretization or simultaneous approach [20,28], in which both state variables and control profiles are discretized, and (2) the control parameterization or sequential approach [3,26], in which only the control profiles are discretized. In this article, only the sequential approach is considered. Since these

problems are often nonconvex and thus may exhibit multiple local solutions, the classical techniques based on solving the necessary conditions for a local minimum may fail to determine the global optimum. This is true even for a rather simple temperature-control problem with a batch reactor [12]. Therefore, there is an interest in global optimization algorithms which can rigorously guarantee optimal performance. There has been significant recent work on this problem. For example, Esposito and Floudas [6,7] used the ˛BB approach [1,2] to address the global optimization of dynamic systems. In this method, convex underestimating functions are used in connection with a branchand-bound framework. A theoretical guarantee of attaining an -global solution is offered as long as rigorous underestimators are used, and this requires that sufficiently large values of ˛ be used. However, this is difficult in this context because determining proper values of ˛ depends on the Hessian of the function being underestimated, and this matrix is not available in explicit functional form when the sequential approach is used. Thus, as discussed in more detail by Papamichail and Adjiman [21], this approach does not provide a theoretical guarantee of global optimality. Alternative approaches have been given by Chachuat and Latifi [4] and by Papamichail and Adjiman [21,22] that do provide a theoretical guarantee of -global optimality; however, this is achieved at a high computational cost. Singer and Barton [25] have described a branch-andbound approach for determining a theoretically guaranteed -global optimum with significantly less computational effort. In this method, convex underestimators and concave overestimators are used to construct two bounding initial value problems (IVPs), which are then solved to obtain lower and upper bounds on the trajectories of the state variables [24]. However, the bounding IVPs are solved using standard numerical methods that do not provide guaranteed error estimates, and so this approach does not provide fully guaranteed results from a computational standpoint. In this article we discuss an approach [8,9] for the deterministic global optimization of dynamical systems based on interval analysis. A key feature of the method is the use of a verifying solver [10] for parametric ordinary differential equations (ODEs), which is used to produce guaranteed bounds on the solutions of dynamic systems with interval-valued parameters. This is

I

Interval Analysis for Optimization of Dynamical Systems

combined with a technique for domain reduction based on using Taylor models [19] in an efficient constraint propagation scheme. The result is that problems can be solved to global optimality with both mathematical and computational certainty. Formulation In this section we give the mathematical formulation of the nonlinear dynamic optimization problem to be addressed. Assume the system is described by the nonlinear ODE model x˙ D f (x, ). Here x is the vector of state variables (length n) and is a vector of adjustable parameters (length p), which may be a parameterization of a control profile (t). The model is given as an autonomous system; a nonautonomous system can easily be converted into autonomous form by treating the independent variable (t) as an additional state variable with derivative equal to 1. The objective function ' is expressed in terms of the adjustable parameters and the values of the states at discrete points t , D 0; 1; : : : ; r. That is, D x ( ); ; D 0; 1; : : : ; r , where x ( ) D x(t ; ). If an integral appears in the objective function, it can be eliminated by introducing an appropriate quadrature variable. The optimization problem is then stated as (1) min x ( ); ; D 0; 1; : : : ; r ;x

subject to

x˙ D f (x; ) ; x 0 D x 0 ( ) ; t 2 [t0 ; t r ] ;

2 :

Here is an interval vector that provides upper and lower parameter bounds (uppercase will be used to denote interval-valued quantities, unless noted otherwise). We assume that f is (k 1) times continuously differentiable with respect to the state variables x, and (q C 1) times continuously differentiable with respect to the parameters . We also assume that ' is (q C 1) times continuously differentiable with respect to the parameters . Here k is the order of the truncation error in the interval Taylor series (ITS) method to be used in the integration procedure, and q is the order of the Taylor model to be used to represent parameter dependence. When a typical sequential approach is used, an ODE solver is applied to the constraints

with a given set of parameter values, as determined by the optimization routine. This effectively eliminates x ; D 0; 1; : : : ; r, and leaves a bound-constrained minimization in the adjustable parameters only. The method discussed here can also be extended to optimization problems with general state path constraints, and more general equality or inequality constraints on parameters. This is done by adapting the constraint propagation procedure (CPP) discussed below to handle the additional constraints. Methods Taylor Models Makino and Berz [13] have described a remainder differential algebra (RDA) approach that uses Taylor models for bounding function ranges. This represents an approach for controlling the “dependency problem” of interval arithmetic, which leads to overestimation of function ranges. In the RDA approach, a function is represented using a model consisting of a Taylor polynomial and an interval remainder bound. One way of forming a Taylor model of a function is by using a truncated Taylor series. Consider a function f : x 2 X Rm ! R that is (q C 1) times partially differentiable on X and let x 0 2 X. The Taylor theorem states that for each x 2 X, there exists a 2 R with 0 < < 1 such that q i i X 1h f (x 0 ) (x x 0 ) i! iD0 h iqC1 1 f x 0 C (x x 0 ) ; C (x x 0 ) (q C 1)! (2)

f (x) D

where the partial differential operator [g [g D

] k is

]k X

j 1 CC j m Dk 0 j 1 ; ; j m k

k! @k j j g11 g mm j 1 : j j1 ! j m ! @x1 @x mm (3)

The last (remainder) term in (2) can be quantitatively bounded over 0 < < 1 and x 2 X using interval arithmetic or other methods to obtain an interval remainder bound Rf . The summation in (2) is a qth

1703

1704

I

Interval Analysis for Optimization of Dynamical Systems

order polynomial (truncated Taylor series) in (x x0 ) which we denote by p f (x x 0 ). A qth order Taylor model T f for f (x) then consists of the polynomial pf and the interval remainder bound Rf and is denoted by T f D (p f ; R f ). Note that f 2 T f for x 2 X and thus T f encloses the range of f over X. In practice, it is more useful to compute Taylor models of functions by performing Taylor model operations. Arithmetic operations with Taylor models can be done using the RDA operations described by Makino and Berz [13,14], which include addition, multiplication, reciprocal, and intrinsic functions. Therefore, it is possible to compute a Taylor model for any function representable in a computer environment by simple operator overloading through RDA operations. When RDA operations are performed, only the coefficients of pf are stored and operated on; however, rounding errors are bounded and added to Rf . It has been shown that, compared with other rigorous bounding methods, the Taylor model can be used to obtain sharper bounds for modest to complicated functional dependencies [13,19]. An interval bound on a Taylor model T D (p; R) over X is denoted by B(T), and is found by determining an interval bound B(p) on the polynomial part p and then adding the remainder bound; that is, B(T) D B(p) C R. The range bounding of the polynomials B(p) D P(X x 0 ) is an important issue, which directly affects the performance of Taylor model methods. Unfortunately, the exact range bounding of an interval polynomial is nondeterministic polynomial-time hard, and direct evaluation using interval arithmetic is very inefficient, often yielding only loose bounds. Thus, various bounding schemes [15,19] have been used, mostly focused on exact bounding of the dominant parts of P, i. e., the first- and second-order terms. However, exact bounding of a general interval quadratic is also computationally expensive (in the worst case, exponential in the number of variables m). Lin and Stadtherr [8] have adopted a very simple compromise approach, in which only the first-order and the diagonal second-order terms are considered for exact bounding, and other terms are evaluated directly. That is, B(p) D

m X iD1

a i (X i x i0 )2 C b i (X i x i0 ) C Q ; (4)

where Q is the interval bound of all other terms, and is obtained by direct evaluation with interval arithmetic. In (4), since X i occurs twice, there exists a dependency problem. For ja i j !, where ! is a small positive number, (4) can be rearranged so that each X i occurs only once; that is, " # m X b 2i bi 2 a i X i x i0 C CQ : (5) B(p) D 2a i 4a i iD1 In this way, the dependence problem in bounding the interval polynomial is alleviated so that a sharper bound can be obtained. If ja i j < !, direct evaluation can be used instead. Verifying Solver for Parametric ODEs When a traditional sequential approach is applied to the optimization of nonlinear dynamical systems, the objective function ' is evaluated, for a given value of

, by applying an ODE solver to the constraints to eliminate the state variables x. In the global optimization approach discussed here, a sequential approach based on interval analysis is used. This approach requires the evaluation of bounds on ', given some parameter interval . Thus, an ODE solver is needed that can compute bounds on x , D 0; 1; : : : ; r, for the case in which the parameters are interval-valued. Interval methods (also called validated methods or verified methods) for ODEs [16] provide a natural approach for computing the desired enclosure of the state variables at t ; D 0; 1; : : : ; r. An excellent review of interval methods for IVPs has been given by Nedialkov et al. [17]. Much work has been done for the case in which the initial values are given by intervals, and there are several software packages available that deal with this case. However, less work has been done on the case in which parameters are also given by intervals. In the global optimization method discussed here, a verifying solver for parametric ODEs [10], called VSPODE, is used to produce guaranteed bounds on the solutions of dynamic systems with interval-valued initial states and parameters. In this section, we review the key ideas behind the method used in VSPODE, and outline the procedures used. Additional details are given by Lin and Stadtherr [10]. Consider the parametric ODE system x˙ D f (x; );

x0 2 X0;

2 ;

(6)

Interval Analysis for Optimization of Dynamical Systems

where t 2 [t0 ; t r ] for some t r > t0 . The interval vectors X 0 and represent enclosures of initial values and parameters, respectively. It is desired to determine a verified enclosure of all possible solutions to this initial value problem. We denote by x(t; t j ; X j ; ) D ˚x(t; t j ; X j ; ) the set of solutions x(t; t j ; x j ; ) j x j 2 X j ; 2 ; where x(t; t j ; x j ; ) denotes a solution of x˙ D f (x; ) for the initial condition x D x j at t D t j . We will outline a method for determining enclosures X j of the state variables at each time step j D 1; : : : ; r, such that x(t j ; t0 ; X 0 ; ) X j . Assume that at t j we have an enclosure X j of x(t j ; t0 ; X 0 ; ), and that we want to carry out an integration step to compute the next enclosure X jC1 . Then, in the first phase of the method, the goal is to find a step size h j D t jC1 t j > 0 and an a priori enclosure (coarse enclosure) e X j of the solution such that a unique X j is guaranteed to exist for all solution x(t; t j ; x j ; ) 2 e t 2 [t j ; t jC1 ], all x j 2 X j , and all 2 . One can apply a traditional interval method, with high-order enclosure, to the parametric ODEs by using an ITS with respect to time. That is, hj and X j are determined such X 0j , that for X j e e Xj D

k1 X

[0; h j ] i F [i] (X j ; ) C [0; h j ] k F [k] (e X 0j ; )

iD0

e X 0j : (7) X j , k denotes the Here e X 0j is an initial estimate of e order of the Taylor expansion, and the coefficients F [i] are interval extensions of the Taylor coefficients f [i] of x(t) with respect to time. Satisfaction of (7) demonstrates [5] that there exists a unique solution X j for all t 2 [t j ; t jC1 ], all x j 2 X j , and x(t; t j ; x j ; ) 2 e all 2 . In the second phase of the method, a tighter X j is computed such that x(t jC1 ; enclosure X jC1 e t0 ; X 0 ; ) X jC1 . This is done by using an ITS approach to compute a Taylor model T x jC1 of x jC1 in terms of the parameter vector and initial state vector x 0 , and then obtaining the enclosure X jC1 D B(T x jC1 ) by bounding T x jC1 over 2 and x 0 2 X 0 . To determine enclosures of the ITS coefficients f [i] (x j ; ) an approach combining RDA operations with the mean value theorem is used to obtain the Taylor models

I

T f [i] . Now using an ITS for x jC1 with coefficients given by T f [i] , one can obtain a result for T x jC1 in terms of the parameters and initial states. In order to address the wrapping effect [16], results are propagated from one time step to the next using a type of Taylor model in which the remainder bound is not an interval but a parallelepiped. That is, the remainder bound is a set of the form P D fAv j v 2 Vg, where A 2 Rnn is a real and regular matrix. If A is orthogonal, as from a QR-factorization, then P can be interpreted as a rotated n-dimensional rectangle. Complete details of the computation of T x jC1 were given by Lin and Stadtherr [10]. The approach outlined above, as implemented in VSPODE, has been tested by Lin and Stadtherr [10], who compared its performance with results obtained using the popular VNODE package [18]. For the test problems used, VSPODE provided tighter enclosures on the state variables than VNODE, and required significantly less computation time. Deterministic Global Optimization Method In this section, we summarize a method for the deterministic global optimization of dynamical systems, based on the use of the tools described above. As noted previously, when a sequential approach is used, the state variables are effectively eliminated using the ODE constraints, in this case by employing VSPODE, leaving a bound-constrained minimization of ( ) with respect to the adjustable parameters (decision variables) . The optimization method discussed here can be thought of as a type of branch-and-bound method, with a CPP used for domain reduction. Therefore, it can also be viewed as a branch-and-reduce algorithm. The basic idea is that only those parts of the decision variable space that satisfy the constraint c( ) D ( ) b 0, where b is a known upper bound on the global minimum found using local minimization, need to be retained. To perform this domain reduction, a CPP can be used. Partial information expressed by a constraint can be used to eliminate incompatible values from the domain of its variables. This domain reduction can then be propagated to all constraints on that variable, where it may be used to further reduce the domains of other variables. This process is known as constraint propaga-

1705

1706

I

Interval Analysis for Optimization of Dynamical Systems

tion. It is applied to a sequence of subintervals of , which arises in a bisection process. For a subinterval (k) , the Taylor model Tk of the objective function ' over (k) is computed. To do this, Taylor models of x , the state variables at times t ; D 1; : : : ; r, in terms of

are determined using VSPODE. Note that Tk then consists of a qth order polynomial in the decision variables , plus a remainder bound. The part of (k) that can contain the global minimum must satisfy the constraint c( ) D ( ) b 0. In the CPP outlined here, B(T c ) is determined and then there are three possible outcomes (in the following, an underline is used to indicate the lower bound of an interval, and an overline is used to indicate the upper bound): 1. If B(Tc ) > 0, then no 2 (k) will ever satisfy the constraint; thus, the CPP can be stopped and (k) discarded. Testing for this outcome amounts to checking if the lower bound of Tk , B(Tk ), is greater than b . If so, then (k) can be discarded because it cannot contain the global minimum and need not be tested further. 2. If B(Tc ) 0, then every 2 (k) will always satisfy the constraint; thus, (k) cannot be reduced and the CPP can be stopped. This amounts to checking if the . This also upper bound of Tk , B(Tk ), is less than b indicates, with certainty, that there is a point in (k) that can be used to update b , which can then be done using a local optimization routine. 3. If neither of the previous two cases occur, then part of the interval (k) may be eliminated. To do this, an approach [8,9] based on the range bounding strategy for Taylor models is used, as given by (5). If insufficient reduction of (k) occurs, then it is bisected and the resulting subintervals are added to the sequence of subintervals to be processed. Complete details of the optimization method based on these ideas were given by Lin and Stadtherr [8,9]. It can be implemented either as an -global algorithm, or, by incorporating interval-Newton steps in the method, as an exact ( D 0) algorithm. The latter requires the application of VSPODE to the first- and second-order sensitivity equations. An exact algorithm using interval-Newton steps was implemented by Lin and Stadtherr [8] for the special case of parameter estimation problems. However, this has not been fully implemented for more general cases.

Cases Lin and Stadtherr [8,9] have tested the performance of the algorithm discussed above on a variety of test problems. In this section we summarize the results for two of these problems. Both example problems were solved using an Intel Pentium 4 3.2 GHz machine running Red Hat Linux. The VSPODE package [9], with a k D 17 order ITS, q D 3 order Taylor model, and QR approach for wrapping, was used to integrate the dynamical systems in each problem. Using a smaller value of k will result in the need for smaller step sizes in the integration and so will tend to increase computation time. Using a larger value of q will result in somewhat tighter bounds on the states, though at the expense of additional complexity in the Taylor model computations. Catalytic Cracking of Gas Oil This problem involves parameter estimation in a model representing the catalytic cracking of gas oil (A) to gasoline (Q) and other side products (S), as described by Tjoa and Biegler [27] and also studied by several others [4,7,22,25]. The reaction is

A>

>> >> > k3 >>

k1

S

/

Q

k2

Only the concentrations of A and Q were measured. This reaction scheme involves nonlinear reaction kinetics. A least-squares objective was used for parameter estimation, resulting in the optimization problem min D

20 X 2 X 2 b x ;i x ;i

D1 iD1

subject to

x˙1 D (1 C 3 )x12 ; x˙2 D 1 x12 2 x2 ; t 2 [0; 0:95] ; x D x(t ) ; x 0 D (1; 0)T ;

2 [0; 20] [0; 20] [0; 20] ; whereb x is given experimental data. Here the state vector, x, is defined as the concentration vector (A, Q)T and the parameter vector, , is defined as (k1 , k2 , k3 )T .

Interval Analysis for Optimization of Dynamical Systems

For the -global algorithm, with a relative convergence tolerance of rel D 103 , 14.3 s was required to solve this problem. For the exact ( D 0) global algorithm using interval-Newton, 11.5 s was required. For this problem, the exact algorithm required less computation than the -global algorithm. However, this may or may not be the case for other problems [8]. Papamichail and Adjiman [22] solved this problem to -global optimality in 35,478 s (Sun UltraSPARCII 360 MHz; Matlab), and Chachuat and Latifi [4] obtained an -global solution in 10,400 s (unspecified machine; prototype implementation). Singer and Barton [25] solved this problem to -global optimality for a series of absolute tolerances, so their results are not directly comparable. However, the computational cost of their method on this problem appears to be quite low. These other methods all provide for -convergence only. Singular Control Problem This example is a nonlinear singular optimal control problem originally formulated by Luus [11] and also considered by Esposito and Floudas [7], Chachuat and Latifi [4], and Singer and Barton [25]. This problem is known to have multiple local solutions. In autonomous form and using a quadrature variable, this problem is given by min D x5 (t f )

(t)

subject to

x˙1 D x2 ;

I

are presented in Table 1. This shows, for each problem, the globally optimal objective value and the corresponding optimal controls * , as well as the CPU time (in seconds) and number of iterations required. Chachuat and Latifi [4] solved the two-interval problem to -global optimality using four different strategies, with the most efficient requiring 502 CPU seconds, using an unspecified machine and a “prototype” implementation. Singer and Barton [25] solved the one-, two-, and three-interval cases with abs D 103 using two different problem formulations (with and without a quadrature variable) and two different implementations (with and without branch-and-bound heuristics). The best results in terms of efficiency were achieved with heuristics and without a quadrature variable, with CPU times of 1.8, 22.5, and 540.3 s (1.667 GHz AMD Athlon XP2000+) for the one-, two, and three-interval problems, respectively. This compares with CPU times of 0.02, 0.32, and 10.88 s (3.2 GHz Intel Pentium 4) for the method discussed here. Even accounting for the roughly factor of 2 difference in the speeds of the machines used, the method described here appears to be well over an order of magnitude faster. The four- and five-interval problems were solved [9] in 369 and 8,580.6 CPU seconds, respectively, and apparently had not been solved previously using a method rigorously guaranteed to find an -global minimum. It should be noted that the solution to the three-interval problem, as given in Table 1, differs from the result reported by Singer and Barton [25], which is known to be a misprint [23].

x˙2 D x3 C 16x4 8 ; x˙3 D ; x˙4 D 1 ; x˙5 D x12 C x22

Conclusions (8)

C 0:0005(x2 C 16x4 8 0:1x3 2 )2 ; p x 0 D (0; 1; 5; 0; 0)T ; t 2 [t0 ; t f ] D [0; 1] ; 2 [4; 10] : The control (t) is parameterized as a piecewise constant profile with a specified number of equal time intervals. Five problems are considered, corresponding to one, two, three, four, and five time intervals in the parameterization. Each problem was solved to an absolute tolerance of abs D 103 . Computational results [9]

In this article, we have described an approach for the deterministic global optimization of dynamical systems, including parameter estimation and optimal control problems. This method [8,9] is based on interval analysis and Taylor models and employs a type of sequential approach. A key feature of the method is the use of a new verifying solver [10] for parametric ODEs, which is used to produce guaranteed bounds on the solutions of dynamic systems with intervalvalued parameters. This is combined with techniques for domain reduction based on using Taylor models in an efficient constraint propagation scheme. The result is that problems can be solved to global op-

1707

1708

I

Interval Analysis for Optimization of Dynamical Systems

Interval Analysis for Optimization of Dynamical Systems, Table 1 Results [9] for the singular control problem Time intervals ' *

*

CPU time (s)

No. of iterations

1

0.4965 (4.071)

0:02

9

2

0.2771 (5.575, 4.000)

0:32

71

3

0.1475 (8.001, 1.944, 6.042)

10:88

1; 414

4

0.1237 (9.789, 1.200, 1.257, 6.256)

369:0

31; 073

5

0.1236 (10.00, 1.494, 0.814, 3.354, 6.151) 8; 580:6

493; 912

timality with both mathematical and computational certainty. On parameter estimation problems, an exact ( D 0) algorithm, using interval-Newton steps, can be applied at a cost comparable to, and perhaps less than, that of the -global algorithm. The new approach can provide significant improvements in computational efficiency, potentially well over an order of magnitude, relative to other recently described methods.

References 1. Adjiman CS, Androulakis IP, Floudas CA, Neumaier A (1998) A global optimization method, ˛BB, for general twicedifferentiable NLPs–I. Theoretical advances. Comput Chem Eng 22:1137–1158 2. Adjiman CS, Dallwig S, Floudas CA, Neumaier A (1998) A global optimization method, ˛BB, for general twicedifferentiable NLPs–II. Implementation and computational results. Comput Chem Eng 22:1159–1179 3. Brusch R, Schappelle R (1973) Solution of highly constrained optimal control problems using nonlinear programming. AIAA J 11:135–136 4. Chachuat B, Latifi MA (2004) A new approach in deteterministic global optimisation of problems with ordinary differential equations. In: Floudas CA, Pardalos PM (eds) Frontiers in Global Optimization. Kluwer, Dordrecht 5. Corliss GF, Rihm R (1996) Validating an a priori enclosure using high-order Taylor series. In: Alefeld G, Frommer A (eds) Scientific Computing : Computer Arithmetic, and Validated Numerics. Akademie Verlag, Berlin 6. Esposito WR, Floudas CA (2000) Deterministic global optimization in nonlinear optimal control problems. J Global Optim 17:97–126 7. Esposito WR, Floudas CA (2000) Global optimization for the parameter estimation of differential-algebraic systems. Ind Eng Chem Res 39:1291–1310 8. Lin Y, Stadtherr MA (2006) Determinstic global optimization for parameter estimation of dynamic systems. Ind Eng Chem Res 45:8438–8448 9. Lin Y, Stadtherr MA (2007) Deterministic global optimization of nonlinear dynamic systems. AIChE J 53:866–875

10. Lin Y, Stadtherr MA (2007) Validated solutions of initial value problems for parametric ODEs. Appl Num Math 58:1145–1162 11. Luus R (1990) Optimal control by dynamic programming using systematic reduction in grid size. Int J Control 51:995–1013 12. Luus R, Cormack DE (1972) Multiplicity of solutions resulting from the use of variational methods in optimal control problems. Can J Chem Eng 50:309–311 13. Makino K, Berz M (1999) Efficient control of the dependency problem based on Taylor model methods. Reliab Comput 5:3–12 14. Makino K, Berz M (2003) Taylor models and other validated functional inclusion methods. Int J Pure Appl Math 4:379– 456 15. Makino K, Berz M (2005) Verified global optimization with Taylor model-based range bounders. Trans Comput 11:1611–1618 16. Moore RE (1966) Interval Analysis. Prentice-Hall, Englewood Cliffs 17. Nedialkov NS, Jackson KR, Corliss GF (1999) Validated solutions of initial value problems for ordinary differential equations. Appl Math Comput 105:21–68 18. Nedialkov NS, Jackson KR, Pryce JD (2001) An effective high-order interval method for validating existence and uniqueness of the solution of an IVP for an ODE. Reliab Comput 7:449–465 19. Neumaier A (2003) Taylor forms – Use and limits. Reliab Comput 9:43–79 20. Neuman C, Sen A (1973) A suboptimal control algorithm for constraint problems using cubic splines. Automatica 9:601–613 21. Papamichail I, Adjiman CS (2002) A rigorous global optimization algorithm for problems with ordinary differential equations. J Global Optim 24:1–33 22. Papamichail I, Adjiman CS (2004) Global optimization of dynamic systems. Comput Chem Eng 28:403–415 23. Singer AB (2006) Personal communication 24. Singer AB, Barton PI (2006) Bounding the solutions of parameter dependent nonlinear ordinary differential equations. SIAM J Sci Comput 27:2167–2182 25. Singer AB, Barton PI (2006) Global optimization with nonlinear ordinary differential equations. J Global Optim 34:159–190

Interval Analysis: Parallel Methods for Global Optimization

26. Teo K, Goh G, Wong K (1991) A unified computational approach to optimal control problems. Pitman Monographs and Surveys in Pure and Applied Mathematics, vol 55. Wiley, New York 27. Tjoa TB, Biegler LT (1991) Simultaneous solution and optimization strategies for parameter estimation of differential-algebraic equation systems. Ind Eng Chem Res 30:376 28. Tsang TH, Himmerlblau DM, Edgar TF (1975) Optimal control via collocation and nonlinear programming. Int J Control 21:763–768

Interval Analysis: Parallel Methods for Global Optimization ANTHONY P. LECLERC College of Charleston, Charleston, USA MSC2000: 65K05, 65Y05, 65Y10, 65Y20, 68W10 Article Outline Keywords and Phrases Introduction Definitions Interval Arithmetic Sequential IA Branch and Bound

Formulation Parallel Computer Models PIAGO Workload Management Load Balancing Superlinear Speedup

Methods Distributed Approaches Centralized Approaches Hybrid Approaches

Conclusions See also References Keywords and Phrases Optimization; Branch and bound; Covering methods; Parallel algorithms; Mathematical programming; Interval analysis Introduction The ability of interval arithmetic (IA) [21,22,23,24] to automatically compute reliable solution bounds in nu-

I

merical computations makes it an ideal mechanism for solving continuous nonlinear global optimization problems. To date, most efforts at developing parallel IA methods for global optimization have used the branch and bound (B&B) global search strategy [1,4,8]. The sequential B&B-based IA global optimization algorithm [10,17] executes a tree-like search process which is naturally parallelized and amenable to massive coarse-grained data parallelism (i. e. workload scalable [14]). Several noteworthy advances in parallel algorithms for global optimization using interval arithmetic have occurred over the past few years [7,15,26]. In addition, new software packages have been developed as a result of recent implementations of new or existing parallel IA global optimization algorithms [13,29]. A parallel programming language expressing a message-driven model is utilized in one implementation, resulting in a significantly different computational flow than is typical with the more classic and popular message-passing (e. g. MPI, PVM) and shared-memory (e. g. pthreads) parallel implementations [20]. Recently, the ubiquity of multi-core processor architectures has opened up new possibilities for exploiting thread-level parallelism. In the sections that follow, a sequential (B&B) IA global optimization algorithm is presented along with relevant IA and parallel computing definitions. Next, a general formulation of a parallel IA global optimization algorithm (PIAGO) based on the B&B global search strategy is presented. In the methods section, a survey of recent algorithmic advances, novel implementations, and pertinent language and programming environments is discussed. Finally, some concluding remarks are made along with thoughts on fertile future research avenues. Definitions Interval Arithmetic A “box” is an n-dimensional interval: ˚ X D xE : x i x i x i ; i D 0; 1; : : : ; n 1 D ([x 0 ; x 0 ]; [x 1 ; x 1 ]; : : : ; [x n1 ; x n1 ])T D (x 0 ; x 1 ; : : : ; x n1 )T : Boldface letters and capital letters are used to denote interval quantities and vectors, respectively (as

1709

1710

I

Interval Analysis: Parallel Methods for Global Optimization

proposed in [17]). The midpoint of X is denoted, m(X). The width of X is denoted, w(X). The greatest lower bound and the least upper bound for the interval x is denoted x and x, respectively. Sequential IA Branch and Bound A canonical sequential (B&B) IA global optimization algorithm, SIAGO, iterates over a prioritized list of boxes, representing current candidate subregions (of the initial search space) for containing global minimizer(s). The prioritized list, Q, is typically implemented as a heap data structure (see Algorithm 1). In each iteration, a box, X, is removed from Q. If w(X) and w( f (X)) are less than the prescribed tolerances, x and f , respectively, then X is placed on the solution list, S; Otherwise, X is subdivided into smaller boxes, X0 ; X1 ; : : : ; X k1 . Each of the k boxes, X i , is subjected to a set of deletion/reduction tests. Surviving X i boxes are placed onto Q. The boolean operator, Delete(), takes as input a box, X, and a floating point number, U (the upper bound on the smallest function value known thus far), and returns TRUE if and only if one of the following tests returns TRUE: f (X) > U if X is strictly feasible (i. e. does not lie on the boundary of the feasible space) and 0 62 r i f (X) (the gradient) for some i D 0; : : : ; n 1, if X is strictly feasible and the Hessian, r 2 f (X), is not positive semi-definite anywhere in X interval Newton’s method can eliminate all of X. These tests are known as the midpoint test, monotonicity test, Hessian test, and Newton test, respectively. More elaborate versions of SIAGO exist today (e. g. Newton’s method box reduction, unique critical point existence tests) [10,16] but have little effect on the survey in Sect. “Methods” of parallel IA global optimization algorithms1 . Formulation The following two facts about SIAGO (see Algorithm 1) reveal a potential for scalable parallelism. First, Delete() can be performed independently on different feasible subregions and therefore can be done 1 SIAGO efficiency can affect experimental parallel speedup measurements as noted in Sect. “Superlinear Speedup”

U = 1, Q.insert(X); // initial box while true do repeat if Q.empty then S.print, Halt Q.remove(X) // cut-off test until f (X) U ; if WithinTol(X; x ; f ; U ) then S.insert(X) else Subdivide(X; X0 ; X1 ; : : : ; X k1 ) for i=0 to k-1 do if not Delete(X i ; U ) then U = min( f (m(X i )); U ) Q.insert(X i ) end end end end Interval Analysis: Parallel Methods for Global Optimization, Algorithm 1 SIAGO

in parallel. This allows for “massive” data parallelism as sub-boxes can be distributed across all processors. Actually, some dependence exists for the midpoint test (see U in Algorithm 1). However, this dependence only affects the “sharpness” of this test and not the correctness. In practice, newly discovered lower U values are shared among all participating processors via broadcasts or shared memory (see Sect. “Parallel Computer Models”). Second, if a feasible region is not deleted (and not reduced via interval Newton’s method), the procedure, Subdivide(), will divide it into k subregions which together entirely cover the whole feasible region. The workload has just grown by k. Such workload growth makes possible workload scalability [14]. This means that the workload can scale to match the parallel computing power (i. e. CPU utilization is optimized). In fact, the workload growth of SIAGO is potentially exponential and for high dimensional problems can overtake the parallel computing power and memory resources. The exponential workload growth of SIAGO is no surprise in that the global optimization problem in general is NP-hard (i. e. no algorithm has yet been found which is better than simply performing a complete space search for the solution, requiring exponen-

Interval Analysis: Parallel Methods for Global Optimization

tial time in the worst case). IA allows one to remove (or reduce using interval Newton’s method) potentially large “chunks” of the search space with the hope of pruning/squeezing one’s way to a solution. Parallel Computer Models A parallel version of SIAGO is implementable on two basic categories of parallel computers: shared memory multiprocessors (including multi-core processors common today) and distributed memory multicomputers. Although multiprocessors are easier to program than multicomputers (one doesn’t have to worry about communication primitives), multicomputers have been the most popular choice for PIAGO for several reasons. First, multicomputers are more scalable. Second, there are freely-available, robust, easy-to-use parallel programming language extensions such as Parallel Virtual Machine (PVM) and the Message Passing Interface (MPI). Third, multicomputers are more cost effective (e. g. a simple cluster of workstations (COW) with inexpensive gigabit Ethernet). Fourth, the massive workload generated by PIAGO implementations on hard global optimization problems (for which PIAGO algorithms were designed to solve) keeps each processor busy working on a local subregion of the search space. If an effective workload management scheme is adopted (see Sect. “Workload Management”), CPU utilization will be maximized and communication will not be a limiting factor. PIAGO A generalized distributed memory parallel IA global optimization algorithm (PIAGO) has the following form:

Initialize/Startup all processors Perform SIAGO in parallel manage workload broadcast improved U values Detect global termination state Terminate all processors Interval Analysis: Parallel Methods for Global Optimization, Algorithm 2 PIAGO

I

Workload Management Workload in PIAGO algorithms is characterized at any given time by the set of boxes remaining to be processes or searched. PIAGO methods distinguish themselves primarily in the manner they manage workload (see Algorithm 2). In SIAGO, workload resides on a single priority queue of boxes, Q. In PIAGO, workload can be centrally managed on a single “master” node (processor), or it can be distributed among all nodes with each processor managing its own local Q. Hybrid schemes can also be employed consisting of a centrally managed global priority search queue on the master node working in concert with local search queues on each slave node. Distributed Management In this scheme, workload is distributed either statically to all processors at the beginning of the computation (static load balancing) or dynamically during computation (dynamic load balancing). With dynamic load balancing, processors coordinate and redistribute workload during computation in order to maximize CPU utilization and minimize total execution time. Workload state information (e. g. local search queue size) is continually (but not necessarily frequently) broadcast among all processors. Dynamic load balancing is generally scalable. However, each processor must communicate (by request, event, or at programmed time intervals) workload state information in order to make effective workload balancing decisions. Too much state information being broadcast frequently detracts from box processing and may saturate the machine’s bandwidth. Stale information concerning a processor’s state risks poor load balancing decisions being made on an inaccurate depiction of the current global state. Centralized Management In this scheme (sometimes called master/slave), one master node is responsible for managing (scheduling) the workload. Slave (worker) nodes request work (or are “pushed” work) from the master. The master node is responsible for scheduling the workload in a way that maximizes CPU utilization and (hopefully) minimizes total execution time. One advantage of centralized control is that workload can be prioritized globally (e. g. boxes, X, ordered on a priority queue based on minimum f (X)). Global

1711

1712

I

Interval Analysis: Parallel Methods for Global Optimization

termination detection is easy: The computation is done when the master node has no more workload and all slave processors are idle. Load balancing is achieved through effective scheduling. Centralized workload management is not scalable, in the theoretical sense. In practice, a centralized scheme is successful provided the master node does not become a “bottleneck”. Communication between one master node and a large set of worker nodes can become intensive and exhaust the communication bandwidth of the parallel machine. Moreover, memory and CPU resources on one processor are limited (relative to the total CPU power and memory of the parallel machine) and can easily become saturated if stressed with too much workload or communication. Hybrid Workload Management Hybrid schemes allow each worker processor to manage its own local Q while still maintaining a master process responsible for handling work requests from idle processors. The benefits of the hybrid approach over the pure centralized approach are two-fold: fewer requests for work (to the master) are required since each worker must first complete its local workload (including self-generated workload resulting from box splitting) before it becomes idle the potential memory bottleneck at the master is mitigated since the local memory resources on each worker are utilized. The main disadvantage of the hybrid approach versus the centralized approach is the sacrifice of total (global) workload ordering. The master node “running out of boxes” or a worker process generating too much work to be held in local memory are two other issues that need to be addressed. The main advantages of the hybrid approach compared to the distributed approach are two-fold: a better approximation to a total workload ordering fewer possible retransmissions for work as the master node is (usually) guaranteed to have boxes. Because the hybrid approach still uses a master node for scheduling workload, this method inherits the scalability weakness of its centralized parent. Load Balancing One necessary condition for load balancing is ensuring no worker processor sits idle. A second goal of load

balancing in PIAGO algorithms is the distribution of “quality” boxes among the worker processors. A quality box is defined as a box more likely to contain a minimizer (or near minimizer). It is natural to expect that global minima will be discovered more quickly if participating processors focus their efforts on subregions of the workspace that more likely to contain minimizers. Early improvements to the SIAGO algorithm recognized this fact and (efficiently) sorted boxes, X, on increasing f (X) using a priority search queue, Q. Superlinear Speedup Speedup is defined as S m D T1 /Tm , where T1 is the sequential execution time (e. g. SIAGO on one processor) and Tm is the parallel execution time (e. g. PIAGO on m processors). Theoretically, superlinear speedup (i. e. S m > p) of an efficient algorithm is not possible [6]. In practice, however, superlinear speedup has been reported often for B&B algorithms in general and PIAGO algorithms in particular [2,5,7,13, 15,18,20,25,26]. One reason why superlinear speedup may be achieved in practice is that the sequential algorithm may be inefficient. Some of the earliest PIAGO implementations reported large superlinear speedups. For example, a superlinear speedup of 170 is reported on 32 nodes in [25]. Using a priority search queue ordered on lowest f (X) [9], Leclerc [18] reports only sublinear speedup of approximately 1/2 for the same problem. In [2] a theorem is presented that “clearly indicates that no substantial superlinear speedup is possible, assuming that the best-first strategy is used”. Here, bestfirst strategy refers to the same lowest f (X) ordering of boxes on the search queue used by Leclerc. Note, the theorem does not claim that the best-first strategy is best strategy to use. It only claims that if the best-first strategy is used for both the the sequential version and the parallel version, then superlinear speedup is not expected. In fact, most of the superlinear speedups that have been reported recently are just slightly above linear. This can be explained by the combination of one or more of the following factors: high memory utilization in the sequential case may result in poor caching and possible paging thus extending execution time

I

Interval Analysis: Parallel Methods for Global Optimization

non-deterministic timing anomalies (race conditions) that occur in parallel executions may not have been “smoothed out” by averaging the results of many execution runs the partial breadth-first search that parallelization introduces into the computation may indeed accelerate finding global solutions for some problems. Methods Following is a survey of PIAGO methods that have evolved over the past 15 years. Performance comparisons of the the various methods based on execution times are difficult to make. For example, differences in implementation hardware, the problems being solved, and IA software used, will affect execution times. Instead, most articles report speedup as a measure of the efficiency of the parallel algorithm. However, speedup is also dependent on the several factors including box ordering on the search queue, memory utilization, non-deterministic parallel “race condition” effects, implementation hardware, and the specific problem being solved (see Sect. “Superlinear Speedup”). For these reasons, no effort is made to compare the various algorithms with regard to reported efficiency. Various acceleration techniques or general improvements to SIAGO are not considered. It is assumed that such improvements would benefit most if not all of the methods surveyed. Finally, no discussion of global termination detection is made. Although this is an interesting topic [28], the methods (both centralized and distributed) are few, well analyzed, and not affected by the particular nature of B&B IA global optimization algorithms. Moreover, the contribution of global termination detection to the total execution time for hard problems is negligible. The key component differentiating the various PIAGO algorithms is workload management (see Sect. “Workload Management”). Each considered PIAGO algorithm is categorized into one of distributed, centralized, or hybrid categories. A discussion of the workload management scheme is given along with relevant comments concerning scalability, code complexity, and communication costs.

Distributed Approaches As mentioned in “Distributed Management”, distributed workload approaches are generally scalable. Asynchronous non-blocking communication is more efficient, but also more difficult to program. By either interleaving messaging probing (e. g. MPI_Iprobe) within the main computation loop (see Algorithm 1) or dedicating a separate thread to the task of receiving messages, one can use efficient non-blocking communication in the approaches that follow. No further discussion of synchronous versus asynchronous communication is made. Let P0 ; P1 ; : : : ; Pm1 represent m processors on f1 ; : : : ; Wm1 represent f0 ; W a parallel machine. Let W recorded workload state information for each processor. A given processor can query the (approximate) current workload queue size or minimum f (X) on procesfj .Qsize or W fj .Qlbf (the lower bound on sor j using W the function over all boxes in the queue), respectively.

A

The Leclerc Approach This approach [18,19] is fully distributed and utilizes the best-first queuing strategy. It uses the load balancing procedure listed in Procedure loadbalance with the function WorkloadBalanced returning TRUE when the processor’s Q is empty (i. e. no work). This is a simple demand-driven load balancing scheme. The lowest f (X) value for boxes on each processor’s local Q are broadcast at regular intervals to e all processors and recorded in W. // Load balance on processor, Pi e then E = fig if not WorkloadBalanced(Q, W) repeat fi :Qlbf); i … E Request fraction Pb = min(W of boxes from Pb if no boxes received then E = E [ fbg end until boxes received end Interval Analysis: Parallel Methods for Global Optimization, Procedure loadbalance(i)

The Hu, Kearfott, Xu, and Yang Approach This approach [13] is similar to the one used by Leclerc, but

1713

1714

I

Interval Analysis: Parallel Methods for Global Optimization

with an initial static assignment of one box to each processor on startup. One box is requested, instead of a fraction of boxes, when a processor becomes idle. From the paper it is unclear to which processor(s) a request for workload is made. It is also unclear whether e is maintained. workload state information, W, The Caprani and Madsen Approach This simple, yet promising approach [3] uses static load balancing rather than dynamic load balancing. First, a “good” U is computed on one processor. Next, a “sufficient number” (e. g. 10m) of sub-boxes are generated using SIAGO and placed into m sets of “approximately equal difficulty”. The m sets along with U are statically distributed onto m processors. SIAGO is performed on each processor with no communication. The Eriksson and Lindström Approach Here [5], load balancing is considered on a specialized parallel computer—an Intel iPSC/2 hypercube. No worke is maintained. In order to load state information, W, load balance qualitatively as well as quantitatively, a hybrid of two load balancing strategies is used: receiverinitiated and sender-initiated. The receiver-initiated load balancer is conceptually similar to Procedure loadbalance. But, rather than a sefi :Qlbf), an un-prioritized linear lection based on min(W search (for a non-idle node) along a ring is performed. This ensures no processor stays idle for very long. The sender-initiated load balancer seeks to balance qualitatively. Here, the “best” box on the Q (i. e. the one with the lowest f (X)) is “pushed” to a random processor each time G boxes have been split. The frequency of a push operation, G, on a particular processor is decremented by one when the pushed box gets placed at the front of the Q of the randomly selected processor; otherwise, G is incremented by one. The net effect is that if truly “good” boxes are being pushed, then they will continue to be pushed at a high frequency; otherwise, pushes will occur less often. The Gau and Stadther Approach Here [7] two fundamental algorithms are proposed. First is the synchronous work stealing (SWS) approach. This approach is very similar to the approaches by Hu, Kearfott, Xu, and Yang, and Leclerc. The difference is that largest Q length is used instead of lowest f (X).

Next, an asynchronous diffusive load balancing (ADLB) scheme is proposed. A group of “nearest neighbors” is defined. Neighbors exchange workload information. Then, boxes are either “pushed” or “pulled” to/from neighbors depending on workload distribution inequities as determined by each processor. The mechanism is analogous to heat or mass diffusion. In theory this approach should be able to handle qualitative issues regarding workload. However, this is not considered in the paper. The Martínez, Casado, Alvarez, and García Approach This recent approach [20] is most novel for it’s implementation language—Charm++. The execution model of Charm++ is message-driven (i. e. the arrival of messages “triggers” associated computations). This model is similar to a data flow machine. Essentially a process (chare) runs on each processor. This process responds to (is triggered by) messages to either process a box, Process-Box, or update U , updateU . A Process-Box message can either: reject the box with no messages generated subdivide the box generating two Process-Box messages sent to two random processors send a message to the main chare to enqueue a new solution. Messages can be prioritized so that update-U messages take precedence over Process-Box messages. This should help improve the efficiency of the parallel algorithm. Also, Process-Box messages can be prioritized on lowest f (X) in order to load balance qualitatively. Data flow solutions are truly elegant. Load balancing quantitively and qualitatively is achieved via randomness and built-in message prioritization. Centralized Approaches The Henriksen and Madsen Approach An early implementation of a PIAGO algorithm using a centralized workload manager is that of Henriksen and Madsen [11]. A master node maintains the priority workload queue, Q, and schedules work to each slave processor. When a slave node splits a box, it keeps only one box and sends the remainder back to the master, to be inserted into Q. U is also maintained at the master. The algorithm is load balanced (both quantitively and qualitatively) and has the advantage of total order-

Interval Analysis: Parallel Methods for Global Optimization

I

ing of boxes. However, its weakness is poor scalability. The master quickly becomes a memory bottleneck and communication “hotspot” on parallel machines with 32 or more nodes [2]. To be fair, however, such an algorithm is better suited to shared memory multiprocessors, and in particular, multi-core processors (e. g. AMD Opteron, Intel Core 2 Duo). Though multi-core processors don’t offer as great an opportunity for massive parallelism (usually 16 cores or less on a processor), they are ubiquitous today and inexpensive. Therefore, one can envision Henriksen and Madsen’s approach being used on a distributed memory multicomputer in which the individual processors are multi-core. A more scalable algorithm would be used on the multicomputer architecture as a whole, but the centralized approach could be used as a multithreaded PIAGO application running on each multi-core processor. The advantage of this hierarchical workload management approach is a better approximation to the bestfirst strategy. In addition, more efficiency would be obtained with the centralized implementation on each multi-core processor, since shared memory is faster than message passing. The main disadvantage would be code complexity.

lower f (X) “challenge” the current leader. The current leader makes a determination as to the next leader and broadcasts the index of the new leader along with the improved f (X) to all processors. In this approach, no effort is made to approximate a totally ordered global Q. Rather, the approach seeks only to ensure that work requests are made to the processor with the best quality boxes.

Hybrid Approaches

The pure centralized workload management scheme is clearly impractical to implement on large distributed memory multicomputers due to issues of scalability. Fully distributed algorithms are scalable but some would question their efficiency based on concerns that the following phenomena may significantly impact performance: frequent broadcasting of workload state repeated retransmissions for workload due to idle Pb in Procedure loadbalance a global best-fit exploration of boxes is not being performed (i. e. perhaps the best quality boxes are not being evenly distributed). Hybrid methods were apparently developed to resolve one or more of the perceived deficiencies of distributed methods and the scalability problem of the pure centralized method. Though hybrid methods have reduced bottleneck potential, they still suffer from poor scalability. A closer examination of the apparent deficiencies of the distributed methods is worth making. Efficient (up

As was mentioned in Sect. “Centralized Approaches”, pure centralized approaches, though offering total ordering of the workload Q, are not scalable. Hybrid approaches are theoretically not scalable either. However, some of the scalability issues are mitigated by leveraging local memory on worker processors. Three hybrid approaches are considered. The Berner Approach Here [2], a master node handles requests from idle processors. A dynamically adjusted variable, max, is used to “throttle” the workload on the worker processors as well as help ensure the master does not run out of work. Processors with more than max boxes on the local Q will send “some of them” to the master. The Ibraev Approach This approach [15] is a variation on the Berner approach, with the master (leader) node continually “floating” to the processor that discovers a better f (X). Workers discovering a possibly

The Tapamo and Frommer Approach Tapamo and Frommer [26] propose a variation of the Berner approach which allows non-idle processors to serve requests. The master node keeps track of the lengths of each processor’s local Q. When one or more processors become idle, the master then instructs non-idle processors (in decreasing order of Q length) to concurrently satisfy requests from idle processors. Workload state information (i. e. local Q sizes) must be sent to the master at some frequency. The same issues regarding this frequency are present in the various distributed approaches (See Sect. “Distributed Management”). Delay is introduced due to the indirection of requests having to “pass” through the master node. Conclusions

1715

1716

I

Interval Analysis: Parallel Methods for Global Optimization

to practically constant-time) broadcast primitives have been implemented [27,12]. Thus, it would seem, that frequent broadcasting of workload state may not significantly effect performance. Moreover, the frequency of broadcast can easily be throttled if required. A good estimate of the workload state on each processor for large problems is reasonable to expect. Thus, a high probability exists that the first or possibly second request will fall on a non-idle processor with “good” work. Retransmissions may in fact be few. Finally, a global best-fit exploration of boxes is not being performed using distributed schemes. However, such a totally ordered exploration is not being done using any of the hybrid methods either. An argument claiming hybrid methods yield better approximations to a global ordering is difficult to make. A complete and fair assessment of the various PIAGO algorithms (in particular distributed methods versus hybrid methods) should cover a wide range of difficult global optimization test problems. The same efficient SIAGO algorithm (e. g. using best-first ordering) should be used in each and a common hardware platform should be utilized. Furthermore, multiple runs of each test case should be run and averaged in order to “smooth out” non-deterministic parallel computation effects. To date no such comprehensive analysis has been performed. See also Interval Analysis: Intermediate Terms Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Unconstrained and Constrained Optimization Interval Analysis: Verifying Feasibility Interval Global Optimization Interval Newton Methods References 1. Bader DA, Hart WE, Phillips CA (2004) Parallel Algorithm Design for Branch and Bound. In: Greenberg H (ed) Tutorials on Emerging Methodologies and Applications in Operations Research, pp 1–44 2. Berner S (1996) Parallel methods for verified global optimization practice and theory. J Global Optim 9(1):1–22

3. Caprani O, Madsen K (1998) An Almost Embarrasingly Parallel Interval Global Optimization Method 4. Crainic T, Cun BL, Roucairol C (2006) Parallel Branch and Bound Algorithms. In: Talbi E-G (ed) Parallel Combinatorial Optimization, Chap 1. Willey, New York, pp 1–28 5. Eriksson J, Lindstrom P (1995) A parallel interval method implementation for global optimization using dynamic load balancing. Reliab Comput 1(1):77–92 6. Faber V, Lubeck O, White A (1986) Superlinear Speedup of an Efficient Sequential Algorithm is Not Possible. Parallel Comput 3:259–260 7. Gau C, Stadtherr M (2001) Parallel Interval-Newton Using Message Passing: Dynamic Load Balancing Strategies. In: Proceedings of the 2001 ACM/IEEE conference on Supercomputing, pp 23–23 8. Gendron B, Crainic TG (1994) Parallel Branch-And-Bound Algorithms: Survey and Synthesis. Operat Res 42(6):1042– 1066 9. Hansen ER (1992) Global Optimization Using Interval Analysis. Marcel Dekker, New York 10. Hansen ER, Walster GW (2004) Global Optimization Using Interval Analysis, 2nd edn. CRC Press, Boca Raton 11. Henriksen T, Madsen K (1992) Use of a depth-first strategy in parallel global optimization. Technical Report 92-10, Institute for Numerical Analysis, Technical University of Denmark, Lyngby 12. Hoefler T, Siebert C, Rehm W (2007) A practically constanttime MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast. In: Parallel and Distributed Processing Symposium, 2007, pp 1–8 13. Hu C, Kearfott B, Xu S, Yang X (2000) A parallel software package for nonlinear global optimization 14. Hwang K (1993) Advanced Computer Architecture. McGraw-Hill, New York 15. Ibraev S (2002) A new parallel method for verified global optimization. Proc Appl Math Mech PAMM 1(1):470–471 16. Kearfott RB (1996) A Review of Techniques in the Verified Solution of Constrained Global Optimization Problems. In: Kearfott RB, Kreinovich V (eds) Applications of Interval Computations. Kluwer, Dordrecht, pp 23–59 17. Kearfott RB (1996) Rigorous Global Search: Continuous Problems. Kluwer, Dordrecht 18. Leclerc A (1993) Parallel interval global optimization in C++. Interval Comput 3:148–163 19. Leclerc A (2001) Interval Analysis: Parallel Methods for Global Optimization. Encycl Optim 3:23–30 20. Martinez J, Casado L, Alvarez J, Garcia I (2006) Interval Parallel Global Optimization with Charm++. In: Dongarra J, Madsen K, Wasniewski J (eds) PARA’04 State-of-the-Art in Scientific Computing, pp 161–168 21. Moore RE (1962) Interval Arithmetic and Automatic Error Analysis in Digital Computing. Applied mathematics and statistics laboratories technical report, Stanford University 22. Moore RE (1966) Interval Analysis. Prentice-Hall, Englewood Cliffs

Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods

23. Moore RE (1979) Methods and Applications of Interval Analysis, SIAM Studies in Applied Mathematics. SIAM, Philadelphia 24. Moore RE (1988) Reliability in Computing. Academic Press. See especially the papers by Hansen E 289–308, Walster GW 309–324, Ratschek H 325–340, and Lodwick WA 341– 354 25. Moore RE, Hansen ER, Leclerc AP (1992) Recent Advances in Global Optimization. Princeton University Press, Princeton, pp 321–342 26. Tapamo H, Frommer A (2007) Two acceleration mechanisms in verified global optimization. J Comput Appl Math 199(2):390–396 27. Tinetti F, Barbieri A (2003) An efficient implementation for broadcasting data in parallel applications over Ethernet clusters. Advanced Information Networking and Applications, pp 593–596 28. Topor RW (1984) Termination Detection for Distributed Computations. Inform Process Lett 18(1):33–36 29. Zilinskas J (2005) A Package for Development of Algorithms for Global Optimization. In: Proceedings of the 10th International Conference MMA2005&CMAM2, pp 185–190

Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods TIBOR CSENDES University of Szeged, Szeged, Hungary MSC2000: 65K05, 90C30 Article Outline Keywords and Phrases Subdivision Directions Properties of Direction Selection Rules Theoretical Properties Numerical Properties

References Keywords and Phrases Branch-and-bound; Interval arithmetic; Optimization; Subdivision direction The selection of subdivision direction is one of the points where the efficiency of the basic branchand-bound algorithm for unconstrained global optimization can be improved (see Interval analysis: un-

I

constrained and constrained optimization). The traditional approach is to choose that direction for subdivision in which the actual box has the largest width. If the inclusion function (x) is the only available information about the problem min (x) ; x2x 0

then it is usually the best possible choice. If, however, other information like the inclusion of the gradient (r ), or even the inclusion of the Hessian (H) is calculated, then a better decision can be made. Subdivision Directions All the rules select a direction with a merit function: n

k :D arg max D(i); iD1

(1)

where D(i) is determined by the given rule. If many such optimal k indices exist then the algorithm can chose the smallest one, or it can select an optimal direction randomly. Rule A. The first rule was the interval-width oriented rule. This rule chooses the coordinate direction with D(i) :D w(x i ):

(2)

This rule is justified by the idea that, if the original interval is subdivided in a uniform way, then the width of the actual subintervals goes to zero most rapidly. The algorithm with Rule A is convergent both with and without the monotonicity test [8]. This rule allows a relatively simple analysis of the convergence speed (as in [8], Chapter 3, Theorem 6). Rule B. E. Hansen described another rule (initiated by G. W. Walster [5]). The direct aim of this heuristic direction selection rule is to find the component for which Wi D max t2x i (m1 ; : : : ; m i1 ; t; m iC1; : : : ; m n ) is the min t2x i (m1 ; : : : ; m i1 ; t; m iC1; : : : ; m n ) largest (where m i D (x i C x i )/2 is the midpoint of the interval x i ). The factor Wi , that should reflect how much ' varies as x i varies over x i , is then approximated by w(r i (x))w(x i ) (where r i (x) denotes the ith component of r (x)). The latter is not an upper bound for

1717

1718

I

Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods

Wi (cf. [5] page 131 and Example 2 in Section 3 of [4]), yet it can be useful as a merit function.

Rule A 3

Rule B selects the coordinate direction, for which (1) holds with D(i) :D w(r i (x))w(x i ):

(3)

It should be noted that the basic bisection algorithm represents only one way in which Rule B was applied in [5]. There the subdivision was, e. g., also carried out for many directions in a single iteration step. Rule C. The next rule was defined by Ratz [9]. The underlying idea was to minimize the width of the inclusion: w( (x)) D w( (x) (m(x))) w(r (x)(x Pn m(x))) D iD1 w(r i (x)(x i m(x i ))). Obviously, that component is to be chosen for which the term w(r i (x)(x i m(x i ))) is the largest. Thus, Rule C can also be formulated with (1) and D(i) :D w(r i (x)(x i m(x i ))):

Rule D. The fourth rule, Rule D is derivative-free like Rule A, and reflects the machine representation of the inclusion function (x) (see [5]). It is again defined by (1) and by D(i) :D

w(x i ) w(x i )/ < x i >

if 0 2 x i ; otherwise ;

-3

-3

0

3

0

3

Rule B 3

(4)

The important difference between (3) and (4) is that in Rule C the width of the multiplied intervals is maximized, not the multiplied widths of the respective intervals (and these are in general not equal). After a short calculation, the right-hand side of (4) can be written as maxfj min r i (x)j; j max r i (x)jgw(x i ). This corresponds to the maximum smear defined by R.B. Kearfott (used as a direction selection merit function solving systems of nonlinear equations [6,7]) for the case : Rn ! R. It is easy to see that the Rules B and C give the same merit function value if and only if either r i (x) D 0 or r i (x) D 0.

0

(5)

where < x > is the mignitude of the interval x: < x >:D min x2x jxj. This rule may decrease the excess width w( (x)) w( u (x)) of the inclusion function (where u (x) is the

0

-3

-3

Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods, Figure 1 Remaining subintervals after 250 iteration steps of the model algorithm with the direction selection Rules A, and B for the Three-Hump-Camel-Back problem [3]

range of ' on x) that is caused in part by the floating point computer representation of real numbers. Consider the case when the component widths are of similar order, and the absolute value of one component is dominant. The subdivision of the latter component may result in a worse inclusion, since the representable numbers are sparser in this direction.

Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods

Rule E. Similar to Rule C, the underlying idea of Rule E is to minimize the width of the inclusion, but this time based on second order information (suggested by Ratz [10]):

I

Rule C 3

D(i) :D w((x i m(x i ))(r i (m(x)) n

1X C (H i j (x)(x i m(x i ))) : (6) 2 jD1

0

Many interval optimization codes use automatic differentiation to produce the gradient and Hessian values. For such an implementation the subdivision selection Rule E requires not much overhead. Properties of Direction Selection Rules Both the theoretical and numerical properties of subdivision direction selection rules have been studied extensively [1,3,4,10,11]. The exact definitions, theorems and details of numerical comparison tests can be found in these papers. Denote the global minimum value by .

-3

-3

0

3

0

3

Rule D 3

Theoretical Properties In [4] the property of balanced direction selection has been defined. A subdivision direction selection rule is balanced basically if the B&B algorithm with this direction selection rule will not be unfair with any coordinate direction: each direction will be selected an infinite number of times in each infinite subdivision sequence of the leading boxes generated by the optimization algorithm. A global minimizer point x 0 2 x 0 is called hidden global minimizer point, if there exists a subbox x 0 x 0 with positive volume for which x 0 2 x 0 and (x 0 ) D while there exists an other global minimizer point x 00 of the same problem such that (x 00 ) < holds for each subbox x 00 x 0 with positive volume that contains x 00 [11]. Now the following statements can be made: 1. The basic branch-and-bound algorithm converges in the sense that lims!1 w(x s ) D 0 if and only if the interval subdivision selection rule is balanced [4] (where x s is the leading box of the algorithm in the iteration step number s). 2. Assume that the subdivision direction selection rule is balanced. Then the basic B&B algorithm converges to global minimizer points in the sense that

0

-3

-3

Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods, Figure 2 Remaining subintervals after 250 iteration steps of the model algorithm with the direction selection Rules C, and D for the Three-Hump-Camel-Back problem [3]

lims!1 (x s ) D , the set of accumulation points A of the leading box sequence is not empty, and A contains only global minimizer points. 3. Assume that the optimization algorithm converges for a given problem in the sense that lims!1 (x s ) D . Then either the algorithm proceeds on the problem as one with a balanced

1719

I

1720

4.

5.

6.

7.

Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods

direction-selection rule, or there exists a box y such that (x) D for all x 2 y, and w(y i ) > 0 (i D 1; 2; : : : ; n) for all coordinate directions that are selected only a finite number of times. The subdivision selection Rules A and D are balanced, and thus the related algorithms converge to global minimizer points. Either the subdivision selection Rules B and C choose each direction an infinite number of times (they behave as balanced), or the related algorithms converge to a positive width subinterval of the search region x0 that contains only global minimizer points. Sonja Berner proved that the basic algorithm is convergent with Rule E in the sense of lims!1 (x s ) D , if an additional condition holds for the inclusion function [1]. If the branch-and-bound algorithm with any of the direction selection Rules A–E converges to a global minimizer point, then it converges to all non-hidden global minimizer points [11].

Numerical Properties The numerical comparison tests were carried out on a wide set of test problems and in several computational environments. The set of numerical test problems contained the standard global optimization test problems [3,4], the set of problems studied in [5], and also some additional ones [10,11]. The computing environments include IBM RISC 6000-580 and HP 9000/730 workstations and Pentium PC-s. The programs were coded in FORTRAN-90, PASCAL-XSC, and also in C++. The tests were carried out both with simple natural interval extension and with more sophisticated inclusion functions involving centered forms. The derivatives were handcoded in some test [4], while they were generated by automatic differentiation in the others [3,10,11]. The range of the investigated algorithms included simple B&B procedures and also optimization codes with many acceleration devices (like the interval Newton method). The conclusions were essentially the same: the Rules B, C, and E had similar, substantial efficiency improvements against Rules A and D, and these improvements were the greater the more difficult the solved problem was. The average performance of Rule D was the worst.

Rule C was usually the best, closely followed by Rule B and E. It seems that the use of Rule E is justified only if the second derivatives are calculated also for other purposes. The numerical results were diverse, thus if the user has a characteristic problem set, then it is worth to test all the subdivision direction selection rules to find the most fitting one. A computationally intensive numerical study [2] has proven that the most efficient subdivision direction selection rules are not those that minimize the width of the objective function inclusions for the result subintervals (which was the common belief), but those that maximize the lower bound of the worse subinterval obtained or minimize the width of the intersection of the result subintervals. The decisions of these a posteriori rules coincide the most with the a priori Rules B, C, and E. These findings confirm the earlier mentioned numerical efficiency results.

References 1. Berner S (1996) New results on verified global optimization. Computing 57:323–343 2. Csendes T, Klatte R, Ratz D (2000) A Posteriori Direction Selection Rules for Interval Optimization Methods. CEJOR 8:225–236 3. Csendes T, Ratz D (1996) A review of subdivision direction selection in interval methods for global optimization. ZAMM 76:319–322 4. Csendes T, Ratz D (1997) Subdivision direction selection in interval methods for global optimization. SIAM J Numer Anal 34:922–938 5. Hansen E (1992) Global optimization using interval analysis. Marcel Decker, New York 6. Kearfott RB (1996) Rigorous global search: continuous problems. Kluwer, Dordrecht 7. Kearfott RB, Novoa M (1990) INTBIS, a Portable Interval Newton/Bisection Package. ACM T Mathemat Softw 16:152–157 8. Ratschek H, Rokne J (1988) New Computer Methods for Global Optimization. Ellis Horwood, Chichester 9. Ratz D (1992) Automatische Ergebnisverifikation bei globalen Optimierungsproblemen. Dissertation, Universität Karlsruhe 10. Ratz D (1996) On Branching Rules in Second-Order Branchand-Bound Methods for Global Optimization. In: Alefeld G, Frommer A, Lang B (eds) Scientific Computing and Validated Numerics. Akademie-Verlag, Berlin, pp 221–227 11. Ratz D, Csendes T (1995) On the selection of Subdivision Directions in Interval Branch-and-Bound Methods for Global Optimization. J Glob Optim 7:183–207

Interval Analysis: Systems of Nonlinear Equations

Interval Analysis: Systems of Nonlinear Equations RAMON E. MOORE Worthington, USA MSC2000: 65G20, 65G30, 65G40 Article Outline Keywords Numerical Example See also References Keywords Interval Newton operator; Nonlinear equations A system of nonlinear equations can be represented in vector form as f (x) = 0, where the components are f i (x) = f i (x1 , . . . , xn ) = 0, i = 1, . . . , n. Sometimes we seek one solution; sometimes we are interested in locating all solutions. A naive interval approach can be used to search a box (an interval vector) V for solutions. Using repeated bisections in various coordinate directions, we can chisel off parts of V that cannot contain a solution. That is, if f (W) does not contain the zero vector for some W in V, then we can delete W as containing no solutions to f (x) = 0. The remaining parts of V contain all the solutions, if any, that were in the initial V. For differentiable systems, there are much more efficient methods for finding a solution or the set of all solutions. Even so, the naive approach does have its uses. In practice it often pays to combine a number of techniques. One approach to solving f (x) = 0 is to formulate an equivalent fixed-point problem, and use iterative methods to solve it. We can define g(x) D x C Y f (x) for any linear mapping Y. If Y is nonsingular, then f (x) = 0 is equivalent to g(x) = x. If g is continuous and S is a compact, convex subset of Rn , and g maps S into itself, then g has a fixed point in S and so f (x) = 0 has a solution in S.

I

An interval vector V is a compact, convex set, so g(V) V implies f (x) = 0 for at least one point x in V. Classical iterative methods consider sequences of points generated by x (kC1) D g(x (k) ) starting from some initial point x(0) . If we denote the Jacobian matrix for the system by f 0 (x), then choosing Y = f 0 (x)1 , we will have Newton’s method. If we take Y as an approximation to f 0 (x)1 , then we obtain a Newton-like method. Interval versions of Newton’s method, however, also involve intersections, as we will see. An interval Newton method for finite systems of nonlinear equations was introduced by R.E. Moore [11,12]. Subsequently, many improvements have been made, e. g., [4,6,7,8,10,13,16,17,18]. In order to explain as clearly as possible, consider the one-dimensional case. We have the mean value theorem for continuously differentiable f : f (x) D f (x (0) ) C f 0 ()(x x (0) ) for some between x(0) and x. We have f (x) = 0 if x satisfies x D x (0) [ f 0 ()]1 f (x (0) ): Now the ordinary Newton method replaces the unknown by x(0) . The initial idea was to use an interval for and use interval computation throughout the iterations. If we start with an interval, say X (0) , that contains x(0) and happens to also contain a solution, say x, of f (x) = 0, then X (0) also contains and therefore x is contained in N(X (0) ) D x (0) [ f 0 (X (0) )]1 f (x (0) ) (N for Newton), where f 0 (X (0) ) {f 0 (x):x 2 X (0) }. The first idea was to iterate X (k+ 1) = N(X (k) ), but this turns out not to converge in all cases. Then the following idea was proposed, [12]. Since a solution x in X (0) is also in N(X (0) ), it follows that x is also contained in the intersection: X (0) \ N(X (0) ). Therefore, we iterate X (kC1) D X (k) \ N(X (k) )

1721

1722

I

Interval Analysis: Systems of Nonlinear Equations

with N(x (k) ) D y(k) [ f 0 (X (k) )]1 f (y(k) ); choosing y(k) in X (k) , say the midpoint of X (k) . With this modification, the interval Newton method does what we want, as will be explained. From the above arguments we have proved that for an interval X, 1) if N(X) \ X is empty, then there is no solution in X. If we divide by an interval containing zero, we may obtain one, or the union of two, semi-infinite intervals for N(X). The intersection with the finite interval X in the interval Newton method, reduces the result to a finite interval, or the union of two finite intervals, or the empty set. During the iterations, if X (k) turns out to be the union of two intervals, we put one on a list and proceed to further iterate with the other one. This idea was first presented in E.R. Hansen [6]. We can also prove that [5,6]: 2) if N(X) X, then there exists a unique solution in N(X). The existence follows for the compact, convex interval X, and from the continuity of f 0 . The uniqueness follows from the boundedness of N(X) X. If there were two solutions in N(X), then f 0 (y) would be zero for some y in X and N(X) would be unbounded. If f is twice continuously differentiable, then we can also prove the following [16]: 3) if N(X) X, then the interval Newton method converges quadratically to the unique solution in X, as does the ordinary Newton method from any starting point in X. ‘Quadratically’ here means there is a constant C such that w(X (k+ 1) ) < Cw(X (k) )2 , k = 1, 2, . . . , where w(X) denotes the width of an interval X; thus, w([a, b]) = b a. We illustrate the different behaviors of the ordinary Newton method and the interval Newton method in the following figures. Fig 1 shows that the ordinary Newton method cannot find the middle solution unless we start very close to it. The first three iterations of the ordinary Newton method f (x k ) x kC1 D x k 0 f (x k )

Interval Analysis: Systems of Nonlinear Equations, Figure 1 The ordinary Newton method

for f (x) = x3 x+ 0.2, starting with x(0) = 0.375, are shown in Fig 1. The algorithm produces x(1) = 0.528 . . . , x(2) = 0.584 . . . , x(3) = 22.569 . . . . In order to converge to the middle root, we need an initial guess x(0) very close to that root. The interval Newton method finds all three solutions on the starting interval X (0) = [ 1.2, 1.2] without difficulty. We choose that starting interval because the roots of a polynomial p(z) D a n z n C C a1 z C a0 with an 6D 0 are well-known to lie in the complex disk ) ( X 1 jzj max 1; ja n j ja k j ; k (x). Then there cannot be any global optimizers of within x. The value can be obtained through an interval function value. This process is illustrated in the following figure. The lower bound for the objective over the box x need not be obtained via interval computations. Indeed, if a Lipschitz constant Lx for is known over x, and (xˇ ) is known for xˇ , the center of x, then, for any e x 2 x, (e x) (xˇ ) 12 Lx kw(x)k ; where w(x) is the vector of widths of the components of the interval vector x. However, getting rigorous bounds on Lipschitz constants can require more human effort than the interval computation, and often results in bounds that are not as sharp as those from interval computation. (However, heuristically obtained ap-

Interval Analysis: Unconstrained and Constrained Optimization, Figure 1 The midpoint test: Rejecting e x because of a high objective value

proximate Lipschitz constants, as employed in the calculations in [4], have been highly successful at solving practical problems, albeit not rigorously.) Similarly, automated computations for Lipschitz constants as presently formulated result in bounds that are prov-

Interval Analysis: Unconstrained and Constrained Optimization

ably not as sharp as interval computations. Furthermore, use of properly rounded interval arithmetic, if used both in computing and (x), allows one to conclude with mathematical rigor that there are no global optima of within x. Use of this lower bound for is sometimes called the midpoint test, since the points x at which (x) is evaluated are often taken to be the vectors of midpoints of the boxes x produced during the subdivision process. (Actually, some implementations use the output of an approximate or local optimizer as x, to get an upper bound on the global optimum that is as low as possible.) The simplest possible branch and bound algorithms need to contain both a box rejection mechanism and a subdivision mechanism. A common subdivision mechanism is to form two sub-boxes by bisecting the widest coordinate interval of x (with possible scaling factors). Heuristics and scaling factors, as well as several references to the literature, appear in [3, §4.3.2, p. 157 ff]. Alternatives to bisection, such as trisection, forming two boxes by cutting other than at a midpoint, etc. have also been discussed at conferences and studied empirically [1]. Acceleration Tools Early and simple algorithms contain only the midpoint test mechanism and bisection mechanism described above. Such algorithms produce as output a large list U of small boxes (with diameters smaller than a stopping tolerance) and no list C of boxes that contain verified critical points. The list U in such algorithms contains clusters of boxes around actual global optimizers. Some Lipschitz constant-based algorithms are of this form. Note, however, that such algorithms are of limited use in high dimensions, since the number of boxes produced increases exponentially in the dimension n. Interval computations provide more powerful tools for accelerating the algorithm. For a start, if an interval extension of the gradient r(x) is computable then 0 62 r(x) implies that x cannot contain a critical point, and x can be rejected. This tool for rejecting a box x is sometimes called the monotonicity test, since 0 62 (r(x))i implies is monotonic over x in the ith component xi , where (r(x))i represents the ith component of the interval evaluation of the gradient r.

I

Perhaps the most powerful interval acceleration tool is interval Newton methods, applied to the system r = 0. Interval Newton methods can result in quadratic convergence to a critical point in the sense that the widths of the coordinates of the image of x are proportional to the square of the widths of the coordinates of x. Interval Newton methods also can prove existence and uniqueness of a critical point or nonexistence of a critical point in x. Thus, the need to subdivide a relatively large x is often eliminated, making a previously impractical algorithm practical. See Interval Newton methods and Interval fixed point theory. For a more detailed algorithm, and for a discussion of parallelization of the branch and bound process, see Interval analysis: Parallel methods for global optimization. Differences Between Unconstrained and Constrained Optimization If m1 > 0 or m2 > 0 in problem (1), then the problem is one of constrained optimization. The midpoint test cannot be applied directly to constrained problems, since (x) is guaranteed to be an upper bound on the global optimum only if the constraints c(x) = 0 and g(x) 0 are also satisfied at x. If there are only inequality constraints and none of the inequality constraints are active at x, then an interval evaluation of g(x) will rigorously verify g(x)< 0, and x can be used in the midpoint test. However, if there are equality constraints (or if one or more of the inequality constraints is active), then an interval evaluation will yield 0 2 c(x) (or 0 2 gi (x) for some i), and it cannot be concluded that x is feasible. In such cases, a small box xˇ can be constructed about x, and it can be verified with interval Newton methods that xˇ contains a feasible point. The upper bound of the interval evaluation (ˇx) then serves as an upper bound on the global optimum, for use in the midpoint test. For details and references, see Interval analysis: Verifying feasibility. On the other hand, constraints can be beneficial in eliminating infeasible boxes x. In particular, 0 62 c(x) or g(x)> 0 implies that x can be rejected. It is sometimes useful to consider bound constraints of the form xi xi and xj xj separately from the general inequality constraints g(x) 0. Such bound constraints can generally coincide with the limits on

1729

1730

I

Interval Analysis: Verifying Feasibility

the search region x0 , but are distinguished from simple search bounds. (It is possible for an unconstrained problem to have no optima within a search region, but it is not possible if all of the search region limits represent bound constraints.) See [3, §5.2.3, p. 180 ff] for details. Example 1 Consider ( min (x) D (x1 C x2 )2 s.t.

c(x) D x2 2x1 D 0:

(2)

Example (2) represents a constrained optimization problem with a single equality constraint and no bound constraints or inequality constraints. To apply the midpoint test in a rigorously verified algorithm, a box must first be found in which a feasible point is verified to exist. Suppose that a point algorithm, such as a generalized Newton method, has been used to find an approximate feasible point, say xˇ D ( 14 ; 12 )> . Now observe that r c ( 2, 1)| . Therefore, as suggested in Interval analysis: Verifying feasibility, x2 can be held fixed at x2 = 1/2. Thus, to prove existence of a feasible point in a neighborhood of xˇ , an interval Newton method can be applied to f (x1 ) = c(x1 , 0.5) = 0.5 2x1 . We may choose initial interval x1 = [ 0.25 , 0.25+ ] with = 0.1, to obtain x1 D [:15; :35]; 0 D [0:25; 0:25] x1 ; e x1 D 0:25 2 This computation proves that, for x2 = 0.5, there is a feasible point for x1 2 [0.25, 0.25]. (See Interval Newton methods and Interval fixed point theory.) We may now evaluate over the box ([ 0.25, 0.25], [0.5, 0.5])| (that is degenerate in the second coordinate, and also happens to be degenerate in the first coordinate for this example). We thus obtain 9 9 ; ([0:25; 0:25]; [0:5; 0:5]) D ; 16 16 and 9/16 has been proven to be an upper bound on the global optimum for example problem (2). See also Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators

Bounding Derivative Ranges Direct Search Luus–Jaakola Optimization Procedure Global Optimization: Application to Phase Equilibrium Problems Interval Analysis: Application to Chemical Engineering Design Problems Interval Analysis: Differential Equations Interval Analysis: Eigenvalue Bounds of Interval Matrices Interval Analysis: Intermediate Terms Interval Analysis: Nondifferentiable Problems Interval Analysis: Parallel Methods for Global Optimization Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Verifying Feasibility Interval Constraints Interval Fixed Point Theory Interval Global Optimization Interval Linear Systems Interval Newton Methods References 1. Csallner AE, Csendes T (1995) Convergence speed of interval methods for global optimization and the joint effects of algorithmic modifications. Talk given at SCAN’95, Wuppertal, Germany, Sept. 26–29 2. Hansen ER (1992) Global optimization using interval analysis. M. Dekker, New York 3. Kearfott RB (1996) Rigorous global search: continuous problems. Kluwer, Dordrecht 4. Pinter JD (1995) Global optimization in action: continuous and Lipschitz optimization. Kluwer, Dordrecht 5. Ratschek H, Rokne J (1988) New computer methods for global optimization. Wiley, New York

Interval Analysis: Verifying Feasibility R. BAKER KEARFOTT Department Math., University Louisiana at Lafayette, Lafayette, USA MSC2000: 65G20, 65G30, 65G40, 65H20

Interval Analysis: Verifying Feasibility

I

The situation is more complicated in the constrained case. In particular, the values (xˇ ) cannot be taken as upper bounds on the global optimum unless it is known that xˇ is feasible. More generally, an upper bound on the range of over a small box xˇ can be taken as an upper bound for the global optimum provided it is proven that there exists a feasible point of problem (1) within xˇ . This article outlines how this can be done.

Article Outline Keywords Introduction General Feasibility: the Fritz John Conditions Feasibility of Inequality Constraints Infeasibility Feasibility of Equality Constraints See also References

General Feasibility: the Fritz John Conditions An interval Newton method (see Interval Newton methods) can sometimes be used to prove existence of a feasible point of problem (1) that is a critical point of . In particular, the interval Newton method can sometimes prove existence of a solution to the Lagrange multiplier or Fritz John system within xˇ . For the Fritz John system, it is convenient to consider the q2 bound constraints in the same form as the q1 general inequality constraints, so that there are q = q1 + q2 general inequality constraints of the from g j (x) 0. With that, the Fritz John system can be written as

Keywords Constrained optimization; Automatic result verification; Interval computations; Global optimization Introduction Constrained optimization problems are of the form 8 ˆ min (x) ˆ ˆ ˆ ˆ ˆ c i (x) D 0; i D 1; : : : ; m; ˆ 2 Rn : 1in is generally given, where some of the sides in (2) correspond to bound constraints of problem (1), and the other sides merely define the extent of the search region. If there are no constraints ci and g j , then the box x is systematically tessellated into sub-boxes. The branch and bound algorithm, in its most basic form, proceeds as follows: Over each sub-box e x, (xˇ) is computed for some xˇ 2 e x, and the range of over e x is bounded (e. g. with a straightforward interval evaluation). If there are no constraints ci and g j , then the value (xˇ) represents an upper bound on the minimum of . The minimum such value is kept as the tessellation and search proceed; if any boxe x has a lower range bound greater than , it is rejected as not containing a global optimum. See [1,2], or [3] for details of such algorithms.

F(x; u; v) D 0 1 Pq P u0 r(x) C jD1 u j r g j C m iD1 v i rc i (x) B C u 1 g1 B C B C : B C : : B C B C uq gq B C B C D 0; B C c (x) 1 B C :: B C B C : B C B C c m (x) @ A Pq Pm 2 u0 C jD1 u j C iD1 v i 1 (3) where uj 0, j = 1, . . . , q, the vi are unconstrained, and the last equation is one of several possible normalization conditions. For details, see [1, §10.5] or [2, §5.2.5]. However, computational problems occur in practice with the system (3). It is more difficult to find a good approximate critical point (for an appropriate small box xˇ ) of the entire system (3) than it is to find a point where the inequality and equality constraints are satisfied. Furthermore, if an interval Newton method is applied to (3) over a large box, the corresponding interval Jacobi matrix or slope matrix typically contains singular matrices and hence is useless for

1731

1732

I

Interval Analysis: Verifying Feasibility

existence verification. This is especially true if it is difficult to get good estimates for the Lagrange multipliers uj and vi . For this reason, the techniques outlined below are useful. Feasibility of Inequality Constraints Proving feasibility of the inequality constraints is sometimes possible by evaluating the g j with interval arithmetic: If g j (ˇx) 0), then every point in xˇ is feasible with respect to the constraint g j (x) 0; see [3]. However, if xˇ corresponds to a point at which g j is active, then 0 2 g j (ˇx), and no conclusion can be reached from an interval evaluation. In such cases, feasibility can sometimes be proven by treating g j (x) = 0 as one of the equality constraints, then using the techniques below. Infeasibility An inequality constraint g j is proven infeasible over xˇ if g j (ˇx) > 0, and an equality constraint ci is infeasible over xˇ if either c i (ˇx) > 0 or c i (ˇx) < 0. See [3].

Interval Analysis: Verifying Feasibility, Figure 1 Proving that there exists a feasible point of an underdetermined constraint system

For the original explanation of the Gaussian elimination-based process, see [1, §12.4]. In [2, §5.2.4], additional background, discussion, and references appear. See also

Feasibility of Equality Constraints The equality constraints c(x) D (c1 (x); : : : ; c m (x))> D 0; c: Rn ! Rm , n m, can be considered an underdetermined system of equations, whereas interval Newton methods generally prove existence and/or uniqueness for square systems. However, fixing n m coordinates xˇ i 2 xˇ i allows interval Newton methods to work with e c : Rm ! Rm , to prove existence of a feasible point within xˇ . In principle, indices of the coordinates to be held fixed are chosen to correspond to coordinates in which c is varying least rapidly. For a set of test problems, the most successful way appears to be choosing those coordinates corresponding to the rightmost columns after Gaussian elimination with complete pivoting has been applied to the rectangular matrix c 0 (xˇ ) for some xˇ 2 xˇ . Figure 1 illustrates the process in two dimensions. Certain complications arise. For example, if bound constraints or inequality constraints are active, then either the point xˇ must be perturbed or else the bound or inequality constraints must be treated as equality constraints. Handling this case by perturbation is discussed in [2, p. 191ff].

Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators Bounding Derivative Ranges Global Optimization: Application to Phase Equilibrium Problems Interval Analysis: Application to Chemical Engineering Design Problems Interval Analysis: Differential Equations Interval Analysis: Eigenvalue Bounds of Interval Matrices Interval Analysis: Intermediate Terms Interval Analysis: Nondifferentiable Problems Interval Analysis: Parallel Methods for Global Optimization Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Unconstrained and Constrained Optimization Interval Constraints Interval Fixed Point Theory Interval Global Optimization Interval Linear Systems Interval Newton Methods

Interval Constraints

References 1. Hansen ER (1992) Global optimization using interval analysis. M. Dekker, New York 2. Kearfott RB (1996) Rigorous global search: continuous problems. Kluwer, Dordrecht 3. Ratschek H, Rokne J (1988) New computer methods for global optimization. Wiley, New York

Interval Constraints Interval Propagation FRÉDÉRIC BENHAMOU IRIN, Université de Nantes, Nantes, France MSC2000: 68T20, 65G20, 65G30, 65G40 Article Outline Keywords See also References Keywords Constraint programming; Continuous constraint satisfaction problems; Local consistency; Propagation Interval constraint processing is an alternative technology designed to process sets of (generally nonlinear) continuous or mixed constraints over the real numbers. It associates propagation and search techniques developed in artificial intelligence and methods from interval analysis. Interval constraints are used in the design of the constraint solving and optimization engines of most modern constraint programming languages and have been used to solve industrial applications in areas like mechanical design, chemistry, aeronautics, medical diagnosis or image synthesis. The term interval constraint is a generic term denoting a constraint (that is a first order atomic formula such as an equation, inequation or more generally a relation) in which variables are associated with intervals. These intervals denote domains of possible values for these variables. In general, intervals are defined over the real numbers but the concept is general enough to address other constraint domains (e. g. non

I

negative integers, Booleans, lists, sets, etc.). When defined over the real numbers, interval constraint sets are often called continuous or numerical constraint satisfaction problems. The main idea underlying interval constraint processing—also called interval propagation—is, given a set of constraints S involving variables {v1 , . . . , vn } and a set of floating point intervals {I 1 , . . . , In } representing the domains of possible values of variables, to isolate a set of {n}-ary canonical boxes (Cartesian products of I i s subintervals whose bounds are either equal or consecutive floating point numbers) approximating the constraint system solution space. To compute such a set, a search procedure navigates through the Cartesian product I 1 × × I n alternating pruning and branching steps. The pruning step uses a relational form of interval arithmetic [1,11]. Given a set of constraints over the reals, interval arithmetic is used to compute local approximations of the solution space for a given constraint. This approximation results in the elimination of values from the domains of the variables and these domain modifications are propagated through the whole constraint set until reaching a stable state. This stable state is closely related to the notion of arc consistency [9,10], a well-known concept in artificial intelligence. The branching step consists in a bisection-based divideand-conquer procedure on which a number of strategies and heuristics can be applied. Interval constraints were first introduced by J.G. Cleary in [5]. The initial goal was to address the incorrectness of floating point numerical computations in the Prolog language while introducing a relational form of arithmetic more adapted to the language formal model. These ideas, clearly connected to the concepts developed in constraint logic programming [6,7], were then implemented in BNR-Prolog [12]. Since then, many other constraint languages and systems have used interval constraints as their basic constraint solving engine, for example CLP(BNR) [4], Prolog IV [13] or Numerica [16]. In the interval framework, the basic data structure is a set of ordered pairs of numbers taken in a finite set of particular rational numbers augmented with the two infinities (this set generally coincides with a set of IEEE floating point numbers). Such a pair, called a floating point interval or, more concisely, an interval, denotes,

1733

1734

I

Interval Constraints

as expected, the set of real numbers in between the lower and upper bounds. Operations and relations over the reals can be lifted to intervals (using floating point operations and outward rounding) so as to keep numerical errors under control. In particular, correctness of computations is guaranteed by a fundamental theorem due to R.E. Moore [11]. Assuming a finite set of intervals closed under intersection, every relation over R can be approximated with its interval enclosure (i. e. the intersection of all intervals containing it). The approximation of any {n}-ary relation is then defined as the Cartesian product of its projection approximations. These Cartesian products of intervals are called boxes. The set of boxes, partially ordered by inclusion, is the complete lattice made of the fixed points of the closure operator that maps {n}-ary relations over R to their approximations. The intersection of all boxes containing an n-ary relation defines an outer approximation notion. A dual notion of inner approximation can be defined as the union of all boxes contained in the relation. Given a finite set of constraints S and an initial nary box X representing the domains (intervals) for all variables occurring in S, every constraint in S represents an n-ary relation (modulo an appropriate cylindrification). The main idea is then to compute a box approximating the solution set defined by S and X. In the interval constraint framework, this approximation is generally computed by applying the following algorithm, called here NC3 to reflect its similarity to the arc consistency algorithm AC3 [10]. The call to the function narrow in NC3 is an algorithmic narrowing process. Every constraint c in S and its corresponding relation is associated with an operator N c , called constraint narrowing operator, mapping boxes to boxes and verifying the properties of correctness, contractance, monotonicity and idempotence, that is for every boxes X, X 0 1) X \ N c (X); 2) N c (X) X; 3) X X0 implies N c (X) N c (X 0 ) 4) N c (N c (X)) = N c (X). When such operators are associated with the constraints of a set S, the function narrow(X, c) simply returns N c (X). The algorithm stops when a stable state is reached, i. e. no (strict) narrowing is possible with respect to any constraint. The result of the main step

function NC3() input: S, a (nonempty) constrain system, X, a (nonempty) box output: X 0 X Queue all constraints from S in Q REPEAT select a constraint c from Q % Narrow down X with respect to c narrow(X; c) X0 % if X 0 is empty, S is inconsistent IF X 0 = ; THEN return ; % Queue the constraints whose variables’ % domains have changed. Delete c from Q Let S 0 = fc 2 S : 9v 2 var(c); Xv0 Xv g Q Q [ S 0 nfcg X X0 UNTIL Q is empty return X END % NC3 NC3: A generic narrowing algorithm

is to remove (some) incompatible values from the domains of the variables occurring in c. Furthermore, it can be shown that NC3 terminates, is correct (the final box contains all solutions of the initial system included in {X}, confluent (selection of constraints in the main loop is strategy independent) and computes the greatest common fixed point of the constraint narrowing operators that is included in the initial box [2]. Over the real numbers, different constraint narrowing operators can be defined, resulting in different local consistency notions. A system is said to be locally consistent (with respect to a family of constraint narrowing operators), if the Cartesian product of the domains of its variables is a common fixed point of the constraint narrowing operators associated with its constraints. The main local consistency notions used in continuous constraint satisfaction problems are: first order local consistencies deriving from arc consistency (hull (or 2B) consistency [4,8], box consistency [3], and higher order local consistencies deriving from k-consistency (3B, kB-consistency [8], box(2)-consistency [14]). More precisely, let apx(c) denote the smallest box enclosing the relation associated with a constraint c. The family of constraint narrowing operators N defined as: For all box X and all constraint c, N c (X) = apx(X \ c) is the support of hull consistency. These opera-

Interval Constraints

tors can be computed for very simple constraints, often named primitive constraints (e. g. x + y = z, sin(x) = y, . . . ) and complex constraints are decomposed into primitive constraints, eventually adding fresh variables. The definition of box consistency involves the introduction of projection constraints. Given a multivariate constraint c over the reals, the projection constraint of c with respect to a variable v is obtained by computing an interval extension of the constraint and by replacing every variable but v with the interval corresponding to its domain. The constraint narrowing operator associated with this projection constraint computes the greatest interval [a, b] such that [a, a+ and (b , b], where a+ (resp. b ) denotes the successor (resp. the predecessor) of a (resp. b), verify the projection constraint. Besides the fact that this technique does not require the addition of any additional variable, these operators can be computed with an algorithm mixing interval Newton methods (cf. Interval Newton methods), propagation and bisection-based search. Higher-order local consistencies are based on their first order counterparts. Operationally, the idea is to improve the enclosures accuracy by eliminating subintervals of the locally consistent domains that can be detected locally inconsistent. The general procedure is as follows: Consider a hull consistent interval constraint system (S, X). An equivalent 3B-consistent system is a system (S, X 0 ) such that, for every variable v, if X v 0 = [a, b], the system (S, X 0v [a;a C ] )(resp. (S, X 0v [b ;b] )) is hull consistent, where X 0v I denotes the Cartesian product X 0 where X v 0 is replaced with I. Box(2)-consistency is defined in the same manner with respect to box consistency. The computational cost of higher-order local consistencies is generally high, but the local gain in accuracy was shown to outperform most existing techniques in several challenging problems (see, for example the circuit design problem in [14]). Finally, the above mentioned interval constraint techniques are also used for unconstrained and constrained optimization problems (see for example [15,16]). See also Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators

I

Bounding Derivative Ranges Global Optimization: Application to Phase Equilibrium Problems Interval Analysis: Application to Chemical Engineering Design Problems Interval Analysis: Differential Equations Interval Analysis: Eigenvalue Bounds of Interval Matrices Interval Analysis: Intermediate Terms Interval Analysis: Nondifferentiable Problems Interval Analysis: Parallel Methods for Global Optimization Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Unconstrained and Constrained Optimization Interval Analysis: Verifying Feasibility Interval Fixed Point Theory Interval Global Optimization Interval Linear Systems Interval Newton Methods References 1. Alefeld G, Herzberger J (1983) Introduction to interval computations. Acad. Press, New York 2. Benhamou F, Granvilliers L (1997) Automatic generation of numerical redundancies for non-linear constraint solving. Reliable Computing 3(3):335–344 3. Benhamou F, McAllester D, Van Hentenryck P (1994) CLP(intervals) revisited. Proc. of ILPS’94, MIT, Cambridge, MA, pp 1–21 4. Benhamou F, Older WJ (1997) Applying interval arithmetic to real, integer and Boolean constraints. J Logic Programming 32(1):1–24 5. Cleary JG (1987) Logical arithmetic. Future Computing Systems 2(2):125–149 6. Colmerauer A (1990) An introduction to Prolog III. Comm ACM 33(7):69–90 7. Jaffar J, Lassez JL (1987) Constraint logic programming. Proc. 14th ACM Symp. Principles of Programming Languages (POPL’87), ACM, New York, pp 111–119 8. Lhomme O (1993) Consistency techniques for numeric CSPs. In: Bajcsy R (ed) Proc. 13th IJCAI. IEEE Computer Soc. Press, New York, pp 232–238 9. Mackworth A (1977) Consistency in networks of relations. Artif Intell 8(1):99–118 10. Montanari U (1974) Networks of constraints: Fundamental properties and applications to picture processing. Inform Sci 7(2):95–132

1735

1736

I

Interval Fixed Point Theory

11. Moore RE (1966) Interval analysis. Prentice-Hall, Englewood Cliffs, NJ 12. Older W, Vellino A (1993) Constraint arithmetic on real intervals. In: Benhamou F, Colmerauer A (eds) Constraint Logic Programming: Selected Research. MIT, Cambridge, MA 13. PrologIA (1994) Prolog IV: Reference manual and user’s guide 14. PugetJ-F, Van Hentenryck P (1998) A constraint satisfaction approach to a circuit design problem. J Global Optim 13:75–93 15. Van Hentenryck P, McAllester D, Kapur D (1997) Solving polynomial systems using a branch and prune approach. SIAM J Numer Anal 34(2):797–827 16. Van Hentenryck P, Michel L, Deville Y (1997) Numerica: A modeling language for global optimization. MIT, Cambridge, MA

Interval Fixed Point Theory R. BAKER KEARFOTT Department Math., University Louisiana at Lafayette, Lafayette, USA MSC2000: 65G20, 65G30, 65G40, 65H20 Article Outline Keywords Classical Fixed Point Theory and Interval Arithmetic The Krawczyk Method and Fixed Point Theory Interval Newton Methods and Fixed Point Theory Uniqueness Infinite-Dimensional Problems See also References Keywords Fixed point iteration; Automatic result verification; Interval computations; Global optimization Interval methods (interval Newton methods and the Krawczyk method) can be used to prove existence and uniqueness of solutions to linear and nonlinear finitedimensional and infinite-dimensional systems, given floating-point approximations to such solutions. (See Interval Newton methods; and [6,8].) In turn, these

existence-proving interval operators have a close relationship with the classical theory of fixed-point iteration. This relationship is sketched here. Classical Fixed Point Theory and Interval Arithmetic Various fixed point theorems, applicable in finiteor infinite-dimensional spaces, state roughly that, if a mapping maps a set into itself, then that mapping has a fixed point within that set. For example, the Brouwer fixed point theorem states that, if D is homeomorphic to the closed unit ball in Rn and P is a continuous mapping such that P maps D into D, then P has a fixed point in D, that is, there is an x 2 D with x = P(x). Interval arithmetic can be naturally used to test the hypotheses of the Brouwer fixed point theorem. An interval extension P of P has the property that, if x is an interval vector with x D, then P(x) contains the range {P(x): x 2 x}, and an interval extension P can be obtained simply by evaluating P with interval arithmetic. Furthermore, with outward roundings, this evaluation can be carried out so that the floating point intervals (whose end points are machine numbers) rigorously contain the actual range of P. Thus, if P(x) x, one can conclude that P has a fixed point within x. Another fixed point theorem, Miranda’s theorem, follows from the Brouwer fixed point theorem, and is directly useful in theoretical studies of several interval methods. Miranda’s theorem is most easily stated with the notation of interval computations: Suppose x Rn is an interval vector, and for each i, look at the lower ith face xi of x, defined to be the interval vector all of whose components except the ith component are those of x, and whose ith component is the lower bound xi of the ith component xi of x. Define the upper ith face x i of x similarly. Let P: x ! Rn , P(x) = (P1 (x), . . . , Pn (x)) be continuous, and let P = (P1 , . . . , Pn ) be any interval extension of P. Miranda’s theorem states that, if P i (x i )P i (x i ) 0;

(1)

then P has a fixed point within x. The Krawczyk Method and Fixed Point Theory R.E. Moore provided one of the earlier careful analyses of interval Newton methods in [5]. There, the Krawczyk

I

Interval Fixed Point Theory

method was analyzed as follows: The chord method is defined as P(x) D x Y f (x);

(2)

where the iteration matrix is normally taken to be Y D 0 1 for some Jacobi matrix f 0 (e x) with e x 2 x, x) f (e where solutions of f (x) = 0, f : D Rn ! Rn are sought. A mean value extension is then used: 0

P(x) 2 P(xˇ ) C P (x)(x xˇ );

K(x; xˇ ) D P(x) (3)

is an interval extension of P. Thus, the fact that the range of P obeys fP(x) : x 2 xg P(x) D K(x; xˇ ) coupled with the Brouwer fixed point theorem implies that, if K(x; xˇ ) x; then there exists a fixed point of P, and hence solution x 2 K(x; xˇ ), f (x ) = 0. By analyzing the norm norm k I Yf 0 (x) k, Moore further concludes, basically, that if

I Y f 0 (x) < 1; then any solution x 2 x must be unique; for an exact statement and details, see [5]. Interval Newton Methods and Fixed Point Theory Traditional interval Newton methods are of the form N( f ; x; xˇ ) D xˇ C v;

Uniqueness In classical fixed point theory, the contractive mapping theorem (a nongeneric property) is often used to prove uniqueness. For example, suppose P is Lipschitz with Lipschitz constant L < 1, that is,

whence D P(xˇ ) C P0 (x)(x xˇ ) D xˇ Y f (xˇ ) C I Y f 0 (x) (x xˇ )

is a solution of f (x) = 0 within N( f ; x; xˇ ), and this solution is unique within x. Classical fixed point theory is used in the succinct proof of this general theorem. When the interval Gauss–Seidel method is used to find the solution set bounds v, a very clear correspondence to Miranda’s theorem can be set up. This is done in [3].

(4)

where v is an interval vector that contains all solutions v to point systems Av D f (xˇ), for A 2 f0 (x), where f0 (x) is either an interval extension to the Jacobi matrix of f over x or an interval slope matrix; see Interval Newton methods. [7, Thm. 5.1.7] asserts that, if N( f ; x; xˇ ) int x, where f0 (x) is a ‘Lipschitz set’ for f, intx denotes the interior of x, and xˇ 2 int(x), then there

kP(x) P(y)k L kx yk

for some L < 1: (5)

Then x = P(x) and y = P(y) implies k x y k = k P(x) P(y) k L k x y k, which can only happen if x = y. (This argument appears in many elementary numerical analysis texts, such as [4].) An alternate proof of uniqueness involves nonsingularity (i. e., regularity) of the mapping f for which we seek x with f (x) = 0. In particular, if f (x) = Ax is linear, corresponding to a nonsingular matrix A, then f (x) = 0 and f (y) = 0 implies 0 D f (x) f (y) D Ax Ay D A(x y);

(6)

whence nonsingularity of A implies x y = 0, i. e. x = y. Without interval arithmetic, the argument in (6) cannot be generalized easily to nonlinear systems. Basically, invertibility implies uniqueness, and one must somehow prove invertibility. However, with interval arithmetic, uniqueness follows directly from an equation similar to (6), and regularity can be proven directly with an interval Newton method. In particular, if the image under the interval Newton method (4) is bounded, then every point matrix A 2 f0 (x) must be nonsingular. (This is because the bounds on the solution set to the linear system f0 (x)v D f (xˇ) must contain the set of solutions to all systems of the form Av D f (xˇ), A 2 f0 (x).) Then, the mean value theorem implies that, for every x 2 x, y 2 x, f (x) f (y) D A(x y)

for some A 2 f0 (x):

(7)

This is in spite of the fact that, in (7), A is in general not equal to any f 0 (x) for some x 2 x. In fact, (7) follows

1737

1738

I

Interval Fixed Point Theory

from considering f componentwise: > f i (y) D f i (x) C r f i (c i ) (y x); for some ci , different for each i, on the line connecting x and y; the matrix A 2 f0 (x) can be taken to have its ith row equal to (r f i (ci ))| . Thus, because of the nonsingularity of A in (7), f (x) = 0, f (y) = 0 implies 0 = A(x y and x = y. Summarizing the actual results, N( f ; x; xˇ ) int x;

(8)

where N( f ; x; xˇ ) is as in (4), then classical fixed-point theory combined with properties of interval arithmetic implies that there is a unique solution to f (x) = 0 in N( f ; x; xˇ ), and hence in x. If slope matrices are used in place of an interval Jacobi matrix f0 (x), then (7) no longer holds, and (8) no longer implies uniqueness. However, a two-stage process, involving evaluation of an interval derivative over a small box containing the solution and evaluation of a slope matrix over a large box containing the small box, leads to an even more powerful existence and uniqueness test than using interval Jacobi matrices alone. This technique perhaps originally appeared in [9]. A statement and proof of the main theorem can also be found in [3, Thm. 1.23, p. 64]. Infinite-Dimensional Problems Many problems in infinite-dimensional spaces (e. g. certain variational optimization problems) can be written in the form of a compact operator fixed point equation, x = P(x), where P: S ! S is some compact operator operating on some normed linear space S. In many such cases, P is approximated numerically from a finitedimensional space of basis functions { i : i = 1, . . . , n} (e. g. splines or finite element basis functions i ), and the approximation error can be computed. That is, P(x) = Pn (y) + Rn (y), where y 2 Rn is an approximation to x 2 S, and Rn (y) is the error that is computable as a function of y. Thus, a fixed point iteration can be set up of the form y

e P n (y) Pn (y) C R n (y);

(9)

where y 2 Rn . (The dimension n can be increased as iteration proceeds.)

For (9), the Schauder fixed point theorem is an analogue of the Brouwer fixed point theorem; see [1, p. 154]. Furthermore, interval extensions can be provided to both Pn and Rn , so that an analogue to finitedimensional computational fixed point theory exists. In particular, if e P n (y) int y;

(10)

then there exists a fixed point of P within the ball in S centered at the midpoint of y and with radius equal to the radius of y. (For these purposes, yD

n X

ai i

iD1

can be identified with the interval vector (a1 , . . . , an )| corresponding to the coefficients in the expansion.) For details, see [6, Chap. 15]. Also see [2] for a theoretical development and various examples worked out in detail. See also Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators Bounding Derivative Ranges Global Optimization: Application to Phase Equilibrium Problems Interval Analysis: Application to Chemical Engineering Design Problems Interval Analysis: Differential Equations Interval Analysis: Eigenvalue Bounds of Interval Matrices Interval Analysis: Intermediate Terms Interval Analysis: Nondifferentiable Problems Interval Analysis: Parallel Methods for Global Optimization Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Unconstrained and Constrained Optimization Interval Analysis: Verifying Feasibility Interval Constraints Interval Global Optimization Interval Linear Systems Interval Newton Methods

Interval Global Optimization

References 1. Istr˘at¸escu VI (1981) Fixed point theory: An introduction. Reidel, London 2. Kaucher EW, Miranker WL (1984) Self-validating numerics for function space problems. Acad. Press, New York 3. Kearfott RB (1996) Rigorous global search: Continuous problems. Kluwer, Dordrecht 4. Kincaid D, Cheney W (1991) Numerical analysis. Brooks/Cole, Pacific Grove, CA 5. Moore RE (Sept. 1977) A test for existence of solutions to nonlinear systems. SIAM J Numer Anal 14(4):611–615 6. Moore RE (1985) Computational functional analysis. Horwood, Westergate 7. Neumaier A (1990) Interval methods for systems of equations. Cambridge Univ. Press, Cambridge 8. Plum M (1994) Inclusion methods for elliptic boundary value problems. In: Herzberger J (ed) Topics in Validated Computations. Stud Computational Math. North-Holland, Amsterdam, pp 323–380 9. Rump SM (1994) Verification methods for dense and sparse systems of equations. In: Herzberger J (ed) Topics in Validated Computations. Elsevier, Amsterdam, pp 63–135

Interval Global Optimization H. RATSCHEK1 , J. ROKNE2 1 Math. Institut, Universität Düsseldorf, Düsseldorf, Germany 2 Department Computer Sci., University Calgary, Calgary, Canada

I

Use of Good Inclusion Functions, Slope Arithmetic The Nonconvexity Test Thin Evaluation of the Hessian Matrix Constraint Logic Programming Automatic Differentiation Parallel Computations

Global Optimization Over Unbounded Domains and Nonsmooth Optimization Global Optimization over Unbounded Domains Nonsmooth Optimization

Constrained Optimization The Penalty Approach The Direct Approach

Applications Chemistry and Chemical Engineering Physics, Electronics and Mechanical Engineering Economics

See also References Keywords Global optimization; Nonsmooth optimization; Unbounded optimization; Interval methods We give an overview of the general ideas involved in solving constrained and unconstrained global optimization problems using interval arithmetic. We include a discussion of a few prototype optimization algorithms and enumerate some applications in engineering, chemistry, manufacturing, economics and physics.

MSC2000: 65K05, 90C30, 65G20, 65G30, 65G40 Introduction Article Outline Keywords Introduction Interval Arithmetic Interval Newton Methods The Preconditioning Step The Relaxation Steps (Gauss–Seidel)

Three Prototype Algorithms for the Unconstrained Problem Convergence Properties of the Prototype Algorithms Accelerating and Related Devices The Monotonicity Test The Interval Newton Method Finding a Function Value as Small as Possible Bisections

Let I be the set of real compact intervals, R the set of reals, m a positive integer, X 2 I m , and f : X ! R the objective function. We assume that a global minimum f of f exists over X. Let X be the set of global minimizers of f over X. Then the global unconstrained optimization problem is written down concisely as min f (x) x2X

(1)

which means that f or X is to be determined. The global constrained optimization problem arises if a more general set M Rm , the so-called feasible domain, is considered. Solution methods for global constrained problems use the tools for global unconstrained problems, but additionally, further concepts are needed such

1739

1740

I

Interval Global Optimization

as numerical proofs of the guaranteed existence of feasible points in subareas of the working domain. Therefore we separate the treatment of the constrained case from the treatment of the unconstrained case. The first interval techniques for treating global optimization problems were established by [13,20,30,31,46, 64,65,66,67,71,72,74,81,98,103,104,105], etc. Although some of these references were focusing on special problems like convex or signomial programming, they provided concepts which would give insight into more general problems where they were later applied. The overview that we provide in this article, can only cast a quick glance at the various topics that will be considered. Their thorough investigation may be found in [33,39,47,48,49,56,70,90,95,100]. Solving an optimization problem such as (1) requires, in general, the repetitive comparison of continua of values and the choosing of an optimum value. Since interval computation is a tool for handling continua, it provides competitive methods for solving global optimization problems. Simple prototype algorithms for unconstrained problems are discussed in order not to get too sophisticated. We choose three variants, on the one hand in order to keep track of the historical origins, on the other hand in order to show how small changes in the prototypes influence their convergence behavior. These prototypes are based on ideas of S. Skelboe [103], R.E. Moore [74], N.S. Asaithambi, Z. Shen and Moore [2], E.R. Hansen [30,31], and K. Ichida and Y. Fujii [46]. We do not have the space to provide prototype algorithms for constrained problems as well in this article. Thus we only discuss parts which we have to add to the unconstrained prototypes in order to get a procedure for constrained problems. In general, interval algorithms for solving global optimization problems consist of i) the main algorithm, ii) accelerating devices. The main algorithm is a sequential deterministic algorithm where branch and bound techniques are used. (An algorithm is called sequential if the nth step of the computation depends on the former steps. A method is deterministic if stochastic methods are avoided. By branch and bound principles is meant that the whole area X or M is not searched uniformly for the global minimizers; instead some parts (branches) are preferred. The branching depends on the bounding. It is

required that for any box Y of the working area a lower bound for f over Y is known or computable.) Interval arithmetic is used for point i) to achieve the bounds needed for the branch and bound techniques (f need not be Lipschitz, convex, etc.) and for point ii) to remove superfluous parts of the domains X or M. The contents of this article is as follows: In the next two sections we introduce the interval tools which are required in the article. In section 4, three algorithms for solving (1) are presented. They are seemingly very similar, but their convergence properties, which are discussed in section 5, are different. The three algorithms are also of interest for historical reasons. A survey of acceleration devices, which aim to speed up the computation, is given in section 6. It is shown in section 7 that interval analysis is an excellent means for dealing with problems which have an unbounded domain or a nonsmooth objective function. In section 8, the constrained case is touched upon. Applications of these methods are collected in the final section 9. Interval Arithmetic The interval tools which are needed for the explanation of the basic features of interval methods in global optimization are described in this section. A thorough introduction to the whole area of interval arithmetic can be found, for example, in [1,4,52,74,102], etc. More advanced readers will enjoy [79]. The development of interval tools appropriate for dealing with optimization problems is presented in [88,90]; cf. also the Appendix of [86]. The interval arithmetic operations are defined by A B D fa b : a 2 A; b 2 Bg

for A; B 2 I; (2)

where the symbol may denote +, , , or /. In general, A/B is not defined if 0 2 B. (But see the sections on ‘interval Newton methods’ and ‘global optimization over unbounded domains’ below.) The meaning of (2) is the following: If some unknown reals ˛, ˇ are included in known intervals, say ˛ 2 A, ˇ 2 B, then it is guaranteed that the desired result ˛ ˇ, which is in general unknown, is contained in the known interval A B. Definition (2) is equivalent to the following rules, [a; b] C [c; d] D [a C c; b C d]; [a; b] [c; d] D [a d; b c];

Interval Global Optimization

[a; b] [c; d] D [min(ac; ad; bc; bd); max(ac; ad; bc; bd)]; 1 1 [a; b] D [a; b] [ ; ] if 0 … [c; d]: [c; d] d c Therefore, the interval arithmetic operations can easily be realized on a computer. The algebraic properties of (2) are different from those of real arithmetic operations. The distributive law, for instance, does not hold for (2). A summary of the algebraic behavior of interval arithmetic is given in [85]. The main interval arithmetic tool applied to optimization problems is the concept of an inclusion function. Let again X 2 I m and f : X ! R. The set of compact intervals contained in X is denoted by I(X). Let f (Y) D f f (x) : x 2 Yg for Y 2 I(X) be the range of f over Y. A function F is called an inclusion function for f if f (Y) F(Y)

for any Y 2 I(X):

The left and the right endpoint of F(Y) will be denoted by minF(Y) and maxF(Y), respectively. Inclusion functions can be constructed in any programming language in which interval arithmetic is simulated or implemented via natural interval extensions: Firstly, let g be any function pre-declared in some programming language (like sin, cos, exp, etc.). Then the corresponding pre-declared interval function IG is defined by IG(Y) D g(Y) for any Y 2 I contained in the domain of g: Since the monotonicity intervals of pre-declared functions g are well known it is easy to realize the interval functions IG on a computer. Nevertheless, the influence of rounding errors may be considered, see [30], for instance. Secondly, let f (x) be any function expression in the variable x 2 Rm . So, f (x) may be an explicit formula or described by an algorithm not containing logical connectives at the moment. For simplicity, we assume that f (x) is representable in a programming language. Let Y 2 I m or let Y be an interval variable over I m . Then the expression which arises if each occurrence of x in f (x) is replaced by Y, if each occurrence of a pre-declared function g in f (x) is replaced by IG, and if the arithmetic

I

operations in f (x) are replaced by the corresponding interval arithmetic operations, is called the natural interval extension of f (x) to Y, and it is denoted by f (Y), see [71]. Due to (2) and the definition of the IG’s we get the inclusion principle for (programmable) functions a2Y

implies

f (a) 2 f (Y):

(3)

Therefore, f (Y), seen as a function in Y, is an inclusion function for the function f (x). For example, if f (x) = x1 sin x2 x3 for x 2 R3 , then f (Y) = Y 1 2 Y 2 Y 3 is the natural interval extension of f (x) to Y 2 I 3 . If logical connectives occur in an expression, the extensions are similar, cf. [55,87]. Due to the algebraic properties of interval arithmetic, different expressions for a real function f can lead to interval expressions which are different as functions. For example, if f 1 (x) = x x2 and f 2 (x) = x(1 x) for x 2 R, then f 1 (Y) = Y Y 2 = [1, 1] and f 2 (Y) = Y(1 Y) = [0, 1] for Y = [0, 1]. For comparison, f (Y) D [0; 14 ]. In general, the problem arises as to how to find expressions of a given function that lead to natural interval extensions that are as good as possible. A partial solution to this problem can be found in [88]. A measure of the quality of an inclusion function F for f : X ! R is the so-called excess width ([71]), defined as w(F(Y)) w( f (Y)) for all Y 2 I(X), where w([a, b]) = b a is the width of an interval. F is called of order ˛ > 0 if w(F(Y)) w( f (Y)) D O(w(Y ˛ ))

for Y 2 I(X);

where the width of a box Y = Y 1 × × Y m is defined by w(Y) = maxi = 1, . . . , m w(Y i ). In order to obtain good computational results it is necessary to choose inclusion functions having as high an order ˛ as possible, when w(Y) is small, see for example [88]. The endpoints of the intervals must be machine numbers, if interval arithmetic is implemented on a machine. This leads to a special topic called machine interval arithmetic. It can be considered as an approximation to interval arithmetic on computer systems. Machine interval arithmetic is based on the inclusion isotonicity of the interval operations in the following manner: Let us again assume that ˛, ˇ are the unknown exact values at any stage of the calculation, and that only including intervals are known, ˛ 2 A, ˇ 2 B.

1741

1742

I

Interval Global Optimization

Then A, B might not be representable on the machine. Therefore A and B are replaced by the smallest machine intervals that contain A and B, A AM;

B BM :

A machine interval is an interval which has left and right endpoints that are machine numbers. From (2) it follows that A B AM BM : The interval AM BM need not be a machine interval and it is therefore approximated by (AM BM )M which is the smallest interval representable on the machine and which contains AM BM . This leads to the inclusion principle of machine interval arithmetic: ˛ 2 A; ˇ 2 B

implies

˛ ˇ 2 (A M B M ) M : (4)

Thus, the basic principle of interval arithmetic is retained in machine interval arithmetic, that is, the exact unknown result is contained in the corresponding known interval, and rounding errors are under control. We sum up: When a concrete problem has to be solved then our procedure is as follows: Firstly, the theory is done in interval arithmetic, secondly, the calculation is done in machine interval arithmetic, and finally, the inclusion principle provides the transition from interval arithmetic to machine interval arithmetic. Many software packages for interval arithmetic are meanwhile available, which work under Fortran 77, Fortran 90, Pascal, C, C++, Prolog, etc. A good survey can be found, for instance, in [57]. Interval Newton Methods Interval Newton methods are excellent methods for determining all zeros of a continuously differentiable vector-valued function : X ! Rm where X 2 I m . These methods are important tools for nonlinear optimization problems since they can be used for computing all critical points of the objective function, f , by applying the methods to J (x), where is the gradient function of f and J the Jacobian of , or for solving the Karush– Kuhn–Tucker or John conditions in constrained optimization. The interval Newton method was introduced by Moore [71] and it has been further extensively developed by many researchers. The latest state of art for interval Newton methods may be found in [79]. See also

Interval Newton methods. The extensive treatment of the interval Newton method is not part of this introductory article so that we sketch it in an extremely simplified manner just in order to make the aim of the method understandable. For a detailed treatment see, for instance, [1,33,79,90,93], etc. Interval Newton methods are closely connected to solving systems of linear interval equations. An unfortunate notation is widely used to describe this situation since it uses the notation of interval arithmetic in a doubtful manner which can lead to misunderstandings. I.e., let A 2 I m × m , B 2 I m then the solution of the linear interval equation (with respect to x or X) Ax D B

or

AX D B

is not an interval vector X 0 that satisfies the equation, AX 0 = B, as one would expect. The solution is defined as the set X D fx 2 Rm : ax D b for some a 2 A; b 2 Bg : Thus, for example, the solution of the linear interval equation [1; 2]x D [1; 2] is X = [1/2, 2]. In general, the solution set is not a box if m 2. It is therefore the aim of interval arithmetic solution methods to find at least a box which contains the solution set. Accordingly, if c 2 Rm , then the solution of the linear interval equation A(x c) D B

or

A(X c) D B

with respect to x or X is defined to be the set X :D c C Y :D fc C y : y 2 Yg where Y is the solution of the interval equation Ay = B. The following prototype algorithm aims to determine the zeros of : X ! Rm in X 2 I m . Let J(Y) be an inclusion function for the Jacobian matrix J (x) for Y 2 I(X).

Interval Global Optimization

1. Set X0 := X. 2. For n = 0; 1; : : : a) choose x n 2 X n , b) determine a superbox Z n+1 of the solution Yn+1 of the linear interval equation with respect to Y, J(X n )(x n Y) = (x n ), c) set X n+1 := Z n+1 \ X n . The interval Newton algorithm

Since we use it later we emphasize that one iteration of the interval Newton algorithm is just the execution of a), b) and c) for a particular value of n. Interval Newton methods are distinguished by the particular choice of the superbox Zn + 1 . For example, if Zn + 1 is the box hull of Y n + 1 , that is, the smallest box containing Y n + 1 , then the method is called the interval Newton method (in the proper sense). If Zn + 1 is obtained by using interval Gauss–Seidel steps combined with preconditioning as will be explained in the sequel, the method is named after Hansen and S. Sengupta [35]. Krawczyk’s method [60] and Hansen– Greenberg’s methods [34] are also widely used. Convergence properties exist under certain assumptions. The following general properties are useful for understanding the principle of application of the algorithm, see [1,71,73,79]: 1) If a zero, , of exists in X then 2 X n for all n. This means that no zero is ever lost! This implies that: 2) If X n is empty for some n then has no zeros in X. 3) If Zn + 1 is obtained by Gauss–Seidel or Gauss elimination, possibly combined with preconditioning as mentioned below then i) if Zn + 1 X n for some n then has a zero in X, ii) Zn + 1 int X n for some n then has a unique zero in X (where int means topological interior). 4) Under certain conditions one obtains w(X nC1 ) ˛(w(X n ))2 for some constant ˛ 0. A very promising realization of the interval Newton algorithm is the Hansen–Sengupta version [35] where the linear system occurring in the Newton iteration step is solved by a preconditioning step and by relaxation steps (Gauss–Seidel). Now we discuss just one iteration of the Hansen– Sengupta variant and suppress the index n when writ-

I

ing down the formulas that occur in the nth iteration. That is, we write J(X)(x Y) D (x)

(5)

instead of J(X n )(x n Y) D (x n ) and, accordingly, we search for a superset Z of the solution set of (5), where X, J(X), x and (x) are given. The solution set of (5) is also denoted by Y. The Preconditioning Step It was already argued by Hansen and R.R. Smith [37] that (5) was best solved by pre-multiplying by an approximate inverse of the midpoint of J(X). If the approximate inverse is B, we obtain BJ(X)(x Y) D B(x) or M(x Y) D b

(6)

where M = BJ(X) and b = B(x). In this manner the system has been modified to a system that is almost diagonally dominant provided the widths of the Jacobian entries are not too large and it is then amenable to Gauss– Seidel type iterations. It is obvious that the solution set of (6) contains the solution set of (5) such that no solution is lost in the above transformation. During the last years much research has been focusing on the preconditioning step, cf. for example, [59]. The Relaxation Steps (Gauss–Seidel) The relaxation procedure for linear interval equations was developed in [36]. It consists mainly in the interpretation of the well-known noninterval Gauss–Seidel iteration procedure in an interval context. But much care is taken in the interval realization if division through intervals that contain zero occurs. We do not have the space for a complete discussion and refer, for example, to [33,90,93]. Instead of a relaxation iteration, interval Gauss elimination can be used. This is nothing more than the well-known Gauss elimination performed in an interval setting. Interval Gauss elimination is not as robust

1743

1744

I

Interval Global Optimization

as the interval Gauss–Seidel steps. It is, however, more effective under certain conditions (for instance, if the Jacobian or the preconditioned Jacobian matrix is diagonally dominant, see [79]). Practical experiences show that it is best to combine Gauss–Seidel steps with Gaussian elimination, cf. [33,79,90]. There is no urgent need for discussing convergence properties of the interval Newton algorithm since only single iterations are incorporated into the optimization algorithms, cf. the sections on ‘accellerating and related devices’ and ‘applications’ below, and hence the convergence theory of the latter is applicable, cf. the sections on ‘convergence properties of the prototype algorithms’ and ‘applications’ below. Only if it is already certain or very likely that the computation is approaching a global minimizer does it make sense to switch to the complete interval Newton algorithm and enjoy finally the quadratic convergence property (cf. property 4). Such a situation occurs, for example, if the objective function, in the unconstrained case, or the Lagrangian in the constrained case, is convex. Three Prototype Algorithms for the Unconstrained Problem The algorithms are designed to determine f or X or both as will be described later. They have the box X, the inclusion function F for f : X ! R and some accuracy parameters which may occur in the termination criteria, as input parameters. The termination criteria will depend on the actual case and will not be specified here, but see, for example, item c) in the section on ‘convergence properties of the prototype algorithms’. For historical reasons, we go back to the roots of interval arithmetic optimization theory. We start with Moore’s algorithm [71], which used uniform subdivision, but we already incorporate the first branch and bound steps as proposed by Skelboe [103], and finally we land at Hansen’s algorithm [30,31], which was the first algorithm which featured convergence to both, to f and to X . Algorithm 1 initializes a list L = L1 consisting of one pair (X, y), see Step 3. Then the list is modified and enlarged at each iteration, see Steps 8 and 9. At the nth iteration a list L = Ln consisting of n pairs is present, n

Ln D ((Z ni ; z ni )) iD1

where z ni D min F(Z ni ):

1. 2. 3. 4.

5. 6. 7. 8. 9.

10. 11. 12. 13.

Calculate F(X). Set y := minF(X). Initialize list L = ((X; y)). Choose a coordinate direction k parallel to an edge of maximum length of X = X1 : : : X m , i.e. k 2 fi : w(X) = w(X i )g. Bisect X normal to direction k obtaining boxes V1 , V2 such that X = V1 [ V2 . Calculate F(V1 ), F(V2 ). Set v i := minF(Vi ) for i = 1; 2. Remove (X; y) from the list L. Enter the pairs (V1 ; v1 ) and (V2 ; v2 ) into the list such that the second members of all pairs of the list do not decrease. Denote the first pair of the list by (X; y). If the termination criteria hold, go to 13. Go to 4. End.

Algorithm 1: Moore–Skelboe

The leading pair of the list Ln will be denoted by (X n ; y n ) D (Z n1 ; z n1 ): The boxes X n are called the leading boxes of the algorithm. It is assumed that the termination criteria of Step 11 are not satisfied during the whole computation such that the algorithm will not stop. In this case an infinite sequence of lists is produced. Algorithm 1 was mainly established to determine f . Now, Ichida and Fujii [46] and Hansen [30,31] focused on the boxes Zni in order to get reasonable inclusions for X . While midpoint tests (cf. [2,30,31,46]) have no impact on the convergence properties of Algorithm 1, they are now important when getting inclusions of X . Midpoint tests are incorporated as follows: Let f n be the lowest function value which has been calculated up to the completion of the list Ln . (If no function values are available then mini = 1, . . . , n max F(X i ) can be taken as f n .) Then all pairs (Zni , zni ) of L are discarded that satisfy f n < z ni : This gives a reduced list Ln . Let U n = [ Zni for all Zni of the reduced list. Then two different procedures are known:

Interval Global Optimization

Algorithm 2 [46] emerges from Algorithm 1 by keeping track of L n instead of Ln (and thus having U n available at each iteration). Algorithm 3 [30,31] is like Algorithm 2, but the reduced lists L n are ordered with respect to the age or the widths of the boxes. Variants of the three prototypes occur if the ordering of the lists and the bisection directions are changed, cf. the next two sections. Convergence Properties of the Prototype Algorithms The results presented in this section are proven in [77,86,89]. Let us first consider Algorithm 1. As in the previous section, we denote the leading pairs of Algorithm 1 by (X n , yn ). One can show that a) w(X n ) ! 0 as n ! 1. This fact seems to be self-evident but it is not. For example, small modifications of the basic algorithm do not satisfy a) as is the case with the cyclic bisection method [74]. From the assumption w(F(Y)) w( f (Y)) ! 0

as w(Y) ! 0 (Y 2 I(X)) (7)

it follows that

yn ! f

for any n; as n ! 1;

f y n w(F(X n )) (error estimate):

as w(Y) ! 0:

e) Let any ˛ > 0 and any converging sequence of reals be given. Then, to any f , there exists an inclusion function of order ˛ for which (yn ) converges slower than the given sequence. This result indicates that the convergence can be arbitrarily slow and that no worst case exists, which is usually taken in order to establish formulas for the convergence speed or convergence order. If, however, only isotone inclusion functions (F is called isotone if Y Z implies F(Y) F(Z)) are considered then the following estimate of the convergence speed is valid. Practically this estimate characterizes the complete convergence theory since it is always possible to find isotone inclusion functions with small effort. f) If F is isotone and of order ˛, then f y n D O(n

˛ m

):

In [16] some variants of this assertion are proven. Algorithms 2 and 3 have nearly the same behavior as Algorithm 1 if the convergence to f is considered. Their properties with respect to a determination of X are as follows: Let (U n ) be the sequence of unions produced by Algorithm 2. If (8) is assumed, then

Let now (U n ) be the sequence of unions produced by Algorithm 3. If (8) is assumed, then

Assumption (7) is not very restrictive. It is almost always satisfied if natural interval extensions are used. However, (7) does not imply continuity, Lipschitz condition on f , etc. Let F now satisfy w(F(Y)) ! 0

The convergence order of the approach yn ! f is described by the following two results:

g) the sequence (U n ) is nested and converges (with respect to the Hausdorff metrics for compact sets) to a superset D X . The probability, that D is not equal to X is zero, however.

b) yn f

I

(8)

Clearly, (8) implies (7) and the continuity of f . Then c) w(F(X n )) ! 0 as n ! 1 (that is, the error estimate tends to 0 and can thus be used for termination criteria), d) each accumulation point of the sequence (X n ) is a global minimizer.

h) the sequence (U n ) is nested and converges to X . Therefore, Hansen’s Algorithm 3 is the only one of the three which features a satisfactory and guaranteed convergence to f and X . This algorithm will therefore play the main role in our further considerations. Accelerating and Related Devices Algorithm 3 and its predecessors which we have treated so far are based on the exhaustion principle, that is, the principle of removing areas (subboxes of X) which cannot contain a global minimizer. In the same manner we realize that the branch and bound principle forms the

1745

1746

I

Interval Global Optimization

overlying structure, that is, areas are processed which have the largest chance to contain a global minimizer. This is a time-consuming process and it is therefore important to combine the principle mentioned with techniques for speeding up the computations. In this section we deal with only a few of these techniques in order to demonstrate how they may be combined with the basic algorithm. Much research is done in order to find an optimal combination of the basic algorithms and the acceleration devices, cf. for instance, [26,31,33,48,49,93,95,96]. In the following we give an overview of several acceleration devices and other tools that are used to improve the computational efficiency of unconstrained interval based global optimization. Most of them are also developed for constrained optimization. The Monotonicity Test It can be applied if f is differentiable and if an inclusion function for r f is available ([30,31,72]). It allows one to automatically recognize that f is strictly monotone in one of the variables in the subbox Y X on which the algorithm is focusing. Then Y can be discarded from the list if Y lies in the interior of X or otherwise Y can be replaced by an edge piece of Y. This can be done since the parts removed do not contain a global minimizer. I. e., let Gi be an inclusion function of @f /@xi for i = 1, . . . , m. If now 0 62 Gi (Y) just for one index i, then f is strictly monotone in the variable xi over Y such that Y can be discarded or replaced by an edge piece as mentioned before. (For the application of the test it is already sufficient that f is locally Lipschitz, cf. the next section). The Interval Newton Method If f is twice continuously differentiable and if an inclusion function for the Hessian matrix function, f , exists then the interval Newton algorithm can be applied to f 0 in order to get boxes that contain all zeros of f 0 . Together with the monotonicity test, the interval Newton algorithm counts as one of the most effective tools for solving optimization problems. The main advantage is not only the localization of the zeros of f 0 , but also a computationally very successful performance. This is based on the properties mentioned in the section on ‘interval Newton methods’ which result

in reducing or splitting the search area. Finally, the contraction shows quadratic convergence under reasonable conditions. Interval Newton methods can be applied in two different manners: i) The method is applied to f 0 in X (necessarily combined with some splittings of the search area) until all critical points of f are included in sufficiently small boxes Z, for example where w(Z) < . Then the search for the global minimizers is restricted to these remaining boxes Z and to the facet of X. This approach is, however, not too effective since these zeros can be saddle points, local maximizers, or even local but not global minimizers. Hence the following procedure is used generally: ii) Each iteration of the optimization algorithm is combined with the monotonicity test and one or two interval Newton iterations. I. e., after having X bisected into the subboxes V 1 and V 2 , cf. Step 5 of Algorithm 3, the midpoint test, the monotonicity test and one interval Newton iteration is applied to V 1 and V 2 in order to diminish the size of V 1 , V 2 or to discard them. This procedure avoids superfluous and costly interval Newton iterations in boxes in which f is strictly monotone or which have too large function values. The interval Newton can be improved by using slopes whenever possible, cf. [79]. See also ‘use of good inclusion functions’ below. Finding a Function Value as Small as Possible The smaller the smallest known or computed function value is at the nth iteration the more effective is the midpoint test, that is, boxes are removed earlier than without these values. There are many possible techniques for getting lower function values such as statistical and line search methods, bundle methods (line search in the nonsmooth case), descent methods, Newton-like methods, where the application of the methods depends on the differentiability of the objective function. Many of these variations lead to so-called globally convergent methods. This does not mean that a global solution is found, however, it does mean that a local solution is always found. Good results in finding small function values have been attained with generating a not very dense set

Interval Global Optimization

of points and to use them as starting points for the globally convergent methods mentioned above. Bisections The computations can be accelerated by a good choice of A) the next box of the list to be bisected; and B) the bisection direction of that box. These two topics did not draw too much attention in the first years of interval optimization. They were considered as a tiresome task for completing the algorithms rather than topics important for the success of the algorithms. Meanwhile it has been recognized how important the right choice of box and bisection direction is for keeping computation time and costs low. The right choice of bisection direction is equally important for the global zero search of systems of functions. Strategies for choosing the next box include uniform subdivision [71], bisecting a box which has a minimal lower bound, cf. Algorithms 1 and 2, bisecting that box which has been longest on the list [31], bisecting a box which has maximum width [31], last in-first out [79], that is, the youngest boxes are always processed first which keeps the list length short under certain circumstances. When a box has been selected for getting bisected one has to choose the bisection direction. Historically, the first three criteria were uniform subdivision [71] (that is, bisections were done in all m directions), cyclic bisection [74] (that is, the bisection directions change cyclically, i. e., the first box gets bisected normal to the first coordinate direction, the second normal to the second coordinate direction, etc.), and bisection normal to one of the longest box edges [31]. It turned out that using the box width as the only criterion for deciding the bisection direction could be very ineffective. For a typical example, see [91]. The conclusion from such examples is that the choice of the bisection direction should consider the behavior of the function f over the box as well. Hence, formulas for deciding a bisection direction are built up using bounds for the box width of the objective function and bounds for the first and second partial derivatives. Natural interval extensions of noninterval scaling formulas are also used. Our own tests and experiments show that an optimum bisection strategy does not exist and that

I

it is reasonable to use several bisection strategies each pursuing another heuristic aim. This led to systematic investigations of bisections and also trisections by several authors, mainly [15,18,53,54,96,97,106]. For further strategies see [93], where also a survey of convergence properties of some of the strategies can be found, and [92]. Use of Good Inclusion Functions, Slope Arithmetic The better the inclusion functions are, the more effective are the tests like midpoint test, monotonicity test, etc., cf. for instance, [88]. The derivatives can frequently be replaced by slopes which leads to inclusions with smaller width, There is also an automatic slope arithmetic available which is comparable to automatic differentiation, cf. ‘autoimatic differentiation’ below. The interested reader is referred to [1,40,61,62,79,88,101]. The Nonconvexity Test The aim of this test is to verify that the objective function is nowhere convex in some subbox Y 2 I(X) by computationally checking whether the Hessian of the objective function does not satisfy some standard conditions of convexity. Then the interior of Y cannot contain a minimizer. f 2 C2 is assumed. The first such test seems to date back to [64]. Thin Evaluation of the Hessian Matrix If interval Newton steps are incorporated they will be applied to = f 0 where the matrix J (Y) = H f (Y) is required and H f (Y) is the natural interval extension of the Hessian matrix. By certain rearrangements of H f (Y) and a special method of getting an interval extension, where not all real entries are replaced by intervals, it is possible to obtain an interval matrix which is thinner, hence better (cf. ‘use of good inclusion functions’ above) than H f (Y). A detailed discussion and formulas can be found in [33,90]. Constraint Logic Programming (also known as constraint solving) involves techniques where, among others, equations (for example, the Karush–Kuhn–Tucker or F. John conditions) are primarily not evaluated numerically but seen as constraints for or as relations between the variables (a con-

1747

1748

I

Interval Global Optimization

cept overtaken from artificial intelligence). The relation is then used to shrink the search domain. For example, the equation (constraint) y x2 = 0 immediately enables the halfspace defined by y < 0 to be removed from the working area. There are several methods based on that idea which are best embedded in appropriate languages, where symbolic manipulation such as PROLOG is available. For example, a method called relational interval algebra is embedded in the computer language CLP having PROLOG as metalanguage (cf. [7] or [83]). In this connection it is also opportune to automatically add redundant constraints in order to accelerate the computations (see, for example, [5] or [82]). Another approach is called branch and prune ([41]). The pruning concept aims to shrink the search area by several tests. The crucial property which is searched for is the so-called box consistency which has been introduced in [6] and is also known in connection with discrete combinatorial search problems. The box consistency is primarily used to indicate the existence of solutions in the considered subarea and is some kind of a substitution of interval Newton techniques. An interesting means for proving box consistency is the bound consistency which requires the checking of the facets of the box instead of the box itself. The branch and prune algorithm is embedded in NUMERICA, which is designed as a modeling language for global optimization and related problems, cf. [41]. There are several other approaches that are based on constrained logical programming such as the use of relational manipulations or of set-valued operations, see for example, [3] or [45] and the references listed there. Automatic Differentiation This technique seems to go back to [108]. It helps to reduce costs when computing derivatives or their inclusion functions, or expressions like (x c)| f 0 (c), (Y c)f 0 (Y), (x c) f 0 (Y), (Y c)| f 0 (Y)(Y c), etc., where x, c 2 X, Y 2 I(X). There are two modes of automatic differentiation, a forward and a reverse. Both modes use recursive techniques for evaluating function values and chain rules of differentiation. In the forward mode all intermediate values of the function are simultaneously determined with the corresponding intermediate values of derivative, Hessian, etc, and all these intermediate values are computed from values calculated in former

steps. The reverse mode requires some structural planing of the formulas similar to the construction of Kantorovich graphs of functional expressions, where a new variable is assigned to each node. The differentiation finally starts backwards from the function in dependency of the variables introduced. Both modes have advantages. Our own experiences, however, show that in case of interval expressions like (Y c) f 0 (Y) or in case of computing generalized gradients, it is not always wise to use automatic differentiation. The reason is that in such cases information about dependencies between intervals can be lost so that the widths of the resulting interval values increase unnecessarily. For a detailed description of automatic differentiation cf. for instance, [23,24,29,84]. Parallel Computations for global optimization were investigated and implemented primarily by [8,12,21,22]. Global Optimization Over Unbounded Domains and Nonsmooth Optimization Global Optimization over Unbounded Domains Almost all methods for solving global optimization problems need the assumption that a bounded domain which contains the solution points is known. The boundedness is necessary for the numerical computation as well as for guaranteeing the convergence properties. If an a-priori box X as search area for the global solutions is not known, it is possible to extend the previous algorithms, especially Algorithm 3 in such a manner, that they can operate over unbounded boxes as well cf. [90,94]. It is not even necessary, to change the algorithms formally, one only has to define midpoint and width of infinite intervals (both values have to be finite) and an arithmetic for infinite intervals. This arithmetic should provide intervals with minimal widths in order to get reasonable inclusion functions. It would go to far to present this arithmetic here, but a short example could be illustrative: This arithmetic assigns to the quotient [0, 1]/[0, 1] the value [0, 1], whereas by an arithmetic which is called Kahan–Novea–Ratz arithmetic in [55] the value [1, 1] results. Most of the convergence properties of the section on ‘convergence

I

Interval Global Optimization

properties of the prototype algorithms’ remain valid under slight modifications of the assumptions since one can interpret the algorithms as algorithms in (R)m where R is the two-point compactification of the real axis, R. Thus compact intervals are generated by the algorithms and that is all one needs for convergence proofs. For details see [90,94] or the survey in [93]. Nonsmooth Optimization A broad spectrum of mathematical programming problems can be reduced to nondifferentiable problems without constraints or with simple constraints. The use of exact nonsmooth penalty functions in problems of nonlinear programming, maximum functions to estimate discrepancies in constraints, piecewise smooth approximation of technical-economic characteristics in practical problems of optimal planning and design, minimax compromise function in problems of multicriterion optimization, all generate problems of nonsmooth optimization. Thus, the objective function, f , of the optimization problem may look like f (x) = max{f 1 (x), . . . , f n (x)} where f i 2 C1 , or like f (x) = P f 0 (x) + kiD1 max(0, f i (x)) which is a typical objective function arising from penalty methods where f 0 , f i 2 C1 and > 0 is a (reciprocal) penalty factor. Interval methods have no difficulties at all to handle nonsmooth problems, a fact which was discovered in [87] and rediscovered in [55] with great emphasis. The construction of inclusion functions does not depend at all on the smoothness of a function. The application of monotonicity tests and other devices where gradients are used (for instance, local noninterval methods, cf. ‘finding a function value as small as possible’ in the section on ‘accellerating and related devices’) is still possible as long as the function is locally Lipschitz, which means that, at any argument, x of the function, f , an open neighborhood of x, say U x , exists in which f satisfies a Lipschitz condition. It follows by a theorem of H. Rademacher that f is differentiable almost everywhere in U x . Let ˝ be the set of points in U x at which f is not differentiable, and let S be any other set of Lebesque measure 0. Then the generalized gradient (also called subdifferential) of f at x is defined as @ f (x) D conv

n

lim r f (x n ) : x n ! x; x n … S [ ˝

n!1

o

where conv denotes the convex hull, cf. [14]. Let (x, y) Rm denote the open line segment between x and y. A theorem of G. Lebourg says that, if y 2 U x with (x, y) U x is given then some u 2 (x, y) exists such that f (y) f (x) 2 (y x)> @ f (u):

(9)

Locally, (9) can be approximated by means of the Lipschitz constant. Globally, (9) can be used to find inclusion functions of f of a mean value type explicitly: If G(Y) is a (not necessarily bounded) box that contains @f (u) for any u 2 Y, then F(Y) D f (c) C (Y c)> G(Y)

for Y 2 I(X);

where c denotes the midpoint of Y (any other point of Y may also be chosen), is an inclusion function of f and appropriate for its use in the Algorithms 1 to 3. Furthermore, G(Y) can be used for the monotonicity test: If only one component of G(Y) does not contain zero, then f is strictly monotone with respect to the corresponding direction. Algorithms 1 to 3 as well as the monotonicity test therefore can be applied to problem (1) without modifications, if the objective function of f is locally Lipschitz. It is, however, only possible to apply the interval Newton algorithm for a very restricted class of functions since second ‘derivatives’ of locally Lipschitz functions are are not yet explored satisfactory. With the aid of the infinite interval arithmetic mentioned in the subsection above one can admit also unbounded subdifferentials and handle them. For the construction of inclusion functions of the objective function and the subdifferential and for numerical tests (with bounded and unbounded search areas) see, for instance, [27,28,55,87,90,94]. For further results in connection with estimates of the penalty factor see [111]. Constrained Optimization The principles which were developed in the previous sections are also useful for constrained problems, that is, min f (x)

(10)

x2M

where M Rm means the feasible set defined by constraints g i (x) 0;

i D 1; : : : ; k;

h j (x) D 0;

j D 1; : : : ; s:

1749

1750

I

Interval Global Optimization

For simplicity, we assume that M X for some X 2 I m and that the functions f , g i and hj are defined on X. For a successful treatment of problem (10) we need inclusion functions F, Gi and H j of f , g i and hj , respectively which satisfy (8) and which have the property that (

w(G i (Y)) ! 0

as w(Y) ! 0;

w(H j (Y)) ! 0

as w(Y) ! 0

(11)

for i = 1, . . . , k, j = 1, . . . , s, and Y 2 I(X). Then a very effective means of interval arithmetic is the infeasibility test which is applicable to any Y 2 I(X): If either G i (Y) > 0 for some i 2 f1; : : : ; kg or if 0 62 H j (Y)

for some j 2 f1; : : : ; sg

then all points of Y are infeasible. (The notation [a, b] > 0 or [a, b] 0 is used to indicate that a > 0 or b 0 holds, respectively.) Hence the box Y can never contain a solution of (10) such that Y can be discarded from any procedure for solving (10). Conversely, if G i (Y) 0 for i D 1; : : : ; k; and H j (Y) D 0 for j D 1; : : : ; s; then all points of Y are feasible (feasibility test). This is due to the inclusion principle, (3), by which a 2 Y implies g i (a) 2 Gi (Y) as well as hj (a) 2 H j (Y) for all indices i and j, that is, g i (a) 0 and hj (a) = 0 for all i and j. This gives, in fact, the guarantee that every point a 2 Y is feasible. However, if equality constraints are present in (10) it is extremely unlikely that conditions like H j (Y) = 0 are satisfied such that the feasibility test is rather an academic tool if s > 0. There are principally two main possibilities for solving the constrained problem. The first possibility is to transform the problem to an unconstrained problem within a penalty setup and apply the methods of the former sections together with the feasibility, respectively infeasibility, test in order to have the guarantee to be in M or to discard infeasible areas. The second possibility is a direct approach where Algorithm 3 is enriched by feasibility and infeasibility test and adapted to handle the constrained case. We will now give a brief discussion of these possibilities.

The Penalty Approach There are two kinds of penalty functions which are usually preferred. The first one is the so-called L1 -exact P penalty function, (x) = f (x) + kiD1 max(0, g i (x)) P + sjD1 |hj (x)|, cf. also the subsection ‘nonsmooth optimization’ in the previous section. The second one, already introduced by R. Courant, is defined as (x) = P P f (x) + kiD1 max(0, g i (x))2 + sjD1 (hj (x))2 . In both cases, is a penalty factor. For details, and how penalty methods are applied to solve constrained optimization problems, cf. [25]. (Augmented Lagrangian functions could also be taken for the penalty approach.) When locally solving (10) with standard noninterval methods, has the advantage that there exists a so that the local minimizers of are also local minimizers of (10), but has the disadvantage of being nonsmooth. The use of has the advantage of dealing with a smooth function (provided f and the constraints are smooth), but the disadvantage, that the minimizers of might attain the solutions of (10) only asymptotically as tends to zero. If f and the constraint functions are smooth there exists a value in both cases of penalty functions so that the global minimizers of and are also global solutions of (10) when solving (10) with interval methods. The explicit determination of this number is still under investigation, cf. [111]. On the other hand, the knowledge of the value is not necessary if only convergence is expected because infeasible areas are removed by the infeasible test which has to be incorporated in the prototype algorithm such as Algorithm 3. The knowledge of the value of accelerates the computation. A further discussion would be too extensive for this article. The Direct Approach Algorithm 3 is also appropriate as a base algorithm for dealing with the constrained case. In order to consider the constraints, one just has to add the feasibility and infeasibility test and to apply the latter test as a box deleting device to the boxes V 1 and V 2 after Step 5 of the algorithm. If it turns out that the box is feasible, it should be marked as feasible by a flag or a Boolean value. The remaining boxes of the list are indeterminate, that is, the tests executed up to the current state of the computation have not yet been able to decide whether the box is feasible or not. It can happen that boxes V i which are feasible (respectively, infeasible) are not rec-

Interval Global Optimization

ognized as feasible (respectively, infeasible) by the feasibility (respectively, infeasibility) test. This is due to the excess width (see the section ‘interval arithmetic’ above) which, for instance, can cause that 0 2 Gi (V) for some box V occurs even if 0 < g i (V) holds. The continued processing of the indeterminate boxes of the list by the steps of Algorithm 3, however, reduces the box widths to zero so that their excess widths also tend to zero as long as (11) is assumed. This implies also that the union of the boxes of L tends, as the computation proceeds, to M with respect to the Hausdorff metrics (cf. [90]) if one dropped the midpoint test. The midpoint test itself helps to discard feasible as well as indeterminate boxes which contain no global minimizer. The execution is as in the unconstrained case: Let f n be the lowest function value which has been calculated up to the completion of the list Ln . Then all pairs (Zni , zni ) of L are discarded that satisfy f n < z ni : It is important for the correctness of the algorithm that only function values of points x 2 M are admitted. Hence, if x is taken from a feasible box of the list, x is certainly feasible. If the list contains only indeterminate boxes no direct access to feasible points of M is at hand. This is regularly the case if equality constraints are present. But without the knowledge of points x 2 M the midpoint test cannot be executed. Two possibilities are known for overcoming this hurdle. The first possibility is the so-called -inflation. It accepts that the constraints are satisfied within a tolerance of . If inflation, which is widely used in noninterval computations, is applied then the reliability of the computation is lost. Thus this possibility is avoided as far as possible in interval computations. The second possibility to overcome the difficulties arising by equality constraints is based on the application of Moore’s test for the existence of solutions of equations [73]. Hansen and G.W. Walster [38] were the first who suggested to apply this test to constrained optimization. It is used in the following manner: If Y is an indetermined box under processing and one looks for a feasible point in Y, the equality constraints and the inequality constraints which are active with respect to Y are combined to a system of equations. Then interval Newton iterations are applied to this system in Y, not to solve the system but only to prove the existence of

I

a solution within Y by a contraction of the Newton operator. Then all boxes (Zni , zni ) of the list (feasible or indetermined) can be discarded that satisfy max f (Y) < zni since max f (Y) is an upper bound for a function value of a feasible point. If the system of equations shows more variables than equations, some variables are replaced by constants. The existence test in Y is best done in the following manner: Apply a local simple noninterval optimizaP tion algorithm to the objective function (x) = kiD1 P s 2 max(0, g i (x))2 + jD1 (hj (x)) (this is the Courant penalty function for f (x) = 0, cf. the first subsection in this section) in order to come near a feasible point, say c. Put a small box which has to lie in Y around c and apply the existence test to the system in the box (eventually cleaned up by meanwhile inactive inequality constraints). If the test is positive, the existence of a feasible point in the box and hence in Y is guaranteed. However, it is not at all a proof that Y is infeasible if the test fails. An improvement is due to [58] where techniques to search for points c 2 Y are designed so that the chances of finding a nearby feasible point is optimal. Also the number of variables can be larger than the number of equations in the underlying system. The convergence of the union of the list boxes to the set of global minimizers can be shown if the test for the existence of feasible points is applied systematically and successfully to the boxes of the list (as far as they are indeterminate). Other convergence proofs can be found in [8,90]. In order to not only get a convergent but also a fast convergent algorithm, acceleration devices and related techniques are again extremely important for practical computations. Well-known techniques are the following: i) Interval Newton iterations. They are applied to the F. John conditions to enclose the stationary points, similar as to the unconstrained case. Since the number of equations exceeds the number of variables by 1 in the F. John conditions, an additional equation is added which does not influence obtaining the stationary points, cf. [39]. As in the unconstrained case, the interval Newton iterations are not executed until termination, but they merge with the steps of the optimization algorithm. Again, if an iteration shows the existence of a F. John point, it is

1751

1752

I

Interval Global Optimization

a feasible point and can be used for the midpoint test. In contrast to several authors we do not count the interval Newton iterations as basic steps of an optimization algorithm, since they do not influence the convergence, only the convergence speed of the algorithm. ii) Monotonicity and nonconvexity test, cf. the section on ‘accellerating and related devices’. These tests are best applied to feasible boxes, but there are also exemptions of this suggestion, cf. [70,93]. Also linearization of the constraints supplementing the infeasibility test is used [33]. iii) Good inclusion functions, slope arithmetic, automatic differentiation, bisections, parallel algorithms, constrained logic programming, are already mentioned in the section on ‘accellerating and related devices’. iv) Local search devices. In order to get soon function values of feasible points, local noninterval optimization procedures are applied to the function , as defined above, related to the current box until one reaches a feasible point or until one is near a feasible point. In the latter case the existence test has to be applied at this approximation w. r. to a small surrounding box in order to guarantee the existence. In case of full-dimensional feasible domains, the local search can be continued with instead , but one has to take care not to leave the domain M. It turned out that the performance of the algorithm was greatly influenced by how the steps of the optimization algorithm and the acceleration devices were combined. Several investigations dealing with this matter have been done, cf. for example, [8,18,19,26,38,39,49, 56,95,100,109,110]. Applications Global optimization using interval arithmetic has been applied to optimization problems in a variety of science, engineering and social science areas. Below we briefly describe representative examples from several areas. Chemistry and Chemical Engineering Many optimization problems in the fields of chemistry and chemical engineering can be investigated ef-

fectively using the tools described in the previous sections. As a first example we consider the diagram of a chemical process showing the processing units and the connections between them. This depicts the flow of chemical components through the system and it is often referred to as process flowsheeting and the associated optimization problems are called process flowsheeting problems. They require the solution of large sparse differential-algebraic systems. In [99] a parallel interval Newton algorithm combined with bisection techniques is applied to solve a number of simple problems of this type where the parallelization is required in order to complete the computations within a reasonable timeframe. The reliable prediction of phase stability in a chemical process simulation has been considered by [42,43]. It is pointed out that conventional methods that are initialization dependent may converge to trivial or nonphysical solutions or to a nonglobal local minimum. It is furthermore shown that these difficulties can be avoided using a cubic equation of the state model combined with interval tools. Their technique is initialization independent and it solves the phase stability problem with complete reliability. In [44] the approach is further developed with respect to computational efficiency. An enhanced method is presented based on sharpening the range of the interval functions that occur in the algorithm. It is shown that the computation time can be reduced by nearly an order of magnitude in some cases. The paper [69] addresses the problem of minimizing the Gibbs free energy in the m-component multiphase chemical and phase equilibrium problem involving different thermodynamic models. The solution method is based on the tangent-plane criterion of Gibbs and it is reduced to a finite sequence of local optimization steps in K(m 1)-dimensional space where K m is the number of phases, and global optimization steps in (m 1)-dimensional space. The algorithm developed in the lower-dimensional space uses techniques from interval analysis. Some promising results are reported for the algorithm. A parallel interval algorithm for the problem was developed in [9]. Chemists performing photoelectron spectroscopy collide photons with atoms or molecules. These collisions result in the ejections of photoelectrons. The chemist is

Interval Global Optimization

left with a photoelectron spectrum which is a plot of the number of photoelectrons ejected as a function of the kinetic energy of the photoelectron. A typical spectrum consists of a number of peaks. The chemist would like to resolve the individual peaks in the spectrum. In the paper [76] a test problem is constructed as a sum of two Gaussian functions involving a number of parameters. These parameters are found using interval techniques of global optimization. Physics, Electronics and Mechanical Engineering A wide variety of problems in physics, electronics and mechanical engineering can be formulated as optimization problems amenable to the techniques described in the previous sections. We provide some representative examples below. An early application is found in [68] who applies interval global optimization to electronic switching systems for efficiency reasons. In [10] interval global optimization is used to determine rigorous bounds on Taylor maps of general optical systems. It is also pointed out that stability for storage rings and other weakly nonlinear systems can be guaranteed using their developments. In [78] Hansen’s method is applied to a demagnifying system for electron beam lithography device for finding all real minimizers of a real valued objective function of several variables. Computer-aided simulation tools for liquid crystal displays have been developed in recent years. These tools calculate the molecule orientation of the liquid crystal material by minimizing an energy function. The results of such simulations are used to optimize notebook computer displays. In the paper [80] interval global optimization is used to calculate all minimizing molecule configurations. Interval global optimization is applied to the optimal design of a flat composite plate and a composite stiffened panel structure in [63]. The methodology is to generate a feasible suboptimal interval which is used to examine the manufacturing tolerance in the design optimization. Economics Global optimization using interval analysis has also found applications in economics. Two examples are presented below.

I

A model of copyable products such as software is considered by [107] who based their model on the model developed by I.E. Besanko and W.L. Winston [11]. In the paper [107] this model is solved for a globally optimal result using an interval branch and bound method. In [50] another problem in economics is considered. The problem is to minimize an econometric function X ˇ1 b ˇ 2 X t2 b ˇ 22 X t3 )2 (y t b where the data are artificially generated for the variables. Several tests are performed and it is shown that interval methods are competitive with other methods such as simulated annealing. See also ˛BB Algorithm Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators Bounding Derivative Ranges Continuous Global Optimization: Applications Continuous Global Optimization: Models, Algorithms and Software Global Optimization in the Analysis and Management of Environmental Systems Global Optimization: Application to Phase Equilibrium Problems Global Optimization in Batch Design Under Uncertainty Global Optimization in Generalized Geometric Programming Global Optimization Methods for Systems of Nonlinear Equations Global Optimization in Phase and Chemical Reaction Equilibrium Interval Analysis: Application to Chemical Engineering Design Problems Interval Analysis: Differential Equations Interval Analysis: Eigenvalue Bounds of Interval Matrices Interval Analysis: Intermediate Terms Interval Analysis: Nondifferentiable Problems Interval Analysis: Parallel Methods for Global Optimization

1753

1754

I

Interval Global Optimization

Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Unconstrained and Constrained Optimization Interval Analysis: Verifying Feasibility Interval Constraints Interval Fixed Point Theory Interval Linear Systems Interval Newton Methods MINLP: Branch and Bound Global Optimization Algorithm MINLP: Global Optimization with ˛BB Mixed Integer Nonlinear Programming Smooth Nonlinear Nonconvex Optimization References 1. Alefeld G, Herzberger J (1983) Introduction to interval computations. Acad. Press, New York 2. Asaithambi NS, Shen Z, Moore RE (1982) On computing the range of values. Computing 28:225–237 3. Babichev AB, Kadyrova AB, Kashevarova TP, Leshchenko AS, Semenov AL (1993) UniCalc, a novel approach to solving systems of algebraic equations. Interval Comput 2:29–47, (Special Issue). 4. Bauch H, Jahn K-U, Oelschlägel D, Süsse H, Wiebigke V (1987) Intervallmathematik: Theorie und Anwendungen. Teubner, Leipzig 5. Benhamou F, Granvilliers L (1997) Automatic generation of numerical redundancies for non-linear constraint solving. Reliable Computing 3:335–344 6. Benhamou F, McAllester D, Van Hentenryck P (1994) CLP(intervals) revisited. In: Logic Programming, Proc. 1994 Internat. Symp., Ithaca, NY, MIT, Cambridge, MA, pp 124–138 7. Benhamou F, Older WJ (1997) Applying interval arithmetic to real, integer and boolean constraints. J Logic Programming 32:1–24 8. Berner S (1995) Ein paralleles Verfahren zur verifizierten globalen Optimierung. Diss. Univ. Wuppertal,). 9. Berner S, McKinnon KIM, Millar C (1997) A parallel algorithm for the global minimization of Gibbs free energy. Ann Oper Res 90:271–292 10. Berz M (1997) Differential algebras with remainder and rigorous proofs of long-term stability. AIP Conf. Proc. 391, 221–228 11. Besanko IE, Winston WL (1989) Copyright protection and intertemporal pricing: does unauthorized copying lead to higher prices? Working Paper School Business Indiana Univ.

12. Caprani O, Godthaab B, Madsen K (1993) Use of a realvalued local minimum in parallel interval global optimization. Interval Comput 2:71–82 13. Caprani O, Madsen K (1979) Interval methods for global optimization. In: Vogt WG, Mickle MH (eds) Modeling and Simulation, 10:3, Instrument Soc. Amer., Pittsburgh, PA, pp 589–593 14. Clarke FH (1983) Optimization and nonsmooth analysis. Wiley, New York 15. Csallner AE, Csendes T (1995) Convergence speed of interval methods for global optimization and the joint effects of algorithmic modifications. Talk given at SCAN’95, Wuppertal, 16. Csallner AE, Csendes T (1996) The convergence speed of interval methods for global optimization. Comput Math Appl 31:173–178 17. Csendes T, Pinter J (1993) A new interval method for locating the boundary of level sets. Internat J Comput Math 49:53–59 18. Csendes T, Ratz D (1997) Subdivision direction selection in interval methods for global optimization. SIAM J Numer Anal 34:922–938 19. Csendes T, Zabinsky ZB, Kristinsdottir BP (1995) Constructing large feasible suboptimal intervals for constrained linear optimization. Ann Oper Res 58:279– 293 20. Dussel R (1972) Einschliessung des Minimalpunktes einer streng konvexen Funktion auf einem n-dimensionalen Quader. Diss. Univ. Karlsruhe 21. Eriksson J (1991) Parallel global optimization using interval analysis. PhD Thesis Univ. Umeåa,). 22. Eriksson J, Lindstroem P (1995) A parallel interval method implementation for global optimization using dynamic load balancing. Reliable Computing 1:77–92 23. Fischer H (1993) Automatic differentiation and applications. In: Adams E, Kulisch U (eds) Scientific Computing with Automatic Result Verification. Acad. Press, New York, pp 105–142 24. Fischer H (1995) Automatisches Differenzieren. In: Herzberger J (ed) Wissenschaftliches Rechnen: Eine Einführung in das Scientific Computing. Akad. Verlag, Berlin, pp 53–104 25. Fletcher R (1987) Practical methods of optimization. second, Wiley, New York 26. Goos A, Ratz D (1997) Praktische Realisierung und Test eines Verifikationsverfahrens zur Lösung globaler Optimierungsprobleme mit Ungleichungsnebenbedingungen, Inst. Angew. Math. Univ. Karlsruhe, Karlsruhe 27. Görges C, Ratschek H (1997) -Interval methods in nonsmooth optimization. In: Gritzmann P Horst R Sachs E, Tichatschke R (eds) Recent Advances in Optimization. Springer, Berlin, pp 75–89 28. Görges C, Ratschek H (1999) Global interval methods for local nonsmooth optimization. J Global Optim 14:157– 179

Interval Global Optimization

29. Griewank A and Corliss G (eds) (1991) Automatic differentiation of algorithms: Theory, implementation, and application. SIAM, Philadelphia 30. Hansen ER (1979) Global optimization using interval analysis: the one-dimensional case. J Optim Th Appl 29:331– 344 31. Hansen ER (1980) Global optimization using interval analysis: the multidimensional case. Numer Math 34:247–270 32. Hansen ER (1988) An overview of global optimization using interval analysis. In: Moore RE (ed) Reliability in Computing: The Role of Interval Methods in Scientific Computing. Acad. Press, New York, pp 289–307 33. Hansen ER (1992) Global optimization using interval analysis. M. Dekker, New York 34. Hansen ER, Greenberg RI (1983) An interval Newton method. Appl Math Comput 12:89–98 35. Hansen ER, Sengupta S (1980) Global constrained optimization using interval analysis. In: Nickel K (ed) Interval Mathematics. Acad. Press, New York, 25–47 36. Hansen ER, Sengupta S (1981) Bounding solutions of systems of equations using interval analysis. BIT 21:203– 211 37. Hansen ER, Smith RR (1967) Interval arithmetic in matrix computations, Part II. SIAM J Numer Anal 4:1–9 38. Hansen ER, Walster GW (1987) Nonlinear equations and optimization. Preprint,). 39. Hansen ER, Walster GW (1993) Bounds for Lagrange multipliers and optimal points. Comput Math Appl 25:59–69 40. Hansen P, Jaumard B, Xiong J (1992) The cord-slope form of Taylor’s expansion in univariate global optimization. JOTA 80:441–464 41. Van Hentenryck P, Michel L, Deville Y (1997) Numerica: A modeling language for global optimization. MIT, Cambridge, MA 42. Hua JZ (1997) Interval methods for reliable computations of phase equilibrium from equation of state models. PhD Thesis Univ. Illinois at Urbana-Champaign 43. Hua JZ, Brennecke JF, Stadtherr MA (1996) Reliable phase stability analysis for cubic equation of state models. Comput Chem Eng 20:S395–S400 44. Hua JZ, Brennecke JF, Stadtherr MA (1998) Enhanced interval analysis for phase stability: Cubic equation of state models. Industr Eng Chem Res 37:1519–1527 45. Hyvönen E, De Pascale S (1996) Interval computations on the spreadsheet. In: Kearfott RB, Kreinovich V (eds) Applications of Interval Computations. Kluwer, Dordrecht 46. Ichida K, Fujii Y (1979) An interval arithmetic method of global optimization. Computing 23:85–97 47. Van Iwaarden RJ (1996) An improved unconstrained global optimization algorithm. PhD Thesis Univ. Colorado at Denver 48. Jansson C (1994) On self-validating methods for optimization problems. In: Herzberger J (ed) Topics in Validated Computations. Elsevier, Amsterdam, pp 381– 438

I

49. Jansson C, Knüppel O (1995) A branch and bound algorithm for bound constrained optimization problems without derivatives. J Global Optim 7:297–331 50. Jerrell M (1994) Global optimization using interval arithmetic. J Comput Economics 7:55–62 51. Kahan WM (1968) A more complete interval arithmetic. Lecture Notes Univ. Michigan, Ann Arbor, MI 52. Kalmykov SA, Shokin YuI, Yuldashev ZKh (1986) Methods of interval analysis. Nauka, Moscow (In Russian) 53. Kearfott RB (1987) Abstract generalized bisection and a cost bound. Math Comput 49:187–202 54. Kearfott RB (1990) Preconditioners for the interval Gauss– Seidel method. SIAM J Numer Anal 27:804–822 55. Kearfott RB (1996) Interval extensions of non-smooth functions for global optimization and nonlinear systems solver. Computing 57:149–162 56. Kearfott RB (1996) A review of techniques in the verified solution of constrained global optimization problems. In: Kearfott RB, Kreinovich V (eds) Applications of Interval Computations. Kluwer, Dordrecht, pp 23–59 57. Kearfott RB (1996) Rigorous global search: continuous problems. Kluwer, Dordrecht 58. Kearfott RB (1997) On proving existence of feasible points in equality constrained optimization. Preprint 59. Kearfott RB, Shi X (1996) Optimal preconditioners for the interval Gauss-Seidel method. In: Alefeld G Frommer A and Lang B (eds) Scientific Computing and Validated Numerics. Akad. Verlag, Berlin, pp 173–178 60. Krawczyk R (1969) Newton-Algorithmen zur Bestimmung von Nullstellen mit Fehlerschranken. Computing 4:187– 201 61. Krawczyk R (1983) Intervallsteigungen für rationale Funktionen und zugeordnete zentrische Formen. Freiburger Intervall-Ber, Inst Angew Math Univ Freiburg 83(2):1–30 62. Krawczyk R, Neumaier A (1985) Interval slopes for rational functions and associated centered forms. SIAM J Numer Anal 22:604–616 63. Kristinsdottir BP, Zabinsky ZB, Csendes T, Tuttle ME (1993) Methodologies for tolerance intervals. Interval Comput 3:133–147 64. Mancini LJ (1975) Applications of interval arithmetic in signomial programming. SOL Techn Report Stanford Univ 75-23 65. Mancini LJ, McCormick GP (1979) Bounding global minima with interval arithmetic. Oper Res 27:743–754 66. Mancini LJ, Wilde DJ (1978) Interval arithmetic in unidimensional signomial programming. J Optim Th Appl 26:277–289 67. Mancini LJ, Wilde DJ (1979) Signomial dual Kuhn–Tucker intervals. J Optim Th Appl 28:11–27 68. Maruyama K (1986) Global optimization with interval analysis (electronic switching systems). Trans Inform Process Soc Japan 27:837–844 69. McKinnon KIM, Millar C, Mongeau M (1995) Global optimization for the chemical and phase equilibrium prob-

1755

1756

I 70. 71. 72.

73. 74. 75.

76.

77. 78.

79. 80.

81.

82. 83.

84. 85.

86. 87.

Interval Global Optimization

lem using interval analysis. In: Floudas CA, Pardalos PM (eds) State of the Art in Global Optimization. Kluwer, Dordrecht, pp 365–382 Mohd IB (1962) Global optimization using interval arithmetic. PhD Thesis Univ. St. Andrews, Scotland,). Moore RE (1966) Interval analysis. Prentice-Hall, Englewood Cliffs, NJ Moore RE (1976) On computing the range of values of a rational function of n variables over a bounded region. Computing 16:1–15 Moore RE (1977) A test for existence of solutions to nonlinear systems. SIAM J Numer Anal 14:611–615 Moore RE (1979) Methods and applications of interval analysis. SIAM, Philadelphia Moore RE (1988) Reliability in computing: the role of interval methods in scientific computing. Acad. Press, New York Moore RE, Hansen ER, Leclerc A (1992) Rigorous methods for global optimization. In: Floudas CA, Pardalos PM (eds) Recent Advances in Global Optimization. Princeton Univ. Press, Princeton, pp 336–342 Moore RE, Ratschek H (1988) Inclusion functions and global optimization II. Math Program 41:341–356 Munack H (1992) Global optimization of an electron beam lithography system using interval arithmetic. Optik 90:175–183 Neumaier A (1990) Interval methods for systems of equations. Cambridge Univ. Press, Cambridge Nonnenmacher A, Mlynski DM (1995) Liquid crystal simulation using automatic differentiation and interval arithmetic. In: Alefeld G, Frommer A and Lang B (eds) Scientific Computing and Validated Numerics. Akad. Verlag, Berlin, pp 334–340 Oelschlägel D, Süsse H (1978) Fehlerabschätzung beim Verfahren von Wolfe zur Lösung Quadratischer Optimierungsprobleme mit Hilfe der Intervallarithmetik. Math Operationsforsch Statist Ser Optim 9:389–396 Older WD (1993) Using interval arithmetic for non-linear constrained optimization. Manuscript WCLP Invited talk Older WD, Vellino A (1993) Constraint arithmetic on real intervals. In: Benhamou F, Colmerauer A (eds) Constraint Logic Programming. Selected Res. MIT, Cambridge, MA, pp 175–195 Rall LB (1981) Automatic differentiation: techniques and applications. Springer, Berlin Ratschek H (1975) Nichtnumerische Aspekte der Intervallarithmetik. In: Nickel K (ed) Interval Mathematics. Springer, Berlin, pp 48–74 Ratschek H (1985) Inclusion functions and global optimization. Math Program 33:300–317 Ratschek H (1988) Some recent aspects of interval algorithms for global optimization. In: Moore RE (ed) Reliability in Computing: The Role of Interval Methods in Scientific Computing. Acad. Press, New York, pp 325– 339

88. Ratschek H, Rokne J (eds) (1984) Computer methods for the range of functions. Horwood, Westergate 89. Ratschek H, Rokne J (1987) Efficiency of a global optimization algorithm. SIAM J Numer Anal 24:1191–1201 90. Ratschek H, Rokne J (1988) New computer methods for global optimization. Horwood, Westergate 91. Ratschek H, Rokne J (1992) Nonuniform variable precision bisecting. In: Brezinski C, Kulisch U (eds) Comput. Appl. Math., vol I. Elsevier, Amsterdam, pp 419–428 92. Ratschek H, Rokne J (1992) The transistor modeling problem again. Microelectronics and Reliability 32:1725–1740 93. Ratschek H, Rokne J (1995) Interval methods. In: Horst R, Pardalos PM (eds) Handbook Global Optim. Kluwer, Dordrecht, pp 751–828 94. Ratschek H, Voller RL (1990) Unconstrained optimization over unbounded domains. SIAM J Control Optim 28:528– 539 95. Ratz D (1992) Automatische Ergebnisverifikation bei globalen Optimierungsproblemen. Diss. Univ. Karlsruhe 96. Ratz D (1994) Box-splitting strategies for the interval Gauss–Seidel step in a global optimization method. Computing 53:337–354 97. Ratz D (1996) On branching rules in second-order branchand-bound methods in a global optimization method. In: Alefeld G, Frommer A and Lang B (eds) Scientific Computing and Validated Numerics. Akad. Verlag, Berlin, pp 221– 227 98. Robinson SM (1973) Computable error bounds for nonlinear programming. Math Program 5:235–242 99. Schnepper CA, Stadtherr MA (1993) Application of a parallel interval Newton/generalized bisection algorithm to equation-based chemical process flowsheeting. Interval Comput 3:40–64 100. Sengupta S (1981) Global nonlinear constrained optimization. Diss. Dept. Pure Appl. Math., Washington State Univ. 101. Shen Z, Wolfe MA (1990) On interval enclosures using slope arithmetic. Appl Math Comput 39:89–105 102. Shokin YuI (1981) Interval analysis. Nauka, Moscow (In Russian) 103. Skelboe S (1974) Computation of rational interval functions. BIT 14:87–95 104. Stroem T (1971) Strict estimation of the maximum of a function of one variable. BIT 11:199–211 105. Süsse H (1977) Intervallarithmetische Behandlung von Optimierungsproblemen und Damit Verbundener Numerischer Aufgabenstellungen. Diss., Techn. Hochsch. ‘Carl Schorlemmer’ Leuna-Merseburg 106. Vaidyanathan R, El-Halwagi M (1994) Global optimization of nonconvex nonlinear programs via interval analysis. Comput Chem Eng 18:889–897 107. Venkataramanan MA, Cabot AV, Winston WL (1995) An interval branch and bound algorithm for global optimization of a multiperiod pricing model. Comput Oper Res 22:681–687

Interval Linear Systems

108. Wengert RE (1964) A simple automatic derivative evaluation program. Comm ACM 7:463–464 109. Wolfe MA (1994) An interval algorithm for constrained global optimization. J Comput Appl Math 50:605–612 110. Wolfe MA (1996) Interval methods for global optimization. Appl Math Comput 75:179–206 111. Wolfe MA, Zhang LS (1994) An interval algorithm for nondifferentiable global optimization. Appl Math Comput 63:101–122

I

Ac and bc are called the centers of the interval linear system. The corresponding solution set X is defined as the union of all solutions of this family, that is X :D fx 2 Rn : x; A; b satisfy (1), (2)g :

(3)

Naturally, the main interest is to determine the exact range of each component of the solution set, that is to calculate the exact or optimal componentwise bounds

Interval Linear Systems C. JANSSON Informatik III, Techn. Universität Hamburg-Harburg, Hamburg, Germany MSC2000: 15A99, 65G20, 65G30, 65G40, 90C26 Article Outline Keywords An Iterative Interval Method Optimal Bounds See also References Keywords Interval arithmetic; Optimization; Linear systems of equations In many applications the coefficients of real linear systems are, due to measurement or approximation errors, not known exactly. Therefore, the family of real linear systems A x D b;

min fx i : x 2 Xg ;

max fx i : x 2 Xg

for i = 1, . . . , n. The minima and maxima exist provided A is regular, that is all matrices A 2 A are nonsingular. Otherwise, A is called singular, and X is unbounded or empty. In general, the solution set X is not convex and has a complicated shape: see Fig. 1 which is taken from a book of A. Neumaier [18, p. 97]. Hence, calculating bounds for the solution set X is a global optimization problem. Moreover, X needs not to be connected or bounded. This is shown by the simple one-dimensional equation A x = 1, A 2 [ 1, 1] with solution set X = ( 1, 1] [ [1, 1). From the point of view of complexity theory, J. Rohn [25] has proved that the problem of calculating bounds for the solution set is NP-hard. Roughly speaking, he has shown that there is no polynomial time algorithm which calculates bounds of the solution set with overestimation less than any given positive constant.

(1)

where A, b satisfy the inequalities jAc Aj ;

jb b c j ı

(2)

is considered. The absolute value and comparisons are used entrywise. The matrices Ac , A, 2 Rn × n are real n × n matrices, bc , b, ı 2 Rn , and , ı, which describe the perturbation bounds, are assumed to be nonnegative. This family of real linear systems is called an interval linear system, because each matrix A, right-hand side b is contained in the interval matrix A := [Ac , Ac + ], interval vector b := [bc ı, bc + ı], respectively.

(4)

Interval Linear Systems, Figure 1 A projection of a three-dimensional solution set

1757

1758

I

Interval Linear Systems

This is true, even if the interval matrix A is strongly regular, i. e. if the spectral radius (|(Ac )1 | ) is less than one. If A is strongly regular, then the regularity of A follows immediately by observing that for A 2 A there holds e D A (I (A ) ); e (5) AD A ˇ ˇ ˇ . Hence, singularity of A is equivawhere ˇe e has the eigenvalue 1. lent to the fact that (Ac )1 c 1 Since (|(A ) | ) < 1 it follows that A is regular. By Perron–Frobenius theory, strong regularity implies that the radius matrix is not too large. For further NP-hardness results related to other interval problems see [27]. During the last three decades the problem of calculating componentwise bounds for X, not necessarily optimal bounds, has received much attention, and many methods were developed. No attempt can be made in this short survey to review all different approaches. But the literature given in this section shall serve as a guide for further reading. The first algorithm for calculating optimal componentwise bounds was given by W. Oettli and W. Prager [19,20]. There the solution set X is described as the set of feasible solutions of a special system of nonlinear inequalities: c

c

c 1

X D fx 2 Rn : jAc x b c j jxj C ıg

(6)

But in each orthant this system is a convex polyhedron. Hence, in each orthant optimal bounds can be calculated by using linear programming techniques. Unfortunately, there are 2n orthants, and therefore this method needs for each instance a priori exponential time, and can work only for problems of very small size. Recently, based on the result of Oettli and Prager, in [9] a more efficient method for calculating optimal bounds is presented. This method uses linear programming techniques in only those orthants which are intersected by the solution set X. Starting with the pioneering book of R.E. Moore [15], a large number of methods were proposed using the tools of interval arithmetic. Many algorithms can be found for example in the monographs [2,16], and [18]. These methods are polynomial time algorithms, calculate only componentwise (not optimal) bounds, and work under special assumptions: in almost all cases strong regularity of A is required.

In interval arithmetic the elementary operations for intervals x D [x; x]; y D [y; y] 2 IR are defined by x y D fx y : x 2 x;

y 2 yg

(7)

where 2 {+, , , /}, and in case of division 0 62 y is assumed. By a simple monotonicity argument it follows that x y D [min S; max S];

(8)

where the set S is defined by S :D fx y; x y; x y; x yg: Interval operations between real matrices, interval matrices, real vectors and interval vectors are defined as in the real case, only the real operations are replaced by the corresponding interval operations (7). For example, if R = (rij ) 2 Rn × n is a real n × n matrix, and b 2 IRn , then R b is defined as follows: the real coefficients rij are replaced by the point intervals rij = rij = [rij , rij ] and (R b) i :D

n X

ri j b j :

jD1

By definition (7), for all i the equation 8 9 n 0 are called inflation parameters, and e is the vector with 1 in each component. The main property of this iteration is that (cf. [30]) for every starting box x0 holds

Hence, for k = 2 it follows x3 int(y2 ), and the solution set X is contained in [2:2014; 0:5348] 2 xˇ C y D (13) [0:7152; 3:4514]

9k 2 N : x kC1 int(y k ) m (jI R Aj) < 1; where denotes the spectral radius of the absolute value of I R A. This means that after a finite number of steps bounds xˇ C x kC1 of X are calculated, provided the spectral radius (|I RA|) < 1; this means that A is strongly regular. For practical applications it is recommended to execute at most k = 10 iteration steps, and should be greater than the smallest positive floating point number. Obviously, by using this parameters we get an 0(n3 ) polynomial time algorithm. Example 2 To demonstrate how this algorithm works, the following interval linear system with centers Ac D

1:2 1:2

1:2 ; 1:2

bc D

1:5 ; 3:5

and perturbation bounds D

0:2 0:2

0:2 ; 0:2

ıD

0:5 0:5

is considered. This system is a slight modification of an example of Rohn [23]. We have chosen = 0.05 and equal to the smallest positive machine number. In the following, five (appropriately rounded) decimal digits are displayed. The two-dimensional interval vector with components equal to [ 1, 1] is denoted by [1, 1]. The spectral radius (|I R A|) 0.3333 < 1 where R (Ac )1 , and therefore the iteration (12) will compute a box containing the solution set X in finitely steps. The approximate solution of the center system is ˇx D (0:83333; 2:0833)> yielding the starting box x0 :D R (b A xˇ ) D 0:9028 [1; 1]. Iteration (12) yields y0 D 0:9480 [1; 1];

x1 D 1:2188 [1; 1];

y1 D 1:2797 [1; 1];

x2 D 1:3294 [1; 1];

y2 D 1:3959 [1; 1];

x3 D 1:3681 [1; 1]:

For numerical results of this method and its generalization to sparse systems, see [29,30]. There, examples up to 1000000 variables including the ‘Harwell test cases’ are presented. Optimal Bounds As pointed out in the introduction, a polynomial time algorithm may overestimate the solution set X drastically or may fail. Therefore, in this section a method (cf. [9]) which produces optimal bounds of X if and only if A is regular is described. An immediate consequence of (6) is that the solution set X is the finite union of convex polyhedrons. To see this, let { 1, 1}n denote the set of all sign vectors with components equal to 1 or 1. For a sign vector s 2 { 1, 1}n let D(s) denote the diagonal matrix with diagonal s and Rn (s) := {x 2 Rn : D(s) x 0}. Then the intersection X(s) := X \ Rn (s) of the solution set with the orthant corresponding to s is given by the following system of linear inequalities (Ac D(s)) x b c C ı (Ac C D(s)) x b c ı

(14)

D(s) x 0: Therefore, for a fixed orthant Rn (s) optimal bounds of X(s) can be calculated by minimizing and maximizing each coordinate xi subject to the constraints (14). These are linear programming problems which can be solved in polynomial time, implying that optimal bounds of X(s) can also be calculated in polynomial time. Now, one can get optimal bounds of X by calculating optimal bounds of X(s) for each orthant Rn (s). Unfortunately, there are 2n orthants, and this approach can work only for very small dimension n. For interval linear systems with = 0, ı = 0 the solution set X is by definition equal to the exact solution of the corresponding real linear system, and therefore X will be in one orthant (with exception of degenerated

Interval Linear Systems

cases). With growing radii , ı the solution set may intersect more orthants. But in many cases only few orthants will intersect the solution set. Then most of the computing time of the above approach will be spent for checking that X \ Rn (s) is empty for almost all orthants. Therefore, the question arises if it is possible to construct an algorithm which picks up exactly those orthants where X \ Rn (s) is nonempty. In the following such an algorithm is presented. This approach heavily relies on the following topological alternative statement, which says that for nonempty X exactly one of the following two statements is true: i) X is compact and connected, and A is regular; ii) X is unbounded, each topologically connected component of X is unbounded, and A is singular. An immediate consequence is that the solution set X cannot be the union of bounded and unbounded topologically connected components. Therefore, each method which only calculates optimal bounds of a topologically connected component of X suffices to solve the problem. To do this, the representation graph G = (V, E) of the solution set X with the set of nodes V D fs 2 f1; 1gn : X(s) ¤ ;g ; and the set of edges s; t 2 V ; s and t differ in E D fs; tg : exactly one component

(15)

(16)

is defined. Now the following basic relationship between the solution set and its representation graph can be proved: a) Each nonempty topologically connected component b X of X can be represented in the form b X D [ fX(s) : s 2 Ug ;

(17)

where U is the node set of a connected component of G. b) If X is nonempty and bounded, then G = (V, E) is a connected graph, and X D [ fX(s) : s 2 Vg :

(18)

This property gives the possibility to apply to the implicitly defined representation graph G the wellknown graph search method (see for example [21]) for calculating a connected component:

I

1) Compute a starting node s 2 V by solving the midpoint system Ac x = bc . The vector s is defined as the sign vector of this solution, and stored in a list L. 2) Put a sign vector s 2 L, and solve the linear programming problems ( min fx i : x 2 X(s)g ; (19) max fx i : x 2 X(s)g for i = 1, . . . , n. If a problem is unbounded, then an unbounded topologically connected component of X is found. Hence, each other topologically connected component of X is unbounded, A is singular and the method is stopped. Otherwise, the linear programming problems calculate optimal bounds of X(s), which are also stored. By definition of the edge set E, it follows immediately that t :D (s1 ; : : : ; s i1 ; s i ; s iC1 ; : : : ; s n )

(20)

is adjacent to s, if and only if one of the lp ’s in (19) has the exact bound equal to zero. All neighbored nodes t of s are stored in list L, except those which have been already treated. Then we proceed by going to 2), and repeat this process until L is empty. It follows that this algorithm terminates in a finite number of steps, and either calculates optimal bounds of the solution set and proves regularity of A, or shows that X is unbounded and A is singular. The algorithm searches only in those orthants which have a nonempty intersection with the solution set, and avoids all other ones. Therefore, |V| calls of a polynomial time algorithm are needed, where |V| is the number of nonempty intersections of the solution set with the orthants. In many cases in practice, due to physical or economical requirements, only few variables will change the sign implying that only few orthants will be intersected by the solution set. In those cases the method works efficiently. Nevertheless, due to the mentioned NP-hardness results of Rohn, there are also cases where an exponential computing time occurs. Example 3 In order to see how this algorithm behaves in detail, the example of the previous section is discussed. The solution xˇ D (0:8333; 2:0833)> gives the sign vector s = ( 1, 1) which is stored in L. Now we take this sign vector from list L (then L is empty) and solve the lp ’s (19) which gives the optimal

1761

1762

I

Interval Linear Systems

bounds of X(s) [1:9167; 0:0595] : [1:3095; 3:1667]

(21)

No optimal bound has a value equal to zero, which implies that s = ( 1, 1) has no neighbor with respect to the edge set E. It follows that X(s) is a topologically connected component and X = X(s). Therefore, (21) gives the optimal bounds of X. Following, the original example of Rohn [24] is discussed, which differs from the previous one by changing 500:5 500:5 ; A c :D 500:5 500:5 499:5 499:5 :D : 499; 5 499:5 Thus very large perturbations are allowed, and the spectral radius (|(Ac )1 | ) = 1.9960. Hence the iteration method of the previous section cannot work, because A is not strongly regular. The solution xˇ D (0:001998; 0:004995)> gives s = ( 1, 1)| and L = {s}. Now s is removed from list L (then L is empty) and the lp ’s (19) yield the following optimal bounds of X(s): [3:9950; 0] : (22) [0:001002; 3:9980] One optimal bound of the first component has a value equal to zero. Therefore, by (20) t = ( s1 , s2 )| = (1, 1)| is adjacent to s and list L := {t}. Now we take t from list L (then L is empty), and the lp ’s (19) yield the optimal bounds of X(t): [0; 1:9950] : (23) [0:0030; 2:0000] Only the lower optimal bound of the first component is equal to zero. This gives the adjacent sign vector s = ( t 1 , t 2 ) = ( 1, 1)| . But this is the sign vector already treated, and therefore not stored in list L. Since list L is empty, the algorithm is finished, and the optimal bounds (22) and (23) together deliver the optimal bounds [3:9950; 1:99950] (24) [0:001002; 3:9980] for the solution set X.

By comparing the bounds (21) and (13), we see that the optimal bounds (21) clearly improve the bounds (13) calculated by the iteration method of the previous section. This overestimation is mainly due to the preconditioning with the midpoint inverse. However, the bounds (13) give additionally the information that the solution set X intersects at most 2 orthants.Thus, an a priori estimation on the computing time for the exact method in this section is given: the above method has only to search in two orthants. Hence, first using in the strongly regular case a polynomial time method, provides rough bounds for X as well as a bound for the computing time which is needed for calculating exact bounds. Several other examples up to dimension n = 50 can be found in [9] and [10]. See also ABS Algorithms for Linear Equations and Linear Least Squares Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators Bounding Derivative Ranges Cholesky Factorization Global Optimization: Application to Phase Equilibrium Problems Interval Analysis: Application to Chemical Engineering Design Problems Interval Analysis: Differential Equations Interval Analysis: Eigenvalue Bounds of Interval Matrices Interval Analysis: Intermediate Terms Interval Analysis: Nondifferentiable Problems Interval Analysis: Parallel Methods for Global Optimization Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Unconstrained and Constrained Optimization Interval Analysis: Verifying Feasibility Interval Constraints Interval Fixed Point Theory Interval Global Optimization Interval Newton Methods

Interval Newton Methods

Large Scale Trust Region Problems Large Scale Unconstrained Optimization Linear Programming Nonlinear Least Squares: Trust Region Methods Orthogonal Triangularization Overdetermined Systems of Linear Equations QR Factorization Solving Large Scale and Sparse Semidefinite Programs Symmetric Systems of Linear Equations References 1. Alefeld G (1994) Inclusion methods for systems of nonlinear equations. In: Herzberger J (ed) Topics in Validated Computations: Studies in Computational Mathematics. North-Holland, Amsterdam, pp 7–26 2. Alefeld G, Herzberger J (1983) Introduction to interval computations Acad. Press, New York 3. Arithmos (1986) Benutzerhandbuch, Siemens, Bibl.-Nr. U 2900-I-Z87-1. 4. Hammer R, Hocks M, Kulisch U, Ratz D (1993) PASCAL-XSC: Basic numerical problems Springer, Berlin 5. Hansen ER (1992) Bounding the solution set of interval linear systems. SIAM J Numer Anal 29:1493–1503 6. Hansen E, Sengupta S (1981) Bounding solutions of systems of equations using interval analysis. BIT 21:203–211 7. Hansen E, Smith R (1967) Interval arithmetic in matrix computations. SIAM Numer Anal 2(4):1–9 8. IBM (1986) High-accuracy arithmetic subroutine library (ACRITH). Program Description and User’s Guide SC 336164-02 9. Jansson C (1997) Calculation of exact bounds for the solution set of linear interval equations. Linear Alg & Its Appl 251:321–340 10. Jansson C, Rohn J (1999) An algorithm for checking regularity of interval matrices. SIAM J Matrix Anal Appl 20(3):756–776 11. Kearfott RB (1990) Preconditioners for the interval-GaussSeidel method. SIAM J Numer Anal 27(3):804–822 12. Kearfott RB (1996) Rigorous global search: continuous problems. Kluwer, Dordrecht 13. Knüppel O (1994) PROFIL/BIAS: A fast interval library. Computing 53:277–287 14. Krawczyk R (1969) Newton-Algorithmen zur Bestimmung von Nullstellen mit Fehlerschranken. Computing 4:187– 201 15. Moore RE (1966) Interval analysis. Prentice-Hall, Englewood Cliffs, NJ 16. Moore RE (1979) Methods and applications of interval analysis. SIAM, Philadelphia 17. Neumaier A (1984) New techniques for the analysis of linear interval equations. Linear Alg & Its Appl 58:273–325

I

18. Neumaier A (1990) Interval methods for systems of equations. Encycl Math Appl. Cambridge Univ. Press, Cambridge 19. Oettli W (1965) On the solution set of a linear system with inaccurate coefficients. SIAM J Numer Anal 2:115–118 20. Oettli W, Prager W (1964) Compatibility of approximate solution of linear equations with given error bounds for coefficients and right-hand sides. Numer Math 6:405–409 21. Papadimitriou CH, Steiglitz K (1982) Combinatorial optimization: Algorithms and complexity. Prentice-Hall, Englewood Cliffs, NJ 22. Ris FN (1972) Interval analysis and applications to linear algebra. PhD Thesis Oxford Univ. 23. Rohn J (1986) A note on the sign-accord algorithm. Freiburger Intervall-Ber 86(4):39–43 24. Rohn J (1989) Systems of linear interval equations. Linear Alg & Its Appl 126:39–78 25. Rohn J (1991) Linear interval equations: computing enclosures with bounded relative or absolute overestimation is NP-hard. In: Kearfott RB, Kreinovich V (eds) Applications of Interval Computations. Kluwer, Dordrecht, pp 81–89 26. Rohn J (1992) Cheap and tight bounds: the result of E. Hansen can be made more efficient. Interval Comput 4:13–21 27. Rohn J (1994) NP-hardness results for linear algebraic problems with interval data. In: Herzberger J (ed) Topics in Validated Computations: Studies in Computational Mathematics. Elsevier, Amsterdam, 463–472 28. Rump SM (1983) Solving algebraic problems with high accuracy. Habilitationsschrift. In: Kulisch UW, Miranker WL (eds) A New Approach to Scientific Computation. Acad. Press, New York, pp 51–120 29. Rump SM (1993) Validated solution of large linear systems. In: Albrecht R, Alefeld G, Stetter HJ (eds) Computing Supplementum 9, Validation Numerics. Springer, Berlin, pp 191–212 30. Rump SM (1994) Verification methods for dense and sparse systems of equations. In: Herzberger J (ed) Topics in Validated Computations: Studies in Computational Mathematics. Elsevier, Amsterdam, pp 63 –136 31. Shary SP (1991) Optimal solutions of interval linear algebraic systems. Interval Comput 2:7–30

Interval Newton Methods R. BAKER KEARFOTT Department Math., University Louisiana at Lafayette, Lafayette, USA MSC2000: 65G20, 65G30, 65G40, 65H20, 65K99

1763

1764

I

Interval Newton Methods

interval Newton operator:

Article Outline Keywords Introduction Univariate Interval Newton Methods Multivariate Interval Newton Methods Existence-Proving Properties See also References

ˇ D xˇ N(f; x; x)

Nonlinear system of equations; Automatic result verification; Interval computations; Global optimization

N(f; x; xˇ ) int(x); where int(x) represents the interior of x, implies that there is a unique solution of f (x) = 0 within N(f; x; xˇ ), and hence within x.

Introduction Interval Newton methods combine the classical Newton method, the mean value theorem, and interval analysis. The result is an iterative method that can be used both to refine enclosures to solutions of nonlinear systems of equations, to prove existence and uniqueness of such solutions, and to provide rigorous bounds on such solutions, including tight and rigorous bounds on critical points of constrained optimization problems. Interval Newton methods can also prove nonexistence of solutions within regions. Such capabilities can be used in isolation, for example, to provide rigorous error bounds for an approximate solution obtained with floating point computations, or as an integral part of global branch and bound algorithms. Univariate Interval Newton Methods

0 D f (x ) D f (xˇ ) C f 0 ()(x xˇ ); ˇ x) we have x D xˇ ff0(() for some 2 x. If f0 (x) is any interval extension of the derivative of f over x, then

for any xˇ 2 x:

Multivariate Interval Newton Methods Multivariate interval Newton methods are analogous to univariate ones in the sense that they obey an iteration equation similar to equation (2), and in the sense that they have quadratic convergence properties and can be used to prove existence and uniqueness. However, multivariate interval Newton methods are complicated by the necessity to bound the solution set of a linear system of equations with interval coefficients. Suppose now that f : Rn ! Rn , suppose x is an interval vector (i. e. a box), and suppose that xˇ 2 Rn . (If interval derivatives, rather than slope sets, are to be used, then further suppose that xˇ 2 x.) Then a general form for multivariate interval Newton methods is N( f ; x; xˇ ) D xˇ C v;

Suppose f : x D [x; x] ! R has a continuous first derivative on x, suppose that there exists x 2 x such that f (x ) = 0, and suppose that xˇ 2 x. Then, since the mean value theorem implies

f (xˇ) f0 (x)

(2)

Because of (1), any solutions of f (x) = 0 that are in x must also be in N(f; x; xˇ ). Furthermore, local convergence of iteration of the interval Newton method (2) is quadratic in the sense that the width of N(f; x; xˇ ) is roughly proportional to the square of the width of x. Furthermore, if an interval derivative extension (in contrast to an interval slope) is used for f0 (x), then

Keywords

x 2 xˇ

f (xˇ ) : f0 (x)

(1)

(Note that, in certain contexts, a slope set for f centered at xˇ may be substituted for f0 (x); see [1] for further references.) Equation (1) forms the basis of the univariate

(3)

where v is an interval vector that contains all solutions v to point systems Av D f (xˇ), for A 2 f0 (x), where f0 (x) is an interval extension to the Jacobi matrix of f over x. (Under certain conditions, f0 may be replaced by an interval slope matrix.) As with the univariate interval Newton method, under certain natural smoothness conditions, N( f ; x; xˇ ) must contain all solutions x 2 x with f (x ) = 0. (Consequently, if N( f ; x; xˇ ) \ x D ;, then there are no solutions of f (x) = 0 in x.) For x containing a solution of f (x) = 0 and the widths of the components of x sufficiently small, the width of N( f ; x; xˇ ) is roughly proportional to the square of the widths of the components of x.

I

Interval Newton Methods

If N( f ; x; xˇ ) int(x), where int(x) represents the interior of x, then there is a unique solution of f (x) = 0 within N( f ; x; xˇ ), and hence within x. For details and further references, see [1, §1.5]. Finding the interval vector v in the iteration formula (3), that is, bounding the solution set to the interval linear system f0 (x)v D f (xˇ ); is a major aspect of the multivariate interval Newton method. Finding the narrowest possible intervals for the components of v is, in general, an NP-hard problem. (See Complexity classes in optimization.) However, procedures that are asymptotically good in the sense that the overestimation in v decreases as the square of the widths of the elements of f0 can be based on first preconditioning the interval matrix f0 (x) by the inverse of its matrix of midpoints or by other special preconditioners (see [1, Chapt. 3]), then applying the interval Gauss–Seidel method or interval Gaussian elimination. Existence-Proving Properties The existence-proving properties of interval Newton methods can be analyzed in the framework of classical fixed-point theory. See Interval fixed point theory, or [1, §1.5.2]. Of particular interest in this context is a variant interval Newton method, not fitting directly into the framework of formula (3), that is derived directly by considering the classical chord method (Newton method with fixed iteration matrix) as a fixed point iteration. Called the Krawczyk method, this method has various nice theoretical properties, but its image is usually not as narrow as other interval Newton methods. See [1, p. 56]. Uniqueness-proving properties of interval Newton methods are based on proving that each point matrix formed elementwise from the interval matrix f0 (x) is nonsingular. Example 1 For an example of a multivariate interval Newton method, take

xD

The usual procedure (although not required in this special case) is to precondition the system f0 (x)v D f (xˇ); say, by the inverse of the midpoint matrix YD

2:1 0

0 2:1

1 D

0:476 0

0 0:476

to obtain Yf0 (x)v D Y f (xˇ ); i. e., rounded out,

[0:85; 1:15] [:096; :096]

[:096; :096] v [0:85; 1:15] [:0488; 0:487] D : 0

(Rigor is not lost by taking floating point approximations for the preconditioner, but the interval arithmetic should be outwardly rounded.) The interval Gauss–Seidel method can then be used to compute sharper bounds on v D x xˇ , beginning with v D [0:15; 0:15] . That is, [0:1; 0:1] [0:0488; 0:0488] [0:096; 0:096]v2 [0:85; 1:15] [0:0688; 0:034]:

e v1

Thus, the first component of N( f ; x; xˇ ) is

In the second step of the interval Gauss–Seidel method,

f 2 (x) D 2x1 x2 ;

and its value at x is [1:8; 2:4] [0:2; 0:2] : [0:2; 0:2] [1:8; 2:4]

xˇ C v [0:9833; 1:016]:

f 1 (x) D x12 x22 1; with

An interval extension of the Jacobi matrix for f is 2x1 2x2 0 ; f (x) D 2x2 2x1

[0:9; 1:2] ; [0:1; 0:1]

1:05 xˇ D : 0

0 [0:096; 0:096]e v1 [0:085; 1:15] [0:00778; 0:00778];

e v2 D

1765

1766

I

Inventory Management in Supply Chains

so, rounded out, N( f ; x; xˇ ) is computed to be

[0:981; 1:016] [0:9; 1:2] : [0:00778; 0:00778] [0:1; 0:1]

This last inclusion proves that there exists a unique solution to f (x) = 0 within x, and hence, within N( f ; x; xˇ ). Furthermore, iteration of the procedure will result in bounds on the exact solution that become narrow quadratically. See also Automatic Differentiation: Calculation of Newton Steps Automatic Differentiation: Point and Interval Automatic Differentiation: Point and Interval Taylor Operators Bounding Derivative Ranges Dynamic Programming and Newton’s Method in Unconstrained Optimal Control Global Optimization: Application to Phase Equilibrium Problems Interval Analysis: Application to Chemical Engineering Design Problems Interval Analysis: Differential Equations Interval Analysis: Eigenvalue Bounds of Interval Matrices Interval Analysis: Intermediate Terms Interval Analysis: Nondifferentiable Problems Interval Analysis: Parallel Methods for Global Optimization Interval Analysis: Subdivision Directions in Interval Branch and Bound Methods Interval Analysis: Systems of Nonlinear Equations Interval Analysis: Unconstrained and Constrained Optimization Interval Analysis: Verifying Feasibility Interval Constraints Interval Fixed Point Theory Interval Global Optimization Interval Linear Systems Nondifferentiable Optimization: Newton Method Unconstrained Nonlinear Optimization: Newton–Cauchy Framework

References 1. Kearfott RB (1996) Rigorous global search: continuous problems. Kluwer, Dordrecht

Inventory Management in Supply Chains IM in SC SANDRA DUNI EKSIOGLU Industrial and Systems Engineering Department, University Florida, Gainesville, USA MSC2000: 90B50 Article Outline Keywords Single Stage Inventory Management Models Multistage Inventory Management Models Conclusions See also References Keywords Inventory management; Multistage inventory management; Supply chain; EOQ; Newsboy problem; (s, S)policy; Periodic review model; Continuous review model; Metric A supply chain (SC) can be defined as an integrated system, where various firms work together, including suppliers of raw materials, manufacturers, distributors and retailers. Their efforts are concentrated on transforming the raw materials into final products that satisfy customer requirements, and delivering these products to the right place, at the right time. A SC contains two basic, integrated processes: a) production planning and inventory management (IM); and b) distribution and logistics processes [6]. This article gives a brief review of literature on singlestage IM and multistage IM models. The objective is to provide an overview of this research and emphasize current achievements in this field. Inventories exist

Inventory Management in Supply Chains

throughout the SC in the form of raw materials, workin-process, and finished goods. Typical relevant inventory costs are: inventory carrying costs, order costs, and shortage costs. These costs often tend to conflict, in other words, decreasing one generally requires increasing another. The main motivation for keeping inventories is to cope with the uncertainty of external demand, supply and lead-time [18]. Keeping inventories is important to increase customer service level and reduce distribution costs, but it is estimated [5] that inventories cost approximately 20% to 40% of their value per year. Thus, managing inventories in a scientific manner to maintain minimal levels required for meeting service objectives makes economic sense. K. Arrow [2] presents an interesting discussion of the motives of a firm for holding inventories. There are several opportunities for streamlining SC inventories. It is important to understand that for a given service level the lowest inventory investment results when the entire SC is considered as a single system. Such coordinated decisions at Xerox and Hewlett Packard reduced their inventory levels by over 25% [9]. Single Stage Inventory Management Models The simplest inventory model is the deterministic economic order quantity (EOQ) model presented by F. Harris [12]. He recognized this problem in 1913 in his work at Westinghouse. The model determines the constant order quantity that minimizes the average annual cost of purchasing and carrying inventory, assuming deterministic and constant demand rate, no shortages, and zero order lead-times. A number of important scholars turned their attention to mathematical inventory models during the 1950s. A collection of mathematical models by Arrow, S. Karlin and H.E. Scarf [3] influenced later work in this area. At about the same time, H.M. Wagner and T.M. Whitin [24] developed a solution algorithm to the dynamic lotsizing problem subject to time varying demand. Their model assumes periodic, deterministic demand over a finite planning horizon, no capacity restrictions on production, and zero inventory at the beginning and the end of the planning horizon. This problem is formulated as a mixed integer linear program (MILP) and can be represented as a fixed-charge network flow problem. The Wagner–Whitin algorithm is best illus-

I

trated using a shortest-path graph representation. Although the Wagner–Whitin model gives an optimal solution, in practice other heuristic lot-sizing algorithms are adopted. See [18] for a survey on the EOQ lot-sizing, silver-meal, least unit cost heuristics, etc. These models trade-off productivity losses from making small batches and the opportunity costs of tying up capital in inventory due to large batches. U.S. Karmarkar [14] extends the lot-sizing model to include lead-time related costs. Inventory control models subject to uncertain demand are basically of two types: periodic review models and continuous review models. Periodic review models exist for one planning period or for multiple planning periods. The single-period, stochastic inventory model is known as the newsboy model. The case of single period models with fixed order cost and initial inventories, leads to the optimality of (s, S) optimal policies. These policies state that if inventory position is less than s, then order up to S, otherwise do not order. The periodic review models with an infinite horizon are formulated in a dynamic programming framework [23]. Continuous review systems under uncertain demand track demands as they occur and the inventory position is always known. These models lead to the (Q, R) policy, under which a fixed amount of Q units is ordered each time the inventory position reaches a certain level R. The model typically assumes either backordering or lost sales when shortages occur. Multistage Inventory Management Models Coordinating decisions at different levels of an organization comes as a need to reduce operating costs. This coordination can be seen in terms of integrating different decision types e. g., facility location, inventory planning, distribution, etc., or linking decisions within the same function at different stages in the SC. Multistage inventory management models (MSIM models) concentrate on integrating IM policies in different stages of the SC. The typical MSIM problem analyzed in the literature is a two-level system composed of a number of retailers being served by a central warehouse. The demand at each retailer is satisfied using on-hand inventory. When insufficient inventory is available, a backorder typically occurs, and demand must be satisfied later using inventory from the warehouse. The model decides on the inventory level at each retailer and the

1767

1768

I

Inventory Management in Supply Chains

warehouse, such that a set of prespecified criteria is satisfied at minimum inventory-related costs. The first MSIM model was developed by A. Clark and Scarf [7]. They consider a system with a single product and N facilities, where facility i supplies facility i + 1, for i = 1, . . . , N 1. The model considers a periodic review of the inventory level and assumes fixed lead-times, a finite planning horizon, backordering of demand shortages and variable order cost. The aim is to find IM policies to be applied in each of the echelons, such that system cost is minimized. They show that under the above assumptions an optimal policy for the system can be found by decomposing the problem into N separate singlelocation problems and solving the problem recursively. The above model was extended to incorporate an infinite horizon and lead time uncertainty. A generalization of the system described above is the multi-echelon arborescence system, where each location has a unique supplier. A.F. Veinott [23] provides an excellent summary of these early modeling efforts. One of the earliest continuous review MSIM models was presented by C.C. Sherbrooke [21]. He considers a two-stage system with several retailers and a single warehouse that supplies to these retailers. He introduces the well-known METRIC approximation to determine the optimal level of inventory in the system. The METRIC approximation assumes a Poisson distribution of demand and constant replenishment lead-times. S.C. Graves [11] extends the METRIC approximation by estimating the mean and the variance of the outstanding retailer orders. He fits the negative binomial distribution to these parameters to determine the optimal inventory policy. S. Axsäter} [4] provides an exact solution to the problem and shows that the METRIC approximation provides an underestimate, whereas Graves ’ two-parameter approximation [11] overestimates the retailer backorders. The above studies use the one-for-one ordering policy (S 1, S), i. e., an order is placed as soon as a demand occurs. This policy is appropriate for items with high value and a low demand rate. Axsäter [4] shows that the models used for the one-for-one ordering policy can be extended in the case of batch ordering with only one retailer. Analysis of batch ordering policies in arborescent systems (when the number of retailers is greater than one) is similar to Sherbrooke ’s model. B.L. Deuermeyer and L.B. Schwarz [8] were the first to analyze such a system. They estimate the mean and the vari-

ance of lead-time demand to obtain average inventory levels and backorders at the warehouse, assuming that lead-time demand is normally distributed. The retailer lead-time demand is also approximated using a normal distribution. In addition to reviewing the literature in the area, [15,17] and [22] also provide several extensions to the Deuermeyer and Schwarz model.In [10] the concept of stability in a capacitated, multi-echelon production-inventory system under a base-stock policy is introduced. W.L. Maxwell and others [16] extend the analysis to multiproduct, continuous review and deterministic demand, MSIM problems. Their model tends to schedule the orders for each of the products over an infinite horizon so as to minimize the long-run average cost. The authors define a new class of policies in which each product uses a stationary interval of time between successive orders. Their model finds a lot-sizing rule that is within 6% of the average cost of the optimal policy. R. Roundy [19] develops a similar multistage, multiproduct lot-sizing model. Under the assumption that the ratio of the order intervals of any two products is an integer power of two, it is shown that the solution is within 2% of optimality. D. Sculli and others [20] extend the analysis of MSIM systems for the case when two suppliers are used to replenish stock of a single item. They calculate the mean and the standard deviation of the effective lead time demand and interarrival time when replenishment orders are placed at the same time with the two suppliers, in a continuous review system. The lead-time distribution of each supplier is assumed to be normal. R. Ganeshan [9] presents a near-optimal (s, Q) inventory policy for a production/distribution network with multiple suppliers replenishing a central warehouse, which distributes to a large number of retailers. The model concentrates on inventory analysis at the retailers and the warehouse, and demand process at the warehouse. The model finds a near-optimal order quantity and a reorder point at both the retailer and the distribution center (DC) under stochastic demand and lead-time, subject to customer service constraints. The main contribution of this model is the integration of the above components for analyzing simple supply chains. P. Afentakis and others [1] develop a procedure for optimally solving medium size lot-scheduling problems in multistage structures with periodic review of the inventory and dynamic deterministic demand. They formu-

Inventory Management in Supply Chains

late the problem in terms of echelon stock, which simplifies its decomposition by Lagrangian relaxation. An efficient branch and bound algorithm is used to solve the problem. M.C. van der Heijden and others [13] consider the periodic review, order-up-to (R, S) inventory system under stochastic demand. They propose a new approach to calculate the mean physical stock. The standard approximation appears to yield inaccurate results in the case of low service levels. Low service levels usually occur at intermediate nodes in optimal solutions for multi-echelon systems. Conclusions With the trend toward just-in-time deliveries and reduction of inventories, many firms are reexamining their inventory and logistics policies. Some firms are altering their inventory, production and shipping policies, and others are working on coordinating inventory decisions throughout their SC, with the goal of reducing costs and improving service. Single stage IM models give some insights on how to manage inventories under certain demand and lead-time considerations, while MSIM models take the analysis further, coordinating inventory decisions throughout the SC. This article reviews the literature on single-stage and multistage inventory management models, with an emphasis on achievements in this field. See also Global Supply Chain Models Nonconvex Network Flow Problems Operations Research Models for Supply Chain Management and Design Piecewise Linear Network Flow Problems References 1. Afentakis P, Gavish B, Karmarkar U (1984) Computationally efficient optimal solutions to the lot-sizing problem in multistage assembly systems. Managem Sci 30(2):222–239 2. Arrow KA (1958) Historical background. In: Arrow KA, Karlin S, Scarf HE (eds) Studies in the Math. Theory of Inventory and Production. Stanford Univ. Press, Palo Alto, CA 3. Arrow KA, Karlin S, Scarf HE (eds) (1958) Studies in the mathematical theory of inventory and production. Stanford Univ. Press, Palo Alto, CA

I

4. Axsäter S (1993) Continuous review policies for multi-level inventory system with stochastic demand. In: Graves SC, Rinnooy Kan AHG, Zipkin P (eds) Handbook Oper. Res. and Management Sci.: Logistics of Production and Inventory, vol 4. North-Holland, Amsterdam, pp 175–197 5. Ballou RH (1992) Business logistics management. Third edn. Prentice-Hall, Englewood Cliffs, NJ 6. Beamon BM (1998) Supply chain design and analysis: Models and methods. Internat J Production Economics 55:281– 294 7. Clark A, Scarf H (1960) Optimal policies for a multi-echelon inventory problem. Managem Sci 6:475–490 8. Deuermeyer BL, Schwarz LB (1981) A model for the analysis of system service level in warehouse retailer distribution systems: The identical retailer case. In: Schwarz LB (ed) Studies in Management Sci.: Multi-Level Production/Inventory Control Systems: Theory and Practice, vol 16. North-Holland, Amsterdam, pp 163–193 9. Ganeshan R (1999) Managing supply chain inventories: A multiple retailer, one warehouse, multiple supplier model. Internat J Production Economics 59:341–354 10. Glasserman P, Tayur S (1994) The stability of a capacitated, multi-echelon production-inventory system under a basestock policy. Oper Res 42(5):913–926 11. Graves SC (1985) A multi-echelon inventory model for a repairable item with one-for-one replenishment. Managem Sci 31(10):1247–1256 12. Harris FW (1913) How many parts to make at once. Factory, The Magazine of Management 10(2):135–136; 152 13. Heijden MC van der, Kok Tde (1998) Estimating stock levels in periodic review inventory systems. Oper Res Lett 22:179–182 14. Karmarkar US (1987) Lot sizes, lead times and in-process inventories. Managem Sci 33(3):409–418 15. Lee H, Moinzadeh K (1987) Two-parameter approximations for multi-echelon repairable inventory models with batch ordering policy. IIE Trans 19:140–147 16. Maxwell WL, Muckstadt JA (1985) Establishing consistent and realistic reorder intervals in production-distribution systems. Oper Res 33:1316–1341 17. Moinzadeh K, Lee H (1986) Batch size and stocking levels in multi-echelon repairable systems. Managem Sci 32(12):1567–1581 18. Nahmias S (1997) Production and operations analysis, 3rd edn. IRWIN Publ., Homewood, IL 19. Roundy R (1986) A 98%-effective lot-sizing rule for a multiproduct, multi-stage production/inventory system. Math Oper Res 11(4):699–727 20. Sculli D, Wu SY (1981) Stock control with two suppliers and normal lead times. J Oper Res Soc 32:1003–1009 21. Sherbrooke CC (1968) METRIC: A multi-echelon technique for recoverable item control. Oper Res 16:122–141 22. Svoronos A, Zipkin P (1988) Estimating the performance of multi-level inventory systems. Oper Res 36(1):57–72

1769

1770

I

Invexity and its Applications

23. Veinott AF (1966) The status of mathematical inventory. Managem Sci 12:745–775 24. Wagner HM, Whitin TM (1958) Dynamic version of the economic lot size model. Managem Sci 5(1):89–96

Invexity and its Applications B. D. CRAVEN University Melbourne, Melbourne, Australia MSC2000: 90C26 Article Outline Keywords See also References Keywords Optimization; Invex; Lagrangian conditions; Constrained minimization; Scale function; E-convex; Wolfe dual; Mond–Weir dual; Generalized invex; V-invex; Convexifiable; Protoconvex; Quasi-invex; Pseudo-invex; Pseudoconvex; Quasiconvex; Basic alternative theorem; Convex-like; Optimal control; Continuous programming Lagrangian conditions are often necessary, but not sufficient, for a minimum of an optimization problem in continuous variables. Sufficiency holds under convex assumptions, which are however often not satisfied in applications. Invexity is a less restrictive assumption than convexity, under which Lagrangian conditions are sufficient for a minimum, and also related duality results hold. It also provides a structure, showing relations with various other kinds of generalized convexity. Consider a constrained minimization problem 8 ˆ ˆ F 00 (p) v C ; 2 1 > (x; p) D v C v Q v C ; 2 where v = x p, and v| F 00 (p) v means that component j of F has second order term v| F j 00 (p) v, and similarly for v| Q v; denote the matrix component k of Q by Qk . Then ([5]), by substituting in (INV), local invexity implies that X F 0 (p)sk Q k F 00 (p)s k

is positive semidefinite, for each s. Conversely, if each of these matrices is positive definite, then F is locally invex at p. Some further classes of invex functions are described as follows (see [9]). Let X 0 be an open domain in Rn , let A: X 0 ! Rm be convex, and let B: X 0 ! R be differentiable and satisfy B(X 0 ) (0, 1). Then A()/B()

I

Invexity and its Applications

m is invex if also B is convex with A(X) RC , or if B m m is concave with A(X 0 ) RC . If g: X 0 ! R is differentiable, and g 0 (a) d < 0 for some direction d, then g is invex at the point a. Let : Rn ! Rn be an invertible C2 mapping; let r: R+ ! R+ be strictly increasing, with r(0) = 0, r0 (0) = 0, and r00 (0) < 0 on some interval. Then r and h() := r ı k k are pseudoconvex, and h ı is pseudo-invex (hence also quasi-invex). The invex property, and also pseudo-invex, quasiinvex and V-invex, are also applicable when the spaces Rn and Rm are replaced by infinite-dimensional normed spaces of functions, such as occur in optimal control (see [3,8]) and continuous programming (see [7,20]). The definitions, and proofs of basic properties (see [3,7,8]), are unchanged, interpreting a b as a b 2 Q, where the order cone Q is a closed convex cone. Examples of spaces of control functions are the spaces C(Rr ) (respectively, PC(I, Rr )) of continuous (respectively, piecewise continuous) functions from an interval I into Rr , with the uniform norm, and the space L2 (I, Rr ) of square-integrable functions from I into Rr ). Consider, for R T example, an integral objective function f (x) :D 0 (x(t); x˙ (t); t) dt, where f 2 C1 (0, T), is differentiable, and x˙ (t) D ( ddt )x(t). Assume boundary conditions x(0) = x0 , x(T) = xT . De@ )(x(t); t), and similarly note x (x(t); x˙ (t); t) :D ( @x 0 x˙ . Then the gradient f (p) of f at p 2 C1 [0, T) is given by Z T x (p(t); p˙(t); t)z(t) dt f 0 (p)z D 0 Z T d x (p(t); p˙(t); t) x˙ (p(t); p˙(t); t) D dt 0

z(t) dt after integrating by parts. Then f is imvex if, for some scale function , f (x) f (p) Z T d x (p(t); p˙(t); t) x˙ (p(t); p˙(t); t) dt 0 (x(t); p(t); t) dt: For a constraint (x(t), t) 0 (8t 2 [0, T]), the analog P of the term j g j (x) in the Lagrangian is Z T (t) (x(t); t) dt; 0

and invexity requires that Z T (t) (x(t); t) (p(t); t) dt 0

Z

T

(t)

x (p(t); t)(x(t);

p(t); t) dt:

0

There are converse KKT and duality properties for such infinite-dimensional problems (see e. g. [8,20]), using invexity quite similarly to finite-dimensional problems. See also Generalized Concavity in Multi-objective Optimization Isotonic Regression Problems L-convex Functions and M-convex Functions References 1. Ben-Israel A, Mond B (1986) What is invexity? J Austral Math Soc (Ser B) 22:1–9 2. Clarke FH (1983) Optimization and nonsmooth analysis. Wiley/Interscience, New York 3. Craven BD (1978) Mathematical programming and control theory. Chapman and Hall, London 4. Craven BD (1981) Duality for generalized convex fractional programs. In: Generalized Concavity in Optimization and Economics. Acad. Press, New York, pp 473–489 5. Craven BD (1981) Invex functions and constrained local minima. Bull Austral Math Soc 24:357–366 6. Craven BD (1986) Nondifferentiable optimization by smooth approximations. Optim 17:3–17 7. Craven BD (1993) On continuous programming with generalized convexity. Asia–Pacific J Oper Res 10:219–231 8. Craven BD (1995) Control and optimization. Chapman and Hall, London 9. Craven BD, Glover BM (1985) Invex functions and duality. J Austral Math Soc (Ser A) 39:1–20 10. Craven BD, Luu DV (1997) Optimization with set-functions described by functions. Optim 42:39–50 11. Hanson MA (1980) On the sufficiency of the Kuhn–Tucker conditions. J Math Anal Appl 80:545–550 12. Hanson MA, Mond B (1987) Necessary and sufficient conditions in constrained optimization. Math Program 37:51– 56 13. Jeyakumar V (1985) Convexlike alternative theorems and mathematical programming. Optim 16:643–652 14. Jeyakumar V, Mond B (1992) On generalized convex mathematical programming. J Austral Math Soc (Ser B) 34:43– 53 15. Kaul RN, Kaur S (1985) Optimality criteria in nonlinear programming involving nonconvex functions. J Math Anal Appl 105:104–112

1773

1774

I

Isotonic Regression Problems

16. Mangasarian OL (1969) Nonlinear programming. McGrawHill, New York 17. Mond B, Hanson MA (1984) On duality with generalized convexity. Math Operationsforsch Statist Ser Optim 15:313–317 18. Mond B, Weir T (1981) Generalized concavity and duality. In: Generalized Concavity in Optimization and Economics. Acad. Press, New York, pp 263–275 19. Mond B, Weir T (1982) Duality for fractional programming with generalized convex conditions. J Inform Optim Sci 3:105–124 20. Weir T, Hanson MA, Mond B (1984) Generalized concavity and duality in continuous programming. J Math Anal Appl 104:212–218 21. Weir T, Mond B (1988) Pre-invex functions in multiple objective optimization. J Math Anal Appl 137:29–38

Isotonic Regression Problems VALENTINA DE SIMONE, MARINA MARINO, GERARDO TORALDO University Naples ‘Federico II’ and CPS, Naples, Italy MSC2000: 62G07, 62G30, 65K05 Article Outline Keywords Synonyms Problem Statement The Pool Adjacent Violators Algorithm Minimum Lower Set Algorithm Other Algorithms See also References Keywords Order restriction; Algorithms; Optimization Synonyms IR Problem Statement Given a finite set X with an ordering 4, a real function g on X and a positive weight function w on X, the isotonic regression problem is to find a function g which

minimizes X [g(x) f (x)]2w(x); x2X

among the class F of isotonic functions f defined on X, i. e. F D f f : 8x; y 2 X and x y ) f (x) f (y)g : The function g is called isotonic regression, and it exists and is unique [5]. Isotonic regression can be viewed as a least squares problem under order restrictions; here, order restrictions on parameters can be regarded as requiring that the parameter, as a function of an index, will be isotonic (the adjective ‘isotonic’ is used as a synonym for ‘order preserving’) with respect to an order on the index set. If 4 is reflexive, transitive, antisymmetric and every pair of elements are comparable, the problem is called simple order isotonic regression. A very important result in the theory of isotonic regression, is that the increasing function f closest to a given function g on X in the (weighted) least squares sense, can be constructed graphically. A geometrical interpretation of isotonic regression over a simple order finite set X = {x1 , . . . , xn } is the following. Let W j = Pj Pj iD1 w(xi ) and Gj = iD1 g(xi ) w(xi ); the points Pj = (W j , Gj ) obtained plotting the cumulative sums Gj against the cumulative sums W j , j = 0, . . . , n (W 0 = 0, G0 = 0), constitute the cumulative sum diagram (CSD) of a given function g with weights w. The isotonic regression of g is given by the slope of the greatest convex minorant (GCM) (i. e., the graph of the supremum of all convex functions whose graphs lie below the CSD) of the CSD; the slope of the segment joining Pj 1 to Pj is just g(xj ), j = 1, . . . , n, while the slope of the segment joining Pi 1 to Pj , i j is the weighted average Pj Avfx i ; : : : ; x j g D

rDi

Pj

g(x r )w(x r )

rDi

:

w(x r )

Therefore, the value of the isotonic regression g at the point xj is just the slope of the GCM at the point Pj = Pj (W j , Gj ), where Gj = iD1 g (xi ) w(xi ). Note that, if Pj is a ‘corner’ of the graph, g is the slope of the segment extending to the left. An illustrative example is shown in Fig. 1.

Isotonic Regression Problems

I

The Pool Adjacent Violators Algorithm

Isotonic Regression Problems, Figure 1 Example of CSD and GCM

The first and the most widely used algorithm for the simply ordered isotonic regression is the pool adjacent violators algorithm (PAV) proposed by M. Ayer et al. [1] in 1955. This algorithm follows directly from the geometrical interpretation of isotonic regression. As it is said before, the solution of the problem under consideration is given by the left derivative of the greatest convex function lying below the CSD. If, for some index i, g(xi 1 ) > g(xi ), then the graph of the part of the GCM between the points Pi2 and Pi is a straight line segment. Thus the CSD could be altered by connecting Pi 2 with Pi by a straight line segment, without changing the GCM. The PAV algorithm is based on this idea of successive approximation to the GCM. (See Fig. 2 for a geometrical interpretation of ‘pooling’ adjacent violators.) In describing the algorithm, an arbitrary set of consecutive elements of X will be referred to as a block. The

Other isotonic regression problems are based on a less restrictive kind of order: partial order and quasiorder. In the partial order isotonic regression problem the binary relation 4 on X is reflexive, transitive and antisymmetric, but there may be noncomparable elements. In the quasi-order isotonic regression problem, the ordering relation satisfies only the first two conditions. The isotonic regression problem arises in both statistics and operations research. Applications in statistics are discussed in [2,10] and [14]. Applications in operations research can be found in [11] and [15]. The problem under consideration is also of theoretical interest being one of the very few quadratic problems known for which strongly polynomial solution algorithms exist (an algorithm is said to be strongly polynomial if the number of elementary arithmetic operations it requires is a polynomial in the problem parameter and not just the size of the input data). That is why many researcher in the area of order restricted statistical inference have paid a great deal of attention to the problem of developing algorithms for isotonic regression. Most of the algorithms proposed involve averaging g over suitably selected subsets S of X on each of which g (x) is constant.

Isotonic Regression Problems, Figure 2 Geometrical interpretation of pooling adjacent violators

1775

1776

I

Isotonic Regression Problems

aim is to find the solution blocks, that is a partitioning of X into sets on each of which the isotonic regression function g is constant. The PAV algorithm starts from the initial block class consisting of the singleton sets {xi }, 1 i n. At each stage of the algorithm, a new block class is obtained from the previous block class by joining the blocks together until the final partition is reached. If g(x1 ) g(xn ), then the initial partition is also the final partition, and g (xi ) = g(xi ), i = 1, . . . , n. Otherwise, select any of the pairs of violators of the ordering; that is, select an index i such that g(xi ) > g(xi + 1 ).‘Pool’ these two values of g: i. e., join the two points xi and xi + 1 in a block {xi , xi + 1 } ordered between {xi 1 } and {xi + 2 }, and associate to this block the average value Av {xi , xi + 1 } and the weight w(xi ) + w(xi + 1 ). After each step in the algorithm, the average values associated with the blocks are examined to see whether or not they are in the required order. If so, the final partition has been reached, and the value of g at each point of a block is the ‘pooled’ value associated with the block. If not, a pair of adjacent violating blocks is selected, and pooled to form a single block, with associated weight the sum of their weights and associated average value the weighted average of their average values, completing another step of the algorithm. A pseudocode for PAV algorithm is presented below, where B is the first block in and B+ is the block that follows B in the blocks partition.

= ffx1 g; : : : ; fx n gg REPEAT set B and B+ WHILE B+ ¤ 0 IF AvB AvB+ THEN = n fB; B+ g [ fB [ B+ g B = B [ B+ g (x) = AvB, 8x 2 B ELSE B = B+ ENDIF B+ = succ(B) ENDWHILE UNTIL there are no violating blocks A pseudocode for PAV algorithm

S.J. Grotzinger and C. Witzgall [9] have shown that the computational complexity of the PAV algorithm is O(n). Minimum Lower Set Algorithm The first algorithm proposed for partially order isotonic regression is the minimum lower sets (MLS) due to H.D. Brunk [4]. For describing this algorithm, as for most of the algorithms for general partial order, it is convenient to introduce the concept of ‘level set’ that generalizes the concept of ‘block’. In order to define this set, important concepts are lower and upper set. A set L X is called lower set if 8x 2 X and 8y 2 L with x 4 y ) x 2 L. A set U X is called upper set if 8x 2 X and 8y 2 U with x 4 y ) x 2 U. Finally, S X is called level set if there are a lower set L X and an upper set U X such that S = L \ U. The isotonic regression with respect to any partial order is constant on level sets. The MLS algorithm computes the isotonic regression function by partitioning X into l level sets S1 , . . . , Sl such that AvS1 < < AvSl . In doing that, the algorithm performs l steps in each of which searches for the largest level set of minimum average Si among the level sets Li + 1 \ LCi , where LCi is the complement of Li . The isotonic regression values are given by the weighted average of the observations in each of the level set that belong to the solution partition. In the following a pseudocode for MLS algorithm is given, where L is the lower set family of X. M.J. Best and N. Chakravarti [3] have proved that the MLS algorithm is of computational complexity O(n2 ). select L1 X : AvB1 = AvL1 = minfAvL : L 2 Lg i=1 REPEAT i = i+1 select L2 X : AvB i = Av(L2 \ L1C ) = minfAv(L \ L1C ) : L 2 Lg L1 = L2 UNTIL X is exhausted g (x) = AvB j ; 8x 2 B j ; j = 1; : : : ; i A pseudocode for MLS algorithm

Isotonic Regression Problems

Other Algorithms Several other algorithms are available for solving the isotonic regression problem as well as its various special cases. Their description are provided in [2,3,6,7,8, 10,11,12,13,15], among others. Best and Chakravarti, in their paper [3], have pointed out that several of the proposed algorithms are active set quadratic programming methods and that this methodology provides a unifying framework for studying algorithms for isotonic regression. See also Regression by Special Functions References 1. Ayer M, Brunk HD, Ewing GM, Reid WT, Silverman E (1955) An empirical distribution function for sampling with incomplete information. Ann Math Statist 26:641– 647 2. Barlow RE, Barthlomew DJ, Bremner JM, Brunk HD (1972) Statistical inference under order restrictions. Wiley, New York 3. Best MJ, Chakravarti N (1990) Active set algorithms for isotonic regression: A unifying framework. Math Program 47:425–439

I

4. Brunk HD (1955) Maximum likelihood estimates of monotone parameters. Ann Math Statist 26:607–616 5. Brunk HD (1965) Conditional expectation given a -lattice and applications. Ann Math Statist 36:1339–1350 6. Dykstra RL (1981) An isotonic regression algorithm. J Statist Planning Inference 5:355–363 7. Eeden Cvan (1957) Maximal likelihood estimation of partially or completely ordered parameters. I. Indag Math 19:128–136 8. Gebhardt F (1970) An algorithm for monotone regression with one or more indipendent variables. Biometrika 57:263–271 9. Grotzinger SJ, Witzgall C (1984) Projection onto order simplexes. Appl Math Optim 12:247–270 10. Lee CIC (1983) The min-max algorithm and isotonic regression. Ann Statist 11:467–477 11. Maxwell WL, Muchstadt JA (1985) Establishing consistent and realistic reorder intervals in production-distributioni systems. Oper Res 33:1316–1341 12. Pardalos PM, Xue G (1998) Algorithms for a class of isotonic regression problems. Algorithmica 23:211–222 13. Pardalos PM, Xue G, Young L (1995) Efficient computation of the isotonic median regression. Applied Math Lett 8:67– 70 14. Robertson T, Wright FT, Dykstra RL (1988) Ordered restricted statistical inference. Wiley, New York 15. Roundy R (1986) A 98% effective lot-sizing rule for a multiproduct multistage production/inventory system. Math Oper Res 11:699–727

1777

Jaynes’ Maximum Entropy Principle

J

J

Jaynes’ Maximum Entropy Principle

further bolstered its importance as a viable tool for statistical inference.

MaxEnt H. K. KESAVAN University Waterloo, Waterloo, Canada MSC2000: 94A17 Article Outline Keywords Entropy and Uncertainty Why Choose Maximum Uncertainty? Shannon Entropy Jaynes’ Maximum Entropy Formalism Applications of MaxEnt and Conclusions See also References Keywords Entropy; Uncertainty; Optimization; Jaynes; Shannon C.E. Shannon’s seminal discovery [7] (1948) of his entropy measure in connection with communication theory has found useful applications in several other probabilistic systems. E.T. Jaynes has further extended its scope by discovering the maximum entropy principle (MaxEnt) [1] (1957) which is inherent in the process of optimization of the entropy measure when some incomplete information is given about a system in the form of moment constraints. MaxEnt has, over the past four decades, given rise to an interdisciplinary methodology for the formulation and solution of a large class of probabilistic systems. Furthermore, MaxEnt’s natural kinship with the Bayesian methods of analyses has

Entropy and Uncertainty The word entropy first originated in the discipline of thermodynamics, but Shannon entropy has a much broader meaning since it deals with the more pervasive concept of information. The word entropy itself has now crept into common usage to mean transformation of a quantity, or phenomenon, from order to disorder. This implies an irreversible rise in uncertainty. In fact, the word uncertainty would have been more unambiguous as to its intended meaning in the context of information theory, but for historic reasons, the usage of the word entropy has come to stay in the literature. Uncertainty arises both in probabilistic phenomena such as in the tossing of a coin and, equally well, in deterministic phenomena where we know that the outcome is not a chance event, but we are merely fuzzy about the possibility of the specific outcome. What is germane to our study of MaxEnt is only probabilistic uncertainty. The concept of probability that is used in this context is what is generally known as the subjective interpretation as distinct from the objective interpretation based on frequency of outcome of an event. The subjective notion of probability considers a probability distribution as representing a state of knowledge and hence it is observer dependant. The underlying basis for an initial probability assignment is given by the Laplace’s principle of insufficient reason. According to this, if in an experiment with n possible outcomes, we have no information exPn cept that each probability pi 0 and iD1 pi = 1, then the most unbiased choice is the uniform distribution: (1/n, . . . , 1/n). Laplace’s principle underscores the

1779

1780

J

Jaynes’ Maximum Entropy Principle

choice of maximum uncertainty based on logical reasoning only.

to maximize it. However, what we are ensuring by the principle of maximum uncertainty is that we are maximally uncertain about what we do not know.

Why Choose Maximum Uncertainty? We shall now consider the example of a die in order to highlight the importance of maximum uncertainty as a preamble to our later discussion of MaxEnt. When the only information available is that the die has six faces, the uniform probability distribution (1/6, . . . , 1/6), satisfying the natural constraint n X

p i D 1;

p1 0; : : : ; p6 0;

(1)

iD1

represents the maximum uncertainty. If, in addition, we are also given the mean number of points on the die, that is, if we are given that p1 C 2p2 C 3p3 C 4p4 C 5p5 C 6p6 D 4:5;

(2)

the choice of distributions is restricted to the incomplete information given by both (1) and (2), and, consequently, the maximum uncertainty encountered at the first stage is reduced. Since there are only two independent equations in six variables, there is an infinity of probability distributions satisfying the constraints. Out of all such distributions, one can anticipate the existence of a distribution giving rise to the maximum uncertainty Smax . The importance of Smax can be deduced from a careful consideration of the process by which uncertainty is reduced (or never increased) by providing more and more information in terms of moment constraints. If we use any distribution from amongst the infinity of distributions satisfying the constraints that is different from the one corresponding to Smax , it would imply that we would be using some information in addition to those given by (1) and (2). But scientific objectivity would behoove that we should use only the information that is given to us, and scrupulously avoid using any extraneous information. The principle of maximum uncertainty can, accordingly, be stated as: Out of all probability distributions consistent with a given set of constraints, the distribution with maximum uncertainty should be chosen. At first glance, it may seem paradoxical that while the goal is reduction of uncertainty, we are actually trying

Shannon Entropy The conclusion from the example of the die is that in making inferences based on incomplete information, the probability distribution that has the maximum uncertainty permitted by the available information should be used. It is therefore necessary to have a quantitative measure of uncertainty in a probability distribution. A unique function was defined by Shannon [7] to measure uncertainty. Let p = (p1 , . . . , pn ) be a probability distribution satisfying the natural constraint n X

p i D 1:

(3)

iD1

Shannon’s measure of entropy (uncertainty) for this distribution is given by SD

n X

p i ln p i

(4)

iD1

Shannon arrived at this unique measure by first stating the desirable properties that such a measure should have. Since not all this long list of properties are independent, he considered a smaller independent set of properties and deduced the uniqueness of (4). Similarly, A.I. Kinchin [5] assumed a different independent set and arrived at the same measure. The Shannon entropy measure is the basis for Jaynes’ maximum entropy principle. Of particular importance is the property of concavity of the measure which guarantees the existence of a maximum entropy distribution with all pi 0. Shannon’s work in information theory did not involve optimization and as such he did not make use of the concavity property whereas here, it is central to the development. Jaynes’ Maximum Entropy Formalism We will now present Jaynes’ maximum entropy formalism based on discrete multivariate distributions of the random variable X and state some important results arising from it. The ensemble, (X; p) D ((x1 ; p1 ); : : : ; (x n ; p n ));

Jaynes’ Maximum Entropy Principle

where n is finite, represents all the possible realizations of X and their probabilities of occurrence. p is estimated by maximizing the Shannon measure (4) subject to the natural constraint (3) and the moment constraints n X

p i gr i D ar ;

r D 1; : : : ; m;

p i 0:

(5)

iD1

The Lagrangian is given by LD

n X

n X

p i ln p i (0 1)

iD1

m X

r

rD1

! pi 1

iD1 n X

!

p i gr i ar

(6)

iD1

where 0 , . . . , m are the (m + 1) Lagrange multipliers corresponding to the (m + 1) constraints of (3) and (5). m X @L D 0 ) ln p i 0 r g r i D 0 @p i rD1

or, p i D exp(0 1 g1i m g mi );

(8)

for i = 1, . . . , n. The m multipliers are determined by substituting for pi from (8), in (3) and (5) so that 1 0 n m X X (9) exp @0 j g ji A D 1 iD1

jD1

and n X

0 gr i exp @0

iD1

1 j g ji A D ar ;

(10)

jD1

for r = 1, . . . , m, or exp(0 ) D

m X

n X

0 exp @

iD1

and ar exp(0 ) D

n X iD1

m X

1 j g ji A

gr i exp @

iD1

q i ln

qi 0 pi

(14)

where q is the probability distribution with entropy S. Jaynes’ formalism also leads to Jaynes’ entropy concentration theorem that asserts that the constrained maximum probability distribution is the one that best represents our state of knowledge about the behavior of the system and that MaxEnt is the preferred method of inferring that distribution. The conclusion is based on (14) and the chi-square test. Jaynes’ formalism is applicable to continuousvariate probability distributions also. In our earlier discussion, we had stated that the statement of the Laplace’s principle of insufficient reason was based purely on logic. We can now show that uniform distribution results from MaxEnt when only the natural constraint (3) is specified.

(11) Applications of MaxEnt and Conclusions

jD1

0

r = 1, . . . , m. Equation (11) gives 0 as a function of 1 , . . . , m . Equations (13) give a1 , . . . , am as functions of 1 , . . . , m . Based on the above formalism, we can derive the following results which are useful in applications. The Lagrange multiplier 0 is a convex function of 1 , . . . , m . The value of the maximum entropy Smax = 0 + 1 a1 + + m a m . Smax is a concave function of a1 , . . . , am . The Lagrange multipliers 1 , . . . , m are the partial derivatives of Smax with respect to a1 , . . . , am , respectively. An alternative proof that MaxEnt gives globally maximum values of entropy, that is, Smax S 0, can be given on the basis of Shannon’s inequality n X

(7)

J

m X

1 j g ji A ;

(12)

jD1

for r = 1, . . . , m so that P Pn m g exp g r i j ji iD1 jD1 ; P ar D P n n iD1 exp jD1 j g ji

(13)

As the very first application, Jaynes demonstrated the power of MaxEnt by deriving all the principal distributions of statistical mechanics without reference to the classical methods of derivation [1]. An important application that closely followed the application of MaxEnt to statistical mechanics, was the correspondence that M. Tribus [9,10] established with the laws of thermodynamics. This application, incidentally, clarified the connection between the Shannon entropy and ther-

1781

1782

J

Job-shop Scheduling Problem

modynamic entropy. He also demonstrated that most of the statistical distributions that are commonly encountered can be derived from MaxEnt by making use of appropriate moment constraints, which, later, came to be known as characterizing moments. Thus, he established the integral link between information theory and statistics. For example, the normal distribution is a maximum-entropy distribution resulting from maximizing the Shannon entropy with respect to the characterizing moments of mean and variance. These remarkable successes set in motion the applications of MaxEnt in several other disciplines. To name only a few, MaxEnt has been applied to problems in urban and regional planning, transportation, business, economics, finance, statistical inference, operations research, queueing theory, nonlinear spectral analysis, pattern recognition and image processing, computerized tomography, risk analysis, population growth models, chemical reactions and many other areas. These are all problems that are inherently probabilistic in nature or, alternatively, where the MaxEnt model is made to fit by artificially interpreting probabilities as proportions. References to these problems can be found in [2,3,4]. For the past ten years, the direction of research into MaxEnt has gone in the direction of using the principle in conjunction with Bayes theorem. There has been a series of workshops conducted under this heading which appears in the series [6]. Also of great interest is the concept of minimum entropy which is found useful in recognizing patterns contained in an information structure. However, research in this direction has been hampered by the computational complexity in determining the quantity because it results from the global minimization of a concave function which is a NP-hard problem. Closely associated with MaxEnt are the methods of optimization based on Kullback–Leibler measure [8] to measure distance between two probability distributions. However, the school dedicated to the use of MaxEnt steers clear of this approach since it does not involve the concept of entropy. See also Entropy Optimization: Interior Point Methods Entropy Optimization: Parameter Estimation

Entropy Optimization: Shannon Measure of Entropy and its Properties Maximum Entropy Principle: Image Reconstruction References 1. Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106:620–630 2. Kapur JN (1989) Maximum entropy models in science and engineering. Wiley Eastern, New Delhi 3. Kapur JN, Kesavan HK (1987) Generalized maximum entropy principle (with applications). Sandford Educational Press Univ. Waterloo, Waterloo, Canada 4. Kapur JN, Kesavan HK (1992) Entropy optimization principles with applications. Acad. Press, New York 5. Kinchin AI (1957) Mathematical foundations of information theory. Dover, Mineola, NY 6. Series: Maximum entropy and Bayesian methods. Kluwer, Dordrecht 7. Shannon CE (1948) A mathematical theory of communication. Bell System Techn J 27:379–423, 623–659 8. Shore JE, Johnson RW (1980) Properties of cross-entropy minimization. IEEE Trans Inform Theory IT-27:472–482 9. Tribus M (1961) Thermostatics and thermodynamics. v. Nostrand, Princeton, NJ 10. Tribus M (1966) Rational descriptions, decisions and designs. Pergamon, Oxford

Job-shop Scheduling Problem PETER BRUCKER Fachber. Math./Informatik, Universität Osnabrück, Osnabrück, Germany MSC2000: 90B35 Article Outline Keywords Complexity Results Branch and Bound Algorithms Upper Bounds Lower Bounds Branching Immediate Selection Heuristic Procedures Priority Rule Based Heuristics Shifting Bottleneck Heuristic Local Search

See also References

Job-shop Scheduling Problem

Keywords Job-shop; Scheduling; Complexity; Heuristics The job-shop problem may be formulated as follows. Given are n jobs j = 1, . . . , n and m machines M 1 , . . . , M m . Job j consists of a sequence O1 j ; : : : ; O n j ; j of nj operations which must be processed in the given order, i. e. Oi + 1, j cannot start before Oij is completed for i = 1, . . . , nj 1. Associated with each operation Oij there is a processing time pij and a machine ij 2 {M 1 , . . . , M m }. Oij must be processed for pij time units on machine ij . Each job can be processed by at most one machine at a time and each machine can process only one operation at a time. If not stated differently preemptions of operations are not allowed. One has to find a feasible schedule which minimizes the makespan. It is assumed that all processing times are nonnegative integers and that all jobs and machines are available at starting time zero. Furthermore, if not stated differently, machine repetition is allowed, i. e. i + 1, j = ij is possible. For a precise formulation of the job-shop problem, let O be the set of all operations, let p(k) be the processing time of operation k 2 O, and define (k, l) 2 C if and only if k = Oij and l = Oi + 1, j for some job j and some i = 1, . . . , nj 1. Finally, let M(k) be the machine on which operation k must be processed. Then the job-shop problem may be formulated as disjunctive linear program (cf. Linear programming): min maxfs(k) C p(k)g

(1)

k2O

such that s(k) C p(k) s(l) for all k; l 2 O; (k; l) 2 C; s(k) C p(k) s(l) for all s(k) 0

k; l 2 O;

or

s(l) C p(l) s(k)

k ¤ l;

for all k 2 O:

(2)

M(k) D M(l); (3) (4)

s(k) represents the starting time of operation k. Due to (2) all operations of the same job are processed in the

J

right order. The constraints (3) make sure that a machine cannot process two operations at the same time. The job-shop problem may be represented by a mixed graph G = (O, C, D) with vertex set O, the set C of (directed) arcs, and a set D = {{k, l}: k, l 2 O; k 6D l; M(k) = M(l)} of (undirected) edges. Furthermore, the processing time p(k) is associated with each vertex k 2 O. The arcs are called conjunctions and the edges are called disjunctions. The basic scheduling decision is to define a processing order of the operations on each machine, i. e. to fix precedence relations between these operations. In the mixed graph model this is done by orienting edges, i. e. by turning disjunctions into conjunctions. A set S of these oriented edges is called an orientation. An orientation S is called a complete orientation if every edges becomes oriented; and the resulting directed graph G(S) = (O, C [ S) has no cycles. A complete orientation S defines a feasible schedule which is calculated in the following way. For each operation k let l(k) be the length of a longest path to vertex k in G(S). A path to k is a sequence of vertices v1 , . . . , vr = k with (vi , vi + 1 ) is an arc for i = 1, . . . , r 1. The length of a path P to k is the sum of all processing times of operations in P excluding operation k. Choose l(k) as the starting time of operation k. It is not difficult to see that this schedule is feasible. Furthermore, the length of the longest path in G(S) defines the makespan of this schedule. A corresponding path is called critical path. On the other hand a feasible schedule s D (s(k)) k2O defines a complete orientation S and the critical path length in G(S) is not greater than the makespan of the schedule s. Thus, one may restrict attention to schedules defined by complete orientations. There are only a few special cases of the job-shop problem which are polynomially solvable (cf. Complexity classes in optimization). They will be discussed next. Complexity Results The two-machine job-shop problem in which each job has at most two operations can be solved by a simple extension of Johnson’s algorithm for the two machine flow-shop problem [16]. Let I i be the set of jobs with operations on M i only (i = 1, 2), and let I 1, 2 (I 2, 1 ) be the

1783

1784

J

Job-shop Scheduling Problem

set of jobs which are processed first on M 1 (M 2 ) and then on M 2 (M 1 ). Order the latter two sets by means of Johnson’s algorithm and the former two sets arbitrarily. Then one obtains an optimal schedule by executing the jobs on M 1 in order (I 12 , I 1 , I 21 ) and on M 2 in order (I 21 , I 2 , I 12 ). In [15] the two-machine job-shop problem with unit-time operations (pij = 1) and no machine repetition is solved in time linear in the total number of operations, through a rule that gives priority to the longest remaining job. Despite the fact that this algorithm is fast it is not polynomial if we represent each job j by the machine which processes the first operation O1j and the number nj of operation of j. However, there exists a clever implementation of this algorithm which is polynomial ([17,26]). Surprisingly, the problem is NPhard if we allow repetition of machines [12]. This, however, is probably as far as one can get if the number of jobs is not fixed but part of the input. Twomachine job-shop problems with nj 3 or pij 2 {1, 2} are NP-hard as well as three machine problems with nj 2 or pij = 1 ([18,19]). The job-shop problem with two jobs may be formulated as a shortest path problem in the plane with regular objects as obstacles [2]. Figure 1 shows a shortest path problem with obstacles which corresponds to a job-shop problem with two jobs with n1 = 4 and n2 = 3. The processing times of the operations of job 1 (job 2) are represented by intervals on the x-axis (y-axis) which are arranged in the order in which the corresponding operations are to be processed. Furthermore, the intervals are labeled by the machines on which the corresponding operations must be processed. A feasible schedule corresponds to a path from 0 to F consisting of segments which are either parallel to one of the axes or diagonal, and avoids the interior of any rectangular obstacle. If one defines the length of the diagonal parts of the path to be equal to the projections of these parts on one of the axes then the length of the path corresponds to the length of the schedule. It can be shown that this geometric problem can be formulated as a shortest path problem in an acyclic network with O(r2 ) arcs where r = max{n1 , n2 } and that this network can be calculated in time O(r2 log r) which is also the complexity for solving the problem [7]. The corresponding preemptive problem can be solved in O(r3 ) time by allowing

Job-shop Scheduling Problem, Figure 1 Path problem with obstacles

the paths to go horizontally or vertically through the obstacles [24]. The two-machine job-shop problem with a fixed number k of jobs has been solved with time complexity O(n2k ) [9]. However, the three machine job-shop problem with k = 3 is NP-hard [25] (cf. also Complexity theory). If one allows preemption even the twomachine problem with k = 3 is NP-hard [12]. This is very surprising because the corresponding problem without preemption is polynomially solvable. Branch and Bound Algorithms Effective branch and bound algorithms (cf. Integer programming: Branch and bound methods) have been developed for the job-shop scheduling problem from the 1990s onwards ([3,11,13,20]). Rather than a description of each of these algorithms in detail some of the main concepts, like lower and upper bounds, branching rules, and immediate selection are presented. Most of the branch and bound algorithms use the mixed graph model. The nodes of the enumeration tree correspond to orientations of edges representing sets of feasible schedules. Branching is done by adding further orientations in different ways. The leaves of the enumeration tree correspond to complete orientations while the root is defined by the empty orientation. Given an orientation S one may define heads and tails. A head r(k) of operation k is a lower bound for an earliest possible starting time of k. A tail q(k) of operation k is a lower bound for the time period between the finishing time of operation k and the optimal makespan. A simple way to derive a head r(k) would be to calcu-

J

Job-shop Scheduling Problem

late the length of a longest path to k in G(S). Similarly, a tail q(k) could be calculated as the length of a longest path starting in k excluding p(k). Let P be a critical path in G(S) and L(S) be the length of P. A maximal sequence u1 , . . . , ul of successive operations in P to be processed on the same machine is called a block if it contains at least two operations. The following block theorem is used in connection with branch and bound algorithms. It also plays an important role when defining neighborhoods for local search methods. Theorem 1 (block theorem) Let S be a complete orientation. If there exists another complete orientation S0 such that L(S0 )< L(S), then in S0 at least one operation of some block B of G(S) has to be processed before the first or after the last operation of B. Upper Bounds Each feasible solution provides an upper bound. At the root of the enumeration tree some time is invested for calculating a good upper bound to start with. This is accomplished by applying tabu search using an appropriate neighborhood. Some branch and bound algorithms also calculate heuristically a feasible solution satisfying the given orientation in each vertex of the enumeration tree. If this solution provides a better upper bound than the current one then the current bound is updated. Furthermore, informations provided by the heuristic solution are used for the branching process. Lower Bounds Lower bounds are calculated for each node of the enumeration tree, i. e. for the set of solutions feasible with respect to the current orientation S. Lower bounds may be calculated constructively or destructively. Constructive lower bounds are calculated by solving relaxations of the problem. The destructive methods work as follows. For a given integer U one tries to prove that there exists no feasible solution with value Cmax U. In this case U + 1 is a valid lower bound. In case of a failure one repeats the test for a smaller U-value. Binary search can be applied to find a large lower bound. The length of a critical path in G(S) provides a constructive lower bound which can be calculated fast. Good bounds are provided by certain one-machine relaxations denoted as head-body-tail problem: Given

a set of jobs j = 1, . . . , n with release times (heads) r(j), processing times p(j), and tails q(j) to be processed on a single machine. Find a schedule with starting times s(j) satisfying the release times such that maxnjD1 {s(j) + p(j) + q(j)} is minimized. Unfortunately the one-machine head-body-tail problem is NP-hard. However, the preemptive version of this problem can be solved in time O(n log n) by applying the following scheduling rules: Take the release times and completion times as decision points. Schedule jobs in increasing order of decision points preferring an available job with longest tail. By applying this algorithm to all operations to be processed on M k one gets a lower bound Lk (k = 1, . . . , m). The best of all these Lk -values is chosen. Other constructive lower bounds are based on two job relaxations [10] and cutting plane approaches (cf. also Integer programming: Cutting plane algorithms) [3]. For destructive methods one assumes that U is a fictive upper bound and wants to prove that no feasible schedule satisfying Cmax U exists. From U one derives the time window [r(k), d(k)] with d(k) = U q(k) in which each operation k must be processed if U is valid. For each job j = 1, . . . , n let Sj the set of schedules for j which are feasible with respect to its time window. The feasibility problem can be reduced to a zero-one linear program as follows. For each schedule 2 Sj (j = 1, . . . , n) one defines a(, i, t) = 1 if requires machine M i in time-period [t 1, t] and a(, i, t) = 0 otherwise. Let xj a 0–1 decision variable that indicates whether job j is scheduled according to schedule . Then there exists no feasible schedule if and only if the following linear program has an optimal solution value > 1 (see [20]). min

(5)

such that X x j D 1;

j D 1; : : : ; n;

(6)

2S j n X X

a(; i; t)x j ;

jD1 2S j

x j 2 f0; 1g;

i D 1; : : : ; m; j D 1; : : : ; n;

t D 1; : : : ; U; 2 S j:

(7) (8)

1785

1786

J

Job-shop Scheduling Problem

Due to (6) exactly one schedule is chosen from each set Sj . The left-hand side of (7) counts the number of jobs to be processed on machine M i in time-period [t1, t]. Thus, one has a feasible schedule if and only if = 1. To check infeasibility one uses the continuous version of (5) to (8) where (8) is replaced by x j 0;

j D 1; : : : ; n;

2 Sj:

A second destructive lower bound based on immediate selection will be discussed later. Branching The simplest branching scheme is to choose a not yet oriented edge and orient it in the two possible ways. Another more sophisticated branching scheme is based on the block theorem. There is a branch to several children of the same father node in the enumeration tree. The idea behind such a branching is to orient many edges when branching (see [11]). In [20] a time oriented branching schemes has been used. Immediate Selection By branching disjunctions are oriented. There is another method to orient disjunctions which is due to [13]. This method is called immediate selection. It uses an upper bound UB for the optimal makespan and simple lower bounds based on heads r(k) and tails q(k) of operations k. Let I be a set of n operations to be processed on the same machine and consider a strict subset J I and an operation c 2 I \ J. If condition X p( j) C min q( j) U B (9) min r( j) C j2J[fcg

j2J

j2J[fcg

holds, then all operations j 2 J must be processed before operation c if we want to improve the current upper bound UB. This follows from the fact that the left-hand side of (9) is a lower bound for all schedules in which c does not succeed all jobs in J. Due to integrality, (9) is equivalent to X p( j) C min q( j) > U B 1 min r( j) C j2J[fcg

or min r( j) C

j2J[fcg

j2J

j2J[fcg

X

p( j) > max d( j);

j2J[fcg

where d(j) := UB q(j) 1.

j2J

(10)

(J, c) is called a primal pair if (9) or, equivalently, (10) holds. The corresponding arcs j ! c with j 2 J are called primal arcs. Similarly, (c, J) is called a dual pair and arcs c ! j are called dual arcs if X p( j) > max d( j) (11) min r( j) C j2J

j2J[fcg

j2J[fcg

holds. In this case all operations j 2 J must be processed after operation c if we want to improve the current solution value UB. It can be shown [9] that all primal and dual arcs for the set I can be calculated in O(n2 ) time. Immediate selection is applied to speed up a branch and bound algorithm. For each machine the set I of all operations to be processed on this machine is considered and all corresponding primal and dual arcs are calculated. Then heads and tails are recalculated and the whole process is repeated until there are no new primal or dual arcs to be added. By this method the orientation S is increased step by step. A possible outcome of this process is that one deduces a graph G(S) which contains cycles. This implies that there exists no feasible solution with makespan UB. Immediate selection and a corresponding cycle check is applied to each node of the enumeration tree. If the cycle check is positive, one can backtrack. Immediate selection is also used to calculate a lower bound by the destructive method. Heuristic Procedures Using a branch and bound algorithm and immediate selection J. Carlier and E. Pinson [13] were able to solve the notorious 10 × 10 benchmark problem in [21] for the first time. Recently (2000), 15 × 15 benchmark problems have been solved [6]. Problems of dimension n × n for n > 15 are still out of the reach if one is interested in optimal solutions. Thus, the only way to find solutions for larger job-shop problems is to apply heuristics, which provide solutions which are not too far away from the optimum. Some of these heuristics will be discussed next. Priority Rule Based Heuristics These are probably most frequently applied due to their ease of implementation and low computation times.

Job-shop Scheduling Problem

The idea of a priority heuristic is to schedule operations step by step each as early as possible. In each step among all unscheduled operations with the property that their conjunctive predecessors are scheduled a candidate with the highest priority is chosen to be scheduled next. For extended summaries and discussions of priority rules see [5,14,23]. Shifting Bottleneck Heuristic (See [1,4].) This is one of the most powerful heuristics for the job-shop scheduling problem. In each iteration a machine M i is chosen and all disjunctions between operations to be processed on M i are oriented. This is done according to the exact solution of the head-bodytail problem for M i . Thus, after k steps the disjunctions for k machines are oriented. Let Mk the set of these k machines. Before choosing a new machine M i 62 Mk in the next step the orientations for the machines in Mk are updated by applying the head-body-tail algorithm to each of the machines in Mk in a given machine order. As a machine M i 62 Mk added to Mk in the next step a bottleneck machine is chosen. A bottleneck machine is a machine M i 62 Mk with a largest head-body-tail problem solution value. It is important to note that before each application of the solution procedure for a headbody-tail problem heads and tails are updated according to the current orientation. Local Search An important class of heuristics are local search methods like iterative improvement, simulated annealing (cf. Simulated annealing), tabu search and genetic algorithms (cf. Genetic algorithms). All these methods have been applied to the job-shop problem (see [27] for an excellent survey). The local search techniques are based on the concept of local improvement. Given an existing solution or representation of such a solution, a modification is made in order to obtain a different (usually better) solution. For the job-shop problem solutions are represented by complete orientations. To modify a solution one usually restricts to critical paths (which must be destroyed for improving the current makespan). One possibility for modifications is to choose an arc (v, w) on

J

a critical path such that v and w are processed on the same machine and replace (v, w) by the reverse arc (w, v). It can be shown that the corresponding new orientation is complete again, i. e. no cycles are created by such a reversal. Other modifications are based on the block theorem. One modifies an orientation by shifting an operation of some block at the beginning or the end of the block. Such modifications are not defined if the resulting selections are not complete, i. e. contain cycles. One of the best local search methods in terms of solution quality and computation time is a tabu search procedure described in [22].

See also MINLP: Design and Scheduling of Batch Processes Stochastic Scheduling Vehicle Scheduling References 1. Adams J, Balas E, Zawack D (1988) The shifting bottleneck procedure for job shop scheduling. Managem Sci 34:391– 401 2. Akers SB, Friedman J (1955) A non-numerical approach to production scheduling problems. Oper Res 3:429–442 3. Applegate D, Cook W (1991) A computational study of the job-shop scheduling problem. ORSA J Comput 3:149– 156 4. Balas E, Lenstra JK, Vazacopoulos A (1995) One machine scheduling with delayed precedence constraints. Managem Sci 41:94–1096 5. Blackstone JH, Phillips DT, Hogg GL (1982) A state of art survey of dispatching rules for manufacturing. Internat J Production Res 20:27–45 6. Brinkkötter W, Brucker P (2001) Solving open benchmark instances for the jobshop problem by parallel head-tail adjustments. J Scheduling 4:53–64 7. Brucker P (1988) An efficient algorithm for the job-shop problem with two jobs. Computing 40:353–359 8. Brucker P (1994) A polynomial algorithm for the two machine job-shop scheduling problem with fixed number of jobs. Oper Res Spektrum 16:5–7 9. Brucker P (1998) Scheduling algorithms, 2nd edn. Springer, Berlin 10. Brucker P, Jurisch B (1993) A new lower bound for the jobshop scheduling problem. Europ J Oper Res 64:156–167 11. Brucker P, Jurisch B, Sievers B (1994) A branch and bound algorithm for the job-shop problem. Discrete Appl Math 49:107–127

1787

1788

J

Job-shop Scheduling Problem

12. Brucker P, Kravchenko SA, Sotskov YN (1999) Preemptive job-shop scheduling problems with a fixed number of jobs. Math Meth Oper Res 49:41–76 13. Carlier J, Pinson E (1989) An algorithm for solving the jobshop problem. Managem Sci 35:164–176 14. Haupt R (1989) A survey of priority-rule based scheduling. OR Spektrum 11:3–16 15. Hefetz N, Adiri I (1982) An efficient algorithm for the two-machine unit-time jobshop schedule-length problem. Math Oper Res 7:354–360 16. Jackson JR (1956) An extension of Johnson’s results on job lot scheduling. Naval Res Logist Quart 3:201–203 17. Kubiak W, Sethi S, Sriskandarajah C (1995) An efficient algorithm for a job shop problem. Ann Oper Res 57:203–216 18. Lenstra JK, Rinnooy Kan AHG (1979) Computational complexity of discrete optimization problems. Ann Discret Math 4:121–140 19. Lenstra JK, Rinnooy Kan AHG, Brucker P (1977) Complexity of machine scheduling problems. Ann Discret Math 1:343– 362

20. Martin PD, Shmoys DB (1996) A new approach to computing optimal schedules for the job shop scheduling problem. Proc. 5th Internat. IPCO Conf. 21. Muth JF, Thompson GL (1963) Industrial scheduling. Prentice-Hall, Englewood Cliffs, NJ 22. Nowicki E, Smutnicki C (1996) A fast tabu search algorithm for the job shop problem. Managem Sci 42:797–813 23. Panwalker SS, Iskander W (1977) A survey of scheduling rules. Oper Res 25:45–61 24. Sotskov YN (1991) On the complexity of shop scheduling problems with two or three jobs. Europ J Oper Res 53:323– 336 25. Sotskov YN, Shakhlevich NV (1995) NP-hardness of shopscheduling problems with three jobs. Discrete Appl Math 59:237–266 26. Timkovsky VG (1985) On the complexity of scheduling an arbitrary system. Soviet J Comput Syst Sci 23(5):46–52 27. Vaessens RJM, Aarts EHL, Lenstra JK (1996) Job shop scheduling by local search. INFORMS J Comput 8:302– 317

Kantorovich, Leonid Vitalyevich

K

K

Kantorovich, Leonid Vitalyevich PANOS M. PARDALOS Center for Applied Optim. Department Industrial and Systems Engineering, University Florida, Gainesville, USA MSC2000: 01A99 Article Outline Keywords See also References Keywords Kantorovich; Linear programming; Economics; Functional analysis L.V. Kantorovich was born in St. Petersburg, Russia, on January 6, 1912 and died on April 7, 1986. Kantorovich shared the 1975 Nobel Prize for Economics with T. Koopmans for their work on the optimal allocation of scarce resources [4,5]. Kantorovich was educated at Leningrad State Univ., receiving his doctorate in mathematics (1930) there at the age of 18. He became a professor at Leningrad in 1934, a position he held until 1960. He headed the department of mathematics and economics in the Siberian branch of the U.S.S.R. Academy of Sciences from 1961 to 1971 and then served as head of the research laboratory at Moscow’s Institute of National Economic Planning (1971–1976). Kantorovich was elected to the prestigious Academy of Sciences of the Soviet Union (1964) and was awarded the Lenin

Prize in 1965. For detailed interesting information on the life and scientific views of Kantorovich, see his paper [2] Kantorovich was one of the first to use linear programming as a tool in economics. His most famous work is [1]. The characteristic of Kantorovich’s work is a combination of theoretical and applied research. His first works concerned delicate problems of set theory. Later he became one of the first Soviet specialists on functional analysis. In the 1930s he laid down the foundations of the theory of semi-ordered spaces which constitutes now a vast chapter of functional analysis bordering algebra and measure theory. At the same time he anticipated the ideas of the future theory of generalized functions which became current only in the 1950s. Kantorovich obtained beautiful results on approximation theory. The approach to Sobolev’s embedding theorem suggested by Kantorovich (based on his estimations of integral operators) is well known.

See also History of Optimization Linear Programming

References 1. Kantorovich LV (1959) The best use of economic resources 2. Kantorovich LV (1987) My way in mathematics. Uspekhi Mat Nauk 42(2) 3. Leifman LJ (ed) (1990) Functional analysis, optimization and mathematical economics (dedicated to the memory of L.V. Kantorovich). Oxford Univ. Press, Oxford 4. Linbeck A (ed) (1992) Nobel Lectures Economic Sciences 1969–1980. World Sci., Singapore 5. Mäler KG (ed) (1992) Nobel Lectures Economic Sciences 1981–1990. World Sci., Singapore

1789

1790

K

Kolmogorov Complexity

Kolmogorov Complexity VICTOR KOROTKICH Central Queensland University, Mackay, Australia MSC2000: 90C60 Article Outline Keywords and Phrases See also References Keywords and Phrases Complexity; Computation; Information; Randomness; Algorithmic complexity; Algorithmic information; Algorithmic entropy; Solomonoff–Kolmogorov–Chaitin complexity; Descriptional complexity; Shortest program length; Algorithmic randomness In the mid 1960s R. Solomonoff [16], A.N. Kolmogorov [11] and G. Chaitin [4] independently invented the field now generally known as Kolmogorov complexity. It is also known variously as algorithmic complexity, algorithmic information, algorithmic entropy, SolomonoffKolmogorov-Chaitin complexity, descriptional complexity, shortest program length, algorithmic randomness and others. An extensive history of the field can be found in [14]. The Kolmogorov complexity formalizes the notion of amount of information necessary to uniquely describe a digital object. A digital object means one that can be represented as a finite binary string, for example, a genome, an Ising microstate, or an appropriately coarse-grained representation of a point in some continuum state space. In particular, the Kolmogorov complexity of a string of bits is the length of the shortest computer program that prints that string and stops running. The Kolmogorov complexity of an object is a form of absolute information of the individual object. This is not possible to do by Shannon’s information theory. Unlike Kolmogorov complexity, information theory is only concerned with the average information of a random source [14]. Solomonoff was addressing the problem: How do we assign a priori probabilities to hypotheses when we

begin an experiment? He represented a scientist’s observations as a series of binary digits and weighted together all the programs for a given result into a probability measure. The scientist seeks to explain these observations through a theory, which can be regarded as an algorithm capable of generating the series and extending it, that is, predicting future observations. For any given series of observations there are always several competing theories and the scientist must choose among them. The model demands that the smallest algorithm, the one consisting of the fewest bits, be selected. Stated another way, this rule is the familiar formulation of Occam’s razor: Given differing theories of apparently equal merit, the simplest is to be preferred [6]. Thus in the Solomonoff model a theory that enables one to understand a series of observations is seen as a small computer program that reproduces the observations and makes predictions about possible future observations. The smaller the program, the more comprehensive the theory and the greater the degree of understanding. Observations that are random cannot be reproduced by a small program and therefore cannot be explained by a theory. In addition the future behavior of a random system cannot be predicted. For random data the most compact way for the scientist to communicate his or her observations is to publish them in their entirety [6]. Kolmogorov and Chaitin independently suggested that computers be applied to the problem of defining what is meant by a random finite binary string of 0s and 1s [5]. In the traditional foundations of the mathematical theory of probability, as expounded by Kolmogorov in his classic [10], there is no place for the concept of an individual random string of 0s and 1s. Yet it is not altogether meaningless to say that the string 001110100001110011010000111110 is more random than the string 000000000000000000000000000000 for we may describe the second string as thirty 0s, but there is no shorter way to specify the first string than by just writing it all out [5]. We believe that the random strings of a given length are those that require the longest programs. Those

Kolmogorov Complexity

strings of length n that can be obtained by putting into a computer program much shorter than n are the nonrandom strings. The more possible it is to compress a binary string into a short program calculation, the less random is the string. Solomonoff, Kolmogorov and Chaitin saw that the notion of ‘computable’ lay at the heart of their questions. They arrived at equivalent notions, showing that these two questions are fundamentally related. The Kolmogorov complexity of a string is low if it can easily be obtained by a computation, whereas it will be high if it is difficult to obtain it [1]. The difficulty is measured by the length of the shortest program that computes the string on a universal Turing machine. The use of Turing machines to determine the length of the shortest program that computes a particular bit-string is intuitive: since a universal Turing machine can simulate any other Turing machine, the length of the program computing string s on Turing machine T, can only differ from the program computing the same string on Turing machine T 0 by a finite length l(T, T 0 ), the length of the prefix code necessary to simulate T on T 0 . As this difference is constant (for each string s), the length of the shortest program to compute string s on a universal Turing machine is constant in the limit of infinitely long strings s and the Kolmogorov complexity of string s is defined as K(s) D min fjpj : s D A T (p)g ; where |p| stands for the length of program p and AT (p) represents the result of running program p on Turing machine T. This measure can be illustrated by a few examples. A blank tape (the string with all zeros) is clearly a highly regular string and correspondingly its Kolmogorov complexity will be low. Indeed, the program needed to produce this string can be very short: print zero, advance, repeat. The same is true, of course, for every string with a repetitive pattern. Another way of viewing algorithmic regularity is by saying that an algorithmically regular string can be compressed to a much smaller size: the size of the smallest program that computes it. More interesting is the regularity of a string that can be obtained by the application of a finite but nontrivial algorithm, such as the calculation of the transcendental number . The string representing the binary equivalent of certainly

K

appears completely random, yet the minimal program necessary to compute it is finite. Thus, such a string is also classified as algorithmically regular (though not quite as regular as the blank tape) [1]. Kolmogorov complexity also provides a means to define randomness in this context. According to the Kolmogorov measure, a string r is declared random if the size of the smallest program to compute r is as long as r itself, i. e., K(r) jrj : Why should this definition of randomness be preferable to any other we might come up with? The answer to that was provided by P. Martin-Loef, who was a postdoc of Kolmogorov. Roughly, he demonstrated that the definition ‘an n-bit string s is random if and only if K(s) n’ ensures that every such individual random string possesses with certainty all effectively testable properties of randomness that hold for strings produced by random sources on the average [9]. The algorithmic definition of randomness provides a new foundation for the theory of probability. By no means does it supersede classical probability theory, which is based on an ensemble of possibilities, each of which is assigned a probability. Rather, the algorithmic approach complements the ensemble method by giving precise meaning to concepts that had been intuitively appealing but that could not be formally adopted [6]. The Kolmogorov complexity of a string s is also defined as the negative base-2 logarithm of the string’s algorithmic probability P(s) [2,18]. This in turn is defined as the probability that a standard universal computer T, randomly programmed, would embark on a computation yielding s as its sole output, afterward halting. The algorithmic probability P(s) may be thought of a weighted sum of contributions from all programs that produce s, each weighted according to the negative exponential of its binary length. P Turning to the sum of P(s) over outputs, the sum s P(s) is not equal to unity, because, as is well known, an undecidable subset of all universal computations fail to P halt and so produce no output. Therefore s P(s) is an uncomputable irrational number less than 1. This number, called Chaitin’s Omega [7], has many remarkable properties [8], such as the fact that its uncomputable digit string is a maximally compressed form of the information required to solve the halting problem [2].

1791

1792

K

Kolmogorov Complexity

Though very differently defined, Kolmogorov complexity is typically very close to ordinary statistical enP tropy p log p in value. To take a simple example, it is known that almost all N-bit strings drawn from a uniform distribution (of statistical entropy N bits) have Kolmogorov complexity nearly N bits. More generally, in any concisely describable ensemble of digital objects, i. e., a canonical ensemble of Ising microstates at a given temperature, the ensemble average of the objects’ Kolmogorov complexity closely approximates the whole ensemble’s statistical entropy [2,18]. In the case of continuous ensembles, the relation between Kolmogorov complexity and statistical entropy is less direct because it depends on the choice of coarse-graining. Some of the conceptual issues involved are discussed in [17]. The basic flaw in the Kolmogorov construction (as far as physical complexity is concerned) is the absence of a context [1]. This is easily rectified by providing the Turing machine with a tape u, which represents the physical ‘universe’, while the Turing machine with u as input computes various strings from u. The conditional Kolmogorov complexity of a string s is defined as the length of the shortest program that computes s given string u [12] K(sju) D min fjpj : s D A T (p; u)g ; where the notation AT (p, u) is introduced as the result of the computation running p on Turing machine T with u as input tape. The conditional complexity measures the remaining randomness in string s, i. e., it counts those bits that are not correlated with bits in u. In other words, the program p is the maximally compressed string containing those bits that cannot be computed from u, as well as the instructions necessary to compute those bits of s that can be obtained from u. The latter part of the program is of vanishing length in the limit of infinitely long strings, which implies that the program p mainly contains the remaining randomness of s. The mutual complexity is defined by K(s : u) D K(s) K(sju); which clearly just measures the number of bits that mean something in the universe u. Let us consider K(s : u) in more detail. Its meaning becomes clearer if instead of considering a string s obtained by Turing machine T with universe u, the ensemble of strings S that can be obtained from a universe

u with T is considered. This ensemble can be thought of as a probabilistic mixture subject to random bit-flips. In other words, the output tapes to be connected to a heat bath can be imagined. In that case, an entropy H(S) can be associated with the ensemble of strings S. Consider a Turing machine operating on u, a specific universe. Obtaining s from u then constitutes a measurement on the universe U and consequently not only reduces the conditional entropy of S given u, but also the conditional entropy of U given s. Note that the universe is assumed here to be fully known, i. e., there is only one tape u in the ensemble U. While this must not strictly be so, sometimes it is convenient to assume that there is no randomness in the universe. Also, the length of the smallest program that computes s from u, averaged over the possible realizations of s, then just equals the conditional entropy of S given u [1]. It is known that the average Kolmogorov complexity over an ensemble of strings just equals the entropy of the ensemble. Then X p(sju) log p(sju) (1) H(Sju) D hK(sju)iS D s

and hK(s) K(sju)iS D H(S) H(Sju): Note that (1) is not strictly a conditional entropy, as no average over different realizations of the universe takes place. Indeed, it looks just like a conventional Shannon entropy only with all probabilities being probabilities conditional on u. Determining the Kolmogorov complexity of a string is a difficult problem. For this reason, Kolmogorov complexity remained more of a curiosity than a practical mathematical tool. In the last few years, mainly due to P. Vitanyi and M. Li, a significant progress in using Kolmogorov complexity has been made [14]. In particular, several successful applications of Kolmogorov complexity in the theory of computation are made and the general pattern of the incompressibility method emerged [14]. The incompressibility method and Kolmogorov complexity is truly a versatile mathematical tool. The incompressibility method is a basic general technique such as the ‘pigeon hole argument’, the ‘counting method’ or the ‘probabilistic method’. It is

Kolmogorov Complexity

a sharper relative of classical information theory (absolute information of individual object rather than average information over a random ensemble) and yet satisfies many of the laws of classical information theory, although with a slight error term. Applications of Kolmogorov complexity have been given in a number of areas, including [14]: randomness of individual finite objects or infinite strings, Martin–Loef tests for randomness, Gödel’s incompleteness result, information theory of individual objects; universal probability, general inductive reasoning, inductive inference, prediction, mistake bounds, computational learning theory, inference in statistics; the incompressibility method, combinatorics, graph theory, Kolmogorov 0–1 laws, probability theory; theory of parallel and distributed computation, time and space complexity of computations, average case analysis of algorithms, language recognition, string matching, routing in computer networks, circuit theory, formal language and automata theory, parallel computation, Turing machine complexity, complexity of tapes, stacks, queues, average complexity, lower bound proof techniques; structural complexity theory, oracles; logical depth, universal optimal search, physics and computation, dissipationless reversible computing, information distance and picture similarity, thermodynamics of computing, statistical thermodynamics and Boltzmann entropy. Based on the Turing model of computation, the field of Kolmogorov complexity probably will need to be modified to account for the new quantum modes of computation. From recent studies there appear more facts that this modification is likely to be based on notions that go beyond the framework of space-time (for example [15]) and sought within the world view, which considers natural systems not as separate entities but as integrated parts of a undivided whole [3]. An attempt to contribute to such a modification is made in [13]. The results are based on a mathematical structure, called a web of relations, that is a collection of hierarchical formations of integer relationships. The web of relations allows to introduce a concept of structural complexity to measure the complexity of a binary string in terms of corresponding hierarchical forma-

K

tions of integer relationships. Importantly, the concept of structural complexity is based on the integers only and does not rely on notions that derive from spacetime. See also Complexity Classes in Optimization Complexity of Degeneracy Complexity of Gradients, Jacobians, and Hessians Complexity Theory Complexity Theory: Quadratic Programming Computational Complexity Theory Fractional Combinatorial Optimization Information-based Complexity and Information-based Optimization Mixed Integer Nonlinear Programming Parallel Computing: Complexity Classes References 1. Adami C (1998) Introduction to artificial life. Springer, Berlin 2. Bennett C (1982) The thermodynamics of computation a review. Internat J Theoret Physics 21:905–940 3. Bohm D (1980) Wholeness and the implicate order. Routledge and Kegan Paul, London 4. Chaitin G (1966) On the length of programs for computing binary sequences. J ACM 13:547–569 5. Chaitin G (1970) On the difficulty of computations. IEEE Trans Inform Theory 16:5–9 6. Chaitin G (1975) Randomness and mathematical proof. Scientif Amer 232:47–52 7. Chaitin G (1975) A theory of program size formally identical to information theory. J ACM 22:329–340 8. Gardner M (1979) Mathematical games. Scientif Amer 10:20–34 9. Kirchherr W, Li M, Vitanyi P (1997) The miraculous universal distribution. Math Intelligencer 19(4):7–15 10. Kolmogorov A (1950) Foundations of the theory of probability. Chelsea, New York 11. Kolmogorov A (1965) Three approaches to the definition of the concept “quantity of information”. Probl Inform Transmission 1:1–7 12. Kolmogorov A (1983) Combinatorial foundations of information theory and the calculus of probabilities. Russian Math Surveys 38:29 13. Korotkich V (1999) A mathematical structure for emergent computation. Kluwer, Dordrecht 14. Li M, Vitanyi P (1997) An introduction to Kolmogorov complexity and its applications. second, revised and expanded edn. Springer, Berlin

1793

1794

K

Krein–Milman Theorem

15. Penrose R (1995) Shadows of the mind. Vintage, London 16. Solomonoff R (1964) A formal theory of inductive inference. Inform and Control 7:1–22 17. Zurek W (1989) Algorithmic randomness and physical entropy. Phys Rev A 40:4731–4751 18. Zvonkin A, Levin L (1970) The complexity of finite objects and the development of the concepts of information and randomness by means of the theory of algorithms. Russian Math Surv 256:83–124

the (relative) boundary of C there exists a supporting hyperplane H of C, that contains u. So u 2 C \ H = conv(ext(C \ H)) (by induction, since dim(C \ H) d1). But ext(C \ H) = ext(C) \ H and so u 2 ext(C) \ H ext(C). An analogous result holds for v. Since x 2 [u, v], x = u + (1 )v with 2 ]0, 1[ and so x 2 conv(ext(C)). See also

Krein–Milman Theorem GABRIELE E. DANNINGER-UCHIDA University Vienna, Vienna, Austria

Carathéodory Theorem Linear Programming References

MSC2000: 90C05 Article Outline Keywords See also References Keywords Convex; Convex hull; Extreme point A theorem stating that a compact closed set can be represented as the convex hull of its extreme points. First shown by H. Minkowski [4] and studied by some others ([1,2,5]), it was named after the paper by M. Krein and D. Milman [3]. See also, for example, [6,7,8]. Theorem 1 Let C Rn be convex and compact, let S = ext(C) be the set of extreme points of C. Then conv(S) = C, i. e. the convex hull of the extreme points of C coincides with the set C. Proof Since S C, conv(S) conv(C) = C. So we are left to show that C conv(S). We prove this by induction. Let d = dim C. For d = 1(C = ;), d = 0 and d = 1 the proof is trivial. Let us assume that the theorem is true for all convex compact sets of dimension d 1 0. If x 2 C, but not in conv(S), there exists a line segment in C such that x is in the interior of it (since x is not an extreme point). This line segment intersects the (relative) boundary of C in two points u and v. At least one of them is not extremal, else x 2 conv(S). Assume u 62 S. Since u is on

1. Klee VL (1957) Extremal structure of convex sets. Arch Math 8:234–240 2. Klee VL (1958) Extremal structure of convex sets II. Math Z 69:90–104 3. Krein M, Milman D (1940) On extreme points of regular convex sets. Studia Math 9:133–138 4. Minkowski H (1911) Gesammelte Abhandlungen. Teubner, Leipzig-Berlin 5. Price GB (1937) On the extreme points of convex sets. Duke Math J 3:56–67 6. Rockafellar RT (1970) Convex analysis. Princeton Univ. Press, Princeton 7. Valentine FA (1964) Convex sets. McGraw-Hill, New York 8. Wets RJ-B (1976) Grundlagen Konvexer Optimierung. Lecture Notes Economics and Math Systems, vol 137. Springer, Berlin

Kuhn–Tucker Optimality Conditions Kuhn–Tucker Conditions, KT Conditions PANOS M. PARDALOS Center for Applied Optim. Department Industrial and Systems Engineering, University Florida, Gainesville, USA MSC2000: 90C30 Article Outline Keywords See also References

Kuhn–Tucker Optimality Conditions

Keywords Optimality conditions; KT point; Constraint qualification; Complexity In this article we discuss necessary conditions for local optimality for an optimization problem in terms of a system of equations and inequalities which form the well-known Kuhn–Tucker KT conditions. Under suitable convexity assumptions about the objective function and the feasible domain, the Kuhn–Tucker conditions are sufficient even for global optimality. However, in the nonconvex case sufficiency is no longer guaranteed. The material of this article has been adapted from [2] and [4]. First, we consider the nonlinear optimization problem with inequality constraints, min f (x); x2S

(1)

where S = {x: g i (x) 0, i = 1, . . . , p} Rn . We may assume that all the functions in the optimization problem are continuously differentiable on an open set containing S. A vector d 2 Rn is called a feasible direction at x if there exists > 0 such that x + d 2 S for every 0 < . Let Z(x ) denote the set of all feasible directions at x . A local minimum point x 2 S satisfies d| rf (x ) 0 for every feasible direction d 2 Z(x ). Let I(x ) :D fi 2 f1; : : : ; pg : g i (x ) D 0g be the index set of the active constraints at x . Recall that d| rg i (x )< 0 implies g i (x + d) < g i (x ) for 0 0, and d| rg i (x ) > 0 implies i , for some e i > 0. g i (x + d) > g i (x ), 0 < e Moreover, each constraint which is not active in x , i. e., for which we have g i (x ) < 0, does not influence Z(x ), because g i (x) 0 holds in a neighborhood of x . It follows that {d: d| rg i (x ) < 0, i 2 I(x )} Z(x ) {d: d| rg i (x ) 0, i 2 I(x )} =: L(x ). It is easy to see that for linearly constrained prob| lems, where g i (x) = ai x bi , ai 2 Rn \ {0}, bi 2 R, we have Z(x ) = L(x ). On the other hand, one can readily construct examples of nonlinear constraints such that ˚ Z(x ) D d : d > r g i (x ) < 0; i 2 I(x ) :

K

Now {d: d| rg i (x ) < 0} is an open set whereas L(x ) is closed. Recall that the closure cl M of a set M Rn is the smallest closed set containing M. Because of the continuity of the inner product d| rg i (x ), we see that d| rf (x ) 0 for every d 2 Z(x ) implies that d| rf (x ) 0 for every d 2 cl Z(x ). Hence, the condition d| rf (x ) 0 for every d 2 cl Z(x ) is as well necessary for x to be a local minimum point. Clearly, by the discussion so far, we have cl Z(x ) L(x ). One would expect that also cl {d: d| rg i (x < 0, i 2 I(x )} = L(x ), and hence cl Z(x ) = L(x ). Indeed, this is true, apart from a few rather pathological cases. An example of such a pathological case is S = {x 2 R: g 1 (x) := x2 0} and x = 0. Here we have S = Z(x ) = cl Z(x ) = {0}, but L(x ) = {d 2 R: d 2 0 0} = R. The constraints of the optimization problem minx 2 S f (x) are said to be regular in x 2 S when L(x ) = cl Z(x ). Every condition which ensures regularity in this sense is called a constraint qualification. Three of the most well-known constraint qualifications are given in the next theorem. Theorem 1 Each of the following conditions is a constraint qualification: | g i (x) = ai x bi , ai 2 Rn \ {0}, bi 2 R (i = 1, . . . , p; linear constraints). g i (x) is convex (i = 1, . . . , p), and there exists x satisfying g i (x) < 0, . . . , p; Slater condition). The vectors rg i (x ), i 2 I(x ) are linearly independent. The first two conditions ensure regularity in every x 2 S whereas the third requires knowledge of x . Finally, one applies the well-known Farkas lemma (cf. Farkas lemma). This states that, whenever d| rf (x ) 0 for every d satisfying d| rg i (x ) 0, i 2 I(x ), there exists i 0, i 2 I(x ) such that r f (x ) C

X

i r g i (x ) D 0:

i2I(x )

Since I(x ) is not known in advance, one formulates this equation in the following equivalent form: Theorem 2 Let f , g i be continuously differentiable on an open set containing S, and let x be a local minimum point such that the constraints are regular in x . Then the following KT conditions hold:

1795

1796

K

Kuhn–Tucker Optimality Conditions

g i (x ) 0, i = 1, . . . , p; there exist i 0 such that i g i (x ) = 0, i = 1, . . . , p; Pp rf (x )+ iD1 i rg i (x ) = 0.

When the functions f , g i (i = 1, . . . , p) are convex, and the functions hi (x) are affine then the above two conditions are again sufficient for x to be a global minimum.

Theorem 3 The KT conditions are sufficient for a constrained global minimum at x provided that the functions f (x) and g i (x), i = 1, . . . , p, are convex.

Next, we consider the situation when Kuhn–Tucker theory is applied to nonconvex programming. We illustrate some difficulties arisen from nonconvexity by the following simple examples of concave minimization problems.

Proof By convexity of f (x) and g i (x) we have f (x) f (x ) C (x x )> r f (x ); g(x) g(x ) C (x x )> r g(x ): Multiplying the last inequalities by i and adding to the first one we obtain:

Example 5 Consider the problem ( min (x12 C x22 ) s.t.

x1 1:

p

f (x) C

X iD1

f (x ) C

The KT conditions for this problem are

i g i (x) p X iD1

i g i (x ) "

>

C (x x )

x1 1 D 0;

r f (x ) C

1 p X

# i r g i (x ) :

iD1

Since the last two terms vanish because of the KT Pp conditions, this implies f (x) f (x ) i21 i g i (x) f (x ) for all x 2 S, that is x is a global minimum. Note that no constraint qualification is required in the above Theorem. Consider now problems with inequality and equality constraints min f (x);

S D x:

D 0;

2x2

1 0;

D 0:

It is easy to see that the KT conditions are satisfied at x = (0, 0) with 1 = 0 and at x = (1, 0) with 1 = 2. The first is a global maximum. The second is neither a local minimum nor a local maximum. The problem has no local minima. Moreover, inf{f (x): x 2 S} = 1 since f is unbounded from below over S. Example 6 ( min 2x x 2 s.t.

0 x 3:

The KT conditions for this problem are

x2S

where

2x1

1 (x1 1) D 0;

g i (x) 0 (i D 1; : : : ; p); h i (x) D 0 (i D 1; : : : ; t)

1 (x 3) D 0;

:

Theorem 4 Let f , g i (i = 1, . . . , p), hi (i = 1, . . . , t) be continuously differentiable in an open set containing S, and let x be a local minimum point. Further, assume that the vectors rg i (x ) (i 2 I(x )), rhi (x ) (i = 1, . . . , t) are linearly independent. Then the following KT conditions hold: g i (x ) 0 (i = 1, . . . , p), hi (x ) = 0 (i = 1, . . . , t). There exist i 0 (i = 1, . . . , p) and i 2 R (i = 1, . . . , t) such that r f (x ) C

p X iD1

i g i (x ) D 0

i r g i (x ) C

t X iD1

(i D 1; : : : ; p):

i r h i (x ) D 0;

2 x D 0;

2(1 x ) C 1 2 D 0; 1 0; 2 0: Since the objective function is strictly concave local minima occur at the endpoints of the interval [0, 3]. The point x = 3 is the global minimizer. The endpoints satisfy the KT conditions (x = 0 with 1 = 0, 2 = 2 and x = 3 with 1 = 4, 2 = 0). However, we can easily see that the KT conditions are also satisfied at x = 1 (with 1 = 2 = 0) and that this is a global maximum point. These examples show that for minimization problems with nonconvex functions KT points may not be local minima. Next, we consider the complexity of the problem of deciding existence of a Kuhn–Tucker point for

Kuhn–Tucker Optimality Conditions

quadratic programming problems. When the feasible domain is unbounded, we prove that the decision problem is NP-hard. Most classical optimization algorithms compute points that satisfy the Kuhn–Tucker KT conditions. When the feasible domain is not bounded it is not easy to check existence of a KT point. More precisely, consider the following quadratic problem ( min f (x) D c > x C 12 x > Qx (2) s.t. x0 where Q is an n × n symmetric matrix, and c 2 Rn . The Kuhn–Tucker optimality conditions for this problem become the following so-called linear complementarity problem (denoted by LCP(Q, c)): Find x 2 Rn (or prove that no such an x exists) such that Qx C c 0;

x 0;

x > (Qx C c) D 0:

Hence, the complexity of finding (or proving existence) of Kuhn–Tucker points for the above quadratic problem is reduced to the complexity of solving the corresponding LCP. Theorem 7 The problem LCP(Q, c), where Q is symmetric, is NP-hard. Proof Consider the LCP(Q, c) problem in Rn+3 defined by 1 0 I n e n e n 0n B e> 1 1 1C n C Q(nC3)(nC3) D B > @e 1 1 1A

K

This problem is known to be NP-complete [1]. Next we will show that LCP(Q, c) is solvable if and only if the associated knapsack problem is solvable. Obviously, if x solves the knapsack problem, then y = (a1 x1 , . . . , an xn , 0, 0, 0)| solves LCP(Q, c). Conversely, assume the point y solves LCP(Q, c) given above. Since Qy + c 0, y 0 we obtain yn+1 P = yn+2 = yn+3 = 0. This in turn implies that niD1 yi = b and 0 yi ai . Finally, if yi < ai , then y| (Qy + c) = 0 enforces yi = 0. Hence, x = (y1 /a1 , . . . , yn /an ) solves the knapsack problem. Therefore, even in quadratic programming, the problem of ‘deciding whether a Kuhn–Tucker point exists’ is NP-hard. See also Equality-constrained Nonlinear Programming: KKT Necessary Optimality Conditions First Order Constraint Qualifications High-order Necessary Conditions for Optimality for Abnormal Points Implicit Lagrangian Inequality-constrained Nonlinear Optimization Lagrangian Duality: Basics Rosen’s Method, Global Convergence, and Powell’s Conjecture Saddle Point Theory and Optimality Conditions Second Order Constraint Qualifications Second Order Optimality Conditions for Nonlinear Optimization

n

0> n

1

1

1

| cn+3

and = (a1 , . . . , an , b, b, 0), where ai , i = 1, . . . , n, and b are positive integers, I n is the (n × n)-unit matrix and the vectors en 2 Rn , 0n 2 Rn are defined by e> n D (1; : : : ; 1);

0> n D (0; : : : ; 0):

Define now the following knapsack problem: Find a feasible solution to the system n X iD1

a i x i D b;

x i 2 f0; 1g (i D 1; : : : ; n):

References 1. Garey MR, Johnson DS (1979) Computers and intractability: A guide to the theory of NP-completeness. Freeman, New York 2. Horst R, Pardalos PM, Thoai NV (1995) Introduction to global optimization. Kluwer, Dordrecht 3. Kuhn HW, Tucker AW (1951) Nonlinear programming. In: Neyman J (ed) Proc. Second Berkeley Symp. Math. Statist. and Probab. Univ. Calif. Press, Berkeley, CA, pp 481–492 4. Pardalos PM, Rosen JB (1987) Constrained global optimization: Algorithms and applications. Lecture Notes Computer Sci, vol 268. Springer, Berlin

1797

Lagrange, Joseph-Louis

L

L

Lagrange, Joseph-Louis TANIA QUERIDO, DUKWON KIM University Florida, Gainesville, USA MSC2000: 01A99 Article Outline Keywords See also References Keywords Calculus of variations; Lagrange multipliers J.L. Lagrange (1736–1813) made significant contributions to many branches of mathematics and physics, among them the theory of numbers, the theory of equations, ordinary and partial differential equations, the calculus of variations, analytic geometry and mechanics. By his outstanding discoveries he threw the first seeds of thought that later nourished C.F. Gauss and N.H. Abel. During the first thirty years of his life he lived in Turin (France, now Italy) and, as a boy, his tastes were more classical than scientific. His interest in mathematics began while he was still in school when he read a paper by E. Halley on the uses of algebra in optics. He then began a course of independent study, and excelled so rapidly in the field of mathematical analysis that by the age of nineteen he was appointed professor at the Royal Artillery School and helped to found the Royal Academy of Science in 1757. His ideas had greatly impressed L. Euler, one of the giants of Euro-

pean mathematics. Euler and Lagrange, together, would join the first rank of the eighteenth century mathematicians, and their careers and research where often related [5]. In 1759 Lagrange focused his research in analysis and mechanics and wrote ‘Sur la Propagation du son dans les fluides’, a very difficult issue for that time [4]. From 1759 to 1761 he had his first publications in the ‘Miscellanea of the Turin Academy’. His reputation was established. Lagrange developed a new calculus which would enrich the sciences, called calculus of variations. In its simplest form the subject seeks to determine a functional R relationship y = f (x) such that an ba g(x, y) dx could produce a maximum or a minimum. It was viewed as a mathematical study of economy or the ‘best income’ [4]. That was Lagrange’s earliest contribution to the optimization area. In 1766, Lagrange was appointed the Head of the Berlin Academy of Science, succeeding Euler. In offering this appointment, Frederick the Great wanted to turn his Academy into one of the best institutes of its day, proclaiming that the ‘greatest mathematician in Europe’ should live near the ‘greatest king in Europe’ [1]. During this period, he had a prosperous time, developing important works in the field of calculus, introducing the strictness in the calculus differential and integral. Later (1767) he published a memoir on the approximation of roots of polynomial equations by means of continued fractions; in 1770 he wrote a paper considering the solvability of equations in terms of permutations on their roots. After Frederick’s death, Lagrange left Berlin and became a member of the Paris Academy of Science by the invitation of Louis XVI (1787). He remained in Paris for the rest of his career, making a lengthy treatise on

1799

1800

L

Lagrange-Type Functions

the numerical solution of equations, representing a significant portion of his mathematical research. His papers on solution of third - and fourth-degree polynomial equations by radicals received considerable attention. His methods, laid in the early development of group theory to solving polynomials, were later taken by E. Galois. Lagrange’s name was attached to one of the most important theorems of group theory [3]: Theorem 1 If o is the order of a subgroup g of a group G of order O, then o is a factor of O. In 1788 he published his masterpiece, the treatise ‘Méchanique Analytique’, which summarized and unified all the work done in the field of general mechanics since the time of I. Newton. This work, notable for its use of theory of differential equations, transformed mechanics into a branch of mathematical analysis. As W. Hamilton later said, ‘he made a kind of scientific poem’ [6]. In 1793, Lagrange headed a commission, which included P.S. Laplace and A. Lavoisier, to devise a new system of weights and measures. Out of this came the metric system. Lagrange developed the method of variation of parameters in the solution of nonhomogeneous linear differential equations. In the determination of maxima and minima of a function, say f (x, y, z, w), subject to constraints such as g(x, y, z, w) = 0 and h(x, y, z, w) = 0, he suggested the use of Lagrange multipliers to provide an elegant algorithm. By this method two undetermined constants and are introduced, forming the function F f + g + h, from the related equations F x = 0, F y = 0, F z = 0, F w = 0, g = 0, and h = 0, the multipliers and are then eliminated, and the problem is solved. This procedure and its variations have emerged as a very important class of optimization method [1,2]. One can characterize Lagrange’s contribution to optimization as his formalist foundation. Most of his results were retained and developed further by the following generations, who gave to his theory a different and practical course. By the end of his life, Lagrange could not think futuristically for the mathematics. He felt that other sciences such as chemistry, physics and biology would attract the ablest minds of the future. His pessimism was unfounded. Much more was to be forthcoming with

Gauss and his successors, making the nineteenth century the richest in the history of mathematics. See also Decomposition Techniques for MILP: Lagrangian Relaxation Integer Programming: Lagrangian Relaxation Lagrangian Multipliers Methods for Convex Programming Multi-objective Optimization: Lagrange Duality References 1. Bertsekas DP (1986) Constrained optimization and Lagrange multiplier methods. Athena Sci., Belmont, MA 2. Boyer CB (1968) A history of mathematics. Wiley, New York 3. Fraser CG (1990) Lagrange’s analytical mathematics, its Cartesian origins and reception in Comte’s positive philosophy. Studies History Philos Sci 21(2):243–256 4. Julia G (1942-1950) La vie et l’oeuvre de J.-L. Lagrange. Enseign Math 39:9–21 5. Koetsier T (1986) Joseph Louis Lagrange (1736-1813): His life, his work and his personality. Nieuw Arch Wisk (4) 4(3):191–205 6. Simmons GF (1972) Differential equations, with applications and historical notes. McGraw-Hill, New York

Lagrange-Type Functions A. M. RUBINOV1 , X. Q. YANG2 1 School of Information Technology and Mathematical Sciences, The University of Ballarat, Ballarat, Australia 2 Department of Applied Mathematics, The Hong Kong Polytechnic University, Kowloon, Hong Kong MSC2000: 90C30, 90C26, 90C46 Article Outline Keywords and Phrases References Keywords and Phrases Lagrange-type function; IPH functions; Multiplicative inf-convolution; Zero duality gap; Exact penalty function

L

Lagrange-Type Functions

Lagrange and penalty function methods provide a powerful approach, both as a theoretical tool and a computational vehicle, for the study of constrained optimization problems. However, for a nonconvex constrained optimization problem, the classical Lagrange primaldual method may fail to find a minimum as a zero duality gap is not always guaranteed. A large penalty parameter is, in general, required for classical quadratic penalty functions in order that minima of penalty problems are a good approximation to those of the original constrained optimization problems. It is well-known that penalty functions with too large parameters cause an obstacle for numerical implementation. Thus the question arises how to generalize classical Lagrange and penalty functions, in order to obtain an appropriate scheme for reducing constrained optimization problems to unconstrained ones that will be suitable for sufficiently broad classes of optimization problems from both the theoretical and computational viewpoints. One of the approaches for such a scheme is as follows: an unconstrained problem is constructed, where the objective function is a convolution of the objective and constraint functions of the original problem. While a linear convolution leads to a classical Lagrange function, different kinds of nonlinear convolutions lead to interesting generalizations. We shall call functions that appear as a convolution of the objective function and the constraint functions, Lagrange-type functions. It can be shown that these functions naturally arise as a result of a nonlinear separation of the image set of the problem and a cone in the image-space of the problem under consideration (see [4]). The class of Lagrange-type functions includes also augmented Lagrangians, corresponding to the so-called canonical dualizing parameterization. However, augmented Lagrangians constructed by means of some general dualizing parameterizations cannot be included in this scheme. Consider the following problem P( f ,g): min f (x) subject to

x 2 X; g(x) 0 ;

where X is a metric space, f is a real-valued function defined on X, and g maps X into Rm , that is, g(x) D (g1 (x); : : : ; g m (x)), where g i are real-valued functions defined on X. We assume that the set of feasible solutions X0 D fx 2 X : g(x) 0g is nonempty and that the objective function f is bounded from below on X.

Let ˝ be a set of parameters and h : R1Cm ˝ ! R be a function. Let 2 R. Then the function L(x; !) D h( f (x) ; g(x); !) C ;

x 2 X; ! 2 ˝ ; (1)

is called a Lagrange-type function for problem P( f ,g) corresponding to h and , and h is called a convolution function. If h is linear with respect to the first variable, more specifically: h(u; v; !) D u C (v; !) ; where : Rm ˝ ! R is a real-valued function, then the parameter can be omitted. Indeed, for each 2 R, we have L(x; !) D f (x) C (g(x); !) : However in general nonlinear situation the presence of is important and different lead to Lagrangetype functions with different properties. One of the possible choices of the number is D f (x ) where x is a reference point, in particular x is a solution of P( f ,g) (see [4]). Then the Lagrangetype function has the form L(x; !) D h( f (x) f (x ); g(x); !) C f (x ) ; x 2 X; ! 2 ˝ : The Lagrange-type function (1) is a very general scheme and includes linear Lagrange functions, classical penalty functions, and augmented Lagrange functions as special cases. m and p be a real-valued function deLet ˝ D RC 1Cm . Define fined on R h(u; v; !) D p(u; !1 v1 ; : : : ; !m v m ) :

(2)

The Lagrange-type function has the form L p (x; !) D p( f (x) ; !1 g1 (x); : : : ; !m g m (x)) C : We can obtain fairly good results if the function p enjoys some properties. In particular, we assume that (i) p is increasing; (ii) p(u; 0m ) u; for all u 2 R. (Here 0m is the origin of Rm .) One more assumption is useful for applications.

1801

1802

L

Lagrange-Type Functions

(iii) p is positively homogeneous (p(x) D p(x) for > 0). If both (i) and (iii) hold, then p is called an IPH function. Let p be a real-valued function defined on R1Cm and h be a convolution function defined by (2). Then (a) If p enjoys properties (i) and (ii), then m 8u 2 R; v 2 R :

sup h(u; v; w) u; w2˝

(b) If p is an IPH function and p(u; e i ) > 0, where ei is the i-th unit vector, i D 1; : : : ; m, then sup h(u; v; w) D C1;

m 8v … R :

w2˝

Then the Lagrange-type function, corresponding to D 0, coincides with the augmented Lagrangian [5], that is, L(x; (y; r)) D h( f (x); g(x); (y; r)) D

inf

zCg(x)0

( f (x) [y; z] C r(z)) :

4) Morrison-type functions. Let ˝ D RC and h(u; v; !) D ((u !)C )2 C (v1C ; : : : ; v C m); where is an augmenting function. Then the Lagrangetype function corresponding to = 0 has the form L(x; !) D (( f (x) !)C)2 C (g1 (x)C ; : : : ; g m (x)C ) :

We now give some examples of Lagrange-type functions. First two examples correspond to functions of the form (2). P 1) Let p(u; v) D uC m iD1 v i . Then L p (x; !) D f (x)C Pm ! g (x) coincides with the classical Lagrange iD1 i i function. P C v C D max(v; 0). 2) Let p(u; v) D u C m iD1 v i where Pm C coinThen L p (x; !) D f (x) C iD1 ! i g i (x) cides with the classical (linear) penalty function. If P C 2 p(u; v) D u C m iD1 (v i ) , then L p (x; !) D f (x) C Pm C 2 iD1 ! i (g i (x) ) is a quadratic penalty function. We now give the definition of a penalty-type function. Let ˝ be a set of parameters and h : R1Cm ˝ ! R be a convolution function with the property: h(u; v; !) D u;

m ;! 2 ˝ : u 2 R; v 2 R

Then the Lagrange-type function L(x,!), corresponding to h, is called a penalty-type function. Next two examples cannot be presented in the form (2). 3) Augmented Lagrangians Let : Rm ! R be an augmenting function, i. e., (0) D 0 and (z) > 0; for z ¤ 0; and ˝ f(y; r) : y 2 Rm ; r 0g be a set of parameters satisfying (0; 0) 2 ˝ and (y; r) 2 ˝ implying (y; r 0 ) 2 ˝, for all r 0 r. Let h : Rm ˝ ! R be the convolution function defined by h(u; v; (y; r)) D

inf (u [y; z] C r(z))

zCv0

D u C inf ([y; z] C r(z)) : zCv0

Functions of this kind have been introduced by Morrison [6]. Consider problem P( f ,g), a convolution function h : R1Cm ˝ ! R and the corresponding Lagrangetype function L(x; !) D h( f (x) ; g(x); !) C : ¯ D R [ f1; C1g The dual function q : ˝ ! R of P( f ,g) with respect to h and is defined by q(!) D inf h( f (x) ; g(x); !) C ; x2X

! 2˝:

Consider the dual problem to P( f ,g) with respect to h and : max q(!);

subject to ! 2 ˝ :

We are interested in the following questions: Find conditions under which 1) the weak duality holds, i. e., M( f ; g) :D inf f (x) sup q(!) :D M ( f ; g) ; x2X0

!2˝

2) the zero duality gap property holds, i. e., inf f (x) D sup q(!) ;

x2X0

!2˝

3) an exact Lagrange parameter exists, i. e., the weak duality holds and there exists !¯ 2 ˝ such that ¯ ; inf f (x) D inf L(x; !)

x2X0

x2X

L

Lagrange-Type Functions

4) a strong exact parameter exists: there exists an exact parameter !¯ 2 ˝ such that argminP( f ; g) :D argmin x2X0 f (x)

inf h(u; v; !) u ;

¯ ; D argmin x2X L(x; !)

!2˝

5) a saddle point exists and generates a solution of P( f ,g). The first part of this question means that there exists (x ; ! ) 2 X ˝ such that L(x; ! ) L(x ; ! ) L(x ; !);

(3)

x 2 X ;! 2 ˝ :

The second part means that (3) implies x 2 argmin P( f ; g). The weak duality allows one to estimate from below the optimal value M( f ,g) by solving the unconstrained problem infx2X L(x; !). The zero duality gap property allows one to find M( f ,g) by solving a sequence of unconstrained problems infx2X L(x; ! t ) where f! t g ˝. The existence of an exact Lagrange parameter !¯ means that M( f ,g) can be found by solving one unconstrained ¯ The existence of a strong exact problem infx2X L(x; !). parameter !¯ means that the solution set of P( f ,g) is the ¯ same as that of minx2X L(x; !). 1Cm ˝ ! R be a convolution function Let h : R such that sup h(u; v; !) u; !2˝

m for all (u; v) 2 R R : (4)

;w 2

m RC

;

and p : R1Cm ! R is an IPH function satisfying p(1; 0m ) 1;

p(1; 0m ) 1 :

Assume that is a lower estimate of the function f over the set X, i. e., f (x) b > 0, for all x 2 X. Then, in order to establish the weak duality, we need only to consider convolution functions defined on [b; C1) Rm ˝ such that sup h(u; v; !) u; !2˝

(6)

and that, for each c > 0, there exists !¯ 2 ˝ such that ¯ cr(v); h(u; v; !)

8u b; v 2 Rm ;

(7)

m : where r : Rm ! R is such that r(v) 0 () v 2 R Assume further that

( f 1 ) The function f is uniformly positive on X 0 , i. e., inf f (x) D M( f ; g) > 0 ;

x2X0

( f 2 ) The function f is uniformly continuous on an open set containing the set X 0 ; ( g) The mapping g is continuous and the set-valued mapping D(ı) D fx 2 X : r(g(x)) ıg is upper semi-continuous at the point ı D 0. Theorem 1 Under the assumptions (5)–(7) and ( f 1 ), ( f 2 ) and (g), the zero duality gap property holds for P( f ,g) with respect to the Lagrange-type function L(x; !), corresponding to h and = 0.

h(u; v; !) D p(u; !1 v1 ; : : : ; !m v m ) ;

h(u; v; w) D p(u; w1 v1 ; : : : ; w m v m ) ; (u; v) 2 R

8u b; r(v) ı ;

Let b 0. Define a convolution function h : [b; C1) Rm ! R by

Then the weak duality holds. Condition (4) can be guaranteed if

1Cm

To investigate the zero duality gap property, we further assume that, for any 2 (0; b), there exists ı > 0 such that

m 8(u; v) 2 [b; C1) R : (5)

where p : RC Rm ! R is an increasing function satisfying p(u; 0m ) u;

for all u 0 :

(8)

Consider the P( f ,g) with uniformly positive objective function f on X. Let L be the Lagrange-type function defined by L(x; !) D p( f (x); !1 g1 (x); : : : ; !m g m (x)) ; where p is defined on RC Rm . Define the perturbation function ˇ(y) of P( f ,g) by ˇ(y) D inff f (x) : x 2 X; g(x) yg;

y 2 Rm :

1803

1804

L

Lagrange-Type Functions

Theorem 2 Let p be a continuous increasing function satisfying (7). Let the zero duality gap property with respect to p holds. Then the perturbation function ˇ is lower semi-continuous at the origin. Further assume that p satisfies the following property: there exist positive numbers a1 ; : : : ; a m such that, for all u > 0; (v1 ; : : : ; v m ) 2 Rm , we have p(u; v1 ; : : : ; v m ) max(u; a1 v1 ; : : : ; a m v m ) :

(9)

Theorem 3 Assume that p is an increasing convolution function that possesses properties (8) and (9). Let perturbation function ˇ of problem P(f ,g) be lower semicontinuous at the origin. Then the zero duality gap property with respect to p holds. Remark 1 The perturbation function ˇ depends on P( f ,g) and doesn’t depend on the exogenous function p. It is worth noting that Theorems 2 and 3 establish equivalence relations between the zero duality gap property with respect to different p from a broad class of convolution functions. Remark 2 If p is a linear function, then the lower semicontinuity does not imply the zero duality gap property, so we need to impose a condition that does not hold for linear functions. This is the role of (9). The results similar to Theorem 2 and Theorem 3 hold also for penalty type functions, where p(u,v) is m and L(x; !) D a function defined on RC RC C C p( f (x); !1 g1 (x) ; : : : ; g m (x) ). In such a case (9) m . This requireshould be valid only for u > 0, v 2 RC ment is very weak and is valid for many increasing funcP tions including the function p(u; v) D u C m iD1 v i . Let the Lagrange-type function be of the following form L(x; !) D f (x) C (g(x); !);

x 2 X; ! 2 ˝ :

Consider set K of functions : Rm ˝ ! R with the following two properties (i) (; !) is lower semi-continuous for all ! 2 ˝; m : (ii) sup!2˝ (v; !) D 0, for all v 2 R Consider a point (x ; ! ) 2 X ˝ such that L(x ; ! ) D min L(x; ! ) ;

(10)

(g(x ); ! ) D 0 :

(11)

x2X

Theorem 4 Let 2 K: If (10) and (11) hold for x 2 X0 and ! 2 ˝, then ! is an exact Lagrange parameter. The most advanced theory has been developed for two special classes of Lagrange-type functions. One of them is augmented Lagrangians (see article in encyclopedia). The other class consists of penalty-type functions for problems with a positive objective and a single constraint. This penalty-type functions are composed by convolutions functions of the form (2) with IPH functions p. Remark 3 Consider problem P( f ,g) with m constraints g1 ; : : : ; g m . We can convert these constraints to a single one by many different ways. In particular, the system g1 (x) 0; : : : ; g m (x) 0 is equivalent to the single P C inequality f1 (x) :D m iD1 g i (x) 0. The function f 1 is non-smooth. If all functions g i (x) are smooth then a smoothing procedure can be applied to f 1 (see [13] for details). Problems with a single constraint are convenient to be dealt with from many points of view. Let P( f , f 1 ) be a problem with a positive objective f and a single constraint f 1 . We consider here only IPH functions sk defined on R2C by: s k (u; v) D (u k C v k )1/k ;

u; v 0 :

(12)

(Many results that are valid for sk can be extended also for IPH functions p : R2C ! RC with properties p(1, 0) = 1, limu!C1 p(1; u) D C1.) A penalty-type function LC k corresponding to sk has k (x; d) D ( f (x) C d k f 1C (x) k )1/k . Here dk is the form LC k a penalty parameter. It can be shown that the exact parameter does not exist if k > 1 for the ‘regular’ problems in a certain sense, so we will here consider only the classical penalty function with k=1 and lower order penalty functions with k < 1. It can be shown that the existence of an exact parameter for k 1 implies the existence of exact parameters for k0 with 0 < k 0 < k. One of the main questions that can be studied in the framework of this class of penalty-type functions is the size of exact penalty parameters. Generally speaking, we can diminish the size of exact parameter using the choice of k and some simple reformulations of the problem P( f , f 1 ) in hand. For the function LC k an explicit value of the least exact penalty parameter can be expressed through the

Lagrangian Duality: BASICS

perturbation function. Let ˇ(y) be the perturbation function of the problem P( f , f 1 ). Note that ˇ(0) D M( f ; f 1 ) and ˇ is a decreasing function, so ˇ(y) M( f ; f 1 ). For the sake of simplicity, we assume that ˇ(y) < M( f ; f1) for all y > 0. Let (M( f ; f 1) k ˇ k (y))1/k : d¯k D sup y y >0

(13)

Then the least exact parameter exists if and only if the supremum in (13) is finite and the least exact parameter is equal to d¯k . For k = 1 the existence easily follows from the calmness results of Burke [1]. Let f c (x) = f (x) + c with c > 0 and dc,k be the least exact parameter for problem P( f c , f 1 ). Then it can be proved that d c;k ! 0 as c ! C1. Assume that functions f and f 1 are Lipschitz. Since k < 1 the function LC k is not locally Lipschitz at points x where f 1 (x) = 0 , so we need to have a special smoothing procedure in order to apply numerical method for the unconstrained minimization of this function. Such a procedure is described in [14]. This procedure can be applied for different types of lower order penalty functions. Another approach for constructing a Lipschitz penalty function with a small exact parameter is also of interest (see [11] and references therein). Let be a strictly increasing continuous concave function defined on [a; C1) where a > 0. Assume that 0 is the right (a) 0 and lim y!C1 (y) D 0 where C derivative of the concave function . Consider the function f ;c (x) D ( f (x) C c) and the classical penalty function for LC 1;;c (x; d) D ( f (x) C c) C d f 1 (x) for the problem P( f ;c ; f 1 ). Let d;c be the least exact parameter of LC 1;;c (assuming that this parameter exists). Then we can assert that d;c ! 0 as c ! 0 under very mild assumptions.

5. Huang XX, Yang XQ (2003) A unified augmented lagrangian approach to duality and exact penalization. Math Oper Res 28(3):533–552 6. Morrison DD (1968) Optimization by least squares. SIAM J Numer Anal 5:83–88 7. Rockafellar RT, Wets RJ-B (1998) Variational analysis. Springer, Berlin 8. Rubinov AM (2000) Abstract convexity and global optimization. Kluwer, Dordrecht 9. Rubinov AM, Glover BM, Yang XQ (2000) Decreasing functions with application to penalization. SIAM J Optim 10:289–313 10. Rubinov AM, Glover BM, Yang XQ (1999) Modified lagrangian and penalty functions in continuous optimization. Optimization 46:327–351 11. Rubinov AM, Yang XQ (2003) Lagrange-type functions in constrained non-convex optimization. Kluwer, Dordrecht 12. Rubinov AM, Yang XQ, Bagirov AM (2002) Nonlinear penalty functions with a small penalty parameter. Optim Methods Softw 17:931–964 13. Teo KL, Goh CJ, Wong KH (1991) A unified computational approach to optimal control problems. In: Pitman monographs and surveys in pure and applied mathematics, vol 55. Longman Scientific and Technical, Harlow, p 329 14. Wu ZY, Bai FS, Yang XQ, Zhang LS (2004) An exact lower order penalty function and its smoothing in nonlinear programming. Optimization 53:51–68 15. Yang XQ, Huang XX (2001) A nonlinear lagrangian approach to constrained optimization problems. SIAM J Optim 14:1119–1144 16. Yevtushenko YG, Zhadan VG (1990) Exact auxiliary functions in optimization problems. USSR Comput Math Math Phys 30:31–42

Lagrangian Duality: BASICS DONALD W. HEARN1 , TIMOTHY J. LOWE2 1 University Florida, Gainesville, USA 2 University Iowa, Iowa City, USA

References 1. Burke JV (1991) Calmness and exact penalization. SIAM J Control Optim 29:493–497 2. Burke JV (1991) An exact penalization viewpoint of constrained optimization. SIAM J Control Optim 29:968–998 3. Clarke FH (1983) Optimization and nonsmooth analysis. Wiley, New York 4. Giannessi F (2005) Constrained optimization and image space analysis, vol 1. Separation of sets and optimality conditions. Mathematical concepts and methods in science and engineering, vol 49. Springer, New York, p 395

L

MSC2000: 90C30 Article Outline Keywords The Primal Problem and the Lagrangian Dual Problem Weak and Strong Duality Properties of the Lagrangian Dual Function

1805

1806

L

Lagrangian Duality: BASICS

Geometrical Interpretations of Lagrangian Duality The Resource-Payoff Space Gap Function

Summary See also References Keywords Primal optimization problem; Dual optimization problem; Subgradient; Duality gap; Constraint qualification

sentation below. For more thorough treatments, see the references. Given (P), define the Lagrangian function L(x, u, v) = f (x)+ u| g(x) + v| h(x). The Lagrangian dual problem is then ( max (u; v) (D) s.t. u 0; where, for fixed (u, v), the dual function is defined in terms of the infimum of the Lagrangian function with respect to x 2 S: (u; v) D inf L(x; u; v): x2S

The Primal Problem and the Lagrangian Dual Problem For a given primal optimization problem (P) it is possible to construct a related dual problem which depends on the same data and often facilitates the analysis and solution of (P). This section focuses on the Lagrangian dual, a particular form of dual problem which has proven to be very useful in many optimization applications. A general form of primal problem is

(P)

8 ˆ min ˆ ˆ ˆ (b Ax): x0

This reduces to >

(u) D b u C

(

0

if (c A> u) 0;

1

otherwise:

Assuming there are nonnegative values of u such that c A| u, these would be the only viable choices for the maximization of (u) and therefore (D2) takes the form familiar from linear programming duality: 8 > ˆ ˆ u c;

u 0:

Example 3 (differentiable convex programming) One of the first nonlinear duals was developed by P. Wolfe [27] for the primal problem 8 ˆ ˆ

ˆ ˆ

1 > ˆ ˆ (b C AH 1 d) 12 d > H 1 d ˆ ˆ :s.t. u 0: Thus, the dual of (P4) is also a quadratic program in the dual variables u. Example 5 (integer program) The following numerical example of a linear problem with binary variables will be used to illustrate various dual properties in the following sections. 8 ˆ ˆ h(x) f (x): The first inequality follows since x 2 S and the second from u> g(x) 0 and h(h(x) D 0. If the optimal primal and dual objective values are equal, strong duality is said to hold for the primal and dual pair. The following theorem illustrates such a result for the the pair (P3) and (D3). Theorem 7 Let x be an optimal solution for (P3) and assume the function g satisfies some constraint qualification. Then there exists a vector u such that (x , u ) solves (D3) and f (x ) D L(x ; u ):

It is important to note that the above result is true under very general conditions. In particular, it is true when the set S is discrete. Since (u, v) is concave, it is known that at least one linear supporting function exists at each (u, v). Collectively, the gradients of all linear supports at (u, v) is called the set of subgradients of at (u, v). For any (u, v) for which (u, v) is finite, denote S(u, v) as the solution set of the minimization defining (u, v). Theorem 9 For fixed (u; v), let x 2 S(u; v). Then (g(x); h(x)) is a subgradient of at (u; v). Proof For any (u, v)

Proof Under the assumptions there exists a u 0 such that (x , u ) satisfies the Karush–Kuhn–Tucker conditions:

(u; v) D inf f (x) C u > g(x) C v > h(x) x2S

f (x) C u > g(x) C v > h(x)

rx L(x ; u ) D 0; u

T

D f (x) C (u u)> g(x) C u> g(x)

g(x ) D 0;

C (v v)> h(x) C v > h(x):

from which it follows that

f (x ) D L(x ; u )

Hence

and that (x , u ) is feasible to (D3). Using this and the weak duality theorem gives L(x ; u ) L(x; u) for any (x, u) satisfying the constraints of (D3). The results of the theorem follow. The references contain additional strong duality results, including cases where differentiability is not required. However, as will be seen in examples below, it often happens that there is a difference, known as the duality gap, between the optimal values of the primal and dual objective functions.

(u; v) (u; v) C g(x)> (u u) C h(x)> (v v): If S(u; v) is a single point x, then there is only one subgradient of at (u; v) in which case is differential at (u; v), i. e., r(u; v) D (g(x); h(x)). From the above, is always concave and it is relatively easy to calculate a slope at any point. Much use of this is made in algorithms for large scale integer programs. Also, the fact that the maximum value of the dual provides a lower bound to the optimal objective function value in methods (such as branch and bound) for solving the primal problem. While strong duality

Lagrangian Duality: BASICS

L

generally holds for convex programs, this is rarely true for integer programs. Revisiting the examples of the first section, for Example 1 the Karush–Kuhn–Tucker conditions can be employed to derive p (u) D (1 u1 )2 C (1 u2 )2 : There is no duality gap for this problem, the dual maximum occurs at (u1 , u2 ) = (1, 1) where is zero, in agreement with the primal minimum. The dual function is differentiable except at its maximizing point. The dual function of Example 2, a linear program, is linear and thus it is concave and differentiable everywhere. Similarly, in Example 4, since H is positive definite, H 1 is also positive definite and the dual function is again concave and differentiable everywhere. For Example 5, the integer program, values of u feasible to the dual problem, S(u) and (u) are given in Table 1. S(u) is the triple (x1 (u), x2 (u), x3 (u)). Figure 1 is a graph of the function (u). Again, (u) is a concave function and it is differentiable except at 5 7 u D 1; ; : 3 4 The maximum dual value is 5 1 D 11 ; 3 3

Lagrangian Duality: BASICS, Table 1 Values of the dual function for Example 5

S(u) (1; 1; 1) f(1; 1; 1) [ (0; 1; 1)g (0; 1; 1) f(0; 1; 1) [ (0; 0; 1)g (0; 0; 1) f(0; 0; 1) [ (0; 0; 0)g (0; 0; 0)

Geometrical Interpretations of Lagrangian Duality The Resource-Payoff Space One interpretation of the dual problem is provided via the resource-payoff set RP for problem (P). To illustrate geometrically, assume that (P) has just one inequality constraint g(x) 0 and there are no explicit equality constraints. Then the resource-payoff set for the problem is the set of points defined by RP D f(g(x); f (x)) : x 2 S)g :

which indicates a duality gap of size 2/3 since the optimal value of (P5) is f (1, 0, 1) = 12. By contrast, Theorem 8 does not apply in Example 3 because the objective of (D3), a Lagrangian function, depends on both x and u, rather than the dual variables alone. Lagrangian functions are generally not concave.

u 0 z < 0;

Az D 0;

z0

has no solution. However Az = 0, z 0 imply that x + z 2 S for all 0. Since S is assumed to be compact, the only possibility is z = 0 and the alternative system has no solution. Thus D(x) is nonempty. The dual constraints imply that u| x = r f (x)| x | v Ax, so 8 b v d(x) D :s.t. A> v r f (x): By linear programming duality 8 ˆ ( ˆ v D s.t. ˆ s.t. A> v r f (x) ˆ :

r f (x)> y Ay D b y 0;

and it follows that d(x) D f (x) C min r f (x)> (y x) y2S

D f (x) G(x): Expressing the duality gap in terms of x allows a simple interpretation of weak and strong duality in the convex case. Figure 4 illustrates the gap function in one variable with S being the interval [a, b]. Let x = x1 . The linear function f (x 1 ) C r f (x 1 )> (y x 1 ) is the tangent line shown. It has a minimum in S at y(x1 ) = a which, by convexity, must lie below f (x ). Hence the weak duality result holds: f (x ) f (x1 ) G(x1 ) =

1811

1812

L

Lagrangian Duality: BASICS

Lagrangian Duality: BASICS, Figure 4 A one variable interpretation of weak and strong duality

d(x1 ). Strong duality occurs when x1 = x and the minimum of the linear function (i. e., the tangent at x ) has the value f (x ). In this case G(x ) = 0. If x were at an interior point of S, and/or if x1 is infeasible to S, this same interpretation holds provided only that f (x1 ) and r f (x1 ) are defined. Summary This section has illustrated basic results and geometrical interpretations of Lagrangian duality. The reference list below is a selection of texts and journal articles on this topic for further reading. See also Equality-constrained Nonlinear Programming: KKT Necessary Optimality Conditions First Order Constraint Qualifications Inequality-constrained Nonlinear Optimization Kuhn–Tucker Optimality Conditions Rosen’s Method, Global Convergence, and Powell’s Conjecture Saddle Point Theory and Optimality Conditions Second Order Constraint Qualifications Second Order Optimality Conditions for Nonlinear Optimization References 1. Balinski ML, Baumol WJ (1968) The dual in nonlinear programming and its economic interpretation. Rev Economic Stud 35:237–256

2. Bazaraa MS, Goode JJ (1979) A survey of various tactics for generating Lagrangian multipliers in the context of Lagrangian duality. Europ J Oper Res 3:322–338 3. Bertsekas DP (1975) Nondifferentiable optimization. North-Holland, Amsterdam 4. Bertsekas DP (1982) Constrained optimization and Lagrange multiplier methods. Acad. Press, New York 5. Bertsekas DP (1995) Nonlinear programming. Athena Sci., Belmont, MA 6. Brooks R, Geoffrion A (1966) Finding Everett’s Lagrange multipliers by linear programming. Oper Res 16:1149– 1152 7. Everett H (1973) Generalized Lagrange multiplier method for solving problems of optimum allocation of resources. Oper Res 4:72–97 8. Falk JE (1967) Lagrange multipliers and nonconvex programming. J Math Anal Appl 19:141–159 9. Fiacco AV, McCormick GP (1968) Nonlinear programming: Sequential unconstrained minimization techniques. Wiley, New York 10. Fisher ML, Northup WD, Shapiro JF (1975) Using duality to solve discrete optimization problems: theory and computational experience. In: Balinski ML, Wolfe P (eds) Nondifferentiable Optimization. North-Holland, Amsterdam 11. Fletcher R (ed) (1969) Optimization. Acad. Press, New York 12. Geoffrion AM (1970) Elements of large- scale mathematical programming I-II. Managem Sci 16:652–675; 676–691 13. Geoffrion AM (1971) Duality in nonlinear programming: A simplified application-oriented development. SIAM Rev 13:1–7 14. Hearn DW (1982) The gap function of a convex program. Oper Res Lett 1:67–71 15. Hearn DW, Lawphongpanich S (1989) Lagrangian dual ascent by generalized linear programming. Oper Res Lett 8:189–196 16. Hearn DW, Lawphongpanich S (1990) A dual ascent algorithm for traffic assignment problems. Transport Res B 248(6):423–430 17. Kiwiel KC (1985) Methods of descent for nondifferentiable optimization. Springer, Berlin 18. Lasdon LS (1970) Optimization theory for large systems. MacMillan, New York 19. Luenberger DG (1969) Optimization by vector space methods. Wiley, New York 20. Luenberger DG (1973) Introduction to linear and nonlinear programming. Addison-Wesley, Reading, MA 21. Mangasarian OL (1969) Nonlinear programming. McGrawHill, New York 22. Nemhauser GL, Wolsey LA (1988) Integer and combinatorial optimization. Wiley, New York 23. Powell MJD (1978) Algorithms for nonlinear constraints that use Lagrangian functions. Math Program 14:224– 248 24. Rockafellar RT (1970) Convex analysis. Princeton Univ. Press, Princeton

Lagrangian Multipliers Methods for Convex Programming

25. Rockafellar RT (1975) Lagrange multipliers in optimization. In: Cottle RW, Lemke CE (eds) Nonlinear Programming, SIAM-AMS Proc. vol IX, pp 23–24 26. Whittle P (1971) Optimization under constraints. Wiley, New York 27. Wolfe P (1961) A duality theorem for nonlinear programming. Quart Appl Math 19:239–244 28. Zangwill WI (1969) Nonlinear programming: A unified approach. Prentice-Hall, Englewood Cliffs, NJ

Lagrangian Multipliers Methods for Convex Programming LMM MARC TEBOULLE School Math. Sci., Tel-Aviv University, Ramat-Aviv, Tel-Aviv, Israel MSC2000: 90C25, 90C30 Article Outline Keywords Augmented Lagrangians Quadratic Lagrangian Proximal Minimization Modified Lagrangians See also References Keywords Augmented Lagrangians; Convex optimization; Lagrangian multipliers; Primal-dual methods; Proximal algorithms Optimization problems concern the minimization or maximization of functions over some set of conditions called constraints. The original treatment of constrained optimization problems was to deal only with equality constraints via the introduction of Lagrange multipliers which found their origin in basic mechanics. Modeling real world situations often requires using inequality constraints leading to more challenging optimization problems. Lagrange multipliers are used in optimality conditions and play a key role to devise algorithms for constrained problems. What will be summarized here are the basic elements of various algorithms

L

based on Lagrangian multipliers to solve constrained optimization problems, and particularly convex optimization problems. A standard formulation of an optimization problem is: (O)

min f f (x) : x 2 X \ Cg ;

where X is a certain subset of Rn and C is the set of constraints described by equality and inequality constraints C D x 2 Rn :

g i (x) 0; i D 1; : : : ; m; h i (x) D 0; i D 1; : : : ; p

:

All the functions in problem (O) are real valued functions on Rn , and the set X can described more abstract constraints of the problem. A point x 2 X \ C is called a feasible solution of the problem, and an optimal solution is any feasible point where the local or global minimum of f relative to X \ C is actually attained. By a convex problem we mean the case where X is a convex set, the functions f , g 1 , . . . , g m are convex and h1 , . . . , hp are affine. Recall that a set S Rn is convex if the line segment joining any two different points of S is contained in it. Let S be a convex subset of Rn . A real valued function f : S ! R is convex if for any x, y 2 S and any 2 [0, 1], f (x C (1 )y) f (x) C (1 ) f (y) : Convexity plays a fundamental role in optimization (even in nonconvex problems). One of the key fact is that when a convex function is minimized over a convex set, every local optimal solution is global. Another, fundamental point is that a powerful duality theory can be developed for convex problems, which as we shall see, is also at the root of the development and analysis of Lagrangian multiplier methods. Augmented Lagrangians The basic idea of augmented Lagrangian methods for solving constrained optimization problems, also called multiplier methods, is to transform a constrained problem into a sequence of unconstrained problems. The approach differs from the penalty-barrier methods, [13] from the fact that in the functional defining the unconstrained problem to be solved, in addition to a penalty parameter, there are also multipliers associated with the

1813

1814

L

Lagrangian Multipliers Methods for Convex Programming

constraints. Multiplier methods can be seen as a combination of penalty and dual methods. The motivation for these methods came from the desire of avoiding illconditioning associated with the usual penalty-barrier methods. Indeed, in contrast to penalty methods, the penalty parameter need not to go to infinity to achieve convergence of the multiplier methods. As a consequence, the augmented Lagrangian has a ‘good’ conditioning, and the methods are robust for solving nonlinear programs. Augmented Lagrangians methods were proposed independently by M.R. Hestenes [16] and M.J.D. Powell [26] for the case of equality constraints, and extended for the case of inequality constraints by R.T. Rockafellar [27]. Many other researchers have contributed to the development of augmented Lagrangian methods, and for an excellent treatment and comprehensive study of multiplier methods, see [7] and references therein. Quadratic Lagrangian We start by briefly describing the basic steps involved in generating a multiplier method for the equality constrained problem (E)

min f f (x) : h i (x) D 0; i D 1; : : : ; pg :

Here f and hi are real valued functions on Rn and no convexity is assumed (which will not help anyway because of the nonlinear equality constraints). Also for simplicity we let X = Rn . The ordinary Lagrangian associated with (E) is l(x; y) D f (x) C

p X

y i h i (x):

iD1

One of the oldest and simplest way to solve (E) is by sequential minimization of the Lagrangian ([2]). Namely, we start with an initial multiplier yk and minimize l(x, yk ) over x 2 Rn to produce xk . We then update the multiplier sequence via the formula: y ikC1 D y ik C s k h i (x k );

i D 1; : : : ; p;

where sk is a stepsize parameter. The rational behind the above method is that it can be simply interpreted as a gradient-type algorithm to solve an associated dual problem. Unfortunately, such a method while simple requires too many assumptions on the problem’s data

to generate points converging rapidly toward an optimal solution. Thus this primal-dual framework is not in general particularly attractive. However, combining the primal-dual idea to the one of penalty leads to another class of algorithms called multiplier methods. In these methods one uses instead of the classical Lagrangian l(x, y) a ‘penalized’ Lagrangian of the form: Pc (x; y) D f (x) C

p X

p

y i h i (x) C

iD1

cX 2 h (x); 2 iD1 i

where c > 0 is a penalty parameter. Then, starting with an initial multiplier yk and penalty parameter ck , the augmented Lagrangian Pc is minimized with respect to x and at the end of each minimization, the multipliers (and sometimes also the penalty parameter) are updated according to some scheme and we continue the process until convergence. More precisely, the method of multipliers generates the sequences {yk } Rm , {xk } Rn as follows. Given a sequence of nondecreasing scalars ck > 0, compute o n x kC1 2 arg min L c k (x; y k ) : x 2 Rn ; y ikC1 D y ik C c k h i (x kC1 );

i D 1; : : : ; p:

The rational behind the updating of the multipliers yk is that if the generated sequence xk converges to a local minimum then the sequence yk will converge to the corresponding Lagrange multiplier y . Under reasonable assumptions, this happens without increasing the parameter ck to infinity and thus avoids the difficulty with ill-conditioning. The above scheme provides with the key steps in devising a multiplier method for equality constrained optimization problems. We now turn to the case of problems with inequality constraints: (I)

min f f (x) : g i (x) 0; i D 1; : : : ; mg :

One simple way to treat this case is to transform the inequality constraints to equality using squared variables and then apply the multiplier framework previously outlined. Thus, we convert problem (I) to the equality constrained problem in the variables (x, z): ( min f (x) s.t.

g i (x) C z2i D 0;

i D 1; : : : ; m;

where z 2 Rm are additional variables. The quadratic augmented Lagrangian to be minimized with respect to

Lagrangian Multipliers Methods for Convex Programming

(x, z) thus takes the form: Q c (x; z; y) D f (x) C

m X

y i (g i (x) C z2i )

iD1 m

C

cX (g i (x) C z2i )2 : 2 iD1

The key observation here is that the minimization with respect to z can be carried out analytically. One can verify via simple calculus that for fixed (x, y), minz2Rm Qc (x, z, y) = Lc (x, y), with L c (x; y) D f (x)C

m 1 X max 2 f0; y i C cg i (x)g y2i : 2c

L

and any limit point of the sequence xk is an optimal solution of the convex program. Note that we do not require that ck is sufficiently large and convergence is obtained from any starting point y0 2 Rm . The multiplier method for inequality constrained problems was derived by using slack variables in the inequality constraints and then by applying the multiplier method which was originally devised for problems having only equality constraints. An alternative way of constructing an augmented Lagrangian method is via the proximal framework. Proximal Minimization Consider the convex optimization problem

iD1

Summarizing, the multiplier method for the inequality constrained problem (I) consists of the following two steps: o n x kC1 2 arg min L c k (x; y k ) : x 2 Rn ; y kC1 D maxf0; y k C c k g(x kC1 )g: For the general optimization problem (O), namely the case of mixed equality and inequality constraints, Lagrangian multiplier methods can be developed in a similar fashion. Convergence results to a local minimum for the above schemes can be established under second order sufficiency assumptions, ([7,28]). In the case of convex programs, namely when in problem (I) the functions f , g 1 , . . . , g m are assumed convex functions, (or more generally in problem (O), if we also assume hi affine and X convex), much stronger convergence results can be established under mild assumptions ([29]). A typical result is as follows. Assumption 1 The set of optimal solutions of the convex problem (I) is nonempty and compact and the set of multiplier is nonempty and compact. The assumption on the optimal set of multipliers is guaranteed under the standard Slater constraint qualification: x) 0; 9xˆ : g i (b

i D 1; : : : ; m:

Under assumption 1, one can prove that the sequence yk converges to some Lagrange multiplier y

(C)

min fF(x) : x 2 Rn g ;

where F: Rn ! ( 1, + 1] is a proper, lower semicontinuous convex function. One method to solve (C) is to ‘regularize’ the objective function using the proximal map of J.-J. Moreau [22]. Given a real positive number c, a proximal approximation of f is defined by: ˚ Fc (x) D inf F(u) C (2c)1 kx uk2 : u

(1)

The resulting function F c enjoys several important properties: it is convex and differentiable with gradient which is Lipschitz with constant (c1 ) and when minimized possesses the same set of minimizers and the same optimal value than problem (C). The quadratic regularization process of the function f leads to an iterative procedure for solving problem (C), called the proximal point algorithm [21,30]. The method is as follows: given an initial point x0 2 Rn a sequence {xk } is generated by solving:

2 1

(2) x kC1 D arg min F(x) C

x x k ; 2c k where {ck }1 kD1 is a sequence of positive numbers. One of the most powerful application of the proximal algorithm is when applied to the dual of an optimization problem. Indeed, as shown by Rockafellar [27,29], a direct calculation shows that Lc can be written as 1 (3) L c (x; y) D maxm l(x; ) k yk2 ; 2c 2RC

1815

1816

L

Lagrangian Multipliers Methods for Convex Programming

where the maximum is attained uniquely at i = max {0, yi + c g i (x)}, i = 1, . . . , m. Here l: Rn × Rm + ! R denotes the usual Lagrangian associated with the inm stands for equality constrained problem (I) and RC the nonnegative orthant. This shows that the quadratic augmented Lagrangian is nothing else but the Moreau proximal regularization of the ordinary Lagrangian, and the quadratic multiplier method can be interpreted as applying the proximal minimization algorithm on the dual problem associated with (I): (D)

sup fd(y) : y 0g ;

where d(y) := infx l(x, y) is the dual functional. This interplay between the proximal algorithm and multiplier methods is particularly interesting since it offers the possibility of designing and analyzing the convergence properties of the later from the former, and also leads to consider useful potential extensions of multiplier methods which are discussed next.

choice of we then have a multiplier method which consists of the sequence of unconstrained minimization problems x kC1 2 arg minn B c k (x; y k ); x2R

followed by the multiplier updates 0

y ikC1 D y ik

i D 1; : : : ; m:

The multiplier updating formula can be simply explained as follows. Suppose the functions in problem (I) are given differentiable, then xk + 1 minimizes Bc k (x, yk ) means that r x Bc k (xk + 1 , yk ) = 0, i. e., r f (x kC1 ) C

m X

y ik

0

(c k g i (x kC1 )r g i (x kC1 ) D 0;

iD1

and using the multiplier updates defined above the equation reduces to:

Modified Lagrangians One of the main disadvantages of the quadratic multiplier methods for inequality constrained problems is that even when the original problem is given twice continuously differentiable, the corresponding functional Lc is not. Indeed, note that with twice continuously differentiable data {f , g i }, the augmented Lagrangian Lc is continuously differentiable in x. However, the Hessian matrix of Lc is discontinuous for all x such that g i (x) = c1 yi . This may cause difficulties in designing an efficient unconstrained minimization algorithm for Lc and motivates the search for alternative augmented Lagrangian to handle inequality constrained problems, which we call here modified Lagrangians. These Lagrangians possess better differentiability properties to allow the use of efficient Newton-like methods in the minimization step. Modified Lagrangians can be found in several works, [1,15,19,20]. An approach originally developed in [19] proposed a class of methods which uses instead of Lc a modified Lagrangian of the form:

(c k g i (x kC1 );

r f (x

kC1

)C

m X

y ikC1 r g i (x kC1 ) D 0;

iD1

showing that (xk + 1 , yk + 1 ) also satisfies the optimality conditions for minimizing the classical Lagrangian, namely r x l(xk + 1 , yk + 1 ) = 0. Interesting special cases of the generic method described above includes the exponential method ([23,35]) with the choice (t) = et 1 and the modified barrier method [24] which is based on the choice (t) = ln(1 t). More examples and further analysis of these methods can be found in [25]. Another way of constructing modified Lagrangians is in view of the results from the previous section, to try alternative proximal regularization terms which could lead to better differentiability properties of the corresponding augmented Lagrangian functional. This approach was considered in [32], who suggested new classes of proximal approximation of a function given by F (x) :D inff f (u) C 1 D(u; x)g:

B c (x; y) :D f (x) C c 1

m X

u

y i (cg i (x));

iD1

where is a scalar penalty function which is at least C2 and satisfies some other technical conditions. For each

(4)

Here, D(, ), which replaces the quadratic proximal term in (1), is a measure of ‘closeness’ between x, y satisfying D(x, y) 0 with equality if and only if x = y. One generic form for D is the use of a ‘proximal-like’

Lagrangian Multipliers Methods for Convex Programming

term defined by D(x; y) :D d' (x; y) :D

n X

y i '(y1 i x i );

iD1

where ' is a given convex function defined on the nonnegative real line and which satisfies some technical conditions ([33]). The motivation of using such functional emerges from the desire of eliminating nonnegativity constraints such as the ones present in the dual problem. Thus, by mimicking (2) and (3) with the proximal term d' , one can design a wide variety of modified Lagrangians methods with an appropriate choice of '. The basic steps of the modified multipliers method then emerging can be described as follows: Given a sequence of positive numbers {ck }, and initial points xk 2 Rn , yk m 2 RC (the positive orthant) generate iteratively the next points by solving o n (5) x kC1 2 arg min M c k (x; y k ) : x 2 Rn ; followed by the multiplier updates k y kC1 2 arg maxfy0 g(x kC1 ) c 1 k d' (y; y )g; y0

function, the modified Lagrangian for various choices of d' is twice continuously differentiable if the problem’s data f , g are. Thus, this opens the possibility of using Newton methods for solving efficiently (5). Under assumption 1 and appropriate condition on the kernel ' one can prove convergence results for these modified multiplier methods similar to the one obtains in the quadratic case ([17]). There has been considerable recent research on modified Lagrangian methods and for further results see [3,4,5,11,18,25]. The Lagrangian functional plays a central role in the analysis and algorithmic development of constrained optimization problems. Lagrangian based methods and the related proximal framework have been used in other optimization contexts, such as convexification of nonconvex optimization problems [6,28], decomposition algorithms [9,12,31,34], semidefinite programming [10] and in many other applications, see e. g., [8,14] where more references can be found. See also

(6)

where M c is the modified Lagrangian defined by M c (x; y) D sup fl(x; ) c 1 d' (; y)g

L

(7)

m

2RC

i. e., the proximal-like regularization of the usual Lagrangian l(x, ) associated with problem (I). In the equation (6), g(x) denotes the column vector (g 1 (x), . . . , g m (x))0 2 Rm and the prime denotes transposition. The method is viable since both (6) and (7) can be solved analytically, and the computational analysis and effort should concentrate on (5). This method of multipliers is nothing else but a proximal-like algorithm applied to m , the dual problem (D) ([17]) i. e., starting with y0 2 RC k generate a sequence {y } by solving k y kC1 D arg maxfd(y) c 1 k d' (y; y )g: y0

The above scheme gives rise to a rich family of numerical methods, which includes (with an appropriate choice of ') several classes of nonquadratic multiplier methods ([7,24,35]). One of the main advantage of using these modified multiplier methods is that in contrast with the usual quadratic augmented Lagrangian

Convex Max-functions Decomposition Techniques for MILP: Lagrangian Relaxation Integer Programming: Lagrangian Relaxation Lagrange, Joseph-Louis Multi-objective Optimization: Lagrange Duality References 1. Arrow KJ, Gould FJ, Howe SM (1973) A general saddle point result for constrained optimization. Math Program 5:225– 234 2. Arrow KJ, Hurwicz L, Uzawa H (1958) Studies in linear and nonlinear programming. Stanford Univ. Press, Palo Alto, CA 3. Auslender AA, Cominetti R, Haddou M (1997) Asymptotic analysis of penalty and barrier methods in convex and linear programming. Math Oper Res 22:43–62 4. Auslender AA, Teboulle M, Ben-Tiba S (1999) Interior proximal and multiplier methods based on second order homogeneous kernels. Math Oper Res 24:645–668 5. Ben-Tal A, Zibulevsky M (1997) Penalty-barrier methods for convex programming problems. SIAM J Optim 7:347–366 6. Bertsekas D (1979) Convexification procedures and decomposition methods for nonconvex optimization problems. J Optim Th Appl 29:169–197 7. Bertsekas D (1982) Constrained optimization and Lagrangian multipliers. Acad. Press, New York

1817

1818

L

Laplace Method and Applications to Optimization Problems

8. Bertsekas D, Tsitsiklis JN (1989) Parallel and distributed computation: Numerical methods. Prentice-Hall, Englewood Cliffs, NJ 9. Chen G, Teboulle M (1994) A proximal-based decomposition method for convex minimization problems. Math Program 64:81–101 10. Doljanski M, Teboulle M (1998) An interior proximal algorithm and the exponential multiplier method for semidefinite programming. SIAM J Optim 9:1–13 11. Eckstein J (1993) Nonlinear proximal point algorithms using Bregman functions with applications to convex programming. Math Oper Res 18:202–226 12. Eckstein J, Bertsekas. DP (1992) On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math Program 55:293– 318 13. Fiacco AV, McCormick GP (1990) Nonlinear programming: Sequential unconstrained minimization techniques. Classics Appl Math. SIAM, Philadelphia 14. Glowinski R, Le Tallec P (1989) Augmented Lagrangians and operator-splitting methods in nonlinear mechanics. Stud Appl Math. SIAM, Philadelphia 15. Golshtein EG, Tretyakov NV (1996) Modified Lagrangians and monotone maps in optimization. Discrete Math and Optim. Wiley, New York 16. Hestenes MR (1969) Multiplier and gradient methods. J Optim Th Appl 4:303–320 17. Iusem A, Teboulle M (1995) Convergence analysis of nonquadratic proximal methods for convex and linear programming. Math Oper Res 20:657–677 18. Kiwiel KC (1997) Proximal minimization methods with generalized Bregman functions. SIAM J Control Optim 35:1142–1168 19. Kort KBW, Bertsekas DP (1972) A new penalty function method for constrained minimization. In: Proc. IEEE Conf. Decison Control, 162–166 20. Mangasarian OL (1975) Unconstrained Lagrangians in nonlinear programming. SIAM J Control 13:772–791 21. Martinet B (1978) Perturbation des méthodes D, optimisation application. RAIRO Anal Numer/Numer Anal 93(12):152–171 22. Moreau JJ (1965) Proximité and dualité dans un espace Hilbertien. Bull Soc Math France 93:273–299 23. Nguyen VH, Strodiot JJ (1979) On the convergence rate of a penalty function method of the exponential type. J Optim Th Appl 27:495–508 24. Polyak RA (1992) Modified barrier functions: Theory and methods. Math Program 54:177–222 25. Polyak RA, Teboulle M (1997) Nonlinear rescaling and proximal-like methods in convex optimization. Math Program 76:265–284 26. Powell MJD (1969) A method for nonlinear constraints in minimization problems. In: Fletcher R (ed) Optimization. Acad. Press, New York, 283–298

27. Rockafellar RT (1973) A dual approach to solving nonlinear programming problems by unconstrained optimization. Math Program 5:354–373 28. Rockafellar RT (1974) Augmented Lagrange multiplier functions and duality in nonconvex programming. SIAM J Control 12:268–285 29. Rockafellar RT (1976) Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math Oper Res 1:97–116 30. Rockafellar RT (1976) Monotone operators and the proximal point algorithm. SIAM J Control Optim 14:877– 898 31. Spingarn JE (1985) Applications of the method of partial inverses to convex programming: Decomposition. Math Program 32:199–223 32. Teboulle M (1992) Entropic proximal mappings in nonlinear programming and applications. Math Oper Res 17:670–690 33. Teboulle M (1997) Convergence of proximal-like algorithms. SIAM J Optim 7:1069–1083 34. Tseng P (1991) Applications of a splitting algorithm to decomposition in convex programming and variational inequalities. SIAM J Control Optim 29:119–138 35. Tseng P, Bertsekas DP (1993) On the convergence of the exponential multiplier method for convex programming. Math Program 60:1–19

Laplace Method and Applications to Optimization Problems PANOS PARPAS, BERÇ RUSTEM Department of Computing, Imperial College, London, GB Article Outline Abstract Background Heuristic Foundations of the Method

Applications Stochastic Methods for Global Optimization Phase Transitions in Combinatorial Optimization Worst Case Optimization

References Abstract The Laplace method has found many applications in the theoretical and applied study of optimization problems. It has been used to study: the asymptotic behavior of stochastic algorithms, ‘phase transitions’ in combinatorial optimization, and as a smoothing technique

L

Laplace Method and Applications to Optimization Problems

for non–differentiable minimax problems. This article describes the theoretical foundation and practical applications of this useful technique. Background Laplace’s method is based on an ingenious trick used by Laplace in one his papers [19]. The technique is most frequently used to perform asymptotic evaluations to integrals that depend on a scalar parameter t, as t tends to infinity. Its use can be theoretically justified for integrals in the following form:

Z I(t) D

exp A

f (x) d(x) : T(t)

Where f : Rn ! R, T : R ! R, are assumed to be smooth, and T(t) ! 0 as t tends to 1. A is some compact set, and is some measure on B (the field generated by A). We know that since A is compact, the continuous function f will have a global minimum in A. For simplicity, assume that the global minimum x* is unique, and that it occurs in the interior A. Under these conditions, and as t tends to infinity, only points that are in the immediate neighborhood of x* contribute to the asymptotic expansion of I(t) for large t. The heuristic argument presented above can be made precise. The complete argument can be found in [2], and in [4]. Instead we give a heuristic but didactic argument that is usually used when introducing the method.

Expanding f to second order, and by noting that f 0 (c) D 0, we obtain the following approximation: (

) f (c) C 12 f 00 (c)(x c)2 exp K(t; ) Ð dx t c Z cC f 00 (c)(x c)2 f (c) dx : exp D exp t 2t c Z

cC

The limits of the integral above can be extended to infinity. This extension can be justified by the fact only points around c contribute to the asymptotic evaluation of the integral. K(t; )

Z C1 f (c) f 00 (c)(x c)2 dx Ð exp exp t 2t 1 s f (c) 2 t : D exp t f 00 (c)

In conclusion we have that:

f (c) lim K(t) D exp t!1 t

s

2 t : f 00 (c)

Rigorous justifications of the above arguments can be found in [4]. These types of results are standard in the field of asymptotic analysis. The same ideas can be applied to optimization problems. Applications

Heuristic Foundations of the Method For the purpose of this subsection only, assume that f is a function of one variable, and that A is given by some interval [a; b]. It will be instructive to give a justification of the method based on the one dimensional integral: Z

b

K(t) D a

f (x) dx : exp t

Suppose that f has a unique global minimum, say c, such that c 2 (a; b). As t is assumed to be large, we only need to take into account points near c when evaluating K(t). We therefore approximate K(t) by K(t; ). The latter quantity is given by: Z

cC

K(t; ) D c

f (x) dx : exp t

Consider the following problem: F D min f (x) s:t g i (x) 0

i D 1; : : : ; l :

(1)

Let S denote the feasible region of the problem above, and assume that it is nonempty, and compact, then: lim ln c(t) D F : t#0

(2)

Where, f (x) d t S Z f (x) D I x (S)d: exp t Rn

Z

c(t) ,

exp

(3)

1819

1820

L

Laplace Method and Applications to Optimization Problems

is any measure on (Rn ; B). A proof of Eq. (2) can be found in [16]. The relationship in Eq. (3) can be evaluated using the Laplace method. The link between the Laplace method and optimization has been explored in: Stochastic methods for global optimization. Phase transitions in combinatorial optimization. Algorithms for worst case analysis. These application areas will be explored next.

that would enable us to escape from local minima is to add noise. One then considers the diffusion process: p (5) dX(t) D r f (X(t))dt C 2T(t)dB(t) : Where B(t) is the standard Brownian motion in Rn . It has been shown in [3,7,8], under appropriate conditions on f , that if the annealing schedule is chosen as follows: T(t) ,

Stochastic Methods for Global Optimization Global optimization is concerned with the computation of global solutions of Eq. (1). In other words, one seeks to compute F * , and if possible obtaining points from the following set: S D fx 2 S j f (x) D F g : Often the only way to solve such problems is by using a stochastic method. Deterministic methods are also available but are usually applicable to low dimensional problems. When designing stochastic methods for global optimization, it is often the case that the algorithm can be analyzed as a stochastic process. Then in order to analyze the behavior of the algorithm we can examine the asymptotic behavior of the stochastic process. In order to perform this analysis we need to define a probability measure that has its support in S* . This strategy has been implemented in [3,6,7,8,9,10,16]. A well known method for obtaining a solution to an unconstrained optimization problem is to consider the following Ordinary Differential Equation (ODE): dX(t) D r f (X(t))dt :

(4)

By studying the behavior of X(t) for large t, it can be shown that X(t) will eventually converge to a stationary point of the unconstrained problem. A review of, so called, continuous-path methods can be found in [22]. More recently, application of this method to large scale problems was considered by Li-Zhi et al. [13]. A deficiency of using Eq. (4) to solve optimization problems is that it will get trapped in local minima. In order to allow the trajectory to escape from local minima, it has been proposed by various authors (e. g. [1,3,7,8,12,16]) to add a stochastic term that would allow the trajectory to “climb” hills. One possible augmentation to Eq. (4)

c ; log(2 C t)

for some

c c0 ;

(6)

where c0 is a constant positive scalar (the exact value of c0 is problem dependent). Under these conditions, as t ! 1, the transition probability of X(t) converges (weakly) to a probability measure ˘ . The latter, has its support on the set of global minimizers. A characterization of ˘ was given by Hwang in [11]. It was shown that ˘ is the weak limit of the following, so called, Boltzmann density:

f (x) p(t; x) D exp T(t)

Z

1 f (x) dx : exp T(t) Rn (7)

Discussion of the conditions for the existence of ˘ , can be found in [11]. A description of ˘ in terms of the Hessian of f can also be found in [11]. Extensions of these results to constrained optimization problems appear in [16]. Phase Transitions in Combinatorial Optimization The aim in combinatorial optimization is to select from a finite set of configurations of the system, the one that minimizes an objective function. The most famous combinatorial problem is the Travelling Salesman Problem (TSP). A large part of theoretical computer science is concerned with estimating the complexity of combinatorial problems. Loosely speaking, the aim of computational complexity theory is to classify problems in terms of their degree of difficulty. One measure of complexity is time complexity, and worst case time complexity has been the aspect that received most attention. We refer the interested reader to [15] for results in this direction. We will briefly summarize results that have to do with average time complexity, the Laplace method, and phase transitions.

Laplace Method and Applications to Optimization Problems

Most of complexity theory is concerned with worst case complexity. However, many useful methods (e. g. the simplex method) will require an exponential amount of time to converge only in pathological cases. It is therefore of great interest to estimate average case complexity. The physics community has recently proposed the use of tools from statistical mechanics as one way of estimating average case complexity. A review in the form of a tutorial can be found in [14]. Here we just briefly adumbrate the main ideas. The first step in the statistical mechanics approach is to define a probability measure on the configuration of the system. This definition is done with the Boltzmann density: ˚ exp 1t f (C) ˚ : p t (C) D P exp 1t f (C) C

The preceding equation is of course the discrete version of Eq. (7). Using the above definition, the average value of the objective function is given by: X p t (C) f (C) : h fti D C

Tools and techniques of statistical mechanics can be used to calculate ‘computational phase transitions’. A computational phase transition is an abrupt change in the computational effort required to solve a combinatorial optimization problem. It is beyond the scope of this article to elaborate on this interesting area of optimization. We refer the interested reader to the review in [14]. The book of Talagrand [20] presents some rigorous results on this subject. Worst Case Optimization In many areas where optimization methods can be fruitfully applied, worst case analysis can provide considerable insight into the decision process. The fundamental tool for worst case analysis is the continuous minimax problem: min ˚(x) ; x2X

where ˚(x) D max y2Y f (x; y). The continuous minimax problem arises in numerous disciplines, including n–person games, finance, economics and policy optimization (see [18] for a review). In general, they are used by the decision maker to assess the worst-case

L

strategy of the opponent and compute the optimal response. The opponent can also be interpreted as nature choosing the worst-case value of the uncertainty, and the solution would be the strategy which ensures the optimal response to the worst–case. Neither the robust decision maker nor the opponent would benefit by deviating unilaterally from this strategy. The solution can be characterized as a saddle point when f (x; ) is convex in x and f (; y) is concave in y. A survey of algorithms for computing saddle points can be found in [5,18]. Evaluating ˚(x) is extremely difficult due to the fact that global optimization is required over Y. Moreover, this function will in general be non-differentiable. For this reason, it has been suggested by many researchers (e. g. [17,21]) to approximate ˚(x) with ˚(x; t) given by: Z f (x; y) dy : exp ˚(x; t) D t Y This is of course another application of the Laplace method, and it can easily be seen that: lim t ln ˚(x; t) D ˚(x) : t#0

This idea has been implemented in [17,21] with considerable success. References 1. Aluffi-Pentini F, Parisi V, Zirilli F (1985) Global optimization and stochastic differential equations. J Optim Theory Appl 47(1):1–16 2. Bender CM, Orszag SA (1999) Advanced mathematical methods for scientists and engineers I. Asymptotic methods and perturbation theory, Reprint of the 1978 original. Springer, New York 3. Chiang TS, Hwang CR, Sheu SJ (1987) Diffusion for global optimization in Rn . SIAM J Control Optim 25(3):737–753 4. de Bruijn NG (1981) Asymptotic methods in analysis, 3rd edn. Dover Publications Inc., New York 5. Dem0 yanov VF, Malozëmov VN (1990) Introduction to minimax. Translated from the Russian by Louvish D, Reprint of the 1974 edn. Dover Publications Inc., New York 6. Gelfand SB, Mitter SK (1991) Recursive stochastic algorithms for global optimization in Rd . SIAM J Control Optim 29(5):999–1018 7. Geman S, Hwang CR (1986) Diffusions for global optimization. SIAM J Control Optim 24(5):1031–1043 8. Gidas B (1986) The Langevin equation as a global minimization algorithm. In: Disordered systems and biological organization (Les Houches 1985). NATO Adv Sci Inst Ser F Comput Systems Sci, vol 20. Springer, Berlin, pp 321–326

1821

1822

L

Large Scale Trust Region Problems

9. Gidas B (1987) Simulations and global optimization. In: Random media (Minneapolis, MN, 1985), IMA Vol Math Appl, vol 7. Springer, New York, pp 129–145 10. Gidas B (1985) Metropolis-type Monte Carlo simulation algorithms and simulated annealing. In: Topics in contemporary probability and its applications. Probab Stochastics Ser. CRC, Boca Raton, FL, pp 159–232 11. Hwang CR (1980) Laplace’s method revisited: weak convergence of probability measures. Ann Probab 8(6):1177– 1182 12. Kushner HJ (1987) Asymptotic global behavior for stochastic approximation and diffusions with slowly decreasing noise effects: global minimization via Monte Carlo. SIAM J Appl Math 47(1):169–185 13. Li-Zhi L, Liqun Q, Hon WT (2005) A gradient-based continuous method for large-scale optimization problems. J Glob Optim 31(2):271 14. Martin OC, Monasson R, Zecchina R (2001) Statistical mechanics methods and phase transitions in optimization problems. Theoret Comput Sci 265(1–2):3–67 15. Papadimitriou CH (1994) Computational complexity. Addison-Wesley, Reading, MA 16. Parpas P, Rustem B, Pistikopoulos E (2006) Linearly constrained global optimization and stochastic differential equations. J Glob Optim 36(2):191–217 17. Polak E, Royset JO, Womersley RS (2003) Algorithms with adaptive smoothing for finite minimax problems. J Optim Theory Appl 119(3):459–484 18. Rustem B, Howe M (2002) Algorithms for worst-case design and applications to risk management. Princeton University Press, Princeton, NJ 19. Stigler SM (1986) Laplace’s 1774 memoir on inverse probability. Statist Sci 1(3):359–378 20. Talagrand M (2003) Spin glasses: a challenge for mathematicians. Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge (Results in Mathematics and Related Areas. 3rd Series). A Series of Modern. Surveys in Mathematics, vol 46. Springer, Berlin 21. Xu S (2001) Smoothing method for minimax problems. Comput Optim Appl 20(3):267–279 22. Zirilli F (1982) The use of ordinary differential equations in the solution of nonlinear systems of equations. In: Nonlinear optimization (Cambridge 1981). NATO Conf Ser II: Systems Sci. Academic Press, London, pp 39–46

Large Scale Trust Region Problems LSTR LAURA PALAGI DIS, Universitá Roma ‘La Sapienza’, Rome, Italy MSC2000: 90C30

Article Outline Keywords Algorithms Based on Successive Improvement of KKT Points Exact Penalty Function Based Algorithm (EPA) D.C. Decomposition Based Algorithm (DCA)

Parametric Eigenvalue Reformulation Based Algorithms Inverse Interpolation Parametric Eigenvalue Formulation (IPE) Semidefinite Programming Approach (SDP)

Conclusion See also References

Keywords Large scale trust region problem; Exact penalty function; D.C. programming; Eigenvalue problem; Semidefinite programming The trust region (TR) problem consists in minimizing a general quadratic function q: Rn ! R of the type q(x) D

1 > x Qx C c > x 2

subject to an ellipsoidal constraint x| Hx r2 with the symmetric matrix H positive definite and r a positive scalar. By rescaling and without loss of generality, it can be assumed for sake of simplicity H = I, hence the TR problem is (

min

q(x)

s.t.

kxk2 r 2 ;

(1)

where k k denotes the `2 norm. The interest in this problem initially arose in the context of unconstrained optimization when q(x) is a local quadratic model of the objective function which is ‘trusted’ to be valid over a restricted ellipsoidal region centered around the current iterate. However, it has been shown later that problems with the same structure of (1) are at the basis of algorithms for solving general constrained nonlinear programming problems (e. g. [2,14,19,21,27,28] and references therein), and for obtaining bounds for integer programming problems

Large Scale Trust Region Problems

(e. g. [10,11,12,17,18,26]; cf. also Integer programming). Many papers have been devoted to study the specific features of Problem (1). It is well known [7,22] that a feasible point x is a global solution for (1) if and only if there exists a scalar 0 such that the following KKT conditions are satisfied: (Q C I)x D c; 2

(kx k r 2 ) D 0; and furthermore Q + I < 0, where < denotes positive semidefinitness of the matrix. Note that a complete characterization of global minimizers is given without requiring any convexity assumption on the matrix Q. Moreover, it has been proved that an approximation to the global solution can be computed in polynomial time (see, for example, [1,24,25]). Hence Problem (1) can be considered an ‘easy’ problem from a theoretical point of view. These peculiarities led to the development of ‘ad hoc’ algorithms for finding a global solution of Problem (1). The first ones proposed in [7,16,22] were essentially based on the solution of a sequence of linear system of the type (Q + k I) x = c for a sequence {k }. These algorithms produce an approximate global minimizer of Problem (1), but rely on the ability to compute a Cholesky factorization of the matrix (Q + k I) at each iteration k, and hence these methods are appropriate when forming a factorization for different values of k is realistic in terms of both memory and time requirements. Indeed, they are appropriate for large scale problems with special structure, but in the general case, when no sparsity pattern is known, one cannot rely on factorizations of the matrices involved. Thus one concentrates on iterative methods of conjugate gradient type (cf. Conjugate-gradient methods) that require only matrix-vector products. Among the methods that have been proposed to solve large scale trust region problems, the following two main categories can be identified: methods that produce a sequence of KKT points of (1) with progressive improvement of the objective function; methods that solve (1) via a sequence of parametric eigenvalue problems.

L

Algorithms Based on Successive Improvement of KKT Points Methods in this class are based on special properties of KKT points of Problem (1). Indeed one can prove the following properties: 1) given a KKT point that is not a global minimizer, it is possible to find a new feasible point with a lower value of the objective function [5,13]; 2) the number of distinct values of the objective function q(x) at KKT points is bounded from above by 2m + 2 where m is the number of negative eigenvalues of Q [13]. Exploiting these properties, a global minimizer of Problem (1) can be found, by applying a finite number of times an algorithm that, starting from a feasible point, locates a KKT point with a lower value of the objective function. An algorithmic scheme of methods in this framework is summarized in the pseudocode of Table 1. The procedure described above is well-posed in the sense that it enters the ‘DO cycle’ a finite number of steps, since by Property 2, the function can assume at most a finite number of values at a KKT point. To complete the scheme of Table 1 and obtain an efficient algorithm for the solution of Problem (1), it remains to specify how to move from a non global KKT

Large Scale Trust Region Problems, Table 1 A pseudocode for TR problem based on successive improvement of KKT points

procedure TR-IMPROVE-KKT() input instance (Q; c; r; x 0 ); Set k = 0; x = x k ; (starting point) find a KKT point xˆ k s.t. q(xˆ k ) q(x k ); DO (until a global minimizer is found) (escape from a nonglobal KKT point) find x s.t. k x k r; q(x) < q(xˆ k ); (update starting point) set k = k + 1; x k = x; (find a ‘better’ KKT point) find a KKT point xˆ k s.t. q(xˆ k ) q(x k ); OD; RETURN (solution) END TR-IMPROVE-KKT;

1823

1824

L

Large Scale Trust Region Problems

point to a feasible point while improving the objective function, and how to define a globally and ‘fast’ convergent algorithm to locate a KKT point. To check global optimality of a KKT point (i. e. to check if Q + I < 0), one needs an estimate of the KKT multiplier corresponding to the point x, and has to verify whether min (Q). To obtain the following multiplier function can be used (x) D

1 > x (Qx C c); 2r 2

(2)

which is consistent, namely at a KKT point (x) = . If < min (Q), then (x, ) is a nonglobal KKT point and a negative curvature direction for the matrix Q + I exists, namely a vector z such that z| (Q + I) z < 0. To perform the step ‘escape from a non global KKT point’, one can use such a direction. Roughly speaking and without discussing the details (see [5,13]), a new feasible point can be obtained by moving from x along z itself or along a direction easily obtainable from z of a computable quantity ˛. The efficiency of this step depends on the ability of finding efficiently such a vector z. Hence a procedure that finds an approximation of the minimum eigenvalue of (Q + I) and of the corresponding eigenvector is needed. In the large scale setting, this can be done efficiently by using a Lanczos method [3,23] which meets the requirement of limited storage and needs only matrix-vector products. In the algorithmic scheme of Table 1, it remains to define how to find efficiently a KKT point for Problem (1). Two different approaches have been recently (1998) proposed to perform this step; one is based on a continuously differentiable exact penalty function approach, the other is based on a difference of convex function approach. In both cases, the basic idea is to reformulate the constrained Problem (1) in a different form that allows one to use ideas typical of other fields of mathematical programming. Both approaches, which are described briefly in the sequel, treat indifferently the so called ‘easy and hard’ cases of Problem (1) and require only matrix vector products. Exact Penalty Function Based Algorithm (EPA) The main idea at the basis of a continuously differentiable exact penalty function approach is the reformulation of the constrained Problem (1) as an unconstrained

one. In particular, a continuously differentiable function P(x) can be defined [13] such that Problem (1) is ‘equivalent’ to the unconstrained problem min P(x):

x2R n

The merit function takes full advantage of the structure of Problem (1) and it is a piecewise quartic function, whose definition relies on the particular multiplier function (2). The analytic expression of P is " P(x) D q(x) (x)2 4 2 2 " C max 0; (kxk2 r 2 ) C (x) ; 4 " where 0 < " < 2r4 /[r2 (k Q k + 1)+ kck2 ]. The function P(x) has the following features: it has compact level sets; stationary (global minimum) points of P(x) are KKT (global minimum) points of Problem (1) and vice versa; moreover P(x) = q(x) at these points; the penalty parameter " need not be updated; for points such that kxk2 r2 it results P(x) q(x); P(x) is twice continuously differentiable in a neighborhood of a KKT point that satisfies strict complementarity. The unconstrained reformulation of Problem (1) can be exploited to define an algorithm for finding a KKT point while improving the value of objective function with respect to the initial one. Indeed any unconstrained method for the minimization of P(x) can be used. Starting from a point x0 , any of these algorithms produce a sequence of the type x kC1 D x k C ˛ k d k ;

(3)

where dk is a suitable direction, ˛ k is a stepsize along dk . The sequence {xk } need not to be feasible for Problem (1). The boundedness of the level sets of P(x) guarantees the boundedness of the iterates and that any convergent unconstrained method obtains a stationary point x for P such that P(x) < P(x0 ). Furthermore a stationary point of P(x) is a KKT point of Problem (1) and P(x) D q(x). If, in addition, x0 is a feasible point, the following relation holds: q(x) D P(x) < P(x0 ) q(x0 );

Large Scale Trust Region Problems

which means that x is a KKT point of Problem (1) with a value of the objective function lower than the value at the starting point. As regard the efficiency of the algorithms, in terms of rate of convergence and computational requirement, a ‘good’ direction dk can be defined, by further exploiting the features of the unconstrained reformulation. Indeed, in a neighborhood of points satisfying the strict complementarity assumption, P(x) 2 C2 and therefore any unconstrained truncated Newton algorithm [4] can be easily adapted in order to define globally convergent methods which show a superlinear rate of convergence. Methods in this class include conjugate gradient based iterative method that requires only matrix-vector products and hence are suitable for large scale instances. The resulting algorithmic scheme is reported in Table 2. In the nonconvex case (Q 0) strict complementarity holds in a neighborhood of every global minimizer of Problem (1) [13]. However, this may not be true in a neighborhood of a KKT point and the function P(x) may be not twice differentiable there. Nevertheless algorithms which exhibit superlinear rate of convergence can be defined. In fact, drawing inspiration from the results in [6], the direction dk is defined as the approximate solution of one of the following linear systems: 8

2 k ˆ if x k r 2 < " 2 ; then ˆ ˆ ˆ ˆ ˆ (Q C k I)d k D (Qx k C c); ˆ
0; set 0 = (x 0 ) and k = 0; DO (until a KKT point (x k ; k ) is found) set x k+1 = x k + ˛ k d k and k+1 = (x k+1 ); k = k + 1; OD; RETURN(KKT point); END KKT point by EPA;

exists a neighborhood of b x where the rate of convergence of the algorithm is superlinear. D.C. Decomposition Based Algorithm (DCA) This algorithm is based on an appropriate reformulation of Problem (1) as the minimization of the difference of convex functions [5]. DCA has been proposed for solving large scale d.c. programming problems. The key aspect in d.c. optimization (cf. D.C. programming) relies on the particular structure of the objective function to be minimized on Rn that is expressed as f (x) = g(x) h(x), with g and h being convex. One uses the tools of convex analysis applied to the two components g and h of the d.c. function. In particular d.c. duality plays a fundamental role to understand how DCA works. Indeed for a generic d.c. problem, DCA constructs two sequences {xk } and {yk } and it can be viewed as a sort of decomposition approach of the primal and dual d.c. problems. It must be pointed out that a d.c. function has infinitely many d.c. decompositions that give rise to different primal dual pairs of d.c. problems and so to different DCA relative to these d.c. decompositions. Thus, choosing a d.c. decomposition may have an important influence on the qualities (such as robustness, stability, rate of convergence) of the DCA. This aspect is related to regularization techniques in d.c. programming. In the special case of Problem (1), a quite appropriate d.c. decomposition has been proposed, so that DCA becomes very simple and it requires only matrix-vector products. To apply DCA to Problem (1), a d.c. decomposition of the objective function f (x) = q(x) + F (x)

1825

1826

L

Large Scale Trust Region Problems

must be defined, where F (x) is the indicator function for the feasible set, namely ( 0 if kxk2 r 2 ; F (x) D 1 otherwise: From the computational point of view, the most efficient decomposition that has been proposed is 1 g(x) D kxk2 C c > x C F (x); 2 1 h(x) D x > (I Q)x; 2 with > 0 and such that ( I Q) < 0. In this case the sequence {yk } is obtained by the following rule yk = (I Q) xk and xk + 1 is obtained as the solution of the problem min

x2R n

1 kxk2 C x > (c y k ) C F (x): 2

Thus xk + 1 is the projection of (yk c)/ onto the feasible region kxk2 r2 . The scheme for obtaining KKT points by DCA is reported in Table 3. It has been proved [5] that algorithm DCA generates a sequence of feasible points {xk } with strictly decreasing value of the objective function and such that {xk } converges to a KKT point. In practice the convergence rate depends on the choice of the parameter . A possible choice (the best one according to some numerical experimentations Large Scale Trust Region Problems, Table 3 A pseudocode for finding a KKT point by DCA

procedure KKT POINT by DCA() Given x 0 ; > 0 such that (I Q) 0; DO (until a KKT point is found) IF k (I Q)x k c k r THEN x k+1 = 1 [(I Q)x k c] (I Q)x k c ELSE x k+1 = r k (I Q)x k c k END IF; IF k x k+1 x k k tol exit; set k = k + 1; OD; RETURN (KKT point); END KKT POINT by DCA;

performed in [5]) consists in taking as close as possible to the largest eigenvalue of the matrix Q, namely = max{max (Q) + ", 103 } with "> 0 and sufficiently small. Actually only a low accuracy estimate of max (Q), which can be found by using a Lanczos method, is needed. Parametric Eigenvalue Reformulation Based Algorithms The algorithms in this framework are based on the reformulation of the TR problem into a parametric eigenvalue problem of a bordered matrix. It must be noted that, if the linear term is not present in the function q(x), i. e. c = 0, Problem (1) is a pure quadratic problem that corresponds to finding the smallest eigenvalue of the matrix Q. Indeed the intuitive observation behind this idea is that given a real number t, one can write > 1 1 1 t t C q(x) D c 2 2 x

c> Q

1 x

and for a fixed t the goal is to minimize the function q(x) over the set {x: kxk2 + 1 = r2 + 1}, that is to minimize a pure quadratic form z| D(t) z/2 over a spherical region where

t D(t) D c

c> : Q

This suggests that a solution of (1) may be found using eigenpairs of the matrix D(t) where t is a parameter to be adjusted. Indeed, in both the algorithms proposed in this framework a key role is played by eigenpairs of the matrix D(t). At each iteration the main computational step is the calculation of the smallest eigenvalue and a corresponding normalized eigenvector of the parametric matrix D(t). The evaluation of the eigenvalueeigenvector pair can be done by using Lanczos method as a black box. Therefore methods can exploit sparsity in the matrices and requires only matrix-vector multiplications. Moreover, only one element of the matrix D(t) is changed at each iteration of both the algorithms and so consecutive steps of Lanczos algorithm become cheaper. Both algorithms have to distinguish between the easy and hard case of Problem (1). The hard case is said to occur when the vector c is orthogonal to the eigenspace associated to the smallest eigenvalue of Q,

L

Large Scale Trust Region Problems

i. e. c| y = 0, for all y 2 Smin with Smin D fx 2 Rn : Qx D min (Q)xg : Depending on whether the easy or the hard case occurs, eigenpairs of the perturbed matrix D(t) satisfies different properties. In the easy case, the smallest eigenvalue min (D(t)) is simple and such that min (D(t)) < min (Q) for all values t. Moreover in this case the corresponding eigenvector has the first component not equal to zero and this plays a fundamental role in defining the iteration of both the algorithms. In the hard case caution should be used, due to the fact that the first component of the eigenvector corresponding to the smallest eigenvalue of D(t) may be zero. Actually, any vector of the form (0, y| )| with y 2 Smin is an eigenvector of the matrix D(t) if and only if c ? Smin . The two algorithms in this framework are briefly described below. Although the basic idea behind both the algorithms is the same, namely inverse interpolation for a parametric eigenvalue problem, the second one is embedded in a semidefinite programming framework. So the first one is referred to as ‘inverse interpolation parametric eigenvalue’ (IPE) approach and the second one as ‘semidefinite programming approach’ (SDP). Inverse Interpolation Parametric Eigenvalue Formulation (IPE) In [23] it is observed that if an eigenvector z of D(t) corresponding to a given eigenvalue can be normalized so that its first component is one, that is z = (1, x| )| , then a solution of the TR problem can be found in terms of eigenpairs of D(t). This corresponds to the easy case and indeed the pair (x, ) satisfies

t c

c> 1 1 D ; Q x x

from which we get: t D c > x; (Q I)x D c:

For < min (Q), that holds in the easy case with = min (D(t)), the matrix (Q I) is positive definite and hence one can define the function () D c > x D c > (Q I)1 c;

whose derivative is 0 () D c > (Q I)2 c D kxk2 : For a given value of t, finding the smallest eigenvalue (t) := min (D(t)) < min (Q) and the corresponding eigenvector of D(t) and then normalizing the eigenvec| tor to have its first component equal to one (1, x (t) )| will provide a mean to evaluate the function () and its derivative. If t can be adjusted so that the corresponding x (t) satisfies 0 ((t)) = k x (t) k2 = r2 with t (t) = c| x (t) , and (t) 0 then (x, (t)) satisfies the optimality conditions for Problem (1). Whereas if, during the course of adjusting t, it happens that (t) > 0 with k x (t) k2 < r2 then the optimal solution of Problem (1) is actually unconstrained and can be found by solving the system Qx = c with any iterative method. Hence using the parametric eigenvalue formulation, the optimal value of (x , ) of Problem (1) can be found by solving a sequence of eigenvalue problems adjusting iteratively the parameter t. In order to make this observation useful, a modified Lanczos methods, the implicit restarted Lanczos method [23], is used for computing the smallest eigenvalue and the corresponding eigenvector of D(t). Moreover a rapidly convergent iteration to adjust t has been developed, based on a twopoint interpolant method. Recalling that the goal is to adjust t so that () = t and 0 () = r2 , an interpolation based iteration that exploits the structure of the problem is proposed. The method is based upon an interpolant b () of () of the form b () D

2 C ˇ(˛ ) C ı: ˛

The values of the parameters ˛, ˇ, , ı appearing in the interpolant function b () are determined using the values of two iterations (xk , k ), (xk 1 , k 1 ) according to the following rules. The value ı is chosen so as to provide the current estimate ı min of min (Q). In particular, if k xk k < r or k xk 1 k < r ! (x k )> Qx k ; ı D min ımin; 2

x k

if k xk k > r and k xk 1 k > r then (x k )> Qx k (x k1 )> Qx k1 ı D min

2 ;

x k

x k1 2

!

1827

1828

L

Large Scale Trust Region Problems

Large Scale Trust Region Problems, Table 4 A pseudocode for TR based on (IPE)

procedure TR INTERPOL-PARAM-EIG() input instace (Q; c; r; x 0 ); (initialization) Find min (Q) and its eigenvector x; set k= 0; t k = 0; x k = x; k =min (Q). k x k k2 r 2 DO until j j tol r2 ˆ construct the interpolar (); ˆ = r 2 , that is: ˆ : ˆ 0 () find 2 1/2

ˆ =˛ 2 ; r +ˇ ˆ ), ˆ + ( ˆ that is: set t k+1 =

2 ˆ + ı + ˇ(˛ ) ˆ + ; t k+1 = ˆ ˛ compute k+1 = min (D(t k+1 )) and the corresponding normalized eigenvec> k+1 > ; tor 1; (x ) set k = k + 1; OD; RETURN(solution) END TR INTERPOLATION-PARAM-EIG;

and ı min = min(ı min , ı) min (Q). The other coefficient

2 0 ( k ) D x k , are chosen to satisfy b ( k ) D c > x k , b

2 b 0 ( k1 ) D x k1 . An algorithmic scheme for finding the global minimizer of Problem (1) in the easy case, is reported in Table 4. It has been proved in [23] that there exists a neighborhood of such that if 0 , 1 are in this neighborhood, all the sequence {k } is well defined, remains in the neighborhood and converge superlinearly to with the corresponding iterates xk converging superlinearly to x . Unfortunately, the iteration described above can break down in the hard case. Indeed the iteration is based on the ability to normalize the eigenvector of the bordered matrix D(t). This is not possible when the first component is equal to zero, that is in the hard case. From the computational point of view, also a nearhard case can be difficult and it is important to detect these cases and to define alternative rules so as to obtain a convergent iteration. This can be done, by using

again eigenpairs of the bordered matrix and additional information such as the value of an upper bound U on the optimal value . When the hard case is detected the new iteration should be used. The convergence of this new iteration can be established but unfortunately the rate of convergence is no longer superlinear. Semidefinite Programming Approach (SDP) In [20] a primal-dual simplex type method for Problem (1) has been proposed, which is essentially based on a primal dual pair of semidefinite programming problems. Primal-dual pairs of SDP provide a general framework for TR problem. The idea arises from the fact that Problem (1) enjoys strict duality, that is there is no duality gap and q(x ) D min max L(x; ) D max min L(x; ); x

x

2

2

where L(x, ) = q(x) + (kxk r ) denotes the Lagrangian function. By exploiting this feature it is possible to define a primal-dual pair of linear SDP problems that are strictly connected with the TR problem. In particular, a dual for Problem (1) is ( max (r 2 C 1)min (D(t)) t; (5) s.t. min (D(t)) 0: The objective function in (5) is a real valued concave function. When the constraint in Problem (1) is an equality one, its dual problem (5) is an unconstrained problem, and as an immediate consequence, the non convex constrained TR problem is transformed into a convex problem and hence it can be solved in polynomial time by the results for general convex programs. Problem (5) can be easily reformulated as a SDP problem, by introducing an additional variable 2 R: 8 2 ˆ ˆ w C 12 w > H(x k )w

(3)

and it is defined by iterations of the form x kC1 D x k C s k

(4)

where the search direction sk is obtained by minimizing the quadratic model of the objective function (3) over Rn . On the one hand, Newton method presents quadratic convergence rate and it is scale invariant, but, on the other hand, in its pure form it is not globally convergent. Globally convergent modifications of the Newton method has been defined following the line search approach and the trust region approach (see, e. g. [11,12,27]; cf. also Large scale trust region problems), but the main difficulty, in dealing with large scale problems, is represented by the possibility to efficiently solve, at each iteration, linear systems which arise in computing the search direction sk . In fact, the problem dimension could be too large for any explicit use of the Hessian matrix and iterative methods must be used to solve systems of linear equations instead of factorizations of the matrices involved. Indeed, whereas in the small scale setting the Newton direction sk is usually determined by using direct methods for solving the linear system H(x k )s D g(x k );

(5)

when n is large, it is impossible to store or factor the full n × n Hessian matrix unless it is a sparse matrix. Moreover the exact solution, at each iteration, of the system (5) could be too burdensome and not justified when xk is far from a solution. In fact, since the benefits of using the Newton direction are mainly local (i. e. in the neighborhood of a solution), it should not be necessary a great computational effort to get an accurate solution of system (5) when g(xk ) is large. On the basis of these remarks, in [8] the inexact Newton methods were proposed. They represent the basic approach underlying most of the Newton-type large scale unconstrained algorithms. The main idea is to approximately solve the system (5) still ensuring a good convergence rate of the method by using a particular trade-off rule between the computational burden required to solve the system (5) and the accuracy with which it is solved. The measure of this accuracy is the relative residual kr k k ; kg(x k )k

where r k D H(x k )s k C g(x k )

(6)

and sk is an approximate solution of (5). The analysis given in [8] shows that if the sequence {xk } generated by (4) converges to a point x? and if kr k k D 0; k!1 kg(x k )k lim

(7)

then {xk } converges superlinearly to x? . This result is at the basis of the truncated Newton methods which represent one of the most effective approach for solving large scale problems. This class of methods was introduced in [9] within the line search based Newtontype methods. They are based on the fact that whenever the Hessian matrix H(xk ) is positive definite, to solve the Newton equation (5) is equivalent to determine the minimizer of the quadratic model (3). Therefore, in these methods, a Newton-type direction, i. e. an approximate solution of (5), is computed by applying the (linear) conjugate gradient (CG) method (cf. Conjugate-gradient methods) [23] to approximately minimize the quadratic function (3). A scheme of a line search based truncated Newton algorithm is the following:

Large Scale Unconstrained Optimization

Line search based truncated Newton algorithm OUTER iterations For k = 0; 1; : : : Compute g(x k ) Test for convergence INNER iterations (Computation of the direction s k ) Iterate CG algorithm until a termination criterion is satisfied Compute a stepsize ˛ k by a line search procedure Set x k+1 = x k + ˛ k s k A scheme for a truncated Newton algorithm

Given a starting point x0 , at each iteration k, a Newton-type direction sk is computed by truncating the CG iterates – the inner iterations – whenever a required accuracy is obtained. The definition of an effective truncation criterion represents a key aspect of any truncated Newton method and a natural choice is represented by monitoring when the relative residual (6) is sufficiently small. Moreover, by requiring that krk k / kg(xk )k k with limk ! 1 k ! 0, the condition given by (7) is satisfied and hence the superlinear convergence is guaranteed [9]. In particular k can be chosen to ensure that, as a critical point is approached, more accuracy is required. Other truncation criteria based on the reduction of the quadratic model can be defined [31]. Numerical experiences showed that a relatively small number of CG iterations is needed, in most cases, for obtaining a good approximation of the Newton direction and this is one the main advantage of the truncated Newton methods since a considerable computational savings can be obtained still ensuring a good convergence rate. The performance of the CG algorithm used in the inner iterations can be improved by using a preconditioning strategy based either on the information gained during the outer iterations or on some scaling of the variables. Several different preconditioning schemes have been proposed and tested [29,40]. Truncated Newton methods can be modified to enable their use whenever the Hessian matrix is not available; in fact, the CG method only needs the product of the Hessian matrix with a displacement vector, and this product can be approximated by finite difference [35]. The resulting method is called discrete truncated Newton method. In [41] a Fortran package (TNPACK) imple-

L

menting a line search based (discrete) truncated Newton algorithm which uses a preconditioned conjugate gradient is proposed. However, additional safeguard is needed within truncated Newton algorithms since the Hessian matrix could be not positive definite. In fact, the CG inner iterations may break down before satisfying the termination criterion when the Hessian matrix is indefinite. To handle this case, whenever a direction of negative curvature (i. e. a direction dk such that d> k H(xk ) dk < 0) is encountered, the inner iterations are usually terminated and a descent direction (i. e. a direction dk such that g(xk )| dk < 0) is computed [9]. More sophisticated strategies can be applied for iteratively solving the system (5) when it is indefinite [6,15, 36,43]. In particular, the equivalent characterization of the linear conjugate gradient algorithm via the Lanczos method can be exploited to define a truncated Newton algorithm which can be used to solve problems with indefinite Hessian matrices [28]. In fact, the Lanczos algorithm does not requires the Hessian matrix to be positive definite and hence it enables to obtain an effective Newton-type direction. A truncated Newton method which uses a nonmonotone line search (i. e. which does not enforce the monotone decrease of the objective function values) was proposed in [20] and the effectiveness of this approach was shown especially in the solution of illconditioned problems. Moreover in the CG-truncated scheme proposed in [20] an efficient strategy to handle the indefinite case is also proposed. A new class of truncated Newton algorithms for solving large scale unconstrained problems has been defined in [25]. In particular, a nonmonotone stabilization framework is proposed based on a curvilinear line search, i. e. a line search along the curvilinear path x(˛) D x k C ˛ 2 s k C ˛d k ; where sk is a Newton-type direction and dk is a particular negative curvature direction which has some resemblance to an eigenvector of the Hessian matrix corresponding to the minimum eigenvalue. The use of the combination of these two directions enables, also in the large scale case, to define a class of line search based algorithms which are globally convergent towards points which satisfy second order necessary optimality conditions, i. e. stationary points where the Hessian matrix is

1833

1834

L

Large Scale Unconstrained Optimization

positive semidefinite. Besides satisfying this important theoretical property, this class of algorithms was also shown to be very efficient in solving large scale unconstrained problems [25,26]. This is also due to the fact that a Lanczos based iterative scheme is used to compute both the directions without terminating the inner iterations when indefiniteness is detected and, as result, more information about the curvature of the objective function are conveyed. Truncated Newton methods have been also defined within the trust region based methods. These methods are characterized by iterations of the form (4) where, at each iteration k, the search direction sk is determined by minimizing the quadratic model of the objective function (3) in a neighborhood of the current iterate, namely by solving the problem min k (s);

ksk

(8)

where is the trust region radius. Also in this framework most of the existing algorithms require the solution of systems of linear equations. Some approaches are the dogleg methods [10,38] which aim to solve problem (8) over a one-dimensional arc and the method proposed in [5] which solves problem (8) over a twodimensional subspace. However, whenever the problem dimension is large, it is impossible to rely on matrix factorizations, and iterative methods must be used. If the quadratic model (3) is positive definite and the trust region radius is sufficiently large that the trust region constraint is inactive at the unconstrained minimizer of the model, problem (8) can be solved by using the preconditioned conjugate gradient method [42,44]. Of course, a suitable strategy is needed whenever the unconstrained minimizer of the quadratic model is no longer lying within the trust region and the desired solution belongs to the trust region boundary. A simple strategy to handle this case was proposed in [42] and [44] and it considers the piecewise linear path connecting the CG iterates, stopping at the point where this path leaves the trust region. If the quadratic model (3) is indefinite, the solution must also lie on the trust region boundary and the piecewise linear path can be again followed until either it leaves the trust region, or a negative curvature direction is found. In this latter case, two possibilities have been considered: in [42] the path is continued along this direction until the bound-

ary is reached; in [44] the minimizer of the quadratic model within the trust region along the steepest descent direction (the Cauchy point) is considered. This class of algorithms represents a trust region version of truncated Newton methods and an efficient implementation is carried out within the LANCELOT package [7]. These methods have become very important in large scale optimization, due to both their strong theoretical convergence properties and good efficiency in practice, but they are known to possess some drawbacks. Indeed, they are essentially unconcerned with the trust region until they blunder into its boundary and stop. Moreover, numerical experiences showed that very frequently this untimely stop happens during the first inner iterations when a negative curvature is present and this could deteriorate the efficiency of the method. In order to overcome this drawback an alternative strategy is proposed in [16] where ways of continuing the process once the boundary of the trust region is reached are investigated. The key point of this approach is the use of the Lanczos method and the fact that preconditioned conjugate gradient and Lanczos methods generate different bases for the same Krylov space. Several other large scale trust region methods (cf. Large scale trust region problems) have been proposed. Another class of methods which can be successfully applied to solve large scale unconstrained optimization problems is the wide class of the nonlinear conjugate gradient methods [14,23]. They are extensions to the general (nonquadratic) case of the already mentioned linear conjugate gradient method. They represent a compromise between steepest descent method and Newton method and they are particularly suited for large scale problems since there is never a need to store a full Hessian matrix. They are defined by the iteration scheme (2) where the search direction is of the form d k D g(x k ) C ˇ k d k1

(9)

with d0 = g(x0 ) and where ˇ k is a scalar such that the algorithm reduces to the linear conjugate gradient method if the objective function f is a strictly convex quadratic function and ˛ k in (2) is obtained by means of an exact line search (i. e., ˛ k is the one-dimensional minimizer of f (xk + ˛ dk ) with respect to ˛). The most widely used formulas for ˇ k are Fletcher–Reeves (FR)

Large Scale Unconstrained Optimization

and Polak–Ribière (PR) formulas given by

ˇ FR k D ˇ PR k D

kg(x k )k2

; kg(x k1 )k2 g(x k )> g(x k ) g(x k1 ) kg(x k1 )k2

:

Many efforts have been devoted to investigate the global convergence for nonlinear conjugate gradient methods. A widespread technique to enforce the global convergence is the use of a regular restart along the steepest descent direction every n iterations obtained by setting ˇ k = 0. However, computational experiences showed that this restart can have a negative effect on the efficiency of the method; on the other hand, in the large scale setting, restarting does not play a significant role since n is large and very few restarts can be performed. Global convergence results have been obtained for the Fletcher–Reeves method without restart both in the case of exact line search [46] and when ˛ k is computed by means of an inexact line search [1]; then, the global convergence was extended to methods with |ˇ k | ˇ FR k [14]. As regards the global convergence of the Polak– Ribière method, for many years it was proved with exact line search only under strong convexity assumptions [37]. Global convergence both for exact and inexact line search can also be enforced by modifying the Polak–Ribière method by setting ˇ k = max{ˇ PR k , 0} [14]; this strategy correspond to restart the iterations along the steepest descent direction whenever a negative value of ˇ k occurs. However, an inexact line search which ensures global convergence of the Polak–Ribière method for nonconvex function has been obtained in [21]. As regards the numerical performance of these two methods, extensive numerical experiences showed that, in general, Polak–Ribière method is usually more efficient than the Fletcher–Reeves method. An efficient implementation of the Polak–Ribière method (with restarts) is available as routine VA14 within the Harwell subroutine library [22]. See, e. g., [34] for a detailed survey on the nonlinear conjugate gradient methods. Another effective approach to large scale unconstrained optimization is represented by the limitedmemory BFGS method (L-BFGS) proposed in [32] and then studied in [24,30]. This method resembles

L

the BFGS quasi-Newton method, but it is particularly suited for large scale (unstructured) problems because the storage of matrices is avoided. It is defined by the iterative scheme (2) with the search direction given by d k D H k g(x k ) and where H k is the approximation to the inverse Hessian matrix of the function f at the kth iteration. In the BFGS method the approximation H k is updated by means of the BFGS correction given by H kC1 D Vk> H k Vk C k s k s> k where V k = I k yk s> k , sk = xk + 1 xk , yk = g(xk + 1 ) g(xk ), and k = 1/y> k sk . In the L-BFGS method, instead of storing the matrices H k , a prefixed number (say m) of vectors pairs {sk , yk } that define them implicitly are stored. Therefore, during the first m iterations the L-BFGS and the BFGS methods are identical, but when k > m only information from the m previous iterations are used to obtain H k . The number m of BFGS corrections that must be kept can be specified by the user. Moreover, in the L-BFGS the product H k g(xk ) which represents the search direction is obtained by means of a recursive formula involving g(xk ) and the most recent vectors pairs {sk , yk }. An implementation of L-BFGS method is available as VA15 routine within the Harwell subroutine library [22]. An interesting numerical study of L-BFGS method and a comparison of its numerical performance with the discrete truncated Newton method and the Polak–Ribière conjugate gradient method are reported in [30]. The results of a numerical experience with limited-memory quasi-Newton and truncated Newton methods on standard library test problems and on two real life large scale unconstrained optimization applications can be found in [45]. A method which combines the discrete Newton method and the L-BFGS method is proposed in [4] to produce an efficient algorithm able to handle also ill-conditioned problems. Limited memory quasi-Newton methods represent an adaptation of the quasi-Newton methods to large scale unstructured optimization. However, the quasiNewton approach can be successfully applied to large scale problems with a particular structure. In fact, fre-

1835

1836

L

Large Scale Unconstrained Optimization

quently, an optimization problem has some structure which may be reflected in the sparsity of the Hessian matrix. In this framework, the most effective method is the partitioned quasi-Newton method proposed in [18,19]. It is based on the fact that a function f with a sparse Hessian is a partially separable function, i. e. it can be written in the form f (x) D

ne X

f i (x)

iD1

where the element functions f i depends only on a few variables. Many practical problems can be formulated (or recasted) in this form showing a wide range of applicability of this approach. The basic idea of the partitioned quasi-Newton method is to decompose the Hessian matrix into a sum of Hessians of the element functions f i . Each approximation to the Hessian of f i is then updated by using dense updating techniques. These small matrices are assembled to define an approximation to the Hessian matrix of f used to compute the search direction. However, the element Hessian matrices may not be positive definite and hence BFGS formula cannot be used, and in this case a symmetric rank one formula is used. Global convergence results have been obtained under convexity assumption of the function f i [17]. An implementation of the partitioned quasi-Newton method is available as VE08 routine of the Harwell subroutine library [22]. A comparison of the performance of partitioned quasi-Newton, LBFGS, CG Polak–Ribière and truncated discrete Newton methods is reported in [33]. Another class of methods which has been extended to large sparse unconstrained optimization are tensor methods [3]. Tensor methods are based on fourth order model of the objective function and are particularly suited for problems where the Hessian matrix has a small rank deficiency. To conclude, it is worthy to outline that in dealing with large scale unconstrained problems with a very large number of variables (more than 104 ) high performance computer architectures must be considered. See e. g. [2] for the solution of large scale optimization problems on vector and parallel architectures. The reader can find the details of the methods mentioned in this brief survey in the specific cited references.

See also ABS Algorithms for Linear Equations and Linear Least Squares Broyden Family of Methods and the BFGS Update Cholesky Factorization Conjugate-gradient Methods Continuous Global Optimization: Models, Algorithms and Software Interval Linear Systems Large Scale Trust Region Problems Linear Programming Modeling Languages in Optimization: A New Paradigm Nonlinear Least Squares: Trust Region Methods Optimization Software Orthogonal Triangularization Overdetermined Systems of Linear Equations QR Factorization Solving Large Scale and Sparse Semidefinite Programs Symmetric Systems of Linear Equations Unconstrained Nonlinear Optimization: Newton–Cauchy Framework Unconstrained Optimization in Neural Network Training References 1. Al-Baali M (1985) Descent property and global convergence of the Fletcher–Reeves method with inexact line search. IMA J Numer Anal 5:121–124 2. Averick BM, Moré JJ (1994) Evaluation of large-scale optimization problems on vector and parallel architectures. SIAM J Optim 4:708–721 3. Bouaricha A (1997) Tensor methods for large, sparse unconstrained optimization. SIAM J Optim 7:732–756 4. Byrd RH, Nocedal J, Zhu C (1995) Towards a discrete Newton method with memory for large-scale optimization. In: Di Pillo G, Giannessi F (eds) Nonlinear Optimization and Applications. Plenum, New York, pp 1–12 5. Byrd RH, Schnabel RB, Shultz GA (1988) Approximate solution of the trust region problem by minimization over twodimensional subspaces. Math Program 40:247–263 6. Chandra R (1978) Conjugate gradient methods for partial differential equations. PhD Thesis Yale Univ. 7. Conn AR, Gould NIM, Toint PhL (1992) LANCELOT: A Fortran package for large-scale nonlinear optimization (release A). Springer, Berlin 8. Dembo RS, Eisenstat SC, Steihaug T (1982) Inexact Newton methods. SIAM J Numer Anal 19:400–408

Large Scale Unconstrained Optimization

9. Dembo RS, Steihaug T (1983) Truncated-Newton algorithms for large-scale unconstrained optimization. Math Program 26:190–212 10. Dennis JE, Mei HHW (1979) Two new unconstrained optimization algorithms which use function and gradient values. J Optim Th Appl 28:453–482 11. Dennis JE, Schnabel RB (1989) A view of unconstrained optimization. In: Nemhauser GL, Rinnooy Kan AHG, Tood MJ (eds) Handbook Oper. Res. and Management Sci., vol 1. North-Holland, Amsterdam, pp 1–72 12. Fletcher R (1987) Practical methods of optimization. Wiley, New York 13. Fletcher R (1994) An overview of unconstrained optimization. In: Spedicato E (ed) Algorithms for continuous optimization. The state of the art. Kluwer, Dordrecht, pp 109– 143 14. Gilbert JC, Nocedal J (1992) Global convergence properties of conjugate gradient methods for optimization. SIAM J Optim 2:21–42 15. Gill PE, Murray W, Ponceleon DB, Saunders MA (1992) Preconditioners for indefinite systems arising in optimization. SIAM J Matrix Anal Appl 13:292–311 16. Gould NIM, Lucidi S, Roma M, Toint PhL (1999) Solving the trust-region subproblem using the Lanczos method. SIAM J Optim 9:504–525 17. Griewank A (1991) The global convergence of partitioned BFGS on problems with convex decomposition and Lipschitzian gradients. Math Program 50:141–175 18. Griewank A, Toint PhL (1982) Local convergence analysis of partitioned quasi-Newton updates. Numerische Math 39:429–448 19. Griewank A, Toint PhL (1982) Partitioned variable metric updates for large structured optimization problems. Numerische Math 39:119–137 20. Grippo L, Lampariello F, Lucidi S (1989) A truncated Newton method with nonmonotone linesearch for unconstrained optimization. J Optim Th Appl 60:401–419 21. Grippo L, Lucidi S (1997) A globally convergent version of the Polak–Ribière conjugate gradient method. Math Program 78:375–391 22. Harwell Subroutine Library (1998) A catalogue of subroutines. AEA Techn. 23. Hestenes MR (1980) Conjugate direction methods in optimization. Springer, Berlin 24. Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Program 45:503–528 25. Lucidi S, Rochetich F, Roma M (1998) Curvilinear stabilization techniques for truncated Newton methods in large scale unconstrained optimization. SIAM J Optim 8:916– 939 26. Lucidi S, Roma M (1997) Numerical experiences with new truncated Newton methods in large scale unconstrained optimization. Comput Optim Appl 7:71–87

L

27. Moré JJ, Sorensen DC (1984) Newton’s method. In: Golub GH (ed) Studies in Numerical Analysis. Math. Assoc. Amer., Washington, DC, pp 29–82 28. Nash SG (1984) Newton-type minimization via the Lanczos method. SIAM J Numer Anal 21:770–788 29. Nash SG (1985) Preconditioning of truncated-Newton methods. SIAM J Sci Statist Comput 6:599–616 30. Nash SG, Nocedal J (1991) A numerical study of the limited memory BFGS method and the truncated-Newton method for large scale optimization. SIAM J Optim 1:358–372 31. Nash SG, Sofer A (1990) Assessing a search direction within a truncated-Newton method. Oper Res Lett 9:219–221 32. Nocedal J (1980) Updating quasi-Newton matrices with limited storage. Math Comput 35:773–782 33. Nocedal J (1990) The performance of several algorithms for large-scale unconstrained optimization. In: Coleman TF, Li Y (eds) Large-scale Numerical Optimization. SIAM, Philadelphia, pp 138–151 34. Nocedal J (1992) Theory and algorithms for unconstrained optimization. Acta Numer 1:199–242 35. O’Leary DP (1982) A discrete Newton algorithm for minimizing a function of many variables. Math Program 23:20– 33 36. Paige CC, Saunders MA (1975) Solution of sparse indefinite systems of linear equations. SIAM J Numer Anal 12:617– 629 37. Polak E, Ribière G (1969) Note sur la convergence de methodes de directions conjugées. Revue Franc Inform et Rech Oper 16:35–43 38. Powell MJD (1970) A new algorithm for unconstrained optimization. In: Mangasarian OL, Ritter K (eds) Nonlinear programming. Acad. Press, New York, pp 31–65 39. Raydan M (1997) The Barzilai and Borwein gradient method for large scale unconstrained minimization problems. SIAM J Optim 7:26–33 40. Schlick T (1993) Modified Cholesky factorization for sparse preconditioners. SIAM J Sci Comput 14:424–445 41. Schlick T, Fogelson A (1992) TNPACK – A truncated Newton package for large-scale problems: I. Algorithm and usage. ACM Trans Math Softw 18:46–70 42. Steihaug T (1983) The conjugate gradient method and trust regions in large-scale optimization. SIAM J Numer Anal 20:626–637 43. Stoer J (1983) Solution of large linear systems of equations by conjugate gradient type methods. In: Bachem A, Grötschel M, Korte B (eds) Mathematical Programming. The State of the Art. Springer, Berlin, pp 540–565 44. Toint PhL (1981) Towards an efficient sparsity exploiting Newton method for minimization. In: Duff IS (ed) Sparse Matrices and Their Uses. Acad. Press, New York, pp 57–88 45. Zou X, Navon IM, Berger M, Phua KH, Schlick T, Dimet FX (1993) Numerical experience with limited-memory quasiNewton and truncated Newton methods. SIAM J Optim 3:582–608

1837

1838

L

L-convex Functions and M-convex Functions

46. Zoutendijk G (1970) Nonlinear programming computational methods. In: Abadie J (ed) Integer and Nonlinear Programming. North-Holland, Amsterdam, pp 37–86

= {p 2 ZV : g(p) < +1}, called the effective domain of g. A function g: ZV ! Z [ {+1} with dom g 6D ; is called L-convex if g(p) C g(q) g(p _ q) C g(p ^ q) (p; q 2 ZV );

L-convex Functions and M-convex Functions KAZUO MUROTA Res. Institute Math. Sci. Kyoto University, Kyoto, Japan MSC2000: 90C27, 90C25, 90C10, 90C35 Article Outline Keywords Definitions of L- and M-Convexity L-Convex Sets M-Convex Sets Properties of L-Convex Functions Properties of M-Convex Functions L\ - and M\ -Convexity Duality Network Duality Subdifferentials Algorithms Applications See also References Keywords L-convexity; M-convexity; Discrete convex analysis; Submodular function; Matroid In the field of nonlinear programming (in continuous variables), convex analysis [20,21] plays a pivotal role both in theory and in practice. An analogous theory for discrete optimization (nonlinear integer programming), called ‘discrete convex analysis’ [15,16], is developed for L-convex and M-convex functions by adapting the ideas in convex analysis and generalizing the results in matroid theory. The L- and M-convex functions are introduced in [15] and [12,18], respectively. Definitions of L- and M-Convexity Let V be a nonempty finite set and Z be the set of integers. For any function g: ZV ! Z [{+1} define dom g

9r 2 Z : g(p C 1) D g(p) C r

(p 2 ZV );

where p _ q = (max(p(v), q(v)) |v 2 V) 2 ZV , p ^ q = (minp(v), q(v))|v 2 V) 2 ZV , and 1 is the vector in ZV with all components being equal to 1. A set D ZV is said to be an L-convex set if its indicator function ı D (defined by ı D (p) = 0 if p 2 D, and = + 1 otherwise) is an L-convex function, i. e., if i) D 6D ;; ii) p, q 2 D ) p _ q, p ^ q 2 D; and iii) p 2 D ) p ˙ 1 2 D. A function f : ZV ! Z [ {+1} with dom f 6D ; is called M-convex if it satisfies M-EXC) For x, y 2 dom f and u 2 supp+ (x y), there exists v 2 supp (x y) such that f (x) C f (y) f (x u C v ) C f (y C u v ); where, for any u 2 V, u is the characteristic vector of u (defined by u (v) = 1 if v = u, and = 0 otherwise), and suppC (z) D fv 2 V : z(v) > 0g

(z 2 ZV );

supp (z) D fv 2 V : z(v) < 0g

(z 2 ZV ):

A set B ZV is said to be an M-convex set if its indicator function is an M-convex function, i. e., if B satisfies B-EXC) For x, y 2 B and for u 2 supp+ (x y), there exists v 2 supp (x y) such that x u + v 2 B and y + u v 2 B. This means that an M-convex set is the same as the set of integer points of the base polyhedron of an integral submodular system (see [8] for submodular systems). L-convexity and M-convexity are conjugate to each other under the integral Fenchel–Legendre transformation f 7! f defined by ˚ f (p) D sup hp; xi f (x) : x 2 ZV ; p 2 ZV ; P where hp, xi = v 2 V p(v) x(v). That is, for L-convex function g and M-convex function f , it holds [15] that g is M-convex, f is L-convex, g = g, and f = f .

L

L-convex Functions and M-convex Functions

Example 1 (Minimum cost flow problem) L-convexity and M-convexity are inherent in the integer minimumcost flow problem, as pointed out in [12,15]. Let G = (V, A) be a graph with vertex set V and arc set A, and let T V be given. For : A ! Z its boundary @: V ! Z is defined by @(v) X˚ X (a) : a 2 ı C v D f(a) : a 2 ı vg (v 2 V ); where ı + v and ı v denote the sets of out-going and incoming arcs incident to v, respectively. For e p: V ! Z its coboundary ıe p : A ! Z is defined by p(@ a) (a 2 A); ıe p(a) D e p(@C a) e where @+ a and @ a mean the initial and terminal vertices of a, respectively. Denote the class of onedimensional discrete convex functions by C1 D f' : Z ! Z [ fC1gj dom ' ¤ ;;

'(t 1) C '(t C 1) 2'(t) (t 2 Z)g: For ' a 2 C1 (a 2 A), representing the arc-cost in terms of flow, the total cost function f : ZT ! Z [ {+1} defined by

f (x) D inf

8 ˆ ˆ

;

< + 1 for any set function : 2V ! Z [ {+1}. For a set function , define 8 9 x(X) (X) = < P() D x 2 RV : ; (8X V); : ; x(V ) D (V) P where x(X) = v 2 X x(v). If is submodular, P() is a nonempty integral polyhedron, B = P() \ ZV is an M-convex set, and (X) D sup fx(X) : x 2 P()g

(X V ):

Conversely, for any nonempty B ZV , define a set function by (X) D sup fx(X) : x 2 Bg

(X V):

If B is M-convex, then is submodular and B D P(). Thus there is a one-to-one correspondence between M-convex set B and submodular set function . In particular, B ZV is M-convex if and only if B = P()\ ZV for some submodular . The correspondence B $ is a restatement of a well-known fact [4,8]. For M-convex sets B1 , B2 ZV , it holds that B1 C B2 D B1 C B2 \ ZV and B1 \ B2 D B1 \ B2 . It is also true that a submodular set function corresponds one-to-one to a positively homogeneous L-convex function g. The correspondence g 7! is given by the restriction

(x 2 ZV ): (X) D g( X ) The correspondence between L-convex sets and positively homogeneous M-convex functions via functions with triangle inequality is a special case of the conjugacy relationship between L- and M-convex functions. M-Convex Sets An M-convex set B ZV has ‘no holes’ in the sense that B D B \ ZV . Hence it is natural to consider the polyhedral description of B, ‘M-convex polyhedron’. A set function : 2V ! Z [ {+1} is said to be submodular if (X) C (Y) (X [ Y) C (X \ Y) (X; Y V ); where the inequality is satisfied if (X) or (Y) is equal to +1. It is assumed throughout that (;) = 0 and (V)

(X V )

( X is the characteristic vector of X), whereas 7! g by the Lovász extension (explained below). The correspondence between M-convex sets and positively homogeneous L-convex functions via submodular set functions is a special case of the conjugacy relationship between M- and L-convex functions. For a set function : 2V ! Z [ { + 1}, the Lovász extension [11] of is a function b : RV ! R [ fC1g defined by b (p) D

n X (p j p jC1 )(Vj ) (p 2 RV ); jD1

where, for each p 2 RV , the elements of V are indexed as {v1 , . . . , vn } (with n = |V|) in such a way that p(v1 ) p(vn ); pj = p(vj ), V j = {v1 , . . . , vj } for j = 1, . . . , n, and

L-convex Functions and M-convex Functions

pn + 1 = 0. The right-hand side of the above expression is equal to + 1 if and only if pj pj + 1 > 0 and (V j ) = +1 for some j with 1 j n 1. The Lovász extension b is indeed an extension of , since b ( X ) D (X) for X V. The relationship between submodularity and convexity is revealed by the statement [11] that a set function is submodular if and only if its Lovász extension b is convex. The restriction to ZV of the Lovász extension of a submodular set function is a positively homogeneous L-convex function, and any positively homogeneous L-convex function can be obtained in this way [15]. Properties of L-Convex Functions For any g: ZV ! Z [ {+ 1} and x 2 RV , define g[ x]: ZV ! R [ {+1} by g[x](p) D g(p) hp; xi

(p 2 ZV ):

The set of the minimizers of g[ x] is denoted as argmin(g[ x]). Let g: ZV ! Z [ {+1} be L-convex. Then dom g is an L-convex set. For each p 2 dom g,

for p, q 2 ZV , where dpe (or bpc) for any p 2 RV denotes the vector obtained by rounding up (or down) the components of p to the nearest integers. The minimum of an L-convex function g is characterized by the local minimality in the sense that, for p 2 dom g, g(p) g(q) for all q 2 ZV if and only if g(p + 1) = g(p) g(p + X ) for all X V. The minimizers of an L-convex function, if nonempty, form an L-convex set. For any x 2 RV , argmin (g[ x]), if nonempty, is an L-convex set. Conversely, this property characterizes L-convex functions under an auxiliary assumption. A number of operations can be defined for L-convex functions [15,16]. For x 2 ZV , g[ x] is an L-convex function. For a 2 ZV and ˇ 2 Z, g(a + ˇ p) is L-convex in p. For U V, the projection of g to U: o n (p0 2 ZU ) g U (p0 ) D inf g(p0 ; p00 ) : p00 2 ZV nU is L-convex in p0 , provided that g U > 1. For (v 2 V), # " X e g(p) D inf g(q) C v (p(v) q(v)) q2Z V

p (X) D g(p C X ) g(p)

(X V)

is a submodular set function with p (;) = 0 and p (V) < + 1. An L-convex function g can be extended to a convex function g : RV ! R [ fC1g through the Lovász extension of the submodular set functions p for p 2 dom g. Namely, for p 2 dom g and q 2 [0, 1]V , it holds [15] that g(p C q) n X (q j q jC1 )(g(p C V j ) g(p)); D g(p) C jD1

where, for each q, the elements of V are indexed as {v1 , . . . , vn } (with n = |V|) in such a way that q(v1 ) q(vn ); qj = q(vj ), V j = {v1 , . . . , vj } for j = 1, . . . , n, and qn + 1 = 0. The expression of g shows that an L-convex function is an integrally convex function in the sense of [5]. An L-convex function g enjoys discrete midpoint convexity:

pCq pCq Cg g(p) C g(q) g 2 2

L

v

2 C1

v2V

g > 1. The is L-convex in p 2 ZV , provided that e sum of two (or more) L-convex functions is L-convex, provided that its effective domain is nonempty. Properties of M-Convex Functions Let f : ZV ! Z[ {+1} be M-convex. Then dom f is an M-convex set. For each x 2 dom f , x (u; v) D f (x u C v ) f (x) (u; v 2 V ) satisfies [16] triangle inequality. An M-convex function f can be extended to a convex function f : RV ! R [ fC1g, and the value of f (x) for x 2 RV is determined by {f (y): y 2 ZV , bxc y dxe. That is, an M-convex function is an integrally convex function in the sense of [5]. The minimum of an M-convex function f is characterized by the local minimality in the sense that for x 2 dom f , f (x) f (y) for all y 2 ZV if and only if f (x) f (x u + v ) for all u, v 2 V [12,15,18]. The minimizers of an M-convex function, if nonempty, form an M-convex set. Moreover, for any

1841

1842

L

L-convex Functions and M-convex Functions

p 2 RV , argmin(f [p]), if nonempty, is an M-convex set. Conversely, this property characterizes M-convex functions, under an auxiliary assumption that the effective domain is bounded or the function can be extended to a convex function over RV (see [12,15]). The level set of an M-convex function is not necessarily an M-convex set, but enjoys a weaker exchange property. Namely, for any p 2 RV and ˛ 2 R, S = {x 2 ZV : f [p](x) ˛} (the level set of f [p]) satisfies: For x, y 2 S and for u 2 supp+ (x y), there exists v 2 supp (x y) such that either x u + v 2 S or y + u v 2 S. Conversely, this property characterizes M-convex functions [25]. A number of operations can be defined for M-convex functions [15,16]. For p 2 ZV , f [ p] is an M-convex function. For a 2 ZV , f (a x) and f (a + x) are M-convex in x. For U V, the restriction of f to U: f U (x 0 ) D f (x 0 ; 0V nU )

(x 0 2 ZU )

(where 0V \ U is the zero vector in ZV \ U ) is M-convex in x0 , provided that dom f U 6D ;. For ' v 2 C1 (v 2 V), e f (x) D f (x) C

X

L\ - and M\ -Convexity L\ - and M\ -convexity are variants of, and essentially equivalent to, L- and M-convexity, respectively. L\ - and M\ -convex functions are introduced in [9] and [19], respectively. e D Let v0 be a new element not in V and define V V fv0 g [ V . A function g: Z ! Z[{+1} with dom g 6D ; is called L\ -convex if it is expressed in terms of an V ! Z [ fC1g as g(p) D L-convex function e g : Ze \ e g(0; p). Namely, an L -convex function is a function obtained as the restriction of an L-convex function. Conversely, an L\ -convex function determines the corresponding L-convex function up to the constant r in the definition of L-convex function. An L\ -convex function is essentially the same as a submodular integrally convex function of [5], and hence is characterized by discrete midpoint convexity [9]. An L-convex function, enjoying discrete midpoint convexity, is an L\ -convex function. Quadratic function g(p) D

v2V

is M-convex, provided that dom e f ¤ ;. In particular, P a separable convex function e f (x) D v2V 'v (x(v)) with dom e f being an M-convex set is an M-convex function. For two M-convex functions f 1 and f 2 , the integral convolution

x D x1 C x2 x1 ; x2 2 ZV

ai j pi p j

(p 2 Zn )

iD1 jD1

'v (x(v)) (x 2 ZV )

( f 1 f 2 )(x) D inf f 1 (x1 ) C f 2 (x2 ) :

n n X X

(x 2 ZV )

is either M-convex or else (f 1 f 2 )(x) = ˙ 1 for all x 2 ZV . Sum of two M-convex functions is not necessarily M-convex; such function with nonempty effective domain is called M 2 -convex. Convolution of two L-convex functions is not necessarily L-convex; such function with nonempty effective domain is called L2 -convex. M2 - and L2 -convex functions are in one-to-one correspondence through the integral Fenchel–Legendre transformation.

with aij = aji 2 Z is L\ -convex if and only if aij 0 (i 6D P j) and njD1 aij 0 (i = 1, , n). For { i 2 C1 : i = 1, . . . , n}, a separable convex function g(p) D

n X

i (p i )

(p 2 Zn )

iD1

is L\ -convex. The properties of L-convex functions mentioned above are carried over, mutatis mutandis, to L\ -convex functions. In addition, the restriction of an L\ -convex function g to U V, denoted g U , is L\ -convex. A subset of ZV is called an L\ -convex set if its indicator function is an L\ -convex function. A set E ZV is an L\ -convex set if and only if

pCq pCq ; 2 E: p; q 2 E ) 2 2 A function f : ZV ! Z [{+1} with dom f 6D ; is called M \ -convex if it is expressed in terms of an M-conV ! Z [ fC1g as vex function e f : Ze X 8 x(u) D 0 < f (x) if x0 C e f (x0 ; x) D u2V : C1 otherwise:

L

L-convex Functions and M-convex Functions

Namely, an M\ -convex function is a function obtained as the projection of an M-convex function. Conversely, an M\ -convex function determines the corresponding M-convex function up to a translation of dom f in the direction of v0 . A function f : ZV ! Z [ {+1} with dom f 6D ; is M\ -convex if and only if (see [19]) it satisfies M\ -EXC) For x, y 2 dom f and u 2 supp+ (x y), f (x) C f (y) min f (x u ) C f (y C u ); min

v2supp (xy)

f f (x u C v ) C f (y C u v )g :

Since M-EXC) implies M\ -EXC), an M-convex function is an M\ -convex function. Quadratic function f (x) D

n X

ai xi 2 C b

iD1

X

(x 2 Zn )

xi x j

i< j \

with ai 2 Z (1 i n), b 2 Z is M -convex if 0 b 2 min1 i n ai (cf. [19]). For {' i 2 C1 : i = 0, . . . , n}, a function of the form ! n n X X xi C ' i (x i ) (x 2 Zn ) f (x) D '0 iD1

iD1

is M\ -convex [19]; a separable convex function is a special case of this (with ' 0 = 0). More generally, for {' X 2 C1 : X 2 T} indexed by a laminar family T 2V , the function X ' X (x(X)) (x 2 ZV ) f (x) D X2T

is M -convex [1], where T is called laminar if for any X, Y 2 T, at least one of X \ Y, X \ Y, Y \ X is empty. The properties of M-convex functions mentioned above are carried over, mutatis mutandis, to M\ -convex functions. In addition, the projection of an M\ -convex function f to U V, denoted f U , is M\ -convex. A subset of ZV is called an M \ -convex set if its indicator function is an M\ -convex function. A set Q ZV is an M\ -convex set if and only if Q is the set of integer points of an integral generalized polymatroid (cf. [7] for generalized polymatroids). As a consequence of the conjugacy between Land M-convexity, L\ -convex functions and M\ -convex functions are conjugate to each other under the integral Fenchel–Legendre transformation.

Duality Discrete duality theorems hold true for L-convex/ concave and M-convex/concave functions. A function g: ZV ! Z [ {1} is called L-concave (respectively, L\ -, M-, or M\ -concave) if g is L-convex (respectively, L\ -, M-, or M\ -convex); dom g means the effective domain of g. The concave counterpart of the discrete Fenchel–Legendre transform is defined as ˚ (p 2 ZV ): g ı (p) D inf hp; xi g(x) : x 2 ZV A discrete separation theorem for L-convex/ concave functions, named L-separation theorem [15] (see also [9]), reads as follows. Let f : ZV ! Z [ {+1} be an L\ -convex function and g: ZV ! Z [ { 1} be an L\ -concave function such that dom f \ dom g 6D ; or dom f \ dom g° 6D ;. If f (p) g(p) (p 2 ZV ), there exist ˇ 2 Z and x 2 ZV such that f (p) ˇ C hp; x i g(p)

(p 2 ZV ):

Since a submodular set function can be identified with a positively homogeneous L-convex function, the L-separation theorem implies Frank’s discrete separation theorem for a pair of sub/supermodular functions [6], which reads as follows. Let : 2V ! Z [ {+1} and : 2V ! Z [ {1} be submodular and supermodular functions, respectively, with (;) = (;) = 0, (V) < +1, (V)> 1, where is called supermodular if is submodular. If (X) (X) (X V), there exists x 2 ZV such that (X) x (X) (X)

(X V):

\

Another discrete separation theorem, M-separation theorem [12,15] (see also [9]), holds true for M-convex/concave functions. Namely, let f : ZV ! Z [ {+1} be an M\ -convex function and g: ZV ! Z [ {1} be an M\ -concave function such that dom f \ dom g 6D ; or dom f \ dom g° 6D ;. If f (x) g(x) (x 2 ZV ), there exist ˛ 2 Z and p 2 ZV such that f (x) ˛ C hp ; xi g(x)

(x 2 ZV ):

The L- and M-separation theorems are conjugate to each other, while a self-conjugate statement can be made in the form of the Fenchel-type duality [12,15], as follows. Let f : ZV ! Z [ {+1} be an L\ -convex function and g: ZV ! Z [ {1} be an L\ -concave function

1843

1844

L

L-convex Functions and M-convex Functions

such that dom f \ dom g 6D ; or dom f \ dom g° 6D ;. Then it holds that ˚ inf f (p) g(p) : p 2 ZV ˚ D sup g ı (x) f (x) : x 2 ZV : Moreover, if this common value is finite, the infimum is attained by some p 2 dom f \ dom g and the supremum is attained by some x 2 dom f \ dom g°. Example 3 Here is a simple example to illustrate the subtlety of discrete separation for discrete functions. Functions f : Z2 ! Z and g: Z2 ! Z defined by f (x1 , x2 ) = max(0, x1 + x2 ) and g(x1 , x2 ) = min(x1 , x2 ) can be extended respectively to a convex function f : R2 ! R and a concave function g : R2 ! R according to the defining expressions. With p D ( 12 ; 12 ), we have f (x) hp; xi g(x) for all x 2 R2 , and a fortiori, f (x) hp; xi g(x) for all x 2 Z2 . However, there exists no integral vector p 2 Z2 such that f (x) hp, xi g(x) for all x 2 Z2 . Note also that f is M\ -convex and g is L-concave. Network Duality A conjugate pair of M- and L-convex functions can be transformed through a network ([12,16]; see also [23]). Let G = (V, A) be a directed graph with arc set A and vertex set V partitioned into three disjoint parts as V = V + [ V 0 [ V . For ' a 2 C1 (a 2 A) and M-convex f : C f : ZV ! Z [ f˙1g by ZV ! Z [ {+1}, define e e f (y) D inf ;x 8 ˆ < X f (x) C ' a ((a)) : ˆ : a2A

9 > @ D (x; 0; y) = C 0 : 2 ZV [V [V > ; A 2Z C

a ((a)) :

D ı(p; r; q) 2 ZA C 0 (p; r; q) 2 ZV [V [V

Subdifferentials The subdifferential of f : ZV ! Z [ {+ 1} at x 2 dom f is defined by {p 2 RV : f (y) f (x) hp, y xi (8y 2 ZV )}. The subdifferential of an L2 - or M2 -convex function forms an integral polyhedron. More specifically: The subdifferential of an L-convex function is an integral base polyhedron (an M-convex polyhedron). The subdifferential of an L2 -convex function is the intersection of two integral base polyhedra (Mconvex polyhedra). The subdifferential of an M-convex function is an L-convex polyhedron. The subdifferential of an M2 -convex function is the Minkowski sum of two L-convex polyhedra. Similar statements hold true with L and M replaced respectively by L\ and M\ . Algorithms

For a 2 C1 (a 2 A) and L-convex g: ZV ! Z [{+1}, define e g : ZV ! Z [ f˙1g by e g(q) D inf ;p;r 8 ˆ < X g(p) C ˆ : a2A

g De f . A special case (V + = V) of the ' a (a 2 A), then e last statement yields the network duality: 9 8 @ D x; = < inf ˚(x; ) : x 2 ZV ; ; : 2 ZA 8 9 D ı p; = < D sup (p; ) : p 2 ZV ; ; : A ; 2Z P where ˚(x, ) = f (x)+ a 2 A ' a ((a)), (p, ) = P g(p) a 2 A a ((a)) and the finiteness of inf ˚ or sup is assumed. The network duality is equivalent to the Fenchel-type duality.

9 > = > ;

:

Then e f is M-convex, provided that e f > 1, and e g is L-convex, provided that e g > 1. If g = f and a =

On the basis of the equivalence of L\ -convex functions and submodular integrally convex functions, the minimization of an L-convex function can be done by the algorithm of [5], which relies on the ellipsoid method. The minimization of an M-convex function can be done by purely combinatorial algorithms; a greedy-type algorithm [2] for valuated matroids and a domain reduction-type polynomial time algorithm [24] for M-convex functions. Algorithms for duality of M-convex functions (in other words, for M2 convex functions) are also developed; polynomial algorithms [14,22] for valuated matroids, and a finite primal algorithm [18] and a polynomial time conjugate-scaling algorithm [10] for the submodular flow problem.

LCP: Pardalos–Rosen Mixed Integer Formulation

Applications A discrete analog of the conjugate duality framework [21] for nonlinear optimization is developed in [15]. An application of M-convex functions to engineering system analysis and matrix theory is in [13,17]. M-convex functions find applications also in mathematical economics [1]. See also Generalized Concavity in Multi-objective Optimization Invexity and its Applications Isotonic Regression Problems References 1. Danilov V, Koshevoy G, Murota K (May 1998) Equilibria in economies with indivisible goods and money. RIMS Preprint Kyoto Univ 1204 2. Dress AWM, Wenzel W (1990) Valuated matroid: A new look at the greedy algorithm. Appl Math Lett 3(2):33–35 3. Dress AWM, Wenzel W (1992) Valuated matroids. Adv Math 93:214–250 4. Edmonds J (1970) Submodular functions, matroids and certain polyhedra. In: Guy R, Hanani H, Sauer N, Schönheim J (eds) Combinatorial Structures and Their Applications. Gordon and Breach, New York, pp 69–87 5. Favati P, Tardella F (1990) Convexity in nonlinear integer programming. Ricerca Oper 53:3–44 6. Frank A (1982) An algorithm for submodular functions on graphs. Ann Discret Math 16:97–120 7. Frank A, Tardos É (1988) Generalized polymatroids and submodular flows. Math Program 42:489–563 8. Fujishige S (1991) Submodular functions and optimization, vol 47. North-Holland, Amsterdam 9. Fujishige S, Murota K (2000) Notes on L-/M-convex functions and the separation theorems. Math Program 88:129– 146 10. Iwata S, Shigeno M (1998) Conjugate scaling technique for Fenchel-type duality in discrete optimization. IPSJ SIG Notes 98-AL-65 11. Lovász L (1983) Submodular functions and convexity. In: Bachem A, Grötschel M, Korte B (eds) Mathematical Programming – The State of the Art. Springer, Berlin, pp 235– 257 12. Murota K (1996) Convexity and Steinitz’s exchange property. Adv Math 124:272–311 13. Murota K (1996) Structural approach in systems analysis by mixed matrices – An exposition for index of DAE. In: Kirchgässner K, Mahrenholtz O, Mennicken R (eds) ICIAM 95. Math Res. Akad. Verlag, Berlin, pp 257–279

L

14. Murota K (1996) Valuated matroid intersection, I: optimality criteria, II: algorithms. SIAM J Discret Math 9:545–561, 562–576 15. Murota K (1998) Discrete convex analysis. Math Program 83:313–371 16. Murota K (1998) Discrete convex analysis. In: Fujishige S (ed) Discrete Structures and Algorithms, vol V. KindaiKagaku-sha, Tokyo, pp 51–100 (In Japanese.) 17. Murota K (1999) On the degree of mixed polynomial matrices. SIAM J Matrix Anal Appl 20:196–227 18. Murota K (1999) Submodular flow problem with a nonseparable cost function. Combinatorica 19:87–109 19. Murota K, Shioura A (1999) M-convex function on generalized polymatroid. Math Oper Res 24:95–105 20. Rockafellar RT (1970) Convex analysis. Princeton Univ. Press, Princeton 21. Rockafellar RT (1974) Conjugate duality and optimization. SIAM Regional Conf Appl Math, vol 16. SIAM, Philadelphia 22. Shigeno M (1996) A dual approximation approach to matroid optimization problems. PhD Thesis Tokyo Inst. Techn. 23. Shioura A (1998) A constructive proof for the induction of M-convex functions through networks. Discrete Appl Math 82:271–278 24. Shioura A (1998) Minimization of an M-convex function. Discrete Appl Math 84:215–220 25. Shioura A (2000) Level set characterization of M-convex functions. IEICE Trans Fundam Electronics, Commun and Comput Sci E83-A:586–589

LCP: Pardalos–Rosen Mixed Integer Formulation PANOS M. PARDALOS Center for Applied Optim. Department Industrial and Systems Engineering, University Florida, Gainesville, USA MSC2000: 90C33, 90C11 Article Outline Keywords See also References Keywords Linear complementarity problem; Mixed integer programming; Bimatrix games; Mixed integer problem; Minimum norm solution

1845

1846

L

LCP: Pardalos–Rosen Mixed Integer Formulation

In this article we consider the general linear complementarity problem (LCP) of finding a vector x 2 Rn such that Mx C q 0;

x 0;

x > Mx C q> x D 0

(or proving that such an x does not exist), where M is an n × n rational matrix and q 2 Rn is a rational vector. For given data M and q, the problem is generally denoted by LCP(M, q). The LCP unifies a number of important problems in operations research. In particular, it generalizes the primal-dual linear programming problem, convex quadratic programming, and bimatrix games [1,2]. For the general matrix M, where S = {x: Mx + q 0, x 0} can be bounded or unbounded, the LCP can always be solved by solving a specific zero-one, linear, mixed integer problem with n zero-one variables. Consider the following mixed zero-one integer problem: 8 ˆ max ˛ ˆ ˆ ˛;y;z ˆ ˆ < s.t. 0 My C ˛q e z; (MIP) ˆ ˆ ˛ 0; 0 y z; ˆ ˆ ˆ : z 2 f0; 1gn : Theorem 1 Let (˛ , y , z ) be any optimal solution of (MIP). If ˛ > 0, then x = y /˛ solves the LCP. If in the optimal solution ˛ = 0, then the LCP has no solution. The equivalent mixed integer programming formulation (MIP) was first given in [3]. Every feasible point (˛, y, z) of (MIP), with ˛ > 0, corresponds to a solution of LCP. Therefore, solving (MIP), we may generate several solutions of the corresponding LCP. J.B. Rosen [4] proved that the solution obtained by solving (MIP) is the minimum norm solution to the linear complementarity problem. See also Branch and Price: Integer Programming with Column Generation Convex-simplex Algorithm Decomposition Techniques for MILP: Lagrangian Relaxation Equivalence Between Nonlinear Complementarity Problem and Fixed Point Problem Generalized Nonlinear Complementarity Problem

Integer Linear Complementary Problem Integer Programming Integer Programming: Algebraic Methods Integer Programming: Branch and Bound Methods Integer Programming: Branch and Cut Algorithms Integer Programming: Cutting Plane Algorithms Integer Programming Duality Integer Programming: Lagrangian Relaxation Lemke Method Linear Complementarity Problem Linear Programming Mixed Integer Classification Problems Multi-objective Integer Linear Programming Multi-objective Mixed Integer Programming Multiparametric Mixed Integer Linear Programming Order Complementarity Parametric Linear Programming: Cost Simplex Algorithm Parametric Mixed Integer Nonlinear Optimization Principal Pivoting Methods for Linear Complementarity Problems Sequential Simplex Method Set Covering, Packing and Partitioning Problems Simplicial Pivoting Algorithms for Integer Programming Stochastic Integer Programming: Continuity, Stability, Rates of Convergence Stochastic Integer Programs Time-dependent Traveling Salesman Problem Topological Methods in Complementarity Theory

References 1. Cottle RW, Dantzig GB (1968) Complementarity pivot theory of mathematical programming. In: Dantzig GB, Veinott AF (eds) Mathematics of the Decision Sci., Part 1. Amer. Math. Soc., Providence, RI, pp 115–136 2. Horst R, Pardalos PM, Thoai NV (1995) Introduction to global optimization. Kluwer, Dordrecht 3. Pardalos PM, Rosen JB (1988) Global optimization approach to the linear complementarity problem. SIAM J Sci Statist Comput 9(2):341–353 4. Rosen JB (1990) Minimum norm solution to the linear complementarity problem. In: Leifman LJ (ed) Functional Analysis, Optimization and Mathematical Economics. Oxford Univ. Press, Oxford, pp 208–216

Least-index Anticycling Rules

Least-index Anticycling Rules LindAcR TAMÁS TERLAKY Department Comput. & Software, McMaster University, West Hamilton, Canada MSC2000: 90C05, 90C33, 90C20, 05B35 Article Outline Keywords Consistent Labeling For the Max-Flow Problem Linear Optimization Least-Index Rules for Feasibility Problem The Linear Optimization Problem Least-Index Pivoting Methods for LO

Linear Complementarity Problems Least-Index Rules and Oriented Matroids See also References

L

work (N, A, u) is given, where N, the set of nodes, is a finite set; A N × N is the set of directed arcs; finally, u 2 RA denotes the nonnegative capacity upper bound for flows through the arcs. Let further s, t 2 N be specified as the source and the sink in the network. A vector f 2 RA is a flow in the network, if the incoming flow at each node, different from the source and the sink, is equal to the flow going out from the node. The goal is to find a maximal flow, namely a flow for which the total flow flowing out of the source or, equivalently, flowing in to the sink is the largest possible. The Ford–Fulkerson algorithm is the best known algorithm to find such a maximal flow. It is based on generating augmenting path’s subsequently. A path P connecting the source s and the sink t is a finite subset of arcs, where the source is the tail of the first arc; the sink is the head of the last arc; finally, the tail of an arc is always equal to the head of its predecessor. For ease of simplicity let us assume that if (v1 , v2 ) 2 A, then (v2 , v1 ) 2 A as well. If the opposite arc were not present, we can introduce it with zero capacity.

Keywords Pivot rules; Anticycling; Least-index; Recursion; Oriented matroids From the early days of mathematical optimization people were looking for simple rules that ensure that certain algorithms terminate in a finite number of steps. Specifically, on combinatorial structures the lack of finite termination imply that the algorithm cycles, i. e. periodically visits the same solutions. That is why rules ensuring finite termination of algorithms on finite structures are frequently referred to as anticycling rules. One frequently used anticycling rule in linear optimization (cf. Linear programming) is the so-called lexicographic pivoting rule [9]. The other large class of anticycling procedures, the ‘least-index’ rules, is the subject of this paper. least-index rules were designed for network flow problems, linear optimization problems, linear complementarity problems and oriented matroid programming problems. These classes will be considered in the sequel. Consistent Labeling For the Max-Flow Problem The maximal flow problem (see e. g. [11]; [24]) is one of the basic problems of mathematical programming. The problem is given as follows. A directed capacitated net-

0

1

2

3

Initialization. Let f be equal to zero. Let a free capacity network (N; A; u) be defined. Initially let A = fa 2 A : u a > 0g and u = u. Augmenting path. Let P be a path from s to t in the free capacity network. IF no such path exists, THEN STOP; A maximal flow is obtained. Augmenting the flow. Let # be the minimum of the arc capacities along the path P. Clearly # > 0. Increase the flow f on each arc of P by #. Update the free-capacity network. Decrease (increase) u a by # if the (opposite) of arc a is on the path P. Let A = fa 2 A : u a > 0g. Go to Step 1.

The Ford–Fulkerson max-flow algorithm

At each iteration cycle the flow value strictly increases. Thus, if the vector u is integral and the maxflow problem is bounded, then the Ford–Fulkerson algorithm provides a maximal flow in a finite number of steps. However, if the vector u contains irrational com-

1847

1848

L

Least-index Anticycling Rules

ponents, then the algorithm does not terminate in a finite number of steps and, even worse, it might converge to a nonoptimal flow. For such an example see [11,24]. An elegant solution for this problem is the consistent labeling algorithm of A.W. Tucker [28]. This most simple refinement reads as follows:

Least-Index Rules for Feasibility Problem The feasibility problem Ax D b;

and its alternative pair b > y > 0;

Be consistent at any time during the algorithm, specifically when building the augmenting path by using the labeling procedure. Whenever a labeled but unscanned subset of nodes is given during the procedure pick always the same from the same subset to be scanned. Particularly, if we assign an index to each node, then we are supposed to choose always the least-indexed node among the possibilities.

Tucker writes [28]: ‘Fulkerson (unpublished) conjectured that a consistent labeling procedure would be polynomially bounded; a proof of this conjecture appears to be very difficult.’ Linear Optimization Before discussing the general LO problem, first the linear feasibility problem is considered. 0

1

2

3

Initialization. Let T(B) be an arbitrary basis tableau and fix an arbitrary ordering of the variables. Leaving variable selection. Let K P be the set of the indices of the infeasible variables in the basis. IF K P = ;, THEN STOP; the feasibility problem is solved. ELSE, let p be the least-index in K P and then x p will leave the basis. Entering variable selection. Let K D be the set of the column indices of the negative elements in row p of T(B). IF K D = ;, THEN STOP; Row p of the tableau T(B) gives an evidence that the feasibility problem is inconsistent and row p of the inverse basis is a solution of the alternative system. ELSE, let q be the least-index in K D and then x q will enter the basis. Basis transformation. Pivot on (p; q). Go to Step 1.

Pivot rule

x 0;

A> y 0;

can be solved by a very simple least-index pivot algorithm. A fundamental result, the so-called Farkas lemma (cf. also Farkas lemma; Farkas lemma: Generalizations) [10] says that exactly one of the two alternative systems has a solution. This result is also known as the theorem of the alternatives. When a simple finite pivot rule gives a solution to either of the two alternatives, an elementary constructive proof for the Farkas lemma and its relatives is obtained. The above simple finite least-index pivot rule for the feasibility problem is a special case (see below) of Bland’s algorithm [5]. It is taken from [19] where the role of pivoting, and specifically the role of finite, least-index pivot rules in linear algebra is explored. The Linear Optimization Problem The general linear optimization (LO), linear programming (cf. Linear programming), problem will be considered in the standard primal form ˚ min c > x : Ax D b; x 0 ; together with its standard dual ˚ max b > y : A> y c : One of the most efficient, and for a long time the only, practical method to solve LO problems was the simplex method of G.B. Dantzig. The simplex method is a pivot algorithm that traverses through feasible basic solutions while the objective value is improving. The simplex method is in practice one of the most efficient algorithms but it is theoretically a finite algorithm only for nondegenerate problems. A basis is called primal degenerate if at least one of the basic variables is zero; it is called dual degenerate if the reduced cost of at least one nonbasic variable is zero. In general, the basis is degenerate if it is either primal or dual, or both primal and dual degenerate. The LO problem is degenerate, if it has a degenerate ba-

Least-index Anticycling Rules

sis. A pivot is called degenerate when after the pivot the objective remains unchanged. When the problem is degenerate the objective might stay the same in subsequent iterations and the simplex algorithm may cycle, i. e. starting from a basis, after some iterations the same basis is revisited and this process is repeated endlessly. Because the simplex method produces a sequence with monotonically improving objective values, the objective stays constant in a cycle, thus each pivot in the cycle must be degenerate. The possibility of cycling was recognized shortly after the invention of the simplex algorithm. Cycling examples were given by E.M.L. Beale [2] and by A.J. Hoffman [17]. Recently (1999) a scheme to construct cycling LO examples is presented in [15]. These examples made evident that extra techniques are needed to ensure finite termination of simplex methods. The first and widely used such tool is the class of lexicographic pivoting rules (cf. Lexicographic pivoting rules). Other, more recent techniques are the leastindex anticycling rules and some more general recursive schemes.

0

1

2

3 Least-Index Pivoting Methods for LO Cycling of the simplex method is possible only when the LO problem is degenerate. In that case not only many variables might be eligible to enter, but also to leave the basis. The least-index primal simplex rule makes the selection of both the entering and the leaving variable uniquely determined. Least-index rules are based on consistent selection among the possibilities. The first such rule for the simplex method was published by R.G. Bland [4,5]. The least-index simplex method is finite. The finiteness proofs are quite elementary. All are based on the simple fact that there is a finite number of different basis tableaus. Further, orthogonality of the primal and dual spaces on some recursive argumentation is used [4, 5,27] It is straightforward to derive the least-index dual simplex algorithm. The only restriction relative to the dual simplex algorithm is, that when there are more candidates to leave or to enter the basis, always the least-indexed candidate has to be selected. An interesting use of least index-resolution is used in [18] by designing finite primal-dual type Hungarian methods for LO. Note that finite criss-cross rules (cf.

L

Initialization Let T(B) be a given primal feasible basis tableau and fix an arbitrary ordering of the variables. Entering variable selection. Let K D be the set of the indices of the dual infeasible variables, i.e. those with negative reduced cost. IF K D = ;, THEN STOP; The tableau T(B) is optimal and this way a pair of solutions is obtained. ELSE, let q be the least-index in K D and x q , will enter the basis. Leaving variable selection. Let K P be the set of the indices of those candidate pivot elements in column q that satisfy the usual pivot selection conditions of the primal simplex method. IF K P = ;, THEN STOP; the primal problem is unbounded, and so the dual problem is infeasible. ELSE, let p be the least-index in K P and then x p will leave the basis. Basis transformation. Pivot on (p; q). Go to Step 1.

The least-index primal simplex rule

also Criss-cross pivoting rules) [14,26] make maximum possible use of least-index resolution. Least-index simplex methods are not polynomial, they might require exponential number of steps to solve a LO problem, as it was shown by D. Avis and V. Chvátal [1]. Their example is essentially the Klee–Minty polytope [21]. Another example, again on the Klee– Minty polytope, is Roos’s exponential example [25] for the least-index criss-cross method. Here the initial basis is feasible and, although it is not required, feasibility happens to be preserved, thus the criss-cross method reduces to a least index simplex method. Linear Complementarity Problems A linear complementarity problem (cf. Linear complementarity problem) (LCP) is given as follows: Mx C s D t;

x; s 0;

x > s D 0:

Pivot algorithms are looking for a complementary basis solution of the LCP. A basis is called complementary, if

1849

1850

L

Least-index Anticycling Rules

exactly one of the complementary variables xi and si for all i is in the basis. The solvability of LCP depends on the properties of the matrix M. One of the simplest case is when M is a P-matrix. The matrix M is a P-matrix if all of its principal minors are positive. K.G. Murty [22] presented an utmost simple finite pivot algorithm for solving the Pmatrix LCP. This algorithm is a least-index principal pivot algorithm. Two extremal behaviors, exponential in the worst case and polynomial in average, of this finite pivot rule is studied in [13]. Finite least-index pivot rules are developed for larger classes of LCPs. All are least-index principal pivoting methods, some more classical feasibility preserving simplex type methods [7,8,23], others are leastindex criss-cross pivoting rules (cf. Criss-cross pivoting rules) [6,16,20]. More details are given in Principal pivoting methods for linear complementarity problems. 0

1

2

Initialization. Let T(B) be complementary basis tableau and fix an arbitrary ordering of the variables. (We can choose x = 0; s = t i.e., x nonbasic, s basic.) Leaving variable selection. Let K be the set of the infeasible variables. IF K = ; , THEN STOP; a complementary solution for LCP is obtained. ELSE, let p be the least-index in K. Basis transformation. Pivot on (p; p), i.e. replace the least-indexed infeasible variable in the basis by its complementary pair. Go to Step 1.

Murty’s Bard-type schema

Least-Index Rules and Oriented Matroids The least-index simplex method was originally designed for oriented matroid linear programming (cf. also Oriented matroids) [3,4]. It turned soon out, that this is not a finite algorithm in the oriented matroid context. The reason is the possibility of nondegenerate cycling [3,12], a phenomenon what is impossible in the linear case. An apparent difference between the linear

and the oriented matroid context is that for oriented matroids none of the finite-, recursive- or least-indextype rules yield a simplex method, i. e. a pivot method that preserves feasibility of the basis throughout. This discrepancy is also due to the possibility of nondegenerate cycling. See also Criss-cross Pivoting Rules Lexicographic Pivoting Rules Linear Programming Pivoting Algorithms for Linear Programming Generating Two Paths Principal Pivoting Methods for Linear Complementarity Problems Probabilistic Analysis of Simplex Algorithms References 1. Avis D, Chvátal V (1978) Notes on Bland’s rule. Math Program Stud 8:24–34 2. Beale EML (1955) Cycling in the dual simplex algorithm. Naval Res Logist Quart 2:269–275 3. Bjorner A, Las Vergnas M, Sturmfels B, White N, Ziegler G (1993) Oriented matroids. Cambridge Univ. Press, Cambridge 4. Bland RG (1977) A combinatorial abstraction of linear programming. J Combin Th B 23:33–57 5. Bland RG (1977) New finite pivoting rules for the simplex method. Math Oper Res 2:103–107 6. Chang YY (1979) Least index resolution of degeneracy in linear complementarity problems. Techn Report Dept Oper Res Stanford Univ 79-14 7. Chang YY, Cottle RW (1980) Least index resolution of degeneracy in quadratic programming. Math Program 18:127–137 8. Cottle R, Pang JS, Stone RE (1992) The linear complementarity problem. Acad. Press, New York 9. Dantzig GB (1963) Linear programming and extensions. Princeton Univ. Press, Princeton 10. Farkas J (1902) Theorie der Einfachen Ungleichungen. J Reine Angew Math 124:1–27 11. Ford LR Jr, Fulkerson DR (1962) Network flows. Princeton Univ. Press, Princeton 12. Fukuda K (1982) Oriented matroid programming. PhD Thesis Waterloo Univ. 13. Fukuda K, Namiki M (1994) On extremal behaviors of Murty’s least index method. Math Program 64:365–370 14. Fukuda K, Terlaky T (1997) Criss-cross methods: A fresh view on pivot algorithms. In: Mathematical Programming, (B) Lectures on Mathematical Programming, ISMP97, vol 79. Lausanne, pp 369–396

Least Squares Orthogonal Polynomials

15. Hall J, McKinnon KI (1998) A class of cycling counterexamples to the EXPAND anti-cycling procedure. Techn. Report Dept. Math. and Statist. Univ. Edinburgh 16. Den Hertog D, Roos C, Terlaky T (1993) The linear complementarity problem, sufficient matrices and the criss-cross method. LAA 187:1–14 17. Hoffman AJ (1953) Cycling in the simplex method. Techn Report Nat Bureau Standards 2974 18. Klafszky E, Terlaky T (1989) Variants of the Hungarian method for solving linear programming problems. Math Oper Statist Ser Optim 20:79–91 19. Klafszky E, Terlaky T (1991) The role of pivoting in proving some fundamental theorems of linear algebra. LAA 151:97–118 20. Klafszky E, Terlaky T (1992) Some generalizations of the criss-cross method for quadratic programming. Math Oper Statist Ser Optim 24:127–139 21. Klee V, Minty GJ (1972) How good is the simplex algorithm? In: Shisha O (ed) Inequalities-III. Acad. Press, New York, pp 1159–1175 22. Murty KG (1974) A note on a Bard type scheme for solving the complementarity problem. Oper Res 11(2–3):123–130 23. Murty KG (1988) Linear complementarity, linear and nonlinear programming. Heldermann, Berlin 24. Murty KG (1992) Network programming. Prentice-Hall, Englewood Cliffs, NJ 25. Roos C (1990) An exponential example for Terlaky’s pivoting rule for the criss-cross simplex method. Math Program 46:78–94 26. Terlaky T (1985) A convergent criss-cross method. Math Oper Statist Ser Optim 16(5):683–690 27. Terlaky T, Zhang S (1993) Pivot rules for linear programming: A survey on recent theoretical developments. Ann Oper Res 46:203–233 28. Tucker A (1977) A note on convergence of the Ford– Fulkerson flow algorithm. Math Oper Res 2(2):143–144

Least Squares Orthogonal Polynomials CLAUDE BREZINSKI, ANA C. MATOS Lab. d’Anal. Numérique et d’Optimisation, Université Sci. et Techn. Lille Flandres–Artois, Lille, France MSC2000: 33C45, 65K10, 65F20, 65F22 Article Outline Keywords Existence and Uniqueness Computation

L

Location of the Zeros Applications See also References Keywords Orthogonal polynomials; Least squares; Padé-type approximation; Quadrature methods Let c be the linear functional on the space of complex polynomials defined by ( c i 2 C; i D 0; 1; : : : ; c(x i ) D 0; i < 0: It is said that {Pk } forms a family of (formal) orthogonal polynomials with respect to c if 8k: Pk has the exact degree k, c(xi Pk (x)) = 0 for i = 0, . . . , k 1. Such a family exists if, 8k, the Hankel determinant ˇ ˇ ˇ c0 c1 c k1 ˇˇ ˇ ˇ c1 c2 c k ˇˇ ˇ H (0) k D ˇ ˇˇ ˇ ˇc c k c2k2 ˇ k1 is different from zero. Such polynomials enjoy most of the properties of the usual orthogonal polynomials, when the functional c is given by Z b c(x i ) D x i d˛(x); a

where ˛ is bounded and non decreasing in [a, b] (see [1] for these properties). In this paper we study the polynomials Rk such that m X

c(x i R k (x))

2

iD0

is minimized, where m is an integer strictly greater than k 1 (since, for m = k 1, we recover the previous formal orthogonal polynomials) and which can possibly depend on k. They will be called least squares (formal) orthogonal polynomials. They depend on the value of m but for simplicity this dependence will not be indicated in our notations. Such polynomials arise naturally in problems of Padé approximation for power series with perturbed

1851

1852

L

Least Squares Orthogonal Polynomials

coefficients, and in Gaussian quadrature (as described in the last section). Some properties of these polynomials are derived, together with a recursive scheme for their computation.

where uk is a column vector, vk a row vector and ak a scalar. We then have 1 1 1 1 A1 A k C A1 k u k ˇk v k A k k u k ˇk ; A1 D kC1 1 ˇ 1 ˇ 1 k vk Ak k

Existence and Uniqueness

where ˇ k = ak vk A1 k uk . Instead of choosing the normalization bk = 1 we could impose the condition b0 = 1. In that case we have the system

Since the polynomials Rk will be defined apart from a multiplying factor, and since it is asked that the degree of Rk is exactly k we shall write R k (x) D b0 C b1 x C C b k x k

b10 (1 ; j ) C C b 0k ( k ; j ) D (0 ; j )

with bk = 1. We set ˚(b0 ; : : : ; b k1 ) D

m X

[c(x i R k (x))]2

iD0

and we seek for the values of b0 , . . . , bk 1 that minimize this quantity. That is, such that @˚ D0 @b j

for j D 0; : : : ; k 1:

(1)

Setting n = (cn , . . . , cn + m )| , this system can be written b0 (0 ; j ) C C b k1 ( k1 ; j ) D ( k ; j )

(2)

for j = 0, . . . , k 1. Thus Rk exists and is unique if and only if the matrix Ak of this system is non singular. Setting X = (1, x, . . . , xk 1 ) and calling the right-hand side of the preceding system we see that ˇ ˇ ˇA k ˇ ˇ ˇ ˇ X xkˇ R k (x) D : jA k j If we set

0

c0 @ Bk D cm

1 c k1 A; c mCk1

> then Ak = B> k Bk , = B k k and we recover the usual solution of a system of linear equations in the least squares sense.

Computation The polynomials Rk can be recursively computed by inverting the matrix Ak of the above system (2) by the bordering method, see [5]. This method is as follows. Set Ak uk A kC1 D vk ak

(3)

for j = 1, . . . , k, and the bordering method can be used not only for computing the inverses of the matrices of the system recursively but also for obtaining its solution, since the new right-hand side contains the previous one. Let Ak 0 be the matrix of (3) and dk 0 be the right-hand side. We then have 0 0 A k u 0k dk 0 0 ; d kC1 D A kC1 D v 0k a0k f k0 with u 0k D (( kC1 ; 1 ); : : : ; ( kC1 ; k ))> ; v 0k D ((1 ; kC1 ); : : : ; ( k ; kC1 )) ; a0k D ( kC1 ; kC1 ); 0

d k D ((0 ; 1 ); : : : ; (0 ; k ))> ; f k0 D (0 ; kC1 ): Setting zk 0 = (b1 0 , . . . , bk 0 )| we have 0 0 f 0 v 0 z0 A k1 u 0k z z0kC1 D k C k 0 k k 0 1 ˇk 0 with ˇ k 0 = ak 0 vk 0 A01 k uk . Of course the bordering method can only be used if ˇ k (or ˇ k 0 in the second case) is different from zero. If it is not the case, instead of adding one new row and one new column to the system it is possible to add several rows and columns until a non singular ˇ k (which is now a square matrix) has been found (see [3] and [4]).

Location of the Zeros We return to the normalization bk = 1. As c(x i R k (x)) D b0 c i C C b k c iCk

Least Squares Orthogonal Polynomials

and @c(xi Rk (x))/ @bj = ci + j , from (1) we obtain m X

c(x i R k (x))c iC j D 0

for j D 0; : : : ; k 1: (4)

iD0

This relation can be written as

for i = 0, . . . , k 1. Let us now assume that Z b ci D x i d˛(x); i D 0; 1; : : : ; a

with ˛ bounded and nondecreasing in [a, b]. We have

b

D

1 X1> B : C X k D @ :: A 0

a

0 yi @

m X

a

1 x j y j A d˛(y):

Set b

w(x; ) D

0 y @

m X

a

and

k D (0 ; : : : ; k1 ):

X> k

jD0

Z

| where X i = (1, xi , . . . , x m i ) , the xi ’s being arbitrary distinct points in [a, b], and thus ˇ ˇ ˇ (0 ; X1 ) ( k1 ; X1 ) ˇ ˇ ˇ ˇ D det(X k k ) ˇ ˇ ˇ ˇ( ; X ) ( ; X )ˇ 0 k k1 k

with

c i C c iC1 x C C c iCm x m "Z # m b X iC j D y d˛(y) x j Z

and the condition of regularity of [7,8] is equivalent to our condition for the existence and uniqueness of Rk . According to [7,8], we now have to look at the interpolation property of w. We have w(x i ; j ) D ( j1 ; X i )

c(R k (x)(c i C c iC1 x C C c iCm x m )) D 0;

jD0

L

1 x j y j A d˛(y):

jD0

The interpolation property holds if and only if det(Xk k ) 6D 0, that is, if and only if the matrix Xk k has rank k. Thus, using the theorem of [7,8], we have proved the following result: Theorem 1 If Ak is regular and if Xk k has rank k, then Rk exists and has k distinct zeros in [a, b]. Remark 2 When 0 a < b, it can be proved that det(Xk k ) 6D 0 (see [2] for the details).

Thus w(x; i) D c i C c iC1 x C C c iCm x m

Applications

and it follows that Z

b

c(R k (x)w(x; i)) D

R k (x)w(x; i) d˛(x) D 0 a

for i = 0, . . . , k 1, which shows that the polynomial Rk is biorthogonal in the sense of [7,8]. Let us now study the location of the zeros of Rk . For that purpose we shall apply [7, Thm. 3], also given as [8, Thm. 5]. Set

Our first application deals with Padé-type approximation. Let vk be an arbitrary polynomial of degree k and let wk (t) = a0 + + ak 1 t k 1 be defined by a i D c(x i1 v k (x)); We set e v k (t) D t k v k (t 1 )

d˚(x; ) D w(x; )d˛(x)

i D 0; : : : ; k 1:

and e w k (t) D t k1 w k (t 1 ):

Let f be the formal power series

and Z

b

I k () D

x k d˚(x; );

k D 0; 1; : : : :

f (t) D

a

In our case, takes the values i = i 1, i = 1, 2, . . . . Thus det I i ( j ) D det ( j1 ; i )

1 X

ci t i :

iD0

Then it can be proved that f (t)

e w k (t) D O(t k ) e v k (t)

(t ! 0):

1853

1854

L

Least Squares Orthogonal Polynomials

w k (t) The rational function e is called a Padé-type approxe v k (t) imant of f and it is denoted by (k 1/k)f (t), [1]. Moreover it can also be proved that

e tk tk v k (x) w k (t) D c D f (t) e e e 1 xt v k (t) v k (t) v k (t) k k x t k1 k1 v k (x) : C c 1 C xt C C x t 1 xt

In the following table we compare the number of exact figures given by the Padé approximant with those of the least squares Padé-type approximant, both computed with the same number of coefficients ci . We can see that the least squares Padé-type approximant has better stability properties. z

That is, f (t)e v k (t) e w k (t) D t

k

1 X

i

1:5 1:9 2:1

i

c(x v k (x))t :

iD0

Thus if the polynomial vk , which is called the generating polynomial of (k 1/k), satisfies c(x i v k (x)) D 0 for i D 0; : : : ; k 1; then

Padé approx [7/8] 6:7 5:7 5:2

LS Padé-type approx [6/7] (m = 8) 7:7 7:0 6:7

Another application concerns quadrature methods. We have already shown that if the functional c is given by Z b x i d˛(x); i D 0; 1; : : : ; 0 a < b; ci D a

e w k (t) f (t) D O(t 2k ): e v k (t) In this case vk is the formal orthogonal polynomial Pk w k (t) e of degree k with respect to c and e is the usual Padé v k (t) approximant [k 1/k] of f . As explained in [10], Padé approximants can be quite sensitive to perturbations on the coefficients ci of the series f . Hence the idea arises to take as vk the least squares orthogonal polynomial Rk of degree k instead of the usual orthogonal polynomial, an idea which in fact motivated our study. Of course such a choice decreases the degree of approximation, since the approximants obtained are only of the Padé-type, but it can increase the stability properties of the approximants and also P i 2 their precision since m iD0 [c(x vk (x))] is minimized by the choice vk = Rk . We give a numerical example that illustrates this fact. We consider the function 1

X ln(1 C z) D ci zi f (z) D z iD0 and we assume that we know the coefficients ci with a certain precision. For example, we know approximate values ci such that ˇ ˇ ˇc i c ˇ 108 ; i

i D 0; 1; : : : :

with ˛ bounded and nondecreasing, then the corresponding least squares orthogonal polynomial of degree k, Rk , has k distinct zeros in [a, b]. We can then construct quadrature formulas of the interpolatory type. If 1 , . . . , k are the zeros of Rk , we can approximate the integral Z b f (x) d˛(x) ID a

by I k D A1 f (1 ) C C A k f ( k )

(5)

where Z

b

Ai D a

(x) d˛(x)

0 ( i )(x i )

and

(x) D

k Y (x j ): jD1

This corresponds to replacing the function f by its interpolating polynomial at the knots 1 , . . . , k . The truncation error of (5) is given by Z b f [1 ; ; k ; x]R k (x) d˛(x): I Ik D ET D a

Least Squares Orthogonal Polynomials

Expanding the divided difference we see f [1 ; : : : ; k ; x] D

k X

f [1 ; ; kCi ](x kC1 ) (x kCi1 )

iD1

C f [1 ; : : : ; kCmC1 ; x](x kC1 ) (x kCmC1 ) for k + 1 , . . . , k + m + 1 any points in the domain of definition Df of f . If 0 2 Df , then we can choose kC1 D D kCmC1 D 0: Setting M i D f [1 ; : : : ; kCi ]

L

with i 2 [a, b], i = 0, . . . , m, , 2 [a, b]. We remark that in the case where m = k 1, Rk is the orthogonal polynomial with respect to the functional c and so (5) corresponds to a Gaussian quadrature formula. An advantage of the quadrature formulas (5) is that they are less sensitive to perturbations on the sequence of moments ci , as is shown in the following numerical example. Such a case can arise in some applications where the formula giving the moments ci is sensitive to rounding errors, see [11] for example. Consider the functional c defined by Z 1 1 x i dx D ci D iC1 0 and perturb the coefficients in the following way

we get f [1 ; : : : ; k ; x] mC1 X

D

M i x i1 C x mC1 f [1 ; : : : ; kCmC1 ; x]

iD1

and hence, for the truncation error ET D

m X

iD0 Z b

C

Z

!

b

M iC1

R k (x)x d˛(x) a

f [1 ; : : : ; kCmC1 ; x]x mC1 R k (x) d˛(x)

with

iD0

i 6 7 8 9 10 11

c i 0:14285700 0:12500000 0:11111109 0:10000000 0:09090899 0:08333300

We can construct from these coefficients the least squares orthogonal polynomials and the corresponding quadrature formulas (5). The R precision of the numerical approximations of I = 10 f (x) dx is given in the following table

!2

b i

R k (x)x d˛(x) a

minimised. Moreover, if f 2 Ck + m + 1 ([a, b]) and, since xm + 1 is positive over [a, b], we obtain Z

c i 1:00000011 0:50000029 0:33333340 0:25000101 0:20000070 0:16666600

i

a

Z m X

i 0 1 2 3 4 5

b

f [1 ; : : : ; kCmC1 ; x]x mC1 R k (x) d˛(x) a

D

c mC1 R k () f (kCmC1) () (k C m C 1)!

with , 2 [a, b], and, for the error, ! Z b m X f (kCi) ( i ) i R k (x)x d˛(x) ET D (k C i)! a iD0 c mC1 R k () f (kCmC1) () (6) C (k C m C 1)!

f (x)

k = 5; m = 4 Gauss quad.

k = 5; m = 6 least sq. quad.

1/(x + 0:5) 1/(x + 0:3)

2:2 105 2:1 104

6:2 106 1:2 105

We can obtain other applications from the following generalization. Instead of minimizing Pm i 2 iD0 [c(x Rk (x))] we can introduce weights and minimize m X 2 p i c(x i Rk (x)) ˚ (b0 ; : : : ; b k1 ) D iD0

with pi > 0, i = 0, . . . , m. If we choose the inner product

( i ; j ) D

m X kD0

p k c iCk c jCk

1855

1856

L

Least Squares Problems

the solution of this problem can be computed as in the previous case and all the properties of the polynomials are still true. It can be seen, from numerical examples, that if the sequence of moments ci has a decreasing precision, we can expect that the least squares Padé-type approximants constructed with a decreasing sequence of weights will give a better result. In the same way, for the quadrature formulas (5), from the expression (6) of the truncation error and the knowledge of the magnitude of the derivatives, we can reduce this error by choosing appropriate weights. Some other possible applications of least squares orthogonal polynomials will be studied in the future.

10. Mason JC (1981) Some applications and drawbacks of Padé approximants. In: Ziegler Z (ed) Approximation Theory and Appl. Acad. Press, New York, pp 207–223 11. Morandi Cecchi M, Redivo Zaglia M (1991) A new recursive algorithm for a Gaussian quadrature formula via orthogonal polynomials. In: Brezinski C, Gori L, Ronveaux A (eds) Orthogonal Polynomials and Their Applications. Baltzer, Basel, pp 353–358

See also

MSC2000: 65Fxx

ABS Algorithms for Linear Equations and Linear Least Squares ABS Algorithms for Optimization Gauss–Newton Method: Least Squares, Relation to Newton’s Method Generalized Total Least Squares Least Squares Problems Nonlinear Least Squares: Newton-type Methods Nonlinear Least Squares Problems Nonlinear Least Squares: Trust Region Methods

Least Squares Problems ÅKE BJÖRCK Linköping University, Linköping, Sweden

Article Outline Keywords Synonyms Introduction Historical Remarks Statistical Models Characterization of Least Squares Solutions

Pseudo-inverse and Conditioning Singular Value Decomposition and Pseudo-inverse Conditioning of the Least Squares Problem

Numerical Methods of Solution References 1. Brezinski C (1980) Padé type approximation and general orthogonal polynomials. ISNM, vol 50. Birkhäuser, Basel 2. Brezinski C, Matos AC (1993) Least squares orthogonal polynomials. J Comput Appl Math 46:229–239 3. Brezinski C, Redivo Zaglia M (1991) Extrapolation methods. Theory and practice. North-Holland, Amsterdam 4. Brezinski C, Redivo Zaglia M, Sadok H (1992) A breakdownfree Lanczos type algorithm for solving linear systems. Numer Math 63:29–38 5. Faddeeva VN (1959) Computational methods of linear algebra. Dover, Mineola, NY 6. Gantmacher FR (1959) The theory of matrices. Chelsea, New York 7. Iserles A, Nørsett SP (1985) Bi-orthogonal polynomials. In: Brezinski C, Draux A, Magnus AP, Maroni P, Ronveaux A (eds) Orthogonal Polynomials and Their Applications. Lecture Notes Math. Springer, Berlin, pp 92–100 8. Iserles A, Nørsett SP (1988) On the theory of biorthogonal polynomials. Trans Amer Math Soc 306:455–474 9. Karlin S (1968) Total positivity. Stanford Univ. Press, Palo Alto, CA

The Method of Normal Equations Least Squares by QR Factorization Rank-Deficient and Ill-Conditioned Problems Rank Revealing QR Factorizations

Updating Least Squares Solutions Recursive Least Squares Modifying Matrix Factorizations

Sparse Problems Banded Least Squares Problems Block Angular Form General Sparse Problems

See also References Keywords Least squares Synonyms LSP

L

Least Squares Problems

The variance-covariance matrix of the least squares estimate b x is given by

Introduction Historical Remarks The linear least squares problem originally arose from the need to fit a linear mathematical model to given observations. In order to reduce the influence of errors in the observations one uses a greater number of measurements than the number of unknown parameters in the model. The algebraic procedure of the method of least squares was first published by A.M. Legendre [25]. It was justified as a statistical procedure by C.F. Gauss [13]. A famous example of the use of the least squares principle is the prediction of the orbit of the asteroid Ceres by Gauss in 1801. After this success, the method of least squares quickly became the standard procedure for analysis of astronomical and geodetic data. Gauss gave the method a sound theoretical basis in two memoirs: ‘Theoria Combinationis’ [11,12]. In them, Gauss proves the optimality of the least squares estimate without any assumptions that the random variables follow a particular distribution. Statistical Models

(1)

where A 2 Rm × n is a known matrix of full column rank. Further, is a vector of random errors with zero means and covariance matrix 2 W 2 Rm × m , where W is known but 2 > 0 unknown. The standard linear model is obtained for W = I. Theorem 1 (Gauss–Markoff theorem) Consider the standard linear model (1) with W = I. The best linear unbiased estimator of any linear function c| x is x, where b x is obtained by minimizing the sum of the c >b squared residuals, krk22

D

m X

r 2i ;

(2)

iD1

where r = b Ax and k k2 denotes the Euclidean vector norm. Furthermore, E(s2 ) D 2 , where s2 is the quadratic form s2 D

1 (b Ab x)> (b Ab x): mn

(4)

The residual vectorb r D b Ab x satisfies A>b r D 0, and hence there are n linear relations among the m components of b r. It can be shown that the residuals b r, and therefore also the quadratic form s2 , are uncorrelated x) D 0. with b x, i. e., cov(b r;b x) D 0, cov(s2 ;b If the errors in are uncorrelated but not of equal variance, then the covariance matrix W is diagonal. Then the least squares estimator is obtained by solving the weighted least squares problem min kD(Ax b)k2 ;

1

D D W 2 :

(3)

(5)

For the general case with no restrictions on A and W, see [23]. The assumption that A is known made in the linear model is frequently unrealistic since sampling or modeling errors often also affect A. In the errors-in-variables model one instead assumes a linear relation (A C E)x D b C r;

In the general univariate linear model the vector b 2 Rm of observations is related to the unknown parameter vector x 2 Rn by a linear relation Ax D b C ;

V(b x) D 2 (A> A)1 :

(6)

where (E, r) is an error matrix whose rows are independently and identically distributed with zero mean and the same variance. An estimate of the parameters x in the model (6) is obtained from the total least squares (TLS) problem. Characterization of Least Squares Solutions Let S be set of all solutions to a least squares problem, S D fx 2 Rn : kAx bk2 D ming :

(7)

Then x 2 S if and only if A| (b Ax) = 0 holds. Equivalently, x 2 S if and only if x satisfies the normal equations A> Ax D A> b:

(8)

Since A| b 2 R(A| ) = R(A| A) the normal equations are always consistent. It follows that S is a nonempty, convex subset of Rn . Any least squares solution x uniquely decomposes the right-hand side b into two orthogonal components b D Ax C r;

Ax 2 R(A) ? r 2 N (A> );

1857

1858

L

Least Squares Problems

where R(A) and N(A| ) denote the range of A and the nullspace of A| , respectively. When rank A < n there are many least squares solutions x, although the residual b Ax is still uniquely determined. There is always a unique least squares solution in S of minimum length. The following result applies to both overdetermined and underdetermined linear systems. Theorem 2 Consider the linear least squares problem min kxk2 ; x2S

S D fx 2 Rn : kb Axk2 D ming ; (9)

where A 2 Rm × n and rank(A) = r min(m, n). This problem always has a unique solution, which is distinguished by the property that x ? N (A):

Pseudo-inverse and Conditioning

1) 2) 3) 4)

AXA = A; XAX = X; (AX)| = AX; (XA)| = XA.

It can be directly verified that A† given by (11) satisfies these four conditions. The total least squares problem (TLS problem) involves finding a perturbation matrix (E, r) having minimal Frobenius norm, which lowers the rank of the matrix (A, b). Consider the singular value decomposition of the augmented matrix (A, b): (A; b) D U˙ V > ;

˙ D diag(1 ; : : : ; nC1 );

where 1 n + 1 0. Then, in the generic case, (x, 1)| is a right singular vector corresponding to n + 1 and min k (E, r) kF = n + 1 . An excellent survey of theoretical and computational aspects of the total least squares problem is given in [22].

Singular Value Decomposition and Pseudo-inverse

Conditioning of the Least Squares Problem

A matrix decomposition of great theoretical and practical importance for the treatment of least squares problems is the singular value decomposition (SVD) of A,

Consider a perturbed least squares problem where e AD e A C ıA, b D b C ıb, and let the perturbed solution be e x D x C ıx. Then, assuming that rank(A) = rank(A + ı A) = n one has the first order bound 1 1 kıb1 k2 C kıAk2 kxk2 C 2 kıAk2 krk2 : kıxk2 n n

A D U˙ V > D

n X

u i i v> i :

(10)

iD1

Here i are the singular values of A and ui and vi the corresponding left and right singular vectors. Using this decomposition the solution to problem (9) can be written x = A† b, where 1 ˙r 0 A D V (11) U > 2 Rnm : 0 0 Here A† is called the pseudo-inverse of A. It is the unique matrix which minimizes kAX IkF , where k kF denotes the Frobenius norm. Note that the pseudoinverse A† is not a continuous function of A, unless one allows only perturbations which do not change the rank of A. The pseudo-inverse was first introduced by E.H. Moore in 1920. R. Penrose [30] later gave the following elegant algebraic characterization. Theorem 3 (Penrose’s conditions) The pseudo-inverse X = A† is uniquely determined by the four conditions:

The condition number of a matrix A 2 Rm × n (A 6D 0) is defined as

1 (12) (A) D kAk2 A 2 D ; r where 1 r > 0, are the nonzero singular values of A. Hence, the normwise relative condition number of the least squares problem can be written as LS (A; b) D (A) C (A)2

krk2 : kAk2 kxk2

(13)

For a consistent problem (r = 0) the last term is zero. However, in general the condition number depends on the size of r and involves a term proportional to (A)2 . A more refined perturbation analysis, which applies to both overdetermined and underdetermined systems, has been given in [34]. In order to prove any meaningful result it is necessary to assume that rank(A + ı A) = rank(A). If rank(A) = min(m, n), the condition kA† k2 kıAk2 < 1 suffices to ensure that this is the case.

Least Squares Problems

Set x0 = 0, r0 = 0, and for s = 0, 1, . . . until convergence do

Numerical Methods of Solution The Method of Normal Equations The first step in the method of normal equations for the least squares problem is to form the cross-products C D A> A 2 Rnn ;

d D A> b 2 Rn :

(14)

Since the matrix C is symmetric, it is only necessary to compute and store its upper triangular part. When m n this step will result in a great reduction in the amount of data. The computation of C and d can be performed either using an inner product form (operating on columns of A) or an outer product form (operating on rows of A). Row-wise accumulation of C and d is advantageous if the matrix A is sparse or held in secondary storage. Partitioning A by rows, one has CD

m X

a i a> i ;

dD

iD1

m X

b i a i ;

(15)

iD1

where e a> i denotes the ith row of A. This expresses C as a sum of matrices of rank. Gauss solved the symmetric positive definite system of normal equation by elimination, preserving symmetry, and solving for x by back-substitution. A different sequencing of this algorithm is to compute the Cholesky factorization C D R> R;

(16)

where R is upper triangular with positive diagonal elements, and then solve the two triangular systems R> z D d;

Rx D z;

L

(17)

by forward- and back-substitution, respectively. The Cholesky factorization, named after the French officer A.L. Cholesky, who worked on geodetic survey problems in Africa, was published by C. Benoit [1]. (In statistical applications this method is often known as the square-root method, although the proper square root of A should satisfy B2 = A.) The method of normal equations is suitable for moderately ill-conditioned problems but is not a backward stable method. The accuracy can be improved by using fixed precision iterative refinement in solving the normal equations.

r s D b Ax s ; R> (Rıx s ) D A> r s ; x sC1 D x s C ıx s : (Here, x1 corresponds to the unrefined solution of the normal equations.) The method of normal equations can fail when applied to weighted least squares problems. To see this consider a problem with two different weights and 1,

A1 b1

; x (18) min

x A2 b2 2 for which the matrix of normal equations is A| A = 2 > A> 1 A1 + A2 A2 . When 1 this problem is called stiff . In the limit ! 1 the solution will satisfy the subsystem A1 x = b1 exactly. If > u 1/2 (u is the unit roundoff), the information in the matrix A2 may completely disappear when forming A| A. For possible ways around this difficulty, see [4, Chap. 4.4]. Least Squares by QR Factorization The QR factorization and its extensions are used extensively in modern numerical methods for solving least squares problems. Let A 2 Rm × n with rank(A) = n. Then there are an orthogonal matrix Q 2 Rm × m and an upper triangular R 2 Rn × n such that R ADQ (19) 0 Since orthogonal transformations preserve the Euclidean length, it follows that

(20) kAx bk2 D Q > (Ax b) 2 for any orthogonal matrix Q 2 Rm × m . Hence using the QR factorization (19) the solution to the least squares problem can be obtained from d (21) Q > b D 1 ; Rx D d1 : d2 An algorithm based on the QR decomposition by Householder transformations was first developed in a seminal paper by G.H. Golub [18]. Here, Q is compactly represented as a product of Householder ma-

1859

1860

L

Least Squares Problems

trices Q = P1 Pn , where Pk = I ˇ k uk u> k . Only the Householder vectors uk are stored, and advantage is taken of the fact that the first k 1 components of uk are zero. Golub’s method for solving the standard least squares problem is normwise backward stable, see [24, pp. 90ff]. Surprisingly, this method is stable also for solving the weighted least squares problems (5) provided only that the equations are sorted after decreasing row norms in A, see [8]. Due to storage considerations the matrix Q in a QR decomposition is often discarded when A is large and sparse. This creates a problem, since then it may not be possible to form Q| b. If the original matrix A is saved one can use the corrected seminormal equations (CSNE) R> Rx D A> b; >

>

R Rıx D A r;

r D b Ax; x c D x C ıx:

(Note that unless the correction step is carried out the numerical stability of this method is no better than the method of normal equations.) An error analysis of the CSNE method is given in [2]. A comparison with the bounds for a backward stable method shows that in most practical applications the corrected seminormal equations is forward stable. Applying the Gram–Schmidt orthogonalization process to the columns of A produces Q1 and R in the factorization A D (a1 ; : : : ; a n ) D Q1 R;

Q1 D (q1 ; : : : ; q n );

where Q1 has orthogonal columns and R is upper triangular. There are two computational variants of Gram–Schmidt orthogonalization, the classical Gram– Schmidt orthogonalization (CGS) and the modified Gram–Schmidt orthogonalization (MGS). In CGS there may be a catastrophic loss of orthogonality unless reorthogonalization is used. In MGS the loss of orthogonality can be shown to occur in a predictable manner. Using an equivalence between MGS and Householder QR applied to A with a square matrix of zeros on top, backward stable algorithm based on MGS for solving least squares problems have been developed, see [3]. Rank-Deficient and Ill-Conditioned Problems The mathematical notion of rank is not always appropriate in numerical computations. For example, if

a matrix A 2 Rn × n , with (mathematical) rank k < n, is randomly perturbed by roundoff, the perturbed matrix most likely has full rank n. However, it should be considered to be ‘numerically’ rank deficient. When solving rank-deficient or ill-conditioned least squares problems, correct assignment of the ‘numerical rank’ of A is often the key issue. The numerical rank should depend on a tolerance which reflects the error level. Overestimating the rank may lead to a computed solution of very large norm, which is totally irrelevant. This behavior is typical in problems arising from discretizations of ill-posed problems, see [21]. Assume that the ‘noise level’ ı in the data is known. Then a numerical rank k, such that k > ı k + 1 , can be assigned to A, where i are the singular values of A. The approximate solution xD

k X ci vi ; iD1 i

c D U > b;

is known as the truncated singular value decomposition solution (TSVD). This solution solves the related least squares problem minx kAk x bk2 , where Ak D

k X

u i i v> i ;

kA A k k2 ı;

iD1

is the best rank k approximation of A, The subspace R(V2 );

V2 D (v kC1 ; : : : ; v n );

is called the numerical nullspace of A. An alternative to TSVD is Tikhonov regularization, where one considers the regularized problem min kAx bk22 C 2 kDxk22 ; x

(22)

for some positive diagonal matrix D = diag(d1 , . . . , dn ). The problem (22) is equivalent to the least squares problem

D 0

; min

(23) x

x A b 2 where the matrix A has been modified by appending the matrix D on top. An advantage of using the regularized problem (23) instead of the TSVD is that its solution can be computed from a QR decomposition. When > 0 this problem is always of full column rank and has

Least Squares Problems

a unique solution. For D = I it can be shown that x() will approximately equal the TSVD solution for = ı. Problem (23) also appears as a subproblem in trust region algorithms for solving nonlinear least squares, and in interior point methods for constrained linear least squares problems. A more difficult case is when the noise level ı is unknown and has to be determined in the solution process. Such problems typically arise in the treatment of discrete ill-posed problems, see [21]. Rank Revealing QR Factorizations In some applications it is too expensive to compute the SVD. In such cases so called ‘rank revealing’ QR factorizations, often are a good substitute. It can be shown that for any 0 < k < n a column permutation ˘ exists such that the QR decomposition of A ˘ has the form R11 R12 ; (24) A˘ D Q 0 R22 where 1 k (R11 ) k ; c

kR22 k2 c kC1 ;

(25)

and c < (n + 1)/2. In particular, if A has numerical ırank equal to k, then there is a column permutation such that k R22 k2 c ı. Such a QR factorization is called a rank revealing QR factorization (RRQR). No efficient numerical method is known which can be guaranteed to compute an RRQR factorization satisfying (25), although in practice Chan’s method [7] often gives satisfactory results. A related rank revealing factorization is the complete orthogonal decomposition of the form R11 R12 V >; (26) ADU 0 R22 where U and V are orthogonal matrices, R11 2 Rk × k , k (R11 ) k /c, and 1 kR12 k2F C kR22 k2F 2 c kC1 : This is also often called a rank revealing URV factorization. (an alternative lower triangular form ULV is sometimes preferable to use.) If V = (V 1 V 2 ) is partitioned conformably the orthogonal matrix V 2 can be taken as an approximation to the numerical nullspace N(A).

L

Updating Least Squares Solutions It is often desired to solve a sequence of modified least squares problems min kAx bk2 ; x

A 2 Rmn ;

(27)

where in each step rows of data in (A, b) are added, deleted, or both. This need arises, e. g., when data are arriving sequentially. In various time-series problems a window moving over the data is used; when a new observation is added, an old one is deleted as the window moves to the next step in the sample. In other applications columns of the matrix A may be added or deleted. Such modifications are usually referred to as updating (downdating) of least squares solutions. Important applications where modified least squares problems arise include statistics, optimization, and signal processing. In statistics an efficient and stable procedure for adding and deleting rows to a regression model is needed; see [6]. In regression models one may also want to examine the different models, which can be achieved by adding or deleting columns (or permuting columns). Recursive Least Squares Applications in signal processing often require near real-time solutions. It is then critical that the modification should be performed with as few operations and as little storage requirement as possible. Methods based on the normal equations and/or updating of the Cholesky factorization are still often used in statistics and signal processing, although these algorithms lack numerical stability. Consider a least squares problem where an observation w| x = ˇ is added. The updated solution e x then satisfies the modified normal equations x D A> b C ˇw: (A> A C ww > )e

(28)

A straightforward method for computing e x is based on updating the (scaled) covariance matrix C = (A| A)1 . By the Sherman–Morrison formula one obtains e C 1 D 1 > C C ww , and e CDC

1 uu > ; 1 C w>u

u D Cw:

(29)

From this follows the updating formula e u; x D x C (ˇ w > x)e

e uDe Cw:

(30)

1861

1862

L

Least Squares Problems

The equations (29), (30) define a recursive least squares (RLS) algorithm. They can, with slight modifications, also be used for ‘deleting’ observations. The simplicity of this updating algorithm is appealing, but a disadvantage is its serious sensitivity to roundoff errors. Modifying Matrix Factorizations The first area where algorithms for modifying matrix factorizations seems to have been systematically used is optimization. Numerous aspects of updating various matrix factorizations are discussed in [17]. There is a simple relationship between the problem of updating matrix factorizations and that of updating least squares solutions. If A has full column rank and the R-factor of the matrix (A, b) is R z ; (31) 0 then the solution to the least squares problem (27) is given by Rx D z;

kAx bk2 D :

(32)

Hence updating algorithms for the QR or Cholesky factorization can be applied to (A, b) in order to give updating algorithms for least squares solutions. Backward stable algorithms, which require O(m2 ) multiplications, exist for updating the QR decomposition for three important kinds of modifications: General rank one change of A. Deleting (adding) a column of A. Adding (deleting) a row of A. In these algorithms, Q 2 Rm × m is stored explicitly as an m × m matrix. In many applications it suffices to update the ‘Gram–Schmidt’ QR decomposition A D Q1 R;

Q1 2 Rmn ;

Rank revealing QR factorizations can be updated more cheaply, and are often a good alternative to use. G.W. Stewart [33] has shown how to compute and update a rank revealing complete orthogonal decomposition from an RRQR decomposition. Most updating algorithms can be modified in a straightforward fashion to treat cases where a block of rows/columns are added or deleted. which are more amenable to efficient implementation on vector and parallel computers. Sparse Problems The gain in operations and storage in solving the linear least squares problems where the matrix A is sparse can be huge, making otherwise intractable problems possible to solve. Sparse least squares problems of huge size arise in a variety of applications, such as geodetic surveys, photogrammetry, molecular structure, gravity field of the earth, tomography, the force method in structural analysis, surface fitting, and cluster analysis and pattern matching. Sparse least squares problems may be solved either by direct or iterative methods. Preconditioned iterative methods can often be considered as hybrids between these two classes of solution. Below direct methods are reviewed for some classes of sparse problems. Banded Least Squares Problems A natural distinction is between sparse matrices with regular zero pattern (e. g., banded structure) and matrices with an irregular pattern of nonzero elements. A rectangular banded matrix A 2 Rm × n has the property that the nonzero elements in each row lie in a narrow band. A is said to have row bandwidth w if

(33)

where Q1 2 Rm × n consists of the first n columns of Q, [10,31]. These only require O(mn) storage and operations. J.R. Bunch and C.P. Nielsen [5] have developed methods for updating the SVD ˙ ADU V >; 0 where U 2 Rm × m and V 2 Rn × n , when A is modified by adding or deleting a row or column. However, their algorithms require O(mn2 ) flops.

w(A) D max (l i (A) f i (A) C 1): 1im

(34)

where ˚ f i (A) D min j : a i j ¤ 0 ; ˚ l i (A) D max j : a i j ¤ 0 are column subscripts of the first and last nonzeros in the ith row of A. For this structure to have practical significance one needs to have w n. Note that, although the row bandwidth is independent of the row ordering, it will depend on the column ordering. To permute the

Least Squares Problems

columns in A so that a small bandwidth is achieved the method of choice is the reverse Cuthill–McKee ordering, see [15]. It is easy to see that if the row bandwidth of A is w then the matrix of normal equations C = A| A has at most upper bandwidth p = w 1, i. e., j j kj w

)

(A> A) jk D

m X

a i j a i k D 0:

iD1

If advantage is taken of the band structure, the solution of a least squares problem where A has bandwidth w by the method of normal equations requires a total of 1 2

(mw(w C 3) C n(w 1)(w C 2)) C n(2w 1)

flops. Similar savings can be obtained for methods based on Givens QR decomposition used to solve banded least squares problem. However, then it is essential that the rows of A are sorted so that the column indices f i (A), i = 1, . . . , m, of the first nonzero element in each row form a nondecreasing sequence, i. e., ik

)

f i (A) f k (A):

A matrix whose rows are sorted in this way is said to be in standard form. Since the matrix R in the QR factorization has the same structure as the Cholesky factor, it must be a banded matrix with nonzero elements only in the first p = w 1 superdiagonals. In the sequential row orthogonalization scheme an upper triangular matrix R is initialized to zero. The orthogonalization then proceeds row-wise, and R is updated by adding a row of A at a time. If A has constant bandwidth and is in standard form then in the ith step of reduction the last (n li (A)) columns of R have not been touched and are still zero as initialized. Further, the first (f i (A) 1) rows of R are already finished at this stage and can be read out to secondary storage. Thus, as with the Cholesky method, very large problems can be handled since primary storage is needed only for the active part of R. The complete orthogonalization requires about 2mw2 flops, and can be performed in w(w + 3)/2 locations of primary storage. The Givens rotations could also be applied to one or several right-hand sides b. Only if right-hand sides

L

which are not initially available are to be treated, need the Givens rotations be saved. The algorithm can be modified to also handle problems with variable row bandwidth wi . For the case when m n a more efficient schemes uses Householder transformations, see [24, Chap. 11]. Let Ak consist of the rows of A for which the first nonzero element is in column k. Then, in step k of this algorithm, the Ak is merged with Rk 1 , by computing the QR factorization R k1 D Rk : Q> k Ak Note that this and later steps will not involve the first k 1 rows and columns of Rk 1 . Hence the first k 1 rows of Rk 1 are rows in the final matrix R. The reduction using this algorithm takes about w(w + 1)(m + 3n/2) flops, which is approximately half as many as for the Givens method. As in the Givens algorithm the Householder transformations can also be applied to one or several right-hand sides b to produce c = Q| b. The least squares solution is then obtained from Rx = c1 by back-substitution. It is essential that the Householder transformations be subdivided as outlined above, otherwise intermediate fill will occur and the operation count increase greatly, see the example in [32]. Block Angular Form There is often a substantial similarity in the structure of large sparse least squares problems. The matrices possess a block structure, perhaps at several levels, which reflects a ‘local connection’ structure in the underlying physical problem. In particular, the problem can often be put in the following bordered block diagonal or block angular form: 1 0 B1 A1 B :: C ; :: (35) AD@ : : A 0

1

x1 B :: C x D @ : A; x MC1

BM 0 1 b1 B :: C b D @ : A:

AM

(36)

bM

From (35) it follows that the variables x1 , . . . , xM are coupled only to the variables xM + 1 . Some applications

1863

1864

L

Least Squares Problems

where the form (35) arises naturally are photogrammetry, Doppler radar positioning [27], and geodetic survey problems [20]. Problems of block angular form can be efficiently treated either by using normal equations of by QR factorization. It is easily seen that the matrix R from Cholesky or QR will have a block structure similar to that of A, 1 0 S1 R1 :: C B :: B : : C (37) RDB C; @ RM SM A R MC1 where the Ri 2 Rn i n i are upper triangular. This factor can be computed by first performing a sequence of orthogonal transformations yielding Q> i (A i ; B i ) D

Ri 0

Si ; Ti

Q> i bi D

ci : di

Any sparse structure in the blocks Ai should be exploited. The last block row RM + 1 , cM + 1 is obtained by computing the QR decomposition e Q> MC1 T

R MC1 d D 0

c MC1 ; d MC1

where 0

1

T1 B :: C T D @ : A;

0

1

d1 B :: C d D @ : A:

TM

dM

The unknown xM + 1 is determined from the triangular system RM + 1 xM + 1 = cM + 1 . Finally xM , . . . , x1 are computed by back-substitution in the sequence of triangular systems Ri xi = ci Si xM + 1 , i = M, . . . , 1. Note that a large part of the computations can be performed in parallel on the M subsystems. Several modifications of this basic algorithm have been suggested in [19] and [9]. General Sparse Problems If A is partitioned by rows, then (15) can be used to compute the matrix C = A| A. Make the ‘nocancellation assumption’ that whenever two nonzero numerical quantities are added or subtracted, the result

is nonzero. Then it follows that the nonzero structure of A| A is the direct sum of the nonzero structures of > ai a> i , i = 1, . . . , m, where a i denotes the ith row of A. Hence the undirected graph G(A| A) representing the structure of A| A can be constructed as the direct sum of all the graphs G(ai a> i ), i = 1, . . . , m. The nonzeros will generate a subgraph, where all pairs of in row a> i nodes are connected. Such a subgraph is called a clique. From the graph G(A| A) the structure of the Cholesky factor R can be predicted by using a graph model of Gaussian elimination. The fill under the factorization process can be analyzed by considering a sequence of undirected graphs Gi = G(A(i) ), i = 0, . . . , n 1, where A(0) = A. These elimination graphs can be recursively formed in the following way. Form Gi from G(i 1) by removing the node i and its incident edges and adding fill edges. The fill edges in eliminating node v in the graph G are ˚ ( j; k) : ( j; k) 2 AdjG (v); j ¤ k : Thus, the fill edges correspond to the set of edges required to make the adjacent nodes of v pairwise adjacent. The filled graph GF (A) of A is a graph with n vertices and edges corresponding to all the elimination graphs Gi , i = 0, . . . , n 1. The filled graph bounds the structure of the Cholesky factor R, G(R> C R) G F (A):

(38)

This also give an upper bound for the structure of the factor R in the QR decomposition. A reordering of the columns of AP of A corresponds to a symmetric reordering of the rows and columns of A| A. Although this will not affect the number of nonzeros in A| A, only their positions, it may greatly affect the number of nonzeros in the Cholesky factor R. Before carrying out the Cholesky or QR factorization numerically, it is therefore important to find a permutation matrix P such that P| A| AP has a sparse Cholesky factor R. By far the most important local ordering method is the minimum degree ordering In terms of the Cholesky factorization this ordering is equivalent to choosing the ith pivot column as one with the minimum number of nonzero elements in the unreduced part of A| A. This will minimize the number of entries that will be modified in the next elimination step. Remarkably fast

Least Squares Problems

symbolic implementations of the minimum degree algorithm exist, which use refinements of the elimination graph model of the Cholesky factorization. See [16] for a survey of the extensive development of efficient versions of the minimum degree algorithm. Another important ordering method is substructuring or nested dissection, which results in a nested block angular form. Here the idea is to choose a set of nodes B in the graph G(A| A), which separates the other nodes into two sets A1 and A2 so that node variables in A1 are not connected to node variables in A2 . The variables are then ordered so that those in A1 appear first, those in A2 second, and those in B last. Finally the equations are ordered so that those including A1 come first, those including A2 next, and those only involving variables in B come last. This dissection can be continued recursively, first dissecting the regions A1 and A2 each into two subregions, and so on. An algorithm using the normal equations for solving sparse linear least squares problems is usually split in a symbolical and a numerical phase as follows. 1) Determine symbolically a column permutation Pc | such that P> c A APc has a sparse Cholesky factor R. | 2) Perform the Cholesky factorization of P> c A APc symbolically to generate a storage structure for R. | > | 3) Compute B = P> c A APc and c = P c A b numerically, storing B in the data structure of R. 4) Compute the Cholesky factor R numerically and solve R| z = c, Ry = z, giving the solution x = Pc y. Here, steps 1 and 2 involve only symbolic computation and apply also to a sparse QR algorithm. For details of the implementation of the numerical factorization see [15, Chap. 5]. For moderately ill-conditioned problems a sparse Cholesky factorization, possibly used with iterative refinement, is a satisfactory choice. Orthogonalization methods are potentially more accurate since they work directly with A. The number of operations needed to compute the QR decomposition depends on the row ordering, and the following heuristic row ordering algorithm should be applied to A before the numerical factorization takes place: First sort the rows after increasing f i (A), so that f i (A) f k (A) if i < k. Then for each group of rows with f i (A) = k, k = 1, . . . , maxi f i (A), sort all the rows after increasing Li (A). In the sparse case, applying the usual sequence of Householder reflections may cause a lot of intermedi-

L

ate fill-in, with consequent cost in operations and storage. In the row sequential algorithm by J.A. George and M.T. Heath [14], this problem is avoided by using a row-oriented method employing Givens rotations. Even more efficient are multifrontal methods, in which Householder transformations are applied to a sequence of small dense subproblems. Note that in most sparse QR algorithms the orthogonal factor Q is not stored. The corrected seminormal equations are used for treating additional right-hand sides. The reason is that for rectangular matrices A the matrix Q is usually much less sparse than R. In the multifrontal algorithm, however, Q can efficiently be represented by the Householder vectors of the frontal orthogonal transformations, see [26]. A Fortran multifrontal sparse QR subroutine, called QR27, has been developed by P. Matstoms [28]. He [29] has also developed a version of this to be used with MATLAB, implemented as four M-files and available from netlib. See also ABS Algorithms for Linear Equations and Linear Least Squares ABS Algorithms for Optimization Gauss, Carl Friedrich Gauss–Newton Method: Least Squares, Relation to Newton’s Method Generalized Total Least Squares Least Squares Orthogonal Polynomials Nonlinear Least Squares: Newton-type Methods Nonlinear Least Squares Problems Nonlinear Least Squares: Trust Region Methods References 1. Benoit C (1924) Sur la méthode de résolution des, équationes normales, etc. (Procédés du commandant Cholesky). Bull Géodésique 2:67–77 2. Björck Å (1987) Stability analysis of the method of seminormal equations for least squares problems. Linear Alg & Its Appl 88/89:31–48 3. Björck Å (1994) Numerics of Gram–Schmidt orthogonalization. Linear Alg & Its Appl 197-198:297–316 4. Björck Å (1996) Numerical methods for least squares problems. SIAM, Philadelphia 5. Bunch JR, Nielsen CP (1978) Updating the singular value decomposition. Numer Math 31:111–129

1865

1866

L

Leibniz, Gottfried Wilhelm

6. Chambers JM (1971) Regression updating. J Amer Statist Assoc 66:744–748 7. Chan TF (1987) Rank revealing {QR}-factorizations. LAA 88/89:67–82 8. Cox AJ, Higham NJ (1997) Stability of Householder QR factorization for weighted least squares problems. Numer Anal Report Manchester Centre Comput Math, Manchester, England, 301 9. Cox MG (1990) The least-squares solution of linear equations with block-angular observation matrix. In: Cox MG, Hammarling SJ (eds) Reliable Numerical Computation. Oxford Univ. Press, Oxford, pp 227–240 10. Daniel J, Gragg WB, Kaufman L, Stewart GW (1976) Reorthogonalization and stable algorithms for updating the Gram–Schmidt QR factorization. Math Comput 30:772–95 11. Gauss CF (1880) Theoria combinationis observationum erroribus minimis obnoxiae, pars posterior. In: Werke, IV. Königl. Gesellschaft Wissenschaft, Göttingen, pp 27–53, First published in 1823. 12. Gauss CF (1880) Theoria combinationis observationum erroribus minimis obnoxiae, pars prior. In: Werke, IV. Königl. Gesellschaft Wissenschaft. Göttingen, Göttingen, pp 1–26, First published in 1821. 13. Gauss CF (1963) Theory of the motion of the heavenly bodies moving about the Sun in conic sections. Dover, Mineola, NY (Translation by Davis CH); first published in 1809 14. George JA, Heath MT (1980) Solution of sparse linear least squares problems using Givens, rotations. Linear Alg & Its Appl 34:69–83 15. George JA, Liu JW-H (1981) Computer solution of large sparse positive definite systems. Prentice-Hall, Englewood Cliffs, NJ 16. George JA, Liu JW-H (1989) The evolution of the minimum degree ordering algorithm. SIAM Rev 31:1–19 17. Gill PE, Golub GH, Murray W, Saunders MA (1974) Methods for modifying matrix factorizations. Math Comput 28:505– 535 18. Golub GH (1965) Numerical methods for solving least squares problems. Numer Math 7:206–216 19. Golub GH, Manneback P, Toint P (1986) A comparison between some direct and iterative methods for large scale geodetic least squares problems. SIAM J Sci Statist Comput 7:799–816 20. Golub GH, Plemmons RJ (1980) Large-scale geodetic leastsquares adjustment by dissection and orthogonal decomposition. Linear Alg & Its Appl 34:3–28 21. Hansen PC (1998) Rank-deficient and discrete ill-posed problems. Numerical aspects of linear inversion. SIAM, Philadelphia 22. Van Huffel S, Vandewalle J (1991) The total least squares problem: Computational aspects and analysis. Frontiers in Appl Math, vol 9. SIAM, Philadelphia 23. Kourouklis S, Paige CC (1981) A constrained approach to the general Gauss–Markov, linear model. J Amer Statist Assoc 76:620–625

24. Lawson CL, Hanson RJ (1974) Solving least squares problems. Prentice-Hall, Englewood Cliffs, NJ 25. Legendre AM (1805) Nouvelles méthodes pour la détermination des orbites des comètes. Courcier, Paris 26. Lu S-M, Barlow JL (1996) Multifrontal computation with the orthogonal factors of sparse matrices. SIAM J Matrix Anal Appl 17:658–679 27. Manneback P, Murigande C, Toint PL (1985) A modification of an algorithm by Golub and Plemmons for large linear least squares in the context of Doppler positioning. IMA J Numer Anal 5:221–234 28. Matstoms P (1992) QR27-specification sheet. Techn. Report Dept. Math. Linköping Univ. 29. Matstoms P (1994) Sparse QR factorization in MATLAB. ACM Trans Math Softw 20:136–159 30. Penrose R (1955) A generalized inverse for matrices. Proc Cambridge Philos Soc 51:406–413 31. Reichel L, Gragg WB (1990) FORTRAN subroutines for updating the QR decomposition. ACM Trans Math Softw 16:369–377 32. Reid JK (1967) A note on the least squares solution of a band system of linear equations by Householder reductions. Computer J 10:188–189 33. Stewart GW (1992) An updating algorithm for subspace tracking. IEEE Trans Signal Processing 40:1535–1541 34. Wedin P-Å (1973) Perturbation theory for pseudo-inverses. BIT 13:217–232

Leibniz, Gottfried Wilhelm SANDRA DUNI EKSIOGLU Industrial and Systems Engineering Department, University Florida, Gainesville, USA MSC2000: 01A99 Article Outline Keywords See also References Keywords Gottfried Wilhelm Leibniz; Integration; Differentiation; Theory of envelops; Infinitesimal calculus G.W. Leibniz (1646–1716) was a well-known German philosopher and mathematician. He is considered a de-

Leibniz, Gottfried Wilhelm

scendant of German idealism and a pioneer of the Enlightenment. Leibniz is known as the inventor of the differential and integral calculus [7]. Leibniz’s contribution in philosophy is as significant as in mathematics. In philosophy Leibniz is known for his fundamental philosophical ideas and principles including truth, necessary and contingent truths, possible worlds, the principle of sufficient reason (i. e., there is a reason behind everybody’s action), the principle of pre-established harmony (i. e., the universe is created in such a way that corresponding mental and physical events occur simultaneously), and the principle of noncontradiction (i. e., if a contradiction can be derived from a proposition, this proposition is false). Leibniz was fond on the idea that the principles of reasoning could be organized into a formal symbolic system, an algebra or calculus of thought, where disagreements could be settled by calculations [4]. Leibniz was the son of a professor of moral philosophy at Leipzig Univ. Leibniz learned to read from his father before going to school. He taught himself Latin and Greek by age 12, so that he could read the books in his father’s library. He studied law at the Univ. of Leipzig from 1661 to 1666. In 1666 he was refused the degree of doctor of laws at Leipzig. He went to the Univ. of Altdorf, which awarded him doctorate in jurisprudence in 1667 [1]. Leibniz started his career at the courts of Mainz where he worked until 1672. The Elector of Mainz promoted him to diplomatic services. In 1672 he visited Paris to try to dissuade Louis XIV from attacking German areas. Leibniz remained in Paris until 1676, where he continued to practice law. In Paris he studied mathematics and physics under Chr. Huygens. During this period he developed the basic features of his version of the calculus. He spent the rest of his life, from 1676 until his death (November 14, 1716) at Hannover [6]. Leibniz’s most important achievement in mathematics was the discovery of infinitesimal calculus. The significance of calculus is so important that it was marked as the starting point of modern mathematics. Leibniz’s formulations were different from previous investigation by I. Newton. Newton was mainly concentrated in the geometrical representation of calculus, while Leibniz took it towards analysis. Newton considered variables changing with time. Leibniz thought of variables x, y as ranging over sequences of infinitely

L

close values. For Newton integration and differentiation were inverses, while Leibniz used integration as a summation. At that time, neither Leibniz nor Newton thought in terms of functions, both always thought in terms of graphs. In November 1675 he wrote a manuscript using the R notation f (x) dx for the first time [5]. In the same manuscript he presented the product rule for differentiation. The quotient rule first appeared two years later, in July 1677. In 1676 Leibniz arrived in the conclusion that he was in possession of a method that was highly important because of its generality. Whether a function was rational or irrational, algebraic or transcendental (a word that Leibniz coined), his operations of finding sums and differences could always be applied. In November 1676 Leibniz discovered the familiar notation d(xn ) = nxn 1 dx for both integral and fractional n. Newton claimed that: ‘not a single previously unsolved problem was solved here’, but the formalism of Leibniz’s approach proved to be vital in the development of the calculus. Leibniz never thought of the derivative as a limit. This does not appear until the work of J. d’Alembert. Leibniz was convinced that good mathematical notations were the key to progress so he experimented with different notation for coefficient systems. His language was fresh and appropriate, incorporating such terms as differential, integral, coordinate and function [8]. His notations which we still use today, were clear and elegant. His unpublished manuscripts contain more than 50 different ways of writing coefficient systems, which he worked on during a period of 50 years beginning in 1678. Leibniz used the word resultant for certain combinatorial sums of terms of a determinant. He proved various results on resultants including what is essentially Cramer’s rule. He also knew that a determinant could be expanded using any column, what is now called Laplace expansion. As well as studying coefficient systems of equations which led him to determinants, Leibniz also studied coefficient systems of quadratic forms which led naturally towards matrix theory [9]. He thought about continuity, space and time [2]. In 1684 Leibniz published details of his differentiable calculus in ‘Acta Eruditorum’, a journal established in Leipzig two years earlier. He described a general method for finding maxima and minima, and drawing tangents to curves. The paper contained the

1867

1868

L

Lemke Method

rules for computing the derivatives of powers, products and quotient. In 1686 Leibniz published a paper on the principles of new calculus [3] in ‘Acta Eruditorum’. Leibniz emphasized the inverse relationship between differentiation and integration in the fundamental theorem of calculus. In 1692 Leibniz wrote a paper that set the basis of the theory of envelopes. This was further developed in another paper published on 1694 where he introduced for the first time the terms coordinates and axes of coordinates. Leibniz published many papers on mechanical subjects as well [1]. In 1700 Leibniz founded the Berlin Academy and was its first president. Leibniz’s principal works are: 1) ‘De Arte Combinatoria’ (On the Art of Combination), 1666; 2) ‘Hypothesis Physica Nova’ (New Physical Hypothesis), 1671; 3) ‘Dicours de Metaphysique’ (Discourse on Metaphysics), 1686; 4) Unpublished Manuscripts on the Calculus of Concepts, 1690; 5) ‘Nouveaux Essais sur L’entendement Humaine’ (New Essays on Human Understanding), 1705; 6) ‘Theodicee’ (Theodicy), 1710; 7) ‘Monadologia’ (The Monadology), 1714. See also History of Optimization References 1. Aiton EJ (1985) Leibniz, A biography. Adam Hilger Ltd, Bristol 2. Anapolitanos D (1999) Leibniz: Representation, continuity and the spatiotemporal. Kluwer, Dordrecht 3. Boyer BB (1968) A history of mathematics. Wiley, New York 4. MacDonald GR (1984) Leibniz. Oxford Univ. Press, Oxford 5. O’Connor JJ (Oct. 1998) Gottfried Wilhelm von Leibniz. Dept. Math. and Statist. Univ. St. Andrews, Scotland), http://www. history.mcs.st-andrews.ac.uk/history/Mathematicians/ Leibniz.html 6. Pereira ME (2000) Gottfried Wilhelm von Leibniz. http:// www.geocities.com/Athens/Delphi/6061/Leibniz.html 7. Wingereid B (2000) Gottfried Wilhelm von Leibniz. http:// www.phs.princeton.k12.oh.us/Public/Lessons/enl/wing. html

8. Woolhouse RS (ed) (1981) Leibniz: Metaphysics and philosophy of science. Oxford Univ. Press, Oxford 9. Zalta EN (2000) Gottfried Wilhelm von Leibniz. http://mally. stanford.edu/leibniz

Lemke Method Lemke Algorithm MICHAEL M. KOSTREVA Department Math. Sci., Clemson University, Clemson, USA MSC2000: 90C33 Article Outline Keywords Lemke’s Algorithm See also References Keywords Linear complementarity; Pivoting The linear complementarity problem (LCP) is a well known problem in mathematical programming. Applications of the LCP to engineering, game theory, economics, and many other scientific fields have been found. The monograph of K.G. Murty [8] is a compendium of LCP developments. One of the most significant approaches to the solution of the linear complementarity problem is called Lemke’s method or Lemke’s algorithm. Two descriptions of the algorithm [6,7] provide many algorithmic proofs and details for the interested reader. Our treatment here is a sketch of the algorithm, together with pointers to related work in the literature. There are some important related works for those who wish to solve LCP. A. Ravindran [10] provided a FORTRAN implementation of Lemke’s algorithm in a set-up similar to the revised simplex method. C.B. Garcia [2] described some classes of matrices for which the associated LCPs can be solved by Lemke’s algorithm. J.J.M. Evers [1] enlarged the range of application

Lemke Method

of Lemke’s algorithm, and showed that it could solve the bimatrix game. P.M. Pardalos and J.B. Rosen [9] presented a global optimization approach to LCP. D. Solow and P. Sengupta [11] proposed a finite descent theory for the linear complementarity problem. M.M. Kostreva [4] showed that without the nondegeneracy assumption, Lemke’s algorithm may cycle, and showed that the minimum length of such a cycle is four. The linear complementarity problem considered is: Given an (n × n)-matrix M and an (n × 1) column vector q, problem LCP(q, M) is to find x (or prove that no such x exists) in Rn satisfying y D Mx C q;

(1)

y i 0;

(2)

x i 0;

(3)

y i x i D 0;

(4)

for all i, i = 1, . . . , n. Clearly these conditions are equivalent to y| x = 0. The variables (yi , xi ) are called a complementary pair of variables. Lemke’s algorithm is organized relative to the following extended system of equations: y D Mx C q C x0 d;

(5)

where d is an (n × 1) column vector, and x0 0. Relative to the vector d, it is only required that (q + x0 d) 0 for some x0 0. It is assumed that the system of equations (5) is nondegenerate, that is, any solution has at most n + 1 zero values among the variables (y, x, x0 ). Lemke’s Algorithm If q > 0, terminate with a complementary feasible solution, y = q, x = 0. If q has some negative component, then on the first pivot x0 is increased until for the first time y = q + x0 d 0. When this occurs, some y variable, say yr becomes zero. The first pivot is to exchange the variables x0 and yr . Now the variable x0 is basic, and the variables yr and xr are two complementary non basic variables. If a pivot can be made on variable xr (complement of the most recently pivoted member of the complementary pair), then it leads to another similar situation with an-

L

other pair of complementary variables. If a pivot cannot be made, the sequence is terminated. If the variable x0 becomes non basic (zero), a solution is at hand. If not, the pivoting continues uniquely, with each new set of equations containing a non basic complementary pair of variables, one of which is most recently made non basic. Due to the unique choices of pivot row and pivot column, finite termination must occur. Under certain conditions, including the positive semidefinite matrices, the condition of termination without finding a pivot (also called secondary ray termination) can be shown to imply that the set {x: y = Mx + q 0, x 0} is empty. Under such conditions, Lemke’s algorithm is said to process the LCP: either it is solved, or it is shown not to have a feasible solution. The set of all LCPs which Lemke’s algorithm will process is unknown, but some recent papers shed light on its processing domain. Kostreva and M.M. Wiecek [5] use a multiple objective optimization approach which eventually results in a larger dimensioned LCP, while G. Isac, Kostreva and Wiecek [3] point out a set of problems which is impossible for Lemke’s method to process. Example 1 Consider the LCP corresponding to the quadratic programming problem 8 ˆ ˆ x : Ax D b; x 0 ; together with its standard dual ˚ max b > y : A> y c : One of the most efficient, and for a long time the only, practical method to solve LO problems was the simplex method of G.B. Dantzig. The simplex method is a pivot algorithm that traverses through feasible basic solutions while the objective value is improving. The simplex method is practically one of the most efficient algorithms but it is theoretically a finite algorithm only for nondegenerate problems. A basis is called primal degenerate if at least one of the basic variables is zero; it is called dual degenerate if the reduced cost of at least one nonbasic variable is zero. In general, the basis is degenerate if it is either primal or dual, or both primal and dual degenerate. The LO problem is degenerate, if it has a degenerate basis. A pivot is called degenerate when after the pivot the objective remains unchanged. When the problem is degenerate the objective might stay the same in subsequent iterations and the simplex algorithm may cycle, i. e. starting from a basis, after some iterations the same basis is revisited and this process is repeated endlessly. Because the simplex method produces a sequence with monotonically improving objective values, the objective stays constant in a cycle, thus each pivot in the cycle must be degenerate. The possibility of cycling was recognized shortly after the invention of the simplex algorithm. Cycling examples were given by E.M.L. Beale [2] and by A.J. Hoffman [10]. Recently (1999) a scheme to construct cycling LO examples is presented in [9]. These examples made evident that extra techniques are needed to ensure finite termination of simplex methods. The first and widely used such tool is the lexico-

Lexicographic Pivoting Rules

graphic simplex rule. Other techniques, like the leastindex anticycling rules (cf. Least-index anticycling rules) and more general recursive schemes were developed more recently.

0

1 Lexicographic Simplex Methods First we need to define an ordering, the so-called lexicographic ordering of vectors. Lexicographic Ordering

2

An n-dimensional vector u = (u1 , . . . , un ) is called lexicographically positive or, in other words, lexico-positive if its first nonzero coordinate is positive, i. e. for a certain j n one has ui = 0 for i < j and xu > 0. Observe, that the zero vector is the only lexico-nonnegative vector which is not lexico-positive. The vector u0 is said to be lexicographically smaller than a vector u1 when the difference u1 u0 of the two vectors is lexico-positive. Further, if a finite set of vectors {u0 , . . . , uk } is given, then the vector u0 is said to be lexico-minimal in the given set, when u0 is lexicographically smaller than ui for all 1 i k. The Lexicographic Primal Simplex Method Cycling of the simplex method is possible only when the LO problem is degenerate. In that case possibly many variables are eligible to enter and to leave the basis. The lexicographic primal simplex rule makes the selection of the leaving variable uniquely determined when the entering variable is already chosen. The Use of Lexicographic Ordering At start a feasible lexico-positive basis tableau is given. A basis tableau is called lexico-positive if, except the reduced cost row, all of its row vectors are lexico-positive. Any feasible basis tableau can be made lexico-positive by a simple rearrangement of its columns. Specifically, we can take the solution column as the first one, and then take the current basic variables, in an arbitrary order, followed by the nonbasic variables, again in an arbitrary ordering. The following lexicographic simplex pivot selection rule was first proposed by Dantzig, A. Orden and P. Wolfe [7].

3

L

Initialization. Let T(B) be a given primal feasible lexicopositive basis tableau. (Fix the order of the variables.) Entering variable selection. Choose a dual infeasible variable, i.e. one with negative reduced cost. Let its index be q. IF no such variable exists, THEN STOP; The tableau T(B) is optimal and this way a pair of optimal solutions is obtained. Leaving variable selection. Collect in column q all the candidate pivot elements that satisfy the usual pivot selection conditions of the primal simplex method. Let K = fi1 ; : : : ; i k g be the set of the indices of the candidate leaving variables. IF there is no pivot candidate, THEN STOP; The primal problem is unbounded, and so the dual problem is infeasible. IF there is a unique pivot candidate fpg = K to leave the basis, THEN go to Step 3. IF there are more pivot candidates, THEN look at the row vectors t i ; i 2 K, of the basis tableau (note that by construction x i is the first coordinate of t i ). Let p be the pivot row if t p is lexico-minimal in this set of row vectors. Basis transformation. Pivot on (p; q). Go to Step 1.

The lexicographic primal simplex rule

The following two observations are important. First note that lexicographic selection plays role only when the leaving variable is selected. In that case some rows of the tableau are compared in the lexicographic ordering. If the basis variables were originally out right after the solution column, as proposed in order to get a lexicopositive initial tableau, then this comparison is already decided when one considers only the columns corresponding to the initial basis. This claim holds, because those columns form a basis, thus the related row vectors are linearly independent as well. On the other hand, when the initial basis is the unit matrix, then at each pivot the basis inverse can be found, in the place of the initial unit matrix. When these

1871

1872

L

Lexicographic Pivoting Rules

two observations are put together, it can be concluded that instead of using the rows of the basis tableau, the rows of the basis inverse headed by the corresponding solution coordinate, can be used in Step 2. to determine the unique leaving variable. As a consequence one do not need to calculate and store the complete basis tableau when implementing the lexicographic pivot rule. The solution and the basis inverse provide all the necessary information. The lexicographic simplex method is finite. The finiteness proof is based on the following simple properties: There is a finite number of different basis tableaus. The first row of the tableau, i. e. the vector, having the objective value as its first coordinate followed by the reduced cost vector, strictly increases lexicographically at each iteration. This fact ensures that no basis can be revisited, thus cycling is impossible. Lexicographic Ordering and Perturbation Independent of [7], A. Charnes [4] developed a technique of perturbation, that resulted in a finite simplex algorithm. This algorithm turned out to be equivalent to the lexicographic rule. The perturbation technique is as follows. Let be a sufficiently small numP ber. Let us replace bi by bi + j aij j for all i. If is small enough then the resulted problem is nondegenerate. Moreover, starting from a given primal feasible basis, the primal simplex method applied to the new problem produces exactly the same pivot sequence as the lexicographic simplex method on the original problem. In particular, when the problem is initialized with a feasible basis solution, it suffices to use the perturbation bi + i . This way only the basis part of the coefficient matrix is used in Charnes’ perturbation technic. An appealing property of the perturbation technique is that actually it is not needed to perform the perturbation with a concrete . It can be done symbolically. Lexicographic Dual Simplex Method The dual simplex method is nothing else, than the primal simplex method applied to the dual problem, when the dual problem is brought in the primal standard form. This way it is straightforward to develop the lexi-

cographic, or the equivalent perturbation technique for the dual simplex method. Extensions The lexicographic rule is extensively used in proving finiteness of pivot algorithms, see e. g. [1] for an application in a monotonic build-up scheme, [14] for further references in LO and [5] for references when lexicographic degeneracy resolution is applied for complementarity problems. Lexicography and Oriented Matroids Based on the perturbation interpretation, analogous lexicographic techniques and lexicographic pivoting rules were developed for oriented matroid programming [3] (cf. also Oriented matroids). These techniques were particularly interesting, because nondegenerate cycling [3,8] is possible in oriented matroids. An apparent difference between the linear and the oriented matroid context is that for oriented matroids none of the finite – recursive or least index type – rules yield a simplex method, i. e. a pivot method that preserves feasibility of the basis throughout. This discrepancy is also due to the possibility of nondegenerate cycling. Interestingly, in the case of oriented matroid programming the finite lexicographic method of M.J. Todd [15,16] is the only one which preserves feasibility of the basis and therefore yields a finite simplex algorithm for oriented matroids. The equivalence of Dantzig’s self—dual parametric algorithm [6] and Lemke’s complementary pivot algorithm [11,12] applied to the linear complementarity problem (cf. also Linear complementarity problem) defined by the primal and dual LO problem was proved by I. Lustig [13]. Todd’s lexicographic pivot rule is essentially a lexicographic Lemke method (or the parametric perturbation method), when applied to the specific linear complementary problem defined by the primal-dual pair of LO problems. Hence, using the equivalence mentioned above a simplex algorithm for LO can be derived. However, it is more complicated to present this method in the linear optimization than in the complementarity context. Now Todd’s rule will be sketched for the linear case.

Lexicographic Pivoting Rules

0

1

2

3

Initialization. Let a lexico-positive feasible tableau T(B) be given. Entering variable selection. Collect all the dual infeasible variables as the set of candidate entering variables. Let their set of indices be denoted by K D . IF no such variable exists, THEN STOP; The tableau T(B) is optimal and this way a pair of optimal solutions is obtained. IF there is a unique fqg = K D candidate to enter the basis, THEN go to Step 2. IF there are more pivot candidates, THEN let q be the index of that variable whose column is lexico-minimal in the set K D . (Analogous to the dual lexicographic simplex selection rule). Leaving variable selection. Collect in column q all the candidate pivot elements that satisfy the usual pivot selection conditions of the primal simplex method. Let K P be the set of the indexes of the candidate leaving variables. IF there is no pivot candidate, THEN STOP; the primal problem is unbounded, and so the dual problem is infeasible. IF there is a unique fpg = K P pivot candidate to leave the basis, THEN go to Step 3. IF there are more pivot candidates, THEN let p be the index of that variable whose row is lexico-minimal in the set K P . (Analogous to the primal lexicographic simplex selection rule.) Basis transformation. Pivot on (p; q). Go to Step 1.

Todd’s lexicographic Lemke rule (Phase II)

In Todd’s rule the perturbation is done first in the right-hand side and then in the objective (with increasing order of the perturbation parameter ). It finally gives a two phase simplex method. For illustration only the second phase [14] is presented here. Complete description of the algorithm can be found in [3,16]. This algorithm is not only a unique simplex method for oriented matroids, but it is a novel application of lexicography in LO as well.

L

See also Criss-cross Pivoting Rules Least-index Anticycling Rules Linear Programming Pivoting Algorithms for Linear Programming Generating Two Paths Principal Pivoting Methods for Linear Complementarity Problems Probabilistic Analysis of Simplex Algorithms References 1. Anstreicher KM, Terlaky T (1994) A monotonic build-up simplex algorithm. Oper Res 42:556–561 2. Beale EML (1955) Cycling in the dual simplex algorithm. Naval Res Logist Quart 2:269–275 3. Bjorner A, Las Vergnas M, Sturmfels B, White N, Ziegler G (1993) Oriented matroids. Cambridge Univ. Press, Cambridge 4. Charnes A (1952) Optimality and degeneracy in linear programming. Econometrica 20(2):160–170 5. Cottle R, Pang JS, Stone RE (1992) The linear complementarity problem. Acad. Press, New York 6. Dantzig GB (1963) Linear programming and extensions. Princeton Univ. Press, Princeton 7. Dantzig GB, Orden A, Wolfe P (1955) Notes on linear programming: Part I – The generalized simplex method for minimizing a linear form under linear inequality restrictions. Pacific J Math 5(2):183–195 8. Fukuda K (1982) Oriented matroid programming. PhD Thesis Waterloo Univ. 9. Hall J, McKinnon KI (1998) A class of cycling counterexamples to the EXPAND anti-cycling procedure. Techn. Report Dept. Math. Statist. Univ. Edinburgh 10. Hoffman AJ (1953) Cycling in the simplex method. Techn Report Nat Bureau Standards 2974 11. Lemke CE (1965) Bimatrix equilibrium points and mathematical programming. Managem Sci 11:681–689 12. Lemke CE (1968) On complementary pivot theory. In: Dantzig GB, Veinott AF (eds) Mathematics of the Decision Sci. Part I. Lect Appl Math 11. Amer. Math. Soc., Providence, RI, pp 95–114 13. Lustig I (1987) The equivalence of Dantzig’s self-dual parametric algorithm for linear programs to Lemke’s algorithm for linear complementarity problems applied to linear programming. SOL Techn Report Dept Oper Res Stanford Univ 87(4) 14. Terlaky T, Zhang S (1993) Pivot rules for linear programming: A survey on recent theoretical developments. Ann Oper Res 46:203–233 15. Todd MJ (1984) Complementarity in oriented matroids. SIAM J Alg Discrete Meth 5:467–485 16. Todd MJ (1985) Linear and quadratic programming in oriented matroids. J Combin Th B 39:105–133

1873

1874

L

Linear Complementarity Problem

RICHARD W. COTTLE Stanford University, Stanford, USA

origin of the term complementary slackness as used in linear and nonlinear programming. It was this terminology that inspired the name complementarity problem.

MSC2000: 90C33

Sources of Linear Complementarity Problems

Linear Complementarity Problem

The linear complementarity problem is associated with the Karush–Kuhn–Tucker necessary conditions of local optimality found in quadratic programming. This connection (as well as the more general connection of nonlinear complementarity problems with other types of nonlinear programs) was brought out in [1,2] and later in [3]. Finding solutions to such systems was one of the original motivations for studying the subject. Another was the finding of equilibrium points in bimatrix and polymatrix games. This kind of application was emphasized in [16] and [22]. These early contributions also included essentially the first algorithms for this class of problems. There are numerous applications of the linear and nonlinear complementarity problems in computer science, economics, various engineering disciplines, finance, game theory, and mathematics. One application of the LCP is in algorithms for the nonlinear complementarity problem. Descriptions of (and references to) these applications can be found in [5,27] and [17]. The survey article [10] is a rich compendium on engineering and economic applications of linear and nonlinear complementarity problems.

Article Outline Keywords Synonyms Definition Sources of Linear Complementarity Problems Equivalent Formulations The Importance of Matrix Classes Algorithms for Solving LCPs Software Some Generalizations See also References Keywords Quadratic programming; Bimatrix games; Matrix classes; Equilibrium problems Synonyms LCP Definition

Equivalent Formulations

In its standard form, a linear complementarity problem (LCP) is an inequality system stated in terms of a mapping f : Rn ! Rn where f (x) = q + Mx. Given f , one seeks a vector x 2 Rn such that for i = 1, . . . , n,

Whether linear or nonlinear, the complementarity problem expressed by the system (1) can be formulated in several equivalent ways. An obvious one calls for a solution (x, y) to the system

x i 0;

f i (x) 0;

and

x i f i (x) D 0:

(1)

Because the affine mapping f is specified by the vector q 2 Rn and the matrix M 2 Rn × n , the problem is ordinarily denoted LCP(q, M) or sometimes just (q, M). A system of the form (1) in which f is not affine is called a nonlinear complementarity problem and is denoted NCP(f ). The notation CP(f ) is meant to cover both cases. If x is a solution to (1) satisfying the additional nondegeneracy condition x i C f i (x) > 0, i = 1, . . . , n, the indices i for which x i > 0 or f i (x) > 0 form complementary subsets of {1, . . . , n}. This is believed to be the

y f (x) D 0;

x 0;

x > y D 0:

(2)

Another is to find a zero x of the mapping g(x) D minfx; f (x)g;

(3)

where the symbol min {a, b} denotes the componentwise minimum of the two n-vectors a and b. A third equivalent formulation asks for a fixed point of the mapping h(x) D x g(x); that is, a vector x 2 Rn such that x = h(x).

Linear Complementarity Problem

The formulation given in (3) is related to the (often nonconvex) optimization problem: 8 > ˆ ˆ 0 for all x 6D 0. In

L

the context of the LCP, the term PD does not require symmetry. An analogous definition (and usage) holds for positive semidefinite (PSD) matrices, namely, M is PSD if x| Mx 0 for all x. Some authors refer to such matrices as monotone because of their connection with monotone mappings. PSD-matrices have the property that associated LCPs (q, M) are solvable whenever they are feasible, whereas LCPs (q, M) in which M 2 PD are always feasible and (since PDPSD) are always solvable. This distinction is given a more general matrix form in [25,26]. There Q is defined as the class of all square matrices for which LCP(q, M) has a solution for all q and Q0 as the class of all square matrices for which LCP(q, M) has a solution whenever it is feasible. Although the goal of usefully characterizing the classes Q and Q0 has not yet been realized, much is known about some of their special subclasses. Indeed, there are now literally dozens of matrix classes for which LCP existence theorems have been established. See [5,27] and [17] for an abundance of information on this subject. From the theoretical standpoint, the class of ‘sufficient matrices’ [6] illustrates the intrinsic role of matrix classes in the study of the LCP. A matrix M 2 Rn × n is column sufficient if [x i (Mx) i 0

8i]

)

[x i (Mx) i D 0

8i]

and row sufficient if M | is column sufficient. When M is both row and column sufficient, it is called sufficient. Row sufficient matrices always have nonnegative principal minors, hence so do (column) sufficient matrices. These classes include both P and PSD as distinct subclasses. The row sufficient matrices form a subclass of Q0 ; this is not true of column sufficient matrices however. The column sufficient matrices M 2 Rn × n are characterized by the property that the solution set of LCP(q, M) is convex for every q 2 Rn . In the same spirit, a real n × n matrix M is row sufficient if and only if for every q 2 Rn , the solutions of the LCP(q, M) are precisely the optimal solutions of the associated quadratic program (4). Rather surprisingly, the class of sufficient matrices turns out to be identical to the matrix class P introduced in [19]. See [13] and [34]. Algorithms for Solving LCPs The algorithms for solving linear complementarity problems are of two major types: pivoting (or, direct)

1875

1876

L

Linear Complementarity Problem

and iterative (or, indirect). Algorithms of the former type are finite procedures that attempt to transform the problem (q, M) to an equivalent system of the form (q0 , M 0 ) in which q0 0. Doing this is not always possible; it depends on the problem data, usually on the matrix class (such as P, PSD, etc.) to which M belongs. When this approach works, it amounts to carrying out a principal pivotal transformation on the system of equations w D q C Mz: To such a transformation there corresponds an index set ˛ (with complementary index set ˛ D f1; : : : ; ng n ˛) such that the principal submatrix M ˛ ˛ is nonsingular. When this (block pivot) operation is carried out, the system w˛ D q˛ C M˛˛ z˛ C M˛˛ z˛ ; w˛ D q˛ C M˛˛ z˛ C M˛˛ z˛ becomes 0 0 w˛ C M˛˛ z˛ ; z˛ D q0˛ C M˛˛

w˛ D q0˛ C M˛ 0 ˛ w˛ C M˛ 0 ˛ z˛ ; where 1 q0˛ D M˛˛ q˛ ; 1 q˛ ; q0˛ D q˛ M˛˛ M˛˛ 0 1 D M˛˛ ; M˛˛ 0 1 M˛˛ D M˛˛ M˛˛ ; 0 1 D M˛˛ M˛˛ ; M˛˛ 0 1 D M˛˛ M˛˛ M˛˛ M˛˛ : M˛˛

There are two main pivoting algorithms used in processing LCPs. The more robust of the two is due to C.E. Lemke [21]. Lemke’s method embeds the LCP (q, M) in a problem having an extra ‘artificial’ nonbasic (independent) variable z0 with coefficients specially chosen so that when z0 is sufficiently large, all the basic variables become nonnegative. At the least positive value of z0 for which this is so, there will (in the nondegenerate case) be (exactly) one basic variable whose value is zero. That variable is exchanged with z0 . Thereafter the method executes a sequence of (almost complementary) simple pivots. In each case, the variable becoming basic is the complement of the variable that be-

came nonbasic in the previous exchange. The method terminates if either z0 decreases to zero (in which case the problem is solved) or else there is no basic variable whose value decreases as the incoming nonbasic variable is increased. The latter outcome is called termination on a secondary ray. For certain matrix classes, termination on a secondary ray is an indication that the given LCP has no solution. Lemke’s method is studied from this point of view in [7]. The other pivoting algorithm for the LCP is called the principal pivoting method (PPM), expositions of which are given in [3] and [5]. The algorithm two versions: symmetric and asymmetric. The former executes a sequence of principal (block) pivots or order 1 or 2, whereas the latter does sequences of almost complementary pivots, each of which results in a block principal pivot or order potentially larger than 2. Iterative methods are often favored for the solution of very large linear complementarity problems. In such problems, the matrix M tends to be sparse (i. e., to have a small percentage of nonzero elements) and structured. Since iterative methods do not modify the problem data, these features of large scale problems can be used to advantage. Ordinarily, however, an iterative method does not terminate finitely; instead, it generates a convergent sequence of trial solutions. The older iterative LCP algorithms are based on equation-solving methods (e. g., Gauss–Seidel, Jacobi, and successive overrelaxation); the more contemporary ones are varieties of the interior point type. In addition to the usual concerns about practical performance, considerable interest attaches to the development of polynomial time algorithms. Not unexpectedly, the allowable analysis and applicability of iterative algorithms depend heavily on the matrix class to which M belongs. Details on several such algorithms are presented in [36,37], and the monographs [5,27] and [17]. Software For decades researchers have experimented with computer codes for various linear (and nonlinear) complementarity algorithms. By the late 1990s, this activity reached the stage where the work could be distributed as something approaching commercial software. An overview of available software for complementarity problems (mostly nonlinear), is available as [35].

Linear Complementarity Problem

Some Generalizations Both linear and nonlinear complementarity problems have been generalized in numerous ways. One of the earliest generalizations, given in [14] and [18], is the problem CP(K, f ) of finding a vector x in the closed convex cone K such that f (x) 2 K (the dual cone) and x| f (x) = 0. Through this formulation, a connection can be made between complementarity problems and variational inequality problems, that is, problems VI(X, f ) wherein one seeks a vector x 2 X (a nonempty subset of Rn ) such that f (x )> (y x ) 0

for all y 2 X:

It was established in [18] that when X is a closed convex cone, say K, with dual cone K , then CP(K, f ) and VI(X, f ) have exactly the same solutions (if any). See [15] for connections with variational inequalities, etc. In [29] the generalized complementarity problem CP(K, f ) defined above is considered as an instance of a generalized equation, namely to find a vector x 2 Rn such that 0 2 f (x) C @

K (x);

where K is the indicator function of the closed convex cone K and @ denotes the subdifferential operator as used in convex analysis. Among the diverse generalizations of the linear complementarity problem, the earliest appears in [30]. There, for given n × n matrices A and B and n-vector c, the authors considered the problem of the finding nvectors x and y such that Ax C By D c;

x; y 0

and

x > y D 0:

A different generalization was introduced in [4]. In this sort of problem, one has an affine mapping f (x) = q + P Nx where N is of order kjD1 pj × n partitioned into k blocks; the vectors q and y = f (x) are partitioned conformably. Thus, j

j

j

y Dq CN x

for j D 1; : : : ; k:

The problem is to find a solution of the system y D q C Nx; x; y 0; xj

pj Y iD1

j

y i D 0;

j D 1; : : : ; k:

L

In recently years, many publications, e. g. [9] and [24], have further investigated this vertical linear complementarity problem (VLCP). Interest in the model which is at the heart of [30] and is now called the horizontal linear complementarity problem (HLCP) was revived in [38] where it is used as the conceptual framework for the convergence analysis of infeasible interior point methods. (The problem also comes up in [20].) In some cases, HLCPs can be reduced to ordinary LCPs. This subject is explored in [33] which gives an algorithm for doing this when it is possible. A further generalization called extended linear complementarity problem (ELCP) was introduced in [23] and subsequently developed in [11,12] and [32]. To this collection of LCP variants can be added the ELCP presented in [31]. The form of this model captures the previously mentioned HLCP, VLCP and ELCP. See also Convex-simplex Algorithm Equivalence Between Nonlinear Complementarity Problem and Fixed Point Problem Generalized Nonlinear Complementarity Problem Integer Linear Complementary Problem LCP: Pardalos–Rosen Mixed Integer Formulation Lemke Method Linear Programming Order Complementarity Parametric Linear Programming: Cost Simplex Algorithm Principal Pivoting Methods for Linear Complementarity Problems Sequential Simplex Method Splitting Method for Linear Complementarity Problems Topological Methods in Complementarity Theory References 1. Cottle RW (1964) Nonlinear programs with positively bounded Jacobians. Univ. Calif., Berkeley, CA 2. Cottle RW (1966) Nonlinear programs with positively bounded Jacobians. SIAM J Appl Math 14:147–158 3. Cottle RW, Dantzig GB (1968) Complementary pivot theory of mathematical programming. Linear Alg & Its Appl 1:103–125 4. Cottle RW, Dantzig GB (1970) A generalization of the linear complementarity problem. J Combin Th 8:79–90

1877

1878

L

Linear Optimization: Theorems of the Alternative

5. Cottle RW, Pang JS, Stone RE (1992) The linear complementarity problem. Acad. Press, New York 6. Cottle RW, Pang JS, Venkateswaran V (1989) Sufficient matrices and the linear complementarity problem. Linear Alg & Its Appl 114/115:231–249 7. Eaves BC (1971) The linear complementarity problem. Managem Sci 17:612–634 8. Eaves BC, Lemke CE (1981) Equivalence of LCP and PLS. Math Oper Res 6:475–484 9. Ebiefung AA (1995) Existence theory and Q-matrix characterization for generalized linear complementarity problem. Linear Alg & Its Appl 223/224:155–169 10. Ferris MC, Pang JS (1997) Engineering and economic applications of complementarity problems. SIAM Rev 39:669– 713 11. Gowda MS (1995) On reducing a monotone horizontal LCP to an LCP. Appl Math Lett 8:97–100 12. Gowda MS (1996) On the extended linear complementarity problem. Math Program 72:33–50 13. Guu S-M, Cottle RW (1995) On a subclass of P0. Linear Alg & Its Appl 223/224:325–335 14. Habetler GJ, Price AJ (1971) Existence theory for generalized nonlinear complementarity problems. J Optim Th Appl 7:223–239 15. Harker PT, Pang JS (1990) Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications. Math Program B 48:161–220 16. Howson JT Jr (1963) Orthogonality in linear systems. Rensselaer Inst. Techn., Troy, NY 17. Isac G (1992) Complementarity problems. Lecture Notes Math, vol 1528. Springer, Berlin 18. Karamardian S (1971) Generalized complementarity problem. J Optim Th Appl 8:161–168 19. Kojima M, Megiddo N, Noma T, Yoshise A (1991) A unified approach to interior point algorithms for linear complementarity problems. Lecture Notes Computer Sci, vol 538. Springer, Berlin 20. Kuhn D, Löwen R (1987) Piecewise affine bijections of Rn and the equation Sx+ Tx = y. Linear Alg & Its Appl 96:109–129 21. Lemke CE (1965) Bimatrix equilibrium points and mathematical programming. Managem Sci 11:681–689 22. Lemke CE, Howson JT Jr (1964) Equilibrium points of bimatrix games. SIAM J Appl Math 12:413–423 23. Mangasarian OL, Pang JS (1995) The extended linear complementarity problem. SIAM J Matrix Anal Appl 16:359– 368 24. Mohan SR, Neogy SK (1997) Vertical block hidden Zmatrices and the generalized linear complementarity problem. SIAM J Matrix Anal Appl 18:181–190 25. Murty KG (1968) On the number of solutions to the complementarity problem and spanning properties of complementary cones. Univ. Calif., Berkeley, CA

26. Murty KG (1972) On the number of solutions to the complementarity problem and spanning properties of complementary cones. Linear Alg & Its Appl 5:65–108 27. Murty KG (1988) Linear complementarity: linear and nonlinear programming. Heldermann, Berlin 28. Pang JS (1995) Complementarity problems. In: Horst R, Pardalos PM (eds) Handbook Global Optim. Kluwer, Dordrecht, pp 271–338 29. Robinson SM (1979) Generalized equations and their solutions, Part I: Basic theory. Math Program Stud 10:128–141 30. Samelson H, Thrall RM, Wesler O (1958) A partition theorem for Euclidean n-space. Proc Amer Math Soc 9:805–807 31. De Schutter B, De Moor B (1996) The extended linear complementarity problem. Math Program 71:289–326 32. Sznajder R, Gowda MS (1995) Generalizations of P0 - and Pproperties; Extended vertical and horizontal LCPs. Linear Alg & Its Appl 223/224:695–715 33. Tütüncü RH, Todd MJ (1995) Reducing horizontal linear complementarity problems. Linear Alg & Its Appl 223/224:717–730 34. Väliaho H (1996) P -matrices are just sufficient. Linear Alg & Its Appl 239:103–108 35. Website: www.cs.wisc.edu/cpnet 36. Ye Y (1993) A fully polynomial-time approximation algorithm for computing a stationary point of the general linear complementarity problem. Math Oper Res 18:334–346 37. Yoshise A (1996) Complementarity problems. In: Terlaky T (ed) Interior point methods of mathematical programming. Kluwer, Dordrecht, pp 297–367 38. Zhang Y (1994) On the convergence of a class of infeasible interior-point algorithm for the horizontal linear complementarity problem. SIAM J Optim 4:208–227

Linear Optimization: Theorems of the Alternative ThAlt KEES ROOS Department ITS/TWI/SSOR, Delft University Technol., Delft, The Netherlands MSC2000: 15A39, 90C05 Article Outline Keywords See also References

Linear Optimization: Theorems of the Alternative

Keywords Inequality systems; Duality; Certificate; Transposition theorem If one has two systems of linear relations, where each relation is either an linear equation (or linear equality relation) or a linear inequality relation (of type >, , A + v > B + w > C = 0, y 0; 0 ¤ v 0 P. Gordan (1873) [7] Ax > 0 y> A = 0; 0 ¤ y 0 J.Farkas (1902) [3] Ax = b; x 0 y> A 0; y> b < 0 Farkas (1902) [3] Ax b y 0; y> A = 0; y> b < 0 E. Stiemke (1915) [13] Ax = 0; x > 0 y> A 0; y> A ¤ 0 W.B. Carver (1912) [2] Ax A = 0; y 0; y> b 0; y ¤ 0 T.S. Motzkin (1936) [10] Ax 0; Bx < 0 y> A + v > B = 0; y 0; v 0; v ¤ 0 J. Ville (1938) [15] Ax > 0; x > 0 y> A 0; y 0; y ¤ 0; or A> y ¤ 0 A.W. Tacket (1956) [14] Ax 0; Ax ¤ 0; Bx 0; Cx = 0 y> A + v > B + w > C = 0; y > 0; v 0 D. Gale (1960) [5] Ax b y> A = 0; y> b = 1; y 0

Ten pairs of alternative systems

cial question is whether the system has a solution or not. Knowing the answer to this question one is able to answer many other questions. For example, if one has a linear optimization problem LO in the standard form ˚ min c > x : Ax D b; x 0 ; x

a given real number z is a strict lower bound for the optimal value of the problem if and only if the system Ax D b;

c > x z;

x 0;

has no solution, i. e. is infeasible. On the other hand, a given real number z is an upper bound for the optimal

1879

1880

L

Linear Optimization: Theorems of the Alternative

value of the problem if and only if the system Ax D b;

>

c x z;

x 0;

has a solution, i. e. is feasible. If a system S has a solution then this is easy to certify, namely by giving a solution of the system. The solution then serves as a certificate for the feasibility of S. If S is infeasible, however, it is more difficult to give an easy certificate. One is then faced with the problem of how to certify a negative statement. This is in general a very nontrivial problem that also occurs in many real life situations. For example, when accused for murder, how should one prove his innocence? In circumstances like these it may be impossible to find an easy to verify certificate for the negative statement ‘not guilty’. A practical solution is the rule ‘a person is innocent until his/her guilt is certified’. Clearly, from the mathematical point of view this approach is unsatisfactory. Now suppose that there is an alternative system T and there exists a theorem of the alternative for S and T. Then we know that exactly one of the two systems has a solution. Therefore, S has a solution if and only if T has no solution. In that case, any solution of T provides a certificate for the unsolvability of S. Thus it is clear that a theorem of the alternative provides an easy to verify certificate for the unsolvability of a system of linear relations. The proof of any theorem of the alternative consists of two parts. Assuming the existence of a solution of one system one needs to show that the other system is infeasible, and vice versa. It has been demonstrated above for Farkas’ lemma that one of the two implications is easy to prove. This seems to be true for each theorem of the alternative: in all cases one of the implications is almost trivial, but the other implication is highly nontrivial and very hard to prove. On the other hand, having proved one theorem of the alternative the other theorems of the alternative easily follow. In this sense one might say that all the listed theorems of the alternative are equivalent: accepting one of them to be true, the validity of each of the other theorems can be verified easily. The situation resembles a number of cities on a high plateau. Travel between them is not too difficult; the hard part is the initial ascent from the plains below [1]. It should be pointed out that Farkas’ lemma, or each of the other theorems of the alternative, is equivalent

to the most deep result in linear optimization, namely the duality theorem for linear optimization: this theorem can be easily derived from Farkas’ lemma, and vice versa (cf. also Linear programming). In fact, in many textbooks on linear optimization the duality theorem is derived in this way [5,17], whereas in other textbooks the opposite occurs: the duality theorem is proved first and then Farkas’ lemma follows as a corollary [11]. This phenomenon is a consequence of a simple, and basic, logical principle that any duality theorem is actually equivalent to a theorem of the alternative, as has been shown in [9]. Both the Farkas’ lemma and the duality theorem for linear optimization can be derived from a more general result which states that for any skew-symmetric matrix K (i. e., K = K | ) there exists a vector x such that Kx 0;

x 0;

x C Kx > 0:

This result is due to Tucker [14] who also derives Farkas’ lemma from it, whereas A.J. Goldman and Tucker [6] show how this result implies the duality theorem for linear optimization. For recent proofs, see [12]. See also Farkas Lemma Linear Programming Motzkin Transposition Theorem Theorems of the Alternative and Optimization Tucker Homogeneous Systems of Linear Relations References 1. Broyden CG (1998) A simple algebraic proof of Farkas’ lemma and related theorems. Optim Methods Softw 3:185–199 2. Carver WB (1921) Systems of linear inequalities. A-MATH 23(2):212–220 3. Farkas J (1902) Theorie der Einfachen Ungleichungen. J Reine Angew Math 124:1–27 4. Fourier JBJ (1826) Solution d’une question particulière du calcul des inégalités. Nouveau Bull. Sci. Soc. Philomath. Paris, 99–100 5. Gale D (1960) The theory of linear economic models. McGraw-Hill 6. Goldman AJ, Tucker AW (1956) Theory of linear programming. In: Kuhn HW, Tucker AW (eds) Linear Inequalities and Related Systems. Ann Math Stud. Princeton Univ. Press, Princeton, 53–97

Linear Ordering Problem

7. Gordan P (1873) Über die Auflösung Linearer Gleichungen mit Reelen Coefficienten. Math Ann 6:23–28 8. Mangasarian OL (1994) Nonlinear programming. No. 10 in Classics Appl Math. SIAM, Philadelphia 9. McLinden L (1975) Duality theorems and theorems of the alternative. Proc Amer Math Soc 53(1):172–175 10. Motzkin TS Beiträge zur Theorie der Linearen Ungleichungen, PhD Thesis, Baselxs 11. Padberg M (1995) Linear optimization and extensions. Algorithms and Combinatorics, vol 12. Springer, Berlin 12. Roos C, Terlaky T, Vial J-Ph (1997) Theory and algorithms for linear optimization. An interior approach. Wiley, New York 13. Stiemke E (1915) Über Positive Lösungen Homogener Linearer Gleichungen. Math Ann 76:340–342 14. Tucker AW (1956) Dual systems of homogeneous linear relations. In: Kuhn HW, Tucker AW (eds) Linear Inequalities and Related Systems. Ann Math Stud. Princeton Univ. Press, Princeton, 3–18 15. Ville J (1938) Sur la thèorie gènèrale des jeux où intervient l’habileté des joueurs. In: Ville J (ed) Applications aux Jeux de Hasard. Gauthier-Villars, Paris, pp 105–113 16. Website: www-math.cudenver.edu/~hgreenbe 17. Zoutendijk G (1976) Mathematical programming methods. North-Holland, Amsterdam

Linear Ordering Problem LOP PAOLA FESTA Dip. Mat. e Inform., Universitá Salerno, Baronissi (SA), Italy

L

The linear ordering problem (LOP) has a wide range of applications in several fields, such as scheduling, sports, social sciences, and economics. Due to its combinatorial nature, it has been shown to be NP-hard [5]. Like many other computationally hard problems, the linear ordering problem has captured the researcher attention for developing efficient solution procedures. A comprehensive treatment of the state-of-art approximation algorithms for solving the linear order problem is contained in [15]. The scope of this article is to introduce the reader to this problem, providing its definition and some of the algorithms proposed in literature for solving it efficiently.

Problem Description The linear ordering problem (LOP) can be formulated as follows: Given a complete digraph Dn = (V n , En ) on n nodes and given arc weights c(i, j) for each arc (i, j) 2 En , find a spanning acyclic tournament in Dn such that the sum of the weights of its arcs is as large as possible. An equivalent mathematical formulation of LOP ([11]) is the following: Given a matrix of weights E = {eij }m×m , find a permutation p of the columns (and rows) in order to maximize the sum of the weights in the upper triangle. Formally, the problem is to maximize C E (p) D

m1 X

m X

epi p j ;

iD1 jDiC1

MSC2000: 90C10, 90C11, 90C20 Article Outline Keywords Problem Description Review of Exact and Approximation Algorithms Branch and Bound Algorithms Linear Programming Algorithms

See also References Keywords Combinatorial optimization; Greedy technique; Graph optimization; Branch and bound; Linear programming

where pi is the index of the column (and row) occupying the position i in the permutation. The best known among the applications of LOP occurs in economics. In fact, it is equivalent to the socalled triangulation problem for input-output tables. In this economical application, the economy (regional or national) is subdivided into sectors. An m × m inputoutput matrix is then created, whose entry (i, j) represents the flow of money from the sector i to the sector j. The sectors have to be ordered so that suppliers tend to come first followed by costumers. This scope can be achieved by permuting the rows and the columns of the built matrix so that the sum of entries above the diagonal is maximized, which is exactly the objective of the linear ordering problem.

1881

1882

L

Linear Ordering Problem

Review of Exact and Approximation Algorithms The pioneer heuristic method for solving LOP has been proposed by H.B. Chenery and T. Watanabe [3]. Their method tries to obtain plausible rankings of the sectors of an input-output table in the triangulation problem by ranking first those sectors that have a small share of inputs from other sectors and of outputs to final demand. An extensive discussion about the heuristics proposed until 1981 can be found in [16], while more recent work has been done in [2,11]. In [11] a heuristic algorithm is proposed based on the tabu search methodology and incorporating strategies for search intensification and diversification are given. For search intensification M. Laguna and others experimented with path relinking, a strategy proposed in connection with tabu search by F. Glover and Laguna [6] and still rarely used in actual implementations. In [2] an algorithm is presented implementing a scatter search strategy, which is a population-based method that has been shown to lead to promising outcomes for solving combinatorial and nonlinear difficult problems. The development of exact algorithms for LOP can be seen connected to the development of methods for solving general integer programming problems, since any such method can be slightly modified to solve the triangulation problem. Most of those exact algorithms belong either to the branch and bound family or to the linear programming methods. Branch and Bound Algorithms One of the earliest published computational results using a branch and bound strategy is due to J.S. DeCani in 1972 [4]. He originally studied how to rank n objects on the basis of a number of paired comparisons. Since k persons have to pairwise compare n objects according to some criterion, a matrix E = {eij } is built, where eij is the number of persons that prefer object i to object j. The problem is to find a linear ranking of the objects reflecting the outcome of the experiment as closely as possible. In the branch and bound strategy proposed by DeCani partial rankings are built up and each branching operation in the tree corresponds to inserting a further object at some position in the partial ranking. At level n of the tree a complete ranking of the objects is found. The upper bounds are exploited in the usual way

for backtracking and excluding parts of the tree from further consideration. A further method for solving LOP is the lexicographic search algorithm proposed in [9,10]. It lexicographically enumerates all permutations of the n sectors by fixing at level k of the enumeration tree the kth position of the permutations. In more detail, if at level k a node is generated, then the first k positions (1), . . . , (k) are fixed. Based on this fixing several Helmstädter’s conditions can be tested. If one of them is violated, then there is no relatively optimum having (1), . . . , (k) in the first k positions. Therefore, the node currently under consideration can be ignored and a backtracking is performed. By using this method all relatively optimum solutions are enumerated, since there is no bounding according to objective function values. At the end the best one among them is kept. Starting from lexicographic search, [8] proposed a lexicographic branch and bound scheme. Other authors have proposed branch and bound methods, such as [7,12], and [14]. Linear Programming Algorithms All linear programming approaches are based on the consideration that the triangulation problem can be formulated as a 0–1 integer programming problem using the 3-dicycle inequalities. In [13] the LP relaxation using the tournament polytope PCn is proposed and the corresponding full linear program is solved in its dual version. In [1] LP relaxation is used for solving scheduling problems with precedence constraints. It is easy to see that the scheduling problem of minimizing the total weighted completion time of a set of processes on a single processor can be formulated as a linear ordering problem. Other possibilities for theoretically solving linear ordering problems are methods as dynamic programming or by formulating the problem as quadratic assignment problem ([10]). See also Assignment and Matching Assignment Methods in Clustering Bi-objective Assignment Problem Communication Network Assignment Problem Complexity Theory: Quadratic Programming

Linear Programming

Feedback Set Problems Frequency Assignment Problem Generalized Assignment Problem Graph Coloring Graph Planarization Greedy Randomized Adaptive Search Procedures Maximum Partition Matching Quadratic Assignment Problem Quadratic Fractional Programming: Dinkelbach Method Quadratic Knapsack Quadratic Programming with Bound Constraints Quadratic Programming Over an Ellipsoid Quadratic Semi-assignment Problem Standard Quadratic Optimization Problems: Algorithms Standard Quadratic Optimization Problems: Applications Standard Quadratic Optimization Problems: Theory

L

13. Marcotorchino JF, Mirchaud P (1979) Optimisation en analyse ordinale des donnees. Masson, Paris 14. Poetsch G (1973) Lösungsverfahren zur Triangulation von Input-Output Tabellen. Heft 79. Rechenzentrum Graz, Graz 15. Reinelt G (1985) The linear ordering problem: Algorithms and applications. In: Hofmann HH, Wille R (eds) Res. and Exposition in Math., vol 8. Heldermann, Berlin 16. Wessels H (1981) Computers and intractability: A guide to the theory of NP-completeness. Beiträge zur Strukturforschung, vol 63. Deutsches Inst. Wirtschaftsforschung, Berlin 17. Whitney H (1935) On the abstract properties of linear dependence. Amer J Math 57:509–533

Linear Programming LP PANOS M. PARDALOS Center for Applied Optim., Department Industrial and Systems Engineering, University Florida, Gainesville, USA

References 1. Boenchendorf K (1982) Reihenfolgenprobleme/Meanflow-time sequencing. Math Systems in Economics. Athenäum–Hain–Scriptor–Hanstein, Königstein/Ts. 2. Campos V, Glover F, Laguna M, Martí R (1999) An experimental evaluation of a scatter search for the linear ordering problem. Manuscript Apr 3. Chenery HB, Watanabe T (1958) International comparisons of the structure of production. Econometrica 26(4):487– 521 4. DeCani JS (1972) A branch &bound algorithm for maximum likelihood paired comparison ranking. Biometrika 59:131–135 5. Garey MR, Johnson DS (1979) Computers and intractability: A guide to the theory of NP-completeness. Freeman, New York 6. Glover F, Laguna M (1997) Tabu search. Kluwer, Dordrecht 7. Hellmich K (1970) Ökonomische Triangulierung. Heft 54. Rechenzentrum Graz, Graz 8. Kaas R (1981) A branch&bound algorithm for the acyclic subgraph problem. Europ J Oper Res 8:355–362 9. Korte B, Oberhofer W (1968) Zwei Algorithmen zur Lösung eines Komplexen Reihenfolgeproblems. Unternehmensforschung 12:217–231 10. Korte B, Oberhofer W (1969) Zur Triangulation von InputOutput Matrizen. Jahrbuch f Nat Ok u Stat 182:398–433 11. Laguna M, Martí R, Campos V (1999) Intensification and diversification with elite tabu search solutions for the liner ordering problem. Comput Oper Res 26:1217–1230 12. Lenstra jr. HW (1973) The acyclic subgraph problem. Techn Report Math Centrum Amsterdam BW26

MSC2000: 90C05 Article Outline Keywords Problem Description The Simplex Method

See also References Keywords Linear programming; Basic solution; Simplex method; Pivoting; Nondegenerate Linear programming (LP) is a fundamental optimization problem in which a linear objective function is to be optimized subject to a set of linear constraints. Due to the wide applicability of linear programming models, an immense amount of work has appeared regarding theory and algorithms for LP, since G.B. Dantzig proposed the simplex algorithm in 1947. It is not surprizing that in a recent survey of Fortune 500 companies, 85% of those responding said that they had used linear programming. The history, theory, and applications of linear programming may be found in [3]. Several books

1883

1884

L

Linear Programming

have been published on the subject (see the references section). Problem Description Consider the linear programming problem (in standard form): 8 > ˆ ˆ x i+1 < c > x i . The method terminates if either none of the edges adjacent to x N is decreasing the objective function (i.e., x N is the solution) or if an unbounded edge adjacent to x N is found, improving the objective function (i.e. the problem is unbounded).

Each step of the simplex method, moving from one vertex to an adjacent one, is called pivoting. The integer N gives the number of pivot steps in the simplex method. Phase I can be solved in a similar way to Phase II. In problems of the canonical form: 8 ˆ ˆmin c > x < (2) s.t. Ax D b; ˆ ˆ : x 0; b 0; there is no need for Phase I, because an initial vertex (x0 = 0) is at hand. We start by considering Phase II of the simplex method, by assuming that an initial vertex (basic feasible solution) is available. Let x0 be a basic feasible solution with x10 , . . . , xm0 its basic variables, and let B = {AB(i) : i = 1, . . . , m} the corresponding basis. If Aj denotes the jth column of A, (Aj 62 B), then m X

x i j A B(i) D A j :

(3)

iD1

In addition, m X iD1

x i0 A B(i) D b:

(4)

Linear Programming

Multiply (3) by > 0 and subtract the result from (4) to obtain: m X (x i0 x i j )A B(i) C A j D b:

(5)

iD1

Assume that x0 is nondegenerate. How much can we increase and still have a solution? We can increase until the first component of (xi0 xij ) becomes zero or equivalently x i0 : xi j > 0 : (6) 0 D min i xi j If 0 = xl0 /xlj , then column Al leaves the basis and Aj enters the basis. If a tie occurs in (6), then the new solution is degenerate. In addition, if all xij 0, then we move arbitrarily far without becoming infeasible. In that case the problem is unbounded. Define the new point x0 0 by ( x i0 x i j ; i ¤ l; 0 x i0 D (7) 0 ; i D l;

L

sic variables are zero. Therefore more than nm of the constraints xj 0 are satisfied as equations (are active) and so x0 satisfies more than n equations. From (6) it follows that if xi0 = 0 and the corresponding xij > 0, then 0 = 0 and therefore we remain at the same vertex. Note that when a basic feasible solution x0 is degenerate, there can be an enormous number of basis associcomponents, ated with it. In fact, if x0 has k >m positive nk different bases. In then there may be as many as nm that case we may compute x0 as many times as there are basis, but the set of variables that we label basic and nonbasic are different. The cost (value of objective function) as a basic feasible solution x, with corresponding basis B is: z0 D

n X

x i 0 c B(i)

l D1

Suppose we bring column Aj into the new basis. The following economic interpretation can be used to select the pivot column Aj : For every unit of the variable xj that enters the basis, an amount xij of each of the variables xB(i) must leave. Hence, a unit increase in the variable xj results in a net change in the cost, equal to:

and ( B0 (i) D

cj D cj zj B(i);

i ¤ l;

j;

i D l:

It is easy to see that the m columns AB0 (i) are linearly independent. Let m X

x i A B0 (i) D x l A j C

iD1

m X

x i A B(i)D0 :

iD1 i¤l

Using (3) we have: m X (a i x i j C a i )A B(i) C a l x l j A B(l ) D 0 iD1 i¤l

and by linear independence of the columns AB(i) we have

P (relative cost of column j), where zj = m iD1 xij cB(i) . It is profitable to bring column j into the basis exactly when c j < 0. Choosing the most negative c j corresponds to a kind of steepest descent. However, many other selection criteria can be used (e. g., Blad’s rule, etc). If all reduced costs satisfy c j 0, then we are at an optimal solution and the simplex method terminates. Note that relations (1) can be expressed in matrix notation by: BX D A or

X D B1 A;

that is, the matrix X = (xij ) is obtained by diagonalizing the basic columns of A. Then zj D

m X

x i j c B(i)

or

> 1 z> D c > B X D c B B A:

l D1

a l D 0;

a i (l C x i j ) D 0 ! a1 ; : : : ; a m D 0: 0

Hence, the new point x0 whose basic variables are given by (7) is a new basic feasible solution. When the basic feasible solution x0 is degenerate then some of the ba-

Suppose c D c z 0. Let y be a feasible point. Then, 1 > 1 > c > y z> y c > B B Ay D c B B b D c x0

and therefore x0 is an optimal solution.

1885

1886

L

Linear Programming: Interior Point Methods

Under the assumption of nondegeneracy with our pivot selection, xl0 > 0 (see (6)) and z 0 D z0

xl 0 (z j c j ) > z0 xl j

(z j c j < 0):

Note that corresponding to any basis there is a unique z0 , and hence, we can never return to a previous basis. Therefore, each iteration gives a different basis and the simplex method terminator after N mn pivots.

2. Bertsimas D, Tsistklis JN (1997) Introduction to linear optimization. Athena Sci., Belmont, MA 3. Dantzig GB (1963) Linear programming and extensions. Princeton Univ. Press, Princeton 4. Fang S-C, Puthenpura S (1993) Linear optimization and extensions. Prentice-Hall, Englewood Cliffs, NJ 5. Papadimitriou CH, Steiglitz K (1982) Combinatorial optimization: Algorithms and complexity. Prentice-Hall, Englewood Cliffs, NJ 6. Roos C, Terlaky T, Vial J-Ph (1998) Theory and algorithms for linear optimization: An interior point approach. Wiley, New York

See also Affine Sets and Functions Carathéodory Theorem Convex-simplex Algorithm Criss-cross Pivoting Rules Farkas Lemma Gauss, Carl Friedrich Global Optimization in Multiplicative Programming History of Optimization Kantorovich, Leonid Vitalyevich Krein–Milman Theorem Least-index Anticycling Rules Lemke Method Lexicographic Pivoting Rules Linear Complementarity Problem Linear Optimization: Theorems of the Alternative Linear Space Motzkin Transposition Theorem Multiparametric Linear Programming Multiplicative Programming Parametric Linear Programming: Cost Simplex Algorithm Pivoting Algorithms for Linear Programming Generating Two Paths Principal Pivoting Methods for Linear Complementarity Problems Probabilistic Analysis of Simplex Algorithms Sequential Simplex Method Simplicial Pivoting Algorithms for Integer Programming Tucker Homogeneous Systems of Linear Relations References 1. Ahuja RK, Magnanti TL, Orlin JB (1993) Network flows: Theory, algorithms and applications. Prentice-Hall, Englewood Cliffs, NJ

Linear Programming: Interior Point Methods KURT M. ANSTREICHER University Iowa, Iowa City, USA MSC2000: 90C05 Article Outline Keywords See also References Keywords Linear programming; Interior point methods; Polynomial time algorithm An enormous amount of research on interior point algorithms for linear programming (LP) has been conducted since N.K. Karmarkar [8] announced his celebrated projective algorithm in 1984. Interior point algorithms for LP are interesting for two different reasons. First, many interior point methods are polynomial time algorithms for LP. Consider a standard form problem: 8 > ˆ ˆ x X ln(x i ); iD1

over {x: Ax = b, x > 0}, for 2 (0, 1). Later C. Roos and J.-Ph. Vial [18], and Gonzaga [6] developed ‘long step’ path following algorithms. These algorithms are based on properties of the central path, but the iterates are not constrained to remain in a small neighborhood of the path. Long step path following algorithms are very closely related to the classical sequential unconstrained minimization technique (SUMT) of A.V. Fiacco and G.P. McCormick [4]. A different class of interior point algorithms is based on Karmarkar’s use of a potential function, a surrogate for the original objective, to monitor the progress of his projective algorithm. Gonzaga [7] and Y. Ye [23] devised the first potential reduction algorithms. These algorithms are based on reducing a potential function but do not employ projective transformations. Ye’s potenp tial reduction algorithm requires O( nL) iterations, like path following algorithms, and provides an O(n3 L) algorithm for LP when implemented with partial updating. All of the algorithms mentioned to this point are based on solving LP, or alternatively the dual problem:

LD

8 ˆ ˆ y A> y C s D c s 0:

Algorithms for solving LP typically generate feasible solutions to LD, and vice versa, but the algorithms are not symmetric in their treatment of the two problems. A different class of interior point methods, known as primal-dual algorithms, is completely symmetric in the

1887

1888

L

Linear Programming: Interior Point Methods

variables x and s. Primal-dual algorithms are based on applying Newton’s method directly to the system of equations: 8 ˆ ˆ y C s D c ˆ ˆ :x ı s D e;

where e 2 Rn is the vector of ones, is a positive scalar, and x ı s is the vector whose ith component is xi si . Solutions x > 0 and s > 0 to PD() are exactly on the central paths for LP and LD, respectively. Most primal-dual algorithms fit into the path following framework. The idea of a primal-dual path following algorithm was first suggested by N. Megiddo [13], and complete algorithms were first devised by R.C. Monteiro and I. Adler [15] and M. Kojima, S. Mizuno, and Y. Yoshise [10]. It is widely believed that primal-dual methods are in practice the best performing interior point algorithms for LP. One advantage of the system PD() is that Newton’s method can be applied even when the current x > 0 and s > 0 are not feasible in LP and LD. This infeasible interior point (IIP) strategy was first employed in the OB1 code of I.J. Lustig, M.E. Marsten, and D.F. Shanno [11]. The solution to the Newton equations with = 0 is referred to as the predictor, or primaldual affine scaling direction, while the solution with = x| s/n, for the current solutions x and s, is called the corrector, or centering direction. The primal-dual predictor-corrector algorithm alternates between the use of these two directions. One implementation of the IIP predictor-corrector strategy, due to S. Mehrotra [14], has worked particularly well in practice. Despite the fact that primal-dual IIP algorithms were very successfully implemented it proved to be quite difficult to characterize the convergence of these methods. The first such analyses, by Kojima, Megiddo, and Mizuno [9], and Y. Zhang [25], were followed by a large number of papers giving convergence/complexity results for various IIP algorithms. Ye, M.J. Todd, and Mizuno [24] devised a ‘selfdual homogeneous’ interior point method that has many of the practical features of IIP methods but at the same time has stronger convergence properties. An implementation of the homogeneous algorithm [22] exhibits excellent behavior, particularly when applied to infeasible or near-infeasible problems.

Many interior point algorithms for LP can be extended to more general optimization problems. Primaldual algorithms generalize very naturally to the monotone linear complementarity problem (LCP; cf. linear complementarity problem); in fact many papers on primal-dual algorithms (for example [25]) are written in terms of the LCP. As a result these algorithms immediately provide interior point solution methods for convex quadratic programming (QP) problems. Interior point algorithms can also be generalized to apply to quadratically constrained quadratic programming (QCQP), optimization over second order cone (SOC) constraints, and semidefinite programming (SDP); for details on these and other extensions see [16]. The application of interior point methods to SDP has particularly rich applications, as described in [1], and [20], and remains the topic of extensive research. See also Entropy Optimization: Interior Point Methods Homogeneous Selfdual Methods for Linear Programming Interior Point Methods for Semidefinite Programming Linear Programming: Karmarkar Projective Algorithm Potential Reduction Methods for Linear Programming Sequential Quadratic Programming: Interior Point Methods for Distributed Optimal Control Problems Successive Quadratic Programming: Solution by Active Sets and Interior Point Methods References 1. Alizadeh F (1995) Interior point methods in semidefinite programming with applications to combinatorial optimization. SIAM J Optim 5:13–51 2. Barnes ER (1986) A variation on Karmarkar’s algorithm for solving linear programming problems. Math Program 36:174–182 3. Dikin II (1967) Iterative solution of problems of linear and quadratic programming. Soviet Math Dokl 8:674–675 4. Fiacco AV, McCormick GP (1990) Nonlinear programming, sequential unconstrained minimization techniques. SIAM, Philadelphia 5. Gonzaga CC (1989) An algorithm for solving linear programming problems in O(n3L) operations. In: Megiddo N

Linear Programming: Karmarkar Projective Algorithm

6. 7.

8. 9.

10.

11.

12. 13.

14. 15.

16. 17.

18.

19.

20. 21.

22.

23. 24.

(ed) Progress in Mathematical Programming. Springer, Berlin, pp 1–28 Gonzaga CC (1991) Polynomial affine algorithms for linear programming. Math Program 49:7–21 Gonzaga CC (1991) Large-step path-following methods for linear programming, Part I: Barrier function method. SIAM J Optim 1:268–279 Karmarkar N (1984) A new polynomial-time algorithm for linear programming. Combinatorica 4:373–395 Kojima M, Megiddo N, Mizuno S (1993) A primal-dual infeasible-interior-point algorithm for linear programming. Math Program 61:263–280 Kojima M, Mizuno S, Yoshise A (1989) A primal-dual interior point algorithm for linear programming. In: Megiddo N (ed) Progress in Mathematical Programming. Springer, Berlin, 29–47 Lustig IJ, Marsten RE, Shanno DF (1991) Computational experience with a primal-dual interior point method for linear programming. Linear Alg Appl 152:191–222 Mascarenhas WF (1997) The affine scaling algorithm fails for stepsize 0.999. SIAM J Optim 7:34–46 Megiddo N (1989) Pathways to the optimal set in linear programming. In: Megiddo N (ed) Progress in Mathematical Programming. Springer, Berlin, pp 131–158 Mehrotra S (1992) On the implementation of a primal-dual interior point method. SIAM J Optim 2:575–601 Monteiro RC, Adler I (1989) Interior path following primaldual algorithms. Part I: linear programming. Math Program 44:27–41 Nesterov Y, Nemirovskii A (1994) Interior-point polynomial algorithms in convex programming. SIAM, Philadelphia Renegar J (1988) A polynomial-time algorithm, based on Newton’s method, for linear programming. Math Program 40:59–93 Roos C, Vial J-Ph (1990) Long steps with the logarithmic penalty barrier function in linear programming. In: Gabszewicz J, Richard J-F, Wolsey L (eds) Economic Decision Making: Games, Economics, and Optimization. Elsevier, Amsterdam, pp 433–441 Vaidya PM (1990) An algorithm for linear programming which requires O(((m+n)n2 + (m+n)1.5n)L) arithmetic operations. Math Program 47:175–201 Vandenberghe L, Boyd S (1996) Semidefinite programming. SIAM Rev 38:49–95 Vanderbei RJ, Meketon MJ, Freedman BA (1986) A modification of Karmarkar’s linear programming algorithm. Algorithmica 1:395–407 Xu X, Hung P-F, Ye Y (1996) A simplified homogeneous selfdual linear programming algorithm and its implementation. Ann Oper Res 62:151–171 Ye Y (1991) An O(n3L) potential reduction algorithm for linear programming. Math Program 50:239–258 p Ye Y, Todd MJ, Mizuno S (1994) An O(nL)O( nL)-iteration homogeneous and self-dual linear programming algorithm. Math Oper Res 19:53–67

L

25. Zhang Y (1994) On the convergence of a class of infeasible interior-point algorithms for the horizontal linear complementarity problem. SIAM J Optim 4:208–227

Linear Programming: Karmarkar Projective Algorithm Karmarkar Algorithm KURT M. ANSTREICHER University Iowa, Iowa City, USA MSC2000: 90C05 Article Outline Keywords See also References Keywords Linear programming; Interior point methods; Projective transformation; Potential function; Polynomial time algorithm In his groundbreaking paper [6], N.K. Karmarkar described a new interior point method for linear programming (LP). As originally described by Karmarkar, his algorithm applies to a LP problem of the form:

KLP

8 ˆ ˆ x Ax D 0 x 2 S;

where x 2 Rn , A is an m × n matrix, and S is the simplex S = {x 2 Rn : x 0, e| x = n}. Throughout e denotes the vector with each component equal to one. It is assumed that e is feasible in KLP, and that the optimal objective value in KLP is exactly zero. These assumptions may seem restrictive, but it is easy to show that a standard form LP problem: 8 > ˆ ˆ 0, k 0, let X k be the diagonal matrix with X kii = x ki , i = 1, . . . , n. On the kth iteration, the algorithm uses a projective change of coordinates T k : S ! S, Tk (x) D

n(X k )1 x ; e > (X k )1 x

to map the point xk to e. Under the assumption that the optimal value in KLP is zero, KLP is equivalent to the transformed problem: 8 > ˆ ˆ x)

n X

ln(x i ):

iD1

Karmarkar proved that on each iteration, the steplength ˛ in (2) can be chosen so that f () is reduced by an absolute constant ı. It is then easy to show that the iterates satisfy c| xk ekı/n c| x0 for all k. For any positive L, it follows that if c| x0 2O(L) , then the algorithm obtains an iterate xk having c| xk 2O(L) in k =

O(nL) iterations, each requiring O(n3 ) operations. For a problem of the form KLP with integer data, it can be shown that if c| xk 2 O(L) , where L is the number of bits required to represent the problem, then an exact optimal solution can be obtained from xk via a ‘rounding’ procedure. These facts together imply that Karmarkar’s algorithm is a polynomial time algorithm for linear programming, requiring O(n4 L) operations for a problem with n variables, and integer data of bit size L. Karmarkar also described a partial updating technique that reduces the total complexity of his algorithm to O(n3.5 L) operations. Partial updating is based on using a scaling matrix e X k which is an approximation of X k , and only ‘updating’ components e X kii which differ from k X i i by more than a fixed factor. Karmarkar’s algorithm created a great deal of interest for two reasons. First, the algorithm was a polynomial time method for LP. Second, Karmarkar claimed that unlike the ellipsoid algorithm, the other wellknown polynomial time method for LP, his method performed extremely well in practice. There was some controversy at the time regarding these claims, and eventually it was discovered that most of Karmarkar’s computational results were based on the affine scaling algorithm, a simplified version of his algorithm that avoids the use of projective transformations. In any case it soon became clear that the performance of interior point algorithms for LP could be highly competitive with the simplex method, the usual solution technique, on large problems. There is a great deal of research connected with Karmarkar’s algorithm. Several authors ([1,3,4,5,9]) showed that the special form of KLP was unnecessary, and instead the projective algorithm could be directly applied to a standard form problem (1). This ‘standard form variant’ adds logic which maintains a lower bound on the unknown optimal value in (1). Later it was shown that the projective transformations could also be eliminated, giving rise to so-called potential reduction algorithms for LP. The best known potential reduction algorithm, due to Y. Ye [8], requires only p O( nL) iterations, and with an adaptation of Karmarkar’s partial updating technique has a total complexity of O(n3 L) operations. The survey articles [2] and [7] give extensive references to research connected with Karmarkar’s algorithm, and related potential reduction methods.

Linear Programming: Klee–Minty Examples

See also

Article Outline

Entropy Optimization: Interior Point Methods Homogeneous Selfdual Methods for Linear Programming Interior Point Methods for Semidefinite Programming Linear Programming: Interior Point Methods Potential Reduction Methods for Linear Programming Sequential Quadratic Programming: Interior Point Methods for Distributed Optimal Control Problems Successive Quadratic Programming: Solution by Active Sets and Interior Point Methods

Keywords Introduction Simplex Algorithm Klee–Minty Examples Applications

References 1. Anstreicher KM (1986) A monotonic projective algorithm for fractional linear programming. Algorithmica 1:483–498 2. Anstreicher KM (1996) Potential reduction algorithms. In: Terlaky T (ed) Interior point methods of mathematical programming. Kluwer, Dordrecht, pp 125–158 3. de Ghellinck G, Vial J-Ph (1986) A polynomial Newton method for linear programming. Algorithmica 1:425–453 4. Gay DM (1987) A variant of Karmarkar’s linear programming algorithm for problems in standard form. Math Program 37:81–90 5. Gonzaga CC (1989) Conical projection algorithms for linear programming. Math Program 43:151–173 6. Karmarkar N (1984) A new polynomial-time algorithm for linear programming. Combinatorica 4:373–395 7. Todd MJ (1997) Potential-reduction methods in mathematical programming. Math Program 76:3–45 8. Ye Y (1991) An O(n3L) potential reduction algorithm for linear programming. Math Program 50:239–258 9. Ye Y, Kojima M (1987) Recovering optimal dual solutions in Karmarkar’s polynomial algorithm for linear programming. Math Program 39:305–317

Linear Programming: Klee–Minty Examples KONSTANTINOS PAPARRIZOS1 , NIKOLAOS SAMARAS1 , DIMITRIOS ZISSOPOULOS2 1 Department Applied Informatics, University Macedonia, Thessaloniki, Greece 2 Department Business Admin., Techn. Institute West Macedonia, Kozani, Greece MSC2000: 90C05

L

Smallest Index Rule Largest Coefficient Rule

See also References Keywords Klee–Minty examples; Linear programming; Simplex algorithm; Pivoting rules The problem of determining the worst-case behavior of the simplex algorithm remained an outstanding open problem for more than two decades. In the beginning of the 1970s, V. Klee and G.J. Minty [9] solved this problem by constructing linear examples on which an exponential number of iterations is required before optimality occurs. In this article we present the Klee–Minty examples and show how they can be used to show exponential worst-case behavior for some well known pivoting rules. Introduction The problem of determining the worst-case behavior of the simplex algorithm remained an outstanding open problem for more than two decades. In the beginning of the 1970s, Klee and Minty in their classical paper [9] showed that the most commonly used pivoting rule, i. e., Dantzig’s largest coefficient pivoting rule [5], performs exponentially bad on some specially constructed linear problems, known today as Klee–Minty examples. Later on, R.G. Jeroslow [8] showed similar behavior for the maximum improvement pivoting rule. He showed this result by slightly modifying Klee–Minty examples. The Klee–Minty examples have been used by several researchers to show exponential worst-case behavior for the great majority of the practical pivoting rules. D. Avis and V. Shvatal [1] and independently, K.G. Murty [10, p. 439] showed exponential behavior for Bland’s least index pivoting rule [2] and D. Goldfarb and W. Sit [7] for the steepest edge simplex method [5]. Recently,

1891

1892

L

Linear Programming: Klee–Minty Examples

C. Roos [13] established exponential behavior for Terlaky’s criss-cross method [14] and K. Paparrizos [11] for a number of pivoting rules some of which use past history. Similar results have been derived by Paparrizos [12] for his dual exterior point algorithm and K. Dosios and Paparrizos [6] for a new primal dual pivoting rule [3]. In this paper we present the Klee–Minty examples and show some of their properties that are used in deriving complexity results of the simplex algorithm. These properties are then used to show exponential behavior for two pivoting rules; the least index and the maximum coefficient pivoting rule. The paper is self contained. Next section describes a particular form of the simplex algorithm. The Klee– Minty examples and their properties are presented in Section 3. Section 4 is devoted to complexity results.

If B is a nonsingular matrix we can set xN = 0 and compute xB from (2). Then, we find xB = B1 b. The non singular matrix B is called basic matrix or basis. The solution xN = 0 and xB = B1 b is called basic solution. If, in addition, it is xB = B1 b 0, then xB , xN is a basic feasible solution. Geometrically, a basic feasible solution of (1) corresponds to a vertex of the polyhedral set of the feasible region. If B is nonsingular, we can express the basic variables xB as a function of the non basic variable xN . We have from (2) that x B D B1 Nx N C B1 b:

(3)

Using (3), the objective function of problem (1) is written in the form > z D c> B xB C cN xN 1 1 > D c> B (B Nx N C B b) C c N x N 1 > > 1 D (c > B B N C c N )x N C c B B b:

Simplex Algorithm In describing our results we find it convenient to use the dictionary form [4] of the simplex algorithm. We will see in the next section that this form exhibits some advantages in describing the properties of the Klee–Minty examples. Consider the linear problem in standard form 8 > ˆ ˆ

> 1 z D (c > B B N C c N )x N C c B B b;

x B D B1 Nx N C B1 b:

(5)

We denote the coefficients of xN and the constant terms of (5) by H, i. e., > 1 1 c> c N c> BB N BB b D H: B1 b B1 N The top row of H, row zero, is devoted to the objective function. Some times we call it cost row. The remaining rows are numbered 1, . . . , m. The ith row, 1 i m, corresponds to the basic variable xB[i] , where B[i] denotes the ith element of B. Similarly, the jth column of H, 1 j nm, corresponds to the nonbasic variable xN[j] . The last column of H corresponds to the constant terms. We denote the entries of H by hij . It is well known that if h0j 0, for j = 1, . . . , n m, then xB , xN is an optimal solution to (1). In this case the algorithm terminates. Otherwise, a nonbasic variable xN[q] = xl such that h0, N[q] > 0 is chosen. Variable xl is called entering variable. If the condition h i;N[q] 0;

for i D 1; : : : ; m;

Linear Programming: Klee–Minty Examples

L

holds, problem (1) is unbounded and the algorithm stops. Otherwise, the basic variable xB[p] = xk , is determined by the following minimum ratio test x B[p] h r;N[q] h i;nmC1 D min : 1 i m; h i;N[q] < 0 : h i;N[q] The basic variable xk is called leaving variable. Then, the entering variable xl takes the place of the leaving variable and vice versa, i. e., it is set B[p]

N[q] and

N[q]

Linear Programming: Klee–Minty Examples, Figure 1 Feasible region of Klee–Minty example of order n = 2

B[p]:

Thus, a new basis B is constructed and the procedure is repeated. Let H be the tableau corresponding to the new basis B. It is easily seen that

hi j D

8 hpj ˆ ˆ h pq
1.

Linear Programming Models for Classification PAUL A. RUBIN The Eli Broad Graduate School of Management, Michigan State University, East Lansing, USA MSC2000: 62H30, 68T10, 90C05 Article Outline

See also Criss-cross Pivoting Rules Least-index Anticycling Rules Lexicographic Pivoting Rules Linear Programming

Keywords Introduction Models Pathologies Multiple Group Problems

Methods

1897

1898

L

Linear Programming Models for Classification

See also References Keywords Classification; Discriminant analysis; Linear programming Introduction The G-group classification problem (discriminant problem) seeks to classify members of some population into one of G predefined groups based on the value of a scoring function f applied to a vector x 2 < p of observed attributes. The scoring function is constructed using training samples drawn from each group. Of several criteria available for selecting a scoring function, expected accuracy (measured either in terms of frequency of misclassification or average cost of misclassification) predominates. The scoring function f can be vectorvalued, but when two groups are involved it is almost always scalar-valued, and scalar functions may be used even when there are more than two groups. As discussed in [8], statistical methods for constructing scoring functions revolve around estimating, directly or indirectly, the density functions of the distributions of the various groups. In contrast, a number of approaches have been proposed that in essence ignore the underlying distributions and simply try to classify the training samples with maximal accuracy, hoping that this accuracy carries over to the larger population. The use of mathematical programming was suggested at least as early as 1965 by Mangasarian [11]; interest in it grew considerably with the publication of a pair of papers by Freed and Glover in 1981 [3,4], which led to parallel streams of research in algorithm development and algorithm analysis. Though nonlinear scoring functions can be constructed, virtually all research into mathematical programming methods other than support vector machines [1] restricts attention to linear functions. This is motivated largely by tractability of the mathematical programming problems, but is bolstered by the fact that the Fisher linear discriminant function, the seminal statistically derived scoring function, is regarded as a good choice under a wide range of conditions. For the remainder of this article, we assume f to be linear. Directly maximizing accuracy on the training samples dic-

tates the use of a mixed integer program to choose the scoring function ( Mixed Integer Classification Problems). The number of binary variables in such a formulation is proportional to the size of the training samples, and so computation time grows in a nonpolynomial manner as the sample sizes increase. It is therefore natural that attention turned to more computationally efficient linear programming classification models (LPCMs). Erenguc and Koehler [2] give a thorough survey of the spectrum of mathematical programming classification models as of 1989, and Stam [14] provides a somewhat more recent view of the field. Comparisons, using both “real-world” data and Monte Carlo experiments, of the accuracy of scoring functions produced by mathematical programming models with that of statistically-derived functions has produced mixed results [14], but there is evidence that LPCMs are more robust than statistical methods to large departures from normality in the population (such as populations with mixture distributions, discrete attributes, and outlier contamination). Models When G D 2 and f is linear and scalar-valued, classification of x is based without loss of generality on whether f (x) < 0 or f (x) > 0. (If f (x) D 0; x can be assigned to either group with equal plausibility. This should be treated as a classification failure.) Barring the degenerate case f 0, the solution set to f (x) D 0 forms a separating hyperplane. Ideally, though not often in practice, each group resides within one of the half-spaces defined by that hyperplane. An early precursor to linear programming models, the perceptron algorithm [12], constructs an appropriate linear classifier in finite time when the samples are separable, but can fail if the samples are not separable. There being no way to count misclassifications in an optimization model without introducing integer variables, LPCMs must employ a surrogate criterion. A variety of criteria have been tried, all revolving around measurements of the displacement of the sample points from the separating hyperplane. Let f (x) D w0 x C w0 for some non-null coefficient vector w 2 < p and some scalar w0 . The euclidean distance from x toıthe separating hyperplane is easily shown to be j f (x)j kwk. So the value of the scoring function at each training observa-

Linear Programming Models for Classification

tion measures (to within a scalar multiple) how far the observation falls from the separating hyperplane. That distance is in turn identified as either an internal deviation or an external deviation depending on whether the observation falls in the correct or incorrect half-space. Figure 1 illustrates both types of deviation. The “hybrid” model of Glover et al. [6] is sufficiently flexible to capture the key features of most two-group models. Let X g be an N g p matrix of training observations from group g, and let 0 and 1 denote vectors of appropriate dimension, all of whose components are 0 and 1 respectively. The core of the hybrid model, to be expanded later, is: min

2 X ˛ g 10 e g ˇ g 10 d g C g e g0 ı g d g0 gD1

s.t. X1 w C w0 1 C d1 e1 C d10 1 e10 1 0 X2 w C w0 1 d2 C e2 d20 1 C e20 1 0 w; w0 free; d g ; e g ; d g0 ; e g0 0 : Variables d g and e g are intended to capture the internal and external deviations respectively of individual observations from group g, while e g0 and d g0 are intended to capture the maximum (or minimum) external and internal deviations respectively across the sample from group g. (The original hybrid model had d10 D d20 and e10 D e20 , which is unnecessarily restrictive.) The intent of Glover et al. in presenting the hybrid model was to subsume a number of previously proposed models, and so the hybrid model should be viewed as a framework. When applied, not all of the deviation variables need be present. For example, omission of e g and d g would yield a version of the “MMD” model [2], with e g0

Linear Programming Models for Classification, Figure 1 Two-Group Problem with Linear Classifier

L

the worst external deviation of any group g observation if any is misclassified (in which case d g0 D 0) and d g0 the minimum internal deviation of any group g observation if none is misclassified (in which case e g0 D 0). On the other hand, omission of e g0 and d g0 results in a variation of the “MSID” model [2], with the objective function penalizing individual external deviations (e g ) and rewarding individual internal deviations (d g ). The nonnegative objective coefficients ˛ g , ˇ g , g , ı g must be chosen so that the penalties for external deviations exceed the rewards for internal deviations; otherwise, the linear program becomes unbounded, as adding an equal amount to both e gn and d gn improves the objective value. Pathologies Due to their focus on minimizing error count, mixed integer classification models tend to be feasible (the trivial function f 0 is often a feasible solution) and bounded (one cannot do better than zero misclassifications). LPCMs, in contrast, tend to be “naturally” feasible but may require explicit bounding constraints. If the training samples are perfectly separable, a solution exists to the partial hybrid model with e gn D 0 for all g and n and d gn > 0 for some g and n; any positive scalar multiple of that solution is also feasible, and so the objective value is unbounded below. One way to correct this is to introduce bounds on the coefficients of the objective function, say 1 w C1: Another potential problem has to do with what is variously referred to as the “trivial” or “unacceptable” solution, namely f 0. Consider the partial hybrid model above. The trivial solution (all variables equal to zero) is certainly feasible, with objective value zero. Given the requirement that the objective coefficients of external deviation variables dominate those of internal deviation variables, any solution with a negative objective value must perfectly separate the training samples. Contrapositively, then, if the training samples cannot be separated, the objective value cannot be less than zero, in which case the trivial solution is in fact optimal. This is undesirable: the trivial function does not classify anything. The trick is to make the trivial solution suboptimal. Some authors try to accomplish this by fixing the

1899

1900

L

Linear Programming Models for Classification

constant term w0 of the classification function at some nonzero value (typically w0 D 1). The trivial discriminant function w D 0 with nonzero constant term now misclassifies one group completely, and is unlikely to be the model’s optimal solution even when the training samples cannot be separated. There is the possibility, however, that the best linear classifier has w0 D 0, in which case this approach dooms the model to finding an inferior solution. Other approaches include various attempts to make w D 0 infeasible, such as adding the constraint kwk D 1. Unfortunately, trying to legislate the trivial solution out of existence results in a nonconvex feasible region, destroying the computational advantage of linear programming. Yet another strategy for weeding out trivial solutions is the introduction of a so-called normalization constraint. The normalization constraint proposed by Glover et al. for the hybrid model is Ng 2 X X

d gn D 1:

ing methods based on statistics or mixed integer programming, a common approach to the multiple group problem is to develop a separate scoring function for each group, and assign observations to the group whose scoring function yields the largest value at that observation. The linear programming analog would be to reward amounts by which the score f i (x) of an observation x from group i exceeds each f j (x); j ¤ i (or max j¤i f j (x)) and penalize differences in the opposite direction. This induces a proliferation of deviaP tion variables (on the order of (G 1) GgD1 N g ). Other approaches may construct discriminant functions for all pairs of groups, or for each group versus all others, and then using a “voting” procedure to classify observations [15]. A good example of the use of a vector-valued scoring function is the work of Gochet et al. [7]. They begin with one scoring function per group, and in cases where two of those functions wind up identical, add additional functions to serve as tie-breakers. Their model is:

gD1 nD0

Various pathologies have been connected to injudicious use of normalization constraints [9,10,13], including: unboundedness; trivial solutions; failure of the resulting discriminant function to adapt properly to rescaling or translation of the data (the optimal discriminant function after scaling or translating the data should be a scaled or translated version of the previously optimal discriminator, and the accuracy should be unchanged); and failure to find a discriminant function with perfect accuracy on the training samples when, in fact, they can be separated (which suggests that the discriminant function found will have suboptimal accuracy on the overall population). Indeed, Glover later changed the normalization of the hybrid model to [5] N2 10 X1 w C N1 10 X2 w D 1 to avoid some of these pathologies. Multiple Group Problems The use of a scalar-valued scoring function in an LPCM with G > 2 groups requires the a priori imposition of both a specific ordering and prescribed interval widths on the scores of the groups. This being impractical, attention turns to vector-valued functions. Whether us-

min

G G X X

10 e g h

gD1 g¤hD1

s.t. X g w g w h C w g0 w h0 1 C e g h d g h D 0 G G X X

10 d g h e g h D q

gD1 g¤hD1

w g ; w g0 free; d g h ; e g h 0 : The scoring function corresponding to group g is f g (x) D w0g xCw g0 . “Internal” and “external” deviations now represent amounts by which the scores of observations generated by the correct functions exceed or fall short of their scores from functions belonging to other groups. The first constraint is repeated for every pair of groups g; h D 1; : : : ; G; g ¤ h. The second constraint, in which q is an arbitrary positive constant, is a normalization constraint intended to render infeasible both the trivial solution (all w g identical) and solutions for which the total of the external deviations exceeds that of the internal deviations. If w g D w h and w g0 D w h0 for some g ¤ h, the model is applied recursively to the subsamples from only those groups (possibly more than just g and h) that yielded identical scoring functions. The additional functions generated are used as tie-breakers.

Linear Space

Methods The number of constraints in an LPCM approximately equals the number of training observations, while the number of variables can range from slightly more than the number of attributes to slightly more than the sum of the number of observations and the number of attributes, depending on which deviation variables are included in the model. In practice, the number of observations will exceed the number of attributes; indeed, if the difference is not substantial, the model runs the risk of overfitting the scoring function (in the statistical sense). When the number of deviation variables is small, then, the LPCM tends to have considerably more constraints than variables, and a number of authors have suggested solving its dual linear program instead, to reduce the amount of computation. Improvements in both hardware and software have lessened the need for this, but it may still be useful when sample sizes reach the tens or hundreds of thousands (which can happen, for example, when rating consumer credit, and in some medical applications).

L

8. Hand DJ (1997) Construction and assessment of classification rules, Wiley, Chichester 9. Koehler GJ (1989) Unacceptable solutions and the hybrid discriminant model. Decis Sci 20:844–848 10. Koehler GJ (1990) Considerations for mathematical programming models in discriminant analysis. Manag Decis Econ 11:227–234 11. Mangasarian OL (1965) Linear and nonlinear separation of patterns by linear programming. Oper Res 13:444–452 12. Rosenblatt F (1958) The perceptron: A probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386–408 13. Rubin PA (1991) Separation failure in linear pgogramming discriminant models. Decis Sci 22:519–535 14. Stam A (1997) Nontraditional approaches to statistical classification: Some perspectives on Lp-norm methods. Ann Oper Res 74:1–36 15. Witten IH, Frank E (2005) Data mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, Amsterdam

Linear Space LEONIDAS PITSOULIS Princeton University, Princeton, USA

See also Deterministic and Probabilistic Optimization Models for Data Classification Linear Programming Mixed Integer Classification Problems Statistical Classification: Optimization Approaches References 1. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297 2. Erenguc SS, Koehler GJ (1990) Survey of mathematical programming models and experimental results for linear discriminant analysis. Manag Decis Econ 11:215–225 3. Freed N, Glover F (1981) A linear programming approach to the discriminant problem. Decis Sci 12:68–74 4. Freed N, Glover F (1981) Simple but powerful goal programming models for discriminant problems. Eur J Oper Res 7:44–60 5. Glover F (1990) Improved linear programming models for discriminant analysis. Decis Sci 21:771–785 6. Glover F, Keene SJ, Duea RW (1988) A new class of models for the discriminant problem. Decis Sci 19:269–280 7. Gochet W, Stam A, Srinivasan V, Chen S (1997) Multigroup discriminant analysis using linear programming. Oper Res 45:213–225

MSC2000: 15A03, 14R10, 51N20 Article Outline Keywords See also Keywords Linear algebra Let F be a field, whose elements are referred to as scalars. A linear space V over F is a nonempty set on which the operations of addition and scalar multiplication are defined. That is, for any x, y 2 V, we have x + y 2 V, and for any x 2 V and ˛ 2 F we have ˛ x 2 V. Furthermore, the following properties must be satisfied: 1) x + y = y + x, 8 x, y 2 V. 2) (x + y) + z = x + (y + z), 8 x, y, z 2 V. 3) There exists an element 0 2 V, such that x + 0 = x, 8x 2 V. 4) 8 x 2 V, there exists x 2 V such that x + (x) = 0. 5) ˛ (x + y) = ˛ x + ˛ y, 8 ˛ 2 F, 8 x, y 2 V. 6) (˛ + ˇ)x = ˛x + ˇx, 8 ˛, ˇ 2 F, 8 x 2 V.

1901

1902

L

Lipschitzian Operators in Best Approximation by Bounded or Continuous Functions

7) (˛ˇ)x = ˛(ˇx), 8 ˛, ˇ 2 F, 8 x 2 V. 8) 1x = x, 8x 2 V. The elements of V are called vectors, and V is also called a vector space. See also Affine Sets and Functions Linear Programming

Lipschitzian Operators in Best Approximation by Bounded or Continuous Functions VASANT A. UBHAYA Department Computer Sci. and Operations Research, North Dakota State University, Fargo, USA MSC2000: 65K10, 41A30, 47A99 Article Outline Keywords Lipschitzian Selection Operators Examples and Applications See also References Keywords Approximation problem; Minimum distance problem; Best approximation; Best estimate; Selection; Continuous selection operator; Lipschitzian selection operator; Bounded function; Continuous function; Uniform norm; Quasiconvex function; Convex function; Isotone functions; Majorants and minorants

denote the shortest distance from f to K. Let also, for f in X, P( f ) D PK ( f ) D fh 2 K : k f hk D d( f ; K)g : The set-valued mapping P on X is called the metric projection onto K. It is also called the nearest point mapping, best approximation operator, proximity map, etc. If P(f ) 6D ;, then each element in it is called a best approximation to (or a best estimate of) f from K. In practical curve fitting or estimation problems, f represents the given data and the set K is dictated by the underlying process that generates f . Because of random disturbance or noise, f is in general not in K, and it is required to estimate f by an element of K. See [7,12] and other references given there for a discussion of such problems and the use of various norms or distance functions in approximation. An approximation problem or a minimum distance problem such as (1) involves finding a best approximation, investigating its uniqueness and other properties, and developing algorithms for its computation. If P(f ) 6D ; (respectively, P(f ) is a singleton) for each f 2 X, then K is called proximinal (respectively, Chebyshev). If K is proximinal, then we define a selection operator, or simply a selection, to be any (single valued) function T on X into K so that T(f ) 2 P(f ) for every f 2 X. If K is Chebyshev, then clearly T = P and T is unique. A continuous selection operator is a selection T which is continuous. There is a vast literature available on the existence and properties of continuous selections including some survey papers. See, e. g., [1,2,3,6,8] and other references given there. A more difficult problem is finding a Lipschitzian selection operator (LSO) i. e., a selection T which satisfies kT( f ) T(h)k c(T) k f hk ;

Stated in simplest terms, this article considers, in an abstract mathematical framework, a curve fitting or estimation problem where a given set of data points f is approximated or estimated by an element from a set K so that the estimate of f is least affected by perturbations in f . Let X be a normed linear space with norm k k and K be any (not necessarily convex) nonempty subset of X. For any f in X, let d( f ; K) D inf fk f hk : h 2 Kg

(1)

all f ; h 2 X; (2)

where c(T) (a positive constant depending upon T) is the smallest value satisfying (2). An LSO T is called an optimal Lipschitzian selection operator (OLSO) if c(T) c(T 0 ) for all LSO T 0 . If the operator T in (2) is OLSO, then (2) shows that the estimate T(f ) of f is least sensitive to changes in the given data f . Consequently, T(f ) is the most desirable estimate of f . The concept of an OLSO was introduced in [12] and the existence of an LSO and OLSO was investigated in [13,14,15,16,17]. If X is a Hilbert space and K X is nonempty, closed and

Lipschitzian Operators in Best Approximation by Bounded or Continuous Functions

convex, then K is Chebyshev. Then T, which maps f to its unique best approximation, is an LSO, i. e., T satisfies (2) with c(T) = 1. For a proof see [5, p. 100]. Since T is unique, it is also trivially OLSO. For other spaces, the results are not so straightforward. In this paper we present several results which identify LSOs and OLSOs in approximation problems on the space of bounded or continuous functions. We illustrate these results by examples. Lipschitzian Selection Operators Let S be any set and B denote the Banach space of real bounded functions f on S with the uniform norm kf k = sup{|f (s)|: s 2 S}. Similarly, when S is topological, denote by C = C(S), the space of real bounded and continuous functions on S, again, with the uniform norm kk. Let X = B or C in what follows. We let f 2 X, K X and d(f, K) as above. We let d(f ) = d(f, K) for convenience. For f in X, define K f = {k 2 K : k f } and K 0 f = {k 2 K : k f }. Let ˚ f (s) D sup k(s) : k 2 K f ; o n f (s) D inf k(s) : k 2 K 0f ;

s 2 S; s 2 S:

We state three conditions below, they are identical for X = B or C. 1) If k 2 K, then k + c 2 K for all real c. 2) If f 2 X, then f 2 K. 3) If f 2 X, then f 2 K. If f and f are in K, then they are called the greatest K-minorant and the smallest K-majorant of f , respectively. Note that condition 2) (respectively, 3)) implies that the pointwise maximum (respectively, minimum) of any two functions in K is also in K. This can be easily established by letting f = max{f 1 , f 2 } (respectively, f = min{f 1 , f 2 }) where f 1 , f 2 2 K. We call a g 2 K the maximal (respectively, minimal ) best approximation to f 2 X if g g 0 (respectively, if g g 0 ) for all best approximations g 0 to f . Theorem 1 Consider (1) with X = B or C, and any K X. a) Assume K is not necessarily convex. Suppose that conditions 1) and 2) hold for K. Then d( f ) D k f f k/2 and f 0 D f C d( f ) is the maximal best approximation to f . Also k f 0 h0 k 2 k f h k for

L

all f , h 2 X. The operator T defined T(f ) = f 0 is an LSO with c(T) = 2. b) Assume K is not necessarily convex. Suppose that conditions 1) and 3) hold for K. Then a) holds with f replaced by f and with f 0 = f d(f ), which is the minimal best approximation to f . c) Assume K is convex. Suppose that conditions 1), 2) and 3) hold for K. Then a) and b) given above apply. In addition, d( f ) D (k f f k)/2. A g in K is a best approximation to f if and only if f d( f ) g f C d( f ). Moreover, if f 0 D ( f C f )/2, then f 0 is a best approximation to f and k f 0 h0 k k f h k for all f , h 2 X. The operator T defined by T(f ) = f 0 is an OLSO with c(T) = 1. The following theorem shows that the existence of a maximal (respectively, minimal) best approximation to (1) implies condition 2) (respectively, 3)). Theorem 2 Consider (1) with X = B or C, and any K X. Assume condition 1) holds for K. Assume that the pointwise maximum (respectively, minimum) of two function in K is also in K. Then condition 2) (respectively, 3)) holds if the maximal (respectively, minimal) best approximation to f exists. This best approximation then equals f C d( f ) (respectively, f d(f )). The above theorems and the next one appear in [14,15]. Their proofs are available there. We now define another approximation problem, closely related to (1). Let ˚ d( f ) D d( f ; K f ) D inf k f hk : h 2 K f :

(3)

The problem is to find a g 2 {h 2 K f : k f h k = d(f , K f )}, called a best approximation to f from K f . Theorem 3 Consider (3) with X = B or C, and any K X which is not necessarily convex. a) Suppose that conditions 1) and 2) hold for K. Then f is best approximation to f and d( f ) D

the maximal

f f D 2d( f ). The operator T defined by T( f ) D f is the unique OLSO with c(T) = 1. b) Assume condition 1) holds for K. Assume that the pointwise maximum of two functions in K is also in K. Then condition 2) holds if the maximum best approximation to f exists. This best approximation then equals f .

1903

1904

L

Lipschitzian Operators in Best Approximation by Bounded or Continuous Functions

mapping f to f is LSO with c(T) = 2. Now the example given in [13, p. 332] shows that T is OLSO.

Examples and Applications Example 4 (Approximation by quasiconvex functions.) Let S Rn be nonempty convex and consider B = B(S). For C = C(S) assume S is nonempty, compact and convex. A function h 2 B is called quasiconvex if h(s C (1 )t) maxfh(s); h(t)g; for all s; t 2 S;

0 1:

(4)

Equivalently, h in B is quasiconvex if one of the following conditions holds [9,10]: {h c} is convex for all real c; {h < c} is convex for all real c. Let K be the set of all quasiconvex functions in B. It is easy to show that K and K \ C are closed cones which are not convex and both satisfy condition 1) above (K is a cone if h 2 K whenever h 2 K and 0.) The greatest K-minorant of f is called the greatest quasiconvex minorant of f . Using (4) it is easy to show that if f 2 B then such a minorant f exits in B. The next proposition shows that if f 2 C then f 2 C. Let ˘ be the set of all convex subsets of S. Clearly, ', S 2 ˘ . For any A Rn , we denote by co(A) the convex hull of A, i. e., the smallest convex set containing A. Proposition 5 Let f 2 X and let f 0 (P) D inf f f (t) : t 2 SnPg ; P 2 ˘; ˚ f (s) D sup f 0 (P) : P 2 ˘; s 2 SnP ; s 2 S: Then the following holds: If f 2 B (respectively, C) then f 2 B (respectively, C) and is quasiconvex. It is the greatest quasiconvex minorant of f . An h 2 B is the greatest quasiconvex minorant of f 2 B if and only if fh < cg D cof f < cg for all real c:

(5)

An h 2 B is the greatest quasiconvex minorant of f 2 C if and only if (5) holds or, equivalently, {h c} = co{h c} for all real c. This proposition and its proof appear in [15]. The proposition shows that condition 2) holds for K and K \ C. Hence, Theorems 1a) and 3a) apply to X = B and K, and also to X = C and K \ C. In particular, Theorem 1a) shows that in each of these two cases the operator T

Example 6 (Approximation by convex functions.) Let S Rn be nonempty convex and consider B = B(S). A function h 2 B is called convex if h(s + (1 )t) h(s) + (1 ) h(t), for all s, t 2 S and all 0 1. Clearly, a convex function is quasiconvex. Let K be the set of all convex functions in B. It is easy to show that K is a closed convex cone and satisfies condition 1). The greatest K-minorant of f is called the greatest convex minorant of f . It follows at once from the definition of a convex function that if f 2 B then such a minorant f exists in B. Condition 2) therefore holds for K. Hence, Theorems 1a) and 3a) apply to X = B and K. In particular, the LSO T of Theorem 1a) mapping f to f with c(T) = 2 can be shown to be an OLSO by using an example as in [13, p. 334]. Now consider approximation of a continuous function by continuous convex functions. For this case we let S Rn be a polytope which is defined to be the convex hull of a finitely many points in Rn . It is compact, convex and locally simplicial [11]. Let K C = C(S) be the set of continuous convex functions. It is easy to show that K is a closed convex cone. Again condition 1) holds for K. We assert that if f 2 C, then f is convex and continuous. This will establish that f is the greatest convex minorant of f . To establish the assertion note that f is convex since it is the pointwise supremum of convex functions. Since S is locally simplicial, [11, Corol. 17.2.1; Thm. 10.2] show that f is continuous on S. Thus, condition 2) holds for K. Hence Theorems 1a) and 3a) apply to X = C and K. In particular, the LSO T of Theorem 1a) mapping f to f with c(T) = 2 can be shown to be an OLSO by using the same example as in the bounded case above since the sequence used in that example consists of continuous functions [13]. Example 7 (Approximation by isotone functions.) Let S be any set with partial order . A partial order is a relation on S satisfying [4, p. 4]: reflexivity, i. e., s s for all s 2 S; and transitivity, i. e., if s, t, v 2 S, and s t and t v, then s v. A partial order is antisymmetric if s t and t s imply s = t. We do not include this antisymmetry condition in the partial order for sake of generality. We consider B = B(S) as before, and define a function k in B to be isotone

Load Balancing for Parallel Optimization Techniques

if k(s) k(t) whenever s, t 2 S and s t. Let K B be the set of all isotone functions. It is easy to see that K is a closed convex cone. It is nonempty since the zero function is in K. It is easy to verify that conditions 1), 2) and 3) apply to K. Thus the greatest isotone minorant f and the smallest isotone majorant f of an f in B exist. Theorem 1c) and 3a) apply and we conclude that the operator T of Theorem 1c), mapping f to ( f C f )/2, is OLSO with c(T) = 1 [15]. The next proposition gives explicit expressions for f and f . We call a subset E of S a lower (respectively, upper) set if whenever t 2 E and v t (respectively, t v), then v 2 E. For s in S, let Ls = {t 2 S, t s} and U s = {t 2 S, s t}. Then, Ls (respectively, U s ) is the smallest lower (respectively, upper) set containing s, as may be easily seen. Proposition 8 f (s) D sup f f (t) : t 2 Ls g ; f (s) D inf f f (t) : t 2 Us g : For a proof, see [15]. Now we consider an application to C. Define S = × {[ai , bi ]: 1 i n} Rn , where ai < bi , and let be the usual partial order on vectors. Let C = C(S) and let K be the set of isotone functions in C. It is easy to verify that K is a closed convex cone. Furthermore, if f 2 C, then f ; f 2 C. We conclude, as before, that Theorems 1c) and 3a) apply. Various generalizations of this problem exist. See, for example, [12, Sect. 5], [15, Ex. 4.3], and [17]. As was observed in [16], the dual cone of K plays an important role in duality and approximation from K. Some properties of the cone K of isotone functions on a finite partially ordered set S and its dual cone are obtained in [18]. See also Convex Envelopes in Optimization Problems References 1. Deutsch F (1983) A survey of metric selections. RC Sine (ed) Fixed points and Nonexpansive Mappings. In: Contemp. Math., vol 18. Amer Math Soc, Providence, pp 49–71 2. Deutsch F (1992) Selections for metric projections. SP Singh (ed) Approximation Theory, Spline Functions and Applications. Kluwer, Dordrecht, pp 123–137

L

3. Deutsch F, Li W, Park S-H (1989) Characterization of continuous and Lipschitz continuous metric selections in normed linear spaces. J Approx Theory 58:297–314 4. Dunford N, Schwartz JT (1958) Linear operators, Part I. Interscience, New York 5. Goldstein AA (1967) Constructive real analysis. Harper and Row, New York 6. Li W (1991) Continuous selections for metric projections and interpolating subspaces. In: Brosowski B, Deutsch F, Guddat J (eds) Approximation and Optimization, vol 1. P. Lang, Frankfurt, pp 1–108 7. Liu M-H, Ubhaya VA (1997) Integer isotone optimization. SIAM J Optim 7:1152–1159 8. Nurnberger G, Sommer M (1984) Continuous selections in Chebyshev approximation. In: Brosowski B, Deutsch F (eds) Parametric Optimization and Approximation. Internat Ser Numer Math, vol 72. Birkhäuser, Boston, pp 248–263 9. Ponstein J (1967) Seven kinds of convexity. SIAM Rev 9:115–119 10. Roberts AW, Varberg DE (1973) Convex functions. Acad. Press, New York 11. Rockafellar RT (1970) Convex analysis. Princeton Univ. Press, Princeton 12. Ubhaya VA (1985) Lipschitz condition in minimum norm problems on bounded functions. J Approx Theory 45:201– 218 13. Ubhaya VA (1988) Uniform approximation by quasiconvex and convex functions. J Approx Theory 55:326–336 14. Ubhaya VA (1989) Lipschitzian selections in approximation from nonconvex sets of bounded functions. J Approx Theory 56:217–224 15. Ubhaya VA (1990) Lipschitzian selections in best approximation by continuous functions. J Approx Theory 61:40– 52 16. Ubhaya VA (1991) Duality and Lispschitzian selections in best approximation from nonconvex cones. J Approx Theory 64:315–342 17. Ubhaya VA (1992) Uniform approximation by a nonconvex cone of continuous functions. J Approx Theory 68:83–112 18. Ubhaya VA (2001) Isotone functions, dual cones, and networks. Appl Math Lett 14:463–467

Load Balancing for Parallel Optimization Techniques LBDOP ANANTH GRAMA1 , VIPIN KUMAR2 1 Purdue University, West Lafayette, USA 2 University Minnesota, Minneapolis, USA MSC2000: 68W10, 90C27

1905

1906

L

Load Balancing for Parallel Optimization Techniques

Article Outline Keywords Parallel Depth-First Tree Search Parallel Best-First Tree Search Searching State Space Graphs Anomalies in Parallel Search Applications of Parallel Search Techniques See also References Keywords Parallel algorithm; Load balancing; Tree search; Graph search Discrete optimization problems are solved using a variety of state space search techniques. The choice of technique is influenced by a variety of factors such as availability of heuristics and bounds, structure of state space, underlying machine architecture, availability of memory, and optimality of desired solution. The computational requirements of these techniques necessitates the use of large scale parallelism to solve realistic problem instances. In this chapter, we discuss parallel processing issues relating to state space search. Parallel platforms have evolved significantly over the past two decades. Symmetric multiprocessors (SMPs), tightly coupled message passing machines, and clusters of workstations and SMPs have emerged as the dominant platforms. From an algorithmic standpoint, the key issues of locality of data reference and load balancing are key to effective utilization of all these platforms. However, message latencies, hardware support for shared address space and mutual exclusion, communication bandwidth, and granularity of parallelism all play important roles in determining suitable parallel formulations. A variety of metrics have also been developed to evaluate the performance of these formulations. Due to the nondeterministic nature of the computation, traditional metrics such as parallel runtime and speedup are difficult to quantify analytically. The scalability metric, Isoefficiency, has been used with excellent results for analytical modeling of parallel state space search. The state spaces associated with typical optimization problems can be fashioned in the form of either a graph or a tree. Exploiting concurrency in graphs is more difficult compared to trees because of the need

for replication checking. The availability of heuristics for best-first search imposes constraints on parallel exploration of states in the state space. For the purpose of parallel processing, we can categorize search techniques loosely into three classes: depth-first tree search techniques (a tree search procedure in which the deepest of the current nodes is expanded at each step), best-first tree search techniques (a tree search procedure in which nodes are expanded based on a global (heuristic) measure of how likely they are to lead to a solution), and graph search techniques (a search requiring additional computation for checking if a node has been encountered before, since a node can be reached from multiple paths). Many variants of these basic schemes fall into each of these categories as well. Parallel Depth-First Tree Search Search techniques in this class include ordered depthfirst search, iterative deepening A (IDA ), and depth—first branch and bound (DFBB). In all of these techniques, the key ingredient is the depth-first search of a state space (cost-bounded in the case of IDA and DFBB). DFS was among the first applications explored on early parallel computers. This is due to the fact that DFS is very amenable to parallel processing. Each subtree in the state space can be explored independently of other subtrees in the space. In simple DFS, there is no exchange of information required for exploring different subtrees. This implies that it is possible to device simple parallel formulations by assigning a distinct subtree to each processor. However, the space associated with a problem instance can be highly unstructured. Consequently, the work associated with subtrees rooted at different nodes can be very different. Therefore, a naive assignment of a subtree rooted at a distinct node to each processor can result in considerable idling overhead and poor parallel performance. The problem of designing efficient parallel DFS algorithms can be viewed in two steps: the partitioning problem and the assignment problem. The partitioning problem addresses the issue of breaking up a given search space into two subspaces. The assignment problem then maps subspaces to individual processors. There are essentially two techniques for partitioning a given search space: node splitting and stack splitting. In node splitting, the root node of a subtree is expanded

Load Balancing for Parallel Optimization Techniques

to generate a set of successor nodes. Each of these nodes represents a distinct subspace. While node splitting is easy to understand and implement, it can result in search spaces of widely varying sizes. Since the objective of the assignment problem is to balance load while minimizing work transfers, widely varying subtask sizes are not desirable. An alternate technique called stack splitting attempts to partition a search space into two by assigning some nodes at all levels leading up to the specified node. Thus if the current node is at level 4, stack splitting will split the stack by assigning some nodes at levels 1, 2, and 3 to each partition. In general, stack splitting results in a more even partitioning of search spaces than node splitting. We can now formally state the assignment problem for parallel DFS as a mapping of subtasks to processors such that: the work available at any processor can be partitioned into independent work pieces as long as it is more than some nondecomposable unit; the cost of splitting and transferring work to another processor is not excessive (i. e. the cost associated with transferring a piece of work is much less than the computation cost associated with it); a reasonable work splitting mechanism is available; i. e., if work w at one processor is partitioned in 2 parts w and (1 )w, then 1 ˛ > > ˛, where ˛ is an arbitrarily small constant; it is not possible (or is very difficult) to estimate the size of total work at a given processor. A number of mapping techniques have been proposed and analyzed in literature [5,7,8,9,11,16]. These mapping techniques are either initiated by a processor with work (sender initiated, the processor with work initiates the work transfer) or a processor looking for work (receiver initiated, an idle processor initiates the work transfer). In the global round robin request (GRR, idle processors in the global round robin scheme request processors for work in a round-robin fashion using a single (global) counter) receiver initiated scheme, a single counter specifies the processor that must receive the next request for work. This ensures that work requests are uniformly distributed across all processors. However, this scheme suffers from contention at the processor holding the counter. Consequently, the performance of this scheme is poor beyond a certain number of processors. A message combining variant of this

L

scheme (GRR-M, a variant of the global round robin scheme in which requests for value of global counter are combined to alleviate contention overheads) relies on combining intermediate requests for the counter into single request. This alleviates the contention and performance bottleneck of the GRR scheme. The asynchronous round robin balancing (ARR, i. e. each processor selects a target for work request in a round robin manner using a local counter) uses one counter at each processor. Each processor uses its counter to determine the next processor to query for work. While this scheme balances work requests in a local sense, these requests may become clustered in a global sense. In the random polling scheme (RP, i. e. idle processors send work requests to a randomly selected target processor), each processor selects a random processor and requests work. In near-neighbor load balancing scheme (NN, i. e. an idle processor requests one of its immediate neighbors for work), processors request work from their immediate neighbors. This scheme has the drawback that localized hot-spots may take a long time to even out. In sender initiated schemes a processor with work can give some of its work to a selected processor [6,16]. This class of schemes includes the master-slave (MS) and randomized allocation (RA) schemes. In the MS scheme, a processor, designated master, generates a fixed number of work pieces. These work-pieces are assigned to processors as they exhaust previously assigned work. The master may itself become the bottleneck when the number of processors is large. Multilevel master-slave algorithms have been used to alleviate this bottleneck. Randomized allocation schemes are sender initiated counterparts of RP schemes. In randomized allocation, a processor sends a part of its work to a randomly selected processor. The performance and scalability of these techniques is often dependent on the underlying architecture. Many of these techniques are, in principle scalable, i. e., they result in linear speedup on increasing the number of processors p as long as the size of the search space grows fast enough with p. It is desirable that this required rate of growth of problem size (also referred to as the iso-efficiency metric [10]) be as small as possible since it allows the use of a larger number of processors effectively for solving a given problem instance. In Table 1, we summarize the iso-efficiency functions of various load balancing techniques.

1907

1908

L

Load Balancing for Parallel Optimization Techniques

Scalability results of receiver initiated load balancing schemes for various architectures

Arch Scheme ARR

Shared

H-cube

p2 log p

Mesh (2D) p2 log2 p p2:5 log p

NN GRR

p2 log p p2 log p

plog 2 p2 log p

k p p2 log p

GRR-M

p log p

p log2 p

p1:5 log p

RP Lower Bound

p log2 p p

p log2 p p1:5 log2 p p2 log2 p p log p p1:5 p2

1+1/˛

p

W/S Cluster p3 log p p3 log p p2 log p

IDA and DFBB search techniques use this basic parallel DFS algorithm for searching state space. In IDA , each processor has a copy of the global cost bound. Processors perform parallel DFS with this cost bound. At the end of each phase, the cost is updated using a single global operation. Some schemes for allowing different processors to work with different cost bound have also been explored. In this case, a solution cannot be deemed optimal until search associated with all previous cost bounds has been completed. DFBB technique uses a global current best solution to bound parallel DFS. Whenever a processor finds a better solution, it updates this global current best solution (using a broadcast in message passing machines and a lock-set in shared memory machines). DFBB and IDA using these parallel DFS algorithms has been shown to yield excellent performance for various optimization problems [3,13,19]. In many optimization problems, the successors of nodes tend to be strongly ordered. In such cases, naive parallel formulations that ignore this ordering information will perform poorly since they are likely to expand a much larger subspace than those that explore nodes in the right order. Parallel DFS formulations for such spaces associate priorities with nodes. Nodes with largest depth and highest likelihood of yielding a solution are assigned the highest priority. Parallel ordered DFS then expands these nodes in a prioritized fashion. Parallel Best-First Tree Search Best-first tree search algorithms rely on an open list (i. e. a list of unexplored configurations sorted on their qual-

ity) to sort available states on the basis of their heuristic solution estimate. If this heuristic solution estimate is guaranteed to be an underestimate (as is the case in the A algorithm), it can be shown that the solution found by BFS is the optimal solution. The presence of a globally ordered open list makes it more difficult to parallelize BFS. In fact, at the first look, BFS may appear inherently serial since a node with higher estimated solution cost must be explored only after all nodes with lower costs have been explored. However, it is possible that there may be multiple nodes with the best heuristic cost. If the number of such nodes is less than the number of available processors, then some of the nodes with poorer costs may also be explored. Since it is possible that these nodes are never explored by the serial algorithm, this may result in excess work by the parallel formulation resulting in deceleration anomalies. These issues of speedup anomalies resulting from excess (or lesser) work done by the parallel formulations of state space search are discussed later. A simple parallel formulation of BFS uses a global open list. Each processor locks the list, extracts the best available node and unlocks the list. The node is expanded and heuristic estimates are determined for each successor. The open list is locked again and all successors are inserted into the open list. Note that since the state space is a tree, no replication checking is required. The open list is typically maintained in the form of a global heap. The use of a global heap is a source of contention. If the time taken to lock, remove, and unlock the top element of the heap is t access and time for expansion is t exp , then the speedup of the formulation is bounded by (t access + t exp )/t access . A number of techniques have been developed to effectively reduce the access time [17]. These techniques support concurrent access to heaps stored in shared memory while maintaining strict insertion and deletion ordering. While these increase the upper bound on possible speedup, the performance of these schemes is still bounded. The contention associated with the global data structure can be alleviated by distributing the open list across processors. Now, instead of p processors sharing a single list, they operate on k distinct open lists. In the limiting case, each processor has its own open list. A simple parallel formulation based on this framework starts off with the initial state in one heap. As additional states are generated, they are shipped off to

Load Balancing for Parallel Optimization Techniques

the other heaps. As nodes become available in other heaps, processors start exploring associated state space using local BFS. While it is easy to keep all processors busy using this framework, it is possible that some of the processors may expand nodes with poor heuristic estimates that are never expanded by the serial formulation. To avoid this, we must ensure that all open lists have a share of the best globally available nodes. This is also referred to as quality equalization (the process of ensuring that all processors are working on regions of state-space of high quality). Since the quality of nodes evolves with time, quality equalization must be performed periodically. Several triggering mechanisms have been developed to integrate quality equalization with load balancing [1,19]. A simple triggering mechanism tracks the best node in the system. The best node in the local heap is compared to the best node in the system and if it is considerably worse, an equalization process is initiated. Alternately, an equalization process may be initiated periodically. The movement of nodes between various heaps may itself be fashioned in a well defined topology. Lists may be organized into rings, shared blackboards, or hierarchical structures. These have been explored for several applications and architectures. Speedups in excess of 950 have been demonstrated on 1024 processor hypercubes in the context of TSPs formulated as best-first tree search problems [2]. Searching State Space Graphs Searching state space graphs presents additional challenges since we must check for replicated states during search. The simplest strategy for dealing with graphs is to unroll them into trees. The overhead of unrolling a graph into a tree may range from a constant to an exponential factor. If the overhead is a small constant factor, the resulting tree may be searched using parallel DFS or BFS based techniques. However, for most graph search problems, this is not a feasible solution. Graph search problems rely on a closed list (i. e. a list of all configurations that have been previously encountered) that keeps track of all nodes that have already been explored. Closed lists are typically maintained as hash tables for searching. In a shared memory context, insertion of nodes into the closed list requires locking of the list. If there is a single lock associated with the entire list, the list must be locked approximately as many

L

times as the total number of nodes expanded. This represents a serial bottleneck. The bottleneck can be alleviated by associating multiple locks with the closed list. Processors lock only relevant parts of the closed list into which the node is being inserted. Distributed memory versions of this parallel algorithm physically distribute the closed list across the processors. As nodes are generated, they are hashed to the appropriate processor that holds the respective part of the hash table. Search is performed locally at this processor and the node is explored further at this processor if required. This has two effects: if the hash function associated with the closed list is truly randomized, this has the effect of load balancing using randomized allocation. Furthermore, since nodes are randomly allocated to processors, there is a probabilistic quality equalization for heuristic search techniques. These schemes have been studied by many researchers [14,15]. Assuming a perfectly random hash function, it has been shown that if the number of nodes originating at each processor grows as O(log p), then each processor will have asymptotically equal number of nodes after the hash operation [15]. Since each node is associated with a communication, this puts constraints on the architecture bandwidth. Specifically, the bisection width of the underlying architecture must increase linearly with the number of processors for this formulation to be scalable. A major drawback of graph search techniques such as BFS is that its memory requirement grows linearly with the search space. For large problems, this memory requirement becomes prohibitive. Many limitedmemory variants of heuristic search have been developed. These techniques rely on retraction or delayed expansion of less promising nodes to reduce memory requirement. In the parallel processing context, retractions lead to additional communication and indexing for parent-child relationships [4]. Anomalies in Parallel Search As we have seen above, it is possible for parallel formulations to do more or less work than the serial search algorithm. The ratio of nodes searched by the parallel and serial algorithms is called the search overhead factor (i. e. the ratio of excess work done by a parallel search formulation with respect to its serial formula-

1909

1910

L

Load Balancing for Parallel Optimization Techniques

tion). A search overhead factor of greater than one indicates a deceleration anomaly and less than one indicates an acceleration anomaly. An acceleration anomaly manifests itself in a speedup greater than p on p processors. It can be argued however that in these cases, the base sequential algorithm is suboptimal and a timemultiplexed serialization of the parallel algorithm is in fact a superior serial algorithm. In DFS and related techniques, parallel formulations might detect solutions available close to the root on alternate branches, whereas serial formulations might search large parts of the tree to the left before reaching this node. Conversely, parallel formulations might also expand a larger number of nodes than the serial version. There are situations, in which parallel DFS can have a search overhead factor of less than 1 on the average, implying that the serial search algorithm in the situation is suboptimal. V. Kumar and V.N. Rao [18] show that if no heuristic information is available to order the successors of a node, then on the average, the speedup obtained by parallel DFS is superlinear if the distribution of solutions is nonuniform. In BFS, the strength of the heuristic determines the search overhead factor. When strong heuristics are available, it is likely that expanding nodes with lower heuristic values will result in wasted effort. In general, it can be shown that for any given instance of BFS, there exists a number k such that expanding more than k nodes in parallel from a global open list leads to wasted computation [12]. This situation gets worse with distributed open lists since expanded nodes have locally minimum heuristics that are not the best nodes across all open lists. In contrast, the search overhead factor can be less than one if there are multiple nodes with identical heuristic estimates and one of the processors picks the right one. Applications of Parallel Search Techniques Parallel search techniques have been applied to a variety of problems such as integer and mixed integer programming, and quadratic assignment for applications ranging from path planning and resource location to VLSI packaging. Quadratic assignment problems from the Nugent–Eschermann test suites with up to 4.8 × 1010 nodes have been solved on parallel machines in days. Traveling salesman problems with thousands of cities and mixed integer programming problems with

thousands of integer variable are within the reach of large scale parallel machines. While the use of parallelism increases the range of solvable problems, designing effective heuristic functions is critical. This has the effect of reducing effective branching factor and thus inter-node concurrency. However, the computation of the heuristic can itself be performed in parallel. The use of intra-node parallelism in addition to inter-node parallelism has also been explored. While significant amounts of progress has been made in effective use of parallelism in discrete optimization, with the development of new heuristic functions, opportunities for significant contributions abound. See also Asynchronous Distributed Optimization Algorithms Automatic Differentiation: Parallel Computation Heuristic Search Interval Analysis: Parallel Methods for Global Optimization Parallel Computing: Complexity Classes Parallel Computing: Models Parallel Heuristic Search Stochastic Network Problems: Massively Parallel Solution References 1. Cun BL, Roucairol C (1995) BOB: A unified platform for implementing branch-and-bound like algorithms. Techn. Report Univ. Versailles Saint Quentin 16 2. Dutt S, Mahapatra NR (1994) Scalable load-balancing strategies for parallel A algorithms. J Parallel Distributed Comput 22(3):488–505, Special Issue on Scalability of Parallel Algorithms and Architectures (Sept. 1994) 3. Eckstein J (1997) Distributed versus centralized storage and control for parallel branch and bound: Mixed integer programming on the CM-5. Comput Optim Appl 7(2):199– 220 4. Evett M, Hendler J, Mahanti A, Nau D (1990) PRA : A memory-limited heuristic search procedure for the connection machine. Proc. Third Symp. Frontiers of Massively Parallel Computation, pp 145–149 5. Finkel RA, Manber U (Apr. 1987) DIB - A distributed implementation of backtracking. ACM Trans Program Languages and Systems 9(2):235–256 6. Furuichi M, Taki K, Ichiyoshi N (1990) A multi-level load balancing scheme for OR-parallel exhaustive search programs on the multi-PSI. Proc. Second ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp 50–59

Local Attractors for Gradient-related Descent Iterations

7. Janakiram VK, Agrawal DP, Mehrotra R (1988) A randomized parallel backtracking algorithm. IEEE Trans Comput C37(12):1665–1676 8. Karp R, Zhang Y (1993) Randomized parallel algorithms for backtrack search and branch-and-bound computation. J ACM 40:765–789 9. Karypis G, Kumar V (oct. 1994) Unstructured tree search on SIMD parallel computers. IEEE Trans Parallel and Distributed Systems 5(10):1057–1072 10. Kumar V, Grama A, Gupta A, Karypis G (1994) Introduction to parallel computing: Algorithm design and analysis. Benjamin Cummings and Addison-Wesley, Redwod City, CA/Reading, MA 11. Kumar V, Grama A, Rao VN (July 1994) Scalable load balancing techniques for parallel computers. J Parallel Distributed Comput 22(1):60–79 12. Lai TH, Sahni S (1984) Anomalies in parallel branch and bound algorithms. Comm ACM 27(6):594–602 13. Lee EK, Mitchell JE (1997) Computational experience of an interior-point algorithm in a parallel branch-and-cut framework. Proc. SIAM Conf. Parallel Processing for Sci. Computing,). 14. Mahapatra NR, Dutt S (July 1997) Scalable global and local hashing strategies for duplicate pruning in parallel A graph search. IEEE Trans Parallel and Distributed Systems 8(7):738–756 15. Manzini G, Somalvico M (1990) Probabilistic performance analysis of heuristic search using parallel hash tables. Proc. Internat. Symp. Artificial Intelligence and Math., 16. Ranade AG (1991) Optimal speedup for backtrack search on a butterfly network. Proc. Third ACM Symp. Parallel Algorithms and Architectures, 17. Rao VN, Kumar V (1988) Concurrent access of priority queues. IEEE Trans Comput C-37(12):1657–1665 18. Rao VN, Kumar V (Apr 1993) On the efficicency of parallel backtracking. IEEE Trans Parallel and Distributed Systems 4(4):427–437. Also available as: Techn. Report 90–55, Dept. Computer Sci. Univ. Minnesota 19. Tschvke S, L-ling R, Monien B (1995) Solving the traveling salesman problem with a distributed branch-and-bound algorithm on a 1024 processor network. Proc. 9th Internat. Parallel Processing Symp. (April 1995), 182–189

Article Outline Keywords Differentials and Gradients Gradient-Related Descent Methods Descent Method Prototypes The Armijo Steplength Rule Fixed Points Local Attractors: Necessary Conditions Local Attractors: Sufficient Conditions Nonsingular Attractors Singular Attractors and Local Convexity Local Convexity and Convergence Rates Concluding Remarks See also References Keywords Unconstrained minimization; Gradient-related descent; Newtonian descent; Singular local attractors; Asymptotic convergence rates In the classic unconstrained minimization problem, a continuously differentiable real-valued function f is given on a normed vector space X and the goal is to find points in X where the infimum of f is achieved or closely approximated. Descent methods for this problem start with some nonoptimal point x0 , search for a neighboring point x1 where f (x1 ) < f (x0 ), and so on ad infinitum. At each stage, the search is typically guided by a local model based on derivatives of f . If f is convex and every local minimizer is therefore automatically a global minimizer, then well-designed descent methods can indeed generate minimizing sequences, i. e., sequences {xk } for which lim f (x k ) D inf f (x):

k!1

Local Attractors for Gradient-related Descent Iterations JOSEPH C. DUNN Math. Department, North Carolina State University, Raleigh, USA

MSC2000: 49M29, 65K10, 90C06

L

x2X

(1)

On the other hand, nonconvex cost functions can have multiple local minimizers and any of these may attract the iterates of the standard descent schemes. This behavior is examined here for a large class of gradient-related descent methods, and for local minimizers that need not satisfy the usual nonsingularity hypotheses. In addition, the analytical formulation adopted yields nontrivial local convergence theorems in infinite-dimensional normed vector spaces X. Such theorems are not without computational significance since

1911

1912

L

Local Attractors for Gradient-related Descent Iterations

they often help to explain emerging trends in algorithm behavior for increasingly refined finite-dimensional approximations to underlying infinite-dimensional optimization problems. Differentials and Gradients In a general normed vector space X, the first (Fréchet) differential of f at a point x is a linear function f 0 (x): X ! R1 that satisfies the following conditions: ˇ ˇ ˇ 0 ˇ def ˇ f (x) ˇ D sup ˇ f 0 (x)u ˇ < 1;

(2)

and

r f (x) D

@f @f (x); : : : ; (x) : @x1 @x n

When rf () is continuous, conditions (2)–(4) can be proved for the linear function in (5) with a straightforward application of the chain rule, Cauchy’s inequality and the one-dimensional mean value theorem. In addition, it can be shown that d = r f (x) is the unique solution of the equations,

ˇ ˇ

(6) kdk D ˇ f 0 (x) ˇ

kukD1

and j f (x C d) f (x) f 0 (x)dj lim D 0: kdk!0 kdk

(3)

Since f 0 (x)d is linear in d, condition (2) holds if and only if f 0 (x)d is continuous in d. Condition (2) is automatically satisfied in any finite-dimensional space X. The remaining condition (3) asserts that f (x)+ f 0 (x)d asymptotically approximates f (x+d) with an o(kdk) error as d approaches zero. At most one linear function can satisfy these conditions in some norm on X. If conditions (2) and (3) do hold in the norm kk, then f is said to be (Fréchet) differentiable at x (relative to the norm kk). If f is differentiable near x 2 X and if

ˇ ˇ

(4) lim ˇ f 0 (y) f 0 (x) ˇ D 0; k yx k!0 then f is continuously differentiable at x. Note that in finite-dimensional spaces, all norms are equivalent and conditions (2)–(4) hold in any norm if they hold in some norm. However, two norms on the same infinitedimensional space need not be equivalent, and continuity and differentiability are therefore norm-dependent properties at this level of generality. In the Euclidean space X = Rn , f is continuously differentiable if and only if the partial derivatives of f are continuous; moreover, when f has continuous partial derivatives, f 0 (x) is specified by the familiar formula, f 0 (x)d D hr f (x); di ;

(5)

where h, i is the standard Euclidean inner product and rf (x) is the corresponding gradient of f at x, i. e., hx; yi D

n X iD1

ˇ

ˇ f 0 (x)d D ˇ f 0 (x) ˇ kdk ;

where kk and |kk| are induced by the Euclidean inner product on Rn . The circumstances in the Euclidean space Rn suggest a natural extension of the gradient concept in general normed vector spaces X. Let f be differentiable at x 2 X. Then any vector d 2 X that satisfies conditions (6)–(7) will be called a gradient vector for f at x. Note that the symbols kk and |kk| in (6)–(7) now signify the norm provided on X and the corresponding operator norm in (2). Depending on the space X, its norm kk and the point x, conditions (6)–(7) may have no solutions for d, or a unique solution, or infinitely many solutions. In any finite-dimensional space X, linear functions are continuous, the unit sphere {u 2 X: kuk = 1} is compact, the supremum in (2) is therefore attained at some unit vector u, and the existence of solutions d for (6)– (7) is consequently guaranteed. On the other hand, f may have infinitely many distinct gradients at a point x if the norm on X is not strictly convex. For example, if X = Rn and kxk = max1 i n |xi |, then f 0 (x) is prescribed by (5), and ˇ n ˇ ˇ 0 ˇ X ˇ @f ˇ ˇ f (x) ˇ D ˇ ˇ ˇ @x (x)ˇ : i iD1 Moreover, d is a gradient vector for f at x if and only if

ˇ ˇ

d D ˇ f 0 (x) ˇ u and

xi yi

(7)

@f u i 2 sgn (x) ; @x i

Local Attractors for Gradient-related Descent Iterations

where sgn(t) = {1} or [1, 1] or {1} for t < 0, t = 0 and t > 0, respectively. The existence of gradients can also be proved in reflexive infinite-dimensional spaces X where bounded linear functions are weakly continuous and closed unit balls are weakly compact. In nonreflexive spaces, conditions (6)–(7) may not have solutions d; however, in any normed vector space and for any fixed arbitrarily small 2 (0, 1), the relaxed conditions,

ˇ ˇ

(8) kdk D ˇ f 0 (x) ˇ and ˇ

ˇ f 0 (x)d (1 ) ˇ f 0 (x) ˇ kdk ;

(9)

always have solutions d. This follows easily from (2) and the meaning of sup. The solutions of (8)–(9) will be called -approximate gradients of f at x. They occupy a central position in the present formulation of the subject algorithms. Gradient-Related Descent Methods If f 0 (x) = 0, then x is called a stationary point of f . If f 0 (x) 6D 0, then x is not stationary and the set {d 2 X: f 0 (x)d < 0} is a nonempty open half-space. An element d in this half-space is called a descent vector since condition (3) immediately implies that f (x + td) < f (x) when t is positive and sufficiently small. If d is a -approximate gradient at a nonstationary point x, then according to (8)–(9), ˇ ˇ2 f 0 (x)(d) (1 ) ˇ f 0 ˇ < 0: Hence d is a descent vector. In particular, if d is a gradient at a nonstationary point x, then d is a steepest descent vector in the sense that 0

0

f (x)(d) f (x)v;

(10)

for all v 2 X such that kvk = kdk. Suppose that , 1 , and 2 are fixed positive numbers, with 2 (0, 1) and 2 1 > 0. At each x 2 X, let G (x) denote the nonempty set of -approximate gradients for f at x and let G(x) be a nonempty subset of the set of all multiples d with 2 [1 , 2 ] and d 2 G (x), i. e., [ ; ¤ G(x) G (x): (11)

2[ 1 ; 2 ]

L

The corresponding set-valued mapping G() is referred to here as a gradient-related set function with parameters , 1 and 2 . In the present development, a gradient-related iterative descent method consists of a gradient-related set function G(), and a rule that selects a vector dk 2 G(xk ) at each iterate xk , and another rule that determines the steplength parameter sk 2 (0, 1] in the recursion, x kC1 D x k s k d k ;

(12)

once dk has been chosen. The sequences {xk } generated by this recursion are called gradient-related successive approximations. (For related formulations, see [3,4,10].) The convergence theorems described later in this article depend only on basic properties of gradientrelated set functions and the steplengths sk , hence the precise nature of the rule for selecting dk in G(xk ) is not important here. This rule may refer to prior iterates {xi }i k , or may even be random in nature. There are also many alternative steplength rules that achieve sufficient reductions in f at each iteration in (12) and move the successive approximations xk toward regions in the domain of f that are interesting in at least a local sense [3,4,10]. Descent Method Prototypes When gradients of f exist and f attains its infima on lines in X, the steepest descent and exact line minimization rules for d and s yield the prototype steepest descent method, x kC1 D x k s k d k ;

(13)

where s k 2 arg min f (x k td k ) t>0

(14)

and dk is any solution of (6)–(7) for x = xk . Note that the actual reduction in f achievable on a steepest descent half-line {y 2 X: 9t > 0, y = x td} may be smaller than that attainable on other half-lines, since (10) merely refers to norm-dependent local directional rates of change for f at x. Thus the name of this method is somewhat misleading. Newtonian descent algorithms also amount to special gradient-related descent methods near a certain type of nonsingular local minimizer x . These schemes

1913

1914

L

Local Attractors for Gradient-related Descent Iterations

employ variants of the restricted line minimization steplength rule, s k 2 arg min f (x k td k ); t2(0;1]

(15)

and replace the gradients dk in a steepest descent iteration by descent vectors that approximate the Newton increment, d N (x k ) D f 00 (x k )1 f 0 (x k ):

(16)

Gradient-related descent vector approximations to dN (xk ) are generated in some neighborhood of x by various quasi-Newton auxiliary recursions, provided that the following (interdependent) nonsingularity conditions hold: i) f is twice continuously (Fréchet) differentiable at x ; ii) f 00 (x ) satisfies the coercivity condition ( f 00 (x )v)v c kvk2 for some c > 0 and all v 2 X; iii) A bounded inverse map f 00 (x)1 exists for all x sufficiently near x ; iv) f 00 ()1 is continuous at x . Near a nonsingular local minimizer, the local convergence rates for Newtonian descent methods are generally much faster than the steepest descent convergence rate [8,10]. On the other hand, near singular local minimizers the Newton increments dN (xk ) and their quasi-Newton approximations are typically not confined to the image sets G(xk ) of some gradient-related set function G(), and may actually be undefined on continuous manifolds in X containing x . Under these circumstances, the unmodified Newtonian scaling principles can degrade or even destroy local convergence. In any case, the convergence properties of Newtonian descent methods near singular local minimizers x are not well-understood, and are likely to depend on the higher order structure of the singularity at x . The Armijo Steplength Rule The line minimization steplength rules in (14) and (15) can be very effective in special circumstances; however, they are more often difficult or impossible to implement, and are not intrinsically ‘optimal’ in any general sense when coupled with standard descent direction rules based on local models of f . By their very

nature, such schemes do not anticipate the effect of current search direction and steplength decisions on the reductions achievable in f in later stages of the calculation. Therefore, over many iterations, the exact line minimization rule may well produce smaller total reductions in f than other much simpler steplength rules that merely aim for local reductions in f that are ‘large enough’ compared with |kf 0 (x)k| at each iteration. A. Goldstein and L. Armijo proposed the first practical steplength rules of this kind in [1,8,9] for steepest descent and Newtonian descent methods in Rn . These rules and other related schemes described in [10] and [4] are easily adapted to general gradient-related iterations. The present development focusses on the local convergence properties of the simple Armijo rule described below; however, with minor modifications, the theorems set forth here extend readily to the Goldstein rule and other similar line search formulations. Let G() be a gradient-related set function with parameters , 1 and 2 . Fix ˇ in (0, 1) and ı in (0, 1), and for each x in X and d in G(x) construct s(x, d) 2 (0, 1] with the Armijo steplength rule, s(x; d) D max t

(17)

subject to t 2 f1; ˇ; ˇ 2 ; : : :g and f (x) f (x td) ıt f 0 (x)d: When x is not stationary and d is any descent vector, the rule (17) admits precisely one associated steplength s(x, d) 2 (0, 1]. This is true because ˇ k converges to zero as k ! 1 and f (x) f (x td) D ı f 0 (x)td C (1 ı) f 0 (x)td C o(t) ı f 0 (x)td for t positive and sufficiently small, in view of (3). When x is stationary, (17) yields s(x, d) = 1 trivially for every vector d. Fixed Points Descent methods based on gradient-related set functions and Armijo’s rule generate sequences {xk } that sat-

Local Attractors for Gradient-related Descent Iterations

isfy x kC1 2 T(x k );

k D 0; 1; : : : ;

(18)

where

L

with center x and radius > 0 such that every sequence {xk } which satisfies (18) and enters the ball B(x , ) must converge to x , i. e.,

(21) 9l; x l 2 B(x ; ) ) lim x k x D 0: k!1

def

T(x) D fy : 9d 2 G(x); y D x s(x; d)dg :

(19)

The convergence theory outlined in the following sections addresses the behavior of all such Armijo gradient-related sequences near fixed points of the setvalued map T(): X ! 2X . The roots of this theory lie in Bertsekas’ convergence proof for steepest descent iterates near nonsingular local minimizers in Rn [2], and subsequent modifications of this proof strategy for gradient projection methods and singular local minimizers in finite-dimensional or infinite-dimensional vector spaces with inner products [6,7]. For related nonlocal theories, see [10] and [4]. By definition, x is a fixed point of T() if and only if x 2 T(x ): Since Armijo’s rule produces nonzero steplengths s(x, d), it follows that x is a fixed point of T() if and only if x is a stationary point of f . More precisely, Proposition 1 Let T() be an Armijo gradient-related iteration map in (19). Then for all x 2 X, x 2 T(x) , T(x) D fxg , G(x) D f0g , f 0 (x) D 0:

(20)

According to Proposition 1, any Armijo gradientrelated sequence {xk } that intercepts a fixed point x of T() must terminate in x . Conversely, if {xk }terminates in a vector x , then x is a fixed point of T(), and hence a stationary point of f . On the other hand, Armijo gradient-related sequences that merely pass near some stationary point x may or may not converge to x . Local Attractors: Necessary Conditions A vector x is said to be a local attractor for an Armijo gradient-related iteration (18) if and only if there is a nonempty open ball, B(x ; ) D fx 2 X : kx x k < g

With Proposition 1 and another rudimentary result for gradient-related set functions and Armijo steplengths, it is readily shown that a local attractor must be a strict local minimizer of f and an isolated stationary point of f. Proposition 2 Let 2 (0, 1), 1 > 0, and ı 2 (0, 1) be fixed parameter values in the gradient-related set function G() and Armijo rule (17), and put c1 = ı (1 )1 > 0. Then for all x 2 X and d 2 G(x), ˇ

ˇ2 f (x) f (x s(x; d)d) c1 s(x; d) ˇ f 0 (x) ˇ : (22) Corollary 3 Let T() be the Armijo gradient-related iteration map in (19). If {xk } is generated by the corresponding gradient-related iteration (18), then for all k = 0, 1, . . . , f (x kC1 ) f (x k )

(23)

and f 0 (x k ) ¤ 0 ) f (x kC1 ) < f (x k ):

(24)

Since f is continuous, the claimed necessary conditions for local attractors are now immediate consequences of Proposition 1 and Corollary 3. Theorem 4 A vector x is a local attractor for an Armijo gradient-related iteration (18) only if x is an isolated stationary point and a strict local minimizer of f , i. e., only if there is a nonempty open ball B(x , ) that excludes every other stationary point x 6D x , and also excludes points x 6D x at which f (x) f (x ). The conclusion in Theorem 4 actually applies more generally to set-valued iteration maps T() prescribed by any steplength rule that guarantees the fixed-point characterization (20) and the descent property (23)– (24). On the other hand, related converse assertions are tied more closely to special properties of the Armijo rule and its variants, and to certain local uniform growth conditions on f and |kf 0 ()k|. If X is a finitedimensional space, and x is a strict local minimizer

1915

1916

L

Local Attractors for Gradient-related Descent Iterations

and an isolated stationary point, then the requisite uniform growth conditions automatically hold near x and the full converse of Theorem 4 can be proved. If X is an infinite-dimensional space, the growth conditions become hypotheses in a weaker but still nontrivial partial converse of Theorem 4. This is explained in greater detail below. Local Attractors: Sufficient Conditions

the quantity f (x) f (x ) is strictly positive when x 6D x . In finite-dimensional spaces, it is possible to say more. If dim X < 1, then for each t 2 (0, ] the corresponding closed annulus, (25)

is compact. Since the function f () f (x ) is continuous and positive in A(t, ), it must attain a positive minimum value in this set, i. e., def

min

x2A(t; )

f (x) f (x ) > 0;

(26)

for each t 2 (0, ]. Put ˛(0) = 0 and note that for all t 1 , t 2 , 0 < t 1 < t 2 ) A(t 1 , ) A(t 2 , ) ) ˛(t 1 ) ˛(t 2 ). This establishes the following uniform growth property for strict local minimizers in finitedimensional spaces. Lemma 5 Let X be a finite-dimensional normed vector space. If x is a strict local minimizer for f , then there is a positive number and a positive definite nondecreasing real-valued function ˛() on [0, ] such that, f (x) f (x ) ˛(kx x k);

By Proposition 2, the simple descent property, f (x s(x; d)d) f (x)

B(x ; ) D fx 2 X : kx x k g ;

˛(t) D

for all x 2 B(x , ). Now construct the corresponding set, I() D fx 2 B(x ; ) : f (x) f (x ) < ˛( )g : (29)

If x is a strict local minimizer of f , then for some > 0 and all x in the closed ball,

A(t; ) D fx : t kx x k g ;

vector x must be a stationary point. Fix 2 (0, ] and note that since f 0 () is continuous and f 0 (x ) = 0, there is a 2 (0, ] for which

ˇ ˇ

(28) kx x k C 2 ˇ f 0 (x) ˇ <

(27)

for all x 2 B(x ; ). In infinite-dimensional spaces, the uniform growth condition (27) need not hold at every strict local minimizer; however, when this condition is satisfied, the minimizer x has a crucial stability property for gradient-related descent methods. More specifically, suppose that (27) holds and T() is an Armijo iteration map (19) with associated parameter 2 > 0. Since descent directions can not exist at a local minimizer, the

(30)

holds for all x and all d 2 G(x), hence the restriction (28) and the properties of ˛() insure that I() is an invariant set for T(), i. e., T(x) I() for all x 2 I(). Moreover, since f is continuous, the minimizer x is clearly an interior point of the set I(), and this proves the following stability lemma for Armijo gradient-related iterations (or indeed, any gradient-related method with the descent property (30)). Lemma 6 Suppose that the uniform growth condition (27) holds near a local minimizer x for f . Let T() be an Armijo gradient-related iteration map in (19). Then for every > 0 there is a corresponding 2 (0, ] such that for all sequences {xk } satisfying (18), and all indices l, x l 2 B(x ; ) ) 8k l x k 2 B(x ; ):

(31)

According to Lemma 6, the uniform growth condition (27) guarantees that an Armijo gradient-related sequence {xk } will remain in any specified arbitrarily small open ball B(x , ) provided {xk } enters a sufficiently small sub-ball of B(x , ). This property alone does not imply that {xk } converges to x ; however, it is an essential ingredient in the local convergence proof outlined below. This proof requires two additional technical estimates for the Armijo rule and gradientrelated set functions, a local uniform growth condition for |kf 0 ()k| analogous to (27), and a local uniform continuity hypothesis on f 0 (). The first pair of estimates are straightforward consequences of the Armijo rule and the one-dimensional mean value theorem. The last two requirements are automatically satisfied in finitedimensional spaces, once again because closed bounded sets are compact in these spaces.

Local Attractors for Gradient-related Descent Iterations

Proposition 7 Let 2 (0, 1), 2 > 0, ˇ 2 (0, 1), and ı 2 (0, 1) be fixed parameter values in the gradientrelated set function G() and Armijo rule (17), and put c2 = ı(1)1 2 > 0. Then for all x 2 X and d 2 G(x), 2

2

f (x) f (x s(x; d)d) c2 s(x; d) kdk :

(32)

Moreover, if s(x, d) < 1 and c3 = (1 ı)(1 ), then there is a vector in the line segment joining x to x ˇ 1 s(x, d)d such that

ˇ

ˇ ˇ 0 ˇ

ˇ f () f 0 (x) ˇ c3 ˇ f 0 (x) ˇ (33)

and therefore

lim x k x D 0: k!1

(39)

To see that (38) must hold, construct the index sets, = {k: s(xk , dk ) = 1} and = {k: s(xk , dk ) < 1}. If is an infinite set, then, ˇ

ˇ ˇ

ˇ lim ˇ f 0 (x k ) ˇ D 0; k2 k!1

by (36). On the other hand, if is an infinite set, then ˇ

ˇ ˇ

ˇ lim ˇ f 0 (x k ) ˇ D 0; k2 k!1

and k xk ˇ 1 s(x; d) kdk :

(34)

Lemma 8 Let X be a finite-dimensional normed vector space. If x is an isolated stationary point for f , then there is a positive number and a positive definite nondecreasing real-valued function ˇ() on [0, ] such that, ˇ 0 ˇ ˇ f (x) ˇ ˇ(kx x k);

(35)

The proof of Lemma 8 is similar to the proof of Lemma 5. Now suppose that the growth conditions (27) and (35) both hold in the ball B(x ; ), and that f 0 () is uniformly continuous in this ball. By Lemma 6, there is a positive number 2 (0, /2] such that every sequence {xk } which satisfies (18) and enters the ball B(x , ), thereafter remains in the larger ball B(x , /2). But if {xk } is eventually confined to the ball B(x , /2), then the mean value theorem insures that the nonincreasing real sequence {f (xk )} is bounded below and therefore converges to some finite limit. In this case, the differences f (xk ) f (xk+1 ) converge to zero and Propositions 2 and 7 therefore yield, ˇ

ˇ2 ˇ

ˇ (36) lim s(x k ; d k ) ˇ f 0 (x k ) ˇ D 0; k!1

lim s(x ; d ) d k D 0; k

k!1

k

Theorem 9 If the uniform growth conditions (27) and (35) hold simultaneously in the closed ball B(x ; ) for some > 0, and if f 0 () is uniformly continuous in B(x ; ) then x is a local attractor for Armijo gradient-related iterations (18). Corollary 10 If X is a finite-dimensional normed vector space and x is a strict local minimizer and an isolated stationary point for f , then x is a local attractor for Armijo gradient-related iterations (18). Nonsingular Attractors The nonsingularity conditions i) and ii) and Taylor’s formula imply that in some neighborhood of x , the objective function f is convex and satisfies the local growth condition (27) with ˛(t) D a t 2

(37)

(40)

for some a > 0. But if f is locally convex near x , then f (x) f (x ) f 0 (x)(x x )

ˇ ˇ

ˇ f 0 (x) ˇ kx x k

(41)

near x , and therefore (27) and (40) imply (35) with ˇ(t) D a t:

where dk 2 G(xk ) and s(xk , dk ) dk = xk+1 xk for all k. It follows easily from the remainder of Proposition 7 and the growth condition (35) that ˇ

ˇ ˇ

ˇ (38) lim ˇ f 0 (x k ) ˇ D 0 k!1

by (37), (33), (34), and the local uniform continuity of f 0 (). This establishes (38) and proves the following local convergence results.

for all x 2 B(x ; ).

and

L

(42)

These observations and Theorem 9 immediately yield the following extension of the convergence result in [2] for steepest descent processes in Rn . Corollary 11 Every nonsingular local minimizer x is a local attractor for Armijo gradient-related iterations (18).

1917

1918

L

Local Attractors for Gradient-related Descent Iterations

Singular Attractors and Local Convexity The growth condition (27) alone does not imply local convexity of f , or condition (35), or the local attractor property. In fact, (27) can hold even if x is the limit of some infinite sequence of local minimizers for f . This is readily demonstrated by the following simple function F: R1 ! R1 : p 5 p 2 2 3 ln x : (43) F(x) D x 2 sin 6 This function has a strict absolute minimizer at x = 0, with p p ( 2 1)x 2 F(x) ( 2 C 1)x 2 for all x 2 R1 . However, F also has infinitely many (nonsingular) local minimizers, (1 8m) ˙ p x m D ˙ exp 8 3 for m = 1, 2, . . . , and these local minimizers accumulate at 0. Since each x˙ m is a stationary point and not an absolute minimizer, it follows that F is not convex in any neighborhood of the absolute minimizer at x = 0, that (35) cannot hold at x , and that x is not a local attractor for gradient-related descent processes. Evidently, x = 0 is a singular minimizer for F; in fact, F 00 (x) does not exist at x = 0. (Apart from a minor alteration in one of its constants, (43) is taken directly from [6, Example 1.1]. The erroneous constant in [6] was kindly called to the author’s attention by D. Bertsekas.) The growth conditions (27) and (35) together still do not imply convexity of f near x , and indeed f may not be convex in any neighborhood of a singular local attractor. This is shown by another function F: R2 ! R1 from [6, Example 1.2], viz. F(x) D x12 1:98x1 kxk2 C kxk4 ;

(44)

where x = (x1 , x2 ) and kk is the Euclidean norm in R2 . This function has a singular absolute minimizer at x = 0, and F(x) and |kF 0 (x)k| grow like kxk4 and kxk3 , respectively, near 0. On the other hand, since every neighborhood of 0 contains points x where F 0 (x)(x 0) is negative, it follows that F is not convex (or even pseudoconvex) near 0. Nevertheless, x = 0 is a local attractor for Armijo gradient-related iterations, according to Corollary 10.

Although f need not be convex near a singular local attractor x , there are many instances where some sort of local convexity property is observed. (The function f (x) = x4 provides a simple illustration.) If the local pseudoconvexity condition, ( f (x) f (x )) f 0 (x)(x x );

(45)

is satisfied for some > 0 and all x in the ball B(x ; ), then ˇ

ˇ ( f (x) f (x )) ˇ f 0 (x) ˇ kx x k near x , and condition (35) follows at once from (27), with ˇ(t) D ( )1 ˛(t) for all t 2 [0, ]. These considerations immediately yield two additional corollaries of Theorem 9. Corollary 12 Suppose that the uniform growth condition (27) holds in the closed ball B(x ; ) for some > 0. In addition, suppose that in B(x ; ), f 0 () is uniformly continuous and f satisfies the pseudoconvexity condition (45). Then x is a local attractor for Armijo gradient-related iterations (18). Corollary 13 If X is a finite-dimensional normed vector space, if x is a strict local minimizer for f , and if f satisfies the pseudoconvexity condition (45), then x is a local attractor for Armijo gradient-related iterations (18). Local Convexity and Convergence Rates A local version of the convergence rate proof strategy in [5] also works in the present setting when f 0 () is locally Lipschitz continuous and f satisfies the pseudoconvexity condition (45) and the growth condition (27) near x . Under these circumstances, the worst-case convergence rate estimate, f (x k ) f (x ) D O(k 1 );

(46)

can be proved for Armijo gradient-related sequences {xk } that pass sufficiently near x . More refined order estimates are possible if the first two hypotheses hold and f (x) f (x ) a kx x kr

(47)

Location Routing Problem

for some a > 0 and r 2 (1, 1), and all x 2 B(x ; ). In such cases, it it can be shown that r

f (x k ) f (x ) D O(k (r2) )

(48)

for r 2 (2, 1), and 9 2 [0; 1)

f (x k ) f (x ) D O( k )

L

nonequivalent norm. Similarly, the local attractor property for a minimizer x , and indeed local optimality itself, are also typically norm-dependent at this level of generality. See also

(49)

for r 2 (1, 2]. (The latter estimate is comparable to the basic geometric convergence rate theorem for steepest descent iterates near nonsingular local minimizers [2].) The proof strategy in [5] can also produce still more precise local convergence rate estimates that relate the constants implicit in the order estimates (48) and (49) to local Lipschitz constants for f 0 () and parameters in the gradient-related set functions G(), the growth condition (47), the pseudoconvexity condition (45), and the Armijo steplength rule (17). In the absence of local convexity assumptions, it is harder to establish analogous asymptotic convergence rate theorems; however, the analysis in [6] and [7] has established O(k2 ) rate estimates for Hilbert space steepest descent iterations and a class of nonlinear functions f that contains the example (44). Concluding Remarks In a finite-dimensional space any two norms are equivalent and it can be seen that the gradient-related property and the local attractor property are therefore norm-invariant qualitative features of set-valued maps G(): X ! 2X and local minimizers x . On the other hand, even in finite-dimensional spaces, the Lipschitz constants, growth rate constants, and gradient-related set function parameters in the present formulation are not norm-invariant, and this is reflected in normdependent convergence rates and norm-dependent size and shape parameters for the domains that are sent to a local attractor x by gradient-related iterations. These facts have potentially important computational manifestations when gradient-related methods are applied to large scale finite-dimensional problems that approximate some limiting problem in an infinite-dimensional space. Note that infinite-dimensional spaces can support multiple nonequivalent norms, and a set-valued function G() that is gradient-related in one norm need not be gradient-related relative to some other

Conjugate-gradient Methods Large Scale Trust Region Problems Nonlinear Least Squares: Newton-type Methods Nonlinear Least Squares: Trust Region Methods References 1. Armijo L (1966) Minimization of functions having continuous partial derivatives. Pacific J Math 16:1–3 2. Bertsekas DP (1982) Constrained optimization and Lagrange multiplier methods. Acad. Press, New York 3. Bertsekas DP (1995) Nonlinear programming. Athena Sci., Belmont, MA 4. Daniel JW (1971) Approximate minimization of functionals. Prentice-Hall, Englewood Cliffs, NJ 5. Dunn JC (1981) Global and asymptotic convergence rate estimates for a class of projected gradient processes. SIAM J Control Optim 12:659–674 6. Dunn JC (1987) Asymptotic decay rates from the growth properties of Liapunov functions near singular attractors. J Math Anal Appl 125:6–21 7. Dunn JC (1987) On the convergence of projected gradient processes to singular critical points. J Optim Th Appl 55:203–216 8. Goldstein A (1965) On Newton’s method. Numer Math 7:391–393 9. Goldstein A (1965) On steepest descent. SIAM J Control 3:147–151 10. Ortega JM, Rheinboldt WC (1970) Iterative solution of nonlinear equations in several variables. Acad. Press, New York

Location Routing Problem YANNIS MARINAKIS Department of Production Engineering and Management, Decision Support Systems Laboratory, Technical University of Crete, Chania, Greece MSC2000: 90B06, 90B80 Article Outline Introduction Variants of the Location Routing Problem

1919

1920

L

Location Routing Problem

Exact Algorithms for the Solution of the Location Routing Problem Heuristic Algorithms for the Solution of the Location Routing Problem Metaheuristic Algorithms for the Solution of the Location Routing Problem References Introduction In the last few years, the need for an integrated logistic system has become a primary objective of every company manager. Managers recognize that there is a strong relation between the location of facilities, the allocation of suppliers, vehicles, and customers to the facilities, and the design of routes around the facilities. In a location routing problem (LRP), the optimal number, the capacity, and the location of facilities are determined, and the optimal set of vehicle routes from each facility is also sought. In most location models, it is assumed that the customers are served directly from the facilities being located. Each customer is served on his or her own route. In many cases, however, customers are not served individually from the facilities. Rather, customers are consolidated into routes that may contain many customers. One of the reasons for the added difficulty in solving these problems is that there are far more decisions that need to be made by the model. These decisions include: How many facilities to locate, Where the facilities should be, Which customers to assign to which depots, Which customers to assign to which routes, In what order customers should be served on each route. In the LRP, a number of facilities are located among candidate sites and delivery routes are established for a set of users in such a way that the total system cost is minimized. As Perl and Daskin [51] pointed out, LRPs involve three interrelated, fundamental decisions: where to locate facilities, how to allocate customers to facilities, and how to route vehicles to serve customers. The difference between the LRP and the classic vehicle routing problem is that not only routing must be designed but the optimal depot location must be simultaneously determined as well. The main difference between the LRP and the classical location-allocation

problem is that, once the facility is located, the former requires a visitation of customers through tours while the latter assumes that the customer will be visited from the vehicle directly, and then the vehicle will return to the facility without serving any other customer ([47]). In general terms, the combined location routing model solves the joint problem of determining the optimal number, capacity, and location of facilities serving more than one customer and finding the optimal set of vehicle routes. In the LRP, the distribution cost is decreased due to the assignment of the customers to vehicles while the main objective is the design of the appropriate routes of the vehicles. Variants of the Location Routing Problem Laporte et al. [39] considered three variants of LRPs, including (1) capacity-constrained vehicle routing problems, (2) cost-constrained vehicle routing problems, and (3) cost-constrained location routing problems. The authors examined multidepot, asymmetrical problems and developed an optimal solution procedure that enables them to solve problems with up to 80 nodes. Chan et al. [11] solved a multidepot, multivehicle location routing problem with stochastically processed demands, which are defined as demands that are generated upon completing site-specific service on their predecessors. Min et al. [47] synthesized the past research and suggested some future research directions for the LRP. An extended recent literature review is included in the survey paper published by Nagy and Salhi [48]. They proposed a classification scheme and looked at a number of problem variants. The most important exact and heuristic algorithms were presented and analyzed in this survey paper. Exact Algorithms for the Solution of the Location Routing Problem A number of exact algorithms for the problem was presented by Laporte et al. [38]. Applications and formulations and exact and approximation algorithms for LRPs under capacity and maximum cost restrictions are studied in the survey of Laporte [34]. Nonlinear programming exact algorithms for the solution of the LRP have been proposed in [20,61]. Dynamic programming exact algorithms for the solution of the LRP have been proposed in [5]. Integer programming exact al-

Location Routing Problem

gorithms for the solution of the LRP have been proposed in [35,37,46]. Mixed integer goal programming exact algorithms for the solution of the LRP have been proposed in [65]. Two branching strategies have been proposed in [36]. An iterative exact procedure has been proposed in [9]. A branch-and-bound technique on the LP relaxation has been proposed in [17]. Heuristic Algorithms for the Solution of the Location Routing Problem The LRP is very difficult to solve using exact algorithms, especially if the number of customers or the candidate for location facilities is very large due to the fact that this problem belongs to the category of NP-hard problems, i. e. there are no known polynomial-time algorithms that can be used to solve them. Madsen [43] presented a survey of heuristic methods. Christofides and Eilon [16] were the first to consider the problem of locating a depot from which customers are served by tours rather than individual trips. They proposed an approximation algorithm for the solution of the problem. Watson-Gandy and Dohrn [63] proposed an algorithm where the problem is solved by transforming its location part into an ordinary location problem using the Christofides–Eilon approximation algorithm. The routing part of the algorithm is solved using the Clarke and Wright algorithm. Jacobsen and Madsen [31] proposed three algorithms. The first is called a tree-tour heuristic. The second is called ALA– SAV and is a three-phase heuristic, where in the first phase a location–allocation problem is solved and in the second and third phases a Clarke and Wright heuristic is applied for solving the problem. Finally, the third proposed algorithm is called SAV–DROP and is a heuristic algorithm that combines the Clarke–Wright method and the DROP algorithm. A two-phase heuristic is presented in [4], where in the first phase the set of open plants is determined and a priori routes are considered, while in the second phase the routes are optimized. Other two-phase heuristics have been proposed in [7,12,13,30,33,42,49,50,58]. Cluster analysis algorithms are presented in [6,18,60]. Iterative approaches have been proposed by [27,59]. Min ([46]) considered a two-level location–allocation problem of terminals to customer clusters and supply sources using a hierarchical approach consisting of both exact and

L

heuristic procedures. Insertion methods have been proposed in [15]. A partitioning heuristic algorithm is proposed in [35], and a sweep heuristic is proposed in [21]. Metaheuristic Algorithms for the Solution of the Location Routing Problem Several metaheuristic algorithms have been proposed for the solution of the LRP. In what follows, an analytical presentation of these algorithms is given. Tabu search (TS) was introduced by Glover [22,23] as a general iterative metaheuristic for solving combinatorial optimization problems. Computational experience has shown that TS is a well-established approximation technique that can compete with almost all known techniques and that, by its flexibility, can beat many classic procedures. It is a form of local neighbor search. Each solution S has an associated set of neighbors N(S). A solution S 0 2 N(S) can be reached from S by an operation called a move. TS can be viewed as an iterative technique that explores a set of problem solutions by repeatedly making moves from one solution S to another solution S 0 located in the neighborhood N(S) of S [24]. TS moves from a solution to its best admissible neighbor, even if this causes the objective function to deteriorate. To avoid cycling, solutions that have been recently explored are declared forbidden or tabu for a number of iterations. The tabu status of a solution is overridden when certain criteria (aspiration criteria) are satisfied. Sometimes, intensification and diversification strategies are used to improve the search. In the first case, the search is accentuated in the promising regions of the feasible domain. In the second case, an attempt is made to consider solutions in a broad area of the search space. Tuzun and Burke [62] proposed a two-phase tabu search architecture for the solution of the LRP. TS algorithms for the LRP are also presented in [10,14,41,45,57]. Simulated annealing (SA) [1,3,32] plays a special role within local search for two reasons. First, SA appears to be quite successful when applied to a broad range of practical problems. Second, some threshold accepting algorithms such as SA have a stochastic component, which facilitates a theoretical analysis of their asymptotic convergence. SA [2] algorithms are stochastic algorithms that allow random

1921

1922

L

Location Routing Problem

uphill jumps in a controlled fashion in order to provide possible escapes from poor local optima. Gradually the probability allowing the objective function value to increase is lowered until no more transformations are possible. SA owes its name to an analogy with the annealing process in condensed-matter physics, where a solid is heated to a maximum temperature at which all particles of the solid randomly arrange themselves in the liquid phase, followed by cooling through careful and slow reduction of the temperature until the liquid is frozen with the particles arranged in a highly structured lattice and minimal system energy. This ground state is reachable only if the maximum temperature is sufficiently high and the cooling sufficiently slow. Otherwise a metastable state is reached. The metastable state is also reached with a process known as quenching, in which the temperature is instantaneously lowered. Its predecessor is the so-called Metropolis filter. Wu et al. [64] proposed an algorithm that divides the original problem into two subproblems, i. e., the location–allocation problem and the general vehicle routing problem, respectively. Each subproblem is, then, solved in a sequential and iterative manner by the SA algorithm embedded in the general framework for the problem-solving procedure. SA algorithms for the LRP are presented in [8,40,41]. Greedy randomized adaptive search procedure (GRASP) [56] is an iterative two-phase search method that has gained considerable popularity in combinatorial optimization. Each iteration consists of two phases, a construction phase and a local search procedure. In the construction phase, a randomized greedy function is used to build up an initial solution. This randomized technique provides a feasible solution within each iteration. This solution is then exposed for improvement attempts in the local search phase. The final result is simply the best solution found over all iterations. Prins et al. [52] proposed a GRASP with a pathrelinking phase for the solution of the capacitated location routing problem. Genetic algorithms (GAs) are search procedures based on the mechanics of natural selection and natural genetics. The first GA was developed by John H. Holland in the 1960s to allow comput-

ers to evolve solutions to difficult search and combinatorial problems such as function optimization and machine learning [28]. Genetic algorithms offer a particularly attractive approach to problems like location routing problems since they are generally quite effective for the rapid global search of large, nonlinear, and poorly understood spaces. Moreover, GAs are very effective in solving large-scale problems. GAs [25] mimic the evolution process in nature. They are based on an imitation of the biological process in which new and better populations among different species are developed during evolution. Thus, unlike most standard heuristics, GAs use information about a population of solutions, called individuals, when they search for better solutions. A GA is a stochastic iterative procedure that maintains the population size constant in each iteration, called a generation. Their basic operation is the mating of two solutions to form a new solution. To form a new population, a binary operator called a crossover and a unary operator called a mutation are applied [54,55]. Crossover takes two individuals, called parents, and produces two new individuals, called offspring, by swapping parts of the parents. Marinakis and Marinaki [44] proposed a bilevel GA for a real-life LRP. A new formulation based on bilevel programming was proposed. Based on the fact that in the LRP decisions are made at a strategic level and at an operational level, we formulate the problem in such a way that in the first level, the decisions of the strategic level are made, namely, the top manager finds the optimal location of the facilities, while in the second level, the operational-level decisions are made, namely, the operational manager finds the optimal routing of vehicles. Other evolutionary approaches for the solution of the LRP have been proposed in [29,53]. Variable neighborhood search (VNS) is a metaheuristic for solving combinatorial optimization problems whose basic idea is systematic change of a neighborhood within a local search [26]. VNS algorithms for the LRP are presented in [45]. The ant colony optimization (ACO) metaheuristic is a relatively new technique for solving combinatorial optimization problems (COPs). Based strongly on the ant system (AS) metaheuristic developed by Dorigo, Maniezzo, and Colorni [19], ACO is derived

Location Routing Problem

from the foraging behavior of real ants in nature. The main idea of ACO is to model the problem as the search for a minimum cost path in a graph. Artificial ants walk through this graph looking for good paths. Each ant has a rather simple behavior so that it will typically only find rather poor-quality paths on its own. Better paths are found as the emergent result of the global cooperation among ants in the colony. An ACO algorithm consists of a number of cycles (iterations) of solution construction. During each iteration a number of ants (which is a parameter) construct complete solutions using heuristic information and the collected experiences of previous groups of ants. These collected experiences are represented by a digital analog of trail pheromone that is deposited on the constituent elements of a solution. Small quantities are deposited during the construction phase while larger amounts are deposited at the end of each iteration in proportion to solution quality. Pheromone can be deposited on the components and/or the connections used in a solution depending on the problem. ACO algorithms for the LRP are presented in [8].

9. 10.

11.

12.

13.

14.

15.

16. 17.

References 1. Aarts E, Korst J (1989) Simulated Annealing and Boltzmann Machines – A Stochastic Approach to Combinatorial Optimization and Neural Computing. John Wiley and Sons, Chichester 2. Aarts E, Korst J, Van Laarhoven P (1997) Simulated Annealing. In: Aarts E, Lenstra JK (eds) Local Search in Combinatorial Optimization. Wiley, Chichester, pp 91–120 3. Aarts E, Ten Eikelder HMM (2002) Simulated Annealing. In: Pardalos PM, Resende MGC (eds) Handbook of Applied Optimization. Oxford University Press, New York, pp 209–221 4. Albareda-Sambola M, Diaz JA, Fernandez E (2005) A Compact Model and Tight Bounds for a Combined LocationRouting Problem. Comput Oper Res 32(3):407–428 5. Averbakh I, Berman O (1994) Routing and LocationRouting p-Delivery Men Problems on a Path. Transp Sci 28(2):162–166 6. Barreto S, Ferreira C, Paixao J, Santos BS (2007) Using Clustering Analysis in a Capacitated Location-Routing Problem. Eur J Oper Res 179(3):968–977 7. Bookbinder JH, Reece KE (1988) Vehicle Routing Considerations in Distribution System Design. Eur J Oper Res, 37:204-213 8. Bouhafs L, Hajjam A, Koukam A (2006) A Combination of Simulated Annealing and Ant Colony System for the Ca-

18.

19. 20.

21. 22. 23. 24. 25.

26.

L

pacitated Location-Routing Problem. Knowl-Based Intelligent Inf Eng Syst, LNCS 4251:409–416 Burness RC, White JA (1976) The Traveling Salesman Location Problem. Transp Sci 10(4):348–360 Caballero R, Gonzalez M, Guerrero FM, Molina J, Paralera C (2007) Solving a Multiobjective Location Routing Problem with a Metaheuristic Based on Tabu Search. Application to a Real Case in Andalusia. Eur J Oper Res 177(3):1751– 1763 Chan Y, Carter WB, Burnes MB (2001) A Multiple-Depot, Multiple-Vehicle, Location-Routing Problem with Stochastically Processed Demands. Comput Oper Res 28:803– 826 Cappanera P, Gallo G, Maffioli F (2003) Discrete Facility Location and Routing of Obnoxious Activities. Discret Appl Math 133(1–3):3–28 Chan Y, Baker SF (2005) The Multiple Depot, Multiple Traveling Salesmen Facility-Location Problem: Vehicle Range, Service Frequency, Heuristic Implementations. Math Comput Model 41(8-9):1035–1053 Chiang WC, Russell RA (2004) Integrating Purchasing and Routing in a Propane Gas Supply Chain. Eur J Oper Res 154(3):710–729 Chien TW (1993) Heuristic Procedures for Practical-sized Uncapacitated Location-Capacitated Routing Problems. Decis Sci 24(5):995–1021 Christofides N, Eilon S (1969) Expected Distances for Distribution Problems. Oper Res Q 20:437–443 Daskin MS (1987) Location, Dispatching, Routing Models for Emergency Services with Stochastic Travel Times. In: Ghosh A, Rushton G (eds) Spatial Analysis and LocationAllocation Models. Von Nostrand Reinhold Company, NY, pp 224–265 Dondo R, Cerda J (2007) A Cluster-Based Optimization Approach for the Multi-Depot Heterogeneous Fleet Vehicle Routing Problem with Time Windows. Eur J Oper Res 176(3):1478–1507 Dorigo M, Stutzle T (2004) Ant Colony Optimization, A Bradford Book. MIT Press Cambridge, MA, London Ghosh JK, Sinha SB, Acharya D (1981) A Generalized Reduced Gradient Based Approach to Round-trip Location Problem. In: Jaiswal NK (eds) Scientific Management of Transport Systems. Amsterdam, Holland, pp 209–213 Gillett B, Johnson J (1976) Multi-Terminal Vehicle-Dispatch Algorithm. Omega 4(6):711–718 Glover F (1989) Tabu Search I. ORSA J Compu 1(3):190–206 Glover F (1990) Tabu Search II. ORSA J Compu 2(1):4–32 Glover F, Laguna M, Taillard E, de Werra D (eds) (1993) Tabu Search. JC Baltzer AG, Science Publishers, Basel Goldberg DE (1989) Genetic Algorithms in Search, Optimization, Machine Learning. Addison-Wesley, Reading Massachussets Hansen P, Mladenovic N (2001) Variable Neighborhood Search: Principles and Applications. Eur J Oper Res 130:449–467

1923

1924

L

Location Routing Problem

27. Hansen PH, Hegedahl B, Hjortkajaer S, Obel B (1994) A Heuristic Solution to the Warehouse Location-Routing Problem. Eur J Oper Res 76:111–127 28. Holland JH (1975) Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI 29. Hwang HS (2002) Design of Supply-Chain Logistics System Considering Service Level. Comput Ind Eng 43(1–2):283– 297 30. Jacobsen SK, Madsen OBG (1978) On the Location of Transfer Points in a Two-Level Newspaper Delivery System – A Case Study. Presented at The International Symposium on Locational Decisions. The Institute of Mathematical Statistics and Operations Research, The Technical University of Denmark, Lyngbyn Denmark, pp 24–28 31. Jacobsen SK, Madsen OBG (1980) A Comparative Study of Heuristics for Two Level Routing Location Problem. Eur J Oper Res 5:378–387 32. Kirkpatrick S, Gelatt CD, Vecchi MP (1982) Optimization by Simulated Annealing. Science 220:671–680 33. Laoutaris N, Zissimopoulos V, Stavrakakis I (2005) On the Optimization of Storage Capacity Allocation for Content Distribution. Comput Netw 47(3):409–428 34. Laporte G (1988) Location Routing Problems. In: Golden BL et al (eds) Vehicle Routing: Methods and Studies. NorthHolland, Amsterdam, pp 163–198 35. Laporte G, Dejax PJ (1989) Dynamic Location-Routing Problems. J Oper Res Soc 40(5):471–482 36. Laporte G, Nobert Y (1981) An Exact Algorithm for Minimizing Routing and Operating Costs in Depot Location. Eur J Oper Res 6:224–226 37. Laporte G, Nobert Y, Arpin D (1986) An Exact Algorithm for Solving a Capacitated Location-Routing Problem. Ann Oper Res 6:293–310 38. Laporte G, Nobert Y, Pelletier P (1983) Hamiltonian Location Problems. Eur J Oper Res 12(1):82–89 39. Laporte G, Nobert Y, Taillefer S (1988) Solving a Family of Multi-depot Vehicle Routing and Location Routing Problems. Transp Sci 22:161–172 40. Lin CKY, Chow CK, Chen A (2002) A Location-RoutingLoading Problem for Bill Delivery Services. Comput Ind Eng 43(1–2):5–25 41. Lin CKY, Kwok RCW (2006) Multi-Objective Metaheuristics for a Location-Routing Problem with Multiple Use of Vehicles on Real Data and Simulated Data. Eur J Oper Res 175(3):1833–1849 42. Liu SC, Lee SB (2003) A Two-Phase Heuristic Method for the Multi-Depot Location Routing Problem Taking Inventory Control Decisions Into Consideration. Int J Adv Manuf Technol 22(11-12):941–950 43. Madsen OBG (1983) Methods for Solving Combined Two Level Location Routing Problems of Realistic Dimension. Eur J Oper Res 12(3):295–301 44. Marinakis Y, Marinaki M (2008) A Bilevel Genetic Algorithm for a Real Life Location Routing Problem. Int J Logist 11(1):49–65

45. Melechovsky J, Prins C, Calvo RW (2005) A Metaheuristic to Solve a Location-Routing Problem with Non-Linear Costs. J Heurist 11(5-6):375–391 46. Min H (1996) Consolidation Terminal Location-Allocation and Consolidated Routing Problems. J Bus Logist 17(2): 235–263 47. Min H, Jayaraman V, Srivastava R (1998) Combined Location-Routing Problems: A Synthesis and Future Research Directions. Eur J Oper Res 108:1–15 48. Nagy G, Salhi S (2007) Location-Routing: Issues, Models and Methods. Eur J Oper Res 177:649–672 49. Nambiar JM, Gelders LF, Van Wassenhove LN (1981) A Large Scale Location-Allocation Problem in the Natural Rubber Industry. Eur J Oper Res 6:183–189 50. Perl J, Daskin MS (1984) A Unified Warehouse LocationRouting Methodology. J Bus Logist 5(1):92–111 51. Perl J, Daskin MS (1985) A Warehouse Location Routing Model. Transp Res B 19:381–396 52. Prins C, Prodhon C, Calvo RW (2006) Solving the Capacitated Location-Routing Problem by a GRASP Complemented by a Learning Process and a Path Relinking, 4OR 4:221–238 53. Prins C, Prodhon C, Calvo RW (2006) A Memetic Algorithm with Population Management (MA|PM) for the Capacitated Location-Routing Problem. Evol Comput Combinatorial Optim, LNCS 3906:183–194 54. Reeves CR (1995) Genetic Algorithms. In: Reeves CR (eds) Modern Heuristic Techniques for Combinatorial Problems. McGraw-Hill, London, pp 151–196 55. Reeves CR (2003) Genetic Algorithms. In: Glover F, Kochenberger GA (eds) Handbooks of Metaheuristics. Kluwer, Dordrecht, pp 55–82 56. Resende MGC, Ribeiro CC (2003) Greedy Randomized Adaptive Search Procedures. In: Glover F, Kochenberger GA (eds) Handbook of Metaheuristics. Kluwer, Boston, pp 219–249 57. Russell R, Chiang WC, Zepeda D (2006) Integrating MultiProduct Production and Distribution in Newspaper Logistics. Comput Oper Res 35(5): 1576–1588 58. Simchi-Levi D, Berman O (1988) A Heuristic Algorithm for the Traveling Salesman Location Problem on Networks. Eur J Oper Res 36:478–484 59. Srisvastava R (1993) Alternate Solution Procedures for the Location-Routing Problem. Omega 21(4):497–506 60. Srivastava R, Benton WC (1990) The Location-Routing Problem: Consideration in Physical Distribution System Design. Comput Oper Res 6:427–435 61. Stowers CL, Palekar US (1993) Location Models with Routing Considerations for a Single Obnoxious Facility. Transp Sci 27(4):350–362 62. Tuzun D, Burke LI (1999) A Two-Phase Tabu Search Approach to the Location Routing Problem. Eur J Oper Res 116:87–99 63. Watson-Gandy CTD, Dohrn PJ (1973) Depot Location with Van Salesman – A Practical Approach. Omega 1:321–329

Logconcave Measures, Logconvexity

64. Wu TH, Low C, Bai JW (2002) Heuristic Solutions to Multi-Depot Location-Routing Problems. Comput Oper Res 29:1393–1415 65. Zografos KG, Samara S (1989) Combined Location-Routing Model for Hazardous Waste Transportation and Disposal. Transp Res Record 1245:52–59

Logconcave Measures, Logconvexity ANDRÁS PRÉKOPA RUTCOR, Rutgers Center for Operations Research, Piscataway, USA MSC2000: 90C15 Article Outline Keywords See also References Keywords Logconcave function; Logconcave measure; Logconvex function; Logconvex measure; ˛-concave function; ˛-concave measure; Quasiconcave function; Quasiconcave measure A nonnegative function f : Rn ! R1 is called a logconcave (point) function if for every x, y 2 Rn and 0 < < 1 we have the inequality 1 f (y) : f (x C (1 )y) f (x) A probability measure P defined on the Borel sets of Rn is called logconcave if for any Borel sets A, B Rn and 0 < < 1 we have the inequality 1 P (A C (1 )B) P(A) P(B) ; provided that A + (1 )B is also a Borel set. If P is a logconcave measure in Rn and A Rn is a convex set, then P(A + z) is a logconcave point function in Rn . In particular, the probability distribution function F(z) = P({x: x z}) = P({x: x 0} + z), of the probability measure P, is a logconcave point function. If n = 1, then also 1 F(z) is logconcave.

L

The basic theorem concerning logconcave measures [5,6] states that if the probability measure P is generated by a logconcave probability density function f , i. e., Z f (x) dx P(C) D C

for every Borel set C Rn , then P is a logconcave measure. Examples for logconcave probability distributions are the multivariate normal, the uniform (on a convex set) and for special parameter values the Wishart, the beta, the univariate and some multivariate gamma distributions. A closely related theorem [5] states that if f : Rn+m ! R1 is a logconcave function, then Z f (x; y) dy Rm

is a logconcave function in Rn . This implies that the convolution of two logconcave functions is also logconcave [3,5]. Logconcave probability distributions play important role in probabilistic constrained stochastic programming problems. If the problem is: 8 > ˆ ˆ 0g; END; The generic preflow-push algorithm

M

The algorithm first saturates all arcs emanating from the source node; then each node adjacent to node s has a positive excess, so that the algorithm can begin pushing flow from active nodes. Since the preprocessing operation saturates all the arcs incident to node s, none of these arcs is admissible and setting d(s) = n will satisfy the validity condition (8), (9). But then, since d(s) = n, and a distance label is a lower bound on the length of the shortest path from that node to node t, the residual network contains no directed path from s to t. The subsequent pushes maintain this property and drive the solution toward feasibility. Consequently, when there are no active nodes, the flow is a maximum flow. A push of ı units from node i to node j decreases both the excess e(i) of node i and the residual rij of arc (i, j) by ı units and increases both e(j) and rji by ı units. We say that a push of ı units of flow on an arc (i, j) is saturating if d = rij and is nonsaturating otherwise. A nonsaturating push at node i reduces e(i) to zero. We refer to the process of increasing the distance label of a node as a relabel operation. The purpose of the relabel operation is to create at least one admissible arc on which the algorithm can perform further pushes. It is instructive to visualize the generic preflow-push algorithm in terms of a physical network: arcs represent flexible water pipes, nodes represent joints, and the distance function measures how far nodes are above the ground. In this network, we wish to send water from the source to the sink. We visualize flow in an admissible arc as water flowing downhill. Initially, we move the source node upward, and water flows to its neighbors. Although we would like water to flow downhill toward the sink, occasionally flow becomes trapped locally at a node that has no downhill neighbors. At this point, we move the node upward, and again water flows downhill toward the sink. Eventually, no more flow can reach the sink. As we continue to move nodes upward, the remaining excess flow eventually flows back toward the source. The algorithm terminates when all the water flows either into the sink or flows back to the source. To illustrate the generic preflow-push algorithm, we use the example given in Fig 4. Figure 4a) specifies the initial residual network. We first saturate the arcs emanating from the source node, node 1, and set d(1) = n = 4. Fig 4b) shows the residual graph at this stage. At

2017

2018

M

Maximum Flow Problem

Maximum Flow Problem, Figure 4 Illustrating the preflow-push algorithm: a) the residual network G(x) for x = 0; b) the residual network after saturating arcs emanating from the source; c) the residual network after pushing flow on arc (2, 4); d) the residual network after pushing flow on arc (3, 4)

this point, the network has two active nodes, nodes 2 and 3. Suppose that the algorithm selects node 2 for the push/relabel operation. Arc (2, 4) is the only admissible arc and the algorithm performs a saturating push of value ı = min {e(2), r24 } = min{2, 1} = 1. Fig 4c) gives the residual network at this stage. Suppose the algorithm again selects node 2. Since no admissible arc emanates from node 2, the algorithm performs a relabel operation and gives node 2 a new distance label d(2) = min{d(3)+ 1, d(1)+ 1} = min{2, 5} = 2. The new residual network is the same as the one shown in Fig 4c) except that d(2) = 2 instead of 1. Suppose this time the algorithm selects node 3. Arc (3, 4) is the only admissible arc emanating from node 3, and so the algorithm performs a nonsaturating push of value ı = min{e(3), r34 } = min{4, 5} = 4. Fig 4d) specifies the residual network at the end of this iteration. Using this process for a few

more iterations, the algorithm will determine a maximum flow. The analysis of the computational (worst-case) complexity of the generic preflow-push algorithm is somewhat complicated. Without examining the details, we might summarize the analysis as follows. It is possible to show that the preflow-push algorithm maintains valid distance labels at all steps of the algorithm and increases the distance label of any node at most 2n times. The algorithm performs O(nm) saturating pushes and O(n2 m) nonsaturating pushes. The nonsaturating pushes are the limiting computational operation of the algorithm and so it runs in O(n2 m) time. The preflow-push algorithm has several attractive features, particularly its flexibility and its potential for further improvements. Different rules for selecting active nodes for the push/relabel operations create many

Maximum Flow Problem

different versions of the generic algorithm, each with different worst-case complexity. As we have noted, the bottleneck operation in the generic preflow-push algorithm is the number of nonsaturating pushes and many specific rules for examining active nodes can produce substantial reductions in the number of nonsaturating pushes. The following specific implementations of the generic preflow-push algorithms are noteworthy: i) the FIFO preflow-push algorithm examines the active nodes in the first-in, first-out (FIFO) order and runs in O(n3 ) time; ii) the highest label preflow-push algorithm pushes flow from an active node with the highest value of a distance label and runs in O(n2 m1/2 ) time; and iii) the excess-scaling algorithm uses the scaling of arc capacities to attain a time bound of O(nm + n2 logU). These algorithms are due to A.V. Goldberg and R.J. Tarjan [10], J. Cheriyan and S.N. Maheshwari [4], and R.K. Ahuja and J.B. Orlin [3], respectively. These preflow-push algorithms are more general, more powerful, and more flexible than augmenting path algorithms. The best preflow-push algorithms currently outperform the best augmenting path algorithms in theory as well as in practice (see, for example, [1]). Combinatorial Implications of the Max–Flow Min–Cut Theorem The max-flow min-cut theorem has far reaching consequences. It can be used to prove several important results in combinatorics that appear to be difficult to prove using other means. We will illustrate the use of the max-flow min-cut theorem to prove two such important results. Network Connectivity Given a directed network G = (N, A) and two specified nodes s and t, we are interested in the following two questions: i) what is the maximum number of arc-disjoint (directed) paths from node s to node t; and ii) what is the minimum number of arcs that we should remove from the network so that it contains no directed paths from node s to node t. We will show that these two questions are closely related. The second question shows how robust a net-

M

work, for example, a telecommunications network, is to the failure of its arcs. In the network G, let us define the capacity of each arc as equal to one. Consider any feasible flow x of value v in the resulting unit capacity network. We can decompose the flow x into flows along v directed paths from node s to node t, each path carrying a unit flow. Now consider any s t-cut [S; ˇ ˇ S] in the network. The capacˇ ˇ ity of this cut is ˇ(S; S)ˇ that is, equals the number of forward arcs in the cut. Since each path joining nodes s and t contains at least one arc in the set (S; S), the removal of all the arcs in (S; S) disconnects all paths from node s to node t. Consequently, the network contains a disconnecting set of arcs of cardinality equal to the capacity of any s t-cut [S; S]. The max-flow min-cut theorem immediately implies the following result: Corollary 5 The maximum number of arc-disjoint paths from s to t in a directed network equals the minimum number of arcs whose removal will disconnect all paths from node s to node t. Matchings and Covers The max-flow min-cut theorem also implies a max-min result concerning matchings and node covers in a directed bipartite network G = (N 1 [ N 2 , A), with arc set A N 1 × N 2 . In the network G, a subset M A is a matching if no two arcs in M have an endpoint in common. A subset C N 1 N 2 is a node cover of G if every arc in A has at least one endpoint in the node set C. Suppose we create the network G0 from G by adding two new nodes s and t, as well as arcs (s, i) of capacity 1 for each i 2 N 1 and arcs (j, t) of capacity 1 for each j 2 N 2 . All other arcs in G0 correspond to the arcs in G and have infinite capacity. It is possible to show that each matching of cardinality v defines a flow of value v in G0 , and each s t cut of capacity v induces a corresponding node cover with v nodes. Consequently, the max-flow min-cut theorem establishes the following result: Corollary 6 In a bipartite network G = (N 1 [ N 2 , A), the maximum cardinality of any matching equals the minimum cardinality of any node cover of G. These two examples illustrate important relationships between maximum flows, minimum cuts, and many other problems in the field of combinatorics. The maximum flow problem is of interest because it provides

2019

2020

M

Maximum Likelihood Detection via Semidefinite Programming

a unifying tool for viewing many such results, because it arises directly in many applications, and because it has been a rich arena for developing new results concerning the design and analysis of algorithms.

9. Gabow HN (1985) Scaling algorithms for network problems. J Comput Syst Sci 31:148–168 10. Goldberg AV, Tarjan RE (1988) A new approach to the maximum flow problem. J ACM 35:921–940, also: Proc. 19th ACM Symp. Theory of Computing, pp 136–146

See also Auction Algorithms Communication Network Assignment Problem Directed Tree Networks Dynamic Traffic Networks Equilibrium Networks Evacuation Networks Generalized Networks Minimum Cost Flow Problem Network Design Problems Network Location: Covering Problems Nonconvex Network Flow Problems Nonoriented Multicommodity Flow Problems Piecewise Linear Network Flow Problems Shortest Path Tree Algorithms Steiner Tree Problems Stochastic Network Problems: Massively Parallel Solution Survivable Networks Traffic Network Equilibrium References 1. Ahuja RK, Kodialam M, Mishra AK, Orlin JB (1997) Computational investigations of maximum flow algorithms. Europ J Oper Res 97:509–542 2. Ahuja RK, Magnanti TL, Orlin JB (1993) Network flows: Theory, algorithms, and applications. Prentice-Hall, Englewood Cliffs, NJ 3. Ahuja RK, Orlin JB (1989) A fast and simple algorithm for the maximum flow problem. Oper Res 37:748–759 4. Cheriyan J, Maheshwari SN (1989) Analysis of preflow-push algorithms for maximum network flow. SIAM J Comput 18:1057–1086 5. Dinic EA (1970) Algorithm for solution of a problem of maximum flow in networks with power estimation. Soviet Math Dokl 11:1277–1280 6. Edmonds J, Karp RM (1972) Theoretical improvements in algorithmic efficiency for network flow problems. J ACM 19:248–264 7. Elias P, Feinstein A, Shannon CE (1956) Note on maximum flow through a network. IRE Trans Inform Theory IT-2:117– 119 8. Ford LR, Fulkerson DR (1956) Maximal flow through a network. Canad J Math 8:399–404

Maximum Likelihood Detection via Semidefinite Programming MIKALAI KISIALIOU, ZHI-QUAN LUO Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, USA MSC2000: 65Y20, 68W25, 90C27, 90C22, 49N15 Article Outline Abstract Keywords and Phrases Introduction Formulation System Model Connection with Unconstrained Optimization Semidefinite Relaxation Strategy Bit-Error-Rate Performance

Method SDP Solver Randomized Rounding Procedure

Cases Performance of SDR Detector Simulation Results

Conclusions References Abstract Maximum-likelihood detection is a generic NP-hard problem in digital communications which requires efficient solution in practice. Some existing quasimaximum-likelihood detectors achieve polynomial complexity with significant bit-error-rate performance degradation (e. g. LMMSE Detector), while others exhibit near-maximum-likelihood bit-error-rate performance with exponential complexity (e. g. Sphere Decoder and its variants). We present an efficient suboptimal detector based on a semidefinite relaxation, called SDR Detector, which enjoys near-maximum-likelihood bit-error-rate with worst-case polynomial complexity.

Maximum Likelihood Detection via Semidefinite Programming

SDR Detector can be implemented with recently developed Interior-Point methods for convex optimization problems. For large systems SDR Detector provides a constant factor approximation for the maximumlikelihood detection problem. In high signal-to-noise ratio region SDR Detector can solve the maximumlikelihood detection problem exactly. Efficient implementations of SDR Detector empirically deliver a nearoptimal bit-error-rate with running time that scales well to large problems and in any signal-to-noise ratio region.

Keywords and Phrases Maximum-likelihood detection; Multiple-input multiple-output systems; Multiuser detection; Semidefinite relaxation Introduction Maximum-Likelihood (ML) detection is a fundamental problem in digital communications. Under the mild assumption of equiprobable transmitted signals ML Detector achieves the best Bit-Error-Rate (BER). In general, the ML detection problem is NP-hard due to the discrete nature of a signal constellation. The exhaustive search can be applied for small problem sizes, however this strategy is not practical for large systems. Large communication systems often arise in schemes with efficient rate and diversity utilization, e. g. the systems based on Linear Dispersion Codes [6]. Various suboptimal detectors that have been developed to approximate ML Detector can be divided into two major categories: Accelerated versions of ML Detector with exponential complexity (e. g. versions of Sphere Decoder [3,16]), Polynomial complexity detectors with significant degradation in the BER performance (e. g. Linear Minimum Mean Square Error (LMMSE) Detector, Matched Filter, Decorrelator, etc.). We focus on an alternative detector which is based on a semidefinite relaxation of the ML detection problem. This detector, called SDR Detector hereafter, enjoys a worst-case polynomial complexity while delivering a near-optimal BER performance. In the next subsection we will introduce notations and a system model used throughout the text.

M

Formulation System Model Consider a vector communication channel with n transmit and m receive antennas. In wireless communications a Rayleigh fading model is widely used in scenarios with significantly attenuated line-of-sight signal component. An abundant research is based on this model which is used in profound theoretical results on channel capacity, diversity and multiplexing gain. Define a fading coefficient from the ith transmit antenna to the kth receive antenna to be a Gaussian zero-mean unit-variance, N (0; 1), variable H ki , with a Rayleigh distributed amplitude jH ki j and a uniformly distributed phase (H ki ). The coefficients H ki are assumed to be spatially and temporarily independent and identically distributed (i.i.d.). The transmitted signals s D [s1 ; : : : ; s n ]T are drawn from a discrete n-dimensional complex set C n . The communication system is operating at an average Signal-to-Noise Ratio (SNR) denoted by . Noise samples at each receive antenna, v k ; k D 1; : : : ; m, are modelled as i.i.d. N (0; 1) random variables. With these notations a Rayleigh memoryless vector channel can be represented by: p (1) y D /n H s C v : p The coefficient /n ensures that the expected value of SNR at each receive antenna is equal to independent of problem dimension n. Channel model (1) is quite generic and can be used to describe other communication systems, for example, a synchronous CDMA multi-access channel, where n denotes the number of users in the system. In the sequel, we will assume that the receiver has perfect information of the fading matrix H. In practice H is estimated by sending training signals which are known to the receiver. Given the vector of received signals y and the channel state H, the optimal detector computes an estimate of transmitted signals such that the probability of an erroneous decision is minimized. For equiprobable input signals the minimal error probability is achieved by ML Detector given by: sML D arg maxn p(yjs; H) ; s2C

where p(j) is a conditional probability density function and sML denotes the ML estimate of transmitted

2021

2022

M

Maximum Likelihood Detection via Semidefinite Programming

signals. For Gaussian noise this optimization problem can be stated in the form of the Integer Least Squares (ILS) problem: p sML D arg minn ky /n H sk2 : (2) s2C

In general, this optimization problem is NP-hard and the discrete constraint set C n of dimension n is the source of intractability. We are interested in an efficient polynomial time approximation algorithm for (2) with theoretical performance guarantees. In the next section we will briefly discuss common approaches to solving problem (2). Connection with Unconstrained Optimization Several strategies have been developed to overcome high computational complexity of ML Detector. Some detectors achieve polynomial complexity by relaxing the integer constraint in the ML detection problem (2), e. g. LMMSE Detector, Decorrelator, and Matched Filter [5]. From the perspective of optimization theory these detectors can be jointly treated by dropping the discrete constraint in (2) and imposing a penalty function instead. For the BPSK constellation the relaxed problem can be written as: p ˆs D arg min ky /n H sk2 C ksk2 : (3) n s2R

The modified optimization problem is usually followed by a rounding procedure which projects the optimal solution of the relaxed problem onto set C n . Selecting proper values for , we can specialize (3) to LMMSE Detector, Decorrelator, or Matched Filter. An appealing advantage of this approach is that it can be solved analytically: 1 T ˆs D sign H H C I HT y : (4) n This strategy achieves complexity O(n3 ) while sacrificing the BER performance. Another type of detectors preserves the near-ML BER while reducing the high complexity of the exhaustive search. The work originates in [3,16] with the algorithm to find the shortest vector on a lattice, known as the so-called Sphere Decoder. The algorithm reduces the exhaustive search to an ellipse centered at the zeroforcing estimate of the transmitted signals: p 1 T H y: s Z F D n/ HT H

Different variants of this approach use various intelligent strategies of the radius selection and ordering of points to be searched inside the ellipse. In high SNR region for small problem sizes Sphere Decoder empirically demonstrates fast running time [7]. However, a thorough theoretical analysis [9,10] has shown that both the worst-case and expected complexity of this algorithm is still exponential. Semidefinite Relaxation Strategy We consider an alternative approach to solve (2) which is based on a convex relaxation of the ML detection problem. Convexity of an optimization problem is a good indicator of problem tractability. Efficient and powerful algorithms with complexity O(n3:5 ) have recently been developed to solve convex optimization problems (e. g. Interior-Point methods). These algorithms make efficient use of theoretically computable stopping criteria, enjoy robustness, and offer the certificate of infeasibility when no solution exists. All these properties render convex optimization methods a primary tool for various fields of engineering. There are several generic types of convex problems, the simplest one being a Linear Program (LP), i. e. the optimization problem with a linear objective function and linear constraints. An LP allows natural generalization of the notion of an inequality constraint to a socalled Linear Matrix Inequality (LMI). Instead of the regular componentwise meaning of the inequality in LP, LMI X 0 implies that X belongs to the cone of symmetric positive semidefinite matrices, i. e. all eigenvalues of X are non-negative. Such generalization leads us to a generic class of Semi-Definite Programs (SDP), which can be written in the standard form as follows: min s.t.

QX Ak X D b k ;

k D 1; : : : ; K;

(5)

X0; where () denotes inner product in the matrix space: Q X D Tr(QX). The class of SDP problems (5) includes Linear Programs as well as Second Order Cone Programs as special cases. It is quite remarkable that any problem (5) in the broad class of SDP problems can be solved in polynomial time, which makes it a valuable asset for solving engineering problems, includ-

Maximum Likelihood Detection via Semidefinite Programming

ing filter design, control, VLSI circuit layout design, etc. [2]. In addition to application in numerical solvers, SDP formulation (5) is widely used for analysis and design of approximation algorithms for NP-hard problems. Traditional approaches involve relaxation of an NP-hard problem to an LP, which can be easily solved in polynomial time. With the invent of Interior-Point methods for non-linear convex optimization problems some approximation algorithms have been significantly improved [4]. Such advanced non-linear approximation algorithms use weaker relaxations, thereby preserving most of the structure of the original NP-hard problem. The class of SDP problems represents a perfect candidate for design of approximation algorithms since the SDP form is quite generic. The solution to the original NP-hard problem is generated from the solution of the relaxed SDP problem by a randomized or deterministic rounding procedure. For example, as will be shown later, the ML detection problem can be formulated as f ML :D min s.t.

QX X i;i D 1;

i D 1; : : : ; n C 1

X0

(6)

Relaxing the rank constraint of X reduces the problem to the standard SDP form (5):

s.t.

QX X i;i D 1;

of approximation ratio c such that: f ML f SDR c f ML ;

c 1;

where c is independent of problem size. Relaxation (5) was first applied to combinatorial optimization in [4] where the authors relaxed MAX-CUT problem to an SDP problem in the standard form (5). This strategy resulted in a substantial improvement of the approximation ratio for MAX-CUT problem, as compared to the classical relaxation to an LP. Unfortunately, we can not pursue this approach because the ML detection problem involves minimization instead of maximization (for a positive semidefinite matrix Q) used in the formulation of MAX-CUT problem. Moreover, the ML detection problem does not allow a constant factor approximation algorithm for the worst case realizations of H and v. However, from the perspective of digital communications we are interested in the average performance of SDR Detector over many channel and noise realizations. It turns out that SDR Detector allows a probabilistic approximation ratio for the random channel model (1). In high SNR region a typical behavior of the detection error probability is Pe ' e () ;

X is rank-1 :

f SDP :D min

M

i D 1; : : : ; n C 1

(7)

where function () varies for different detectors. For example, ml () D O() for ML Detector, p and lmmse () D O( ) for LMMSE Detector [5]. When a suboptimal detector is deployed instead of ML Detector, the incurred BER deterioration can be expressed in terms of the log-likelihood ratio:

X0: A subsequent rounding procedure generates an estimate of the transmitted signals with an objective value denoted fSDR based on the optimal solution Xopt of this SDP problem. Since SDR Detector outputs an estimate that belongs to the feasible set of the ML detection problem, the optimal objective value fSDR of SDR Detector satisfies f ML f SDR. Let f opt ( f apr) denote the optimal objective value of an NP-hard problem (approximation algorithm) in minimization form, then the approximation algorithm with ratio c 1 guarantees to provide a solution with objective value f apr such that f apr c f opt . The quality of SDR Detector can be measured in terms

sdr () log(Pe (sdr)) D c() : log(Pe (ml)) ml() Therefore, the approximation ratio c() is an essential step in bounding the SNR gap between two detectors. Before we proceed with the probabilistic analysis of the performance, let us consider the empirical BER performance of SDR Detector in numerical simulations for channel model (1). Bit-Error-Rate Performance The detector based on a semidefinite relaxation (SDR) consists of two parts: a solver of relaxation (7) and a randomized rounding procedure. The SDP in (7) can

2023

2024

M

Maximum Likelihood Detection via Semidefinite Programming

Maximum Likelihood Detection via Semidefinite Programming, Figure 1 Bit-Error-Rate as a function of Signal-to-Noise Ratio for different detectors

be efficiently solved using Interior Point (IP) methods with complexity O(n3:5 ). For this purpose we use SeDuMi optimization toolbox for Matlab. The randomized rounding procedure projects the solution of the SDP (7) onto the original discrete constraint set and will be discussed in details in the next section. Figure 1 shows a comparison of the BER performance of the SeDuMi-based SDR Detector [13], LMMSE Detector, Matched Filter, Decorrelator, Nulling and Cancelling strategy, Sphere Decoder, and ML Detector. We observe a significant BER improvement of SDR Detector compared to other polynomial complexity detectors. Sphere Decoder with adjustable radius search [16] delivers the BER performance of ML Detector (with probability 1) with running time that scales exponentially [9] with problem size. In many real-time/embedded applications a detection latency is upper bounded and, in general, premature decisions cause significant BER degradation. For simulation purposes we suppose that an engineering system is designed with BPSK modulation, operates at SNR = 10 dB and allows 6.3 ms per bit detection latency. Figure 2 demonstrates the BER performance of this system under the upper bound on the detection latency. The exponential complexity of Sphere Decoder reveals itself between dimensions 40 and 60 where we observe a rapid BER degradation because the running time of Sphere Decoder exceeds the fixed detection time

Maximum Likelihood Detection via Semidefinite Programming, Figure 2 BER degradation due to the limit on detection time. Simulation parameters: BPSK modulation, SNR = 10 dB and time limit per bit D 6:3 ms

threshold for most channel realizations. At the same time, the running time of SDR Detector scales gracefully with problem size and, in most cases, the detector completes detection in time. As a result, SDR Detector does not suffer any significant BER degradation even for large problem sizes. In fact, the number of late detections for SDR Detector does not exceed 1% for all dimensions shown in Fig. 2. For different values of SNR and latency per bit we obtain essentially similar curves for both detectors. Such behavior is indicative of the exponentially growing computational effort of Sphere Decoder and comparably modest computational power required by SDR Detector. In the next section we will discuss the details of the SDP relaxation (11) and the randomized rounding procedure. After that we present theoretical guarantees that substantiate the observed empirical behavior of SDR Detector. Method SDR Detector consists of two components: an SDP solver and a randomized rounding procedure. SDP Solver A transformation of the original ML detection problem (2) into the standard SDP form (5) will help

Maximum Likelihood Detection via Semidefinite Programming

us localize the place in (2) that makes the problem NP-hard. We start with homogenizing the objective function: p ky /n Hsk2 p T (/n) s /n HT y p H H D [s 1]T 1 kyk2 /n yT H D Tr(QxxT ) ; where matrix Q 2 R(nC1)(nC1) and vector x 2 RnC1 are defined as p (1/n) HT H s 1/n HT y p QD ;x D 1 kyk2 / 1/n yT H (8) Notice, that matrix Q is composed of the parameters that are known at the receiver. We linearize the objective function by introducing a variable matrix X to comply with the standard SDP form (5): f ML :D min s.t.

Tr(QX) X D xxT X i;i D 1;

(9) i D 1; : : : ; n C 1 :

In this problem formulation we discarded constraint x nC1 D 1 on the last entry of vector x because the problem is not sensitive to the sign of vector x. If xˆ nC1 D 1 we output ˆx as the solution to (9). Constraint X D xxT is equivalent to the set fX 0; rank(X) D 1g, where notation X 0 implies that matrix X is symmetric positive semidefinite. Thus, we complete the transformation of the original ML Detection problem over BPSK constellation to the equivalent form stated in (6): f ML :D min s.t.

Tr(QX) X i;i D 1;

i D 1; : : : ; n C 1

X0

(10)

X is rank-1 : The rank-1 constraint is the only non-convex constraint in (10) which makes the above problem intractable. SDR Detector relaxes the rank constraint and solves the following convex optimization problem: f SDP :D min s.t.

Tr(QX) X i;i D 1; X0:

i D 1; : : : ; n C 1

(11)

M

To reveal the difference between this relaxation and the one in (3) we can take one step further by relaxing the set of constraints fX i;i D 1; i D 1; : : : ; n C 1g into fTr(X) D n C1g while keeping constraint X 0 intact. This extra relaxed problem can be solved analytically and leads to the solution T 1 T ˆs D H H H y; n n which is exactly the soft output of Decorrelator (4) with D 0. The relaxation in (11) compares favorably to the relaxations in (3) because it requires less modifications of the ML problem, although complexity O(n3:5 ) of (11) is higher than O(n3 ) for the detectors in (3). Since we dropped the rank constraint in (11), a solution Xopt of (11) is no longer rank-1, hence, we need to project Xopt onto the feasible set of the original ML detection problem. Such projection is usually done by a rounding procedure which can be either deterministic like in (4) or randomized [13]. It can also vary depending on the processing power available for the algorithm. In the next section we will consider a randomized rounding procedure based on the principal eigenvector of matrix Xopt . Randomized Rounding Procedure There are various rounding procedures that can be used to extract a rank-1 approximation of Xopt . Widely used approaches and their analysis can be found in [4,13,14]. For our purposes we consider the randomized strategy based on the principal eigenvector of matrix Xopt . Notice that in the noise-free case, we have v D 0 and a transmitted vector s belongs to the kernel of matrix Q which is defined in (8). The optimal objective function is 0 and is achieved by the vector of transmitted signals s. Thus, in the noise-free case, the optimal solution of problem (11) is a rank-1 matrix: Xopt D

s 1

sT 1 :

The structure of the optimal matrix Xopt in the noise-free case suggests that the principal component of the eigen-decomposition contains most reliable information on the transmitted signals in high SNR region. It turns out that the optimal matrix Xopt has a strong

2025

2026

M

Maximum Likelihood Detection via Semidefinite Programming

principal component even in low SNR region, justifying the randomized rounding procedure presented below: INPUT: Solution Xopt of (11), and number D of randomized rounding tries. OUTPUT: Quasi-ML estimate sSDR and the best achieved objective value fSDR . RANDOMIZED ROUNDING PROCEDURE: D 1. Take a spectral decomposition Xopt p PnC1 T i u i ; i D 1; : : : ; iD1 i u i u i and set v i D n C 1. 2. Pick v i corresponding to the principal eigenvector vmax D arg max1inC1 fkv i kg. 3. For each entry xi define Bernoulli distribution: Prfx i D C1g D (1 C v imax )/2; Prfx i D 1g D (1 v imax )/2 ; 4.

5. 6. 7.

(12)

where v imax denotes the ith entry of vector vmax . Generate a fixed number D of i.i.d. (n+1)-dimensional vector samples x¯ d ; d D 1; : : : ; D, such that each entry of (¯xd ) i ; i D 1; : : : ; nC1, is drawn from distribution (12). For all D samples, set x¯ d :D ¯xd if (n+1)-st entry of x¯ d is equal to 1. Pick xSDR :D arg mind x¯ dT Q¯xd and set the best T QxSDR . achieved objective value f SDR :D xSDR Return fSDR and sSDR which is given by vector xSDR with the last bit discarded.

This randomized rounding procedure is designed to ensure that output sSDR is equal to the vector of transmitted signals with high probability. Whenever there is an error, the procedure selects sSDR to reduce the number of bits in error. Cases Performance of SDR Detector Constant Factor Optimality of SDR Detector The core component of SDR Detector is an approximation algorithm based on the convex relaxation (11) of the original ML detection problem. In this section we analyze the approximation ratio of this algorithm. A technique pioneered in [4] is widely used in optimization literature to derive a constant factor optimality for SDP-based relaxations. After the optimal solution Xopt of problem (11) has been obtained

the randomized rounding procedure used in [4] defines Gaussian distribution N (0; Xopt) (compare with (12)) and implements the n-dimensional sign() operator with uniformly generated cutting hyperplanes: Generate D i.i.d. samples x¯ 1 ; : : : ; x¯ D from Gaussian distribution N (0; Xopt ). Let x i D sign(¯x i ) and set the solution xSDR that achieves minimum: T QxSDR D min xTi Qx i : fSDR :D xSDR i

The best objective value fSDR achieved with this randomized rounding procedure can be upper bounded as follows [4]: ˚ T QxSDR E f f SDRg D E xSDR ˚ P E xTi Qx i ˚ (13) D Tr QE x xT i i

2 D Tr Q arcsin(Xopt ) ;

where the inequality above holds in probability for sufficiently many samples D, and the last equality follows from that fact that for any scalar random samples x¯ i and x¯ j drawn from N (0; 1) we have: E fsign(x¯ i ) sign(x¯ i )g D

˚ 2 arcsin E x¯ i x¯ j :

By taking Taylor expansion of arcsin(Y), we can see that for any matrix Y, such that Y 0; Yi i D 1 the following inequality holds: arcsin(Y) Y :

(14)

Suppose that Q 0, then we have the following upper bound: (15) Tr Q arcsin(Xopt ) Tr(QXopt ) ; which allows us to bound fSDR as a constant factor away from f ML : f ML E f f SDRg P D

2 Tr(QXopt )

2 2 f SDP f ML ;

where the first inequality holds because an output of SDR Detector belongs to the feasible set of the ML problem (10), the second inequality follows from (13)

Maximum Likelihood Detection via Semidefinite Programming

combined with (15), the third equality is the definition of f S DP , and the last inequality holds because the SDP problem (11) is a relaxation of the ML problem (10). Therefore, given Q 0, we obtain a 2/ -approximation ratio for the algorithm. Unfortunately, for ML detection problem the reverse inequality takes place (8): p 1/n HT y (1/n) HT H p 0: QD kyk2 / 1/n yT H We can attempt to cure the problem with inequality similar to (14) in the reverse direction for some constant c: arcsin(Y) cY; for all Y 0; with Yi i D 1 : For this inequality to hold, c must be growing linearly with problem dimension n. Hence, in the limit n ! 1 the constant c together with the approximation ratio of the algorithm grow unbounded. That is, we can not obtain a constant factor approximation by applying the standard technique of [4] to the analysis of the SDP relaxation in (11). The technique presented above applies to any negative semidefinite matrix Q, hence, in the context of suboptimal detection it attempts to obtain a constant factor optimality for the worst-case channel realization. However, from the perspective of digital communications, we are interested in the average performance of SDR Detector over many channel realizations. Unlike the technique we have discussed above, a probabilistic analysis of Karush–Kuhn–Tucker (KKT) optimality conditions of the semidefinite problem (11) allows us to claim a constant factor optimality for SDR Detector in probability [11]. The optimal objective value f SDR achieved by SDR Detector is within a constant factor c(; ) away from the optimal ML objective value in probability: f SDR c(; ) D 1; lim P f ML n; m ! 1 m/n ! 1

(16)

p 2(1 C )2 ˇ where c(; ) D 1 C ; ˛ 1 and f˛; ˇg are given by ˛D

1 3; 1 2;

if D 1 if > 1

( ˇD

p 3 4; if D 1 4q 4 1 ; if > 1

M

The statement implies that the log-likelihood ratio of SDR and ML Detectors is bounded in probability by a constant which is fully specified by SNR only. Performance of SDR Detector in High SNR Region We have argued in Sect. “Randomized Rounding Procedure” that the selected randomized rounding procedure provides the optimal solution in the noise-free case. The optimality condition can be extended to the case of large finite SNR: for sufficiently high SNR SDR Detector solves ML detection problem in polynomial time. For given system dimension n and SNR (both finite), the solution Xopt of the relaxed problem (11) is rank-1 if channel matrix H and noise v realizations satisfy: r n T kHT vk1 : mi n (H H) > (18) Since random matrix HT H is full rank with probability 1, this claim can also be interpreted as follows: for any given n there exists a sufficiently high (finite) SNR level such that (18) holds and Xopt is rank-1. In general, if (18) does not hold Xopt may still be rank-1. Notice that if condition (18) is satisfied the solution of the SDP problem (11) belongs to the feasible set of (10), thus, Xopt is also the solution of the ML detection problem. Hence, under the specified conditions SDR Detector solves the original ML detection problem. The asymptotic performance of SDR Detector for fixed problem size and ! 1 has been analyzed in [8], where it is shown that for Rayleigh fading H SDR Detector achieves maximum diversity, i. e. lim

!1

n log Pfssdr ¤ sg log Pfsml ¤ sg D lim D : !1 log log 2

Simulation Results In this section we compare the running time and the BER performance of various implementations of the detectors based on the semidefinite relaxation (11) and that of Sphere Decoder: SDP detector [13] implemented with SeDuMi toolbox [15] for convex optimization problems. SDR Detector that is based on a dual-scaling interior-point method (DSDP implementation [1]) and a dimension reduction strategy [12].

2027

2028

M

Maximum Likelihood Detection via Semidefinite Programming

SDR Detector [12], implemented with a dual-scaling interior-point method, a dimension reduction strategy, and warm start with a truncated version of Sphere Decoder. Sphere Decoder [16]. Figures 3 and 4 demonstrate the average running time and the BER performance achieved by the above detectors for problem size n D 60. Notice, the running time of DSDP-based (SeDuMi-based) detector is insensitive to SNR, and the BER performance shows 1 dB (2-dB)

Maximum Likelihood Detection via Semidefinite Programming, Figure 3 Running time comparison, n D 60

Maximum Likelihood Detection via Semidefinite Programming, Figure 4 Bit-error-rate comparison, n D 60

SNR loss. Sphere Decoder is faster than the semidefinite relaxation-based detectors in high SNR regime but becomes significantly slower for SNR lower than 10 dB. SDR Detector matches the speed of Sphere Decoder in high SNR region, matches the running time of other semidefinite relaxation-based detectors in low SNR regime, and enjoys the near-ML BER performance. Figures 5 and 6 compare the average running time for large problems and in low SNR region. The running time of polynomial complexity detectors (SDR

Maximum Likelihood Detection via Semidefinite Programming, Figure 5 Running time for large problems, D 10 dB

Maximum Likelihood Detection via Semidefinite Programming, Figure 6 Running time in low SNR regime, n D 40

Maximum Partition Matching

Detector, SeDuMi and DSDP-based) scales well in both regimes, remaining in the sub-second region, while the running time of Sphere Decoder deteriorates in both scenarios. Conclusions We have considered the maximum likelihood detection problem. Among various quasi-ML detectors SDR Detector offers a near-optimal BER performance with the worst-case polynomial complexity. We have analyzed the underlying structure of the SDP relaxation which is the core of SDR Detector. For a given SNR SDR Detector delivers a constant factor approximation of the loglikelihood ratio for the original ML detection problem in probability, where the constant factor is independent of problem size. SDR Detector solves ML detection problem exactly in high SNR region. Numerical simulations of BER and running time empirically demonstrate the advantages of SDR Detector as compared to the computationally expensive ML Detector. References 1. Benson SJ, Ye Y (2007) DSDP5: Software for semidefinite programming. ACM Trans Math Soft 34(3) 2. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, New York 3. Fincke U, Pohst M (1985) Improved methods for calculating vectors of short length in a lattice, including a complexity analysis. Math Comput 44:463–471 4. Goemans MX, Williamson DP (1995) Improved approximation algorithms for maximum cut and satisfiability problem using semidefinite programming. J ACM 42:1115– 1145 5. Guo D, Verdú S (2005) Randomly spread CDMA: asymptotics via statistical physics. IEEE Trans Inf Theory 51:1982– 2010 6. Hassibi B, Hochwald BM (2002) High-rate codes that are linear in space and time. IEEE Trans Inf Theory 48(7): 1804–1824 7. Hassibi B, Vikalo H (2001) On the expected complexity of sphere decoding. Thirty-Fifth Asilomar Conference on Signals. Syst Comput 2:1051–1055 8. Jalden J (2006) Detection for multiple input multiple output channels. Ph.D. Thesis, KTH, School of Electrical Engineering, Stockholm 9. Jalden J, Ottersten B (2004) An exponential lower bound on the expected complexity of sphere decoding. Proc ICASSP ’04, vol 4, pp IV 393–IV 396 10. Jalden J, Ottersten B (2005) On the complexity of sphere decoding in digital communications. IEEE Trans Signal Process 53(4):1474–1484

M

11. Kisialiou M, Luo Z-Q (2005) Performance analysis of quasimaximum-likelihood detector based on semi-definite programming. Proc ICASSP ’05, vol 3, pp III 433–III 436 12. Kisialiou M, Luo Z-Q (2007) Efficient implementation of a quasi-maximum-likelihood detector based on semidefinite relaxation. Proc ICASSP ’07, vol 4, pp IV 1329–IV 1332 13. Ma WK, Davidson TN, Wong KM, Luo Z-Q, Ching PC (2002) Quasi-maximum-likelihood multiuser detection using semidefinite relaxation. IEEE Trans Signal Process 50(4):912–922 14. Nesterov YE (1997) Quality of semi-difinite relaxation for nonconvex quadratic optimization. CORE Discussion Paper, no 9719 15. Sturm JF (1999) Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optim Meth Soft 11–12:625–653 16. Viterbo E, Boutros J (1999) A universal lattice code decoder for fading channels. IEEE Trans Inf Theory 45(5): 1639–1642

Maximum Partition Matching MPM JIANER CHEN Texas A&M University, College Station, USA MSC2000: 05A18, 05D15, 68M07, 68M10, 68Q25, 68R05 Article Outline Keywords Definitions and Motivation Case I. Via Pre-Matching when kSk is Large Case II. Via Greedy Method when kSk is Small See also References Keywords Maximum matching; Greedy algorithm; Star network; Parallel routing algorithm The maximum partition matching problem was introduced recently in the study of routing schemes on interconnection networks [2]. In this article, we study the basic properties of the problem. An efficient algorithm for the maximum partition matching problem is presented.

2029

2030

M

Maximum Partition Matching

Definitions and Motivation Let S = {C1 , . . . , Ck } be a collection of subsets of the universal set U = {1, . . . , n} such that [ kiD1 Ci = U, and Ci \ Cj = ; for all i 6D j. A partition (A, B) of S pairs two elements a and b in U if a is contained in a subset in A and b is contained in a subset in B. A partition matching (of order m) of S consists of two ordered subsets L = {a1 , . . . , am } and R = {b1 , . . . , bm } of m elements of U (the subsets L and R may not be disjoint), together with a sequence of m distinct partitions of S: (A1 , B1 ), . . . , (Am , Bm ) such that for all i = 1, . . . , m, the partition (Ai , Bi ) pairs the elements ai and bi . The maximum partition matching problem is to construct a partition matching of order m for a given collection S with m maximized. The maximum partition matching problem arises in connection with the parallel routing problem in interconnection networks. In particular, in the study of the star networks [1], which are attractive alternatives to the popular hypercubes networks. It can be shown that constructing an optimal parallel routing scheme in the star networks can be effectively reduced to the maximum partition matching problem. Readers interested in this connection are referred to [2] for a detailed discussion. The maximum partition matching problem can be formulated in terms of the 3-dimensional matching problem as follows: given an instance S = {C1 , . . . , Ck } of the maximum partition matching problem, we construct an instance M for the 3-dimensional matching problem such that a triple (a, b, P) is contained in M if and only if the partition P of S pairs the elements a and b. However, since the number of partitions of the collection S can be as large as 2n and the 3-dimensional matching problem is NP-hard [4], this reduction does not hint a polynomial time algorithm for the maximum partition matching problem. In the rest of this article, we study the basic properties for the maximum partition matching problem, and present an algorithm of running time O(n2 log n) for the problem. We first introduce necessary terminologies that will be used in our discussion. Let = hL, R, (A1 , B1 ), . . . , (Am , Bm )i be a partition matching of the collection S, where L = {a1 , . . . , am } and R = {b1 , . . . , bm }. We will say that the partition (Ai , Bi ) left-pairs the element ai and right-pairs the element bi . An element a is said to be left-paired if it is in the set

L. Otherwise, the element a is left-unpaired. Similarly we define right-paired and right-unpaired elements. The collections Ai and Bi are called the left-collection and right-collection of the partition (Ai , Bi ). The partition matching may also be written as [(a1 , b1 ), . . . , (am , bm )] if the corresponding partitions are implied. For the rest of this paper, we assume that U = {1, . . . , n} and that S = {C1 , . . . , Ck } is a collection of pairwise disjoint subsets of U such that [ kiD1 Ci = U. Case I. Via Pre-Matching when kSk is Large A necessary condition for two ordered subsets L = {a1 , . . . , am } and R = {b1 , . . . , bm } of U to form a partition matching for the collection S is that ai and bi belong to different subsets in the collection S, for all i = 1, . . . , m. We say that the two ordered subsets L and R of U form a pre-matching = {(ai , bi ): 1 i m} if ai and bi do not belong to the same subset in the collection S, for all i = 1, . . . , m. The pre-matching is maximum if m is the largest among all pre-matchings of S. A maximum pre-matching can be constructed efficiently by the algorithm pre-matching given below, where we say that a set is singular if it consists of a single element. See [3] for a proof for the correctness of the algorithm.

the collection S = fC1 ; : : : ; C k g of subsets of U Output : a maximum pre-matching in S 1. T = S; = ;; 2. WHILE T contains more than one set but does not consist of exactly three singular sets DO 2.1. pick two sets C and C 0 of largest cardinality in T; 2.2. pick an element a in C and an element b in C 0 ; 2.3. = [ f(a; b); (b; a)g; 2.4. C = C fag; C 0 = C 0 fbg; 2.5. if C or C 0 is empty now, delete it from T; 3. IF T consists of exactly three singular sets C1 = fa1 g, C2 = fa2 g, and C3 = fa3 g THEN = [ f(a1 ; a2 ); (a2 ; a3 ); (a3 ; a1 )g. Input :

Algorithm pre-matching

Maximum Partition Matching

In the following, we show that when the cardinality of the collection S is large enough, a maximum partition matching of S can be constructed from the maximum pre-matching produced by the algorithm prematching. Suppose that the collection S consists of k subsets C1 , . . . , Ck and 2k 4n. The pre-matching contains at most n pairs. Let (a, b) be a pair in and let C and C0 be two arbitrary subsets in S such that C contains a and C0 contains b. Note that the number of partitions (A, B) of S such that C is in A and C0 is in B is equal to 2k 2 n. Therefore, at least one such partition can be used to left-pair a and right-pair b. This observation results in the following theorem. Theorem 1 Let S = {C1 , . . . , Ck } be a collection of nonempty subsets of the universal set U = {1, . . . , n} such that [ kiD1 Ci = U and Ci \ Cj = ;, for i 6D j. If 2k 4n, then a maximum partition matching in S can be constructed in time O(n2 ). Proof Consider the following algorithm partitionmatching-I. the collection S = fC1 ; : : : ; C k g of subsets of U Output: a partition matching in S 1. construct a maximum pre-matching of S; 2. FOR each pair (a; b) in DO use an unused partition of S to pair a and b.

Input:

Algorithm partition-matching-I

Suppose the pre-matching constructed in step 1 is = {(a1 , b1 ), . . . , (am , bm )}. According to the above discussion, for each pair (ai , bi ) in , there is always an unused partition of S that left-pairs a and right-pairs b. Therefore, step 2 of the algorithm partition-matching-I is valid and constructs a partition matching for the collection S. Since each partition matching for S induces a pre-matching in S and is a maximum prematching, we conclude that the partition matching is a maximum partition matching for the collection S. By carefully organizing the elements in U and the partitions of S, we can show that the algorithm partition-matching-I runs in time O(n2 ). See [3].

M

Case II. Via Greedy Method when kSk is Small Now we consider the case 2k < 4n. Since the number 2k of partitions of the collection S is small, we can apply a greedy strategy that expands a current partition matching by trying to add each of the unused partitions to the partition matching. We show in this section that a careful use of this greedy method constructs a maximum partition matching for the given collection. Suppose we have a partition matching = [(a1 , b1 ), . . . , (ah , bh )] and want to expand it. The partitions of the collection S then can be classified into two classes: h of the partitions are used to pair the h pairs (ai , bi ), i = 1, . . . , h, and the rest 2k h partitions are unused. Now if there is an unused partition P = (A, B) such that there is a left-unpaired element a in A and a right-unpaired element b in B, then we simply pair the element a with the element b using the partition P, thus expanding the partition matching . Now suppose that there is no such unused partition, i. e., for all unused partitions (A, B), either A contains no left-unpaired elements or B contains no rightunpaired elements. This case may not necessarily imply that the current partition matching is the maximum. For example, suppose that (A, B) is an unused partition such that there is a left-unpaired element a in A but no right-unpaired elements in B. Assume further that there is a used partition (A0 , B0 ) that pairs elements (a0 , b0 ), such that the element b0 is in B and there is a right-unpaired element b in B0 . Then we can let the partition (A0 , B0 ) pair the elements (a0 , b), and then let the partition (A, B) pair the elements (a, b0 ), thus expanding the partition matching . An explanation of this process is that the used partitions have been incorrectly used to pair elements, thus in order to construct a maximum partition matching, we must re-pair some of the elements. To further investigate this relation, we need to introduce a few notations. For a used partition P of S, we put an underline on a set in the left-collection (resp. the right-collection) of P to indicate that an element in the set is left-paired (resp. right-paired) by the partition P. The sets will be called the left-paired set and the right-paired set of the partition P, respectively. Definition 2 A used partition P is directly leftreachable from a partition P1 = (A1 , B1 ) if the leftpaired set of P is contained in A1 (the partition P1

2031

2032

M

Maximum Partition Matching

can be either used or unused). The partition P is directly right-reachable from a partition P2 = (A2 , B2 ) if the right-paired set of P is contained in B2 . A partition Ps is left-reachable (resp. right-reachable) from a partition P1 if there are partitions P2 , . . . , Ps 1 such that Pi is directly left-reachable (resp. directly right-reachable) from Pi 1 , for all i = 2, . . . , s. The left-reachability and the right-reachability are transitive relations. Let P1 = (A1 , B1 ) be an unused partition such that there are no left-unpaired elements in A1 , and let Ps = (As , Bs ) be a partition left-reachable from P1 and there is a left-unpaired element as in As . We show how we can use a chain justification to make a left-unpaired element for the collection A1 . By the definition, there are used partitions P2 , . . . , Ps 1 such that Pi is directly left-reachable from Pi 1 , for i = 2, . . . , s. We can further assume that Pi is not directly left-reachable from Pi 2 for i = 3, . . . , s (otherwise we simply delete the partition Pi 1 from the sequence). Thus, these partitions can be written as P1 D (fC1 g [ A01 ; B1 ); P2 D (fC1 ; C2 g [ A02 ; B2 ); P3 D (fC2 ; C3 g [ A03 ; B3 ); ::

:

Ps1 D (fCs2 ; Cs1 g [

paired element in the right-collection Bs is unchanged). Thus, the element as 1 in the set Cs 1 of the partition Ps used to left-pair becomes left-unpaired. We then use the partition Ps 1 to left-pair the element as 1 and leave an element as 2 in the set Cs 2 left-unpaired, then we use the partition Ps 2 to left-pair as 2 , etc. At the end, we use the partition P2 to left-pair an element a2 in the set C2 and leave an element a1 in the set C1 leftunpaired. Therefore, this process makes an element in the left-collection A1 = {C1 } [ A1 0 of the partition P1 left-unpaired. The above process will be called a left-chain justification. Thus, given an unused partition P1 = (A1 , B1 ) in which the left-collection A1 has no left-unpaired elements and given a used partition Ps = (As , Bs ) leftreachable from P1 such that the left-collection As of Ps has a left-unpaired element, we can apply the left-chain justification that keeps all used partitions in the partition matching and makes a left-unpaired element for the partition P1 . A process called right-chain justification for right-collections of the partitions can be described similarly. A greedy method based on the left-chain and rightchain justifications is presented in the following algorithm greedy-expanding. the collection S = fC1 ; : : : ; C k g of subsets of U Output: a partition matching exp in S 1. exp = ;; 2. repeat until no more changes IF there is an unused partition P = (A; B) that has a left-unpaired element a in A and a right-unpaired element b in B THEN pair the elements (a; b) by the partition P and add P to the matching exp ELSE IF a left-chain justification or a right-chain justification (or both) is applicable to make an unused partition P = (A; B) to have a left-unpaired element in A and a right-unpaired element in B THEN apply the left-chain justification and/or the right-chain justification

Input: A0s1 ; Bs1 );

Ps D (fCs1 ; Cs g [ A0s ; Bs ); where A1 0 , . . . , As 0 are subcollections of S without an underlined set. We can assume that the left-unpaired element as in As D fCs1 ; Cs g [ A0s is in a nonunderlined set Cs in As (otherwise we consider the sequence P1 , . . . , Ps 1 instead). We modify the partition sequence into P1 D (fC1 g [ A01 ; B1 ); P2 D (fC1 ; C2 g [ A02 ; B2 ); P3 D (fC2 ; C3 g [ A03 ; B3 ); :: : Ps1 D (fCs2 ; Cs1 g [ A0s1 ; Bs1 ); Ps D (fCs1 ; Cs g [ A0s ; Bs ): The interpretation is as follows: we use the partition Ps to left-pair the left-unpaired element as (the right-

Algorithm greedy-expanding

In case 2k < 4n, a careful organization of the elements and the partitions can make the running time

Maximum Partition Matching

of the algorithm greedy-expanding bounded by O(n2 log n). Briefly speaking, we construct a graph G of 2k vertices in which each vertex represents a partition of S. The direct left- and right- reachabilities of partitions are given by the edges in the graph G, so that checking left- and right- reachabilities and performing left- and right- chain justifications can be done efficiently. Interested readers are referred to [3] for a detailed description. After execution of the algorithm greedy-expanding, we obtain a partition matching exp . For each partition P = (A, B) not included in exp , either A has no leftunpaired elements and no used partition left-reachable from P has a left-unpaired element in its left-collection, or B has no right-unpaired elements and no used partition right-reachable from P has a right-unpaired element in its right-collection. Definition 3 Define Lfree to be the set of partitions P not used by exp such that the left-collection of P has no left-unpaired elements and no used partition leftreachable from P has a left-unpaired element in its leftcollection, and define Rfree to be the set of partitions P0 not used by exp such that the right-collection of P0 has no right-unpaired elements and no used partition rightreachable from P0 has a right-unpaired element in its right-collection. According to the algorithm greedy-matching, each partition not used by exp is either in the set Lfree or in the set Rfree . The sets Lfree and Rfree may not be disjoint. Definition 4 Lreac to be the set of partitions in exp that are left-reachable from a partition in Lfree , and define Rreac to be the set of partitions in exp that are rightreachable from a partition in Rreac. According to the definitions, if a used partition P is in the set Lreac , then all elements in its left-collection are left-paired, and if a used partition P is in the set Rreac, then all elements in its right-collection are rightpaired. We first show that if Lreac and Rreac are not disjoint, then we can construct a maximum partition matching from the partition matching exp constructed by the algorithm greedy-expanding. For this, we need the following technical lemma.

M

Lemma 5 If the sets Lreac and Rreac contain a common partition and the partition matching exp has less than n pairs, then there is a set C0 in S, |C0 | n/2, such that either all elements in each set C 6D C0 are left-paired and every used partition whose left-paired set is not C0 is contained in Lreac , or all elements in each set C 6D C0 are right-paired and every used partition whose right-paired set is not C0 is contained in Rreac . For a proof, see [3]. Theorem 6 If Lreac and Rreac have a common partition, then the collection S has a maximum partition matching of n pairs, which can be constructed in linear time from the partition matching exp . Proof If exp has n pairs, then exp is already a maximum partition matching. Thus we assume that exp has less than n pairs. According to the above lemma, we can assume, without loss of generality, that all elements in each set Ci , i = 2, . . . , k, are left-paired, and that every used partition whose left-paired set is not C1 is in Lreac . P Moreover, |C1 | kiD2 |Ci |. Pk Let t = iD2 |Ci | and d = |C1 |. Then we can assume that the partition matching exp consists of the partitions P1 ; : : : ; Pt ; PtC1 ; : : : ; PtCh where P1 , . . . , Pt are used by exp to left-pair the elements in [ kiD2 Ci , and Pt+ 1 , . . . , Pt+ h are used by exp to left-pair the elements in C1 , h < d. Moreover, all partitions P1 , . . . , Pt are in the set Lreac. Thus, the set C1 must be contained in the right-collection in each of the partitions P1 , . . . , Pt . We ignore the partitions Pt+ 1 , . . . , Pt+ h and use the partitions P1 , . . . , Pt to construct a maximum partition matching of n pairs. Note that {P1 , . . . , Pt } also forms a partition matching in the collection S. For a partition (A, B) of S, we say that the partition (B, A) is obtained by flipping the partition (A, B). In the following algorithm partition-flipping, we show that a maximum partition matching of n pairs can be constructed by flipping d partitions in the partitions P1 , . . . , Pt .

2033

2034

M

Maximum Partition Matching

a partition matching fP1 ; : : : ; Pt g that leftP pairs all elements in [ ki=2 C i , t = ki=2 jC i j, and the set C1 is contained in the rightcollection of each partition Pi , i = 1; : : : ; t, d = jC1 j t Output: a maximum partition matching in S with n pairs. 1. if not all elements in the set C1 are right-paired by P1 ; : : : ; Pt , replace a proper number of right-paired elements in [ ki=2 C i by the right-unpaired elements in C1 so that all elements in C1 are right-paired by P1 ; : : : ; Pt ; 2. suppose that the partitions P1 ; : : : ; Ptd right-pair t d elements b1 ; : : : ; b td in [ ki=2 C i , and that Ptd+1 ; : : : ; Pt right-pair the d elements in C1 ; 3. suppose that P 1 ; : : : ; P td are the t d partitions in fP1 ; : : : ; Pt g that left-pair the elements b1 ; : : : ; b td ; 4. flip each of the d partitions in fP1 ; : : : ; Pt g fP 1 ; : : : ; P td g to get d partitions P10 ; : : : ; Pd0 to left-pair the d elements in C1 . The right paired element of each Pi0 is the left-paired element before the flipping; 5. fP1 ; : : : ; Pt ; P10 ; : : : ; Pd0 g is a partition matching of n pairs. Input:

Algorithm partition-flipping

Step 1 of the algorithm is always possible: since C1 is contained in the right-collection of each partition Pi , i = 1, . . . , t, and t d, for each right-unpaired element b in C1 , we can always pick a partition Pi that right-pairs an element in [ kiD2 Ci , and let Pi right-pair the element b. We keep doing this replacement until all d elements in C1 get right-paired. At this point, the number of partitions in {P1 , . . . , Pt } that right-pair elements in [ kiD2 Ci is exactly t d. Step 3 is always possible since the partitions P1 , . . . , Pt left-pair all elements in [ kiD2 Ci . Now we verify that the constructed sequence {P1 , . . . , Pt , P1 0 , . . . , Pd 0 } is a partition matching in S. No two partitions Pi and Pj can be identical since {P1 , . . . , Pt } is supposed to be a partition matching in S. No two partitions Pi 0 and Pj 0 can be identical since they are obtained by flipping two different partitions in {P1 , . . . , Pt }. No partition Pi is identical to a partition Pj 0 because

Pi has C1 in its right-collection while Pj 0 has C1 in its left-collection. Therefore, the partitions P1 , . . . , Pt , P1 0 , . . . , Pd 0 are all distinct. Each of the partitions P1 , . . . , Pt left-pairs an element in [ kiD2 Ci , and each of the partitions P1 0 , . . . , Pd 0 left-pairs an element in C1 . Thus, all elements in the universal set U get left-paired in {P1 , . . . , Pt , P1 0 , . . . , Pd 0 }. Finally, the partitions P1 , . . . , Pt right-pair all elements in C1 and the elements b1 , . . . , bt d in [ kiD2 Ci . Now by our selection of the partitions, the partitions P1 0 , . . . , Pd 0 precisely right-pair all the elements in [ kiD2 Ci {b1 , . . . , bt d }. Thus, all elements in U also get right-paired in {P1 , . . . , Pt , P1 0 , . . . , Pd 0 }. This concludes that the constructed sequence {P1 , . . . , Pt , P1 0 , . . . , Pd 0 } is a maximum partition matching in the collection S. The running time of the algorithm partition-flipping is obviously linear. Now we consider the case when the sets Lreac and Rreac have no common partitions. Theorem 7 If Lreac and Rreac have no common partitions, then the partition matching exp is a maximum partition matching. Proof Let W other be the set of used partitions in exp that belong to neither Lreac nor Rreac. Then Lfree [ Rfree [ Lreac [ Rreac [ W other is the set of all partitions of the collection S, and Lreac [ Rreac [ W other is the set of partitions contained in the partition matching exp . Since all sets Lreac , Rreac , and W other are pairwise disjoint, the number of partitions in exp is precisely |Lreac | + |Rreac| + |W other |. Now consider the set W L = Lfree [ Lreac . Let U L be the set of elements that appears in the left-collection of a partition in W L . We have Every P 2 Lreac left-pairs an element in U L ; Every element in U L is left-paired; If an element a in U L is left-paired by a partition P, then P 2 Lreac. Therefore, the partitions in Lreac precisely left-pair the elements in U L . This gives |Lreac| = |U L |. Since there are only |U L | elements that appear in the left-collections in partitions in Lfree [ Lreac , we conclude that the partitions in W L = Lfree [ Lreac can be used to left-pair at most |U L | = |Lreac elements in any partition matching in S.

Maximum Satisfiability Problem

M

Similarly, the partitions in the set W R = Rfree [ Rreac can be used to right-pair at most |Rreac | elements in any partition matching in S. Therefore, any partition matching in the collection S can include at most |Lreac | partitions in the set W L , at most |Rreac | partitions in the set W R , and at most all partitions in the set W other . Consequently, a maximum partition matching in S consists of at most |Lreac | + |Rreac| + |W other| partitions. Since the partition matching exp constructed by the algorithm greedy-expanding contains just this many partitions, exp is a maximum partition matching in the collection S.

3. Chen C-C, Chen J (1999) The maximum partition matching problem with applications. SIAM J Comput 28:935–954 4. Garey MR, Johnson DS (1979) Computers and intractability: A guide to the theory of NP-completeness. Freeman, New York

Now it is clear how the maximum partition matching problem is solved.

MSC2000: 03B05, 68Q25, 90C09, 90C27, 68P10, 68R05, 68T15, 68T20, 94C10

Theorem 8 The maximum partition matching problem is solvable in time O(n2 log n).

Article Outline

Proof Suppose that we are given a collection S = {C1 , . . . , Ck } of pairwise disjoint subsets of U = {1, . . . , n}. In case 2k 4n, we can call the algorithm partitionmatching-I to construct a maximum partition matching in time O(n2 ). In case 2k < 4n, we first call the algorithm greedyexpanding to construct a partition matching exp and compute the sets Lreac and Rreac. If Lreac and Rreac have no common partition, then according to the previous theorem, exp is already a maximum partition matching. Otherwise, we call the algorithm partition-flipping to construct a maximum partition matching. All these can be done in time O(n2 log n). A detailed analysis of this algorithm can be found in [3]. See also Assignment and Matching Assignment Methods in Clustering Bi-objective Assignment Problem Communication Network Assignment Problem Frequency Assignment Problem Quadratic Assignment Problem References 1. Akers SB, Krishnamurthy B (1989) A group-theoretic model for symmetric interconnection networks. IEEE Trans Comput 38:555–565 2. Chen C-C, Chen J (1997) Optimal parallel routing in star networks. IEEE Trans Comput 48:1293–1303

Maximum Satisfiability Problem MAX-SAT ROBERTO BATTITI Dip. Mat., Universitá Trento, Povo (Trento), Italy

Keywords See also References

Keywords Maximum satisfiability; Local search; Approximation algorithms; History-sensitive heuristics In the maximum satisfiability (MAX-SAT) problem one is given a Boolean formula in conjunctive normal form, i. e., as a conjunction of clauses, each clause being a disjunction. The task is to find an assignment of truth values to the variables that satisfies the maximum number of clauses. Let n be the number of variables and m the number of clauses, so that a formula has the following form: 0 1 _ ^ @ li k A ; 1im

1kjC i j

where |Ci | is the number of literals in clause Ci and lik is a literal, i. e., a propositional variable uj or its negation u j , for 1 j n. The set of clauses in the formula is denoted by C. If one associates a weight wi to each clause Ci one obtains the weighted MAX-SAT problem, denoted as MAX W-SAT: one is to determine the assignment of truth values to the n variables that maximizes the sum of the weights of the satisfied clauses. In

2035

2036

M

Maximum Satisfiability Problem

the literature one often considers problems with different numbers k of literals per clause, defined as MAX-kSAT, or MAX W-k-SAT in the weighted case. In some papers MAX-k-SAT instances contain up to k literals per clause, while in other papers they contain exactly k literals per clause. We consider the second option unless otherwise stated. MAX-SAT is of considerable interest not only from the theoretical side but also from the practical one. On one hand, the decision version SAT was the first example of an NP-complete problem [16], moreover MAXSAT and related variants play an important role in the characterization of different approximation classes like APX and PTAS [5]. On the other hand, many issues in mathematical logic and artificial intelligence can be expressed in the form of satisfiability or some of its variants, like constraint satisfaction. Some exemplary problems are consistency in expert system knowledge bases [46], integrity constraints in databases [4,23], approaches to inductive inference [35,40], asynchronous circuit synthesis [32]. An extensive review of algorithms for MAX-SAT appeared in [9]. M. Davis and H. Putnam [19] started in 1960 the investigation of useful strategies for handling resolution in the satisfiability problem. Davis, G. Logemann and D. Loveland [18] avoid the memory explosion of the original DP algorithm by replacing the resolution rule with the splitting rule. A recent review of advanced techniques for resolution and splitting is presented in [31]. The MAX W-SAT problem has a natural integer linear programming formulation. Let yj = 1 if Boolean variable uj is ‘true’, yj = 0 if it is ‘false’, and let the Boolean variable zi = 1 if clause Ci is satisfied, zi = 0 otherwise. The integer linear program is: max

m X

where U C i and U i denote the set of indices of variables that appear unnegated and negated in clause Ci , respectively. If one neglects the objective function and sets all zi variables to 1, one obtains an integer programming feasibility problem associated to the SAT problem [11]. The integer linear programming formulation of MAX-SAT suggests that this problem could be solved by a branch and bound method (cf. also Integer programming: Branch and bound methods). A usable method uses Chvátal cuts. In [35] it is shown that the resolvents in the propositional calculus correspond to certain cutting planes in the integer programming model of inference problems. Linear programming relaxations of integer linear programming formulations of MAX-SAT have been used to obtained upper bounds in [27,33,55]. A linear programming and rounding approach for MAX-2SAT is presented in [13]. A method for strengthening the generalized set covering formulation is presented in [47], where Lagrangian multipliers guide the generation of cutting planes. The first approximation algorithms with a ‘guaranteed’ quality of approximation [5] were proposed by D.S. Johnson [38] and use greedy construction strategies. The original paper [38] demonstrated for both of them a performance ratio 1/2. In detail, let k be the minimum number of variables occurring in any clause of the formula, m(x, y) the number of clauses satisfied by the feasible solution y on instance x, and m (x) the maximum number of clauses that can be satisfied. For any integer k 1, the first algorithm achieves a feasible solution y of an instance x such that

1 m(x; y) 1 ; m (x) kC1 while the second algorithm obtains

wi zi

iD1

subject to the constraints: X 8X ˆ yj C (1 y j ) z i ; ˆ ˆ ˆ C ˆ j2U i j2U ˆ < i i D 1; : : : ; m; ˆ ˆ ˆ j D 1; : : : ; n; y j 2 f0; 1g; ˆ ˆ ˆ : z i 2 f0; 1g; i D 1; : : : ; m;

1 m(x; y) 1 k: m (x) 2 Recently (1997) it has been proved [12] that the second algorithm reaches a performance ratio 2/3. There are formulas for which the second algorithm finds a truth assignment such that the ratio is 2/3. Therefore this bound cannot be improved [12]. One of the most interesting approaches in the design of new algorithms is the use of randomization.

Maximum Satisfiability Problem

During the computation, random bits are generated and used to influence the algorithm process. In many cases randomization allows to obtain better (expected) performance or to simplify the construction of the algorithm. Two randomized algorithms that achieve a performance ratio of 3/4 have been proposed in [27] and [55]. Moreover, it is possible to derandomize these algorithms, that is, to obtain deterministic algorithms that preserve the same bound 3/4 for every instance. The approximation ratio 3/4 can be slightly improved [28]. T. Asano [2] (following [3]) has improved the bound to 0.77. For the restricted case of MAX-2-SAT, one can obtain a more substantial improvement (performance ratio 0.931) with the technique in [21]. If one considers only satisfiable MAX W-SAT instances, L. Trevisan [54] obtains a 0.8 approximation factor, while H. Karloff and U. Zwick [41] claim a 0.875 performance ratio for satisfiable instances of MAX W-3-SAT. A strong negative result about the approximability can be found in [36]: Unless P = NP MAX W-SAT cannot be approximated in polynomial time within a performance ratio greater than 7/8. MAX-SAT is among the problems for which local search has been very successful: in practice, local search and its variations are the only efficient and effective method to address large and complex real-world instances. Different variations of local search with randomness techniques have been proposed for SAT and MAX-SAT starting from the late 1980s, see for example [30,52], motivated by previous applications of ‘min-conflicts’ heuristics in the area of artificial intelligence [44]. The general scheme is based on generating a starting point in the set of admissible solution and trying to improve it through the application of basic moves. The search space is given by all possible truth assignments. Let us consider the elementary changes to the current assignment obtained by changing a single truth value. The definitions are as follows. Let U be the discrete search space: U = {0, 1}n , and let f be the number of satisfied clauses. In addition, let U (t) 2 U be the current configuration along the search trajectory at iteration t, and N(U (t) ) the neighborhood of point U (t) , obtained by applying a set of basic moves i (1 i n), where i complements the ith bit ui of the string: i (u1 , . . . , ui , . . . , un ) = (u1 , . . . , 1 ui , . . . ,

M

un ): n o N(U (t) ) D U 2 U : U D i ; U (t) ; i D 1; : : : ; n : The version of local search that we consider starts from a random initial configuration U (0) 2 U and generates a search trajectory as follows: V D BESTNEIGHBOR(N(U (t) )); ( U

(tC1)

D

if f (V) > f (U (t) );

V U

(t)

if f (V) f (U (t) )

(1) (2)

where BESTNEIGHBOR selects V 2 N(U (t) ) with the best f value and ties are broken randomly. V in turn becomes the new current configuration if f improves. Other versions are satisfied with an improving (or nonworsening) neighbor, not necessarily the best one. Clearly, local search stops as soon as the first local optimum point is encountered, when no improving moves are available, see (2). Let us define as LS+ a modification of LS where a specified number of iterations are executed and the candidate move obtained by BESTNEIGHBOR is always accepted even if the f value remains equal or worsens. Properties about the number of clauses satisfied at a local optimum have been demonstrated. Let m be the best value and k the minimum number of literals contained in the problem clauses. Let mloc be the number of satisfied clauses at a local optimum of any instance of MAX-SAT with at least k literals per clause. mloc satisfies the following bound [34]: mloc

k m kC1

and the bound is sharp. Therefore, if mloc is the number of satisfied clauses at a local optimum, then: mloc

k m : kC1

(3)

State-of-the-art heuristics for MAX-SAT are obtained by complementing local search with schemes that are capable of producing better approximations beyond the locally optimal points. In some cases, these schemes generate a sequence of points in the set of admissible solutions in a way that is fixed before the search

2037

2038

M

Maximum Satisfiability Problem

starts. An example is given by multiple runs of local search starting from different random points. The algorithm does not take into account the history of the previous phase of the search when the next points are generated. The term ‘memory-less’ denotes this lack of feedback from the search history. In addition to the cited multiple-run local search, these techniques are based on Markov processes (simulated annealing; cf. also Simulated annealing methods in protein folding), ‘plateau’ search and ‘random noise’ strategies, or combinations of randomized constructions and local search. The use of a Markov process to generate a stochastic search trajectory is adopted, for example in [53]. The Gsat algorithm was proposed in [52] as a modelfinding procedure, i. e., to find an interpretation of the variables under which the formula comes out ‘true’. Gsat consists of multiple runs of LS+ , each run consisting of a number of iterations that is typically proportional to the problem dimension n. An empirical analysis of Gsat is presented in [24,25]. Different ‘noise’ strategies to escape from attraction basins are added to Gsat in [50,51]. A hybrid algorithm that combines a randomized greedy construction phase to generate initial candidate solutions, followed be a local improvement phase is the GRASP scheme proposed in [48] for the SAT and generalized for the MAX W-SAT problem in [49]. GRASP is an iterative process, with each iteration consisting of two phases, a construction phase and a local search phase. Different history-sensitive heuristics have been proposed to continue local search schemes beyond local optimality. These schemes aim at intensifying the search in promising regions and at diversifying the search into uncharted territories by using the information collected from the previous phase (the history) of the search. Because of the internal feedback mechanism, some algorithm parameters can be modified and tuned in an on-line manner, to reflect the characteristics of the task to be solved and the local properties of the configuration space in the neighborhood of the current point. This tuning has to be contrasted with the offline tuning of an algorithm, where some parameters or choices are determined for a given problem in a preliminary phase and they remain fixed when the algorithm runs on a specific instance.

Tabu search is a history-sensitive heuristic proposed by F. Glover [26] and, independently, by P. Hansen and B. Jaumard, that used the term ‘SAMD’ (steepest ascent mildest descent) and applied it to the MAX-SAT problem in [34]. The main mechanism by which the history influences the search in tabu search is that, at a given iteration, some neighbors are prohibited, only a nonempty subset N A (U (t) ) N(U (t) ) of them is allowed. The general way of generating the search trajectory that we consider is given by: N A (U (t) ) D allow(N(U (t) ); : : : ; U (t) );

(4)

U (tC1) D BESTNEIGHBOR(N A (U (t) )):

(5)

The set-valued function allow selects a nonempty subset of N(U (t) ) in a manner that depends on the entire previous history of the search U (0) , . . . , U (t) . A specialized tabu search heuristic is used in [37] to speed up the search for a solution (if the problem is satisfiable) as part of a branch and bound algorithm for SAT, that adopts both a relaxation and a decomposition scheme by using polynomial instances, i. e., 2-SAT and Horn-SAT. Different methods to generate prohibitions produce discrete dynamical systems with qualitatively different search trajectories. In particular, prohibitions based on a list of moves lead to a faster escape from a locally optimal point than prohibitions based on a list of visited configurations [6]. In detail, the function allow can be specified by introducing a prohibition parameter T (also called list size) that determines how long a move will remain prohibited after its execution. The fixed tabu search algorithm is obtained by fixing T throughout the search [26]. A neighbor is allowed if and only if it is obtained from the current point by applying a move that has not been used during the last T iterations. In detail, if LU() is the last usage time of move (LU() = 1 at the beginning): n o N A (U (t) ) D U D U (t) : LU() < (t T) : The reactive tabu search algorithm of [10], defines simple rules to determine the prohibition parameter by reacting to the repetition of previously-visited configurations. One has a repetition if U (t + R) = U (t) for R 1. The prohibition period T depends on the iteration t and

Maximum Satisfiability Problem

a reaction equation is added to the dynamical system: T (t) D react(T (t1) ; U (0) ; : : : ; U (t) ): An algorithm that combines local search and nonoblivious local search [8], the use of prohibitions, and a reactive scheme to determine the prohibition parameter is the Hamming-reactive tabu search algorithm proposed in [7], which contains also a detailed experimental analysis. Given the hardness of the problem and the relevancy for applications in different fields, the emphasis on the experimental analysis of algorithms for the MAX-SAT problem has been growing in recent years (as of 2000). In some cases the experimental comparisons have been executed in the framework of ‘challenges,’ with support of electronic collection and distribution of software, problem generators and test instances. An example is the the Second DIMACS algorithm implementation challenge on cliques, coloring and satisfiability, whose results have been published in [39]. Practical and industrial MAX-SAT problems and benchmarks, with significant case studies are also presented in [20]. Some basic problem models that are considered both in theoretical and in experimental studies of MAX-SAT algorithms are described in [31]. Different algorithms demonstrate a different degree of effort, measured by number of elementary steps or CPU time, when solving different kinds of instances. For example, in [45] it is found that some distributions used in past experiments are of little interest because the generated formulas are almost always very easy to satisfy. It also reports that one can generate very hard instances of k-SAT, for k 3. In addition, it reports the following observed behavior for random fixed length 3SAT formulas: if r is the ratio r of clauses to variables (r = m/n), almost all formulas are satisfiable if r < 4, almost all formulas are unsatisfiable if r > 4.5. A rapid transition seems to appear for r 4.2, the same point where the computational complexity for solving the generated instances is maximized, see [17,42] for reviews of experimental results. Let be the least real number such that, if r is larger than , then the probability of C being satisfiable converges to 0 as n tends to infinity. A notable result found independently by many people, including [22] and [14]

M

is that log 8 2 D 5:191: 7

A series of theoretical analyses aim at approximating the unsatisfiability threshold of random formulas [1,15,29,43]. See also Greedy Randomized Adaptive Search Procedures Integer Programming References 1. Achlioptas D, Kirousis LM, Kranakis E, Krinzac D (1997) Rigorous results for random (2 + p)-SAT. In: Proc. Work. on Randomized Algorithms in Sequential, Parallel and Distributed Computing (RALCOM 97), Santorini, Greece, pp 1– 10 2. Asano T (1997) Approximation algorithms for MAX-SAT: Yannakakis vs. Goemans–Williamson. In: Proc. 3rd Israel Symp. on the Theory of Computing and Systems, Ramat Gan, Israel, pp 24–37 3. Asano T, Ono T, Hirata T (1996) Approximation algorithms for the maximum satisfiability problem. Proc. 5th Scandinavian Work. Algorithms Theory, pp 110–111 4. Asirelli P, de Santis M, Martelli A (1985) Integrity constraints in logic databases. J Logic Programming 3:221–232 5. Ausiello G, Crescenzi P, Protasi M (1995) Approximate solution of NP optimization problems. Theoret Comput Sci 150:1–55 6. Battiti R (1996) Reactive search: Toward self-tuning heuristics. In: Rayward-Smith VJ, Osman IH, Reeves CR, Smith GD (eds) Modern Heuristic Search Methods. Wiley, New York, pp 61–83 7. Battiti R, Protasi M (1997) Reactive search, a historysensitive heuristic for MAX-SAT. ACM J Experimental Algorithmics 2:2 8. Battiti R, Protasi M (1997) Solving MAX-SAT with nonoblivious functions and history-based heuristics. In: Du DZ, Gu J, Pardalos PM (eds) Satisfiability Problem: Theory and Applications. DIMACS, vol 35. Amer. Math. Soc. and ACM, Providence, RI, pp 649–667 9. Battiti R, Protasi M (1998) Approximate algorithms and heuristics for MAX-SAT. In: Du D-Z and Pardalos PM (eds) Handbook Combinatorial Optim. Kluwer, Dordrecht, pp 77–148 10. Battiti R, Tecchiolli G (1994) The reactive tabu search. ORSA J Comput 6(2):126–140 11. Blair CE, Jeroslow RG, Lowe JK (1986) Some results and experiments in programming for propositional logic. Comput Oper Res 13(5):633–645 12. Chen J, Friesen D, Zheng H (1997) Tight bound on Johnson’s algorithm for MAX-SAT. In: Proc. 12th Annual IEEE

2039

2040

M 13.

14. 15.

16.

17.

18. 19. 20.

21.

22.

23. 24. 25.

26. 27.

28.

29. 30.

Maximum Satisfiability Problem

Conf. Computational Complexity (Ulm, Germany), pp 274– 281 Cheriyan J, Cunningham WH, Tuncel T, Wang Y (1996) A linear programming and rounding approach to MAX 2-SAT. In: Trick M, Johson DS (eds) Proc. Second DIMACS Algorithm Implementation Challenge on Cliques, Coloring and Satisfiability. DIMACS vol 26, pp 395–414 Chvátal V, Szemerédi E (1988) Many hard examples for resolution. J ACM 35:759–768 Chvátal V, Reed B (1992) Mick gets some (the odds are on his side). In: Proc. 33th Ann. IEEE Symp. on Foundations of Computer Sci., IEEE Computer Soc., pp 620–627 Cook SA (1971) The complexity of theorem-proving procedures. In: Proc. Third Annual ACM Symp. Theory of Computing, pp 151–158 Cook SA, Mitchell DG (1997) Finding hard instances of the satisfiability problem: A survey. In: Du D-Z Gu J, Pardalos PM (eds) Satisfiability Problem: Theory and Applications. DIMACS, vol 35. Amer. Math. Soc. and ACM, Providence, RI, pp 1–17 Davis M, Logemann G, Loveland D (1962) A machine program for theorem proving. Comm ACM 5:394–397 Davis M, Putnam H (1960) A computing procedure for quantification theory. J ACM 7:201–215 Du D-Z, Gu J, Pardalos PM (eds) (1997) Satisfiability problem: Theory and applications. DIMACS, vol 35. Amer. Math. Soc. and ACM, Providence, RI Feige U, Goemans MX (1995) Approximating the value of two proper proof systems, with applications to MAX2SAT and MAX-DICUT. In: Proc. Third Israel Symp. Theory of Computing and Systems, pp 182–189 Franco J, Paull M (1983) Probabilistic analysis of the Davis– Putnam procedure for solving the satisfiability problem. Discrete Appl Math 5:77–87 Gallaire H, Minker J, Nicolas JM (1984) Logic and databases: A deductive approach. Computing Surveys 16(2):153–185 Gent IP, Walsh T (1993) An empirical analysis of search in GSAT. J Artif Intell Res 1:47–59 Gent IP, Walsh T (1993) Towards an understanding of hillclimbing procedures for SAT. In: Proc. Eleventh Nat. Conf. Artificial Intelligence, AAAI Press/MIT, pp 28–33 Glover F (1989) Tabu search: Part I. ORSA J Comput 1(3):190–260 Goemans MX, Williamson DP (1994) New 3/4approximation algorithms for the maximum satisfiability problem. SIAM J Discret Math 7(4):656–666 Goemans MX, Williamson DP (1995) Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J ACM 42(6):1115– 1145 Goerdt A (1996) A threshold for unsatisfiability. J Comput Syst Sci 53:469–486 Gu J (1992) Efficient local search for very large-scale satisfiability problem. ACM SIGART Bull 3(1):8–12

31. Gu J, Purdom PW, Franco J, Wah BW (1997) Algorithms for the satisfiability (SAT) problem: A survey. In: Du D-Z Gu J, Pardalos PM (eds) Satisfiability Problem: Theory and Applications. DIMACS, vol 35. Amer. Math. Soc. and ACM, Providence, RI 32. Gu J, Puri R (1995) Asynchronous circuit synthesis with Boolean satisfiability. IEEE Trans Computer-Aided Design Integr Circuits 14(8):961–973 33. Hammer PL, Hansen P, Simeone B (1984) Roof duality, complementation and persistency in quadratic 0-1 optimization. Math Program 28:121–155 34. Hansen P, Jaumard B (1990) Algorithms for the maximum satisfiability problem. Computing 44:279–303 35. Hooker JN (1988) Resolution vs. cutting plane solution of inference problems: some computational experience. Oper Res Lett 7(1):1–7 36. Håstad J (1997) Some optimal inapproximability results. In: Proc. 28th Annual ACM Symp. on Theory of Computing, El Paso, Texas, pp 1–10 37. Jaumard B, Stan M, Desrosiers J (1996) Tabu search and a quadratic relaxation for the satisfiability problem. In: Trick M, Johson DS (eds) Proc. Second DIMACS Algorithm Implementation Challenge on Cliques, Coloring and Satisfiability. DIMACS, vol 26, pp 457–477 38. Johnson DS (1974) Approximation algorithms for combinatorial problems. J Comput Syst Sci 9:256–278 39. Johnson DS, Trick M (eds) (1996) Cliques, coloring, and satisfiability: Second DIMACS implementation challenge. DIMACS, vol 26. Amer Math Soc, Providence, RI 40. Kamath AP, Karmarkar NK, Ramakrishnan KG, Resende MG (1990) Computational exprience with an interior point algorithm on the satisfiability problem. Ann Oper Res 25:43– 58 41. Karloff H, Zwick U (1997) A 7/8-approximation algorithm for MAX 3SAT? In: Proc. 38th Annual IEEE Symp. Foundations of Computer Sci., IEEE Computer Soc 42. Kirkpatrick S, Selman B (1994) Critical behavior in the satisfiability of random Boolean expressions. Science 264:1297–1301 43. Kirousis LM, Kranakis E, Krizanc D (Sept. 1996) Approximating the unsatisfiability threshold of random formulas. In: Proc. Fourth Annual European Symp. Algorithms. Springer, Berlin, pp 27–38 44. Minton S, Johnston MD, Philips AB, Laird P (1990) Solving large-scale constraint satisfaction and scheduling problems using a heuristic repair method. In: Proc. 8th Nat. Conf. Artificial Intelligence (AAAI-90), pp 17–24 45. Mitchell D, Selman B, Levesque H (1992) Hard and easy distributions of SAT problems. In: Proc. 10th Nat. Conf. Artificial Intelligence (AAAI-92), pp 459–465 46. Nguyen TA, Perkins WA, Laffrey TJ, Pecora D (1985) Checking an expert system knowledge base for consistency and completeness. In: Proc. Internat. Joint Conf. on Artificial Intelligence, pp 375–378

Medium-Term Scheduling of Batch Processes

47. Nobili P, Sassano A (1996) Strengthening Lagrangian bounds for the MAX-SAT problem. Techn. Report Inst. Informatik Köln Univ., Germany, no. 96–230; Franco J, Gallo G, Kleine Buening H (eds) Proc. Work Satisfiability Problem, Siena, Italy 48. Resende MGC, Feo TA (1996) A GRASP for satisfiability. In: Trick M, Johson DS (eds) Proc. Second DIMACS Algorithm Implementation Challenge on Cliques, Coloring and Satisfiability. DIMACS, vol 26. Amer. Math. Soc., Providence, RI, pp 499–520 49. Resende MGC, Pitsoulis LS, Pardalos PM (1997) Approximate solution of weighted MAX-SAT problems using GRASP. In: Du D-Z Gu J, Pardalos PM (eds) Satisfiability Problem: Theory and Applications. DIMACS, vol 35. Amer Math Soc, Providence, RI 50. Selman B, Kautz H (1993) Domain-independent extensions to GSAT: Solving large structured satisfiability problems. In: Proc. Internat. Joint Conf. Artificial Intelligence, pp 290– 295 51. Selman B, Kautz HA, Cohen B (1996) Local search strategies for satisfiability testing. In: Trick M, Johson DS (eds) Proc. Second DIMACS Algorithm Implementation Challenge on Cliques, Coloring and Satisfiability. DIMACS, vol 26, pp 521–531 52. Selman B, Levesque H, Mitchell D (1992) A new method for solving hard satisfiability problems. In: Proc. 10th Nat. Conf. Artificial Intelligence (AAAI-92), pp 440–446 53. Spears WM (1996) Simulated annealing for hard satisfiability problems. In: Trick M, Johnson DS (eds) Proc. Second DIMACS Algorithm Implementation Challenge on Cliques, Coloring and Satisfiability. DIMACS vol 26, pp 533–555 54. Trevisan L (1997) Approximating satisfiable satisfiability problems. In: Proc. 5th Annual European Symp. Algorithms, Graz. Springer, Berlin, pp 472–485 55. Yannakakis M (1994) On the approximation of maximum satisfiability. J Algorithms 17:475–502

Medium-Term Scheduling of Batch Processes STACY L. JANAK, CHRISTODOULOS A. FLOUDAS Department of Chemical Engineering, Princeton University, Princeton, USA

Article Outline Introduction Problem Statement

Formulation Models Short-Term Scheduling Model

M

Cases Case 1: Nominal Run without Campaign Mode Production

Conclusions References Introduction In multiproduct and multipurpose batch plants, different products can be manufactured via the same or a similar sequence of operations by sharing available pieces of equipment, intermediate materials, and other production resources. They are ideally suited to manufacture products that are produced in small quantities or for which the production recipe or the customer demand pattern is likely to change. The inherent operational flexibility of this type of plant provides the opportunity for increased savings through the realization of an efficient production schedule which can reduce inventories, production and transition costs, and production shortfalls. The problem of production scheduling and planning for multiproduct and multipurpose batch plants has received a considerable amount of attention during the last two decades. Extensive reviews have been written by Reklaitis [10], Pantelides [9], Shah [11] and more recently by Floudas and Lin [4,5]. Most of the work in the area of multiproduct batch plants has dealt with either the long-term planning problem or the short-term scheduling problem. Both planning and scheduling deal with the allocation of available resources over time to perform a set of tasks required to manufacture one or more products. However, long-term planning problems deal with longer time horizons (e. g., several months or years) and are focused on higher level decisions such as timing and location of additional facilities and levels of production. In contrast, short-term scheduling models address shorter time horizons (e. g., several days) and are focused on determining detailed sequencing of various operational tasks. The area of medium-term scheduling, however, which involves medium time horizons (e. g. several weeks) and still aims to determine detailed production schedules, can result in very largescale problems and has received much less attention in the literature. For medium-term scheduling, relatively little work has been presented in the literature. Medium-term

2041

2042

M

Medium-Term Scheduling of Batch Processes

scheduling can be quite computationally complex, thus it is common for mathematical programming techniques to be used in their solution. The most widely employed strategy to overcome the computational difficulty is based on the idea of decomposition. The decomposition approach divides a large and complex problem, which may be computationally expensive or even intractable when formulated and solved directly as a single MILP model, to smaller subproblems, which can be solved much more efficiently. There have been a wide variety of decomposition approaches proposed in the literature. In addition to decomposition techniques developed for general forms of MILP problems, various approaches that exploit the characteristics of specific process scheduling problems have also been proposed. In most cases, the decomposition approaches only lead to suboptimal solutions, however, they substantially reduce the problem complexity and the solution time, making MILP based techniques applicable for large, real-world problems. In this chapter, we propose an enhanced State-Task Network MILP model for the medium-term production scheduling of a multipurpose, multiproduct industrial batch plant. The proposed approach extends the work of Ierapetritou and Floudas [6] and Lin et al. [8] to consider a large-scale production facility and account for various storage policies (UIS, NIS, ZW), variable batch sizes and processing times, batch mixing and splitting, sequence-dependent changeover times, intermediate due dates, products used as raw materials, and several modes of operation. The methodology consists of the decomposition of the whole scheduling period into successive short horizons of a few days. A decomposition model is implemented to determine each short horizon and the corresponding products to be included. Then, a novel continuous-time formulation for shortterm scheduling of batch processes with multiple intermediate due dates is applied to each short horizon selected, leading to a large-scale mixed-integer linear programming (MILP) problem. The scheduling model includes over 80 pieces of equipment and can take into account the processing recipes of hundreds of different products. Several characteristics of the production plant are incorporated into the scheduling model and actual plant data are used to model all parameters.

Problem Statement In the multiproduct batch plant investigated, there are several different types of operations (or tasks) termed operation type 1 to operation type 6. The plant has many different types of units and over 80 are modeled explicitly. Hundreds of different products can be produced and for each of them, one of the processing recipes shown in Fig. 1 or a slight variation is applied. The recipes are represented in the form of State-Task Network (STN), in which the state node is denoted by a circle and the task node by a rectangle. The STN representation provides the flow of material through various tasks in the production facility to produce different types of final products and does not represent the actual connectivity of equipment in the plant. For the first type of STN shown in Fig. 1, raw materials (or state F) are fed into a type 1 unit and undergo operation type 1 to produce an intermediate (or state I1). This intermediate then undergoes operation type 3 in a type 3 unit to produce another intermediate (or state I2). This second intermediate is then sent to a type 4b unit before the resulting intermediate material (or state I3) is sent to a type 6 unit to undergo an operation type 6 task to produce a final product (or state P). The information on which units are suitable for each product is given. All the units are utilized in a batch mode with the exception of the type 5 and 6 units, which operate in a continuous mode. The capacity limits of the type 1, type 2, and type 3 units vary from one product to another, while the capacity limits of the types 4a, 4b, 5 and 6 units are the same for all suitable products. The processing time or processing rate of each task in the suitable units is also specified. Also, some products require other products as their raw materials, creating very complicated state-task networks. The time horizon considered for production scheduling is a few weeks or longer. Customer orders are fixed throughout the time horizon with specified amounts and due dates. There is no limitation on external raw materials and we apply the zero-wait storage condition or limited intermediate storage capacity for all materials based on actual plant data. There are two different types of products produced, category 1 and 2. The sixth STN shown in Fig. 1 shows a special type of product, denoted as a campaign product. For this

Medium-Term Scheduling of Batch Processes

M

Medium-Term Scheduling of Batch Processes, Figure 1 State-task network (STN) representation of plant

type of product, raw materials are fed into up to three type 1 units and undergo operation type 1 to produce an intermediate, or state I1. This intermediate is then sent to one of two type 4a units before being processed in the type 5 unit, which is a continuous unit. Finally, the intermediate material (or state I3) is sent to a type 6 unit, producing a final campaign product (or state P). Because product changeovers in the type 5 unit can be undesirable, there was a need to introduce the ability to fix campaigns for continuous production of a single product in the type 5 unit, called campaign mode production. Formulation The overall methodology for solving the mediumrange production scheduling problem is to decompose the large and complex problem into smaller short-term scheduling subproblems in successive time horizons [8]. The flowchart for this rolling horizon approach is shown in Fig. 2. The first step is to input relevant data into the formulation. Then, if necessary, campaign mode production is determined. Next, the overall medium-term scheduling problem is considered. A decomposition model is formulated and solved

to determine the current time horizon and corresponding products that should be included in the current subproblem. According to the solution of the decomposition model, a short-term scheduling model is formulated using the information on customer orders, inventory levels, and processing recipes. The resulting MILP problem is a large-scale, complex problem which requires a large computational effort for its solution. When a satisfactory solution is determined, the relevant data is output and the next time horizon is considered. The above procedure is applied iteratively in an automatic fashion until the whole time horizon under consideration has been scheduled. Note that the decomposition model determines how many days and products to consider in the shorter scheduling horizon subject to an upper limit on the complexity of the resulting mathematical model. Products are selected for the scheduling horizon if there is an order for the product, if the product has an order within a set amount of time into the future, if the product is used as a raw material for another product which is included, if the product was still processing in the previous scheduling horizon, or if the product is a campaign product and is included in a campaign for the current horizon.

2043

2044

M

Medium-Term Scheduling of Batch Processes

added to the horizon so that each of the first-stage units, or type 1 units, are fully utilized. Short-Term Scheduling Model Once the decomposition model has determined the days in the time horizon and the products to be included, a novel continuous-time formulation for shortterm scheduling with multiple intermediate due dates is applied to determine the detailed production schedule. This formulation is based on the models of Floudas and coworkers [6,7,8] and is expanded and enhanced in this work to take into account specific aspects of the problem under consideration. The proposed short-term scheduling formulation requires the following indices, sets, parameters and variables: Indices: d i j k n s

days; processing tasks; units; orders; event points representing the beginning of a task; states;

Sets:

Medium-Term Scheduling of Batch Processes, Figure 2 Flowchart of the rolling horizon approach

Models A key component of the rolling horizon approach is the determination of the time horizon and the products which should be included for each short-term scheduling subproblem. We extend the two-level decomposition formulation of Lin et al. [8] which partitions the entire scheduling horizon into shorter subhorizons by taking into account the trade-off between demand satisfaction, unit utilization, and model complexity. In the first level, the number of days in the time horizon and the main products which should be included are determined. In the second level, additional products are

D Din I Ij Ik I cs p Is I in I T5 I T6b J Ji Jp

J T1

days in the overall scheduling horizon; days in the current scheduling horizon; processing tasks; tasks which can be performed in unit (j); tasks which process order (k); tasks which consume state (s); tasks which produce state (s); tasks which are included in the current scheduling horizon; tasks which are used to determine the type 5 unit campaign; tasks which are used to perform operation type 6 for category 1 products; units; units which are suitable for performing task (i); units which are suitable for performing only processing tasks, or operation type 1, 2, 3, and 5 tasks; units which are suitable for performing only operation type 1 tasks;

Medium-Term Scheduling of Batch Processes

J T4 J T5 J T6 K Ki Ks K in N S Sk Scat1 Scat2 Scpm Sf Si S

in

Sp Srw Sst ST5 Sunl S0

units which are suitable for performing only operation type 4a and 4b tasks; units are which used to determine the type 5 unit campaign; units which are suitable for performing only operation type 6 tasks; orders; orders which are processed by task (i); orders which produce state (s); orders which are included in the current scheduling horizon; event points within the time horizon; states; states which are used to satisfy order (k); states which are category 1 final products; states which are category 2 final products; states which have minimum or maximum storage limitations; states which are final products, after operation type 6; states which are intermediate products, before operation type 6; states which are included in the current scheduling horizon; states which are either final or intermediate products; states which are products and are used as raw materials for other products; states which have no intermediate storage; states which are used to determine the type 5 unit campaign; states which have unlimited intermediate storage; states which are external raw materials;

Bmin s C capmax ij capmin ij dems demrw s

duekksd ExtraTimei FixedTimeij H mintasks

N max prawss0 prices priors priorraw s RateCT ij rkksd start j

stcapmax s stcapmin s ˛ ˇ ı

Parameters: Bmax s

demtot s

the maximum suitable batch size used to produce product state (s); the minimum suitable batch size used to produce product state (s); a large constant (e. g., 10000); maximum capacity for task (i) in unit (j); minimum capacity for task (i) in unit (j); demand for state (s) in the current scheduling horizon; demand for raw material product state (s);

cis p

is

M

total demand for state (s) in the overall horizon; due date of order (k) for state (s) on day (d); amount of time needed for operation type 3 task after processing task (i); constant term of processing time for task (i) in unit (j); time horizon; the minimum number of tasks that must occur in the first-stage processing units, J T1 ; the maximum number of event points in the scheduling horizon; 0-1 parameter to relate final product (s) to its raw material product (s0 ); price of state (s); priority of product state (s); priority of raw material state (s); variable term of processing time for task (i) in unit (j); amount of order (k) for state (s) on day (d); the time at which unit (j) first becomes available in the current scheduling horizon; maximum capacity for storage of state (s); minimum capacity for storage of state (s); coefficient for the demand satisfaction of individual orders term; coefficient for the due date satisfaction of individual orders term; coefficient for the overall demand satisfaction slack variable term; coefficient for the minimum inventory requirement in dedicated units term; coefficient for the artificial demands on raw material states term; coefficient for the minimizing of binary variables term; coefficient for the minimizing of active start times term; a small constant (e. g., 0.01); proportion of state (s) consumed by task (i); proportion of state (s) produced by task (i);

2045

M

2046

ii0 !

Medium-Term Scheduling of Batch Processes

sequence-dependent setup time between tasks (i) and (i 0 ); coefficient for the satisfaction of orders term; coefficient for the overall production term;

tt s (i , j , n)

Binary Variables: wv(i , j , n) y(i , k , n)

Continuous Variables: B(i , j , n) D(s , n) Df (s , n) kD(k , s , n) kDf (k , s , n) sla1(k , s , d) sla2(k , s , d) slcap(s , n) sll(s) sllraw (s)

slorder(k) slt1(k , s , d) slt2(k , s , d) ST(s , n) STF(s) STO(s) T f (i , j , n) T s (i , j , n) tot(s)

amount of material undertaking task (i) in unit (j) at event point (n); amount of state (s) delivered at event point (n); amount of state (s) delivered after the last event point; amount of state (s) delivered at event point (n) for order (k); amount of state (s) delivered after the last event point for order (k); amount of state (s) due on day (d) for order (k) that is not delivered; amount of state (s) due on day (d) for order (k) that is over delivered; amount of state (s) that is deficient in its dedicated storage unit at event point (n); amount of state (s) due in the current time horizon but not made; amount of raw material product state (s) artificially due in the current time horizon but not made; 0-1 variable indicating if order (k) was met; amount of time state (s) due on day (d) for order (k) is late; amount of time state (s) due on day (d) for order (k) is early; amount of state (s) at event point (n); final amount of state (s) at the end of the current time horizon; initial amount of state (s) at the beginning of the current time horizon; time at which task (i) finishes in unit (j) at event point (n); time at which task (i) starts in unit (j) at event point (n); total amount of state (s) made in the current time horizon;

starting time of the active task (i) in unit (j) at event point (n);

assigns the beginning of task (i) in unit (j) at event point (n); assigns the delivery of order (k) through task (i) at event point (n);

On the basis of this notation, the mathematical model for the short-term scheduling of an industrial batch plant with intermediate due dates involves the following constraints: X wv(i; j; n) 1 ; i2I in ;I j (1) 8 j 2 J; n 2 N; n N max capmin i j wv(i; j; n) B(i; j; n) capmax i j wv(i; j; n) ; in

8i 2 I ; j 2 J i ; n 2 N; n N st(s; n) D 0; 8s 2 S in ; S st ; s … S cpm ; S unl ;

(2)

max

n 2 N; n N max

st(s; n) stcapmin slcap(s) ; s 8s 2 S in ; S cpm ; n 2 N; n N max st(s; n) stcapmax ; s 8s 2 S in ; S cpm ; n 2 N; n N max

(3)

(4)

(5)

ST(s; n) D ST(s; n 1) D(s; n) X p X C i s B(i; j; n 1) p

j2J i

i2I s

X i2I cs

ci s

X

(6) B(i; j; n) ;

j2J i

in

8s 2 S ; n 2 N; n > 1; n N max ST(s; n) D STO(s) D(s; n) X X ci s B(i; j; n) ; i2I cs

j2J i

(7)

8s 2 S in ; n 2 N; n D 1 STF(s) D ST(s; n) D f (s; n) X p X C i s B(i; j; n) ; p

i2I s

j2J i

8s 2 S in ; n 2 N; n D N max

(8)

Medium-Term Scheduling of Batch Processes

T f (i; j; n) D T s (i; j; n) C FixedTime i j wv(i; j; n) (9)

C RateCT i j B(i; j; n) ; 8i 2 I in ; j 2 J p [ J T6 ; J i ; n 2 N; n N max T f (i; j; n) T s (i; j; n) ; 8i 2 I in ; j 2 J T4 ; J i ; n 2 N; n N max T f (i; j; n) D H; 8s 2 S in ; S st ; s … S unl ; i 2 I in ; Isp ; j 2 J T4 ; J i ; n 2 N; n D N max

(10)

(11)

T s (i; j; n C 1) T f (i; j; n) C ExtraTime i wv(i; j; n) ; in

8i 2 I ; j 2 J i ; n 2 N; n < N

(12)

max

T s (i; j; n C 1) T f (i 0 ; j; n) C ( i 0 i C ExtraTime i 0 ) wv(i 0 ; j; n) H[1 w(i 0 ; j; n)] ; 8 j 2 J; i; i 0 2 I in ; I j ; i ¤ i 0 ; n 2 N; n < N max (13) T s (i; j; n C 1) T f (i 0 ; j0 ; n) H[1 wv(i 0 ; j0 ; n)] ; 8s 2 S in ; i 2 I in ; Isc ; i 0 2 I in ; Isp ;

(14)

j 2 J i ; j0 2 J i 0 ; j ¤ j0 ; n 2 N; n < N max T s (i; j; n C 1) T f (i 0 ; j0 ; n) C H[2 wv(i 0 ; j0 ; n) wv(i; j; n C 1)] ; 8s 2 S ; S ; s … S unl ; i 2 I in ; Isc ; i 0 2 I in ; Isp ; in

st

j 2 J i ; j0 2 J i 0 ; j ¤ j0 ; n 2 N; n < N max (15) The allocation constraints in (1) express the requirement that for each unit (j) and at each event point (n), only one of the tasks that can be performed in the unit (i. e., i 2 I j ) should take place. The capacity constraints in (2) express the requirement for the batchsize of a task (i) processing in a unit (j) at event point (n); B(i; j; n), to be greater than the minimum amount of material, capmin i j , and less than the maximum amount of material, capmax i j , that can be processed by task (i) in unit (j). The storage constraints in (3) enforce that those states with no intermediate storage have to be consumed by some processing task or storage task immediately after they are produced. Constraints (4) represent

M

the minimum required storage for state (s) in a dedicated storage tank where this amount can be violated, if necessary, by an amount slcap(s) which is penalized in the objective function. Constraints (5) represent the maximum available storage capacity for state (s) based on the maximum storage capacity of the dedicated storage tank. According to the material balance constraints in (6), the amount of material of state (s) at event point (n) is equal to that at event point (n 1) increased by any amounts produced at event point (n 1), decreased by any amounts consumed at event point (n), and decreased by the amount required by the market at event point (n); D(s; n). Constraints (7)–(8) represent the material balance on state (s) at the first and last event points, respectively. The duration constraints in (9) represent the relationship between the starting and finishing times of task (i) in unit (j) at event point (n) for all processing tasks (i. e., J p ) and all operation type 6 tasks (i. e., J T6 ) where FixedTimeij are the fixed processing times for batch tasks and zero for continuous tasks and RateCT ij are the inverse of processing rates for continuous tasks and zero for batch tasks, respectively. Constraints (10) also represent the relationship between the starting and finishing times of task (i) in unit (j) at event point (n), but for operation type 4a and 4b tasks (i. e., J T4 ). They do not impose exact durations for tasks in these units but just enforce that all tasks must end after they start. Constraints (11) are written only for tasks in units which are processing a nonstorable state (i. e., Sst and not Sunl ) and enforce that task (i) taking place at the last event point (n) must finish at the end of the horizon. The sequence constraints in (12) state that task (i) starting at event point (n C 1) should start after the end of the same task performed in the same unit (j) which has finished at the previous event point, (n) where extra time is added after task (i) at event point (n), if necessary. The constraints in (13) are written for tasks (i) and (i 0 ) that are performed in the same unit (j) at event points (n C 1) and (n), respectively. If both tasks take place in the same unit, they should be at most consecutive. The third set of sequence constraints in (14) relate tasks (i) and (i 0 ) which are performed in different units (j) and ( j0 ) but take place consecutively according to the production recipe. The zero-wait constraints in (15) are written for different tasks (i) and (i 0 ) that take place consecutively with the intermediate state (s)

2047

2048

M

Medium-Term Scheduling of Batch Processes

having no possible intermediate storage and thus subject to the zero-wait condition. X

X

(16)

(17)

s2S in ;S cat1 d2D sl s >0 rk ksd >0

i

suit i j y(i; k; n)

(26)

8i 2 I ; I

T6b

(18)

; n 2 N; n N X

y(i; k; n)

j2J i

i

in

8i 2 I ; I

T6b

max

wv(i; j; n) ; (19)

;J T6

; n 2 N; n N

max

kD(k; s; n C 1) C kD f (k; s; n C 1) X B(i; j; n) C (1 y(i; k; n)) ; (20)

j2J i

8s 2 S in ; S cat1 ; k 2 K in ; K s ; i 2 I k ; I T6b ; n 2 N; n < N max X

D(s; n) D

kD(k; s; n) ; (21)

k2K in ;K s in

8s 2 S ; S

cat1

; n 2 N; n N

X

D f (s; n) D

max

kD f (k; s; n) ; (22)

k2K in ;K s in

8s 2 S ; S X

cat1

; n 2 N; n D N

max

kD(k; s; n C 1) C kDf (k; s; n C 1)

n2N n 0

wv(i; j; n) ;

in

k2K in ;K

j 2 J i ; n 2 N; n N max ; d 2 Din ; rkksd > 0

8s 2 S in ; S cat1 ; k 2 K in ; K s ; i 2 I in ; I k ; I T6b ;

j2J i ;J T6

X

(25)

H (2 wv(i; j; n) y(i; k; n)) ;

j2J i

X

; k 2 K ; Ks ;

in

8s 2 S in ; S cat1 ; k 2 K in ; K s ; i 2 I in ; I k ; I T6b ;

X rkksd ; 8k 2 K in min B s in

X

k2K in ;K

(24)

in

C H (2 wv(i; j; n) y(i; k; n)) ; y(k; i; n)

n2N nN max

X X

8s 2 S ; S

cat1

d 2 D ; rkksd > 0 T f (i; j; n) slt1(k; s; d; n) duek(k; s; d)

X

n2N n 0

(23)

The order satisfaction constraints in (16)–(23) are written to ensure that all orders for category 1 products are met on-time and with the required amount. Both under and overproduction as well as early and late production are represented with slack variables that are penalized in the objective function. Note that these constraints can be modified to represent different requirements for production, if desired. Constraints (16) try to ensure that each order (k) is met at least one time with an operation type 6 task (i), where task (i) is suitable for order (k) if i 2 I k and is a operation type 6 task for a category 1 product if i 2 I T6b . Similarly, constraints (17) enforce the condition that each order (k) for category 1 product state (s) on day (d) can be met with at most drkksd / Bmin s e tasks. Constraints (18) and (19) link the delivery of order (k) through task (i) at event point (n) to the beginning of task (i) in any suitable unit (j) at event point (n) so that every category 1 operation type 6 task must be linked to at least one order delivery and vice versa. Thus, constraint (18) enforces that if a binary variable is activated for operation type 6 task (i), then at least one order delivery must be activated. Similarly, constraint (19) ensures that if no binary variables are activated for operation type 6 task (i) at event point (n), then no delivery variables can be activated. Constraints (20) relate the individual order delivery variables to the batch-size of the operation type 6 task used to satisfy the order. If an order (k) is met by task (i) at event point (n) (i. e., y(i; k; n) D 1), then at least one

Medium-Term Scheduling of Batch Processes

operation type 6 task is active for task (i) at event point (n) and thus at least one B(i; j; n) variable is greater than zero. Constraints (21) and (22) relate the individual order delivery variables to the overall delivery variables used in the material balance constraints. Constraints (23) and (24) determine the under and overproduction, respectively, of order (k) for state (s) on day (d). Constraints (23) try to enforce the individual order delivery variables to exceed the amount due for order (k) (i. e., rkksd ) where slack variables sla1(k; s; d) are activated in the case of underproduction. Similarly, constraints (24) try to enforce the individual order delivery variables plus any amount of the product state left at the end of the horizon not to exceed the amount due for order (k) where slack variables sla2(k; s; d) are activated in the case of overproduction. Constraints (25) and (26) determine the late and early production, respectively, of order (k) for state (s) on day (d). Constraints (25) try to enforce the finishing time of task (i) used to satisfy order (k) at event point (n) to be less than the due date of order (k) where slack variables slt1(k; s; d; n) are activated in the case of late production. Similarly, constraints (26) try to enforce the finishing time of task (i) used to fulfill order (k) at event point (n) to be greater than the beginning of the day (d) on which the order is due (i. e., duek(k; s; d) 24). Otherwise, slack variables slt2(k; s; d; n) are activated indicating early production. X D(s; n) C D f (s; n) ; tot(s) D stf (s) C n2N nN max

(27)

8s 2 S in X D(s; n) C D f (s; n) C sll(s) dems ; n2N nN max

p i2I in ;I s

j2J i

B(i; j; n) C sll

raw

(s)

(29)

8s 2 S in ; S rw ; s0 2 S in ; S f ; praws 0 s > 0 ;

T s (i; j; n) start j ; 8i 2 I; j 2 J i ; n 2 N T f (i; j; n) H; 8i 2 I; j 2 J i ; n 2 N T s (i; j; n) H; 8i 2 I; j 2 J i ; n 2 N STO(s) D 0; 8s … S 0 f STF(s) demtot s ; 8s 2 S f tot(s) demtot s ; 8s 2 S

D(s; n); Df (s; n) D 0; 8s … S p or n 2 N; n > N max X X D(s; n); Df (s; n) rk(k; s; d) ; d2D in k2K in max

kD(k; s; n); kDf (k; s; n) D 0; 8k … K in or s … S k or n 2 N; n > N max X kD(k; s; n); kDf (k; s; d) rk(k; s; d) ; d2D in max

demraw s ;

n2N nN max

T f (i; j; n) start j ; 8i 2 I; j 2 J i ; n 2 N

8s 2 S ; n 2 N; n N

8s 2 S in ; S cat1

X X X

products in the current time horizon. First, constraints (27) determine the total production for all product states (s) (i. e., tot(s)) in the current horizon. Then, constraints (28) sum the overall delivery variables for category 1 products and activate the slack variables sll(s) if the sum does not exceed the demand for category 1 product state (s). Similarly, constraints (29) calculate the amount of underproduction (i. e., sll(s)) for category 2 product state (s) based on it’s overall demand in the time horizon. The slack variable sll(s) is then penalized in the objective function where category 1 and 2 products can be penalized at different weights. Constraints (30) determine the amount of underproduction for intermediate product states (s) that are needed as raw materials for final product states (s0 ). The bound constraints are used to impose lower and upper bounds on the continuous variables including slack variables. They are also used to fix some binary and continuous variables to be zero when necessary.

p

(28)

tot(s) C sll(s) dems ; 8s 2 S in ; S cat2

M

8s 2 S k ; n 2 N; n N (30)

>0 dems 0 > 0; demraw s Constraints (27)–(29) are used to determine the overall underproduction for both category 1 and 2

cpm slcap(s; n) stcapmin s ; 8s 2 S

sla1(k; s; n); sla2(k; s; n) D 0; 8k … K in or s … S k or d … Din or rk(k; s; d) D 0 sla1(k; s; n) rk(k; s; d); 8k 2 K in ; s 2 S k ; d 2 Din

2049

2050

M

Medium-Term Scheduling of Batch Processes

2

in sla2(k; s; n) demtot s ; 8k 2 K ;

6X X X ˇ6 4

s 2 S k ; d 2 Din slt1(k; s; d; n); slt2(k; s; d; n) D 0; 8k … K

k2K in s2S cat1 d2D in

in

3

or s … S k or d … Din or slt1(k; s; d; n) H duek(k; s; d); 8k 2 K in ; s 2 S k ; d 2 Din ; n 2 N; n N max slt2(k; s; d; n) duek(k; s; d); 8k 2 K ;

(31)

s 2 S k ; d 2 Din ; n 2 N; n N max sll(s) dems ; 8s 2 S p rw sllraw (s) demraw s ; 8s 2 S

or n 2 N; n > N max D(s; n); D f (s; n) D 0; 8s … S in or n 2 N; n > N max There are several different objective functions that can be employed with a general short-term scheduling problem. In this work, we maximize the sale of final products while penalizing several other terms including the slack variables introduced previously. The overall objective function is as follows: X

prices tot(s)

s2S in ;S p

XX X i2I in j2J i

2

tts(i; j; n)

n2N nN max

6X X X 6 wv(i; j; n) 4 i2I in j2J i

C

X X X k2K in i2I k

n2N nN max

X

n2N nN max

7 y(i; k; n)7 5

˛

X X X k2K in s2S cat1 d2D in

#

C sla2(k; s; d)

X

slorder(k)

k2K in

s2S f

"

3

prior s sll(s)

X

priorraw sllraw (s) s

s2S rw

ı

X

X

s2S cpm

n2N nN max

slcap(s; n) (32)

wv(i; j; n); B(i; j; n) D 0; 8i … I in or j … J i

Max !

slt1(k; s; d; n)

n2N nN max

7 C slt2(k; s; d; n)7 5

rk(k; s; d) D 0 or n 2 N; n > N max

in

X

sla1(k; s; d)

where each of the coefficients is used to balance the relative weight of each term in the overall objective function. The first term is the maximization of the value of the final products and is the main term of the objective function. The second term seeks to minimize the sum of the starting times of all active processing tasks. This is done to encourage all tasks to start as early as possible in the scheduling horizon. Note that this results in a bilinear term which can replaced with an equivalent linear term and set of constraints [3]. The third term seeks to minimize the number of active binary variables in the final production schedule. The fourth term seeks to minimize the slack variable that is activated when product state (s) does not meet its overall demand for the time horizon. Coefficient priors allows the ability to assign different weights to different product states. The fifth term minimizes the number of category 1 orders (k) that are not filled in the time horizon. The sixth term minimizes the amount of over and underproduction of orders for category 1 products in the time horizon where the coefficient allows over and underproduction to be penalized by different amounts. The seventh term seeks to minimize the amount of early and late production of orders for category 1 products due in the time horizon where the coefficient allows early and late production to be penalized to different degrees. The eighth term minimizes the slack variables activated when insufficient raw material state (s) is produced durallows different ing the time horizon where priorraw s states to be penalized by different amounts. The ninth, and final, term seeks to minimize the slack variables activated when insufficient intermediate state (s) is stored in its dedicated storage tank at each event point. Typical

Medium-Term Scheduling of Batch Processes

values for each of the coefficients are as follows: ! D 1; D 1; D 10; D 1000; D 1000; ˛ D 2000; ˇ D 500; D 0:01; D 50; ı D 10. Cases In this section, an example problem is presented to demonstrate the effectiveness of the rolling horizon framework. The example utilizes the proposed framework to determine the medium range production schedule of an industrial batch plant for a two-week time period which satisfies customer orders for various products distributed throughout the time period. The example is implemented with GAMS 2.50 [1] and solved using CPLEX 9.0 [2] with a 3.20 GHz Linux workstation. The dual simplex method is used with best-bound search and strong branching. A relative optimality tolerance equal to 0.001% was used as the termination criterion along with a three hour time limit and an integer solution limit of 40. The distribution of demands for the entire twoweek time period is shown in Fig. 3 where the amounts are shown in relative terms. There are two categories of products, category 1 and 2, and a total of 67 different products have demands. There are two different campaign products that can be scheduled for campaign mode production and an additional eight intermediate products are used to make final products, even though they do not have demands. It is assumed that no final products are available at the beginning of the time horizon although some intermediate materials are available. Also, we assume no limitation on external raw materials and the zero-wait condition is applied to all intermediate materials unless they are used as raw materials for

M

other final products. In this case, unlimited intermediate storage is allowed. Note that finite intermediate storage is effectively modeled for those intermediates that have a dedicated storage task with a given capacity limit. In addition, there are two types of connections made between each consecutive short-term scheduling horizon in the rolling horizon framework: the initial available time for each unit and the inventory of intermediate materials.

Case 1: Nominal Run without Campaign Mode Production The example problem considers the production scheduling of an industrial batch plant where no type 5 unit campaign is imposed. Instead, demands for both campaign products are created throughout the time horizon with a total demand for each product equal to the production that would be imposed by a campaign. The total time period is 19 days, from D0 to D18. The rolling horizon framework decomposes the time horizon into 8 individual subhorizons, each with its own products and demands. The results of the decomposition for each time horizon can be seen in Table 1. The final production schedule for the entire time period can be seen in Fig. 4 and 5 where the processing units (operation type 1, 2, 3, and 5) are shown in the first figure and the other units (operation type 4a, 4b, and 6) are shown in the second. Each short-term scheduling horizon is represented with a different color beginning with black for the first horizon, red for the second horizon, green for the third horizon, etc. The model and solution statistics for each short-term Medium-Term Scheduling of Batch Processes, Table 1 Decomposition results for case 1 Days

Medium-Term Scheduling of Batch Processes, Figure 3 Distribution of demands

Main Products Additional Products

H1

D0–D2

27

2

H2

D3–D4

31

0

H3

D5–D6

50

0

H4

D7–D8

49

0

H5

D9–D10 37

0

H6 D11–D12 49

0

H7 D13–D14 54

0

H8 D15–D18 45

0

2051

2052

M

Medium-Term Scheduling of Batch Processes

Medium-Term Scheduling of Batch Processes, Figure 4 Overall production schedule for processing units for case 1

Medium-Term Scheduling of Batch Processes

Medium-Term Scheduling of Batch Processes, Figure 5 Overall production schedule for non-processing units for case 1

M

2053

2054

M

Medium-Term Scheduling of Batch Processes

Medium-Term Scheduling of Batch Processes, Table 2 Model and solutions statistics for case 1 Days

Event Points Objective Function

Binary Variables Continuous Variables Constraints

H1

D0–D2

8

14; 001:69

4880

33,064

H2

D3–D4

6

4135:24

3660

24,923

187,833 125,374

H3

D5–D6

6

105; 854:81

5478

32,621

258,852

H4

D7–D8

6

5496:19

5376

32,167

255,696

H5

D9–D10

6

15; 352:37

4296

27,613

175,939

H6 D11–D12

6

11; 326:13

5490

32,637

272,802

H7 D13–D14

6

19; 401:39

5568

32,955

282,632

H8 D15–D18 10

37; 054:00

7430

46,827

321,162

scheduling horizon can be seen in Table 2 where each horizon runs for the time limit of three hours. The total demand for the entire 14-day period is 2323.545 and the total production is 2744.005, where 51.674 of the demands are not met. The production schedules obtained satisfy demands for almost all the products, though some due dates are relaxed, and also produce 18.10% more material than the demands require. Many of the processing units are not fully utilized, as shown in Table 3, indicating the potential for even more production in the given time period. Also, note that the processing units become more idle towards the end of the overall time horizon. This is because no demands are specified for the days following day D14 including days D15 to D18. Additional demands at the end of the overall time horizon or in the following days would generate a more heavily utilized production schedule.

Medium-Term Scheduling of Batch Processes, Table 3 Unit utilization statistics for case 1

Conclusions

composition model to split the overall scheduling horizon into smaller subhorizons which are scheduled in a sequential fashion. Also, new constraints are added to the short-term scheduling model in order to model the delivery of orders at intermediate due dates. The effectiveness of the proposed approach is demonstrated with an industrial case study. Results indicate that the rolling horizon approach is effective at solving largescale, medium-term production scheduling problems.

In this paper, a unit-specific event-based continuoustime formulation is presented for the medium-term production scheduling of a large-scale, multipurpose industrial batch plant. The proposed formulation takes into account a large number of processing recipes and units and incorporates several features including various storage policies (UIS, NIS, ZW), variable batch sizes and processing times, batch mixing and splitting, sequence-dependent changeover times, intermediate due dates, products used as raw materials, and several modes of operation. The scheduling horizon is several weeks or longer, however longer time periods can be addressed with the proposed framework. A key feature of the proposed formulation is the use of a de-

Unit

Time Used (h) TimeLeft (h) Percent Utilized

Type 1–1

98.00

358.00

21.49%

Type 1–2

341.00

115.00

74.78%

Type 1–3

329.60

126.40

72.15%

Type 1–4

396.00

60.00

80.92%

Type 1–5

283.20

172.80

62.06%

Type 1–6

402.00

54.00

88.16%

Type 1–7

408.00

48.00

89.47%

Type 1–8

281.00

175.00

61.62%

Type 1–9

322.00

134.00

70.61%

Type 1–10 322.20

133.80

70.66%

Type 1–11 312.20

143.80

68.46%

Type 1–12 177.00

279.00

38.82%

Type 1–13 201.00

255.00

44.08%

93.96

79.39%

Type 5

362.04

References 1. Brooke A, Kendrick D, Meeraus A, Raman R (2003) GAMS: A User’s Guide. South San Franciso 2. CPLEX (2005) ILOG CPLEX 9.0 User’s Manual 3. Floudas CA (1995) Nonlinear and Mixed-Integer Optimization. Oxford University Press, Oxford

Metaheuristic Algorithms for the Vehicle Routing Problem

4. Floudas CA, Lin X (2004) Continuous-Time versus DiscreteTime Approaches for Scheduling of Chemical Processes: A Review. Comput Chem Eng 28:2109 5. Floudas CA, Lin X (2005) Mixed Integer Linear Progrannubg in Process Scheduling: Modeling, Algorithms, and Applications. Ann Oper Res 139:131 6. Ierapetritou MG, Floudas CA (1998) Effective ContinuousTime Formulation for Short-Term Scheduling: 1. Multipurpose Batch Processes. Ind Eng Chem Res 37:4341 7. Janak SL, Lin X, Floudas CA (2004) Enhanced ContinuousTime Unit-Specific Event-Based Formulation for ShortTerm Scheduling of Multipurpose Batch Processes: Resource Constraints and Mixed Storage Policies. Ind Eng Chem Res 43:2516 8. Lin X, Floudas CA, Modi S, Juhasz NM (2002) ContinuousTime Optimization Approach for Medium-Range Production Scheduling of a Multiproduct Batch Plant. Ind Eng Chem Res 41:3884 9. Pantelides CC (1993) Unified Frameworks for Optimal Process Planning and Scheduling. In: Rippin DWT, Hale JC, Davis J (eds) Proceedings of the Second International Conference on Foundations of Computer-Aided Process Operations. CACHE. Crested Butte, Austin, pp 253–274 10. Reklaitis GV (1992) Overview of Scheduling and Planning of Batch Process Operations. In: Presented at NATO Advanced Study Institute – Batch Process Systems Engineering. Antalya 11. Shah N (1998) Single- And Multisite Planning and Scheduling: Current Status and Future Challenges. In: Pekny JF, Blau GE (eds) Proceedings of the Third International Conference on Foundations of Computer-Aided Process Operations. CACHE-AIChE. Snowbird, New York, pp 75–90

Metaheuristic Algorithms for the Vehicle Routing Problem YANNIS MARINAKIS Department of Production Engineering and Management, Decision Support Systems Laboratory, Technical University of Crete, Chania, Greece

MSC2000: 90B06, 90C59

M

Introduction The vehicle routing problem (VRP) or the capacitated vehicle routing problem (CVRP) is often described as a problem in which vehicles based at a central depot are required to visit geographically dispersed customers in order to fulfill known customer demands. Let G D (V; E) be a graph where V D fi0 ; i1 ; i2 ; : : : i n g is the vertex set (i i D i0 refers to the depot and the customers are indexed i i D i1 ; : : : ; i n ) and E D f(i l ; i l 1 ) : i l ; i l 1 2 V g is the edge set. Each customer must be assigned to exactly one of the k vehicles and the total size of deliveries for customers assigned to each vehicle must not exceed the vehicle capacity (Qk ). If the vehicles are homogeneous, the capacity for all vehicles is equal and denoted by Q. A demand q i l and a service time st i l are associated with each customer node il . The travel cost between customers il and i l 1 is c i l i l1 . The problem is to construct a low cost, feasible set of routes – one for each vehicle. A route is a sequence of locations that a vehicle must visit along with the indication of the service it provides. The vehicle must start and finish its tour at the depot. The most important variants of the vehicle routing problem can be found in [12,13,39,54,84]. The vehicle routing problem was first introduced by Dantzig and Ramser [21]. As it is an NP-hard problem, the instances with a large number of customers cannot be solved in optimality within reasonable time. Due to the general inefficiency of the exact methods and their inability to solve large scale VRP instances, a large number of approximation techniques have been proposed. These techniques are classified into two main categories, the classical heuristics that were developed mostly between 1960 and 1990 and the metaheuristics that were developed in the last fifteen years. In the 1960s and 1970s the first attempts to solve the vehicle routing problem focused on route building, route improvement and two-phase heuristics. In the 1980s a number of mathematical programming procedures were proposed for the solution of the problem. The most important of them can be found in [6,18,19,22,28,29,33,62,88].

Article Outline Introduction Metaheuristic Algorithms for the Vehicle Routing Problem References

Metaheuristic Algorithms for the Vehicle Routing Problem The last fifteen years an incremental amount of metaheuristic algorithms have been proposed. Simulated

2055

2056

M

Metaheuristic Algorithms for the Vehicle Routing Problem

annealing, genetic algorithms, neural networks, tabu search, ant algorithms, together with a number of hybrid techniques are the main categories of the metaheuristic procedures. These algorithms have the ability to find their way out of local optima. Surveys in metaheuristic algorithms have been published by [27,31,32, 49,50,79]. A number of metaheuristic algorithms have been proposed for the solution of the Capacitated Vehicle Routing Problem. The most important algorithms published for each metaheuristic algorithm are given in the following: Simulated Annealing (SA) [1,3,47,72] plays a special role within local search for two reasons. First, they appear to be quite successful when applied to a broad range of practical problems. Second, some threshold accepting algorithms such as SA have a stochastic component, which facilitates a theoretical analysis of their asymptotic convergence. Simulated Annealing [2] is a stochastic algorithm that allows random uphill jumps in a controlled fashion in order to provide possible escapes from poor local optima. Gradually the probability allowing the objective function value to increase is lowered until no more transformations are possible. Simulated Annealing owes its name to an analogy with the annealing process in condensed matter physics, where a solid is heated to a maximum temperature at which all particles of the solid randomly arrange themselves in the liquid phase, followed by cooling through careful and slow reduction of the temperature until the liquid is frozen with the particles arranged in a highly structured lattice and minimal system energy. This ground state is reachable only if the maximum temperature is sufficiently high and the cooling sufficiently slow. Otherwise a meta-stable state is reached. The metastable state is also reached with a process known as quenching, in which the temperature is instantaneously lowered. Its predecessor is the so-called Metropolis filter. Simulated Annealing algorithms for the VRP are presented in [14,31,63]. Threshold Accepting Method is a modification of the Simulated Annealing, which together with record to record travel [25,26] are known as Deterministic Annealing methods. These methods leave out the stochastic element in accepting worse solu-

tions by introducing a deterministic threshold denoted by Th m > 0, and accept a worse solution if D c(S 0 ) c(S) Th m , where c is the cost of the solution. This is the move acceptance criterion and the subscript m is an iteration index. Dueck and Scheurer [26] were the first to propose the Threshold Accepting Method for the VRP. Tarantilis et al. [81,82] proposed two very efficient algorithms belonging to this class: the Backtracking Adaptive Threshold Accepting (BATA) and the List-Based Threshold Accepting (LBTA). Other Deterministic Annealing methods were proposed by Golden et al. [40], the Record-to-Record Travel Method and by Li et al. [51]. Tabu search (TS) was introduced by Glover [34,35] as a general iterative metaheuristic for solving combinatorial optimization problems. Computational experience has shown that TS is a well established approximation technique, which can compete with almost all known techniques and which, by its flexibility, can beat many classic procedures. It is a form of local neighbor search. Each solution S has an associated set of neighbors N(S). A solution S 0 2 N(S) can be reached from S by an operation called a move. TS can be viewed as an iterative technique which explores a set of problem solutions, by repeatedly making moves from one solution S to another solution S0 located in the neighborhood N(S) of S [37]. TS moves from a solution to its best admissible neighbor, even if this causes the objective function to deteriorate. To avoid cycling, solutions that have been recently explored are declared forbidden or tabu for a number of iterations. The tabu status of a solution is overridden when certain criteria (aspiration criteria) are satisfied. Sometimes, intensification and diversification strategies are used to improve the search. In the first case, the search is accentuated in the promising regions of the feasible domain. In the second case, an attempt is made to consider solutions in a broad area of the search space. Tabu Search algorithms for the VRP are presented in [7,9,20,30,63,70,71,77,85,89,90]. Genetic Algorithms (GAs) are search procedures based on the mechanics of natural selection and natural genetics. The first GA was developed by John H. Holland in the 1960s to allow computers to evolve solutions to difficult search and combinatorial prob-

Metaheuristic Algorithms for the Vehicle Routing Problem

lems, such as function optimization and machine learning [44]. Genetic algorithms offer a particularly attractive approach for problems like vehicle routing problem since they are generally quite effective for rapid global search of large, non-linear and poorly understood spaces. Moreover, genetic algorithms are very effective in solving large-scale problems. Genetic algorithms [38,72] mimic the evolution process in nature. GAs are based on an imitation of the biological process in which new and better populations among different species are developed during evolution. Thus, unlike most standard heuristics, GAs use information about a population of solutions, called individuals, when they search for better solutions. A GA is a stochastic iterative procedure that maintains the population size constant in each iteration, called a generation. Their basic operation is the mating of two solutions in order to form a new solution. To form a new population, a binary operator called crossover, and a unary operator, called mutation, are applied [65,66]. Crossover takes two individuals, called parents, and produces two new individuals, called offsprings, by swapping parts of the parents. Genetic Algorithms for the VRP are presented in [4,5,8,11,45,56,53,60,64]. Greedy Randomized Adaptive Search Procedure – GRASP [73] is an iterative two phase search method which has gained considerable popularity in combinatorial optimization. Each iteration consists of two phases, a construction phase and a local search procedure. In the construction phase, a randomized greedy function is used to build up an initial solution. This randomized technique provides a feasible solution within each iteration. This solution is then exposed for improvement attempts in the local search phase. The final result is simply the best solution found over all iterations. Greedy Randomized Adaptive Search Procedure algorithms for the VRP are presented in [17,42,55]. The use of Artificial Neural Networks to find good solutions to combinatorial optimization problems has recently caught some attention. A neural network consists of a network [76] of elementary nodes (neurons) that are linked through weighted connections. The nodes represent computational units, which are capable of performing a simple computation, consisting of a summation of the

M

weighted inputs, followed by the addition of a constant called the threshold or bias, and the application of a nonlinear response (activation) function. The result of the computation of a unit constitutes its output. This output is used as an input for the nodes to which it is linked through an outgoing connection. The overall task of the network is to achieve a certain network configuration, for instance a required input–output relation, by means of the collective computation of the nodes. This process is often called self-organization. Neural Networks algorithm for the VRP are presented in [61,83]. The Ant Colony Optimization (ACO) metaheuristic is a relatively new technique for solving combinatorial optimization problems (COPs). Based strongly on the Ant System (AS) metaheuristic developed by Dorigo, Maniezzo and Colorni [24], ant colony optimization is derived from the foraging behaviour of real ants in nature. The main idea of ACO is to model the problem as the search for a minimum cost path in a graph. Artificial ants walk through this graph, looking for good paths. Each ant has a rather simple behavior so that it will typically only find rather poor-quality paths on its own. Better paths are found as the emergent result of the global cooperation among ants in the colony. An ACO algorithm consists of a number of cycles (iterations) of solution construction. During each iteration a number of ants (which is a parameter) construct complete solutions using heuristic information and the collected experiences of previous groups of ants. These collected experiences are represented by a digital analogue of trail pheromone which is deposited on the constituent elements of a solution. Small quantities are deposited during the construction phase while larger amounts are deposited at the end of each iteration in proportion to solution quality. Pheromone can be deposited on the components and/or the connections used in a solution depending on the problem. Ant Colony Optimization algorithms for the VRP are presented in [10,15,16,23,57,67,68,69]. Path Relinking This approach generates new solutions by exploring trajectories that connect highquality solutions – by starting from one of these solutions, called the starting solution and generating a path in the neighborhood space that leads

2057

2058

M

Metaheuristic Algorithms for the Vehicle Routing Problem

towards the other solution, called the target solution [36]. Two new metaheuristic algorithms using the path relinking strategy as a part first of Tabu Search Metaheuristic is proposed in [43] and second as a part of a Particle Swarm Optimization Metaheuristic is proposed in [52]. Guided Local Search (GLS), originally proposed by Voudouris and Chang [86,87], is a general optimization technique suitable for a wide range of combinatorial optimization problems. The main focus is on the exploitation of problem and search-related information to effectively guide local search heuristics in the vast search spaces of NP-hard optimization problems. This is achieved by augmenting the objective function of the problem to be minimized with a set of penalty terms which are dynamically manipulated during the search process to steer the heuristic to be guided. GLS augments the cost function of the problem to include a set of penalty terms and passes this, instead of the original one, for minimization by the local search procedure. Local search is confined by the penalty terms and focuses attention on promising regions of the search space. Iterative calls are made to local search. Each time local search gets caught in a local minimum, the penalties are modified and local search is called again to minimize the modification cost function. Guided Local Search algorithms for the VRP are presented in [58,59]. Particle Swarm Optimization (PSO) is a population-based swarm intelligence algorithm. It was originally proposed by Kennedy and Eberhart as a simulation of the social behavior of social organisms such as bird flocking and fish schooling [46]. PSO uses the physical movements of the individuals in the swarm and has a flexible and well-balanced mechanism to enhance and adapt to the global and local exploration abilities. The first algorithm for the solution of the Vehicle Routing Problem was proposed by [52]. One of the most interesting developments that have occurred in the area of TS in recent years is the concept of Adaptive Memory developed by Rochat and Taillard [74]. It is, mostly, used in TS, but its applicability is not limited to this type of metaheuristic. An adaptive memory is a pool of good solutions that is dynamically updated throughout the search process.

Periodically, some elements of these solutions are extracted from the pool and combined differently to produce new good solutions. Very interesting and efficient algorithms based on the concept of Adaptive Memory have been proposed [74,78,79,80]. Variable Neighborhood Search (VNS) is a metaheuristic for solving combinatorial optimization problems whose basic idea is systematic change of neighborhood within a local search [41]. Variable Neighborhood Search algorithms for the VRP are presented in [48].

References 1. Aarts E, Korst J (1989) Simulated Annealing and Boltzmann Machines – A stochastic Approach to Combinatorial Optimization and Neural Computing. Wiley, Chichester 2. Aarts E, Korst J, Van Laarhoven P (1997) Simulated Annealing. In: Aarts E, Lenstra JK (eds) Local Search in Combinatorial Optimization. Wiley, Chichester, pp 91–120 3. Aarts E, Ten Eikelder HMM (2002) Simulated Annealing. In: Pardalos PM, Resende MGC (eds) Handbook of Applied Optimization. Oxford University Press, Oxford, pp 209–221 4. Alba E, Dorronsoro B (2004) Solving the Vehicle Routing Problem by Using Cellular Genetic Algorithms, Conference on Evolutionary Computation in Combinatorial Optimization, EvoCOP’04. LNCS, vol 3004, 11–20, Portugal. Springer, Berlin 5. Alba E, Dorronsoro B (2006) Computing Nine New Best-SoFar Solutions for Capacitated VRP with a Cellular Genetic Algorithm. Inform Process Lett 98(6):225–230 6. Altinkemer K, Gavish B (1991) Parallel Savings Based Heuristics for the Delivery Problem. Oper Res 39(3): 456–469 7. Augerat P, Belenguer JM, Benavent E, Corberan A, Naddef D (1998) Separating Capacity Constraints in the CVRP Using Tabu Search. Eur J Oper Res 106(2–3):546–557 8. Baker BM, Ayechew MA (2003) A Genetic Algorithm for the Vehicle Routing Problem. Comput Oper Res 30(5): 787–800 9. Barbarosoglu G, Ozgur D (1999) A Tabu Search Agorithm for the Vehicle Routing Problem. Comput Oper Res 26:255–270 10. Bell JE, McMullen PR (2004) Ant Colony Optimization Techniques for the Vehicle Routing Problem. Adv Eng Inform 18(1):41–48 11. Berger J, Mohamed B (2003) A Hybrid Genetic Algorithm for the Capacitated Vehicle Routing Problem. In: Proceedings of the Genetic and Evolutionary Computation Conference, Chicago, pp 646–656 12. Bodin L, Golden B (1981) Classification in Vehicle Routing and Scheduling. Networks 11:97–108

Metaheuristic Algorithms for the Vehicle Routing Problem

13. Bodin L, Golden B, Assad A, Ball M (1983) The State of the Art in the Routing and Scheduling of Vehicles and Crews. Comput Oper Res 10:63–212 14. Breedam AV (2001) Comparing Descent Heuristics and Metaheuristics for the Vehicle Routing Problem. Comput Oper Res 28(4):289–315 15. Bullnheimer B, Hartl RF, Strauss C (1997) Applying the Ant System to the Vehicle Routing Problem. Paper presented at 2nd International Conference on Metaheuristics, Sophia-Antipolis, France 16. Bullnheimer B, Hartl RF, Strauss C (1999) An Improved Ant System Algorithm for the Vehicle Routing Problem. Ann Oper Res 89:319–328 17. Chaovalitwongse W, Kim D, Pardalos PM (2003) GRASP with a new local search scheme for Vehicle Routing Problems with Time Windows. J Combin Optim 7(2):179–207 18. Christofides N, Mingozzi A, Toth P (1979) The Vehicle Routing Problem. In: Christofides N, Mingozzi A, Toth P, Sandi C (eds) Combinatorial Optimization. Wiley, Chichester 19. Clarke G, Wright J (1964) Scheduling of Vehicles from a Central Depot to a Number of Delivery Points. Oper Res 12:568–581 20. Cordeau JF, Gendreau M, Laporte G, Potvin JY, Semet F (2002) A Guide to Vehicle Routing Heuristics. J Oper Res Soc 53:512–522 21. Dantzig GB, Ramser RH (1959) The Truck Dispatching Problem. Manag Sci 6:80–91 22. Desrochers M, Verhoog TW (1989) A Matching Based Savings Algorithm for the Vehicle Routing Problem. Les Cahiers du GERAD G-89–04, Ecole des Hautes Etudes Commerciales de Montreal 23. Doerner K, Gronalt M, Hartl R, Reimman M, Strauss C, Stummer M (2002) Savings Ants for the Vehicle Routing Problem. In: Cagnoni S (ed) EvoWorkshops02. LNCS, vol 2279. Springer, Berlin, pp 11–20 24. Dorigo M, Stutzle T (2004) Ant Colony Optimization, A Bradford Book. MIT Press, London 25. Dueck G (1993) New Optimization Heuristics: The Great Deluge Algorithm and the Record-To-Record Travel. J Comput Phys 104:86–92 26. Dueck G, Scheurer T (1990) Threshold Accepting: A General Purpose Optimization Algorithm. J Comput Phys 90: 161–175 27. Fisher ML (1995) Vehicle routing. In: Ball MO, Magnanti TL, Momma CL, Nemhauser GL (eds) Network Routing. Handbooks in Operations Research and Management Science. North Holland, Amsterdam 8:1–33 28. Fisher ML, Jaikumar R (1981) A Generalized Assignment Heuristic for Vehicle Routing. Networks 11:109–124 29. Foster BA, Ryan DM (1976) An Integer Programming Approach to the Vehicle Scheduling Problem. Oper Res 27:367–384 30. Gendreau M, Hertz A, Laporte G (1994) A Tabu Search Heuristic for the Vehicle Routing Problem. Manag Sci 40:1276–1290

M

31. Gendreau M, Laporte G, Potvin J-Y (1997) Vehicle Routing: Modern Heuristics. In: Aarts EHL, Lenstra JK (eds) Local search in Combinatorial Optimization. Wiley, Chichester, pp 311–336 32. Gendreau M, Laporte G, Potvin JY (2002) Metaheuristics for the Capacitated VRP. In: Toth P, Vigo D (eds) The Vehicle Routing Problem, Monographs on Discrete Mathematics and Applications. SIAM, Philadelphia, pp 129–154 33. Gillett BE, Miller LR (1974) A Heuristic Algorithm for the Vehicle Dispatch Problem. Oper Res 22:240–349 34. Glover F (1989) Tabu Search I. ORSA J Comput 1(3): 190–206 35. Glover F (1990) Tabu Search II. ORSA J Comput 2(1):4–32 36. Glover F, Laguna M, Marti R (2003) Scatter Search and Path Relinking: Advances and Applications. In: Glover F, Kochenberger GA (eds) Handbook of Metaheuristics. Kluwer, Boston, pp 1–36 37. Glover F, Laguna M, Taillard E, de Werra D (eds) (1993) Tabu Search. Baltzer, Basel 38. Goldberg DE (1989) Genetic Algorithms in Search, Optimization, and Machine Learning. Addison Wesley, Reading 39. Golden BL, Assad AA (1988) Vehicle Routing: Methods and Studies. North Holland, Amsterdam 40. Golden BL, Wassil E, Kelly J, Chao IM (1998) The Impact of Metaheuristics on Solving the Vehicle Routing Problem: Algorithm, Problem Sets and Computational Results. In: Crainic TG, Laporte G (eds) Fleet Management and Logistics. Kluwer, Boston, pp 33–56 41. Hansen P, Mladenovic N (2001) Variable Neighborhood Search: Principles and Applications. Eur J Oper Res 130: 449–467 42. Hjorring C (1995) The Vehicle Routing Problem and Local Search Metaheuristics, PhD thesis, Department of Engineering Science, University of Auckland 43. Ho SC, Gendreau M (2006) Path Relinking for the Vehicle Routing Problem. J Heuristics 12:55–72 44. Holland JH (1975) Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor 45. Jaszkiewicz A, Kominek P (2003) Genetic Local Search with Distance Preserving Recombination Operator for a Vehicle Routing Problem. Eur J Oper Res 151:352–364 46. Kennedy J, Eberhart R (1995) Particle Swarm Optimization. In: Proceedings of (1995) IEEE International Conference on Neural Networks 4:1942–1948 47. Kirkpatrick S, Gelatt CD, Vecchi MP (1982) Optimization by Simulated Annealing. Science 220:671–680 48. Kytojoki J, Nuortio T, Braysy O, Gendreau M (2007) An Efficient Variable Neighborhood Search Heuristic for Very Large Scale Vehicle Routing Problems. Comput Oper Res 34(9):2743–2757 49. Laporte G, Gendreau M, Potvin J-Y, Semet F (2000) Classical and Modern Heuristics for the Vehicle Routing Problem. Int Trans Oper Res 7:285–300 50. Laporte G, Semet F (2002) Classical Heuristics for the Capacitated VRP. In: Toth P, Vigo D (eds) The Vehicle Routing

2059

2060

M 51.

52.

53.

54.

55.

56.

57.

58.

59.

60.

61.

62.

63.

64.

65.

66.

Metaheuristic Algorithms for the Vehicle Routing Problem

Problem, Monographs on Discrete Mathematics and Applications. SIAM, Philadelphia, pp 109–128 Li F, Golden B, Wasil E (2005) Very Large-Scale Vehicle Routing: New Test Problems, Algorithms and Results. Comput Oper Res 32(5):1165–1179 Marinakis Y, Marinaki M, Dounias G (2007) Nature Inspired Network Approaches in Management: Network Analysis and Optimization for the Vehicle Routing Problem Using NI-Techniques (submitted in AI Commun) Marinakis Y, Marinaki M, Migdalas A (2006) A Hybrid Genetic – GRASP – ENS Algorithm for the Vehicle Routing Problem (submitted in Asia Pac J Oper Res) Marinakis Y, Migdalas A (2002) Heuristic Solutions of Vehicle Routing Problems in Supply Chain Management. In: Pardalos PM, Migdalas A, Burkard R (eds) Combinatorial and Global Optimization. World Scientific, New Jersey, pp 205–236 Marinakis Y, Migdalas A, Pardalos PM (2006) Multiple Phase Neighborhood Search GRASP for the Vehicle Routing Problem (submitted in Comput Manag Sci) Marinakis Y, Migdalas A, Pardalos PM (2007) A New Bilevel Formulation for the Vehicle Routing Problem and a Solution Method Using a Genetic Algorithm. J Global Optim 38:555–580 Mazzeo S, Loiseau I (2004) An Ant Colony Algorithm for the Capacitated Vehicle Routing. Electron Notes Discret Math 18:181–186 Mester D, Braysy O (2005) Active Guided Evolution Strategies for the Large Scale Vehicle Routing Problems with Time Windows. Comput Oper Res 32:1593–1614 Mester D, Braysy O (2007) Active-Guided Evolution Strategies for Large-Scale Capacitated Vehicle Routing Problems. Comput Oper Res 34(10):2964–2975 Mester D, Braysy O, Dullaert W (2007) A Multi-Parametric Evolution Strategies Algorithm for Vehicle Routing Problems. Expert Syst Appl 32(2):508–517 Modares A, Somhom S, Enkawa T (1999) A Self-Organizing Neural Network Approach for Multiple Traveling Salesman and Vehicle Routing Problems. Int Trans Oper Res 6(6):591–606 Mole RH, Jameson SR (1976) A Sequential Route-Building Algorithm Employing a Generalized Savings Criterion. Oper Res Q 27:503–511 Osman IH (1993) Metastrategy Simulated Annealing and Tabu Search Algorithms for Combinatorial Optimization Problems. Ann Oper Res 41:421–451 Prins C (2004) A Simple and Effective Evolutionary Algorithm for the Vehicle Routing Problem. Comput Oper Res 31:1985–2002 Reeves CR (1995) Genetic Algorithms. In: Reeves CR (ed) Modern Heuristic Techniques for Combinatorial Problems. McGraw-Hill, London, pp 151–196 Reeves CR (2003) Genetic Algorithms. In: Glover F, Kochenberger GA (eds) Handbooks of Metaheuristics. Kluwer, Dordrecht, pp 55–82

67. Reimann M, Doerner K, Hartl RF (2003) Analyzing a Unified Ant System for the VRP and Some of Its Variants. In: Cagnoni S et al. (eds) EvoWorkshops 2003. LNCS, vol 2611: 300–310. Springer, Berlin 68. Reimann M, Doerner K, Hartl RF (2004) D-Ants: Savings Based Ants Divide and Conquer the Vehicle Routing Problem. Comput Oper Res 31(4):563–591 69. Reimann M, Stummer M, Doerner K (2002) A Savings Based Ant System for the Vehicle Routing Problem. In: Proceedings of the Genetic and Evolutionary Computation Conference, New York, pp 1317–1326 70. Rego C (1998) A Subpath Ejection Method for the Vehicle Routing Problem. Manag Sci 44:1447–1459 71. Rego C (2001) Node-Ejection Chains for the Vehicle Routing Problem: Sequential and Parallel Algorithms. Parallel Comput 27(3):201–222 72. Rego C, Glover F (2002) Local Search and Metaheuristics. In: Gutin G, Punnen A (eds) The Traveling Salesman Problem and its Variations. Kluwer, Dordrecht, pp 309–367 73. Resende MGC, Ribeiro CC (2003) Greedy Randomized Adaptive Search Procedures. In: Glover F, Kochenberger GA (eds) Handbook of Metaheuristics. Kluwer, Boston, pp 219–249 74. Rochat Y, Taillard ED (1995) Probabilistic Diversification and Intensification in Local Search for Vehicle Routing. J Heuristics 1:147–167 75. Shi Y, Eberhart R (1998) A Modified Particle Swarm Optimizer. In: Proceedings of (1998) IEEE World Congress on Computational Intelligence, pp 69–73 76. Sodererg B, Peterson C (1997) Artificial Neural Networks. In: Aarts E, Lenstra JK (eds) Local Search in Combinatorial Optimization. Wiley, Chichester, pp 173–214 77. Taillard ED (1993) Parallel Iterative Search Methods for Vehicle Routing Problems. Networks 23:661–672 78. Taillard ED, Gambardella LM, Gendreau M, Potvin JY (2001) Adaptive Memory Programming: A Unified View of Metaheuristics. Eur J Oper Res 135(1):1–16 79. Tarantilis CD (2005) Solving the Vehicle Routing Problem with Adaptive Memory Programming Methodology. Comput Oper Res 32(9):2309–2327 80. Tarantilis CD, Kiranoudis CT (2002) BoneRoute: An Adaptive Memory-Based Method for Effective Fleet Management. Ann Oper Res 115(1):227–241 81. Tarantilis CD, Kiranoudis CT, Vassiliadis VS (2002) A Backtracking Adaptive Threshold Accepting Metaheuristic Method for the Vehicle Routing Problem. Syst Anal Modeling Simul (SAMS) 42(5):631–644 82. Tarantilis CD, Kiranoudis CT, Vassiliadis VS (2002) A List Based Threshold Accepting Algorithm for the Capacitated Vehicle Routing Problem. Int J Comput Math 79(5): 537–553 83. Torki A, Somhon S, Enkawa T (1997) A Competitive Neural Network Algorithm for Solving Vehicle Routing Problem. Comput Indust Eng 33(3–4):473–476

Metaheuristics

84. Toth P, Vigo D (2002) The Vehicle Routing Problem. Monographs on Discrete Mathematics and Applications. SIAM, Philadelphia 85. Toth P, Vigo D (2003) The Granular Tabu Search (and its Application to the Vehicle Routing Problem). INFORMS J Comput 15(4):333–348 86. Voudouris C, Tsang E (1999) Guided Local Search and its Application to the Travelling Salesman Problem. Eur J Oper Res 113:469–499 87. Voudouris C, Tsang E (2003) Guided Local Search. In: Glover F, Kochenberger GA (eds) Handbooks of Metaheuristics. Kluwer, Dordrecht, pp 185–218 88. Wark P, Holt J (1994) A Repeated Matching Heuristic for the Vehicle Routing Problem. J Oper Res Soc 45:1156–1167 89. Willard JAG (1989) Vehicle Routing Using r-Optimal Tabu Search. Master thesis, The Management School, Imperial College, London 90. Xu J, Kelly JP (1996) A New Network Flow-Based Tabu Search Heuristic for the Vehicle Routing Problem. Transp Sci 30:379–393

Metaheuristics STEFAN VOSS Institute of Information Systems (Wirtschaftsinformatik), University of Hamburg, Hamburg, Germany MSC2000: 68T20, 90C59, 90C27, 68T99 Article Outline Keywords and Phrases Introduction Definitions Local Search Metaheuristics

Metaheuristic Methods Simple Local Search Based Metaheuristics Simulated Annealing Tabu Search Evolutionary Algorithms Swarm Intelligence Miscellaneous

General Frames Adaptive Memory Programming A Pool Template Partial Optimization Metaheuristic Under Special Intensification Conditions Hybrids with Exact Methods Optimization Software Libraries

Applications

M

Conclusions References Keywords and Phrases Heuristics; Metaheuristics; Greedy randomized adaptive search procedure; Pilot method; Variable neighborhood search; Simulated annealing; Tabu search; Genetic algorithm; Evolutionary algorithm; Scatter search; Hybridization; Optimization software library; POPMUSIC; Adaptive memory programming; Pool template Introduction Many decision problems in various areas such as business, engineering, economics, and science, including those in manufacturing, location, routing, and scheduling, may be formulated as optimization problems. Owing to the complexity of many of these optimization problems, particularly those of large sizes encountered in most practical settings, exact algorithms often perform very poorly, in some cases taking days or more to find moderately decent, let alone optimal, solutions even to fairly small instances. As a result, heuristic algorithms are conspicuously preferable in practical applications. As an extension of simple heuristics, a large number of local search approaches have been developed to improve given feasible solutions. The main drawback of these approaches is their inability to continue the search upon becoming trapped in local optima. This leads to consideration of techniques for guiding known heuristics to overcome local optimality. Following this theme metaheuristics have become a most important class of approaches for solving optimization problems. They support managers in decision-making with robust tools that provide high-quality solutions to important applications in reasonable time horizons. We describe metaheuristics mainly from an operations research perspective. Earlier survey papers on metaheuristics include those of Blum and Roli [14] and Voß [95]. Here we occasionally rely on the latter. The general concepts have not become obsolete, and many changes are mainly based upon an update to most recent references. A handbook on metaheuristics is available describing a great variety of concepts by various authors in a comprehensive manner [44].

2061

2062

M

Metaheuristics

Definitions The basic concept of heuristic search as an aid to problem solving was first introduced in [76]. A heuristic is a technique (consisting of a rule or a set of rules) which seeks (and hopefully finds) good solutions at a reasonable computational cost. A heuristic is approximate in the sense that it provides (hopefully) a good solution for relatively little effort, but it does not guarantee optimality. Heuristics provide simple means of indicating which among several alternatives seems to be best. That is, “Heuristics are criteria, methods, or principles for deciding which among several alternative courses of action promises to be the most effective in order to achieve some goal. They represent compromises between two requirements: the need to make such criteria simple and, at the same time, the desire to see them discriminate correctly between good and bad choices. A heuristic may be a rule of thumb that is used to guide one’s action” [73]. Greedy heuristics are simple iterative approaches available for any kind of (e. g., combinatorial) optimization problem. A good characterization is their myopic behavior. A greedy heuristic starts with a given feasible or infeasible solution. In each iteration there are a number of alternative choices (moves) that can be made to transform the solution. From these alternatives which consist in fixing (or changing) one or more variables, a greedy choice is made, i. e., the best alternative according to a given measure is chosen until no such transformations are possible any longer. Usually, a greedy construction heuristic starts with an incomplete solution and completes it stepwise. Savings and dual algorithms follow the same iterative scheme: dual heuristics change an infeasible low-cost solution until reaching feasibility; savings algorithms start with a high-cost solution and realize the highest savings as long as possible. Moreover, in all three cases, once an element has been chosen this decision is (usually) not reversed throughout the algorithm, it is kept. As each alternative has to be measured, in general we may define some sort of heuristic measure (providing, e. g., some priority values or some ranking information) which is iteratively followed until a complete solution is built. Usually this heuristic measure is applied in a greedy fashion.

For heuristics we usually have the distinction between finding initial feasible solutions and improving them. In that sense we first discuss local search before characterizing metaheuristics. Local Search The basic principle of local search is to successively alter solutions locally. Related transformations are defined by neighborhoods which for a given solution include all solutions that can be reached by one move. That is, neighborhood search usually is assumed to correspond to the process of iteratively moving from one solution to another one by performing some sort of operation. More formally, each solution of a problem has an associated set of neighbors called its neighborhood, i. e., solutions that can be obtained by a single operation called transformation or move. Most common ideas for transformations are, e. g., to add or drop some problem-specific individual components. Other options are to exchange two components simultaneously, or to swap them. Furthermore, components may be shifted from a certain position into other positions. All components involved within a specific move are called its elements or attributes. Moves must be evaluated by some heuristic measure to guide the search. Often one uses the implied change of the objective function value, which may provide reasonable information about the (local) advantage of moves. Following a greedy strategy, steepest descent (SD) corresponds to selecting and performing in each iteration the best move until the search stops at a local optimum. Obviously, savings algorithms correspond to SD. As the solution quality of local optima may be unsatisfactory, we need mechanisms which guide the search to overcome local optimality. A simple strategy called iterated local search is to iterate/restart the local search process after a local optimum has been obtained, which requires some perturbation scheme to generate a new initial solution (e. g., performing some random moves). Of course, more structured ways to overcome local optimality may be advantageous. A general survey of local search can be found in [1] and the references from [2]. A simple template is provided in [90].

Metaheuristics

Starting in the 1970s (see Lin and Kernighan [66]), a variable way of handling neighborhoods is still a topic within local search. Consider an arbitrary neighborhood structure N, which defines for any solution s a set of neighbor solutions N1 (s) as a neighborhood of depth d D 1. In a straightforward way, a neighborhood NdC1 (s) of depth d C 1 is defined as the set Nd (s) [ fs0 j9s00 2 Nd (s) : s0 2 N1 (s00 )g. In general, a large d might be unreasonable, as the neighborhood size may grow exponentially. However, depths of two or three may be appropriate. Furthermore, temporarily increasing the neighborhood depth has been found to be a reasonable mechanism to overcome basins of attraction, e. g., when a large number of neighbors with equal quality exist. Large-scale neighborhoods have become an important topic (see, e. g., [5] for a survey), especially when efficient ways are at hand for exploring them. Related research can also be found under various names; see, e. g., [75] for the idea of ejection chains. Stochastic local search is pretty much all we know about local search but is enhanced by randomizing choices. That is, a stochastic local search algorithm is a local search algorithm making use of randomized choices in generating or selecting candidate solutions for given instances of optimization problems. Randomness may be used for search initialization as well as the computation of search steps. A comprehensive treatment of stochastic local search is given in [58].

Metaheuristics, Figure 1 Simplified metaheuristics inheritance tree

M

Metaheuristics The formal definition of metaheuristics is based on a variety of definitions from different authors based on [39]. Basically, a metaheuristic is a top-level strategy that guides an underlying heuristic solving a given problem. In that sense we distinguish between a guiding process and an application process. The guiding process decides upon possible (local) moves and forwards its decision to the application process, which then executes the move chosen. In addition, it provides information for the guiding process (depending on the requirements of the respective metaheuristic) like the recomputed set of possible moves. According to [43], “metaheuristics in their modern forms are based on a variety of interpretations of what constitutes intelligent search”, where the term “intelligent search” has been made prominent by Pearl [73] (regarding heuristics in an artificial intelligence context; see also [92] regarding an operations research context). In that sense we may also consider the following definition: “A metaheuristic is an iterative generation process which guides a subordinate heuristic by combining intelligently different concepts for exploring and exploiting the search spaces using learning strategies to structure information in order to find efficiently nearoptimal solutions” [72]. To summarize, the following definition seems to be most appropriate: “A metaheuristic is an iterative

2063

2064

M

Metaheuristics

master process that guides and modifies the operations of subordinate heuristics to efficiently produce high-quality solutions. It may manipulate a complete (or incomplete) single solution or a collection of solutions at each iteration. The subordinate heuristics may be high (or low) level procedures, or a simple local search, or just a construction method. The family of metaheuristics includes, but is not limited to, adaptive memory procedures, tabu search, ant systems, greedy randomized adaptive search, variable neighborhood search, evolutionary methods, genetic algorithms, scatter search, neural networks, simulated annealing, and their hybrids” (p. ix in [97]). We describe the ingredients and basic concepts of various metaheuristic strategies like tabu search (TS), simulated annealing (SA), and scatter search. This is based on a simplified view of a possible inheritance tree for heuristic search methods, illustrating the relationships between some of the most important methods discussed below, as shown in Fig. 1. We also emphasize advances including the important incorporation of exact methods into intelligent search. Furthermore, general frames are sketched that may subsume various approaches within the metaheuristics field. Metaheuristic Methods We survey the basic concepts of some of the most important metaheuristics. We shall see that adaptive processes originating from different settings such as psychology (“learning”), biology (“evolution”), physics (“annealing”), and neurology (“nerve impulses”) have served as interesting starting points. Simple Local Search Based Metaheuristics To improve the efficiency of greedy heuristics, one may apply generic strategies to be used alone or in combination with each other, namely, changing the definition of alternative choices, look ahead evaluation, candidate lists, and randomized selection criteria bound up with repetition, as well as combinations with local search or other methods. Greedy Randomized Adaptive Search Omitting a greedy choice criterion for a random strategy, one can run the algorithm several times and obtain a large

number of different solutions. A combination of best and random choice seems to be appropriate: We define a candidate list as a list consisting of a number of (best, i. e., first best, second best, third best, etc.) alternatives. Out of this list one alternative is chosen randomly. The length of the candidate list is given either as an absolute value, a percentage of all feasible alternatives, or implicitly by defining an allowed quality gap (to the best alternative), which also may be an absolute value or a percentage. Replicating a search procedure to determine a local optimum multiple times with different starting points has been given the acronym GRASP and investigated with respect to different applications. A comprehensive survey of GRASP and its applications is given in [32]. It should be noted that GRASP goes back to older approaches [52], which is frequently overlooked in many applications. The different initial solutions or starting points are found by a greedy procedure incorporating a probabilistic component. That is, given a candidate list to choose from, GRASP randomly chooses one of the best candidates from this list in a greedy fashion, but not necessarily the best possible choice. The underlying principle is to investigate many good starting points through the greedy procedure and thereby to increase the possibility of finding a good local optimum on at least one replication. The method is said to be adaptive as the greedy function takes into account previous decisions when performing the next choice. The Pilot Method Building on a simple greedy algorithm such as, e. g., a construction heuristic, the pilot method [29,30] is a metaheuristic not necessarily based on a local search in combination with an improvement procedure. It primarily looks ahead for each possible local choice (by computing a so-called “pilot” solution), memorizing the best result, and performing the respective move. (Very similar ideas have been investigated under the name rollout method [13].) One may apply this strategy by successively performing a greedy heuristic for all possible local steps (i. e., starting with all incomplete solutions resulting from adding some not yet included element at some position to the current incomplete solution). The look ahead mechanism of the pilot method is related to increased neighborhood depths as the pilot method exploits the evaluation

Metaheuristics

M

of neighbors at larger depths to guide the neighbor selection at depth one. In most applications, it is reasonable to restrict the pilot process to some evaluation depth. That is, the method is performed up to an incomplete solution (e. g., partial assignment) based on this evaluation depth and is then completed by continuing with a conventional heuristic. For a recent study applying the pilot method to several combinatorial optimization problems obtaining very good results see [96]. Additional applications can be found, e. g., in [18,68].

Various authors have described a robust concretization of this general SA approach [60,62]. An interesting variant of SA is to strategically reheat the process, i. e., to perform a nonmonotonic acceptance function. Threshold accepting [28] is a modification (or simplification) of SA accepting every move that leads to a new solution which is “not much worse” (i. e., deteriorates not more than a certain threshold which reduces with temperature) than the older one.

Variable Neighborhood Search The basic idea of variable neighborhood search (VNS) is to change the neighborhood during the search in a systematic way. VNS usually explores increasingly distant neighborhoods of a given solution, and jumps from this solution to a new one if and only if an improvement has been made. In this way often favorable characteristics of incumbent solutions, e. g., that many variables are already at their appropriate value, will be kept and used to obtain promising neighboring solutions. Moreover, a local search routine is applied repeatedly to get from these neighboring solutions to local optima. This routine may also use several neighborhoods. Therefore, to construct different neighborhood structures and to perform a systematic search, one needs to have a way for finding the distance between any two solutions, i. e., one needs to supply the solution space with some metric (or quasi-metric) and then induce neighborhoods from it. For an excellent treatment of various aspects of VNS see [51].

The basic paradigm of tabu search (TS) is to use information about the search history to guide local search approaches to overcome local optimality (see [43] for a survey on TS). In general, this is done by a dynamic transformation of the local neighborhood. Based on some sort of memory, certain moves may be forbidden; we say they are set tabu. As for SA, the search may lead to performing deteriorating moves when no improving moves exist or when all improving moves of the current neighborhood are set tabu. At each iteration a best admissible neighbor may be selected. A neighbor, or a corresponding move, is called admissible if it is not tabu or if an aspiration criterion is fulfilled. An aspiration criterion is a rule to eventually override a possibly unreasonable tabu status of a move. For example, a move that leads to a neighbor with a better objective function value than encountered so far should be considered as admissible. We briefly describe some TS methods that differ especially in the way in which tabu criteria are defined, taking into consideration the information about the search history (performed moves, traversed solutions). The most commonly used TS method is based on a recency-based memory that stores moves, or attributes characterizing respective moves, of the recent past (static TS). The basic idea of such approaches is to prohibit an appropriately defined inversion of performed moves for a given period. For example, one may store the solution attributes that have been created by a performed move in a tabu list. To obtain the current tabu status of a move to a neighbor, one may check whether (or how many of) the solution attributes that would be destroyed by this move are contained in the tabu list. Strict TS embodies the idea of preventing cycling to formerly traversed solutions. The goal is to provide ne-

Simulated Annealing Simulated annealing (SA) extends basic local search by allowing moves to inferior solutions [26,64]. A basic SA algorithm may be described as follows: Successively, a candidate move is randomly selected; this move is accepted if it leads to a solution with an improved objective function value compared to the current solution, otherwise, the move is accepted with a probability depending on the deterioration of the objective function value. The acceptance probability is computed as e /T , using a temperature T as a control parameter. Usually, T is reduced over time for diversification at an earlier stage of the search and to intensify later.

Tabu Search

2065

2066

M

Metaheuristics

cessity and sufficiency with respect to the idea of not revisiting any solution. Accordingly, a move is classified as tabu if and only if it leads to a neighbor that has already been visited during the previous search. There are two primary mechanisms to accomplish the tabu criterion: First, we may exploit logical interdependencies between the sequence of moves performed throughout the search process, as realized by, e. g., the reverse elimination method and the cancellation sequence method [40,94]. Second, we may store information about all solutions visited so far. This may be carried out either exactly or, for reasons of efficiency, approximately (e. g., by using hash codes). Reactive TS aims at the automatic adaptation of the tabu list length of static TS [12]. The idea is to increase the tabu list length when the tabu memory indicates that the search is revisiting formerly traversed solutions. A possible specification can be described as follows: Starting with a tabu list length l of 1 it is increased every time a solution has been repeated. If there has been no repetition for some iterations, we decrease it appropriately. To accomplish the detection of a repetition of a solution, one may apply a trajectory-based memory using hash codes as for strict TS. For reactive TS [12], it is appropriate to include means for diversifying moves whenever the tabu memory indicates that we are trapped in a certain region of the search space. As a trigger mechanism one may use, e. g., the combination of at least two solutions each having been traversed three times. A very simple escape strategy is to perform a number of random moves (depending on the average of the number of iterations between solution repetitions); more advanced strategies may take into account some long-term memory information (like the frequencies of appearance of specific solution attributes in the search history). Of course there are a great variety of additional ingredients that may make TS work successfully, e. g., restricting the number of neighbor solutions to be evaluated (using candidate list strategies). Evolutionary Algorithms Evolutionary algorithms comprise a great variety of different concepts and paradigms, including genetic algorithms (GAs) [45,56], evolutionary strategies [55,83], evolutionary programs [36], scatter search [38,41], and

memetic algorithms [71]. For surveys and references on evolutionary algorithms see also [9,37,69,78]. GAs are a class of adaptive search procedures based on principles derived from the dynamics of natural population genetics. One of the most crucial ideas for a successful implementation of a GA is the representation of an underlying problem by a suitable scheme. A GA starts, e. g., with a randomly created initial population of artificial creatures (strings), a set of solutions. These strings in whole and in part are the base set for all subsequent populations. They are copied and information is exchanged between the strings in order to find new solutions of the underlying problem. The mechanisms of a simple GA essentially consist of copying strings and exchanging partial strings. A simple GA uses three operators which are named according to the corresponding biological mechanisms: reproduction, crossover, and mutation. Performing an operator may depend on a fitness function or its value (fitness), respectively. As some sort of heuristic measure, this function defines a means of measurement for the profit or the quality of the coded solution for the underlying problem and often depends on the objective function of the given problem. GAs are closely related to evolutionary strategies. Whereas the mutation operator in a GA serves to protect the search from premature loss of information, evolution strategies may incorporate some sort of local search procedure (such as SD) with self-adapting parameters involved in the procedure. On a simplified scale many algorithms may be coined evolutionary once they are reduced to the following frame [54]: 1. Generate an initial population of individuals. 2. While no stopping condition is met do. 3. Co-operation. 4. Self-adaptation. Self-adaptation refers to the fact that individuals (solutions) evolve independently while co-operation refers to an information exchange among individuals. Scatter search ideas established a link between early ideas from various sides – evolutionary strategies, TS, and GAs. As an evolutionary approach, scatter search originated from strategies for creating composite decision rules and surrogate constraints [38]. Scatter search is designed to operate on a set of points, called reference points, that constitute good solutions obtained

Metaheuristics

from previous solution efforts. The approach systematically generates linear combinations of the reference points to create new points, each of which is mapped into an associated point that yields integer values for discrete variables. Scatter search contrasts with other evolutionary procedures, such as GAs, by providing unifying principles for joining solutions based on generalized path constructions in Euclidean space and by utilizing strategic designs where other approaches resort to randomization. For a very comprehensive treatment of scatter search see [65]. Swarm Intelligence Swarm intelligence is a relatively novel discipline interested in the study of self-organizing processes in nature and human artifacts [15,63]. While researchers in ethnology and animal behavior have proposed many models to explain various aspects of social insect behavior such as self-organization and shape formation, algorithms inspired by these models have been proposed to solve optimization problems. Successful examples are the so-called ant system or ant colony paradigm, the bee system, and swarm robotics, where the focus is on applying swarm intelligence techniques to the control of large groups of cooperating autonomous robots. The ant system is a dynamic optimization process reflecting the natural interaction between ants searching for food [23]. The ants’ ways are influenced by two different kinds of search criteria. The first one is the local visibility of food, i. e., the attractiveness of food in each ant’s neighborhood. Additionally, each ant’s way through its food space is affected by the other ants’ trails as indicators for possibly good directions. The intensity of trails itself is time-dependent: With time passing, parts of the trails are diminishing, while the intensity may increase by new and fresh trails. With the quantities of these trails changing dynamically, an autocatalytic optimization process is started forcing the ants’ search into most promising regions. This process of interactive learning can easily be modeled for most kinds of optimization problems by using simultaneously and interactively processed search trajectories. A comprehensive treatment of the ant system paradigm can be found in [24]. To achieve enhanced performance of the ant system it is useful to hybridize it at least with a local search component.

M

Miscellaneous Target analysis may be viewed as a general learning approach. Given a problem we first explore a set of sample instances and an extensive effort is made to obtain a solution which is optimal or close to optimality. The best solutions obtained provide targets to be sought within the next part of the approach. For instance, a TS algorithm may resolve the problems with the aim of finding what are the right choices to come to the already known solution (or as close to it as possible). This may give some information on how to choose parameters for other problem instances. A different method in this context is path relinking (PR), which provides a useful means of intensification and diversification. Here new solutions are generated by exploring search trajectories that combine elite solutions, i. e., solutions that have proven to be better than others throughout the search. For references on target analysis and PR see [43]. Recalling local search based on data perturbation the noising method may be related to the following approach too. Given an initial feasible solution, the method performs some data perturbation [87] in order to change the values taken by the objective function of a respective problem to be solved. On the perturbed data a local search may be performed (e. g., following a SD approach). The amount of data perturbation (the noise added) is successively reduced until it reaches zero. The noising method is applied, e. g., in [19] for the clique partitioning problem. The key issue in designing parallel algorithms is to decompose the execution of the various ingredients of a procedure into processes executable by parallel processors. In contrast to ant systems or GAs, metaheuristics like TS or SA, at first glance, have an intrinsic sequential nature owing to the idea of performing the neighborhood search from one solution to the next. However, some effort has been undertaken to define templates for parallel local search [20,90,91,93]. A comprehensive treatment with successful applications is provided in [6]. The discussion of parallel metaheuristics has also led to interesting hybrids such as the combination of a population of individual processes, agents, in a cooperative and competitive nature (see, e. g., the discussion of memetic algorithms in [71]) with TS.

2067

2068

M

Metaheuristics

Neural networks may be considered as metaheuristics, although we have not considered them here; see [85] for a comprehensive survey of these techniques for combinatorial optimization. In contrast, one may use metaheuristics to speed up the learning process regarding artificial neural networks; see [7] for a comprehensive consideration. Furthermore, recent efforts on problems with multiple objectives and corresponding metaheuristic approaches can be found in [61]. See [82] for some ideas regarding GAs and fuzzy multiobjective optimization. General Frames An important avenue of metaheuristics research refers to general frames (to explain the behavior and the relationship between various methods) as well as the development of software systems incorporating metaheuristics (eventually in combination with other methods). Besides other aspects, this takes into consideration that in metaheuristics it has very often been appropriate to incorporate a certain means of diversification vs. intensification to lead the search into new regions of the search space. This requires a meaningful mechanism to detect situations when the search might be trapped in a certain area of the solution space. Therefore, within intelligent search the exploration of memory plays a most important role.

strategies for types of problems (or special instances of a given problem type) where standard methods fail. Such AMP techniques could be, e. g., dynamic handling of operational restrictions, dynamic move selection formulas, and flexible function evaluations. Consider, as an example, adaptive memory within TS concepts. Realizing AMP principles depends on which specific TS application is used. For example, the reverse elimination method observes logical interdependencies between moves and infers corresponding tabu restrictions, and therefore makes fuller use of AMP than simple static approaches do. To discuss the use of AMP in intelligent agent systems, one may use the simple model of ant systems as an illustrative starting point. Ant systems are based on combining local search criteria with information derived from the trails. This follows the AMP requirement for using flexible (dynamic) move selection rules (formulas). However, the basic ant system exhibits some structural inefficiencies when viewed from the perspective of general intelligent agent systems, as no distinction is made between successful and less successful agents, no time-dependent distinction is made, and there is no explicit handling of restrictions providing protection against cycling and duplication. Furthermore, there are possible conflicts between the information held in the adaptive memory (diverging trails).

Adaptive Memory Programming Adaptive memory programming (AMP) coins a general approach (or even thinking) within heuristic search focusing on exploiting a collection of memory components [42,89]. An AMP process iteratively constructs (new) solutions based on the exploitation of some memory, especially when combined with learning mechanisms supporting the collection and use of the memory. Based on the idea of initializing the memory and then iteratively generating new solutions (utilizing the given memory) while updating the memory based on the search, we may subsume various of the abovedescribed metaheuristics as AMP approaches. This also includes exploiting provisional solutions that are improved by a local search approach. The performance as well as the efficiency of a heuristic scheme strongly depends on its ability to use AMP techniques providing flexible and variable

A Pool Template In [48] a pool template (PT) is proposed as can be seen in Fig. 2. The following notation is used. A pool of p 1 solutions is denoted by P. Its input and output transfer is managed by two functions which are called IF and OF, respectively. S is a set of solutions with cardinality s 1. A solution combination method (procedure SCM) constructs a solution from a given set S, and IM is an improvement method. Depending on the method used, in step 1 a pool is either completely (or partially) built by a (randomized) diversification generator or filled with a single solution which has been provided, e. g., by a simple greedy approach. Note that a crucial parameter that deserves careful elaboration is the cardinality p of the pool. The main loop, executed until a termination criterion holds, consists of steps 2–5. Step 2 is the call of the output

Metaheuristics

1. Initialize P by an external procedure WHILE termination=FALSE DO BEGIN 2. S := OF(P) 3. IF s > 1 THEN S 0 := SCM(S) ELSE S 0 := S 4. S 00 := IM(S 0 ) 5. P := IF(S 00 ) END 6. Apply a post-optimizing procedure to P

c) d)

e)

Metaheuristics, Figure 2 Pool template

f) function which selects a set of solutions, S, from the pool. Depending on the kind of method represented in the PT, these solutions may be assembled (step 3) to a working solution S 0 which is the starting point for the improvement phase of step 4. The outcome of the improvement phase, S 00 , is then evaluated by means of the input function, which possibly feeds the new solution into the pool. Note that a post-optimization procedure in step 6 is for facultative use. It may be a straightforward greedy improvement procedure if used for singlesolution heuristics or a pool method on its own. As an example we quote a sequential pool method, the TS with PR in [11]. Here a PR phase is added after the pool has been initialized by a TS. A parallel pool method on the other hand uses a pool of solutions while it is constructed by the guiding process (e. g., a GA or scatter search). Several heuristic and metaheuristic paradigms, whether they are obviously pool-oriented or not, can be summarized under the common PT frame. We provide the following examples: a) Local search/SD: PT with p D s D 1. b) SA: p D 2; s D 1 incorporating its probabilistic acceptance criterion in IM. (It should be noted that p D 2 and s D 1 seems to be unusual at first glance. For SA we always have a current solution in the pool for which one or more neighbors are evaluated and eventually a neighbor is found which replaces the current solution. Furthermore, at all iterations throughout the search the so far best solution is stored too (even if no real interaction between those two stored solutions takes place). The same is also valid for a simple TS. As for local search the current

M

solution corresponds to the best solution of the specific search, we have p D 1.) Standard TS: p D 2; s D 1 incorporating adaptive memory in IM. GAs: p > 1 and s > 1 with population mechanism (crossover, reproduction, and mutation) in SCM of step 3 and without the use of step 4. Scatter search: p > 1 and s > 1 with subset generation in OF of step 2, linear combination of elite solutions by means of SCM in step 3, e. g., a TS for procedure IM and a reference set update method in IF of step 5. PR (as a parallel pool method): p > 1 and s D 2 with a PR neighborhood in SCM. Facultative use of step 4.

Partial Optimization Metaheuristic Under Special Intensification Conditions A natural way to solve large optimization problems is to decompose them into independent subproblems that are solved with an appropriate procedure. However, such approaches may lead to solutions of moderate quality since the subproblems might have been created in a somewhat arbitrary fashion. Of course, it is not easy to find an appropriate way to decompose a problem a priori. The basic idea of POPMUSIC conditions is to locally optimize subparts of a solution, a posteriori, once a solution to the problem is available. These local optimizations are repeated until a local optimum is found. Therefore, POPMUSIC may be viewed as a local search working with a special, large neighborhood. While the acronym POPMUSIC was given by Taillard and Voß [88] other metaheuristics may be incorporated into the same framework too [84]. For large optimization problems, it is often possible to see the solutions as composed of parts (or chunks [102], cf. the term “vocabulary building”). Considering the vehicle routing problem, a part may be a tour (or even a customer). Suppose that a solution can be represented as a set of parts. Moreover, some parts are more in relation with some other parts, so a corresponding heuristic measure can be defined between two parts. The central idea of POPMUSIC is to select a so-called seed part and a set P of parts that are mostly related to the seed part to form a subproblem. Then it is possible to state a local search optimization frame that consists of trying to improve all sub-

2069

2070

M

Metaheuristics

problems that can be defined, until the solution does not contain a subproblem that can be improved. In the POPMUSIC frame of [88], P corresponds precisely to seed parts that have been used to define subproblems that have been unsuccessfully optimized. Once P contains all the parts of the complete solution, all subproblems have been examined without success and the process stops. Basically, the technique is a gradient method that starts from a given initial solution and stops in a local optimum relative to a large neighborhood structure. To summarize, both POPMUSIC as well as AMP may serve as a general frame encompassing various other approaches.

Hybrids with Exact Methods Often a new idea or a new paradigm in metaheuristics is claimed to be the idea by the inventor, while others see it as useless in the first instance. However, once it has been hybridized, things begin to fly. Especially in population-based metaheuristics, many researchers have followed this trend. That is, we now see many hybrid approaches where the successful ingredients of various metaheuristics have been combined. The term “hybridization”, however, goes further, as it also refers to combining metaheuristics with exact methods. Traditionally, the structure of neighborhoods is determined by local transformations or moves. This usually refers to relatively small homogeneous neighborhoods. Different types of moves have been used in the construction of very large and diverse neighborhoods. In contrast, as a hybrid one may deploy neighborhoods that are method-based. By this we mean that the basic structure of a neighborhood is determined by the needs and requirements of a given (say, exact) optimization method used to search the neighborhood. That is, given an incumbent solution one may define the neighborhood so that an exact method can be efficiently used rather than defining a neighborhood and trying to find an appropriate method to explore it. This approach was called corridor method by Sniedovich and Voß [86] as it literally defines a neighborhood as a sufficiently sized corridor around a given solution so that a given exact method behaves well. Iteratively the corridor is moved through the search space for exploration.

Constraint programming (CP) is a paradigm for representing and solving a wide variety of problems expressed by means of variables, their domains, and constraints on the variables. Usually CP models are solved using depth-first search and branch and bound. Naturally, these concepts can be complemented by local search concepts and metaheuristics. This idea is followed by several authors; see [21] for TS and guided local search hybrids. Commonalities with the POPMUSIC approach can be deduced from [74]. Of course, the treatment of this topic is by no means complete and various ideas have been developed. One idea is to transform a greedy heuristic into a search algorithm by branching only in a few (i. e., limited number) cases when the choice criterion of the heuristic observes some borderline case or where the choice is least compelling, respectively. This approach may be called limited discrepancy search [17,53]. Independent from the CP concept, one may investigate hybrids of branch and bound and metaheuristics, e. g., for deciding upon branching variables or search paths to be followed within a branch and bound tree (see [103] for an application of reactive TS). Here we may also use the term “cooperative solver.” Somewhat related is the local branching concept for solving mixed integer programs (MIP), which seeks to explore neighborhoods defined through (invalid) linear cuts. The neighborhoods are searched by means of a general purpose MIP solver [35]. Correspondingly, one of the current research issues refers to exploiting mathematical programming (MP) techniques in a (meta)heuristic framework or, correspondingly, granting to MP approaches the crossproblem robustness and computation time effectiveness which characterize metaheuristics. Discriminating landmark is some form of exploitation of a MP formulation, e. g., by means of MIP. In this respect various efforts have been made towards developing strategies for making a heuristic sequence of roundings to obtain feasible solutions for problems represented by means of appropriate MIP [3,34]. Optimization Software Libraries Besides some well-known approaches for reusable software in the field of exact optimization (e. g., CPLEX or ABACUS; see http://www.ilog.com and http://www.

Metaheuristics

informatik.uni-koeln.de/abacus) some ready-to-use and well-documented component libraries in the field of local search based heuristics and metaheuristics have been developed; see especially the contributions in [98]. The most successful approaches documented in the literature are the heuristic optimization framework HOTFRAME of [33] and EASYLOCAL++ of [22]. HOTFRAME, as an example, is implemented in C++, which provides adaptable components incorporating different metaheuristics and an architectural description of the collaboration among these components and problem-specific complements. Typical applicationspecific concepts are treated as objects or classes: problems, solutions, neighbors, solution attributes, and move attributes. On the other hand, metaheuristic concepts such as the different methods described above and their building blocks such as tabu criteria or diversification strategies are also treated as objects. H OTFRAME uses genericity as the primary mechanism to make these objects adaptable. That is, common behavior of metaheuristics is factored out and grouped in generic classes, applying static type variation. Metaheuristics template classes are parameterized by aspects such as solution spaces and neighborhood structures.

Applications Applications of metaheuristics are almost uncountable and appear in various journals (e. g., Journal of Heuristics), books, and technical reports every day. A helpful source for a subset of successful applications may be special issues of journals or compilations such as [25, 77,79,97], just to mention some. Specialized conferences like the Metaheuristics International Conference are devoted to the topic [25, 59,72,80,81,97] and even more general conferences reveal that metaheuristics have become part of necessary prerequisites for successfully solving optimization problems [46]. Moreover, ready-to-use systems such as class libraries and frameworks have been developed, although they are usually restricted to application by the knowledgeable user. Specialized applications also reveal research needs, e. g., in dynamic environments. One example refers to the application of metaheuristics for online optimization [49].

M

Conclusions Over the last few decades metaheuristics have become a substantial part of the optimization stockroom with various applications in science and, even more important, in practice. Metaheuristics have been considered in textbooks, e. g., in operations research, and a wealth of monographs [27,43,70,92] are available. Most important in our view are general frames. AMP, an intelligent interplay of intensification and diversification (such as ideas from POPMUSIC), and the connection to powerful exact algorithms as subroutines for handable subproblems are avenues to be followed. From a theoretical point of view, the use of most metaheuristics has not yet been fully justified. While convergence results regarding solution quality exist for most metaheuristics, once appropriate probabilistic assumptions are made [8,31,50] these turn out not to be very helpful in practice as usually a disproportionate computation time is required to achieve these results (usually convergence is achieved for a computation time tending to infinity, with a few exceptions, e. g., for the reverse elimination method within TS or the pilot method where optimality can be achieved with a finite, but exponential number of steps in the worst case). Furthermore, we have to admit that theoretically one may argue that none of the metaheuristics described are on average better than any other; there is no free lunch [101]. Basically this leaves the choice of a best possible heuristic or related ingredients to the ingenuity of the user/researcher. Some researchers related the term “hyperheuristics” to the question of which (heuristic) method among a given set of methods to choose for a given problem [16]. Moreover, despite the widespread success of various metaheuristics, researchers occasionally still have a poor understanding of many key theoretical aspects of these algorithms, including models of the high-level run-time dynamics and identification of search space features that influence problem difficulty [99]. Moreover, fitness landscape evaluations are considered to be in their infancy too. From an empirical standpoint it would be most interesting to know which algorithms perform best under various criteria for different classes of problems. Unfortunately, this theme is out of reach as long as we do not

2071

2072

M

Metaheuristics

have any well-accepted standards regarding the testing and comparison of different methods. While most papers on metaheuristics claim to provide “high-quality” results based on some sort of measure, we still believe that there is a great deal of room for improvement in testing existing as well as new approaches from an empirical point of view [10,57,67]. In a dynamic research process numerical results provide the basis for systematically developing efficient algorithms. The essential conclusions of finished research and development processes should always be substantiated (i. e., empirically and, if necessary, statistically proven) by numerical results based on an appropriate empirical test cycle. Furthermore, even when excellent numerical results are obtained, it may still be possible to compare with a simple random restart procedure and obtain better results in some cases [47]. However, this comparison is usually neglected. Usually the ways of preparing, performing, and presenting experiments and their results are significantly different. The failing of a generally accepted standard for testing and reporting on the testing, or at least a corresponding guideline for designing experiments, unfortunately implies the following observation: Some results can be used only in a restricted way, e. g., because relevant data are missing, wrong environmental settings are used, or simply results are glossed over. In the worst case nonsufficiently prepared experiments provide results that are unfit for further use, i. e., any generalized conclusion is out of reach. Future algorithm research needs to provide effective methods for analyzing the performance of, e. g., heuristics in a more scientifically founded way (see [4,100] for some steps into this direction). A final aspect that deserves special consideration is to investigate the use of information within different metaheuristics. While the AMP frame provides a very good entry into this area, this still provides an interesting opportunity to link artificial intelligence with operations research concepts.

References 1. Aarts EHL, Lenstra JK (eds) (1997) Local Search in Combinatorial Optimization. Wiley, Chichester 2. Aarts EHL, Verhoeven M (1997) Local search. In: Dell’Amico M, Maffioli F, Martello S (eds) Annotated

3. 4.

5.

6. 7. 8. 9.

10.

11.

12. 13. 14.

15.

16.

17.

18.

19.

20.

21.

Bibliographies in Combinatorial Optimization. Wiley, Chichester, pp 163–180 Achterberg T, Berthold T (2007) Improving the feasibility pump. Discret Optim 4:77–86 Adenso-Diaz B, Laguna M (2006) Fine-tuning of algorithms using fractional experimental designs and local search. Oper Res 54:99–114 Ahuja RK, Ergun O, Orlin JB, Punnen AB (2002) A survey of very large-scale neighborhood search techniques. Discret Appl Math 123:75–102 Alba E (ed) (2005) Parallel Metaheuristics. Wiley, Hoboken Alba E, Marti R (eds) (2006) Metaheuristic Procedures for Training Neural Networks. Springer, New York Althöfer I, Koschnick KU (1991) On the convergence of ‘threshold accepting’. Appl Math Optim 24:183–195 Bäck T, Fogel DB, Michalewicz Z (eds) (1997) Handbook of Evolutionary Computation. Institute of Physics Publishing, Bristol Barr RS, Golden BL, Kelly JP, Resende MGC, Stewart WR (1995) Designing and reporting on computational experiments with heuristic methods. J Heuristics 1:9–32 Bastos MB, Ribeiro CC (2002) Reactive tabu search with path relinking for the Steiner problem in graphs. In: Ribeiro CC, Hansen P (eds) Essays and Surveys in Metaheuristics. Kluwer, Boston, pp 39–58 Battiti R, Tecchiolli G (1994) The reactive tabu search. ORSA J Comput 6:126–140 Bertsekas DP, Tsitsiklis JN, Wu C (1997) Rollout algorithms for combinatorial optimization. J Heuristics 3:245–262 Blum C, Roli A (2003) Metaheuristics in combinatorial optimization: Overview conceptual comparison. ACM Comput Surv 35:268–308 Bonabeau E, Dorigo M, Theraulaz G (eds) (1999) Swarm Intelligence – From Natural to Artificial Systems. Oxford University Press, New York Burke EK, Kendall G, Newall J, Hart E, Ross P, Schulenburg S (2003) Hyper-heuristics: An emerging direction in modern search technology. In: Glover FW, Kochenberger GA (eds) Handbook of Metaheuristics. Kluwer, Boston, pp 457–474 Caseau Y, Laburthe F, Silverstein G (1999) A metaheuristic factory for vehicle routing problems. Lect Notes Comput Sci 1713:144–158 Cerulli R, Fink A, Gentili M, Voß S (2006) Extensions of the minimum labelling spanning tree problem. J Telecommun Inf Technol 4/2006:39–45 Charon I, Hudry O (1993) The noising method: A new method for combinatorial optimization. Oper Res Lett 14:133–137 Crainic TG, Toulouse M, Gendreau M (1997) Toward a taxonomy of parallel tabu search heuristics. INFORMS J Comput 9:61–72 de Backer B, Furnon V, Shaw P, Kilby P, Prosser P (2000) Solving vehicle routing problems using constraint programming and metaheuristics. J Heuristics 6:501–523

Metaheuristics

22. Di Gaspero L, Schaerf A (2003) EASYLOCAL++: An objectoriented framework for the flexible design of local-search algorithms. Softw Pr Experience 33:733–765 23. Dorigo M, Maniezzo V, Colorni A (1996) Ant system: Optimization by a colony of cooperating agents. IEEE Trans Syst, Man Cybern B 26:29–41 24. Dorigo M, Stützle T (2004) Ant Colony Optimization. MIT Press, Cambridge 25. Dörner KF, Gendreau M, Greistorfer P, Gutjahr WJ, Hartl RF, Reimann M (eds) (2007) Metaheuristics: Progress in Complex Systems Optimization. Springer, New York 26. Dowsland KA (1993) Simulated annealing. In: Reeves C (ed) Modern Heuristic Techniques for Combinatorial Problems. Halsted, Blackwell, pp 20–69 27. Dreo J, Petrowski A, Siarry P, Taillard E (2006) Metaheuristics for Hard Optimization. Springer, Berlin 28. Dueck G, Scheuer T (1990) Threshold accepting: a general purpose optimization algorithm appearing superior to simulated annealing. J Comput Phys 90:161–175 29. Duin CW, Voß S (1994) Steiner tree heuristics – a survey. In: Dyckhoff H, Derigs U, Salomon M, Tijms HC (eds) Operations Research Proceedings. Springer, Berlin, pp 485– 496 30. Duin CW, Voß S (1999) The pilot method: A strategy for heuristic repetition with application to the Steiner problem in graphs. Netw 34:181–191 31. Faigle U, Kern W (1992) Some convergence results for probabilistic tabu search. ORSA J Comput 4:32–37 32. Festa P, Resende MGC (2004) An annotated bibliography of GRASP. Technical report, AT&T Labs Research, Florham Park 33. Fink A, Voß S (2002) HotFrame: A heuristic optimization framework. In: Voß S, Woodruff DL (eds) Optimization Software Class Libraries. Kluwer, Boston, pp 81–154 34. Fischetti M, Glover F, Lodi A (2005) The feasibility pump. Math Program A 104:91–104 35. Fischetti M, Lodi A (2003) Local branching. Math Program B 98:23–47 36. Fogel DB (1993) On the philosophical differences between evolutionary algorithms and genetic algorithms. In: Fogel DB, Atmar W (eds) Proceedings of the Second Annual Conference on Evolutionary Programming, Evolutionary Programming Society, La Jolla, pp 23–29 37. Fogel DB (1995) Evolutionary Computation: Toward a New Philosophy of Machine Intelligence. IEEE Press, New York 38. Glover F (1977) Heuristics for integer programming using surrogate constraints. Decis Sci 8:156–166 39. Glover F (1986) Future paths for integer programming links to artificial intelligence. Comput Oper Res 13:533– 549 40. Glover F (1990) Tabu search – Part II. ORSA J Comput 2:4–32 41. Glover F (1995) Scatter search and star-paths: beyond the genetic metaphor. OR Spektrum 17:125–137

M

42. Glover F (1997) Tabu search and adaptive memory programming – Advances, applications challenges. In: Barr RS, Helgason RV, Kennington JL (eds) Interfaces in computer science and operations research: Advances in metaheuristics, optimization and stochastic modeling technologies. Kluwer, Boston, pp 1–75 43. Glover F, Laguna M (1997) Tabu Search. Kluwer, Boston 44. Glover FW, Kochenberger GA (eds) (2003) Handbook of Metaheuristics. Kluwer, Boston 45. Goldberg DE (1989) Genetic Algorithms in Search, Optimization, Machine Learning. Addison-Wesley, Reading 46. Golden BL, Raghavan S, Wasil EA (eds) (2005) The Next Wave in Computing, Optimization, Decision Technologies. Kluwer, Boston 47. Gomes AM, Oliveira JF (2006) Solving irregular strip packing problems by hybridising simulated annealing and linear programming. Eur J Oper Res 171:811–829 48. Greistorfer P, Voß S (2005) Controlled pool maintenance for meta-heuristics. In: Rego C, Alidaee B (eds) Metaheuristic optimization via memory evolution. Kluwer, Boston, pp 387–424 49. Gutenschwager K, Niklaus C, Voß S (2004) Dispatching of an electric monorail system: Applying meta-heuristics to an online pickup and delivery problem. Transp Sci 38:434–446 50. Hajek B (1988) Cooling schedules for optimal annealing. Math Oper Res 13:311–329 51. Hansen P, Mladenović N (1999) An introduction to variable neighborhood search. In: Voß S, Martello S, Osman IH, Roucairol C (eds) Meta-heuristics: Advances and trends in local search paradigms for optimization. Kluwer, Boston, pp 433–458 52. Hart JP, Shogan AW (1987) Semi-greedy heuristics: An empirical study. Oper Res Lett 6:107–114 53. Harvey W, Ginsberg M (1995) Limited discrepancy search. In: Proceedings of the 14th IJCAI. Morgan Kaufmann, San Mateo, pp 607–615 54. Hertz A, Kobler D (2000) A framework for the description of evolutionary algorithms. Eur J Oper Res 126:1–12 55. Hoffmeister F, Bäck T (1991) Genetic algorithms and evolution strategies: Similarities and differences. Lect Notes Comput Sci 496:455–469 56. Holland JH (1975) Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor 57. Hooker JN (1995) Testing heuristics: We have it all wrong. J Heuristics 1:33–42 58. Hoos HH, Stützle T (2005) Stochastic Local Search – Foundations and Applications. Elsevier, Amsterdam 59. Ibaraki T, Nonobe K, Yagiura M (eds) (2005) Metaheuristics: Progress as Real Problem Solvers. Springer, New York 60. Ingber L (1996) Adaptive simulated annealing (ASA): Lessons learned. Control Cybern 25:33–54 61. Jaszkiewicz A (2004) A comparative study of multipleobjective metaheuristics on the bi-objective set covering

2073

2074

M 62.

63. 64. 65. 66.

67. 68.

69. 70. 71.

72. 73. 74.

75. 76. 77.

78. 79. 80. 81. 82. 83.

Metaheuristics

problem and the pareto memetic algorithm. Ann Oper Res 131:215–235 Johnson DS, Aragon CR, McGeoch LA, Schevon C (1989) Optimization by simulated annealing: An experimental evaluation; part i, graph partitioning. Oper Res 37:865– 892 Kennedy J, Eberhart RC (2001) Swarm Intelligence. Elsevier, Amsterdam Kirkpatrick S, Gelatt CD Jr, Vecchi MP (1983) Optimization by simulated annealing. Science 220:671–680 Laguna M, Martí R (2003) Scatter Search. Kluwer, Boston Lin S, Kernighan BW (1973) An effective heuristic algorithm for the traveling-salesman problem. Oper Res 21:498–516 McGeoch C (1996) Toward an experimental method for algorithm simulation. INFORMS J Comput 8:1–15 Meloni C, Pacciarelli D, Pranzo M (2004) A rollout metaheuristic for job shop scheduling problems. Ann Oper Res 131:215–235 Michalewicz Z (1999) Genetic Algorithms + Data Structures = Evolution Programs, 3rd edn. Springer, Berlin Michalewicz Z, Fogel DB (2004) How to Solve It: Modern Heuristics, 2nd edn. Springer, Berlin Moscato P (1993) An introduction to population approaches for optimization and hierarchical objective functions: A discussion on the role of tabu search. Ann Oper Res 41:85–121 Osman IH, Kelly JP (eds) (1996) Meta-Heuristics: Theory and Applications. Kluwer, Boston Pearl J (1984) Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley, Reading Pesant G, Gendreau M (1999) A constraint programming framework for local search methods. J Heuristics 5:255– 279 Pesch E, Glover F (1997) TSP ejection chains. Discret Appl Math 76:165–182 Polya G (1945) How to solve it. Princeton University Press, Princeton Rayward-Smith VJ, Osman IH, Reeves CR, Smith GD (eds) (1996) Modern Heuristic Search Methods. Wiley, Chichester Reeves CR, Rowe JE (2002) Genetic Algorithms: Principles and Perspectives. Kluwer, Boston Rego C, Alidaee B (eds) (2005) Metaheuristic optimization via memory and evolution. Kluwer, Boston Resende MGC, de Sousa JP (eds) (2004) Metaheuristics: Computer Decision-Making. Kluwer, Boston Ribeiro CC, Hansen P (eds) (2002) Essays and Surveys in Metaheuristics. Kluwer, Boston Sakawa M (2001) Genetic algorithms and fuzzy multiobjective optimization. Kluwer, Boston Schwefel HP, Bäck T (1998) Artificial evolution: How and why? In: Quagliarella D, Périaux J, Poloni C, Winter G (eds) Genetic Algorithms and Evolution Strategy in Engineer-

84.

85.

86.

87.

88.

89.

90. 91. 92. 93.

94.

95. 96. 97.

98. 99.

100. 101. 102. 103.

ing and Computer Science: Recent Advances and Industrial Applications, Wiley, Chichester, pp 1–19 Shaw P (1998) Using constraint programming local search methods to solve vehicle routing problems. Working paper, ILOG SA, Gentilly Smith K (1999) Neural networks for combinatorial optimisation: A review of more than a decade of research. INFORMS J Comput 11:15–34 Sniedovich M, Voß S (2006) The corridor method: A dynamic programming inspired metaheuristic. Control Cybern 35:551–578 Storer RH, Wu SD, Vaccari R (1995) Problem and heuristic space search strategies for job shop scheduling. ORSA J Comput 7:453–467 Taillard E, Voß S (2002) POPMUSIC - partial optimization metaheuristic under special intensification conditions. In: Ribeiro CC, Hansen P (eds) Essays and Surveys in Metaheuristics. Kluwer, Boston, pp 613–629 Taillard ÉD, Gambardella LM, Gendreau M, Potvin JY (2001) Adaptive memory programming: A unified view of meta-heuristics. Eur J Oper Res 135:1–16 Vaessens RJM, Aarts EHL, Lenstra JK (1998) A local search template. Comput Oper Res 25:969–979 Verhoeven MGA, Aarts EHL (1995) Parallel local search techniques. J Heuristics 1:43–65 Voß S (1993) Intelligent Search. Manuscript, TU Darmstadt Voß S (1993) Tabu search: applications and prospects. In: Du DZ, Pardalos P (eds) Network Optimization Problems. World Scientific, Singapore, pp 333–353 Voß S (1996) Observing logical interdependencies in tabu search: Methods and results. In: Rayward-Smith VJ, Osman IH, Reeves CR, Smith GD (eds) Modern Heuristic Search Methods. Wiley, Chichester, pp 41–59 Voß S (2001) Meta-heuristics: The state of the art. Lect Notes Artif Intell 2148:1–23 Voß S, Fink A, Duin C (2004) Looking ahead with the pilot method. Ann Oper Res 136:285–302 Voß S, Martello S, Osman IH, Roucairol C (eds) (1999) Meta-Heuristics: Advances and Trends in Local Search Paradigms for Optimization. Kluwer, Boston Voß S, Woodruff DL (eds) (2002) Optimization Software Class Libraries. Kluwer, Boston Watson JP, Whitley LD, Howe AE (2005) Linking search space structure, run-time dynamics, and problem difficulty: A step toward demystifying tabu search. J Artif Intell Res 24:221–261 Whitley D, Rana S, Dzubera J, Mathias KE (1996) Evaluating evolutionary algorithms. Artif Intell 85:245–276 Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1:67–82 Woodruff DL (1998) Proposals for chunking and tabu search. Eur J Oper Res 106:585–598 Woodruff DL (1999) A chunking based selection strategy for integrating meta-heuristics with branch and

Minimax: Directional Differentiability

bound. In: Voß S, Martello S, Osman IH, Roucairol C (eds) Meta-Heuristics: Advances and Trends in Local Search Paradigms for Optimization. Kluwer, Boston, pp 499–511

Metropolis, Nicholas Constantine PANOS M. PARDALOS Center for Applied Optim. Department Industrial and Systems Engineering, University Florida, Gainesville, USA MSC2000: 90C05, 90C25 Article Outline Keywords References

M

a direct simulation of the probabilistic problems concerned with random neutron diffusion in fissile material. Metropolis and his collaborators, obtained MonteCarlo estimates for the eigenvalues of Schrodinger equation. In 1953, Metropolis co-authored the first paper on the technique that came to be known as simulated annealing [3,8]. Simulated annealing is a method for solving optimization problems. The name of the algorithm derives from an analogy between the simulation of the annealing of solids. Annealing refers to a process of cooling material slowly until it reaches a stable state. Metropolis also made several early contributions to the use of computers in the exploration of nonlinear dynamics. In the Sixties and Seventies he collaborated with G.-C. Rota and others on significance arithmetic. Another contribution of Metropolis to numerical analysis is an early paper on the use of Chebyshev’s iterative method for solving large scale linear systems [1].

Keywords Metropolis; Simulated annealing; Monte-Carlo method Nicholas Constantine Metropolis was born in Chicago on June 11, 1915 and died on October 17, 1999 in Los Alamos. At Los Alamos, Metropolis was the main driving force behind the development of the MANIAC series of electronic computers. He was the first to code a problem for the ENIAC in 1945–1946 (together with S. Frankel), a task which consumed approximately 1,000,000 IBM punched cards. Metropolis received his PhD in physics from the University of Chicago in 1941. He went to Los Alamos in 1943 as a member of the initial staff of fifty scientists of the Manhattan Project. He spent his entire career at Los Alamos, except for two periods (1946–1948 and 1957–1965), during which he was professor of Physics at the University of Chicago. Metropolis is best known for the development (joint with S. Ulam and J. von Neumann) of the Monte-Carlo method. The Monte-Carlo method provides approximate solutions to a variety of mathematical problems by performing statistical sampling experiments on a computer. However, the real use of Monte-Carlo methods as a research tool stems from work on the atomic bomb during the second world war. This work involved

References 1. Blair A, Metropolis N, von Neumann J, Taub AH, Tsingou M (1959) A study of a numerical solution to a two-dimensional hydrodynamical problem. Math Tables and Other Aids to Computation 13(67):145–184 2. Harlow F, Metropolis N (1983) Computing and computers: Weapons simulation leads to the computer era. Los Alamos Sci 7:132–141 3. Kirkpatrick S, Gelatt CD, Vecchi MP Jr (1983) Optimization by simulated annealing. Science 220(4598):671–680 4. Metropolis N (1987) The beginning of the Monte Carlo method. Los Alamos Sci 15:125–130 5. Metropolis N (1992) The age of computing: A personal memoir. Daedalus 121(1):87–103 6. Metropolis N, Howlett J, Rota G-C (eds) (1980) A history of computing in the twentieth century. Acad. Press, New York 7. Metropolis N, Nelson EC (oct. 1982) Early computing at Los Alamos. Ann Hist Comput 4(4):348–357 8. Metropolis N, Rosenbluth A, Teller A, Teller E (1953) Equation of state calculation by fast computing machines. J Chem Phys 21:1087–1092

Minimax: Directional Differentiability VLADIMIR F. DEMYANOV St. Petersburg State University, St. Petersburg, Russia MSC2000: 90C30, 65K05

2075

2076

M

Minimax: Directional Differentiability

is called normal to ˝ at x. Here

Article Outline Keywords A max-function A Maximum Function with Dependent Constraints A Maxmin Function Higher-Order Directional Derivatives Hypodifferentiability of a Max Function See also References

˝ (x) D max(v; y) y2˝

is the support function of ˝ at x. A max-function Let f (x) D max '(x; y);

(1)

y2G

Keywords Minimax problem; Max-function; Maxmin function; Directional derivative; Higher-order derivatives; Hypodifferentiability; Support function Minimax is a principle of optimal choice (of some parameters or functions). If applied, this principle requires to find extremal values of some max-type function. Since the operation of taking the pointwise maximum (of a finite or infinite number of functions) generates, in general, a nonsmooth function, it is important to study properties of such a function. Fortunately enough, though a max-function is not differentiable, in many cases it is still directionally differentiable. The directional differentiability provides a tool for formulating necessary (and sometimes sufficient) conditions for a minimum or maximum and for constructing numerical algorithms. Recall that a function f : Rn ! R is called Hadamard directionally differentiable (H.d.d.) at a point x 2 Rn if for any g 2 Rn there exists the finite limit f H0 (x; g) D

lim

[˛;g 0 ]![C0;g]

f (x C ˛g 0 ) f (x) : ˛

A function f : Rn ! R is called Dini directionally differentiable (D.d.d.) at a point x 2 Rn if for any g 2 Rn there exists the finite limit f (x C ˛g) f (x) : f D0 (x; g) D lim ˛#0 ˛

where ': S × G ! R is continuous jointly in x, y on S × G and continuously differentiable in x there, S Rn is an open set, G is a compact set of some space. Under the conditions stated, the function f is continuous on S. Proposition 1 The function f is H.d.d. at any point x 2 S and fH0 (x; g) D max ('x0 (x; y); g) D max (v; g); y2R(x)

v2@ f (x)

(2)

where R(x) D fy 2 G : f (x) D '(x; y)g ; ' x 0 (x, g) is the gradient of ' with respect to x for a fixed y, (a, b) is the scalar product of vectors a and b, ˚ @ f (x) D co 'x0 (x; g) : y 2 R(x) Rn : The set @f (x) is called the subdifferential of f at x. It is convex and compact. The mapping @f is, in general, discontinuous. Remark 2 It turns out that a convex function can also be represented in the form (1) with ' being affine in x. For this special (convex) case the set @f (x) is @ f (x) D fv 2 Rn : f (z) f (x) (v; z x); 8z 2 Sg :

0

If f is H.d.d., then it is D.d.d. as well and f H (x, g) = f D 0 (x, g). Let ˝ Rn be a convex compact set, x 2 ˝. The cone N x (˝) D fv 2 Rn : (v; x) D ˝ (x)g

The discovery of the directional differentiability of max-functions ([1,2,6]) and convex functions [10] was a breakthrough and led to the development of minimax theory and convex analysis ([4,9,10]).

Minimax: Directional Differentiability

A Maximum Function with Dependent Constraints

M

A Maxmin Function Let '(x, y, z): S × G1 × G2 ! R be continuous jointly in all variables, S Rn be an open set, G1 Rm , G2 Rp be compact. Put

Let x Rn , Y Rm be open sets and let f (x) D max '(x; g);

(3)

y2a(x)

f (x) D max min '(x; y; z):

(5)

y2G 1 z2G 2

where a(x) is a multivalued mapping with compact images, ': X × Y ! R is Hadamard differentiable as a function of two variables, i. e. there exists the limit

The function f is continuous on S. Let ˚(x; y) D min '(x; y; z);

1 'H0 ([x; y]; [g; v]) D lim 0 0 [˛;g ;v ]![C0;g;v] ˛ '(x C ˛g 0 ; y C ˛v 0 ) '(x; y) : Then ' is continuous and ' H 0 is continuous as a function of direction [g, v]. The function f is called a maximum function with dependent constraints. Such functions are of great importance and have widely been studied (see [3,5,7,8]). To illustrate the results let us formulate one of them [5, Thm. I, 6.3]. Proposition 3 Let a mapping a be closed and bounded, its images be convex and compact, the support function a(x, l) = maxv 2 a(x) (v, l) be uniformly differentiable with respect to parameter l. Let, further, x 2 X and a function ' be concave in some convex neighborhood of the set {[x, y]: y 2 R(x)} (where R(x) = {y 2 a(x): '(x, y) = f (x)}). Then f (see (3)) is H.d.d. and f 0 (x; g) D sup

min

[(l1 ; g)C a0 (x; l2 ; g)]; (4)

y2R(x) [l 1 ;l 2 ]2V (x;y)

where o n V(x; y) D l D [l1 ; l2 ] 2 @'(x; y) : l2 2 N x;y ; @'(x; y) is the superdifferential of ' at the point [x, y], and N x, y is the cone normal to a(x) at y. Recall that if a function F: Rs ! R is concave, Z Rs is open, z 2 Z, then the set

F(z0 ) F(z) (v; z0 z); @F(z) D v 2 R : 8z0 2 Z s

is called the superdifferential of F at z 2 Z. It is convex and compact.

z2G 2

R(x) D fy 2 G1 : ˚(x; y) D f (x)g ; Q(x; y) D fz 2 G2 : '(x; y; z) D ˚(x; y)g : Fix x 2 S, let D" (" > 0) be an "-neighborhood of the set {x} × R(x) × [y 2 R(x) Q(x, y). Assume that the derivatives @' @' @2 ' @2 ' @2 ' ; ; ; ; @x @y @x 2 @x@y @y2 exist and are continuous jointly in all variables on D" (x) and that 2 @ '(x; y; z) v; v 0; @y2 8[x; y; z] 2 D" (x);

v 2 Rm :

Assume also that G1 is convex. Let y 2 G1 . Put ˚ (y) D v D (y0 y) : > 0; y0 2 G1 ; (y) D cl (y): Proposition 4 [3, Thm. 5.2] Under the above assumptions the function f (see (5)) is Hadamard directionally differentiable and fH0 (x; g) D sup

sup

min

y2R(x) y2 (y) z2Q(x;y)

@'(x; y; z) @'(x; y; z) ;v C ;g : @y @x

Remark 5 More sophisticated results on the directional differentiability of max- and maxmin functions can be found, e. g., in [8]. Higher-Order Directional Derivatives The results above are related to the first order directional derivatives. Using these derivatives, it is possible

2077

2078

M

Minimax: Directional Differentiability

to construct the following first order expansion: 0

f (x C ˛g) D f (x) C ˛ f (x; g) C o x;g (˛);

where (6)

where f 0 is either f H 0 or f D 0 . In some cases it is possible to get ‘higher-order’ expansions. Let f (x) D max f i (x);

(7)

i2I

where I = 1 : N, x = (x1 , . . . , xn ) 2 Rn , the f i 0 s are continuous and continuously differentiable up the lth order on an open set S Rr . Fix x 2 S. Then for sufficiently small ˛ > 0 f i (x C ˛g) D f i (x) C

l X ˛k kD1

k!

f i(k) (x; g) C o i (g; ˛ l ); (8)

n X j 1 ;:::; j k D1

@ k f i (x) g j ; : : : ; g jk ; @x j 1 @x j k 1

k 2 1; : : : ; l;

o i (g; ˛ ) ! 0 ˛l ˛#0

(9)

uniformly with respect to g, kgk = 1. Let us use the following notation

The value @k f (x)/ @g k = f (k) (x, g) is called the kth derivative of f at x in a direction g. Remark 7 The mapping R1 (x, g) is not continuous in x, while the mappings Rk (x, g) (k 2) are not continuous in x as well as in g. Therefore the functions f (k) (x, g) in (11) are not continuous in x and (if k 2) in g and, as a result, expansion (6) is also not ‘stable’ in x.

max

j2R k1 (x;g)

Let us again consider the case where f is defined by (7). It follows from (8) that, for = (1 , . . . , n ) 2 Rn ,

kD1

R k (x; g) D fi 2 R k1 (x; g) :

(12)

f j(k1) (x; g) ;

Let us use the notation (see (9))

k 2 1; : : : ; l:

f i(k) (x; ) D A i k k :

Clearly R0 (x; g) R1 (x; g) R2 (x; g) : Note that R0 (x, g) does not depend on x and g, and R1 (x, g) does not depend on g. Proposition 6 [3, Thm. 9.1] The following expansion holds: f (x C ˛g) D f (x) C

uniformly with respect to g, kgk = 1.

# l X 1 (k) f (x; ) C o(k k k ): D max f i (x) C i2I k! i

R0 (x; g) D I;

f i(k1) (x; g) D

(11)

f (x C ) "

g) D f i (x);

8i 2 I;

o(g; ˛ l ) ! 0 ˛l ˛#0

Hypodifferentiability of a Max Function

l

f i0 (x;

i2R k (x;g)

To overcome this difficulty we shall employ another tool.

where f i(k) (x; g) D

f (k) (x; g) D max f i(k) (x; g);

l X ˛k kD1

k!

f (k) (x; g) C o(g; ˛ l ); 8g 2 Rn ;

(10)

The function f (k) i (x, ) is a kth order form of coordinates 1 , . . . , n ; Aik being the set of coefficients of this form. Then (12) can be rewritten as f (x C ) "

# l X 1 k A i k C o(k k k ) D max f i (x) C i2I k! kD1 " l # X 1 A k k C o(k k k ); (13) D f (x) C max l k! A2d f (x) kD1

Minimax Game Tree Searching

where n

o

d l f (x) D co A(i) D (A i0 ; : : : ; A i l ) : i 2 I ;

7.

A i0 D f i (x) f (x);

8.

A D (A0 ; : : : ; A l );

n

A0 2 R; A1 2 R ; k times

A2 2 R

nn

‚ …„ ƒ ; : : : ; A k 2 Rnn :

9. 10.

k times

‚ …„ ƒ Here, Rnn is the space of kth order real forms, e. g. Rn×n is the space of real (n × n)-matrices. The set dl f (x) is called the kth order hypodifferential of f at x. It is an element of the space R Rn l

‚ …„ ƒ Rnn . The mapping dl f is continuous in x. Remark 8 Expansion (13) can be extended to the case where f is given by (1) and ' is l times continuously differentiable in x. Max functions represent a special case of the class of quasidifferentiable functions (see [5]). See also Bilevel Linear Programming: Complexity, Equivalence to Minmax, Concave Programs Bilevel Optimization: Feasibility Test and Flexibility Index Minimax Theorems Nondifferentiable Optimization: Minimax Problems Stochastic Programming: Minimax Approach Stochastic Quasigradient Methods in Minimax Problems References 1. Danskin JM (1967) The theory of max-min and its application to weapons allocation problems. Springer, Berlin 2. Demyanov VF (1966) On minimizing the maximal deviation. Vestn Leningrad Univ 7:21–28 3. Demyanov VF (1974) Minimax: Directional differentiability. Leningrad Univ. Press, Leningrad 4. Demyanov VF, Malozemov VN (1974) Introduction to minimax. Wiley, New York. Second edition: Dover, New York (1990) 5. Demyanov VF, Rubinov AM (1995) Constructive nonsmooth analysis. P. Lang, Frankfurt am Main 6. Girsanov IV (1965) Differentiability of solutions of the mathematical programming problems. Abstracts Conf.

M

Applications of Functional Analysis Methods to Solving Nonlinear Problems, pp 43–45 Levitin ES (1994) Perturbation theory in mathematical programming and its applications. Wiley, New York Minchenko LI, Borisenko OF (1992) Differential properties of marginal functions and their applications to optimization problems. Nauka i Techn., Minsk Pschenichny BN (1980) Convex analysis and extremal problems. Nauka, Moscow Rockafellar RT (1970) Convex analysis. Princeton Univ. Press, Princeton

Minimax Game Tree Searching CLAUDE G. DIDERICH1 , MARC GENGLER2 1 Computer Sci. Department, Swiss Federal Institute Technology-Lausanne, Lausanne, Switzerland 2 Ecole Sup. d’Ingénieurs de Luminy, Université Méditerrannée, Marseille, France MSC2000: 49J35, 49K35, 62C20, 91A05, 91A40 Article Outline Keywords Minimax Trees Sequential Minimax Game Tree Algorithms Minimax Algorithm Alpha-Beta Algorithm Optimal State Space Search Algorithm SSS SCOUT: Minimax Algorithm of Theoretical Interest GSEARCH: Generalized Game Tree Search Algorithm SSS-2: Recursive State Space Search Algorithm Some Variations On The Subject

Parallel Minimax Tree Algorithms A Simple Way to Parallelize the Exploration of Minimax Trees A Mandatory Work First Algorithm Aspiration Search Tree-Splitting Algorithm PVSPLIT: Principal Variation Splitting Algorithm Synchronized Distributed State Space Search Distributed Game Tree Search Algorithm Parallel Minimax Algorithm with Linear Speedup

See also References Keywords Algorithms; Games; Minimax; Searching

2079

2080

M

Minimax Game Tree Searching

With the introduction of computers, also started the interest in having machines play games. Programming a computer such that it could play, for example chess, was seen as giving it some kind of intelligence. Starting in the mid fifties, a theory on how to play two player zero sum perfect information games, like chess or go, was developed. This theory is essentially based on traversing a tree called minimax or game tree. An edge in the tree represents a move by either of the players and a node a configuration of the game. Two major algorithms have emerged to compute the best sequence of moves in such a minimax tree. On one hand, there is the alpha-beta algorithm suggested around 1956 by I. McCarthy and first published in [27]. On the other hand, G.C. Stockman [29] introduced the SSS algorithm. Both methods try to minimize the number of nodes explored in the game tree using special traversal strategies and cut conditions. Minimax Trees A two-player zero-sum perfect-information game, also called minimax game, is a game which involves exactly two players who alternatively make moves. No information is hidden from the adversary. No coins are tossed, that is, the game is completely deterministic, and there is perfect symmetry in the quality of the moves allowed. Go, checker and chess are such minimax games whereas backgammon (the outcome of a die determines the moves available) or card games (cards are hidden from the adversary) are not. A minimax tree or game tree is a tree where each node represents a state of the game and each edge a possible move. Nodes are alternatively labeled ‘max’ and ‘min’ representing either player’s turn. A node having no descendants represents a final outcome of the game. The goal of a game is to find a winning sequence of moves, given that the opponent always plays his best move. The quality of a node t in the minimax game tree, representing a configuration, is given by its value e(t). The value e(t), also called minimax value, is defined recursively as 8 ˆ f (t) if t is a leave node; ˆ ˆ < max e(s) if t is labeled ‘max’; e(t) D s2sons(t) ˆ ˆ ˆ : min e(s) if t is labeled ‘min’: s2sons(t)

If the considered minimax tree represents a complete game, that is, all possible board configurations, the function f may be defined as follows: 8 ˆ ˆ ˛ THEN ˛ vi hUpdate the bounds according to ˛ on all slavesi END IF IF ˛ > ˇ THEN hTerminate all slave processorsi RETURN ˛ END IF END LOOP RETURN ˛ END PVSplit Pseudocode for the PVSPLIT algorithm

Synchronized Distributed State Space Search A completely different approach to parallelizing the SSS algorithm has been taken by C.G. Diderich and

2085

2086

M

Minimax Game Tree Searching

M. Gengler [10]. The algorithm proposed is called synchronized distributed state space search (SDSSS). It is an alternation of computation and synchronization phases. The algorithm has been designed for a distributed memory multiprocessor machine. Each processor manages its own local ‘open’ list of unvisited nodes. The synchronization phase may be subdivided in three major parts. First, the processors exchange information about which nodes can be removed from the local ‘open’ lists. This corresponds to each processor sending the nodes for which the ‘purge’ operation may be applied by all the other processors. Next, all the processors agree on the globally lowest upper bound m for which nodes exist in some of the ‘open’ lists. Finally all the nodes having the same upper bound m are evenly distributed among all the processors. This operation concludes the synchronization phase. The computation phase of the SDSSS algorithm may be described by the following pseudocode. hComputation phasei WHILE hthere exists a node in the open list having an upper bound of m i LOOP remove(open) (s; t; m ) IF s = root AND t = solved THEN BROADCAST ‘the solution has been found’ RETURN m END IF hApply the operator to node si END LOOP Pseudocode for the computation phase of the SDSSS algorithm

Experiments executing the SDSSS algorithm on an Intel iPSC/2 parallel machine have been conducted. Speedups of up to 11.4 have been measured for 32 processors. Distributed Game Tree Search Algorithm R. Feldman [12] parallelized the alpha-beta algorithm for massively parallel distributed memory machines. Different subtrees are searched in parallel by different processors. The allocation of processors to trees is done by imposing certain conditions on the nodes which are

be selectable. They introduce the concept of younger brother waits. This concept essentially says that in the case of a subtree rooted at s1 , where s1 is the first son node of a node n, is not yet evaluated, then the other sons s2 , . . . , sb of node n are not selectable. Younger brothers may only be considered after their elder brothers, which has as a consequence that the value of the elder brothers may be used to give a tight search window to the younger brothers. This concept is nevertheless not sufficient to achieve the same good search window as the alpha-beta algorithm achieves. Indeed when node s1 is computed, then the younger brothers may all be explored in parallel using the value of node s1 . Thus the node s2 has the same search window as it would have in the sequential alphabeta algorithm, but this is not true anymore for si , where i 3. Indeed if nodes s2 and s3 are processed in parallel, they only know the value of node s1 , while in the sequential alpha-beta algorithm, the node s3 would have known the value of both s1 and s2 . This fact forces the parallel algorithm to provide an information dissemination protocol. In case the nodes s2 and s3 are evaluated on processors P and P0 , and processor P finishes its work before P0 , producing a better value than node s1 did, then processor P will inform processor P0 of this value, allowing it to continue with better information on the rest of its subtree or to terminate its work if the new value allows P0 to conclude that its computation becomes useless. The load distribution is realized by means of a dynamic load balancing scheme, where idle processors ask other processors for work. Speedups as high as 100 have been obtained on a 256 processor machines. In [13], a speedup of 344 on a 1024 transputer network interconnected as a grid and a speedup of 142 on a 256 processor transputer de Bruijn interconnected network have been shown. Parallel Minimax Algorithm with Linear Speedup In 1988, Althöfer [4] proved that it is possible, to develop a parallel minimax algorithm which achieves linear speedup in the average case. With the assumption that all minimax trees are binary win-loss trees, he exhibited such a parallel minimax algorithm. M. Böhm and E. Speckenmeyer [8] also suggested an algorithm which uses the same basic ideas as Althöf-

Minimax Theorems

fer. Their algorithm is more general in the sense that it needs only to know the distribution of the leave values and is independent of the branching of the tree explored. In 1989, R.M. Karp and Y. Zhang [17] proved that it is possible to obtain linear speedup on every instance of a random uniform minimax tree if the number of processors is close to the height of the tree.

See also Bottleneck Steiner Tree Problems Directed Tree Networks Shortest Path Tree Algorithms References 1. Akl SG, Barnard DT, Doran RJ (1979) Searching game trees in parallel. In: Proc. 3rd Biennial Conf. Canad. Soc. Computation Studies of Intelligence, pp 224–231 2. Akl SG, Barnard DT, Doran RJ (1982) Design, analysis, and implementation of a parallel tree search algorithm. IEEE Trans Pattern Anal Machine Intell PAMI-4(2):192–203 3. Almquist K, McKenzie N, Sloan K (1988) An inquiry into parallel algorithms for searching game trees. Techn. Report Univ. Washington, Seattle, WA 12(3) 4. Althöfer I (1988) On the complexity of searching game trees and other recursion trees. J Algorithms 9:538–567 5. Althöfer I (1990) An incremental negamax algorithm. Artif Intell 43:57–65 6. Ballard BW (1983) The -minimax search procedure for trees containing chance nodes. Artif Intell 21:327–350 7. Baudet GM (1978) The design and analysis of algorithms for asynchronous multiprocessors. PhD Thesis CarnegieMellon Univ. Pittsburgh, PA, CMU-CS-78-116 8. Böhm M, Speckenmeyer E (1989) A dynamic processor tree for solving game trees in parallel. Proc. SOR’89 9. Cung V-D, Roucairol C (1991) Parallel minimax tree searching. Res Report INRIA, vol 1549 10. Diderich CG (1992) Evaluation des performances de l’algorithme SSS avec phases de synchronisation sur une machine parallèle à mémoires distribuées. Techn. Report Computer Sci. Dept. Swiss Federal Inst. Techn. Lausanne, Switzerland, LiTH-99 (In French.) 11. Feigenbaum EA, Feldman J (1963) Computers and thought. McGraw-Hill, New York 12. Feldmann R, Monien B, Mysliwietz P, Vornberger O (1989) Distributed game tree search. ICCA J 12(2):65–73 13. Feldmann R, Mysliwietz P, Monien B (1994) Game tree search on a massively parallel system. In: van den Herik HJ, Herschberg IS, Uiterwijk JWHM (eds) Advances in Computer Chess, vol 7. Univ. Limburg, Maastricht, pp 203–218

M

14. Finkel RA, Fishburn JP (1982) Parallelism in alpha-beta search. Artif Intell 19:89–106 15. Hewett R, Krishnamurthy G (1992) Consistent linear speedup in parallel alpha-beta search. Proc. ICCI’92, Computing and Information. IEEE Computer Soc Press, New York, pp 237–240 16. Ibaraki T (1986) Generalization of alpha-beta and {SSS*} search procedures. Artif Intell 29:73–117 17. Karp RM, Zhang Y (1989) On parallel evaluation of game trees. In: ACM Annual Symp. Parallel Algorithms and Architectures (SPAA’89). ACM, New York, pp 409–420 18. Knuth DE, Moore RW (1975) An analysis of alpha-beta pruning. Artif Intell, 6(4):293–326 19. Marsland TA, Campbell MS (1982) Parallel search of strongly ordered game trees. ACM Computing Surveys 14(4):533–551 20. Marsland TA, Popowich F (1985) Parallel game-tree search. IEEE Trans Pattern Anal Machine Intell PAMI-7(4):442–452 21. Marsland TA, Reinefeld A, Schaeffer J (1987) Low overhead alternatives to SSS . Artif Intell 31:185–199 22. McAllester DA (1988) Conspiracy numbers for min-max searching. Artif Intell 35:287–310 23. Pearl J (1980) Asymptotical properties of minimax trees and game searching procedures. Artif Intell 14(2):113–138 24. Pijls W, de Bruin A (Aug. 1990) Another view of the SSS algorithm. In: Proc. Internat. Symp. (SIGAL’90) 25. Rivest RL (1987) Game tree searching by min/max approximation. Artif Intell 34(1):77–96 26. Roizen I, Pearl J (1983) A minimax algorithm better than alpha-beta? Yes and no. Artif Intell 21:199–230 27. Slagle JH, Dixon JK (Apr. 1969) Experiments with some programs that search game trees. J ACM 16(2):189–207 28. Steinberg IR, Solomon M (1990) Searching game trees in parallel. Proc. IEEE Internat. Conf. Parallel Processing, III, III– 9–III–17 29. Stockman GC (1979) A minimax algorithm better than alpha-beta? Artif Intell 12(2):179–196

Minimax Theorems STEPHEN SIMONS Department Math., University California, Santa Barbara, USA MSC2000: 46A22, 49J35, 49J40, 54D05, 54H25, 55M20, 91A05 Article Outline Keywords Von Neumann’s Results Infinite-Dimensional Results for Convex Sets

2087

2088

M

Minimax Theorems

Functional-Analytic Minimax Theorems Minimax Theorems that Depend on Connectedness Mixed Minimax Theorems A Metaminimax Theorem Minimax Theorems and Weak Compactness Minimax Inequalities for Two or More Functions Coincidence Theorems See also References

Keywords Minimax theorem; Fixed point theorem; Hahn–Banach theorem; Connectedness We suppose that X and Y are nonempty sets and f : X × Y ! R. A minimax theorem is a theorem that asserts that, under certain conditions, inf sup f D sup inf f ; Y

X

X

Y

that is to say, inf sup f (x; y) D sup inf f (x; y):

y2Y x2X

x2X y2Y

The purpose of this article is to give the reader the flavor of the different kind of minimax theorems, and of the techniques that have been used to prove them. This is a very large area, and it would be impossible to touch on all the work that has been done in it in the space that we have at our disposal. The choice that we have made is to give the historical roots of the subject, and then go directly to the most recent results. The reader who is interested in a more complete narrative can refer to the 1974 survey article [35] by E.B. Yanovskaya, the 1981 survey article [8] by A. Irle and the 1995 survey article [31] by S. Simons. Von Neumann’s Results In his investigation of games of strategy, J. von Neumann realized that, even though a two-person zerosum game did not necessarily have a solution in pure strategies, it did have to have one in mixed strategies. Here is a statement of that seminal result ([19], translated into English in [21]):

Theorem 1 (1928) Let A be an m × n matrix, and X and Y be the sets of nonnegative row and column vectors with unit sum. Then min max xAy D max min xAy: y2Y x2X

x2X y2Y

Despite the fact that the statement of this result is quite elementary, the proof was quite sophisticated, and depended on an extremely ingenious induction argument. Nine years later, in [20], von Neumann showed that the bilinear character of Theorem 1 was not needed when he extended it as follows, using Brouwer’s fixed point theorem: Theorem 2 (1937) Let X and Y be nonempty compact, convex subsets of Euclidean spaces, and f : X × Y ! R be jointly continuous. Suppose that f is quasiconcave on X and quasiconvex on Y (see below). Then min max f D max min f : Y

X

X

Y

When we say that f is quasiconcave on X, we mean that for all y 2 Y and 2 R, GT(, y) is convex, and when we say that f is quasiconvex on Y, we mean that for all x 2 X and 2 R, LE(x, ) is convex. Here, GT(, y) and LE(x, ) are ‘level sets’ associated with the function f . Specifically, GT(; y) :D fx 2 X : f (x; y) > g and LE(x; ) :D fy 2 Y : f (x; y) g : In 1941, S. Kakutani [10] analyzed von Neumann’s proof and, as a result, discovered the fixed point theorem that bears his name. Infinite-Dimensional Results for Convex Sets The first infinite-dimensional minimax theorem was proved in 1952 by K. Fan ([1]), who generalized Theorem 2 to the case when X and Y are compact, convex subsets of infinite-dimensional locally convex spaces, and the quasiconcave and quasiconvex conditions are somewhat relaxed. The result in this general line that has the simplest statement is that of M. Sion, who proved the following ([33]):

Minimax Theorems

Theorem 3 (1958) Let X be a convex subset of a linear topological space, Y be a compact convex subset of a linear topological space, and f : X × Y ! R be upper semicontinuous on X and lower semicontinuous on Y. Suppose that f is quasiconcave on X and quasiconvex on Y. Then min sup f D sup min f : Y

X

X

Functional-Analytic Minimax Theorems The first person to take minimax theorems out of the context of convex subsets of vector spaces, and their proofs (other than that of the matrix case discussed in Theorem 1) out of the context of fixed point theorems was Fan in 1953 ([2]). We present here a generalization of Fan’s result due to H. König ([15]). König’s proof depended on the Mazur–Orlicz version of the Hahn– Banach theorem (see Theorem 5 below). Theorem 4 (1968) Let X be a nonempty set and Y be a nonempty compact topological space. Let f : X × Y ! R be lower semicontinuous on Y. Suppose that: for all x1 , x2 2 X, there exists x3 2 X such that f (x1 ; ) C f (x2 ; ) 2

on Y;

for all y1 , y2 2 Y, there exists y3 2 Y such that f (; y3 )

Then min sup f D sup min f : Y

X

X

Y

We give here the statement of the Mazur–Orlicz version of the Hahn–Banach theorem, since it is a very useful result and it not as well-known as it deserves to be.

Y

When we say that f is ‘upper semicontinuous on X’ and ‘lower semicontinuous on Y’ we mean that, for all y 2 Y, the map x 7! f (x, y) is upper semicontinuous and, for all x 2 X, the map y 7! f (x, y) is lower semicontinuous. The importance of Sion’s weakening of continuity to semicontinuity was that it indicated that many kinds of minimax problems have equivalent formulations in terms of subsets of X and Y, and led to Fan’s 1972 work ([4]) on sets with convex sections and minimax inequalities, which has since found many applications in economic theory. Like Theorem 2, all these result relied ultimately on Brouwer’s fixed point theorem (or the related Knaster–Kuratowski–Mazurkiewicz lemma (KKM lemma) on closed subsets of a finite-dimensional simplex).

f (x3 ; )

M

f (; y1 ) C f (; y2 ) 2

on X:

Theorem 5 (Mazur–Orlicz theorem) Let S be a sublinear functional on a real vector space E, and C be a nonempty convex subset of E. Then there exists a linear functional L on E such that L S on E and infC L = infC S. See [16,22] and [23] for applications of the Mazur– Orlicz theorem and the related ‘sandwich theorem’ to measure theory, Hardy algebra theory and the theory of flows in infinite networks. The kind of minimax theorem discussed in this section (where X is not topologized) has turned out to be extremely useful in functional analysis, in particular in convex analysis and also in the theory of monotone operators on a Banach space. (See [32] for more details of these kinds of applications.) Minimax Theorems that Depend on Connectedness It was believed for some time that proofs of minimax theorems required either the fixed point machinery of algebraic topology, or the functional-analytic machinery of convexity. However, in 1959, W.-T. Wu proved the first minimax theorem in which the conditions of convexity were totally replaced by conditions related to connectedness. This line of research was continued by H. Tuy, L.L. Stachó, M.A. Geraghty with B.-L. Lin, and J. Kindler with R. Trost, whose results were all subsumed by a family of general topological minimax theorem established by König in [17]. Here is a typical result from [17]. In order to simplify the statements of this and some of our later results, we shall write f := supX infY f. f is the ‘lower value’ of f . If 2 R, V Y T and W X, we write GT(, V) := y 2 V GT(, y) and T LE(W, ) := x 2 W LE(x, ). Theorem 6 (1992) Let X be a connected topological space, Y be a compact topological space, and f : X × Y ! R be upper semicontinuous on X and lower semicontinuous on Y. Let be a nonempty subset of (f , 1)

2089

2090

M

Minimax Theorems

such that inf = f and suppose that, for all 2 , for all nonempty subsets V of Y, and for all nonempty finite subsets W of X, GT(; V )

is connected in X;

and LE(W; )

is connected in Y:

Then min sup f D sup min f : Y

X

X

Y

H H0 [ H1 ;

Mixed Minimax Theorems In [34], F. Terkelsen proved the first mixed minimax theorem. We describe Terkelsen’s result as ‘mixed’ since one of the conditions in it is taken from Theorem 4, and the other from Theorem 6: Theorem 7 (1972) Let X be a nonempty set and Y be a nonempty compact topological space. Let f : X × Y ! R be lower semicontinuous on Y. Suppose that, for all x1 , x2 2 X there exists x3 2 X such that f (x1 ; ) C f (x2 ; ) f (x3 ; ) 2

on Y:

LE(W; ) is connected in Y: Then min sup f D sup min f : X

X

H \ H0 ¤ ;

and H \ H1 ¤ ;: We say that a family H of sets is pseudoconnected if H0 ; H1 ; H 2 H

and

H0 and H1 joined by H

+ H0 \ H1 ¤ ;:

Suppose also that, for all 2 R and, for all nonempty finite subsets W of X,

Y

cation of this breakdown was provided by Terkelsen’s result, Theorem 7, and the subsequent 1982 results of I. Joó and Stachó ([9]), the 1985 and 1986 results of Geraghty and Lin ([5] and [6]), and the 1989 results of H. Komiya ([18]). Kindler ([11]) was the first to realize (in 1990) that some abstract concept akin to connectedness might be involved in minimax theorems, even when the topological condition of connectedness was not explicitly assumed. This idea was pursued by Simons with the introduction in 1992 of the concept of pseudoconnectedness, which we will now describe. We say that sets H 0 and H 1 are joined by a set H if

Y

A Metaminimax Theorem It was believed for some time that Brouwer’s fixed point theorem or the Knaster–Kuratowski–Mazurkiewicz lemma was required to order to prove Sion’s theorem, Theorem 3. However, in 1966, M.A. Ghouila-Houri ([7]) proved Theorem 3 using a simple combinatorial property of convex sets in finite-dimensional space. This was probably the first indication of the breakdown of the classification of minimax theorems as either of ‘topological’ or ‘functional-analytic’ type. Further indi-

Any family of closed connected subsets of a topological space is pseudoconnected. So also is any family of open connected subsets. However, pseudoconnectedness can be defined in the absence of any topological structure and, as we shall see in Theorem 8, is closely related to minimax theorems. Theorem 8 is the improvement of the result of [29] due to König (see [30]). We shall say that a subset W of X is good if W is finite; and for all x 2 X, LE(x, f ) \ LE(W, f ) 6D ;. Theorem 8 (1995) Let Y be a topological space, and be a nonempty subset of R such that inf = f . Suppose that, for all 2 and for all good subsets W of X, for all x 2 X, LE(x, ) is closed and compact; {LE(x, ) \ LE(W, )}x 2 X is pseudoconnected; and for all x0 , x1 2 X, there exists x 2 X such that LE(x0 , ) and LE(x1 , ) are joined by LE(x, ) \ LE(W, ). Then min sup f D sup min f : Y

X

X

Y

M

Minimax Theorems

Theorem 8 is proved by induction on the cardinality of the good subsets of W. Given the obvious topological motivation behind the concept of pseudoconnectedness, it is hardly surprising that Theorem 8 implies Theorem 6. What is more unexpected is that Theorem 8 implies Theorems 4 and 7 also. We prefer to describe Theorem 8 as a metaminimax theorem rather than a minimax theorem, since it is frequently harder to prove that the conditions of Theorem 8 are satisfied in any particular case that it is to prove Theorem 8 itself. So Theorem 8 is really a device for obtaining minimax theorems rather than a minimax theorem in its own right. More recent work by Kindler ([12,13] and [14]) on abstract intersection theorems has been at the interface between minimax theory and abstract set theory. Minimax Theorems and Weak Compactness There are close connections between minimax theorems and weak compactness. The following ‘converse minimax theorem’ was proved by Simons in [25]; this result also shows that there are limitations on the extent to which one can totally remove the assumption of compactness from minimax theorems. Theorem 9 (1971) Suppose that X is a nonempty bounded, convex, complete subset of a locally convex space E with dual space E , and inf sup hx; yi D sup inf hx; yi

y2Y x2X

Theorem 11 (James sup theorem) If C is a nonempty bounded closed convex subset of E, then C is w(E, E )compact if and only if, for all x 2 E , there exists x 2 C such that hx, x i = maxC x . James’s theorem is not easy - the standard proof can be found in the paper [24] by J.D. Pryce. See [31] for more details of the connections between minimax theorems and weak compactness.

Minimax Inequalities for Two or More Functions Motivated by Nash equilibrium and the theory of noncooperative games, Fan generalized Theorem 2 to the case of more than one function. In particular, he proved in [3] the following two-function minimax inequality (since the compactness of X is not needed, this result can in fact be strengthened to include Sion’s theorem, Theorem 3, by taking g = f ): Theorem 12 (1964) Let X and Y be nonempty compact, convex subsets of topological vector spaces and f , g: X × Y ! R. Suppose that f is lower semicontinuous on Y and quasiconcave on X, g is upper semicontinuous on X and quasiconvex on Y, and f g

on X Y:

x2X y2Y

whenever Y is a nonempty convex, equicontinuous, subset of E . Then X is weakly compact. No compactness is assumed in the following, much harder, result (see [26]): Theorem 10 (1972) If X is a nonempty bounded, convex subset of a locally convex space E such that every element of the dual space E attains its supremum on X, and Y is any nonempty convex equicontinuous subset of E , then inf sup hx; yi D sup inf hx; yi :

y2Y x2X

James, one of the most beautiful results in functional analysis:

x2X y2Y

If one now combines the results of Theorems 9 and 10, one can obtain a proof of the ‘sup theorem’ of R.C.

Then min sup f sup inf g: Y

X

X

Y

Fan (unpublished) and Simons (see [27]) generalized König’s theorem, Theorem 4, with the following twofunction minimax inequality: Theorem 13 (1981) Let X be a nonempty set, Y be a compact topological space and f , g: X × Y ! R. Suppose that f is lower semicontinuous on Y, and for all y1 , y2 2 Y there exists y3 2 Y such that f (; y3 )

f (; y1 ) C f (; y2 ) 2

on X;

2091

2092

M

Minimax Theorems

for all x1 , x2 2 X there exists x3 2 X such that g(x3 ; )

g(x1 ; ) C g(x2 ; ) 2

on Y;

and f g on X × Y. Then X

X

Y

Theorems 12 and 13 both unify the theory of minimax theorems and the theory of variational inequalities. The curious feature about these two results is that they have ‘opposite geometric pictures’. This question is discussed in [27] and [28]. The relationship between Theorem 12 and Brouwer’s fixed point theorem is quite interesting. As we have already pointed out, Sion’s theorem, Theorem 3, can be proved in an elementary fashion without recourse to fixed point related concepts. On the other hand, Theorem 12 can, in fact, be used to prove Tychonoff’s fixed point theorem, which is itself a generalization of Brouwer’s fixed point theorem. (See [3] for more details of this.) A number of authors have proved minimax inequalities for more than two functions. See [31] for more details of these results. Coincidence Theorems A coincidence theorem is a theorem that asserts that if S: X ! 2Y and T: Y ! 2X have nonempty values and satisfy certain other conditions, then there exist x0 2 X and y0 2 Y such that y0 2 Sx0 and x0 2 Ty0 . The connection with minimax theorems is as follows: Suppose that infY supX f 6D supX infY f . Then there exists 2 R such that sup inf f < < inf sup f : X

Y

f (x0 ; y0 ) <

and

f (x0 ; y0 ) > ;

which is clearly impossible. Thus this coincidence theorem would imply that

min sup f sup inf g: Y

If S and T were to satisfy a coincidence theorem, then we would have x0 2 X and y0 2 Y such that

Y

X

Hence, for all x 2 X there exists y 2 Y such that f (x, y) < ; and for all y 2 Y there exists x 2 X such that f (x, y) > . Define S: X ! 2Y and T: Y ! 2X by Sx :D fy 2 Y : f (x; y) < g ¤ ; and Tx :D fx 2 X : f (x; y) > g ¤ ;:

inf sup f D sup inf f : Y

X

X

Y

The coincidence theorems known in algebraic topology consequently give rise to corresponding minimax theorems. There is a very extensive literature about coincidence theorems. See [31] for more details about this. See also Bilevel Linear Programming: Complexity, Equivalence to Minmax, Concave Programs Bilevel Optimization: Feasibility Test and Flexibility Index Minimax: Directional Differentiability Nondifferentiable Optimization: Minimax Problems Stochastic Programming: Minimax Approach Stochastic Quasigradient Methods in Minimax Problems References 1. Fan K (1952) Fixed-point and minimax theorems in locally convex topological linear spaces. Proc Nat Acad Sci USA 38:121–126 2. Fan K (1953) Minimax theorems. Proc Nat Acad Sci USA 39:42–47 3. Fan K (1964) Sur un théorème minimax. CR Acad Sci Paris 259:3925–3928 4. Fan K (1972) A minimax inequality and its applications. In: Shisha O (ed) Inequalities, vol III. Acad. Press, New York, pp 103–113 5. Geraghty MA, Lin B-L (1985) Minimax theorems without linear structure. Linear Multilinear Algebra 17:171–180 6. Geraghty MA, Lin B-L (1986) Minimax theorems without convexity. Contemp Math 52:102–108 7. Ghouila-Houri MA (1966) Le théorème minimax de Sion. In: Theory of games. English Univ. Press, London, pp 123–129 8. Irle A (1981) Minimax theorems in convex situations. In: Moeschlin O, Pallaschke D (eds) Game Theory and Mathematical Economics. North-Holland, Amsterdam, pp 321– 331 9. Joó I, Stachó LL (1982) A note on Ky Fan’s minimax theorem. Acta Math Acad Sci Hung 39:401–407 10. Kakutani S (1941) A generalization of Brouwer’s fixed-point theorem. Duke Math J 8:457–459

Minimum Concave Transportation Problems

11. Kindler J (1990) On a minimax theorem of Terkelsen’s. Arch Math 55:573–583 12. Kindler J (1993) Intersection theorems and minimax theorems based on connectedness. J Math Anal Appl 178:529– 546 13. Kindler J (1994) Intersecting sets in midset spaces. I. Arch Math 62:49–57 14. Kindler J (1994) Intersecting sets in midset spaces. II. Arch Math 62:168–176 15. König H (1968) Über das Von Neumannsche MinimaxTheorem. Arch Math 19:482–487 16. König H (1970) On certain applications of the Hahn-Banach and minimax theorems. Arch Math 21:583–591 17. König H (1992) A general minimax theorem based on connectedness. Arch Math 59:55–64 18. Komiya H (1989) On minimax theorems. Bull Inst Math Acad Sinica 17:171–178 19. Neumann Jvon (1928) Zur Theorie der Gesellschaftspiele. MATH-A 100:295–320 20. Neumann Jvon (1937) Ueber ein ökonomisches Gleichungssystem und eine Verallgemeinerung des Brouwerschen Fixpunktsatzes. Ergebn Math Kolloq Wien 8:73–83 21. Neumann Jvon (1959) On the theory of games of strategy. In: Tucker AW, Luce RD (eds) Contributions to the Theory of Games, vol 4, Princeton Univ. Press, Princeton, pp 13–42 22. Neumann M (1989) Some unexpected applications of the sandwich theorem. In: Proc. Conf. Optimization and Convex Analysis, Univ. Mississippi 23. Neumann M (1991) Generalized convexity and the Mazur– Orlicz theorem. In: Proc. Orlicz Memorial Conf., Univ. Mississippi 24. Pryce JD (1966) Weak compactness in locally convex spaces. Proc Amer Math Soc 17:148–155 25. Simons S (1970/1) Critères de faible compacité en termes du théorème de minimax. Sém. Choquet 23:8 26. Simons S (1972) Maximinimax: minimax, and antiminimax theorems and a result of R.C. James. Pacific J Math 40:709– 718 27. Simons S (1981) Minimax and variational inequalities: Are they or fixed point or Hahn–Banach type? In: Moeschlin O, Pallaschke D (eds) Game Theory and Mathematical Economics. North-Holland, Amsterdam, pp 379–388 28. Simons S (1986) Two-function minimax theorems and variational inequalities for functions on compact and noncompact sets with some comments on fixed-points theorems. Proc Symp Pure Math 45:377–392 29. Simons S (1994) A flexible minimax theorem. Acta Math Hungarica 63:119–132 30. Simons S (1995) Addendum to: A flexible minimax theorem. Acta Math Hungarica 69:359–360 31. Simons S (1995) Minimax theorems and their proofs. In: Du DZ, Pardalos PM (eds) Minimax and Applications. Kluwer, Dordrecht, pp 1–23 32. Simons S (1998) Minimax and monotonicity. Lecture Notes Math, vol 1693. Springer, Berlin

M

33. Sion M (1958) On general minimax theorems. Pacific J Math 8:171–176 34. Terkelsen F (1972) Some minimax theorems. Math Scand 31:405–413 35. Yanovskaya EB (1974) Infinite zero-sum two-person games. J Soviet Math 2:520–541

Minimum Concave Transportation Problems MCTP BRUCE W. LAMAR Economic and Decision Analysis Center, The MITRE Corp., Bedford, USA MSC2000: 90C26, 90C35, 90B06, 90B10 Article Outline Keywords See also References

Keywords Flows in networks; Global optimization; Nonconvex programming; Fixed charge transportation problem The minimum concave transportation problem MCTP concerns the least cost method of carrying flow on a bipartite network in which the marginal cost for an arc is a nonincreasing function of the flow on that arc. A bipartite network contains source nodes and sink nodes, but no transshipment (i. e., intermediate) nodes. The MCTP can be formulated as X i j (x i j ) (1) min (i; j)2A

subject to: X

xi j D si ;

8i 2 M;

(2)

xi j D d j ;

8 j 2 N;

(3)

j2N

X i2M

x i j 0;

8(i; j) 2 A;

(4)

2093

2094

M

Minimum Concave Transportation Problems

where M is the set of source nodes; N is the set of sink nodes; si is the supply at source node i, dj is the demand at sink node j; A = {(i, j) : i 2 M, j 2 N} is the (directed) arc set; xij is the flow carried on arc (i, j); and ij (xij ) is the concave cost function for arc (i, j). Objective function (1) minimizes total costs; constraints (2) balance flow at the source nodes; and constraints (3) balP ance flow at the sink nodes. If i 2 M si is less (greater) P than j 2 N dj , then a dummy source (sink) node can be added to set M (N). MCTPs arise naturally in distribution problems involving shipments sent directly from supply points to demand points in which the transportation costs exhibit economies of scale [21]. However, the MCTP is not limited to this class of problems. Specifically, any network flow problem with arc cost functions that are not concave can be converted to a network flow problem on an expanded network whose arc cost functions are all concave [16]. Then, the expanded network can be converted to a bipartite network by replacing each transshipment node with a source node and a sink node. Arc flow capacities can be removed by adding additional source nodes, one for each capacitated arc [19,23]. The fixed charge transportation problem FCTP is a type of MCTP in which the cost function ij (xij ) for each arc (i, j) 2 A is of the form ( 0 if x i j D 0; (5) i j (x i j ) D f i j C g i j x i j if x i j > 0; where f ij and g ij are coefficients with f ij 0. FCTPs are commonly used to model network flow problems involving setup costs [9]. Furthermore, a variety of combinatorial problems can be converted to FCTPs. For instance, consider the 0–1 knapsack problem KP. The KP is formulated as n X ck yk (6) max kD1

subject to: n X

a k y k b;

(7)

kD1

y k 2 f0; 1g;

for k D 1; : : : ; n;

(8)

with ak 0 and ck 0 for k = 1, . . . , n. The KP can be converted to a FCTP with two source nodes and n +

1 sink nodes. Define an + 1 = b and cn + 1 = 0. Then, the network is specified as M = {1, 2}, N = {1, . . . , n + 1}, s1 P = b, s2 = nkD1 ak , and dj = aj for j = 1, . . . , n + 1; and the cost function is of the form of (5) where, for each arc (i, j) 2 A, the coefficients f ij and g ij are given by

fi j D

gi j D

8 n X ˆ < ck ˆ kD1

if j D 1; : : : ; n;

: 0 ( cj aj

if i D 1;

0

if i D 2:

(9)

if j D n C 1; (10)

For j = 1, . . . , n sink node j has two incoming arcs, exactly one of which will have nonzero flow in the optimal solution to the FCTP. If x1 j > 0 in the FCTP, then yj = 1 in the KP. If x2 j > 0 in the FCTP, then yj = 0 in the KP. One consequence of this result is that any integer programming problem with integer coefficients can (in principle) be formulated and solved as a FCTP by first converting the integer program to a KP [10]. Exact solution methods for the MCTP are predominately branch and bound enumeration procedures [2,3,4,6,8,11,12,15]. Binary partitioning is used for the FCTP; and interval partitioning is used for the MCTP with arbitrary concave arc cost functions. Finite convergence of the method was shown by R.M. Soland [22]. The convex envelope of the cost function ij (xij ) is an affine function. Hence, a subproblem in the branch and bound procedure can be solved efficiently as a linear transportation problem (LTP) [1]. Fathoming techniques (such as ‘up and down penalties’ and ‘capacity improvement’) based on post-optimality analysis of the LTP facilitate the branch and bound procedure for the MCTP [2,3,18,20]. The LTP is also used in approximate solution methods for the MCTP which rely on successive linearizations of the concave cost function, ij (xij ) [5,13,14]. Test problems for the MCTP are given in [7,8,12,17,20]. See also Bilevel Linear Programming: Complexity, Equivalence to Minmax, Concave Programs Concave Programming Motzkin Transposition Theorem

Minimum Cost Flow Problem

Multi-index Transportation Problems Stochastic Transportation and Location Problems

References 1. Balinski ML (1961) Fixed-cost transportation problems. Naval Res Logist 8:41–54 2. Barr RS, Glover F, Klingman D (1981) A new optimization method for large scale fixed charge transportation problems. Oper Res 29:448–463 3. Bell GB, Lamar BW (1997) Solution methods for nonconvex network problems. In: Pardalos PM Hearn DW, Hager WW (eds) Network Optimization. Lecture Notes Economics and Math Systems. Springer, Berlin, pp 32–50 4. Cabot AV, Erenguc SS (1984) Some branch-and-bound procedures for fixed-cost transportation problems. Naval Res Logist 31:145–154 5. Diaby M (1991) Successive linear approximation procedure for generalized fixed-charge transportation problems. J Oper Res Soc 42:991–1001 6. Florian M, Robilland P (1971) An implicit enumeration algorithm for the concave cost network flow problem. Managem Sci 18:184–193 7. Floudas CA, Pardalos PM (1990) A collection of test problems for constrained global optimization Algorithms. Lecture Notes Computer Sci, vol 455. Springer, Berlin 8. Gray P (1971) Exact solution of the fixed-charge transportation problem. Oper Res 19:1529–1538 9. Guisewite GM, Pardalos PM (1990) Minimum concave-cost network flow problems: Applications, complexity, and algorithms. Ann Oper Res 25:75–100 10. Kendall KE, Zoints S (1977) Solving integer programming problems by aggregating constraints. Oper Res 25:346– 351 11. Kennington J (1976) The fixed-charge transportation problem: A computational study with a branch-and-bound code. AIIE Trans 8:241–247 12. Kennington J, Unger VE (1976) A new branch-and-bound algorithm for the fixed charge transportation problem. Managem Sci 22:1116–1126 13. Khang DB, Fujiwara O (1991) Approximate solutions of capacitated fixed-charge minimum cost network flow problems. Networks 21:689–704 14. Kim D, Pardalos PM (1999) A solution approach to the fixed charge network flow problem using a dynamic slope scaling procedure. Oper Res Lett 24:195–203 15. Lamar BW (1993) An improved branch and bound algorithm for minimum concave cost network flow problems. J Global Optim 3:261–287 16. Lamar BW (1993) A method for solving network flow problems with general nonlinear arc costs. In: Du D-Z and Pardalos PM (eds) Network Optimization Problems: Algorithms, Applications, and Complexity. World Sci., Singapore, pp 147–167

M

17. Lamar BW, Wallace CA (1996) A comparison of conditional penalties for the fixed charge transportation problem. Techn. Report Dept. Management Univ. Canterbury 18. Lamar BW, Wallace CA (1997) Revised-modified penalties for fixed charge transportation problems. Managem Sci 43:1431–1436 19. Lawler EL (1976) Combinatorial optimization: Networks and matroids. Holt, Rinehart and Winston, New York 20. Palekar US, Karwan MH, Zionts S (1990) A branch-andbound method for the fixed charge transportation problem. Managem Sci 36:1092–1105 21. Rech P, Barton LG (1970) A non-convex transportation algorithm. In: Beale EML (ed) Applications of Mathematical Programming Techniques. English Univ. Press, London 22. Soland RM (1974) Optimal facility location with concave costs. Oper Res 22:373–382 23. Wagner HM (1959) On a class of capacitated transportation problems. Managem Sci 5:304–318

Minimum Cost Flow Problem RAVINDRA K. AHUJA1 , THOMAS L. MAGNANTI2 , JAMES B. ORLIN3 1 Department Industrial and Systems Engineering, University Florida, Gainesville, USA 2 Sloan School of Management and Department Electrical Engineering and Computer Sci., Massachusetts Institute Technol., Cambridge, USA 3 Sloan School of Management, Massachusetts Institute Technol., Cambridge, USA MSC2000: 90C35 Article Outline Keywords Applications Distribution Problems Airplane Hopping Problem Directed Chinese Postman Problem

Preliminaries Assumptions Graph Notation Residual Network Order Notation

Cycle-Canceling Algorithm Successive Shortest Path Algorithm Network Simplex Algorithm See also References

2095

2096

M

Minimum Cost Flow Problem

li j xi j ui j;

Keywords Network; Minimum cost flow problem; Cycle-canceling algorithm; Successive shortest path algorithm; Network simplex algorithm The minimum cost flow problem seeks a least cost shipment of a commodity through a network to satisfy demands at certain nodes by available supplies at other nodes. This problem has many, varied applications: the distribution of a product from manufacturing plants to warehouses, or from warehouses to retailers; the flow of raw material and intermediate goods through various machining stations in a production line; the routing of automobiles through an urban street network; and the routing of calls through the telephone system. The minimum cost flow problem also has many less direct applications. In this article, we briefly introduce the theory, algorithms and applications of the minimum cost flow problem. [1] contains much additional material on this topic. Let G = (N, A) be a directed network defined by a set N of n nodes and a set A of m directed arcs. Each arc (i, j) 2 A has an associated cost cij that denotes the cost per unit flow on that arc. We assume that the flow cost varies linearly with the amount of flow. Each arc (i, j) 2 A has an associated capacity uij denoting the maximum amount that can flow on this arc, and a lower bound lij that denotes the minimum amount that must flow on the arc. We assume that the capacity and flow lower bound for each arc (i, j) are integers. We associate with each node i 2 N an integer b(i) representing its supply/demand. If b(i) > 0, node i is a supply node; if b(i) < 0, then node i is a demand node with a demand of b(i); and if b(i) = 0, then node i is a transshipment node. P We assume that i 2 N b(i) = 0. The decision variables xij are arc flows defined for each arc (i, j) 2 A. The minimum cost flow problem is an optimization model formulated as follows: X ci j xi j (1) Minimize (i; j)2A

subject to X f j : (i; j)2Ag

xi j

X f j : ( j;i)2Ag

x ji D b(i); for all i 2 N; (2)

for all (i; j) 2 A:

(3)

We refer to the constraints (2) as the mass balance constraints. For a fixed node i, the first term in the constraint (2) represents the total outflow of node i and the second term represents the total inflow of node i. The mass balance constraints state that outflow minus inflow must equal the supply/demand of each node. The flow must also satisfy the lower bound and capacity constraints (3), which we refer to as flow bound constraints. This article is organized as follows. To help in understanding the applicability of the minimum cost flow problem, we begin in Section 2 by describing several applications. In Section 3, we present preliminary material needed in the subsequent sections. We next discuss algorithms for the minimum cost flow problem, describing the cycle-canceling algorithm in Section 4 and the successive shortest path algorithm in Section 5. The cycle-canceling algorithm identifies negative cost cycles in the network and augments flows along them. The successive shortest path algorithm augments flow along shortest cost augmenting paths from the supply nodes to the demand nodes. In Section 6, we describe the network simplex algorithm. Applications Minimum cost flow problems arise in almost all industries, including agriculture, communications, defense, education, energy, health care, manufacturing, medicine, retailing, and transportation. Indeed, minimum cost flow problems are pervasive in practice. In this section, by considering a few selected applications that arise in distribution systems planning, capacity planning, and vehicle routing, we give a passing glimpse of these applications. Distribution Problems A large class of network flow problems center around distribution applications. One core model is often described in terms of shipments from plants to warehouses (or, alternatively, from warehouses to retailers). Suppose a firm has p plants with known supplies and q warehouses with known demands. It wishes to identify a flow that satisfies the demands at the warehouses from the available supplies at the plants and that minimizes

Minimum Cost Flow Problem

its shipping costs. This problem is a well-known special case of the minimum cost flow problem, known as the transportation problem. We next describe in more detail a slight generalization of this model that also incorporates manufacturing costs at the plants. A car manufacturer has several manufacturing plants and produces several car models at each plant that it then ships to geographically dispersed retail centers throughout the country. Each retail center requests a specific number of cars of each model. The firm must determine the production plan of each model at each plant and a shipping pattern that satisfies the demand of each retail center while minimizing the overall cost of production and transportation. We describe this formulation through an example. Figure 1 illustrates a situation with two manufacturing plants, two retailers, and three car models. This model has four types of nodes: i) plant nodes, representing various plants; ii) plant/model nodes, corresponding to each model made at a plant; iii) retailer/model nodes, corresponding to the models required by each retailer; and iv) retailer nodes corresponding to each retailer. The network contains three types of arcs: i) production arcs; ii) transportation arcs; and iii) demand arcs. The production arcs connect a plant node to a plant/ model node; the cost of this arc is the cost of producing the model at that plant. We might place lower and upper bounds on production arcs to control for the minimum and maximum production of each particular car model at the plants. Transportation arcs connect plant/model nodes to retailer/model nodes; the cost of any such arc is the total cost of shipping one car from the manufacturing plant to the retail center. The transportation arcs might have lower or upper bounds imposed upon their flows to model contractual agreements with shippers or capacities imposed upon any distribution channel. Finally, demand arcs connect retailer/model nodes to the retailer nodes. These arcs have zero costs and positive lower bounds that equal the demand of that model at that retail center. The production and shipping schedules for the automobile company correspond in a one-to-one fashion with the feasible flows in this network model. Conse-

M

quently, a minimum cost flow provides an optimal production and shipping schedule. Airplane Hopping Problem A small commuter airline uses a plane, with a capacity to carry at most p passengers, on a ‘hopping flight’ as shown in Fig. 2a). The hopping flight visits the cities 1, . . . , n, in a fixed sequence. The plane can pick up passengers at any node and drop them off at any other node. Let bij denote the number of passengers available at node i who want to go to node j, and let f ij denote the fare per passenger from node i to node j. The airline would like to determine the number of passengers that the plane should carry between the various origins to destinations in order to maximize the total fare per trip while never exceeding the plane’s capacity. Figure 2b) shows a minimum cost flow formulation of this hopping plane flight problem. The network contains data for only those arcs with nonzero costs and with finite capacities: any arc listed without an associated cost has a zero cost; any arc listed without an associated capacity has an infinite capacity. Consider, for example, node 1. Three types of passengers are available at node 1: those whose destination is node 2, node 3 or node 4. We represent these three types of passengers in a new derived network by the nodes 1 – 2, 1 – 3 and 1 – 4 with supplies b12 , b13 and b14 . A passenger available at any such node, say 1 – 3, could board the plane at its origin node represented by flowing through the arc (1 – 3, 1) and incurring a cost of f 13 units (or profit of f 13 units). Or, the passenger might never board the plane, which we represent by the flow through the arc (1 – 3, 3). It is easy to establish a one-to-one correspondence between feasible flows in Fig. 2b) and feasible loading of the plane with passengers. Consequently, a minimum cost flow in Fig. 2b) will prescribe a most profitable loading of the plane. Directed Chinese Postman Problem The directed Chinese postman problem is a generic routing problem that can be stated as follows. In a directed network G = (N, A) in which each arc (i, j) has an associated cost cij , we wish to identify a walk of minimum cost that starts at some node (the post office), visits each arc of the network at least once, and returns to the starting point (see the next Section for the def-

2097

2098

M

Minimum Cost Flow Problem

p1

p1 /m1

r1 /m 1

p1 /m2

r1 /m2 r1

p2 /m1

p2 /m2

r1/m3

r2 /m1

p2

Plant nodes

p2 /m3

r2 /m2

Plant /model nodes

Retailer/model nodes

r2

Retailer nodes

Minimum Cost Flow Problem, Figure 1 Formulating the production-distribution problem

inition of a walk). This problem has become known as the Chinese postman problem because a Chinese mathematician, K. Mei-Ko, first discussed it. The Chinese postman problem arises in other settings as well; for instance, patrolling streets by police, routing street sweepers and household refuse collection vehicles, fuel oil delivery to households, and spraying roads with sand during snowstorms. The directed Chinese postman problem assumes that all arcs are directed, that is, the postal carrier can traverse an arc in only one direction (like one-way streets). In the directed Chinese postman problem, we are interested in a closed (directed) walk that traverses each arc of the network at least once. The network might not contain any such walk. It is easy to show that a network contains a desired walk if and only if the net-

work is strongly connected, that is, every node in the network is reachable from every other node via a directed path. Simple graph search algorithms are able to determine whether the network is strongly connected, and we shall therefore assume that the network is strongly connected. In an optimal walk, a postal carrier might traverse arcs more than once. The minimum length walk minimizes the sum of lengths of the repeated arcs. Let xij denote the number of times the postal carrier traverses arc (i, j) in a walk. Any carrier walk must satisfy the following conditions: X X xi j x ji D 0 for all i 2 N; (4) f j : (i; j)2Ag f j : ( j;i)2Ag xi j 1

for all (i; j) 2 A:

(5)

Minimum Cost Flow Problem

M

Minimum Cost Flow Problem, Figure 2 Formulation of the hopping plane flight problem as a minimum cost flow problem

The constraints (4) state that the carrier enters a node the same number of times that he or she leaves it. The constraints (5) state that the carrier must visit each arc at least once. Any solution x satisfying the system (4)–(5) defines a carrier’s walk. We can construct a walk in the following manner. Given a flow xij , we replace each arc (i, j) with xij copies of the arc, each arc carrying a unit flow. In the resulting network, say G0 = (N, A0 ), each node has the same number of outgoing arcs as it has the incoming arcs. It is possible to decompose this network into at most m/2 arc-disjoint directed cycles (by walking along an arc (i, j) from some node i with xij > 0, leaving an node each time we enter it until we repeat a node). We can connect these cycles together to form a closed walk of the carrier. The preceding discussion shows that the solution x defined by a feasible walk for the carrier satisfies conditions (4)–(5), and, conversely, every feasible solution of system (4)–(5) defines a walk of the postman. The length of a walk defined by the solution x equals P (i, j) 2 A cij xij . This problem is an instance of the minimum cost flow problem.

Preliminaries In this Section, we discuss some preliminary material required in the following sections.

Assumptions We consider the minimum cost flow problem subject to the following six assumptions: 1) lij = 0 for each (i, j) 2 A; 2) all data (cost, supply/demand, and capacity) are integral; 3) all arc costs are nonnegative; 4) for any pair of nodes i and j, the network does not contain both the arcs (i, j) and (j, i); 5) the minimum cost flow problem has a feasible solution; and 6) the network contains a directed path of sufficiently large capacity between every pair of nodes. It is possible to show that none of these assumptions, except 2), restricts the generality of our development. We impose them just to simply our discussion.

2099

2100

M

Minimum Cost Flow Problem

Graph Notation We use standard graph notation. A directed graph G = (N, A) consists of a set N of nodes and a set A of arcs. A directed arc (i, j) has two endpoints, i and j. An arc (i, j) is incident to nodes i and j. The arc (i, j) is an outgoing arc of node i and an incoming arc of node j. A walk in a directed graph G = (N, A) is a sequence of nodes and arcs i1 , a1 , i2 , a2 , . . . , ir satisfying the property that for all 1 k r 1, either ak = (ik , ik + 1 ) 2 A or ak = (ik + 1 , ik ) 2 A. We sometimes refer to a walk as a sequence of arcs (or nodes) without any explicit mention of the nodes (or arcs). A directed walk is an oriented version of the walk in the sense that for any two consecutive nodes ik and ik + 1 on the walk, ak = (ik , ik + 1 ) 2 A. A path is a walk without any repetition of nodes, and a directed path is a directed walk without any repetition of nodes. A cycle is a path i1 , i2 , . . . , ir together with the arc (ir , i1 ) or (i1 , ir ). A directed cycle is a directed path i1 , i2 , . . . , ir together with the arc (ir , i1 ). A spanning tree of a directed graph G is a subgraph G0 = (N, A0 ) with A0 A that is connected (that is, contains a path between every pair of nodes) and contains no cycle. Residual Network The algorithms described in this article rely on the concept of a residual network G(x) corresponding to a flow x. For each arc (i, j) 2 A, the residual network contains two arcs (i, j) and (j, i). The arc (i, j) has cost cij and residual capacity rij = uij xij , and the arc (j, i) has cost cji = cij and residual capacity rji = xij . The residual network consists of arcs with positive residual capacity. If (i, j) 2 A, then sending flow on arc (j, i) in G(x) corresponds to decreasing flow on arc (i, j); for this reason, the cost of arc (j, i) is the negative of the cost of arc (i, j). These conventions show how to determine the residual network G(x) corresponding to any flow x. We can also determine a flow x from the residual network G(x) as follows. If rij > 0, then using the definition of residual capacities and Assumption 4), we set xij = uij rij if (i, j) 2 A, and xji = rij otherwise. We define the cost of a directed cycle W in the residual network G(x) P as (i, j) 2 W cij . Order Notation In our discussion, we will use some well-known notation from the field of complexity theory. We say that

an algorithm for a problem P is an O(n3 ) algorithm, or has a worst-case complexity of O(n3 ), if it is possible to solve any instance of P using a number of computations that is asymptotically bounded by some constant times the term n3 . We refer to an algorithm as a polynomial time algorithm if its worst-case running time is bounded by a polynomial function of the input size parameters, which for a minimum cost flow problem, are n, m, log C (the number of bits needed to specify the largest arc cost), and log U (the number of bits needed to specify the largest arc capacity). A polynomial time algorithm is either a strongly polynomial time algorithm (when the complexity terms involves only n and m, but not log C or log U), or is a weakly polynomial time algorithm (when the complexity terms include log C or log U or both). We say that an algorithm is a pseudopolynomial time algorithm if its worst-case running time is bounded by a polynomial function of n, m and U. For example, an algorithm with worst-case complexity of O(nm2 log n) is a strongly polynomial time algorithm, an algorithm with worst-case complexity O(nm2 log U) is a weakly polynomial time algorithm, and an algorithm with worst-case complexity of O(n2 mU) is a pseudopolynomial time algorithm. Cycle-Canceling Algorithm In this Section, we describe the cycle-canceling algorithm, one of the more popular algorithms for solving the minimum cost flow problem. The algorithm sends flows (called augmenting flows) along directed cycles with negative cost (called negative cycles). The algorithm rests upon the following negative cycle optimality condition stated as follows. Theorem 1 (Negative cycle optimality condition) A feasible solution x is an optimal solution of the minimum cost flow problem if and only if the residual network G(x ) contains no negative cost (directed) cycle. It is easy to see the necessity of these conditions. If the residual network G(x ) contains a negative cycle (that is, a negative cost directed cycle), then by augmenting positive flow along this cycle, we can decrease the cost of the flow. Conversely, it is possible to show that if the residual network G(x ) does not contain any negative cost cycle, then x must be an optimal flow. The negative cycle optimality condition suggests one simple algorithmic approach for solving the min-

Minimum Cost Flow Problem

Minimum Cost Flow Problem, Figure 3 Cycle-canceling algorithm

imum cost flow problem, which we call the cyclecanceling algorithm. This algorithm maintains a feasible solution and at every iteration improves the objective function value. The algorithm first establishes a feasible flow x in the network by solving a related (and easily solved) problem known as the maximum flow problem. Then it iteratively finds negative cycles in the residual network and augments flows on these cycles. The algorithm terminates when the residual network contains no negative cost directed cycle. Theorem 1 implies that when the algorithm terminates, it has found a minimum cost flow. Figure 3a specifies this generic version of the cycle-canceling algorithm. The numerical example shown in Fig. 4a) illustrates the cycle-canceling algorithm. This figure shows the arc

M

costs and the starting feasible flow in the network. Each arc in the network has a capacity of 2 units. Figure 4b) shows the residual network corresponding to the initial flow. We do not show the residual capacities of the arcs in Fig. 4b) since they are implicit in the network structure. If the residual network contains both arcs (i, j) and (j, i) for any pair i and j of nodes, then both have residual capacity equal to 1; and if the residual network contains only one arc, then its capacity is 2 (this observation uses the fact that each arc capacity equals 2). The residual network shown in Fig. 4b) contains a negative cycle 1 – 3 – 2 – 1 with cost – 3. By augmenting a unit flow along this cycle, we obtain the residual network shown in Fig. 4c). The residual network shown in Fig. 4c) contains a negative cycle 6 – 4 – 5 – 6 with cost – 4. We augment unit flow along this cycle, producing the residual network shown in Fig. 4d), which contain no negative cycle. Given the optimal residual network, we can determine optimal flow using the method described in the previous Section. A byproduct of the cycle-canceling algorithm is the following important result. Theorem 2 (Integrality property) If all arc capacities and supply/demands of nodes are integer, then the minimum cost flow problem always has an integer minimum cost flow.

Minimum Cost Flow Problem, Figure 4 Illustration of the cycle-canceling algorithm. a) the original network with flow x and arc costs; b) the residual network G(x); c) the residual network after augmenting a unit of flow along the cycle 2 – 1 – 3 – 2; d) the residual network after augmenting a unit of flow along the cycle 4 – 5 – 6 – 4

2101

2102

M

Minimum Cost Flow Problem

This result follows from the fact that for problems with integer arc capacities and integer node supplies/demand, the cycle-canceling algorithm starts with an integer solution (which is provided by the maximum flow algorithm used to obtain the initial feasible flow) and at each iteration augments flow by an integral amount. What is the worst-case computational requirement (complexity) of the cycle-canceling algorithm? The algorithm must repeatedly identify negative cycles in the residual network. We can identify a negative cycle in the residual network in O(nm) time using a shortest path label-correcting algorithm [1]. How many times must the generic cycle-canceling algorithm perform this computation? For the minimum cost flow problem, mCU is an upper bound on the initial flow cost (since cij C and xij U for all (i, j) 2 A) and mCU is a lower bound on the optimal flow cost (since cij C and xij U for all (i, j) 2 A). Any iteration of the cycle-canceling algorithm changes the objective P function value by an amount (i, j) 2 W ci, j ) ı, which is strictly negative. Since we have assumed that the problem has integral data, the algorithm terminates within O(mCU) iterations and runs in O(nm2 CU) time, which is a pseudopolynomial running time. The generic version of the cycle-canceling algorithm does not specify the order for selecting negative cycles from the network. Different rules for selecting negative cycles produce different versions of the algorithm, each with different worst-case and theoretical behavior. Two versions of the cycle-canceling algorithm are polynomial time implementations: i) a version that augments flow in arc-disjoint negative cycles with the maximum improvement [2]; and ii) a version that augments flow along a negative cycle with minimum mean cost, that is, the average cost per arc in the cycle [4]). Successive Shortest Path Algorithm The cycle-canceling algorithm maintains feasibility of the solution at every step and attempts to achieve optimality. In contrast, the successive shortest path algorithm maintains optimality of the solution at every step (that is, the condition that the residual network G(x) contains no negative cost cycle) and strives to attain feasibility. It maintains a solution x, called a pseudoflow

(see below), that is nonnegative and satisfies the arcs’ flow capacity restrictions, but violates the mass balance constraints of the nodes. At each step, the algorithm selects a node k with excess supply (i. e., supply not yet sent to some demand node), a node l with unfulfilled demand, and sends flow from node k to node l along a shortest path in the residual network. The algorithm terminates when the current solution satisfies all the mass balance constraints. To be more precise, a pseudoflow is a vector x satisfying only the capacity and nonnegativity constraints; it need not satisfy the mass balance constraints. For any pseudoflow x, we define the imbalance of node i as e(i) D b(i) C

X f j;i)2Ag

x ji

X

xi j

f(i; j)2Ag

for all i 2 N: (6) If e(i) > 0 for some node i, then we refer to e(i) as the excess of node i; if e(i) < 0, then we refer to e(i) as the node’s deficit. We refer to a node i with e(i) = 0 as balanced. Let E and D denote the sets of excess and deficit P P nodes in the network. Notice that i 2 N e(i) = i 2 N P P b(i) = 0, which implies that i 2 E e(i) = i 2 D e(i). Consequently, if the network contains an excess node, then it must also contain a deficit node. The residual network corresponding to a pseudoflow is defined in the same way that we define the residual network for a flow. The successive shortest path algorithm uses the following result. Theorem 3 (Shortest augmenting path theorem) Suppose a pseudoflow (or a flow) x satisfies the optimality conditions and we obtain x0 from x by sending flow along a shortest path from node k to some other node l in the residual network, then x0 also satisfies the optimality conditions. To prove this Theorem, we would show that if the residual network G(x) contain no negative cycle, then augmenting flow along any shortest path does not introduce any negative cycle (we will not establish this result in this discussion). Figure 5 gives a formal description of the successive shortest path algorithm. The numerical example shown in Fig. 6a) illustrates the successive shortest path algorithm. The algorithm starts with x = 0, and at this value of flow, the residual network is identical to the starting network. Just as we

Minimum Cost Flow Problem

BEGIN x := 0; e(i) = b(i) for all i 2 N; initialize the sets E and D; WHILE E ¤ ; DO BEGIN select a node k 2 E and a node l 2 D; identify a shortest path P in G(x) from node k to node l; ı := min[e(s); e(t); minfr i j : (i; j) 2 Pg]; augment ı units of flow along the path P and update x and G(x); END END Minimum Cost Flow Problem, Figure 5 Successive shortest path algorithm

observed in Fig. 4, whenever the residual network contains both the arcs (i, j) and (j, i), the residual capacity of each arc is 1. If the residual network contains only one arc, (i, j) or (j, i), then its residual capacity is 2 units. For this problem, E = {1} and D = {6}. In the residual network shown in Fig. 6a), the shortest path from node 1 to node 6 is 1 – 2 – 4 – 6 with cost equal to 9. The residual capacity of this path equals 2. Augmenting two units of flow along this path produces the residual network shown in Fig. 6b), and the next shortest path from

M

node 1 to node 6 is 1 – 3 – 5 – 6 with cost equal to 10. The residual capacity of this path is 2 and we augment two unit of flow on it. At this point, the sets E = D = ;, and the current solution solves the minimum cost flow problem. To show that the algorithm correctly solves the minimum cost flow problem, we argue as follows. The algorithm starts with a flow x = 0 and the residual network G(x) is identical to the original network. Assumption 3) implies that all arc costs are nonnegative. Consequently, the residual network G(x) contains no negative cycle and so the flow vector x satisfies the negative cycle optimality conditions. Since the algorithm augments flow along a shortest path from excess nodes to deficit nodes, Theorem 3 implies that the pseudoflow maintained by the algorithm always satisfies the optimality conditions. Eventually, node excesses and deficits become zero; at this point, the solution maintained by the algorithm is an optimal flow. What is the worst-case complexity of this algorithm? In each iteration, the algorithm reduces the excess of some node. Consequently, if U is an upper bound on the largest supply of any node, then the algorithm would terminate in at most nU iterations. We can determine a shortest path in G(x) in O(nm) time using a label-correcting shortest path algorithm [1]. Consequently, the running time of the successive shortest path algorithm is n2 mU.

Minimum Cost Flow Problem, Figure 6 Illustration of the successive shortest path algorithm. a) the residual network corresponding to x = 0; b) the residual network after augmenting 2 units of flow along the path 1 – 2 – 4 – 6; c) the residual network after augmenting 2 units of flow along the path 1 – 3 – 5 – 6

2103

2104

M

Minimum Cost Flow Problem

Minimum Cost Flow Problem, Figure 7 Computing flows for a spanning tree

The successive shortest path algorithm requires pseudopolynomial time to solve the minimum cost flow problem since it is polynomial in n, m and the largest supply U. This algorithm is, however, polynomial time for some special cases of the minimum cost flow problem (such as the assignment problem for which U = 1). Researchers have developed weakly polynomial time and strongly polynomial time versions of the successive shortest path algorithm; some notable implementations are due to [3] and [5].

Network Simplex Algorithm The network simplex algorithm for solving the minimum cost flow problem is an adaptation of the wellknown simplex method for general linear programs. Because the minimum cost flow problem is a highly structured linear programming problem, when applied to it, the computations of the simplex method become considerably streamlined. In fact, we need not explicitly maintain the matrix representation (known as the simplex tableau) of the linear program and can perform all of the computations directly on the network. Rather than presenting the network simplex algorithm as a special case of the linear programming simplex method, we will develop it as a special case of the cyclecanceling algorithm described above. The primary advantage of our approach is that it permits the network simplex algorithm to be understood without relying on linear programming theory. The network simplex algorithm maintains solutions called spanning tree solutions. A spanning tree solution partitions the arc set A into three subsets: i) T, the arcs in the spanning tree; ii) L, the nontree arcs whose flows are restricted to value zero;

iii) U, the nontree arcs whose flow values are restricted in value to the arcs’ flow capacities. We refer to the triple (T, L, U) as a spanning tree structure. Each spanning tree structure (T, L, U) has a unique solution that satisfies the mass balance constraints (2). To determine this solution, we set xij = 0 for all arcs (i, j) 2 L, xij = uij for all arcs (i, j) 2 U, and then solve the mass balance equations (2) to determine the flow values for arcs in T. To show that the flows on spanning tree arcs are unique, we use a numerical example. Consider the spanning tree T shown in Fig. 7a). Assume that U = ', that is, all nontree arcs are at their lower bounds. Consider the leaf node 4 (a leaf node is a node with exactly one arc incident to it). Node 4 has a supply of 5 units and has only one arc (4, 2) incident to it. Consequently, arc (4, 2) must carry 5 units of flow. So we set x42 = 5, add 5 units to b(2) (because it receives 5 units of flow sent from node 4), and delete arc (4, 2) from the tree. We now have a tree with one fewer node and next select another leaf node, node 5 with the supply of 5 units and the single arc (5, 2) incident to it. We set x52 = 5, again add 5 units to b(2), and delete the arc (5, 2) from the tree. Now node 2 becomes a leaf node with modified supply/demand of b(5) = 10, implying that node 5 has an unfulfilled demand of 10 units. Node 2 has exactly one incoming arc (1, 2) and to meet the demand of 10 units of node 2, we must send 10 units of flow on this arc. We set x12 = 10, subtract 10 units from b(1) (since node 1 sends 10 units), and delete the arc (1, 2) from the tree. We repeat this process until we have identified flow on all arcs in the tree. Figure 7b) shows the corresponding flow. Our discussion assumed that U is empty. If U were nonempty, we would first set xij = uij , add uij to b(j), and subtract uij from b(i) for each arc (i, j) 2 U, and then apply the preceding method.

Minimum Cost Flow Problem

M

Minimum Cost Flow Problem, Figure 8 Computing node potentials for a spanning tree

We say a spanning tree structure is feasible if its associated spanning tree solution satisfies all of the arcs’ flow bounds. We refer to a spanning tree structure as optimal if its associated spanning tree solution is an optimal solution of the minimum cost flow problem. We will now derive the optimality conditions for a spanning tree structure (T, L, U). The network simplex algorithm augments flow along negative cycles. To identify negative cycles quickly, we use the concept of node potentials. We define node potentials (i) so that the reduced cost for any arc in the spanning tree T is zero. That is, that is, ci j = cij (i) + (j) = 0 for each (i, j) 2 T. With the help of an example, we show how to compute the vector of node potentials. Consider the spanning tree shown in Fig. 8a) with arc costs as shown. The vector has n variables and must satisfy n 1 equations, one for each arc in the spanning tree. Therefore, we can assign one potential value arbitrary. We assume that (1) = 0. Consider arc (1, 2) incident to node 1. The condition c12 = c12 (1) + (2) = 0 yields (2) = 5. We next consider arcs incident to node 2. Using the condition c52 = c52 (5)+ (2) = 0, we see that (5) = 3, and the condition c32 = c32 (3) + (2) = 0 shows that (3) = 2. We repeat this process until we have identified potentials of all nodes in the tree T. Figure 8b) shows the corresponding node potentials. Consider any nontree arc (k, l). Adding this arc to the tree T creates a unique cycle, which we denote as W kl . We refer to W kl as the fundamental cycle induced by the nontree arc (k, l). If (k, l) 2 L, then we define the orientation of the fundamental cycle as in the direction of (k, l), and if (k, l) 2 U, then we define the orienta-

tion opposite to that of (k, l). In other words, we define the orientation of the cycle in the direction of flow change permitted by the arc (k, l). We let c(W kl ) denote the change in the cost if we send one unit of flow on the cycle W kl along its orientation. (Notice that because of flow bounds, we might not always be able to send flow along the cycle W kl .) Let W k l denote the set of forward arcs in W kl (that is, those with the same orientation as (k, l)), and let W k l denote the set of backward arcs in W kl (that is, those with an opposite the orientation to arc (k, l)). Then, if we send one unit of flow along W kl , then the flow on arcs in W k l increases by one unit and the flow on arcs in W k l decreases by one unit. Therefore, X X ci j ci j : c(Wk l ) D (i; j)2Wk l

(i; j)2W k l

Let c (W kl ) denote the change in the reduced costs if we send one unit of flow in the cycle W kl along its orientation, that is, X X c i j c i j : c (Wk l ) D (i; j)2W k l

(i; j)2W k l

It is easy to show that c (W kl ) = c(W kl ). This result follows from the fact that when we substitute ck l = cij (i) + (j) and add the reduced costs around any cycle, then the node potentials (i) cancel one another. Next notice that the manner we defined node potentials ensures that each arc in the fundamental cycle W kl except the arc (k, l) has zero reduced cost. Consequently, if arc (k, l) 2 L, then c(Wk l ) D c (Wk l ) D c k l ;

2105

2106

M

Minimum Cost Flow Problem

and if arc (k, l) 2 U, then

c(Wk l ) D c (Wk l ) D

c k l :

This observation and the negative cycle optimality condition (Theorem 1) implies that for a spanning tree solution to be optimal, it must satisfy the following necessary conditions: c k l 0 for every arc (i; j) 2 L;

(7)

c k l 0 for every arc (i; j) 2 U:

(8)

It is possible to show that these conditions are also sufficient for optimality; that is, if any spanning tree solution satisfies the conditions (7)–(8), then it solves the minimum cost flow problem. We now have all the necessary ingredients to describe the network simplex algorithm. The algorithm maintains a feasible spanning tree structure at each iteration, which it successively transforms it into an improved spanning tree structure until the solution becomes optimal. The algorithm first obtains an initial spanning tree structure. If an initial spanning tree structure is not easily available, then we could use the following method to construct one: for each node i with b(i) 0, we connect node i to node 1 with an (artificial) arc of sufficiently large cost and large capacity; and for each node i with b(i) < 0, we connect node 1 to node i with an (artificial) arc of sufficiently large cost and capacity. These arcs define the initial tree T, all arcs in A define the set L, and U = ;. Since these artificial arcs have large costs, subsequent iterations will drive the flow on these arcs to zero. Given a spanning tree structure (T, L, U), we first check whether it satisfies the optimality conditions (7) and (8). If yes, we stop; otherwise, we select an arc (k, l) 2 L or (k, l) 2 U violating its optimality condition as an entering arc to be added to the tree T, obtain the fundamental cycle W kl induced by this arc, and augment the maximum possible flow in the cycle W kl without violating the flow bounds of the tree arcs. At this value of augmentation, the flow on some tree arc, say arc (p, q), reaches its lower or upper bound; we select this arc as an arc to leave the spanning tree T, adding it added to L or U depending upon its flow value. We next add arc (k, l) to T, giving us a new spanning tree structure. We repeat this process until the spanning tree structure

BEGIN determine an initial feasible tree structure (T; L; U); let x be the flow and let be the corresponding node potentials; WHILE (some nontree arc violates its optimality condition) DO BEGIN select an entering arc (k; l) violating the optimality conditions; add arc (k; l) to the spanning tree T, thus forming a unique cycle Wk l ; augment the maximum possible flow ı in the cycle Wk l and identify a leaving arc (p; q) that reaches its lower or upper flow bound; update the flow x, the spanning tree structure (T; L; U) and the potentials ; END; END Minimum Cost Flow Problem, Figure 9 The network simplex algorithm

satisfies the optimality conditions. Figure 9 specifies the essential steps of the algorithm. To illustrate the network simplex algorithm, we use the numerical example shown in Fig. 10a). Figure 10b) shows a feasible spanning tree solution for the problem. For this solution, T = {(1, 2), (1, 3), (2, 4), (2, 5), (5, 6)}, L = {(2, 3), (5, 4)} and U = {(3, 5), (4, 6)}. We next compute c35 = 1. We introduce the arc (3, 5) into the tree, creating a cycle. Since (3, 5) is at its upper bound, the orientation of the cycle is opposite to that of (3, 5). The arcs (1, 2) and (2, 5) are forward arcs in the cycle and arcs (3, 5) and (1, 3) are backward arcs. The maximum increase in flow permitted by the arcs (3, 5), (1, 3), (1, 2), and (2, 5) without violating their upper and lower bounds is, respectively, 3, 3, 2, and 1 units. Thus, we augment 1 unit of flow along the cycle. The augmentation increases the flow on arcs (1, 2) and (2, 5) by one unit and decreases the flow on arcs (1, 3) and (3, 5) by one unit. Arc (2, 5) reaches its upper bound and we select it as the leaving arc. We update the spanning tree structure; Fig. 10c) shows the new spanning tree T and the new node potentials. The sets L and U become L = {(2, 3), (5, 4)} and U = {(2, 5), (4, 6)}. In the next iter-

Minimum Cost Flow Problem

M

Minimum Cost Flow Problem, Figure 10 Numerical example for the network simplex algorithm

ation, we select arc (4, 6) since this arc violates the arc optimality condition. We augment one unit flow along the cycle 6 – 4 – 2 – 1 – 3 – 5 – 6 and arc (3, 5) leaves the spanning tree. Figure 10d) shows the next spanning tree and the updated node potentials. All nontree arcs satisfy the optimality conditions and the algorithm terminates with an optimal solution of the minimum cost flow problem. The network simplex algorithm can select any nontree arc that violates its optimality condition as an entering arc. Many different rules, called pivot rules, are possible for choosing the entering arc, and these rules have different empirical and theoretical behavior. [1] describes some popular pivot rules. We call the process of moving from one spanning tree structure to another as a pivot operation. By choosing the right data structures for representing the tree T, it is possible to perform a pivot operation in O(m) time. To determine the number of iterations performed by the network simplex algorithm, we distinguish two cases. We refer to a pivot operation as nondegenerate if it augments a positive amount of flow in the cycle W kl (that is, ı > 0), and degenerate otherwise (that is, ı = 0). During a degenerate pivot, the cost of the spanning tree solution decreases by |ck l |ı. When combined with the integrality of data assumption (Assumption 2)

above), this result yields a pseudopolynomial bound on the number of nondegenerate iterations. However, degenerate pivots do not decrease the cost of flow and so are difficult to bound. There are methods to bound the number of degenerate pivots. Obtaining a polynomial bound on the number of iterations remained an open problem for quite some time; [6] suggested an implementation of the network simplex algorithm that runs in polynomial time. In any event, the empirical performance of the network simplex algorithm is very attractive. Empirically, it is one of the fastest known algorithms for solving the minimum cost flow problem. See also Auction Algorithms Communication Network Assignment Problem Directed Tree Networks Dynamic Traffic Networks Equilibrium Networks Evacuation Networks Generalized Networks Maximum Flow Problem Multicommodity Flow Problems Network Design Problems

2107

2108

M

MINLP: Application in Facility Location-allocation

Network Location: Covering Problems Nonconvex Network Flow Problems Nonoriented Multicommodity Flow Problems Piecewise Linear Network Flow Problems Shortest Path Tree Algorithms Steiner Tree Problems Stochastic Network Problems: Massively Parallel Solution Survivable Networks Traffic Network Equilibrium References 1. Ahuja RK, Magnanti TL, Orlin JB (1993) Network flows: Theory, algorithms, and applications. Prentice-Hall, Englewood Cliffs, NJ 2. Barahona F, Tardos E (1989) Note of Weintraub’s minimum cost circulation algorithm. SIAM J Comput 18:579–583 3. Edmonds J, Karp RM (1972) Theoretical improvements in algorithmic efficiency for network flow problems. J ACM 19:248–264 4. Goldberg AV, Tarjan RE (1988) Finding minimum-cost circulations by canceling negative cycles. Proc. 20th ACM Symposium on the Theory of Computing, pp 388–397. Full paper: J ACM (1989) 36:873–886 5. Orlin JB (1988) A faster strongly polynomial minimum cost flow algorithm. Proc. 20th ACM Symp. Theory of Computing, pp 377–387. Full paper: Oper Res (1989) 41:338–350 6. Orlin JB (1997) A polynomial time primal network simplex algorithm for minimum cost flows. Math Program 78B:109– 129

MINLP: Application in Facility Location-allocation MARIANTHI IERAPETRITOU1 , CHRISTODOULOS A. FLOUDAS2 1 Department Chemical and Biochemical Engineering, Rutgers University, Piscataway, USA 2 Department Chemical Engineering, Princeton University, Princeton, USA MSC2000: 90C26 Article Outline Keywords Location-allocation Models Solution Approaches

Application: Development of Offshore Oil Fields See also References

Keywords MINLP; Facility location-allocation The location-allocation problem may be stated in the following general way: Given the location or distribution of a set of customers which could be probabilistic and their associated demands for a given product or service, determine the optimal locations for a number of service facilities and the allocation of their products or services to the costumers, so as to minimize total (expected) location and transportation costs. This problem finds a variety of applications involving the location of warehouses, distribution centers, service and production facilities and emergency service facilities. In the last section we are going to consider the development of an offshore oil field as a real-world application of the location-allocation problem. This problem involves the location of the oil platforms and the allocation of the oil wells to platforms. It was shown in [25] that the joint locationallocation problem is NP-hard even with all the demand points located along a straight line. In the next section alternative location-allocation models will be presented based on different objectives and the incorporation of consumer behavior, price elasticity and system dynamics within the location-allocation decision framework. Location-allocation Models In developing location-allocation models different objectives alternatives are examined. One possibility is to follow the approach in [5], to minimize the number of centers required to serve the population. This objective is appropriate when the demand is exogenously fixed. A more general objective is to maximize demand by optimally locating the centers as proposed in [10]. The demand maximization requires the incorporation of price elasticity representing the dependence of the costumer preference to the distance from the center. The cost of establishing the centers can also be incorporated in the

MINLP: Application in Facility Location-allocation

model as proposed in [13]. An alternative objective towards the implementation of costumer preference towards the nearest center is the minimization of an aggregated weighted distance which is called the median location-allocation problem. The simplest type of location-allocation problem is the Weber problem, as posed in [9], which involves locating a production center so as to minimize aggregate weighted distance from the different raw material sources. The extension of the Weber problem is the p-median location-allocation problem, which involves the optimal location of a set of p uncapacitated centers to minimize the total weighted distance between them and n demand locations. Here, each source is assumed to have infinite capacity. In continuous space, the p-median problem can be formulated as follows: 8 ˆ ˆ ˆ min ˆ ˆ ˆ ˆ < ˆ s.t. ˆ ˆ ˆ ˆ ˆ ˆ :

CD

p n X X

O i i j c i j

iD1 jD1 p X

i j D 1;

i D 1; : : : ; n;

jD1

i j D 0; 1;

i D 1; : : : ; n; j D 1; : : : ; p;

where Oi is the quantity demanded at location i whose coordinates are (xi , yi ); and ij is the binary variables that is assigned the value of 1 if demand point i is located to center j and zero otherwise. The above formulation allocate the consumers to their nearest center while ensuring that only one center will serve each customer. This however, can lead to disproportionally sized facilities. In the more realistic situation where the capacities of the facilities are limited to supplies of s1 , . . . , sn for i = 1, . . . , n facilities then the location-allocation problem takes the following form [24]: 8 ˆ ˆ ˆ min ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ

1 dz1 C 2 dz1 ; ˆ ˆ :0 D > df1 C > df2 : 1 dz2

(7)

2 dz2

This is a set of DAEs where the solutions for df1 /d˙z1 , df1 /dz1 , df2 /dz1 , df1 /dz2 , and df2 /dz2 are known functions of time obtained from the solution of the primal problem. The variables 1 (t) and 2 (t) are the adjoint variables and the solution of this problem is a backward integration in time with the following final time conditions: dJ dh0 dg0 df1 C 0 C 0 C > D 0: 1 dz1 dz1 dz1 d˙z1 Thus, the Lagrange multipliers for the end-time constraints are used as the final time conditions for the adjoint problem and are not included in the master problem formulation. The master problem is formulated using the solution of the primal problem, xk and zk (t), along with

MINLP: Applications in the Interaction of Design and Control

MINLP: Applications in the Interaction of Design and Control, Figure 3 Superstructure for reactor-separator-recycle system

MINLP: Applications in the Interaction of Design and Control, Figure 4 Noninferior solution set for the reactor-separator-recycle system

M

2129

2130

M

MINLP: Applications in the Interaction of Design and Control

MINLP: Applications in the Interaction of Design and Control, Figure 5 Dynamic responses of product compositions for three designs

the dual information, 00k , 00k , and k (t). The master problem has the following form: 8 ˆ min b ˆ ˆ y; b ˆ ˆ ˆ ˆ ˆ s.t. b J(x k ; y) ˆ ˆ Z tN ˆ ˆ ˆ ˆ 1k (t)f1 (˙z1k (t); z1k (t); z2k (t); x k ; y; t) dt C ˆ ˆ ˆ t ˆ Z 0t N ˆ ˆ ˆ ˆ ˆ C 2k (t) ˆ ˆ ˆ t0 ˆ ˆ ˆ ˆ f2 (z1k (t); z2k (t); x k ; y; t) dt ˆ ˆ ˆ ˆ ˆ C00k g00 (x k ; y) C 00k h00 (x k ; y); < k2K ; ˆ Z t N feas ˆ ˆ ˆ ˆ ˆ 1k (t) 0 ˆ ˆ ˆ t 0 ˆ ˆ ˆ ˆ f1 (˙z1k (t); z1k (t); z2k (t); x k ; y; t) dt ˆ ˆ Z tN ˆ ˆ ˆ ˆ C 2k (t)f2 (z1k (t); z2k (t); x k ; y; t) dt ˆ ˆ ˆ t0 ˆ ˆ ˆ ˆ C00k g00 (x k ; y) C 00k h00 (x k ; y); ˆ ˆ ˆ ˆ ˆ ˆ k 2 Kinfeas ; ˆ ˆ ˆ : y 2 f0; 1gq : (8)

The integral term can be evaluated since the profiles for zk (t) and k (t) both are fixed and known. Note that this formulation has no restrictions on whether or not y variables participate in the the DAE system. Example: Reactor-Separator-Recycle System The example problem considered here is the design of a process involving a reaction step, a separation step, and a recycle loop. Fresh feed containing A and B flow into a an isothermal reactor where the first order irreversible reaction A ! B takes place. The product from the reactor is sent to a distillation column where the unreacted A is separated from the product B and sent back to the reactor. The superstructure is shown in Fig. 3. The model equations for the reactor (CSTR) and the separator (ideal binary distillation column) can be found in [12]. The specific problem design follows the work in [10]. For this problem, the single output is the product composition. The bottoms (product) composition is controlled by the vapor boil-up and the distillate composition is controlled by the reflux rate. Since only the product composition is specified, the distillate compo-

MINLP: Applications in the Interaction of Design and Control

sition set-point is free and left to be determined through the optimization. The cost function includes column and reactor capital and utility costs. costreactor D 17639Dr1:066(2Dr )0:802 ; (2:4N t )0:802 costcolumn D 6802D1:066 c C 548:8D1:55 c Nt ; costexchangers D 193023Vss0:65 ; costutilities D 72420Vss ; costreactor C costcolumn C costexchangers costtotal D ˇpay C ˇtax [costutilities]: The controllability measure is the time weighted ISE for the product composition: d D t(x B x B )2 : dt The noninferior solution set is shown in Fig. 4, and Table 2 lists the solution information for three of the designs in the noninferior solution set. The dynamic profile for these three designs are shown in Fig. 5. All of the designs in the noninferior solution set are strippers. Since the feed enters at the top of the column, there is no reflux and thus no control loop for the distillate composition. The controllability of the process is increased by increasing the size of the reactor and decreasing the size of the column. The most controllable design has a large reactor and a single flash unit. MINLP: Applications in the Interaction of Design and Control, Table 2 Solution information for three designs

Solution Cost($) Capital($) Utility($) ISE Trays Feed Vr (kmol) V(kmol/hr) KV V (hr)

A 489; 000 321; 000 168; 000 0:0160 19 19 2057:9 138:94 90:94 0:295

B 534; 000 364; 000 170; 000 0:00379 8 8 3601:2 141:25 80:68 0:0898

C 736; 000 726; 000 10; 000 0:0011 1 1 15000 85:473 87:40 0:0156

M

See also Chemical Process Planning Control Vector Iteration Duality in Optimal Control with First Order Differential Equations Dynamic Programming: Continuous-time Optimal Control Dynamic Programming and Newton’s Method in Unconstrained Optimal Control Dynamic Programming: Optimal Control Applications Extended Cutting Plane Algorithm Generalized Benders Decomposition Generalized Outer Approximation Hamilton–Jacobi–Bellman Equation Infinite Horizon Control and Dynamic Games MINLP: Application in Facility Location-allocation MINLP: Applications in Blending and Pooling Problems MINLP: Branch and Bound Global Optimization Algorithm MINLP: Branch and Bound Methods MINLP: Design and Scheduling of Batch Processes MINLP: Generalized Cross Decomposition MINLP: Global Optimization with ˛BB MINLP: Heat Exchanger Network Synthesis MINLP: Logic-based Methods MINLP: Outer Approximation Algorithm MINLP: Reactive Distillation Column Synthesis Mixed Integer Linear Programming: Mass and Heat Exchanger Networks Mixed Integer Nonlinear Programming Multi-objective Optimization: Interaction of Design and Control Optimal Control of a Flexible Arm Robust Control Robust Control: Schur Stability of Polytopes of Polynomials Semi-infinite Programming and Control Problems Sequential Quadratic Programming: Interior Point Methods for Distributed Optimal Control Problems Suboptimal Control References 1. Bahri PA, Bandoni JA, Romagnoli JA (1996) Effect of disturbances in optimizing control: Steady-state open-loop backoff problem. AIChE J 42(4):983–994

2131

2132

M

MINLP: Branch and Bound Global Optimization Algorithm

2. Brengel DD, Seider WD (1992) Coordinated design and control optimization of nonlinear processes. Comput Chem Eng 16(9):861–886 3. Duran MA, Grossmann IE (1986) An outer-approximation algorithm for a class of mixed-integer nonlinear programs. Math Program 36:307–339 4. Elliott TR, Luyben WL (1995) Capacity-based approach for the quantitative assessment of process controllability during the conceptual design stage. Industr Eng Chem Res 34:3907–3915 5. Figueroa JL, Bahri PA, Bandoni JA, Romagnoli JA (1996) Economic impact of disturbances and uncertain parameters in chemical processes – A dynamic back-off analysis. Comput Chem Eng 20(4):453–461 6. Floudas CA (1995) Nonlinear and mixed integer optimization: Fundamentals and applications. Oxford Univ. Press, Oxford 7. Geoffrion AM (1972) Generalized Benders decomposition. J Optim Th Appl 10(4):237–260 8. Kocis GR, Grossmann IE (1987) Relaxation strategy for the structural optimization of process flow sheets. Industr Eng Chem Res 26(9):1869 9. Luyben ML, Floudas CA (1994) Analyzing the interaction of design and control–1. A multiobjective framework and application to binary distillation synthesis. Comput Chem Eng 18(10):933–969 10. Luyben ML, Floudas CA (1994) Analyzing the interaction of design and control–2. Reactor-separator-recycle system. Comput Chem Eng 18(10):971–994 11. Luyben ML, Luyben WL (1997) Essentials of process control. McGraw-Hill, New York 12. Luyben WL (1990) Process modeling, simulation, and control for chemical engineers, 2nd edn. McGraw-Hill, New York 13. Mohideen MJ, Perkins JD, Pistikopoulos EN (1996) Optimal design of dynamic systems under uncertainty. AIChE J 42(8):2251–2272 14. Morari M, Perkins J (1994) Design for operations. FOCAPD Conf Proc 15. Morari M, Zafiriou E (1989) Robust process control. Prentice-Hall, Englewood Cliffs, NJ 16. Narraway LT, Perkins JD (1993) Selection of control structure based on economics. Comput Chem Eng 18:S511–515 17. Narraway LT, Perkins JD (1993) Selection of process control structure based on linear dynamic economics. Industr Eng Chem Res 32:2681–2692 18. Narraway LT, Perkins JD, Barton GW (1991) Interaction between process design and process control: Economic analysis of process dynamics. J Process Control 1:243–250 19. Paules IV GE, Floudas CA (1989) APROS: Algorithmic development methodology for discrete-continuous optimization problems. Oper Res 37(6):902–915 20. Schweiger CA, Floudas CA (1997) Interaction of design and control: Optimization with dynamic models. In: Hager WW,

Pardalos PM (eds) Optimal Control: Theory, Algorithms, and Applications. Kluwer, Dordrecht, pp 388–435 21. Vassiliadis VS, Sargent RWH, Pantelides CC (1994) Solution of a class of multistage dynamic optimization problems 1. Problems without path constraints. Industr Eng Chem Res 33:2111–2122 22. Viswanathan J, Grossmann IE (1990) A combined penalty function and outer approximation method for MINLP optimization. Comput Chem Eng 14(7):769–782 23. Walsh S, Perkins JD (1996) Operability and control in process synthesis and design. In: Anderson JL (eds) Adv Chem Engin, vol 23. Acad. Press, New York, pp 301–402

MINLP: Branch and Bound Global Optimization Algorithm IOANNIS P. ANDROULAKIS Department of Biomedical Engineering, Rutgers University, Piscataway, USA MSC2000: 90C10, 90C26 Article Outline Keywords See also References

Keywords Mixed integer nonlinear programming; Global optimization; Branch and bound algorithms A wide range of nonlinear optimization problems involve integer or discrete variables in addition to continuous ones. These problem are denoted as mixed integer nonlinear programming (MINLP) problems. Integer variables correspond to logical decision describing whether certain actions do or do not take place, or modeling the sequence according to which those decisions take place. The nonlinear nature of the MINLP models may arise from: nonlinear relations in the integer domain only nonlinear relations in the continuous domain only nonlinear relations in the joint domain, i. e., products of continuous and binary/integer variables.

MINLP: Branch and Bound Global Optimization Algorithm

The general mathematical formulation of the MINLP problems can be stated as follows: 8 min ˆ ˆ x;y ˆ ˆ ˆ ˆ ˆ x Li , then x Li = i . b) If xi x Li = 0 at the solution of the convex NLP and i = x Li + (U L)/i is such that i < xUi , then xUi = i . If neither bound constraint is active at the solution of the convex NLP for some variable xj , the problem can be solved by setting xj = xUj or xj = x Lj . Tests similar to those presented above are then used to update the bounds on xj . 2) Feasibility based range reduction tests: In addition to ensuring that tight bounds are available for the variables, the constraint underestimators are used to generate new constraints for the problem. Consider the constraint g i (x, y) 0. If its underestimating function g (x; y) D 0 at the solution of the coni vex NLP and its multiplier is i > 0, the constraint g (x; y) i

UL i

can be included in subsequent problems. A global optimization algorithm branch and bound algorithm has been proposed in [20]. It can be applied to problems in which the objective and constraints are functions involving any combination of binary arithmetic operations (addition, subtraction, multiplication and division) and functions that are either concave over the entire solution space (such as ln) or convex over this domain (such as exp). The algorithm starts with an automatic reformulation of the original nonlinear problem into a problem that involves only linear, bilinear, linear fractional, simple exponentiation, univariate concave and univariate convex terms. This is achieved through the introduction of new constraints and variables. The reformulated problem is then solved to global optimality using a branch and bound approach. Its special structure allows the construction of a convex relaxation at each node of the tree. The integer variables can be handled in two ways during the generation of the convex lower bounding problem. The integrality condition on the variables can be relaxed to yield a convex NLP which can then be solved globally. Alternatively, the integer variables can be treated directly and the convex lower bounding MINLP can be solved using a branch and bound algorithm as described earlier. This second ap-

MINLP: Branch and Bound Global Optimization Algorithm

proach is more computationally intensive but is likely to result in tighter lower bounds on the global optimum solution. In order to obtain an upper bound for the optimum solution, several methods have been suggested. The MINLP can be transformed to an equivalent nonconvex NLP by relaxing the integer variables. For example, a variable y 2 {0, 1} can be replaced by a continuous variable z 2 [0, 1] by including the constraint z z z = 0. The nonconvex NLP is then solved locally to provide an upper bound. Finally, the discrete variables could be fixed to some arbitrary value and the nonconvex NLP solved locally. In [1] SMIN was proposed which is designed to address the following class of problems to global optimality: 8 ˆ min f (x) C x > A0 y C c0> y ˆ ˆ ˆ ˆ ˆ h(x) C x > A1 y C c1> y D 0 ˆ A2 y C c2> y 0

x 2 X Rn y2Y

(integer);

> > where c> 0 , c1 and c2 are constant vectors, A0 , A1 and A2 are constant matrices and f (x), h(x) and g(x) are functions with continuous second order derivatives. The solution strategy is an extension of the ˛BB algorithm for twice-differentiable NLPs [4,5,7]. It is based on the generation of two converging sequences of upper and lower bounds on the global optimum solution. A rigorous underestimation and convexification strategy for functions with continuous second order derivatives allows the construction of a lower bounding MINLP problem with convex functions in the continuous variables. If no mixed-bilinear terms are present (Ai = 0, 8i), the resulting MINLP can be solved to global optimality using the outer approximation algorithm (OA), [8]. Otherwise, the generalized Benders decomposition (GBD) can be used, [10], or the Glover transformations [11] can be applied to remove these bilinearities and permit the use of the OA algorithm. This convex MINLP provides a valid lower bound on the original MINLP. An upper bound on the problem can be obtained by applying the OA algorithm or the GBD to find a local solution. This bound generation strategy is incorporated within a branch and bound scheme: a lower and upper bound on the global solution are first obtained for the entire solution space. Subsequently, the domain

M

is subdivided by branching on a binary or a continuous variable, thus creating new nodes for which upper and lower bounds can be computed. At each iteration, the node with the lowest lower bound is selected for branching. If the lower bounding MINLP for a node is infeasible or if its lower bound is greater than the best upper bound, this node is fathomed. The algorithm is terminated when the best lower and upper bound are within a pre-specified tolerance of each other. Before presenting the algorithmic procedure, an overview of the underestimation and convexification strategy is given, and some of the options available within the algorithm are discussed. In order to transform the MINLP problem of the form just described into a convex problem which can be solved to global optimality with the OA or GBD algorithm, the functions f (x), h(x) and g(x) must be convexified. The underestimation and convexification strategy used in the ˛BB algorithm has previously been described in detail [3,4,5]. Its main features are exposed here. In order to construct as tight an underestimator as possible, the nonconvex functions are decomposed into a sum of convex, bilinear, univariate concave and general nonconvex terms. The overall function underestimator can then be built by summing up the convex underestimators for all terms, according to their type. In particular, a new variable is introduced to replace each bilinear term, and is bounded by its convex envelope. The univariate concave terms are linearized. For each nonconvex term nt(x) with Hessian matrix H nt (x), a convex underestimator L(x) is defined as X ˛ i (x iU x i )(x i x iL ); (1) L(x) D nt(x) i

where xUi and x Li are the upper and lower bounds on variable xi , respectively, and the ˛ parameters are nonnegative scalars such that H nt (x) + 2 diag(˛ i ) is positive semidefinite over the domain [xL , xU ]. The rigorous computation of the ˛ parameters using interval Hessian matrices is described in [3,4,5]. The underestimators are updated at each node of the branch and bound tree as their quality strongly depends on the bounds on the variables. An unusual feature of the SMIN-˛BB algorithm is the strategy used to select branching variables. It follows a hybrid approach where branching may occur both on the integer and the

2135

2136

M

MINLP: Branch and Bound Global Optimization Algorithm

continuous variables in order to fully exploit the structure of the problem being solved. After the node with the lowest lower bound has been identified for branching, the type of branching variable must be determined according to one of the following two criteria: 1) Branch on the binary variables first. 2) Solve a continuous relaxation of the nonconvex MINLP locally. Branch on a binary variable with a low degree of fractionality at the solution. If there is no such variable, branch on a continuous variable. The first criterion results in the creation of an integer tree for the first q levels of the branch and bound tree, where q is the number of binary variables. At the lowest level of this integer tree, each node corresponds to a nonconvex NLP and the lower and upper bounding problems at subsequent levels of the tree are NLP problems. The efficiency of this strategy lies in the minimization of the number of MINLPs that need to be solved. The combinatorial nature of the problem and its nonconvexities are handled sequentially. If branching occurs on a binary variable, the selection of that variable can be done randomly or by solving a relaxation of the nonconvex MINLP and choosing the most fractional variable at the solution. The second criterion selects a binary variable for branching only if it appears that the two newly created nodes will have significantly different lower bounds.Thus, if a variable is close to integrality at the solution of the relaxed problem, forcing it to take on a fixed value may lead to the infeasibility of one of the nodes or the generation of a high value for a lower bound, and therefore the fathoming of a branch of the tree. If no binary variable is close to integrality, a continuous variable is selected for branching. A number of rules have been developed for the selection of a continuous branching variable. Their aim is to determine which variable is responsible for the largest separation distances between the convex underestimating functions and the original nonconvex functions. These efficient rules are exposed in [2]. Variable bound updates performed before the generation of the convex MINLP have been found to greatly enhance the speed of convergence of the ˛BB algorithm for continuous problems [2]. For continuous variables, the variable bounds are updated by minimizing or maximizing the chosen variable subject to the convexified constraints being satisfied. In spite of its computational

cost, this procedure often leads to significant improvements in the quality of the underestimators and hence a noticeable reduction in the number of iterations required. In addition to the update of continuous variable bounds, the SMIN-˛BB algorithm also relies on binary variable bound updates. Through simple computations, an entire branch of the branch and bound tree may be eliminated when a binary variable is found to be restricted to 0 or 1. The bound update procedure for a given binary variable is as follows: 1) Set the variable to be updated to one of its bounds y = yB . 2) Perform interval evaluations of all the constraints in the nonconvex MINLP, using the bounds on the solution space for the current node. 3) If any of the constraints are found infeasible, fix the variable to y = 1 yB . 4) If both bounds have been tested, repeat this procedure for the next variable to be updated. Otherwise, try the second bound. In [1] GMIN, which operates within a classical branch and bound framework, was proposed. The main difference with similar branch and bound algorithms [12,17] is its ability to identify the global optimum solution of a much larger class of problems of the form 8 min ˆ ˆ x;y ˆ ˆ ˆ ˆ ˆ

L(x; y;e u) D e u> 1 G1 (x; y) C e 2 G2 (x; y); L1 (x; y; u1 ) D f (x; y) C u1> G1 (x; y); e u1 ) D e u> L1 (x; y;e 1 G1 (x; y):

The Problem Consider the following general optimization problem.

(P)

8 ˆ v D ˆ ˆ ˆ ˆ ˆ ˆ 120 mg/dl), a condition known as hyperglycemia. Both hypo- and hyperglycemia can be harmful to an individual’s health. Hence, it is very important to control the level of blood glucose in the body to within a reasonable range [9,12]. In the following sections, advanced model based controllers for regulating the blood glucose concentration for type 1 diabetes are presented.

The Bergman model [1] is used in this study, which presents a ‘minimal’ model comprising 3 equations to describe the dynamics of the system. The schematic representation of the model is shown in Fig. 4. The modeling equations are: dG D P1 G X(G C Gb ) C D(t) dt

(4)

dI D n(I C Ib ) C U(t)/V1 dt

(5)

dX D P2 X C P3 I : (6) dt The states in this model are: G, plasma glucose concentration (mg/dl) relative to basal value, I, plasma insulin concentration (mU/l) relative to basal value, and X, proportional to I in remote compartment (min1 ). The inputs are: D(t), meal glucose disturbance (mg/dl/min), U(t), manipulated insulin infusion rate (mU/min) and Gb , I b , nominal values of glucose and insulin concentration (81 mg/dl; 15 mU/l). The parameter values for a Type 1 diabetes are: P1 D 0 min1 , P3 D 0:000013 l/mUmin2 , P2 D 0:025 min1 , V1 D 12 l and n D 5/54 min [4]. The model, (4)–(6) is linearized about the steadystate values of Gb D 81 mg/dl, Ib D 15 mU/l, X b D 0 and Ub D 16:66667 mU/min to obtain the state space model of the form: x tC1 D Ax t C Bu t C Bd d t where the term dt represents the input disturbance glucose meal. The sampling time considered is 5 minutes, which is reasonable for the current glucose sensor technology. The discrete state-space matrices A; B; C and Bd

2281

2282

M

Model Based Control for Drug Delivery Systems

are as follows: 2 3 1 0:000604 21:1506 5 AD4 0 0:6294 0 0 0:00004875 0:8825 2 3 0:000088 B D 4 0:3335 5 ; 0:0000112 2 3 5 C D 1 0 0 ; Bd D 4 0 5

This mp-QP is solved by treating z as the vector of optimization variables and xt as the vector of parameters to obtain z as an explicit function of xt . U is then obtained as an explicit function of xt by using U D z H 1 F T x t .

Results

0 The constraints imposed are 60 G C Gb 180 and 0 U C Ub 100. Parametric Controller Parametric programming can be used in the MPC framework to obtain U as a function of xt by treating U as optimization variables and xt as parameters as described next [3,13]. For simplicity in presentation assume that N y D Nu D N c , the theory presented is however valid for the case when N y , N u and N c are not equal. The equalities in formulation (3) are eliminated by making the following substitution: x tCkjt D Ak x t C

k1 X

A j Bu tCk1 j

(7)

jD0

to obtain the following Quadratic Program (QP): 1 1 min U T HU C x tT FU C x tT Y x t U 2 2 s:t:GU W C Ex t ;

5 I 25 0:0478972G 0:0002712I X 0:104055 0:0261386G 0:0004641I X 0:0576751 0:00808846G C 0:00119685I C X 0

(10)

0:00660123G C 0:00130239I C X 0 0:00609435G 0:00134362I X 0 where the insulin infusion rate as a function of the state variables for the next five time intervals is given as follows: U(1) D 30:139G 0:44597I 3726:2X

(8)

U(2) D 24:874G 0:40326I 3280:4X U(3) D 20:16G 0:35946I 2842:8X

where, U D [u Tt ; : : : ; u TtCN u 1 ]T 2 R s , is the vector of optimization variables, s D mNu , H is a constant, symmetric and positive definite matrix and H; F; Y; G; W; E are obtained from Q; R and (1) and (2). The QP problem in (8) can now be reformulated as a multi-parametric quadratic program (mp-QP): 1 Vz (x) D min z T Hz z 2 s:t:Gz W C Sx t ;

A prediction horizon N y D 5 and Q/R ratio of 1000 is considered for deriving the control law – this results in partitioning of the state-space into 31 polyhedral regions. These regions are known as Critical Regions (CR). Associated with each CR is a control law that is an affine function of the state of the patient. For example, one of the CRs is given by the following state inequalities:

(9)

where, z D U C H 1 F T x t ; z 2 R s , and S D E C GH 1 F T .

(11)

U(4) D 16:002G 0:31571I 2424:1X U(5) D 0 The complete partitioning of the state-space for G = 80 mg/dl into CRs is shown in Fig. 5. The performance of the parametric controller for a 50 mg meal disturbance [8] is as shown in Figs. 6 and 7. The corresponding trajectory of the state variables is also shown in Fig. 5. The model based parametric controller of the form given in (10) and (11) can be stored and implemented on a simple computational hardware and therefore can provide effective therapy at low on-line computational costs.

Model Based Control for Drug Delivery Systems

Model Based Control for Drug Delivery Systems, Figure 5 Critical regions for type 1 diabetes

M

Model Based Control for Drug Delivery Systems, Figure 7 Insulin infusion vs. time

putational hardware was demonstrated by deriving insulin delivery rate as an explicit function of the state of patient. The developments presented in this chapter highlight the importance of modeling and control techniques for biomedical systems. See also Nondifferentiable Optimization: Parametric Programming References Model Based Control for Drug Delivery Systems, Figure 6 Glucose concentration vs. time

Concluding Remarks Automation of drug delivery systems aims at reducing patient inconvenience by providing better and personalized healthcare. The automation can be achieved by developing detailed models and by deriving advanced controllers that can take into account the model as well as the constraints on state and control variables. In this chapter, a compartmental model incorporating pharmacokinetic and pharmacodynamic aspects for delivery of anesthetic agents has been presented. This model was then used for the derivation of model predictive controller. For type 1 diabetes, implementation of advanced model based controllers through a simple com-

1. Bergman RN, Phillips LS, Cobelli C (1981) Physiologic evaluation of factors controlling glucose tolerance in man. J Clin Invest 68:1456–1467 2. DCCT-The Diabetes Control and Complications Trial Research Group (1993) The effect of intensive treatment of diabetes on the development and progression of longterm complications in insulin-dependent diabetes mellitus. New Engl J Med 329:977–986 3. Dua P, Doyle FJ III, Pistikopoulos EN (2006) Model based glucose control for type 1 diabetes via parametric programming. IEEE Trans Biomed Eng 53:1478–1491 4. Fisher ME (1991) A semiclosed loop algorithm for the control of blood glucose levels in diabetics. IEEE Trans Biomed Eng 38:57–61 5. Garcia CE, Prett DM, Morari M (1989) Model predictive control: theory and practice – a survey. Automatica 25: 335–348 6. Gentilini A, Frei CW, Glattfelder AH, Morari M, Sieber TJ, Wymann R, Schnider TW, Zbinden AM (2001) Multitasked closed-loop control in anesthesia. IEEE Eng Med Biol 20:39–53

2283

2284

M

Modeling Difficult Optimization Problems

7. gPROMS (2003) Introductory user guide, Release 2.2. Process Systems Enterprise Ltd, London 8. Lehmann ED, Deutsch T (1992) A physiological model of glucose-insulin interaction in type 1 diabetes mellitus. J Biomed Eng 14:235–242 9. Lynch SM, Bequette BW (2002) Model predictive control of blood glucose in type I diabetics using subcutaneous glucose measurements. In: Proc. American Control Conf., Anchorage, AK, pp 4039–4043 10. Mahfouf M, Asbury AJ, Linkens DA (2003) Unconstrained and constrained generalized predictive control of depth of anaesthesia during surgery. Control Eng Pract 11: 1501–1515 11. MATLAB (1998) MPC Toolbox manual. The MathWorks, Natick 12. Parker RS, Doyle FJ III, Peppas NA (2001) The intravenous route to blood glucose control. IEEE Eng Med Biol 20: 65–73 13. Pistikopoulos EN, Dua V, Bozinis NA, Bemporad A, Morari M (2002) On line optimization via off-line parametric optimization tools. Comput Chem Eng 26:175–185 14. Rao RR, Palerm CC, Aufderhiede B, Bequette BW (2001) Automated regulation of hemodynamic variables. IEEE Eng Med Biol 20:24–38 15. Yasuda N, Lockhart SH, Eger EI, Weiskopf RB, Laster M, Taheri S, Peterson NA (1991) Comparison of kinetics of sevoflurane and isoflurane in humans. Anesthesia Analg 72:316–324 16. Yu C, Roy RJ, Kaufman H (1990) A circulatory model for combined nitroprusside-dopamine therapy in acute heart failure. Med Prog Technol 16:77–88 17. Zwart AN, Smith NT, Beneken JEW (1972) Multiple model approach to uptake and distribution of halothane: the use of an analog computer. Comput Biomed Res 5:228–238

Modeling Difficult Optimization Problems 1,2

JOSEF KALLRATH 1 GVC/S (Scientific Computing) - B009, BASF Aktiengesellschaft, Ludwigshafen, Germany 2 Astronomy Department, University of Florida, Gainesville, USA MSC2000: 90C06, 90C10, 90C11, 90C30, 90C57, 90C90 Article Outline Introduction Models and the Art of Modeling

Tricks of the Trade for Monolithic Models

Decomposition Techniques Column Generation Column Enumeration Branch-and-Price Rolling Time Decomposition

An Exhaustion Method Indices and Sets Variables The Idea of the Exhaustion Method Computing Lower Bounds

Primal Feasible Solutions and Hybrid Methods Summary References Introduction We define difficult optimization problems as problems that cannot be solved to optimality or to any guaranteed bound by any standard solver within a reasonable time limit. The problem class we have in mind are mixed-integer programming (MIP) problems. Optimization, and especially MIP, is often appropriate and frequently used to model real-world optimization problems. While it started in the 1950s, models have become larger and more complicated. A reasonable general framework is mixed-integer nonlinear programming (MINLP) problems. They are specified by the augmented vector xT˚ D xT ˚ yT established by the vectors xT D (x1 ; : : : ; x n c ) and yT D (y1 ; : : : ; y n d ) of n c continuous and nd discrete variables, an objective function f (x; y), n e equality constraints h(x; y), and n i inequality constraints g(x; y). The problem ˇ ˇ ˇ ˇ f (x; y) ˇˇ min ˆ ˇ ˆ : ˇ 8 ˆ ˆ
> = > > ; (1)

is called a mixed-integer nonlinear programming (MINLP) problem if at least one of the functions f (x; y), g(x; y), or h(x; y) is nonlinear. The vector inequality, g(x; y) 0, is to be read componentwise. Any vector xT˚ satisfying the constraints of (1) is called a feasible point of (1). Any feasible point whose objective function value is less than or equal to that of all other feasible points is called an optimal solution. From this

Modeling Difficult Optimization Problems

definition it follows that the problem might not have a unique optimal solution. Depending on the functions f (x; y), g(x; y), and h(x; y) in (1) we get the following structured problems known as Acronym

Type of optimization

f (x; y)

h(x; y)

g(x; y) nd

LP

Linear programming

cT x

Ax b

x

0

QP

Quadratic xT Qx CcT x programming

Ax b

x

0

NLP

Nonlinear programming

MILP

Mixedinteger LP

MIQP Mixedinteger QP MINLP Mixedinteger NLP

0

cT x ˚

Ax˚ b x˚

1

xT˚ Qx˚ CcT x˚ Ax˚ b x˚

1

1

with a matrix A of m rows and n columns, i. e., A 2 M(m n; IR); b 2 IR m ; c 2 IR n , and n D n c C nd . Real-world problems lead much more frequently to LP and MILP than to NLP or MINLP problems. QP refers to quadratic programming problems. They have a quadratic objective function but only linear constraints. QP and MIQP problems often occur in applications of the financial services industry. While LP problems as described in [31] or [1] can be solved relatively easily (the number of iterations, and thus the effort to solve LP problems with m constraints, grows approximately linearly in m), the computational complexity of MILP and MINLP grows exponentially with nd but depends strongly on the structure of the problem. Numerical methods to solve NLP problems work iteratively, and the computational problems are related to questions of convergence, getting stuck in bad local optima and availability of good initial solutions. Global optimization techniques can be applied to

M

both NLP and MINLP problems, and its complexity increases exponentially in the number of all variables entering nonlinearly into the model. While the word optimization, in nontechnical or colloquial language, is often used in the sense of improving, the mathematical optimization community sticks to the original meaning of the word related to finding the best value either globally or at least in a local neighborhood. For an algorithm being considered as a (mathematical, strict, or exact) optimization algorithm in the mathematical optimization community there is consensus that such an algorithm computes feasible points proven globally (or locally) optimal for linear (nonlinear) optimization problems. Note that this is a definition of a mathematical optimization algorithm and not a statement saying that computing a local optimum is sufficient for nonlinear optimization problems. In the context of mixed-integer linear problems an optimization algorithm [12] and [13] is expected to compute a proven optimal solution or to generate feasible points and, for a maximization problem, to derive a reasonably tight, nontrivial upper bound. The quality of such bounds is quantified by the integrality gap – the difference between the upper and lower bound. What one considers to be a good-quality solution depends on the problem, the purpose of the model, and the accuracy of the data. A few percent, say 2 to 3%, might be acceptable for the example discussed by Kallrath (2007, Encyclopedia: Planning). However, discussion based on percentage gaps become complicated when the objective function includes penalty terms containing coefficients without a strict economic interpretation. In such cases scaling is problematic. Goal programming as discussed in ([23], p. 294) might help in such situations to avoid penalty terms in the model. The problem is first solved with respect to the highest-priority goal, then one is concerned with the next level goal, and so on. For practical purposes it is also relevant to observe that solving mixed-integer linear problems and the problem of finding appropriate bounds is often N P complete, which makes these problems hard to solve. A consequence of this structural property is that these problems scale badly. If the problem can be solved to optimality for a given instance, this might not be so if the size is increased slightly. While tailor-made optimization algorithms such as column generation and branch-and-price techniques can often cope with this

2285

2286

M

Modeling Difficult Optimization Problems

situation for individual problems, it is very difficult for standard software. We define difficult optimization problems as problems that cannot be solved to optimality or within a reasonable integrality gap by any standard MIP solver within a reasonable time limit. Problem structure, size, or both could lead to such behavior. However, in many cases these problems (typically MIP or nonconvex optimization problems fall into this class) can be solved if they are individually treated, and we resort to the art of modeling. The art of modeling includes choosing the right level of detail implemented in the model. On the one hand, this needs to satisfy the expectations of the owner of the real-world problem. On the other hand, we are limited by the available computational resources. We give reasons why strict optimality or at least safe bounds are essential when dealing with real-world problems and why we do not accept methods that do not generate both upper and lower bounds. Mapping the reality also forces us to discuss whether deterministic optimization is sufficient or whether we need to resort to optimization under uncertainty. Another issue is to check whether one objective function suffices or whether multiple-criterion optimization techniques need to be applied. Instead of solving such difficult problems directly as, for example, a standalone MILP problem, we discuss how problems can be solved equivalently by solving a sequence of models. Efficient approaches are as follows: Column generation with a master and subproblem structure, Branch-and-price, Exploiting a decomposition structure with a rolling time horizon, Exploiting auxiliary problems to generate safe bounds for the original problem, which then makes the original problems more tractable, Exhaustion approaches, Hybrid methods, i. e., constructive heuristics and local search on subsets of the difficult discrete variables leaving the remaining variables and constraints in tractable MILP or MINLP problems that can be solved. We illustrate various ideas using real-world planning, scheduling, and cutting-stock problems.

Models and the Art of Modeling We are here concerned with two aspects of modeling and models. The first one is to obtain a reasonable representation of the reality and mapping it onto a mathematical model, i. e., an optimization problem in the form of (1). The second one is to reformulate the model or problem in such equivalent forms that is is numerically tractable. Models The terms modeling and model building are derived from the word model. Its etymological roots are the Latin word modellus (scale, [diminutive of modus, measure]) and what was to be in the 16th century the new word modello. Nowadays, in a scientific context the term is used to refer to a simplified, abstract, or well-structured part of the reality one is interested in. The idea itself and the associated concept is, however, much older. Classical geometry, and especially Pythagoras around 600 B.C., distinguish between wheel and circle and field and rectangle. Around A.D. 1100 a wooden model of the later Speyer cathedral was produced; the model served to build the real cathedral. Astrolabs and celestial globes have been used as models to visualize the movement of the moon, planets, and stars on the celestial sphere and to compute the times of rises and settings. Until the 19th century mechanical models were understood as pictures of reality. Following the principles of classical mechanics the key idea was to reduce all phenomena to the movement of small particles. Nowadays, in physics and other mathematical sciences one will talk about models if For reasons of simplification, one restricts oneself to certain aspects of the problem (example: if we consider the movement of the planets, in a first approximation the planets are treated as point masses); For reasons of didactic presentation, one develops a simplified picture for more complicated reality (example: the planetary model is used to explain the situation inside atoms); One uses the properties in one area to study the situation in an analogous problem. A model is referred to as a mathematical model of a process or a problem if it contains typical mathematical objects (variables, terms, relations). Thus, a (mathematical) model represents a real-world problem in

Modeling Difficult Optimization Problems

the language of mathematics using mathematical symbols, variables, equations, inequalities, and other relations. It is very important when building a model to define and state precisely the purpose of the model. In science, we often encounter epistemological arguments. In engineering, a model might be used to construct some machines. In operations research and optimization, models are often used to support strategic or operative decisions. All models enable us to Learn and understand situations that do not allow easy access (very slow or fast processes, processes involving a very small or very large region); Avoid difficult, expensive, or dangerous experiments; and Analyze case studies and what-if-when scenarios. Tailored optimization models can be used to support decisions (that is, the overall purpose of the model). It is essential to have a clear objective describing what a good decision is. The optimization model should produce, for instance, optimal solutions in the following sense: To avoid unwanted byproducts as much as possible, To minimize costs, or to maximize profit, earnings before interest and taxes (EBIT), or contribution margin. The purpose of a model may change over time. To solve a real-world problem by mathematical optimization, at first we need to represent our problem by a mathematical model, that is, a set of mathematical relationships (e. g., equalities, inequalities, logical conditions) representing an abstraction of our real-world problem. This translation is part of the model-building phase (which is part of the whole modeling process) and is not trivial at all because there is nothing we could consider an exact model. Each model is an acceptable candidate as long as it fulfills its purpose and approximates the real world accurately enough. Usually, a model in mathematical optimization consists of four key objects: Data, also called the constants of a model; Variables (continuous, semicontinuous, binary, integer), also called decision variables; Constraints (equalities, inequalities), also called restrictions; and Objective function (sometimes even several of them).

M

The data may represent costs or demands, fixed operation conditions of a reactor, capacities of plants, and so on. The variables represent the degrees of freedom, i. e., what we want to decide: how much of a certain product is to be produced, whether a depot is closed or not, or how much material we will store in the inventory for later use. Classical optimization (calculus, variational calculus, optimal control) treats those cases in which the variables represent continuous degrees of freedom, e. g., the temperature in a chemical reactor or the amount of a product to be produced. Mixed-integer optimization involves variables restricted to integer values, for example counts (numbers of containers, ships), decisions (yes-no), or logical relations (if product A is produced, then product B also needs to be produced). The constraints can be a wide range of mathematical relationships: algebraic, analytic, differential, or integral. They may represent mass balances, quality relations, capacity limits, and so on. The objective function expresses our goal: minimize costs, maximize utilization rate, minimize waste, and so on. Mathematical models for optimization usually lead to structured problems such as: Linear programming (LP) problems, Mixed-integer linear programming (MILP) problems, Quadratic (QP) and mixed-integer quadratic programming (MIQP), Nonlinear programming (NLP) problems, and Mixed-integer nonlinear programming (MINLP) problems. The Art of Modeling How do we get from a given problem to its mathematical representation? This is a difficult, nonunique process. It is a compromise between the degree of detail required to model a problem and the complexity, which is tractable. However, simplifications should not only be seen as an unavoidable evil. They could be useful for developing understanding or serve as a platform with the client, as the following three examples show. 1. At the beginning of the modeling process it can be useful to start with a “down-scaled” version to develop a feeling for the structure and dependencies of the model. This enable a constructive dialog between the modeler and the client. A vehicle fleet with 100 vehicles and 12 depots could be analyzed with

2287

2288

M

Modeling Difficult Optimization Problems

only 10 vehicles and 2 depots to let the model world and the real world find each other in a sequence of discussions. 2. In partial or submodels the modeler can develop a deep understanding of certain aspects of the problem which can be relevant to solve the whole problem. 3. Some aspects of the real world problem could be too complicated to model them complete or exactly. During the modeling process it can be clarified, using a smaller version, whether partial aspects of the model could be neglected or whether they are essential. In any case it is essential that the simplifications be well understood and documented.

on binary variables ı i jk . They can be replaced by the bounds ˇ ı i jk D 0 ; 8f(i; j; k) ˇA i jk < B i jk g or, if one does not trust the < in a modeling language, the bounds ˇ ı i jk D 0 ; 8f(i; j; k) ˇA i jk B i jk " g

Tricks of the Trade for Monolithic Models Using state-of-the-art commercial solvers, e. g., XPressMP [XPressMP is by Dash Optimization, http://www.dashoptimization.com] or CPLEX [CPLEX is by ILOG, http://www.ilog.com], MILP problems can be solved quite efficiently. In the case of MINLP and using global optimization techniques, the solution efficiency depends strongly on the individual problem and the model formulation. However, as stressed in [21] for both MILP and MINLP problem, it is recommended that the full mathematical structure of a problem be exploited, that appropriate reformulations of models be made, and that problem-specific valid inequalities or cuts be used. Software packages may also differ with respect to the ability of presolving techniques, default strategies for the branch-and-bound algorithm, cut generation within the branch-and-cut algorithm, and, last but not least, diagnosing and tracing infeasibilities, which is an important issue in practice. Here we collect a list of recommendation tricks that help to improve the solution procedure of monolithic MIP problems, i. e., standalone models that are solved by one call to a MILP or MINLP solver. Among them are: Use bounds instead of constraints if the dual values are not necessarily required. Apply one’s own presolving techniques. Consider, for instance, a set of inequalities B i jk ı i jk A i jk ;

8fi; j; kg

(2)

where " > 0 is a small number, say, of the order of 106 . If A i jk B i jk , then (2) is redundant. Note that, due to the fact that we have three indices, the number of inequalities can be very large. Exploit the presolving techniques embedded in the solver; cf. [28]. Exploit or eliminate symmetry: sometimes, symmetry can lead to degenerate scenarios. There are situations, for instance, in scheduling where orders can be allocated to identical production units. Another example is the capacity design problem of a set of production units to be added to a production network. In that case, symmetry can be broken by requesting that the capacities of the units be sorted in descending order, i. e., cu cuC1 . [29] exploit symmetry in order allocation for stock cutting in the paper industry; this is a very enjoyable paper to read. Use special types of variables for which tailor-made branching rules exist (this applies to semicontinuous and partial-integer variables as well as special ordered sets). Experiment with the various strategies offered by the commercial branch-and-bound solvers for the branch-and-bound algorithm. Experiment with the cut generation within the commercial branch-and-cut algorithm, among them Gomory cuts, knapsack cuts, or flow cuts; cf. [28]. Construct one’s own valid inequalities for certain substructures of problems at hand. Those inequalities may be added a priori to a model, and in the extreme case they would describe the complete convex hull. As an example we consider the mixed-integer inequality x C ;

0 x X;

x 2 IRC 0 ;

2 IN (3)

which has the valid inequality x X G(K ) where X K :D and G :D X C (K 1) : C

(4)

Modeling Difficult Optimization Problems

This valid inequality (4) is the more useful, the more K and X/C deviate. A special case arising is often the situation 2 f0; 1g. Another example, taken from [39], p. 129 is A1 ˛1 C A2 ˛2 B C x

x 2 IRC 0

˛1 ; ˛2 2 IN (5)

which for B … IN leads to the valid inequality x f2 f bBcC (6) bA1 c ˛1 C bA2 c ˛2 C 1 f 1 f where the following abbreviations are used: f :D B bBc ; f1 :D A1 bA1 c ;

f 2 :D A2 bA2 c : (7)

The dynamic counterpart of valid inequalities added a priori to a model leads to cutting-plane algorithms that avoid adding a large number of inequalities a priori to the model (note, this can be equivalent to finding the complete convex hull). Instead, only those useful in the vicinity of the optimal solution are added dynamically. For the topics of valid inequalities and cutting-plane algorithms the reader is referred to books by Nemhauser and Wolsey [30], Wolsey [39], and Pochet and Wolsey [32]. Try disaggregation in MINLP problems. Global optimization techniques are often based on convex underestimators. Univariate functions can be treated easier than multivariate terms. Therefore, it helps to represent bilinear or multilinear terms by their disaggregated equivalences. As an example we consider C x1 x2 with given lower and upper bounds X i and X i for x i ; i D 1; 2. Wherever we encounter x1 x2 in our model we can replace it by x1 x2 D

This formulation has another advantage. It allows us to construct easily a relaxed problem which can be used to derive a useful lower bound. Imagine a problem P with the inequality x1 x2 A :

(8)

Then 2 x12 X1 x1 X2 x2 2A

(9)

is a relaxation of P as each point (x1 ; x2 ) satisfying (8) also fulfills (9). Note that an alternative disaggregation avoiding an additional variable is given by x1 x2 D

1 4

(x1 C x2 )2 (x1 x2 )2 :

However, all of the creative attempts listed above may not suffice to solve the MIP using one monolithic model. That is when we should start looking at solving the problem by a sequence of problems. We have to keep in mind that to solve a MIP problem we need to derive tight lower and upper bounds with the gap between them approaching zero. Decomposition Techniques Decomposition techniques decompose a problem into a set of smaller problems that can be solved in sequence or in any combination. Ideally, the approach can still compute the global optimum. There are standardized techniques such as Benders Decomposition [cf. Floudas ([9], Chap. 6). But often one should exploit the structure of an optimization to construct tailor-made decompositions. This is outlined in the following subsections. Column Generation

1 2 (x x12 x22 ) 2 12

and x12 D x1 C x2 : The auxiliary variable is subject to the bounds :D X1 C X2 and X12 C X12 x12 X12 ; :D X1 C X2 ; X12

M

C X12 :D X1C C X2C :

In linear programming parlance, the term column usually refers to variables. In the context of columngeneration techniques it has wider meaning and stands for any kind of objects involved in an optimization problem. In vehicle routing problems a column might, for instance, represent a subset of orders assigned to a vehicle. In network flow problems a column might represent a feasible path through the network. Finally, in cutting-stock problems [10,11] a column represents a pattern to be cut.

2289

2290

M

Modeling Difficult Optimization Problems

The basic idea of column generation is to decompose a given problem into a master and subproblem. Problems that might otherwise be nonlinear can be completely solved by solving only linear problems. The critical issue is to generate master and subproblems that can both be solved efficiently. One of the most famous examples is the elegant column-generation approach of Gilmore and Gomory [10] for computing the minimal number of rolls to satisfy a requested demand for smaller sized rolls. This problem, if formulated as one monolithic problem, leads to a MINLP problem with a large number of integer variables. In simple cases, such as those described by Schrage ([35], Sect. 11.7), it is possible to generate all columns explicitly, even within a modeling language. Often the decomposition has a natural interpretation. If not all columns can be generated, the columns are added dynamically to the problem. Barnhart et al. [2] give a good overview on such techniques. A more recent review focusing on selected topics of column generation is [25]. In the context of vehicle routing problems, feasible tours contain additional columns as needed by solving a shortestpath problem with time windows and capacity constraints using dynamic programming [7]. More generally, column-generation techniques are used to solve well-structured MILP problems involving a huge number, say, several hundred thousand or millions, of variables, i. e., columns. Such problems lead to large LP problems if the integrality constraints of the integer variables are relaxed. If the LP problem contains so many variables (columns) that it cannot be solved with a direct LP solver (revised simplex, interior point method), one starts solving this so-called master problem with a small subset of variables yielding the restricted master problem. After the restricted master problem has been solved, a pricing problem is solved to identify new variables. This step corresponds to the identification of a nonbasic variable to be taken into the basis of the simplex algorithm and the term column generation. The restricted master problem is solved with the new number of variables. The method terminates when the pricing problems cannot identify any new variables. The simplest version of column generation is found in the Dantzig–Wolfe decomposition [6]. Gilmore and Gomory [10,11] were the first to generalize the idea of dynamic column generation to an integer programming (IP) problem: the cutting-stock

problem. In this case, the pricing problem, i. e., the subproblem, is an IP problem itself – and one refers to this as a column-generationalgorithm. This problem is special as the columns generated when solving the relaxed master problem are sufficient to get the optimal integer feasible solution of the overall problem. In general this is not so. If not only the subproblem, but also the master problem involves integer variables, then the columngeneration part is embedded into a branch-and-bound method; this is called branch-and-price. Thus, branchand-price is integer programming with column generation. Note that during the branching process new columns are generated; therefore the name branch-andprice. Column Generation in cutting-stock Problems This section describes the mathematical model for minimizing the number of roles or trimloss and illustrates the idea of column generation. Indices

used in this model:

p 2 P :D fp1 ; : : : ; p N P g for cutting patterns (formats). Either the patterns are directly generated according to a complete enumeration or they are generated by column generation. i 2 I :D fi1 ; : : : ; i N I g given orders or widths. Input Data here:

We arrange the relevant input data size

B [L] width of the rolls (raw material roles) D i [-] number of orders for the width i Wi [L] width of order type i Integer Variables

used in the different model variants:

p 2 IN0 :D f0; 1; 2; 3; : : :g [] indicates how often pattern p is used. If cutting pattern p is not used, then we have p D 0. ˛ i p 2 IN0 [] indicates how often width i is contained in pattern p. This variable can take values between 0 and D i depending on the order situation.

Modeling Difficult Optimization Problems

Model

The model contains a suitable object function

min f (˛ i p ; p ) ; as well as the boundary condition (fulfillment of the demand) X ˛ i p p D D i ; 8i (10)

M

(12) and ˛ i is an integer variable specifying how often width i occurs in the new pattern. We add the knapsack constraint with respect to the width of the rolls X Wi ˛ i B ; 8i (14) i

and the integrality constraints

p

˛ i 2 IN0 ;

and the integrality constraints ˛ i p 2 IN0 ;

8fi pg ;

p 2 IN0 ;

8fpg :

(11)

General Structure of the Problem In this form it is a mixed-integer nonlinear optimization problem (MINLP). This problem class is difficult in itself. More serious is the fact that we may easily encounter several million variables ˛ i p . Therefore the problem cannot be solved in this form. Solution Method The idea of dynamic column generation is based on the fact that one must decide in a master problem for a predefined set of patterns how often every pattern must be used as well as calculate suitable input data for a subproblem. In this subproblem new patterns are calculated. The master problem solves for the multiplicities of existing patterns and has the shape X p ; min p

with the demand-fulfill inequality (note that it is allowed to produce more than requested) X N i p p D i ; 8i (12) i

and the integrality constraints p 2 IN0 ;

8fpg :

(13)

The subproblem generates new patterns. Structurally it is a knapsack problem with object function X Pi ˛ i ; min 1 ˛i

p

where Pi are the dual values (pricing information) of the master problem (pricing problem) associated with

8fig :

(15)

In some cases, ˛ i could be additionally bounded by the number, K, of knives. Implementation Issues The critical issues in this method, in which we alternate in solving the master problem and the subproblem, are the initialization of the procedure (a feasible starting point is to have one requested width in each initial pattern, but this is not necessarily a good one), excluding the generation of the existing pattern by applying integer cuts, and the termination. Column Enumeration Column enumeration is a special variant of column generation and is applicable when a small number of columns is sufficient. This is, for instance, the case in real-world cutting-stock problems when it is known that the optimal solution has only a small amount of trimloss. This usually eliminates most of the pattern. Column enumeration naturally leads to a type of selecting columns or partitioning models. A collection of illustrative examples contained in ([35], Sect. 11.7) covers several problems of grouping, matching, covering, partitioning, and packing in which a set of given objects has to be grouped into subsets to maximize or minimize some objective function. Despite the limitations with respect to the number of columns, column enumeration has some advantages: No pricing problem, Easily applied to MIP problems, Column enumeration is much easier to implement. In the online version of the vehicle routing problem described in [22] it is possible to generate the complete set, Cr , of all columns, i. e., subsets of orders i 2 O ; r D jOj, assigned to a fleet of n vehicles, v 2 V . Let Cr be the union of the sets, Crv , i. e., Cr D [vD1:::n Crv with C r D jCr j D 2r n, where Crv

2291

2292

M

Modeling Difficult Optimization Problems

contains the subsets of orders assigned to vehicle v. Note that Crv contains all subsets containing 1, 2, or r orders assigned to vehicle v. The relevant steps of the algorithm are: 1. Explicitly generate all columns Crv , followed by a simple feasibility test w.r.t. the availability of the cars. 2. Solve the routing-scheduling problem for all columns Crv using a tailor-made branch-and-bound approach (the optimal objective function values, Z(c ) or Z(cv ), respectively, and the associated routing-scheduling plan are stored). 3. Solve the partitioning model: V

min

C rv X N X

cv

Z(cv )cv ;

(16)

cD1 vD1

and the inequality C rv X

cv 1 ;

8v 2 V ;

cD1

where V V is a subset of the set V of all vehicles. Alternatively, if it is not prespecified which vehicles should be used but it is only required that not more than NV vehicles be used, then the inequality V

Cr N X X

cv NV

(20)

cD1 vD1jv2V

is imposed. 4. Reconstruct the complete solution and extract the complete solution from the stored optimal solutions for the individual columns.

s.t. Branch-and-Price

V

Cr X N X

I i (cv )cv D 1 ;

8i D 1; : : : ; r

(17)

cD1 vD1

ensures that each order is contained exactly once, the inequality Cr X

cv 1 ;

8v 2 V ;

(18)

cD1

ensuring that at most one column can exist for each vehicle, and the integrality conditions cv 2 f0; 1g ;

8c D 1; : : : ; Cr :

(19)

Note that not all combinations of index pairs fc; vg exist; each c corresponds to exactly one v, and vice versa. This formulation allows us to find optimal solutions with the defined columns for a smaller number of vehicles. The objective function and the partitioning constraints are just modified by substituting V

V

N X

!

vD1jv2V

N X

;

vD1jv2V

the equations C rv X

V

N X

cD1 vD1jv2V

I i (cv )cv D 1 ;

8i D 1; : : : ; r ;

Branch-and-price (often coupled with branch-and-cut) refers to a tailor-made algorithm exploiting the decomposition structure of the problem to be solved. This efficient method for solving MIP problems with column generation has been well described by Barnhart et al. [2] and has been covered by Savelsbergh [34] in the first edition of the Encyclopedia of Optimization. Here, we give a list of more recent successful applications in various fields. Cutting stock: [3,38] Engine routing and industrial in-plant railroads: [26] Network design: [16] Lot sizing: [38] Scheduling (staff planning): [8] Scheduling of switching engines: [24] Supply chain optimization (pulp industry): [5] Vehicle routing: [7,15] Rolling Time Decomposition The overall methodology for solving the mediumrange production scheduling problem is to decompose the large and complex problem into smaller short-term scheduling subproblems in successive time horizons, i. e., we decompose according to time. Large-scale industrial problems have been solved by Janak et al. [18,19]. A decomposition model is formulated and solved to determine the current horizon and

Modeling Difficult Optimization Problems

corresponding products that should be included in the current subproblem. According to the solution of the decomposition model, a short-term scheduling model is formulated using the information on customer orders, inventory levels, and processing recipes. The resulting MILP problem is a large-scale complex problem that requires a large computational effort for its solution. When a satisfactory solution is determined, the relevant data are output and the next time horizon is considered. The above procedure is applied iteratively in an automatic fashion until the whole scheduling period under consideration is finished. Note that the decomposition model determines automatically how many days and products to consider in the small scheduling horizon subject to an upper limit on the complexity of the resulting mathematical model. An Exhaustion Method This method combines aspects of a constructive heuristics and of exact model solving. We illustrate the exhausting method by the cutting-stock problem described in Sect. “Column Generation in cutting-stock Problems”; assigning orders in a scheduling problem would be another example. The elegant column generation approach by Gilmore and Gomory [10] is known for producing minimal trimloss solutions with many patterns. Often this corresponds to setup changes on the machine and therefore is not desirable. A solution with a minimal number of patterns minimizes the machine setup costs of the cutter. Minimizing simultaneously trimloss and the number of patterns is possible for a small case of a few orders only exploiting the MILP model by Johnston and Salinlija [20]. It contains two conflicting objective functions. Therefore one could resort to goal programming. Alternatively, we could produce several parameterized solutions leading to different numbers of rolls to be used and patterns to be cut from which the user would extract the one he likes best. As the table above indicates, we compute tight lower bounds on both trimloss and the number of patterns. Even for up to 50 feasible orders, near-optimal solutions are constructed in less than a minute. Note that it would be possible to use the branchand-price algorithm described in [38] or [3] to solve the one-dimensional cutting-stock problem with minimal numbers of patterns. However, these methods are

M

not easy to implement. Therefore, we use the following approaches, which are much easier to program: V1: Direct usage of the model by Johnston and Salinlija [20] for a small number, say, N I 14, of orders and Dmax 10. In a preprocessing step we compute valid inequalities as well as tight lower and upper bounds on the variables. V2: Exhaustion procedure in which we generate successively new patterns with maximal multiplicities. This method is parameterized by the permissible percentage waste Wmax , 1 Wmax 99. After a few patterns have been generated with this parameterization, it could happen that is is not possible to generate any more patterns with waste restriction. In this case the remaining unsatisfied orders are generated by V1 without the Wmax restriction. Indices and Sets In this model we use the indices listed in Johnston and Salinlija [20]: i 2 I :D fi1 ; : : : ; i N I g denotes the sets of width. j 2 | :D f j1 ; : : : ; j N P g denotes the pattern; N J N I . The patterns are generated by V1, or dynamically by maximizing the multiplicities of a used pattern. k 2 K :D fk1 ; : : : ; k N P g denotes the multiplicity index to indicate how often a width is used in a pattern. The multiplicity index can be restricted by the ratio of the width of the orders and the width of the given rolls. Variables The following integer or binary variables are used: a i jk 2 IN [] specifies the multiplicity of pattern j. The multiplicity can vary between 0 and Dmax :D maxfD i g. If pattern j is not used, we have r j D p j D 0. p j 2 f0; 1g [] indicates whether pattern j is used at all. r j 2 IN [] specifies how often pattern j is used. The multiplicity can vary between 0 and Dmax :D maxfD i g. If pattern j is not used, we have r j D p j D 0. ˛ i p 2 IN [] specifies how often width i occurs in pattern p.

2293

2294

M

Modeling Difficult Optimization Problems

# of # output file flag Wmax comment rolls pat -------------------------------------------------------------------0 5 8 99 lower bound: minimal # of patterns 30 10 pat00.out 9 99 lower bound: minimal # of rolls 34 7 pat01.out 0 20 31 9 pat02.out 1 15 30 8 pat03.out 0 10 minimal number of rolls 32 9 pat04.out 1 8 30 8 pat05.out 0 6 minimal number of rolls 31 8 pat06.out 1 4 The best solution found contains 7 patterns! The solution with minimal trimloss contain 30 rolls! Improvement in the lower bound of pattern: 6! Solutions with 6 patterns are minimal w.r.t. to the number of patterns. A new solution was found with only 6 patterns and 36 rolls: patnew.out 36 6 patnew.out 0 99

This width-multiplicity variable can take all values between 0 and D i . x i jk 2 f0; 1g [] indicates whether width i appears in pattern j at level k. Note that x i jk D 0 implies a i jk D 0.

where u specifies the maximal number of patterns ( u could be taken from the solution of the columngeneration approach, for instance), or minimizing the number of patterns generated min

max

u X jD1

rj ;

pj :

jD1

The Idea of the Exhaustion Method In each iteration we generate m at most two or three new patterns by maximizing the multiplicities of these patterns, allowing no more than a maximum waste, Wmax . The solution generated in iteration m is preserved in iteration m C 1 by fixing the appropriate variables. If the problem turns out to be infeasible (this may happen if Wmax turns out to be restrictive), then we switch to a model variant in which we minimize the number of patterns subject to satisfying the remaining unsatisfied orders. The model is based on the inequalities (2,3,5,6,7,8,9) in [20], but we add a few more additional ones or modify the existing ones. We exploit two objective functions: maximizing the multiplicities of the patterns generated

u X

The model is completed by the integrality conditions r j ; a i jk 2 f0; 1; 2; 3; : : :g

(21)

p j ; x i jk ; y jk 2 f0; 1g :

(22)

˜ i, The model is applied several times with a i jk D ˜ i is the number of remaining orders of width where D i. In particular, the model has to fulfill the relationships ˜i ka i jk > D

H)

a i jk D 0

;

x i jk D 0

and a i jk

˜ Di k

or

a i jk

˜ Di C Si ; k

where S i denotes the permissible overproduction. The constructive method described so far provides an improved upper bound, u0 , on the number of pattern.

Modeling Difficult Optimization Problems

Computing Lower Bounds To compute a lower bound we apply two methods. The first method is to solve a bin-packing problem, which is equivalent to minimizing the number of rolls in the original cutting-stock problem described in the Sect. “Column Generation in Cutting-Stock Problems” for equal demands D i D 1. If solved with the columngeneration approach, this method is fast and cheap, but the lower bound, l0 , is often weak. The second method is to exploit the upper bound, u0 , on the number of patterns obtained and to call the exact model as in V1. It is impressive how quickly the commercial solvers CPLEX and XpressMP improve the lower bound yielding l" . For most examples with up to 50 orders we obtain

u0 l0 2, but in many cases u0 l0 D 1 or even

u0 D l" . Primal Feasible Solutions and Hybrid Methods We define hybrid methods as methods based on any combination of exact MIP methods with constructive heuristics, local search, metaheuristics, or constraint programming that produces primal feasible solutions. Dive-and-fix, near-integer-fix, and fix-and-relax are such hybrid methods. They are user-developed heuristics exploiting the problem structure. In their kernel they use a declarative model solved, for instance, by CPLEX and XpressMP. In constructive heuristics we exploit the structure of the problem and compute a feasible point. Once we have a feasible point we can derive safe bounds on the optimum and assign initial values to the critical discrete variable, which could be exploited by the GAMS/CPLEX mipstart option. Feasible points can sometimes be generated by appropriate sequences of relaxed models. For instance, in a scheduling problem P with due times one might relax these due times obtaining the relaxed model R. The optimal solution, or even any feasible point of R, is a feasible point of P if the due times are models with appropriate unbounded slack variables. Constructive heuristics can also be established by systematic approaches of fixing critical discrete variables. Such approaches are dive-and-fix and relax-andfix. In dive-and-fix the LP relaxation of an integer problem is to be solved followed by fixing a subset of fractional variables to suitable bounds. Near-integer-fix is

M

a variant of dive-and-fix that fixes variables with fractional values to the nearest integer point. Note that these heuristics are subject to the risk of becoming infeasible. The probability of becoming infeasible is less likely in relax-and-fix. In relax-and-fix, following Pochet and Wolsey ([32], pp. 109) we suppose that the binary variables ı of a MIP problem P can be partitioned into R disjoint sets S 1 ; : : : ; S R of decreasing imporR S u for tance. Within these subsets U r with U [uDrC1 r D 1; : : : ; R 1 can be chosen to allow for somewhat more generality. Based on these partitions, R MIP problems are solved, denoted P r with 1 r R to find a heuristic solution to P . For instance in a production planning problem, S 1 might be all the ı variables associated with time periods in f1; : : : ; t1 g, S u those associated with periods in ft u C 1; : : : ; t uC1 g, whereas U r would would be the ı variables associated with the periods in some set ft r C 1; : : : ; u r g. In the first problem, P 1 , one only imposes the integrality of the important variables in S 1 [ U 1 and relaxes the integrality on all the other variables in S. As P 1 is a relaxation of P , for a minimization problem, the solution of P 1 provides a lower bound of P . The solution values, ı 1 , of the discrete variables are kept fixed when solving P r . This continues and in the subsequent P r , for 2 r R, we additionally fix the values of the ı variables with index in S r1 at their optimal values from P r1 and add the integrality restriction for the variables in S r [ U r . Either P r is infeasible for some r 2 f1; : : : ; Rg, and the heuristic failed, or else (x R , ı R ) is a relax-andfix solution. To avoid infeasibilities one might apply a smoothed form of this heuristic that allows for some overlap of U r1 and U r . Additional free binary variables in horizon r 1 allow one to link the current horizon r with the previous one. Usually this suffices to ensure feasibility. Relax-and-fix comes in various flavors exploiting time-decomposition or time-partitioning structures. Other decompositions, for instance plants, products, or customers, are possible as well. A local search can be used to improve the solution obtained by the relax-and-fix heuristic. The main idea is to solve repeatedly the subproblem on a small number of binary variables reoptimizing, for instance, the production of some products. The binary variables for resolving could be chosen randomly or by a metaheuristic

2295

2296

M

Modeling Difficult Optimization Problems

such as simulated annealing. All binary variables related to them are released; the others are fixed to the previous best values. Another class of MIP hybrid method is established by algorithms that combine a MIP solver with another algorithmic method. A hybrid method obtained by the combination of mixed-integer and constraint logic programming strategies has been developed and applied by Harjunkoski et al. [14] as well as Jain and Grossmann [17] for solving scheduling and combinatorial optimization problems. Timpe [37] solved mixed planning and scheduling problems with mixed MILP branch-and-bound and constraint programming. Maravelias and Grossmann [27] proposed a hybrid/decomposiiton algorithm for the short-term scheduling of batch plants, and Roe et al. [33] presented a hybrid MILP/CLP algorithm for multipurpose batch process scheduling in which MILP is used to solve an aggregated planning problem while CP is used to solve a sequencing problem. Other hybrid algorithms combine evolutionary and mathematical programming methods; see, for instance, the heuristics by Till et al. [36] for stochastic scheduling problems and by Borisovsky et al. [4] for supply management problems. Finally, one should not forget to add some algorithmic component that, for the minimization problem at hand, would generate some reasonable bounds to be provided in addition to the hybrid method. The hybrid methods discussed above provide upper bounds by constructing feasible points. In favorite cases, the MIP part of the hybrid solver provides lower bounds. In other case, lower bounds can be derived from auxiliary problems, which are relaxations of the original problem, and which are easier to solve. Summary If a given MIP problem cannot be solved by an available MIP solver exploiting all its internal presolving techniques, one might reformulate the problem and obtain an equivalent or closely related representation of reality. Another approach is to construct MIP solutions and bounds by solving a sequence of models. Alternatively, individual tailor-made exact decomposition techniques could help as well as primal heuristics such as relaxand-fix or local search techniques on top of a MIP model.

References 1. Anstreicher KM (2001) Linear Programming: Interior Point Methods. In: Floudas CA, Pardalos P (eds) Encyclopedia of Optimization, vol 3. Kluwer, Dordrecht, pp 189–191 2. Barnhart C, Johnson EL, Nemhauser GL, Savelsberg MWP, Vance PH (1998) Branch-and-price: column generation for solving huge integer programs. Oper Res 46(3):316– 329 3. Belov G, Scheithauer G (2006) A Branch-and-Crice Algorithm for One-Dimensional Stock Cutting and TwoDimensional Two-Stage Cutting. Eur J Oper Res 171:85– 106 4. Borisovsky P, Dolgui A, Eremeev A (2006) Genetic Algorithms for Supply Management Problem with Lowerbounded Demands. In: Dolgui A, Morel G, Pereira C (eds) Information Control Problems in Manufacturing 2006: A Proceedings volume from the 12th IFAC International Symposium. vol. 3., St Etienne, France, 17-19 May 2006. North-Holland, Dordrecht, pp 521–526 5. Bredström D, Lundgren JT, Rnnqvist ¨ M, Carlsson D, Mason A (2004) Supply Chain Optimization in the Pulp Mill Industry – IP models, Column Generation and Novel Constraint Branches. Eur J Oper Res 156:2–22 6. Dantzig B, Wolfe P (1960) The decomposition algorithm for linear programming. Oper Res 8:101–111 7. Desrochers M, Desrosiers J, Solomon MM (1992) A New Optimization Algorithm for the Vehicle Routing Problem with time Windows. Oper Res 40(2):342–354 8. Eveborn P, Ronnqvist M (2004) Scheduler – A System for Staff Planning. Ann Oper Res 128:21–45 9. Floudas CA (1995) Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications. Oxford University Press, Oxford 10. Gilmore PC, Gomory RE (1961) A Linear Programming Approach to the Cutting Stock Problem. Oper Res 9:849–859 11. Gilmore PC, Gomory RE (1963) A Linear Programming Approach to the Cutting Stock Problem, Part II. Oper Res 11:863–888 12. Grötschel M (2004) Mathematische Optimierung im industriellen Einsatz. In: Lecture at Siemens AG, Munich, Germany, 7 Dec 2004 13. Grötschel M (2005) Private communication 14. Harjunkoski I, Jain V, Grossmann IE (2000) Hybrid Mixedinteger constraint Logic Programming Strategies for Solving Scheduling and Combinatorial Optimization Problems. Comput Chem Eng 24:337–343 15. Irnich S (2000) A Multi-Depot Pickup and Delivery Problem with a Single Hub and Heterogeneous Vehicles. Eur J Oper Res 122:310–328 16. Irnich S (2002) Netzwerk-Design für zweistufige Transportsysteme und ein Branch-and-Price-Verfahren für das gemischte Direkt- und Hubflugproblem. Dissertation, Fakultät für Wirtschaftswissenschaften, RWTH Aachen, Aachen, Germany

Modeling Languages in Optimization: A New Paradigm

17. Jain V, Grossmann IE (2001) Algorithms for hybrid MILP/CP models for a class of optimization problems. IFORMS J Comput 13:258–276 18. Janak SL, Floudas CA, Kallrath J, Vormbrock N (2006) Production Scheduling of a Large-Scale Industrial Batch Plant: I. Short-Term and Medium-Term Scheduling. Ind Eng Chem Res 45:8234–8252 19. Janak SL, Floudas CA, Kallrath J, Vormbrock N (2006) Production Scheduling of a Large-Scale Industrial Batch Plant: II. Reactive Scheduling. Ind Eng Chem Res 45:8253– 8269 20. Johnston RE, Sadinlija E (2004) A New Model for Complete Solutions to One-Dimensional Cutting Stock Problems. Eur J Oper Res 153:176–183 21. Kallrath J (2000) Mixed Integer Optimization in the Chemical Process Industry: Experience, Potential and Future Perspectives. Chem Eng Res Des 78(6):809–822 22. Kallrath J (2004) Online Storage Systems and Transportation Problems with Applications: Optimization Models and Mathematical Solutions, vol 91 of Applied Optimization. Kluwer, Dordrecht 23. Kallrath J, Maindl TI (2006) Real Optimization with SAPAPO. Springer, Berlin 24. Lübbecke M, Zimmermann U (2003) Computer aided scheduling of switching engines. In: Jäger W, Krebs HJ (eds) Mathematics – Key Technology for the Future: Joint Projects Between Universities and Industry. Springer, Berlin, pp 690–702 25. Lübbecke ME, Desrosiers J (2005) Selected topics in column generation. Oper Res 53(6):1007–1023 26. Lübbecke ME, Zimmermann UT (2003) Engine routing and scheduling at industrial in-plant railroads. Transp Sci 37(2):183–197 27. Maravelias CT, Grossmann IE (2004) A Hybrid MILP/CP Decomposition Approach for the Continuous Time Scheduling of Multipurpose Batch Plants. Comput Chem Eng 28:1921–1949 28. Martin A (2001) General Mixed Integer Programming: Computational Issues for Branch-and-Cut Algorithms. In: Naddef D, Juenger M (eds) Computational Combinatorial Optimization. Springer, Berlin, pp 1–25 29. Menon S, Schrage L (2002) Order Allocation for Stock Cutting in the Paper Industry. Oper Res 50(2):324–332 30. Nemhauser GL, Wolsey LA (1988) Integer and Combinatorial Optimization. Wiley, New York 31. Pardalos PM (2001) Linear Programming. In: Floudas CA, Pardalos P (eds) Encyclopedia of Optimization, vol 3. Kluwer, Dordrecht, pp 186–188 32. Pochet Y, Wolsey LA (2006) Production Planning by Mixed Integer Programming. Springer, Berlin 33. Roe B, Papageorgiou LG, Shah N (2005) A hybrid MILP/CLP Algorithm for Multipurpose Batch Process schedulin. Comput Chem Eng 29:1277–1291 34. Savelsbergh MWP (2001) Branch-and-Price: Integer Programming with Column Generation. In: Floudas CA, Parda-

35. 36.

37.

38.

39.

M

los P (eds) Encyclopedia of Optimization. Kluwer, Dordrecht, pp 218–221 Schrage L (2006) Optimization Modeling with LINGO. LINDO Systems, Chicago Till J, Sand G, Engell S, Emmerich M, Schönemann L (2005) A New Hybrid Algorithm for Solving Two-Stage Stochastic Problems by Combining Evolutionary and Mathematical Programming Methods. In: Puigjaner L, Espuña A (eds) Proc. European Symposium on Computer Aided Process Engineering (ESCAPE) - 15. Dordrecht, North-Holland, pp 187–192 Timpe C (2002) Solving mixed planning and scheduling problems with mixed branch and bound and constraint programming. OR Spectrum 24:431–448 Vanderbeck F (2000) Exact algorithm for minimising the number of setups in the one-dimensional cutting stock problem. Oper Res 48(5):915–926 Wolsey LA (1998) Integer Programming. Wiley, New York

Modeling Languages in Optimization: A New Paradigm TONY HÜRLIMANN Institute Informatics, University Fribourg, Fribourg, Switzerland MSC2000: 90C10, 90C30 Article Outline Keywords Why Declarative Representation Algebraic Modeling Languages Second Generation Modeling Languages Modeling Language and Constraint Logic Programming Modeling Examples Sorting The n-Queens Problem A Two-Person Game Equal Circles in a Square The (Fractional) Cutting-Stock Problem

Conclusion See also References Keywords Algorithmic language; Declarative language; Modeling language; Solver

2297

2298

M

Modeling Languages in Optimization: A New Paradigm

In this paper, modeling languages are identified as a new computer language paradigm and their applications for representing optimization problems is illustrated by examples. Programming languages can be classified into three paradigms: imperative, functional, and logic programming [14]. The imperative programming paradigm is closely related to the physical way of how (the von Neumann) computer works: Given a set of memory locations, a program is a sequence of well defined instructions on retrieving, storing and transforming the content of these locations. The functional paradigm of computation is based on the evaluation of functions. Every program can be viewed as a function which translates an input into a unique output. Functions are first-class values, that is, they must be viewed as values themselves. The computational model is based on the calculus invented by A. Church (1936) as a mathematical formalism for expressing the concept of a computation. The paradigm of logic programming is based on the insight that a computation can be viewed as a kind of (constructive) proof. Hence, a program is a notation for writing logical statements together with specified algorithms for implementing inference rules. All three programming paradigms concentrate on problem representation as a computation, that is, the problem is stated in a way that describes the process of solving it. The computation on how to solve a problem ‘is’ its representation. One may call such a notational system an algorithmic language. Definition 1 An algorithmic language describes (explicitly or implicitly) the computation of solving a problem, that is, ‘how’ a problem can be processed using a machine. The computation consists of a sequence of well-defined instructions which can be executed in a finite time by a Turing machine. The information of a problem which is captured by an algorithmic language is called algorithmic knowledge of the problem. Algorithmic knowledge to describe a problem is very common in our everyday life – one only need to look at cookery-books, or technical maintenance manuals – that one may ask whether the human brain is ‘predisposed’ to preferably present a problem in describing its solution recipe. However, there exists at least one different way to capture knowledge about a problem; it is the method

which describes ‘what’ the problem is by defining its properties, rather than saying ‘how’ to solve it. Mathematically, this can be expressed by a set {x 2 X: R(x)}, where X is a continuous or discrete state space and R(x) is a Boolean relation, defining the properties or the constraints of the problem; x is called the variable(s). A notational system that represents a problem in this way is called a declarative language. Definition 2 A declarative language describes the problem as a set using mathematical variables and constraints defined over a given state space. This space can be finite or infinite, countable or noncountable. The information of a problem which is captured by a declarative language is called declarative knowledge of the problem. The declarative representation, in general, does not give any indication on how to solve the problem. It only states what the problem is. Of course, there exists a trivial algorithm to solve a declaratively stated problem, which is to enumerate the state space and to check whether a given x 2 X violates the constraint R(x). The algorithm breaks down, however, whenever the state space is infinite. But even if the state space is finite, it is – for most nontrivial problems – so large that a full enumeration is practically impossible. Algorithmic and declarative representations are two fundamentally different kinds of modeling and representing knowledge. Declarative knowledge answers the question ‘what is?’, whereas algorithmic knowledge asks ‘how to?’ [4]. An algorithm gives an exact recipe of how to solve a problem. A mathematical model, i. e. its declarative representation, on the other hand, (only) defines the problem as a subspace of the state space. No algorithm is given to find all or a single element of the feasible subspace. Why Declarative Representation The question arises, therefore, why to present a problem using a declarative way, since one must solve it anyway and, hence, represent as an algorithm? The reasons are, first of all, conciseness, insight, and documentation. Many problems can be represented declaratively in a very concise way, while the representation of their computation is long and complex. Concise writings favor also the insight of a problem. Furthermore, in many scientific papers a problem is stated in a declarative

Modeling Languages in Optimization: A New Paradigm

way using mathematical equations and inequalities for documentational purposes. This gives a clear statement of the problem and is an efficient way to communicate it to other scientists. However, documentation is by no means limited to human beings. One can imagine declarative languages implemented on a computer like algorithmic languages, which are parsed and interpreted by a compiler. In this way, an interpretative system can analyse the structure of a declarative program, can pretty-print it on a printer or a screen, can classify it, or symbolically transform it in order to view it as a diagram or in another textual form. Of course, the most interesting question is whether the declarative way of representing a problem could be of any help in solving the problem. Indeed, for certain classes of problems the computation can be obtained directly from a declarative formulation. This is true for all recursive definitions. A classical example is the algorithm of Euclid to find the greatest common divisor (gcd) of two integers. One can proof that ( gcd(b; a mod b); b > 0 gcd(a; b) D a; b D 0; which is clearly a declarative statement of the problem. In Scheme, a functional language, this formula can be implemented directly as a function in the following way: (define (gcd a b) (if(= b 0) a (gcd b (remainder a b)))) Similar formulations can be given for any other language which includes recursion as a basic control structure. This class of problems is surprisingly rich. The whole paradigm of dynamic programming can be subsumed under this class. A class of problems of a very different kind are linear programs, which can be represented declaratively in the following way: fmin cx : Ax bg From this formulation – in contrast to the class of recursive definitions – nothing can be deduced that would be useful in solving the problem. However, there

M

exists well-known methods, for example the simplex method, which solves almost all instances in a very efficient way. Hence, to make the declarative formulation of a linear program useful for solving it, one only needs to translate it into a form, the simplex algorithm accepts as input. The translation from the declarative formulation {min cx: Ax b} to such an input-form can be automated. This concept can be extended to nonlinear and discrete problems. Algebraic Modeling Languages The idea to state the mathematical problem in a declarative way and to translate it into an ‘algorithmic’ form by a standard procedure led to a new language paradigm emerged basically in the community of operations research at the end of the 1980s, the algebraic modeling languages (AIMMS [1], AMPL [7], GAMS [2], LINGO [18], and LPL [12] and others). These languages are becoming increasingly popular even outside the community of operations research. Algebraic modeling languages represent a problem in a purely declarative way, although most of them include computational facilities to manipulate the data as well as certain control structures. One of their strength is the complete separation of the problem formulation as a declarative model from finding a solution, which is supposed to be computed by an external program called a solver. This allows the modeler not only to separate the two main tasks of model formulation and model solution, but also to switch easily between several solvers. This is an invaluable benefit for many difficult problems, since it is not uncommon that a model instance can be solved using one method, and another instance is solvable only using another method. Another advantage of such languages is to separate clearly between model structure, which only contains parameters (place-holder for data) but no data, and model instance, in which the parameters are replaced by a specific data set. This leads to a natural separation between model formulation and data gathering stored in databases. Hence, the main features of these algebraic modeling languages are: purely declarative representation of the problem; clear separation between formulation and solution; clear separation between model structure and model data.

2299

2300

M

Modeling Languages in Optimization: A New Paradigm

It is, however, naive to think that one only needs to formulate a problem in a concise declarative form and to link it somehow to a solver in order to solve it. First of all, the ‘linking process’ is not so straightforward as it seems initially. Second, a solver may not exist which could solve the problem at hand in an efficient way. One only needs to look at Fermat’s last conjecture which can be stated in a declarative way as {a, b, c, n 2 N+ : an + bn = cn , a, b, c 1, n > 2} to convince oneself of this fact. Even worse, one can state a problem declaratively for which no solver can exist. This is true already for the rather limited declarative language of first order logic, for which no algorithm exists which decides whether a formula is true or false in general (see [5]). In this sense, efforts are under way actually in the design of such languages which focus on flexibly linking the declarative formulation to a specific solver to make this paradigm of purely declarative formulation more powerful. This language-solver-interface problem has different aspects and research goes in many directions. A main effort is to integrate symbolic model transformation rules into the declarative language in order to generate formulations which are more useful for a solver. AMPL, for example, automatically detects partially separable structure and computes second derivatives [8]. This information are also handed over to a nonlinear solver. LPL, to cite a very different undertaking, has integrated a set of rules to translate symbolically logical constraints into 0–1 constraint [11]. To do this in an intelligent way is all but easy, because the resulting 0–1 formulation should be as sharp as possible. This translation is useful for large mathematical models which must be extended by a few logical conditions. For many applications the original model becomes straightforward while the transformed is complicated but still relatively easy to solve (examples were given in [11]). Even if the resulting formulation is not solvable efficiently, the modeler can gain more insights into the structure of the model from such a symbolic translation procedure, and eventually modify the original formulation. Second Generation Modeling Languages Another research activity, actually under way, goes in the direction of extending the algebraic modeling languages in order to express also algorithmic knowledge.

This is necessary, because even if one could link an purely declarative language to any solver, it remains doubtful of whether this can be done efficiently in all cases. Furthermore, for many problems it is not useful to formulate them in a declarative way: the algorithmic way is more straightforward and easier to understand. For still other problems a mixture of declarative and algorithmic knowledge leads to a superior formulation in terms of understandability as well as in terms of efficiency, (examples are given below to confirm this findings). Therefore, AIMMS integrates control structures and procedure definitions. GAMS, AMPL and LPL also allow the modeler to write algorithms powerful enough to solves models repeatedly. A theoretical effort was undertaken in [10] to specify a modeling language which allows the modeler (or the programmer) to combine algorithmic and declarative knowledge within the same language framework without intermingle them. The overall syntax structure of a model (or a program) in this framework is as follows: MODEL ModelName hdeclarative part of the modeli BEGIN halgorithmic part of the modeli END ModelName.

Declarative and algorithmic knowledge are clearly separated. Either part can be empty, meaning that the problem is represented in a purely declarative or in a purely algorithmic form. The declarative part consists of the basic building blocks of declarative knowledge: variables, parameters, constraints, model checking facilities, and sets (that is a way to ‘multiply’ basic building blocks). This part may also contain ‘ordinary declarations’ of an algorithmic language (e. g., type and function declarations). Furthermore, one can declare whole models within this part, leading to nested model structures, which is very useful in decomposing a complex problem into smaller parts. The algorithmic part, on the other hand, consists of all control structures which make the language Turing complete. One may imagine his or her favorite programming language being implemented in this part. A language which combines declar-

Modeling Languages in Optimization: A New Paradigm

ative and algorithmic knowledge in this way is called modeling language. Definition 3 A modeling language is a notational system which allows one to combine (not to merge) declarative and algorithmic knowledge in the same language framework. The content captured by such a notation is called a model. Such a language framework is very flexible. Purely declarative models are linked to external solvers to be solved; purely algorithmic models are programs, that is algorithms + data structures, in the ordinary sense. Modeling Language and Constraint Logic Programming Merging declarative and algorithmic knowledge is not new, although it is not very common in language design. The only existing language paradigm doing it is constraint logic programming (CLP), a refinement of logic programming [13]. There are, however, important differences between the CLP paradigm and the paradigm of modeling language as defined above. 1) In CLP the algorithmic part – normally a search mechanism – is behind the scene and the computation is intrinsically coupled with the declarative language itself. This could be a strength because the programmer does not have to be aware of how the computation is taking place, he or she only writes the rules in a descriptive, that is declarative, way and triggers the computation by a request. In reality, however, it is an important drawback, because – for most nontrivial problem – the programmer ‘must’ be aware on how the computation is taking place. Therefore, to guide the computation in CLP, the declarative program is interspersed with additional rules which have nothing to do with the description of the original problem. In a modeling language, the user either links the declarative part to an external solver or writes the solver within the language. In either case, both parts are strictly separated. Why is this separation so important? Because it allows the modeler to ‘plug in’ different solvers without touching the overall model formulation. 2) The second difference is that the modeling language paradigm lead automatically to modular design. This is probably to hottest topic in software

M

engineering: building components. Software engineering teaches us that a complex structure can be only managed efficiently by break it down into many relatively independent components. The CLP approach leads more likely to programs that are difficult to survey and hard to debug and to maintain, because such considerations are entirely absent within the CLP paradigm. 3) On the other hand, the community of CLP has developed methods to solve specific classes of combinatorial problems which seems to be superior to other methods. This is because they rely on propagation, simplification of constraints, and various consistency techniques. In this sense, CLP solvers could be used and linked with modeling languages. Such a project is actually under way between the AMPL language and the ILOG solver [6,17]. Hence, while the representation of models is probably best done in the language framework of modeling languages, the solution process can taken place in a CLP solver for certain problems. Modeling Examples Five modeling examples are chosen from very different problem domains to illustrate the highlights of the presented paradigm of modeling language. The first two examples show that certain problems are best formulated using algorithmic knowledge, the next two examples show the power of a declarative formulation, and a last example indicates that mixing both paradigms is sometimes more advantageous. Sorting Sorting is a problem which is preferably expressed in an algorithmic way. Declaratively, the problem could be formulated as follows: Find a permutation such that A i A i+1 for all i 2 {1, . . . , n1} where A1 , . . . , n is an array of objects on which an order is defined. It is difficult to imagine a ‘solver’ that could solve this problem as efficiently as the best known sorting algorithms such as Quicksort, of which the implementation is straightforward. The reason why the sorting problem is best formulated as an algorithm is probably that the state space is exponential in the number of items, however, the best algorithm only has complexity O(n log n).

2301

2302

M

Modeling Languages in Optimization: A New Paradigm

The n-Queens Problem The n-queens problem is to place n queens on a chessboard of dimension n × n in such a way, that they cannot beat each other. This problem can be formulated declarative as follows: {xi , xj 2 {1, . . . , n}:xi 6D xj , xi + i 6D xj + j, xi i 6D xj j}, where xi is the column position of the ith queen (i. e. the queen in row i). Using the LPL [12] formulation:

MODEL nQueens; PARAMETER n; SET i ALIAS j ::= f1; : : : ; ng; DISTINCT VARIABLE xfig[1; : : : ; n]; CONSTRAINT Sfi; j : i < jg: x[i] + i x[ j] + j AND x[i] i x[ j] j; END the author was able to solve problems for n 8 using a general MIP solver. The problem is automatically translated into a 0–1 problem by LPL. Replacing the MIP-solver by a tabu search heuristic, problems with n 50 were solvable within the LPL framework. Using the constraint language OZ [19] problems of n 200 are efficiently solvable using techniques of propagation and variable domain reductions. However, the success of all these methods seems to be limited compared to the best we can attain. In [20,21], Sosic Rok and Gu Jun presented a polynomial time local heuristic that can solve problems of n 3 000 000 in less than one minute. The presented algorithm is very simple. The conclusion seems to be for the n-queens problem that an algorithmic formulation is advantageous. A Two-Person Game Two players choose at random a positive number and note it on a piece of paper. They then compare them. If both numbers are equal, then neither player gets a payoff. If the difference between the two numbers is one, then the player who has chosen the higher number obtains the sum of both; otherwise the player who has chosen the smaller number obtains the sum of both. What is the optimal strategy for a player, i. e. which numbers should be chosen with what frequencies to get the maximal payoff? This problem was presented in [9] and is

a typical two-person zero-sum game. In LPL, it can be formulated as follows:

ODEL Game ‘finite two-person zero-sum game’; SET i ALIAS j := /1 : 50/; PARAMETER pfi; jg := IF( j > i; IF( j = i + 1; i j; MIN(i; j)); IF( j < i; p[ j; i]; 0)); VARIABLE xfig; CONSTRAINT R : SUMfig x[i] = 1; MAXIMIZE gain: MINf jg(SUMfigp[ j; i] x[i]); END Game. This is an very compact way to declaratively formulate the problem and it is difficult to imagine how this could be achieved using algorithmic knowledge alone. It is also an efficient way to state the problem, because large instances can be solved by an linear programming solver. LPL automatically transforms it into an linear program. (By the way, the problem has an interesting solution: Each player should only choose number smaller than six.) Equal Circles in a Square The problem is to find the maximum diameter of n equal mutually disjoint circles packed inside a unit square. In LPL, this problem can be compactly formulated as follows: MODEL circles ‘pack equal circles in a square’; PARAMETER n ‘number of circles’; SET i ALIAS j = 1; : : : ; n; VARIABLE t ‘diameter of the circles’; xfig[0; 1] ‘x-position of the center’; yfig[0; 1] ‘y-position of the center’; CONSTRAINT Rfi; j : i < jg ‘circles must be disjoint’: (x[i] x[ j])2 + (y[i] y[ j])2 t; MAXIMIZE obj ‘maximize diameter’: t; END C.D. Maranas et al. [15] obtained the best known solutions for all n 30 and, for n = 15, an even better one

Modeling Languages in Optimization: A New Paradigm

using an equivalent formulation in GAMS and linking it to MINOS [16], an well-known nonlinear solver.

The (Fractional) Cutting-Stock Problem Paper is manufactured in rolls of width B. A set of customers W orders dw rolls of width bw (with w 2 W). Rolls can be cut in many ways, every subset P0 W P such that i2P0 yi bi B is a possible cut-pattern, where yi is a positive integer. The question is how the initial roll of width B should be cut, that is, which patterns should be used, in order to minimize the overall paper waste. A straightforward formulation of this problem is to enumerate all patterns, each giving a variable, then to minimize the number of used patterns while fulfilling the demands. The resulting model is a very large linear program which cannot be solved. A well-known method in operations research to solve such kind of problems is to use a column generation method (see [3] for details), that is, a small instance with only a few patterns is solved and a rewarding column – a pattern – is added repeatedly to the problem. The new problem is then solved again. This process is repeated, until no pattern can be added. To find a rewarding pattern, another problem – named a knapsack problem – must be solved. The problem can be formulated partially be algorithmic partially by declarative knowledge. It consists of two declaratively formulated problems (a linear program and an knapsack problem), which are both repeatedly solved. In a pseudocode one could formulate the algorithmic knowledge as follows: SOLVE the small cutting-stock problem SOLVE the knapsack problem WHILE a rewarding pattern was found DO add pattern to the cutting-stock problem SOLVE the cutting-stock problem again SOLVE the knapsack problem again ENDWHILE

The two models (the cutting-stock problem and the knapsack problem) can be formulated declaratively. In the proposed framework of modeling language, the complete problem can now be expressed as in the program below.

M

MODEL CuttingStock; MODEL Knapsack(i; w; p; K; x; obj); SET i; PARAMETER wfig; pfig; K; INTEGER VARIABLE xfig; CONSTRAINT R: SUMfig w x K; MAXIMIZE obj: SUMfig p x; END Knapsack. SET w ‘rolls ordered’; p ‘possible patterns’; PARAMETER afw; pg ‘pattern table’; dfwg ‘demands’; bfwg ‘widths of ordered rolls’; B ‘initial width’; INTEGER yfwg ‘new added pattern’; C ‘contribution of a cut’; VARIABLE Xfpg ‘number of rolls cut according to p’; CONSTRAINT Demfwg: SUMfpg a X d; MINIMIZE obj: SUMfpg X; BEGIN SOLVE; SOLVE Knapsack(w; b; Dem.dual; B; y; C); WHILE (C > 1) DO p := p + f‘pattern_’ + str(#p)g; afw; #pg := y[w]; SOLVE; SOLVE Knapsack(w; b; Dem.dual; B; y; C); END; END CuttingStock.

This formulation has several remarkable properties: 1) It is short and readable. The declarative part consists of the (small) linear cutting-stock problem, it also contains, as a submodel, a knapsack problem. The algorithmic part implements thecolumn generation method. Both parts are entirely separated. 2) It is a complete formulation, except from the data. No other code is needed; both models can be solved using a standard MIP solver (since the knapsack problem is small in general). 3) It has a modular structure. The knapsack problem is an independent component with its own name space; there is no interference with the surrounding model. It could even be declared outside the cuttingstock problem.

2303

2304

M

Modeling Languages in Optimization: A New Paradigm

4) The cutting-stock problem is only one problem of a large class of relevant problems which are solved using a column generation or, alternatively, a rowcut generation. Conclusion It has been shown that certain problems are best formulated as algorithms, others in a declarative way, still others need both paradigms to be stated concisely. Computer science made available many algorithmic languages; they can be contrasted to the algebraic modeling languages which are purely declarative. A language, called modeling language, which combines both paradigms was defined in this paper and examples were given showing clear advantages of doing so. Its is more powerful than both paradigms separated. However, the integration of algorithmic and declarative knowledge cannot be done in an arbitrary way. The language design must follow certain criteria wellknown in computer science. The main criteria are: reliability and transparency. Reliability can be achieved by a unique notation to code models, that is, by a modeling language, and by various checking mechanisms (type checking, unit checking, data integrity checking and others). Transparency can be obtained by flexible decomposition techniques, like modular structure as well as access and protection mechanisms of these structure, well-known techniques in language design and software engineering. Solving efficiently and relevant optimization problems using present desktop machine not only asks for fast machines and sophisticated solvers, but also for formulation techniques that allow the modeler to communicate the model easily and to build it in a readable and maintainable way. See also Continuous Global Optimization: Models, Algorithms and Software Large Scale Unconstrained Optimization Optimization Software References 1. Bisschop J (1998) AIMMS, the modeling system. Paragon Decision Techn, Haarlem, www.paragon.nl

2. Brooke A, Kendrick D, Meeraus A (1988) GAMS. A user’s guide. Sci Press, Marrickville 3. Chvátal V (1973) Linear programming. Freeman, New York 4. Feigenbaum EA (1996) How the ‘what’ becomes the ‘how’. Comm ACM 39(5):97–104 5. Floyd RW, Beigel R (1994) The language of machines, an introduction to computability and formal languages. Computer Sci Press, Rockville 6. Fourer R (1998) Extending a general-purpose algebraic modeling language to combinatorial optimization: A logic programming approach. In: Woodruff DL (ed) Advances in Computational and Stochastic Optimization, Logic Programming, and Heuristic Search: Interfaces in Computer Sci and Oper Res. Kluwer, Dordrecht, pp 31–74 7. Fourer R, Gay DM, Kernighan BW (1993) AMPL, a modeling language for mathematical programming. Sci Press, Marrickville 8. GAY DM (1996) Automatically finding and exploiting partially separable structure in nonlinear programming problems. AT&T Bell Lab Murray Hill, New Jersey 9. Hofstadter DR (1988) Metamagicum, Fragen nach der Essenz von Geist und Struktur. Klett-Cotta, Stuttgart 10. Hürlimann T (1997) Computer-based mathematical modeling. Habilitations Script. Fac Economic and Social Sci, Inst Informatics, Univ Fribourg 11. Hürlimann T (1998) An efficient logic-to-IP translation procedure. Working Paper, Inst Informatics, Univ Fribourg, ftp://ftp-iiuf.unifr.ch/pub/lpl/doc/APMOD1.pdf 12. Hürlimann T (1998) Reference manual for the LPL modeling language. Working Paper, version 4.30. Inst Informatics, Univ. Fribourg, Fribourg, ftp://ftpiiuf.unifr.ch/pub/lpl/doc/Manual.ps 13. Jaffar J, Maher MJ (1995) Constraint logic programming: A survey. Handbook Artificial Intelligence and Logic Programming. Oxford Univ Press, Oxford 14. Louden KC (1993) Programming languages – Principles and practice. PWS/Kent Publ, Boston 15. Maranas CD, Floudas CA, Pardalos PM (1993) New results in the packing of equal circles in a square. Dept Chemical Engin, Princeton Univ, Princeton 16. Murtagh BA, Saunders MA (1987) MINOS 5.0, user guide. Systems Optim Lab, Dept Oper Res, Stanford Univ, Stanford 17. ILOG SA (1997) ILOG solver 4.0 user’s manual; ILOG solver 4.0 reference manual. ILOG, Mountain View 18. Schrage L (1998) Optimization modeling with LINGO. Lindo Systems, Chicago, www.lindo.com 19. Smolka G (1995) The Oz programming model. In: van Leeuwen J (ed) Computer Sci Today, 1000 of Computer Sci. Springer, Berlin, pp 324–343 20. Sosic R, Gu J (1990) A polynomial time algorithm for the n-queens problem. SIGART Bull 1(3):7–11 21. Sosic R, Gu J (1991) 3,000,000 queens in less than one minute. SIGART Bull 2(1):22–24

Molecular Distance Geometry Problem

Molecular Distance Geometry Problem CARLILE LAVOR1 , LEO LIBERTI2 , NELSON MACULAN3 1 State University of Campinas (IMECC-UNICAMP), Campinas, Brazil 2 École Polytechnique, LIX, Palaiseau, France 3 Federal University of Rio de Janeiro (COPPE-UFRJ), Rio de Janeiro, Brazil MSC2000: 46N60 Article Outline Introduction ABBIE Algorithm Global Continuation Algorithm D.C. Optimization Algorithms Geometric Build-up Algorithm BP Algorithm Conclusion Acknowledgements References Introduction This article presents a general overview of some of the most recent approaches for solving the molecular distance geometry problem, namely, the ABBIE algorithm, the Global Continuation Algorithm, d.c. optimization algorithms, the geometric build-up algorithm, and the BP algorithm. The determination of the three-dimensional structure of a molecule, especially in the protein folding framework, is one of the most important problems in computational biology. That structure is very important because it is associated to the chemical and biological properties of the molecule [7,11,46]. Basically, this problem can be tackled in two ways: experimentally, via nuclear magnetic resonance (NMR) spectroscopy and X-ray crystallography [8], or theoretically, through potential energy minimization [19]. The Molecular Distance Geometry Problem (MDGP) arises in NMR analysis. This experimental technique provides a set of inter-atomic distances dij for certain pairs of atoms (i,j) of a given protein [23,24, 33,56,57]. The MDGP can be formulated as follows:

M

Given a set S of atom pairs (i,j) on a set of m atoms and distances d i j defined over S, find positions x1 , : : : ; x m 2 R3 of the atoms in the molecule such that jjx i x j jj D d i j ;

8(i; j) 2 S:

(1)

When the distances between all pairs of atoms of a molecule are given, a unique three-dimensional structure can be determined by a linear time algorithm [16]. However, because of errors in the given distances, a solution may not exist or may not be unique. In addition to this, because of the large scale of problems that arise in practice, the MDGP becomes very hard to solve in general. Saxe [51] showed that the MDGP is NP-complete even in one spatial dimension. The exact MDGP can be naturally formulated as a nonlinear global minimization problem, where the objective function is given by f (x1 ; : : : ; x m ) D

X

(jjx i x j jj2 d 2i j )2 :

(2)

(i; j)2S

This function is everywhere infinitely differentiable and has an exponential number of local minimizers. Assuming that all the distances are correctly given, x 2 R3m solves the problem if and only if f (x) D 0. Formulations (1) and (2) correspond to the exact MDGP. Since experimental errors may prevent solution existence (e. g. when the triangle inequality di j di k C dk j is violated for atoms i, j, k), we sometimes consider an -optimum solution of (1), i. e. a solution x1 ; : : : ; x m satisfying jjjx i x j jj d i j j ;

8(i; j) 2 S :

(3)

Moré and Wu [41] showed that even obtaining such an -optimum solution is NP-hard for small enough. In practice, it is often just possible to obtain lower and upper bounds on the distances [4]. Hence a more practical definition of the MDGP is to find positions x1 ; : : : ; x m 2 R3 such that l i j jjx i x j jj u i j ;

8(i; j) 2 S ;

(4)

where lij and uij are lower and upper bounds on the distance constraints, respectively.

2305

2306

M

Molecular Distance Geometry Problem

The MDGP is a particular case of a more general problem, called the distance geometry problem [6,13, 14,15], which is intimately related to the Euclidean distance matrix completion problem [1,28,38]. Several methods have been developed to solve the MDGP, including the EMBED algorithm by Crippen and Havel [12,25], the alternating projection algorithm by Glunt et al. [20], spectrial gradient methods by Glunt et al. [21,22], the multi-scaling algorithm by Trosset et al. [29,52], a stochastic/perturbation algorithm by Zou, Byrd, and Schnabel [58], variable neighborhood search-based algorithms by Liberti, Lavor, and Maculan [35,39], the ABBIE algorithm by Hendrickson [26,27], the Global Continuation Algorithm by Moré and Wu [41,42,43,44,45], the d.c. optimization algorithms by An and Tao [2,3], the geometric buildup algorithm by Dong, Wu, and Wu [16,17,54], and the BP algorithm by Lavor, Liberti, and Maculan [37]. Two completely different approaches for solving the MDGP are given in [34] (based on quantum computation) and [53] (based on algebraic geometry). The wireless network sensor positioning problem is closely related to the MDGP, the main difference being the presence of fixed anchor points with known positions: results derived for this problem can often be applied to the MDGP. Amongst the most notable, [18] shows that the MDGP associated to a trilateration graph (a graph with an order on the vertices such that each vertex is adjacent to the preceding 4 vertices) can be solved in polynomial time; [40] provides a detailed study of Semi Definite Programming (SDP) relaxations applied to distance geometry problems. ABBIE Algorithm In [26,27], Hendrickson describes an approach to the exact MDGP that replaces a large optimization problem, given by (2), by a sequence of smaller ones. He exploits some combinatorial structure inherent in the MDGP, which allows him to develop a divide-andconquer algorithm based on a graph-theoretic viewpoint. If the atoms and the distances are considered as nodes and edges of a graph, respectively, the MDGP can be described by a distance graph and the solution to the problem is an embedding of the distance graph in an Euclidean space. When some of the atoms can be

moved without violating any distance constraints, there may be many embeddings. The graph is then called flexible or otherwise rigid. If the graph is rigid or does not have partial reflections, for example, then the graph has a unique embedding. These necessary conditions can be used to find subgraphs that have unique embeddings. The problem can then be solved by decomposing the graph into such subgraphs, in which the minimization problems associated to the function (2) are solved. The solutions found for the subgraphs can then be combined into a solution for the whole graph. This approach to the MDGP has been implemented in a code named ABBIE and tested on simulated data provided by the bovine pancreatic ribonuclease A, a typical small protein consisting of 124 amino acids, whose three-dimensional structure is known [47]. The data set consists of all distances between pairs of atoms in the same amino acid, along with 1167 additional distances corresponding to pairs of hydrogen atoms that were within 3.5 Å of each other. It was used fragments of the protein consisting of the first 20, 40, 60, 80 and 100 amino acids as well as the full protein, with two sets of distance constraints for each size corresponding to the largest unique subgraphs and the reduced graphs. These problems have from 63 up to 777 atoms.

Global Continuation Algorithm In [43], Moré and Wu formulated the exact MDGP in terms of finding the global minimum of a similar function to (2), f (x1 ; : : : ; x m ) D

X

w i j (jjx i x j jj2 d 2i j )2 ; (5)

(i; j)2S

where wij are positive weights (in numerical results w i j D 1 was used). Following the ideas described in [55], Moré and Wu proposed an algorithm, called Global Continuation Algorithm, based on a continuation approach for global optimization. The idea is gradually transform the function (5) into a smoother function with fewer local minimizers, where an optimization algorithm is then applied to the transformed function, tracing their minimizers back to the original function. For other works based on continuation approach, see [9,10,30,31,32,49].

M

Molecular Distance Geometry Problem

The transformed function h f i , called the Gaussian transform, of a function f : Rn ! R is defined by h f i (x) D

1

n/2 n

Z Rn

jjy xjj2 f (y) exp dy ; 2 (6)

where the parameter controls the degree of smoothing. The value h f i (x) is a weighted average of f (x) in a neighborhood of x, where the size of the neighborhood decreases as decreases: as ! 0, the average is carried out on the singleton set fxg, thus recovering the original function in the limit. Smoother functions are obtained as increases. This approach to the MDGP has been implemented and tested on two artificial models of problems, where the molecule has m D s3 atoms located in the threedimensional lattice f(i1 ; i2 ; i3 ) : 0 i1 < s; 0 i2 < s; 0 i3 < sg for an integer s 1. In numerical results, it was considered m D 27; 64; 125; 216. In the first model, the ordering for the atoms is specified by letting i be the atom at the position (i1 ,i2 ,i3 ), i D 1 C i1 C si2 C s2 i3 ; and the set of atom pairs whose distances are known, S, is given by S D f(i; j) : ji jj rg ;

(7)

where r D s2 . In the second model, the set S is specified by p (8) S D f(i; j) : jjx i x j jj rg ; where x i D (i1 ; i2 ; i3 ) and r D s2 . For both models, s is considered in the interval 3 s 6. In (7), S includes all nearby atoms, while in (8), S includes some of nearby atoms and some relatively distant atoms. It was shown that the Global Continuation Algorithm usually finds a solution from any given starting point, whereas the local minimization algorithm used in the multistart methods is unreliable as a method for determining global solutions. It was also showed that the continuation approach determines a global solution with less computational effort that is required by the multistart approach.

D.C. Optimization Algorithms In [2,3], An and Tao proposed an approach for solving the exact MDGP, based on the d.c. (difference of convex functions) optimization algorithms. They worked in Mm;3 (R), the space of real matrices of order m 3, where for X 2 Mm;3 (R), X i (resp., X i ) is its ith row (resp., ith column). By identifying a set of positions of atoms x1 ; : : : ; x m with the matrix X, X iT D x i for i D 1; : : : ; m, they expressed the MDGP by 0 D min (X) 1 :D 2

X

w i j i j (X) : X 2 Mm;3 (R)

(i; j)2S;i< j

9 = ;

;

(9)

where w i j > 0 for i ¤ j and w i i D 0 for all i. The pairwise potential i j : Mm;3 (R) ! R is defined for problem (1) by either 2 i j (X) D d 2i j jjX iT X Tj jj2

(10)

2 i j (X) D d i j jjX iT X Tj jj ;

(11)

or

and for problem (4) by ( i j (X) D min

2

jjX iT X Tj jj2 l i2j (

C max 2

l i2j

) ;0

jjX iT X Tj jj2 u 2i j u 2i j

) ;0

:

(12)

Similarly to (2), X is a solution if and only if it is a global minimizer of problem (9) and (X) D 0. While the problem (9) with i j given by (9) or (12) is a nondifferentiable optimization problem, it is a d.c. optimization problem. An and Tao demonstrated that the d.c. algorithms can be adapted for developing efficient algorithms for solving large-scale exact MDGPs. They proposed various versions of d.c. algorithms that are based on different formulations for the problem. Due its local character, the global optimality cannot be guaranteed for a general d.c. problem. However, the fact that the global optimality can be obtained with a suitable starting point

2307

2308

M

Molecular Distance Geometry Problem

motivated them to investigate a technique for computing good starting points for the d.c. algorithms in the solution of (9), with i j defined by (11). The algorithms have been tested on three sets of data: the artificial data from Moré and Wu [43] (with up to 4096 atoms), 16 proteins in the PDB [5] (from 146 up to 4189 atoms), and the data from Hendrickson [27] (from 63 up to 777 atoms). Using these data, they showed that the d.c. algorithms can efficiently solve large-scale exact MDGPs. Geometric Build-up Algorithm In [17], Dong and Wu proposed the solution of the exact MDGP by an algorithm, called the geometric build-up algorithm, based on a geometric relationship between coordinates and distances associated to the atoms of a molecule. It is assumed that it is possible to determine the coordinates of at least four atoms, which are marked as fixed; the remaining ones are non-fixed. The coordinates of a non-fixed atom a can be calculated by using the coordinates of four non-coplanar fixed atoms such that the distances between any of these four atoms and the atom a are known. If such four atoms are found, the atom a changes its status to fixed. More specifically, let b1 , b2 , b3 , b4 be the four fixed atoms whose Cartesian coordinates are already known. Now suppose that the Euclidean distances among the atom a and the atoms b1 , b2 , b3 , b4 , namely da,bi , for i D 1; 2; 3; 4, are known. That is, jja b1 jj D d a;b 1 ; jja b2 jj D d a;b 2 ; jja b3 jj D d a;b 3 ; jja b4 jj D d a;b 4 : Squaring both sides of these equations, we have: jjajj2 2a T b1 C jjb1 jj2 D d 2a;b 1 ; jjajj2 2a T b2 C jjb2 jj2 D d 2a;b 2 ; jjajj2 2a T b3 C jjb3 jj2 D d 2a;b 3 ; jjajj2 2a T b4 C jjb4 jj2 D d 2a;b 4 : By subtracting one of these equations from the others, it is obtained a linear system that can be used to determine the coordinates of the atom a. For example, subtracting the first equation from the others, we obtain Ax D b ;

(13)

where 3 (b1 b2 )T A D 2 4 (b1 b3 )T 5 ; (b1 b3 )T 2

x D a;

and 2 6 6 bD6 4

d 2a;b 1 d 2a;b 2 jjb1 jj2 jjb2 jj2 d 2a;b 1 d 2a;b 3 jjb1 jj2 jjb3 jj2 d 2a;b 1 d 2a;b 4 jjb1 jj2 jjb4 jj2

3 7 7 7: 5

Since b1 , b2 , b3 , b4 are non-coplanar atoms, the system (13) has a unique solution. If the exact distances between all pairs of atoms are given, this approach can determine the coordinates of all atoms of the molecule in linear time [16]. Dong and Wu implemented such an algorithm, but they verified that it is very sensitive to the numerical errors introduced in calculating the coordinates of the atoms. In [54], Wu and Wu proposed the updated geometric build-up algorithm showing that, in this algorithm, the accumulation of the errors in calculating the coordinates of the atoms can be controlled and prevented. They have been tested the algorithm with a set of problems generated using the known structures of 10 proteins downloaded from the PDB data bank [5], with problems from 404 up to 4201 atoms. BP Algorithm In [37], Lavor, Liberti, and Maculan propose an algorithm, called branch-and-prune (BP), based on a discrete formulation of the exact MDGP. They observe that the particular structures of proteins makes it possible to formulate the MDGP applied to protein backbones as a discrete search problem. They formalize this by introducing the discretizable molecular distance geometry problem (DMDGP), which consists of a certain subset of MDGP instances (to which most protein backbones belong) for which a discrete formulation can be supplied. This approach requires that the bond lengths and angles, as well as the distances between atoms separated by three consecutive bond lengths are known. In order to describe a backbone of a protein with m atoms, in addition to the bond lengths d i1;i , for i D 2; : : : ; m, and the bond angles i2;i , for i D 3; : : : ; m,

Molecular Distance Geometry Problem

it is necessary to consider the torsion angles ! i3;i , for i D 4; : : : ; m, which are the angles between the normals through the planes defined by the atoms i 3; i 2; i 1 and i 2; i 1; i. It is known that [48], given all the bond lengths d1;2 ; : : : ; d m1;m , bond angles 13 ; : : : ; m2;m , and torsion angles !1;4 ; : : : ; !m3;m of a molecule with m atoms, the Cartesian coordinates (x i 1 ; x i 2 ; x i 3 ) for each atom i in the molecule can be obtained using the following formulae: 3 2 x i1 6 6 x i2 7 7 6 6 4 x i 5 D B1 B2 : : : B i 4 3 1 2

3 0 0 7 7; 0 5 1

8i D 1; : : : ; m ;

where 2

1 0 0 0 6 0 1 0 0 B1 D 6 4 0 0 1 0 0 0 0 1 2 1 0 0 6 0 1 0 B2 D 6 4 0 0 1 0 0 0 2

cos 1;3 6 sin 1;3 B3 D 6 4 0 0

3 7 7; 5 3 d1;2 0 7 7; 0 5 1

sin 1;3 cos 1;3 0 0

0 0 1 0

3 d2;3 cos 1;3 d2;3 sin 1;3 7 7; 0 5 1

and 2

cos i2;i 6 sin i2;i cos ! i3;i Bi D 6 4 sin i2;i sin ! i3;i 0 0 sin ! i3;i cos ! i3;i 0

sin i2;i cos i2;i cos ! i3;i cos i2;i sin ! i3;i 0 3 d i1;i cos i2;i d i1;i sin i2;i cos ! i3;i 7 7; d i1;i sin i2;i sin ! i3;i 5 1

for i D 4; : : : ; m. Since all the bond lengths and bond angles are assumed to be given in the instance, the Cartesian coordinates of all atoms of a molecule can be completely determined by using the values of cos ! i3;i and sin ! i3;i , for i D 4; : : : ; m.

M

For instances of the DMDGP class, for all i D 4; : : : ; m, the value of cos ! i3;i can be computed by the formula cos ! i3;i D a/b where

a D d 2i3;i2 C d 2i2;i 2d i3;i2 d i2;i cos i2;i cos i1;iC1 d 2i3;i

and

b D 2d i3;i2 d i2;i sin i2;i sin i1;iC1 ; (14)

which is just a rearrangement of the cosine law for torsion angles [50] (p. 278), and all the values in the expression (14) are given in the instance. This allows to express the position of the i-th atom in terms of the preceding three, giving 2m3 possible conformations, which characterizes the discretization of the problem. The idea of the BP algorithm is that at each step the ith atom can be placed in two possible positions. However, either of both of these positions may be infeasible with respect to some constraints. The search is branched on all atomic positions which are feasible with respect to all constraints; by contrast, if a position is not feasible the search scope is pruned. The algorithm has been tested on the artificial data from Moré and Wu [43] (with up to 216 atoms) and on the artificial data from Lavor [36] (a selection from 10 up to 100 atoms). Conclusion This paper surveys some of the methods to solve the Molecular Distance Geometry Problem, with particular reference to five existing algorithms: ABBIE algorithm, global continuation algorithm, d.c. optimization algorithms, the geometric build-up algorithm and the BP algorithm. Acknowledgements The authors would like to thank CNPq, FAPESP and FAPERJ for their financial support. References 1. Alfakih AY, Khandani A, Wolkowicz H (1999) Solving Euclidean distance matrix completion problems via semidefinite programming. Comput Optim Appl 12:13–30 2. An LTH (2003) Solving large-scale molecular distance geometry problems by a smoothing technique via the Gaussian transform and d.c. programming. J Global Optim 27:375–397

2309

2310

M

Molecular Distance Geometry Problem

3. An LTH, Tao PD (2003) Large-scale molecular optimization from distance matrices by a d.c. optimization approach. SIAM J Optim 14:77–114 4. Berger B, Kleinberg J, Leighton T (1999) Reconstructing a three-dimensional model with arbitrary errors. J ACM 46:212–235 5. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucl Acids Res 28:235–242 6. Blumenthal LM (1953) Theory and Applications of Distance Geometry. Oxford University Press, London 7. Brooks III CL, Karplus M, Pettitt BM (1988) Proteins: a theoretical perspective of dynamics, structure, and thermodynamics. Wiley, New York 8. Brünger AT, Nilges M (1993) Computational challenges for macromolecular structure determination by X-ray crystallography and solution NMR-spectroscopy. Q Rev Biophys 26:49–125 9. Coleman TF, Shalloway D, Wu Z (1993) Isotropic effective energy simulated annealing searches for low energy molecular cluster states. Comput Optim Appl 2:145–170 10. Coleman TF, Shalloway D, Wu Z (1994) A parallel buildup algorithm for global energy minimizations of molecular clusters using effective energy simulated annealing. J Global Optim 4:171–185 11. Creighton TE (1993) Proteins: structures and molecular properties. Freeman and Company, New York 12. Crippen GM, Havel TF (1988) Distance geometry and molecular conformation. Wiley, New York 13. Dattorro J (2005) Convex optimization and euclidean distance geometry. Meboo Publishing USA, Palo Alto 14. De Leeuw J (1977) Applications of convex analysis to multidimensional scaling. In: Barra JR, Brodeau F, Romier G, van Cutsem B (eds) Recent developments in statistics. NorthHolland, Amsterdam, pp 133–145 15. De Leeuw J (1988) Convergence of the majorization method for multidimensional scaling. J Classif 5:163–180 16. Dong Q, Wu Z (2002) A linear-time algorithm for solving the molecular distance geometry problem with exact inter-atomic distances. J Global Optim 22:365–375 17. Dong Q, Wu Z (2003) A geometric build-up algorithm for solving the molecular distance geometry problem with sparse distance data. J Global Optim 26:321–333 18. Eren T, Goldenberg DK, Whiteley W, Yang YR, Morse AS, Anderson BDO, Belhumeur PN (2004) Rigidity, computation, and randomization in network localization. In: Proc IEEE Infocom 2673–2684, Hong Kong 19. Floudas CA, Pardalos PM (eds)(2000) Optimization in computational chemistry and molecular biology. Nonconvex optimization and its applications, vol 40. Kluwer, The Netherlands 20. Glunt W, Hayden TL, Hong S, Wells J (1990) An alternating projection algorithm for computing the nearest euclidean distance matrix. SIAM J Matrix Anal Appl 11:589–600

21. Glunt W, Hayden TL, Raydan M (1993) Molecular conformations from distance matrices. J Comput Chem 14:114–120 22. Glunt W, Hayden TL, Raydan M (1994) Preconditioners for distance matrix algorithms. J Comput Chem 15:227–232 23. Gunther H (1995) NMR Spectroscopy: basic principles, concepts, and applications in chemistry. Wiley, New York 24. Havel TF (1991) An evaluation of computational strategies for use in the determination of protein structure from distance geometry constraints obtained by nuclear magnetic resonance. Prog Biophys Mol Biol 56:43–78 25. Havel TF (1995) Distance geometry. In: Grant DM, Harris RK (eds) Encyclopedia of nuclear magnetic resonance. Wiley, New York, pp 1701–1710 26. Hendrickson BA (1991) The molecule problem: determining conformation from pairwise distances. Ph.D. thesis. Cornell University, Ithaca 27. Hendrickson BA (1995) The molecule problem: exploiting structure in global optimization. SIAM J Optim 5:835–857 28. Huang HX, Liang ZA (2003) Pardalos PM Some properties for the euclidean distance matrix and positive semidefinite matrix completion problems. J Global Optim 25:3–21 29. Kearsley AJ, Tapia RA, Trosset MW (1998) The solution of the metric STRESS and SSTRESS problems in multidimensional scaling by Newton’s method. Comput Stat 13:369–396 30. Kostrowicki J, Piela L (1991) Diffusion equation method of global minimization: performance for standard functions. J Optim Theor Appl 69:269–284 31. Kostrowicki J, Piela L, Cherayil BJ, Scheraga HA (1991) Performance of the diffusion equation method in searches for optimum structures of clusters of Lennard-Jones atoms. J Phys Chem 95:4113–4119 32. Kostrowicki J, Scheraga HA (1992) Application of the diffusion equation method for global optimization to oligopeptides. J Phys Chem 96:7442–7449 33. Kuntz ID, Thomason JF, Oshiro CM (1993) Distance geometry. In: Oppenheimer NJ, James TL (eds) Methods in Enzymology, vol 177. Academic Press, New York, pp 159–204 34. Lavor C, Liberti L, Maculan N (2005) Grover’s algorithm applied to the molecular distance geometry problem. In: Proc. of VII Brazilian Congress of Neural Networks, Natal, Brazil 35. Lavor C, Liberti L, Maculan N (2006) Computational experience with the molecular distance geometry problem. In: Pintér J (ed) Global optimization: scientific and engineering case studies. Springer, New York, pp 213–225 36. Lavor C (2006) On generating instances for the molecular distance geometry problem. In: Liberti L, Maculan N (eds) Global optimization: from theory to implementation. Springer, Berlin, pp 405–414 37. Lavor C, Liberti L, Maculan N (2006) The discretizable molecular distance geometry problem. arXiv:qbio/0608012

Molecular Structure Determination: Convex Global Underestimation

38. Laurent M (1997) Cuts, matrix completions and a graph rigidity. Math Program 79:255–283 39. Liberti L, Lavor C, Maculan N (2005) Double VNS for the molecular distance geometry problem. In: Proc. of MECVNS Conference, Puerto de la Cruz, Spain 40. Man-Cho So A, Ye Y (2007) Theory of semidefinite programming for sensor network localization. Math Program 109:367–384 41. Moré JJ, Wu Z (1996) -Optimal solutions to distance geometry problems via global continuation. In: Pardalos PM, Shalloway D, Xue G (eds) Global minimization of non-convex energy functions: molecular conformation and protein folding. American Mathematical Society, Providence, IR, pp 151–168 42. Moré JJ, Wu Z (1996) Smoothing techniques for macromolecular global optimization. In: Di Pillo G, Gianessi F (eds) Nonlinear Optimization and Applications. Plenum Press, New York, pp 297–312 43. Moré JJ, Wu Z (1997) Global continuation for distance geometry problems. SIAM J Optim 7:814–836 44. Moré JJ, Wu Z (1997) Issues in large scale global molecular optimization. In: Biegler L, Coleman T, Conn A, Santosa F (eds) Large scale optimization with applications. Springer, Berlin, pp 99–122 45. Moré JJ, Wu Z (1999) Distance geometry optimization for protein structures. J Global Optim 15:219–234 46. Neumaier A (1997) Molecular modeling of proteins and mathematical prediction of protein structure. SIAM Rev 39:407–460 47. Palmer KA, Scheraga HA (1992) Standard-geometry chains fitted to X-ray derived structures: validation of the rigidgeometry approximation. II. Systematic searches for short loops in proteins: applications to bovine pancreatic ribonuclease A and human lysozyme. J Comput Chem 13:329–350 48. Phillips AT, Rosen JB, Walke VH (1996) Molecular structure determination by convex underestimation of local energy minima. In: Pardalos PM, Shalloway D, Xue G (eds) Global minimization of non-convex energy functions: molecular conformation and protein folding. American Mathematical Society, Providence, IR, pp 181–198 49. Piela L, Kostrowicki J, Scheraga HA (1989) The multipleminima problem in the conformational analysis of molecules: deformation of the protein energy hypersurface by the diffusion equation method. J Phys Chem 93:3339–3346 50. Pogorelov A (1987) Geometry. Mir Publishers, Moscow 51. Saxe JB (1979) Embeddability of weighted graphs in kspace is strongly NP-hard. In: Proc. of 17th Allerton Conference in Communications, Control, and Computing, 480– 489, Allerton, USA 52. Trosset M (1998) Applications of multidimensional scaling to molecular conformation. Comput Sci Stat 29:148–152 53. Wang L, Mettu RR, Donald BR (2005) An algebraic geometry approach to protein structure determination from NMR

54.

55.

56.

57.

58.

M

data. In: Proc. of the 2005 IEEE Computational Systems Bioinformatics Conference, Stanford, USA Wu D, Wu Z (2007) An updated geometric build-up algorithm for solving the molecular distance geometry problem with sparse distance data. J Global Optim 37:661–673 Wu Z (1996) The effective energy transformation scheme as a special continuation approach to global optimization with application to molecular conformation. SIAM J Optim 6:748–768 Wütrich K (1989) The development of nuclear magnetic resonance spectroscopy as a technique for protein structure determination. Acc Chem Res 22:36–44 Wütrich K (1989) Protein structure determination in solution by nuclear magnetic resonance spectroscopy. Science 243:45–50 Zou Z, Byrd RH, Schnabel RB (1997) A stochastic/ perturbation global optimization algorithm for distance geometry problems. J Global Optim 11:91–105

Molecular Structure Determination: Convex Global Underestimation ANDREW T. PHILLIPS Computer Sci. Department, University Wisconsin–Eau Claire, Eau Claire, USA MSC2000: 65K05, 90C26 Article Outline Keywords Molecular Model The Convex Global Underestimator The CGU Algorithm See also References Keywords Protein folding; Molecular structure determination; Convex global underestimation An important class of difficult global minimization problems arise as an essential feature of molecular structure calculations. The determination of a stable molecular structure can often be formulated in terms of calculating the global (or approximate global) minimum of a potential energy function (see [6]). Computing the global minimum of this function is very difficult because it typically has a very large number of local

2311

2312

M

Molecular Structure Determination: Convex Global Underestimation

minima which may grow exponentially with molecule size. One such application is the well known protein folding problem. It is widely accepted that the folded state of a protein is completely dependent on the onedimensional linear sequence (i. e., ‘primary’ sequence) of amino acids from which the protein is constructed: external factors, such as enzymes, present at the time of folding have no effect on the final, or native, state of the protein. This led to the formulation of the protein folding problem: given a known primary sequence of amino acids, what would be its native, or folded, state in threedimensional space. Several successful predictions of folded protein structures have been made and announced before the experimental structures were known (see [3,9]). While most of these have been made with a blend of a human expert’s abilities and computer assistance, fully automated methods have shown promise for producing previously unattainable accuracy [2]. These machine based prediction strategies attempt to lessen the reliance on experts by developing a completely computational method. Such approaches are generally based on two assumptions. First, that there exists a potential energy function for the protein; and second that the folded state corresponds to the structure with the lowest potential energy (minimum of the potential energy function) and is thus in a state of thermodynamic equilibrium. This view is supported by in vitro observations that proteins can successfully refold from a variety of denatured states. Evolutionary theory also supports a folded state at a global energy minimum. Protein sequences have evolved under pressure to perform certain functions, which for most known occurrences requires a stable, unique, and compact structure. Unless specifically required for a certain function, there was no biochemical need for proteins to hide their global minimum behind a large kinetic energy barrier. While kinetic blocks may occur, they should be limited to special proteins developed for certain functions (see [1]). Molecular Model Unfortunately, finding the ‘true’ energy function of a molecular structure, if one even exists, is virtually impossible. For example, with proteins ranging in size

up to 1, 053 amino acids (a collagen found in tendons), exhaustive conformational searches will never be tractable. Practical search strategies for the protein folding problem currently require a simplified, yet sufficiently realistic, molecular model with an associated potential energy function representing the dominant forces involved in protein folding [4]. In a one such simplified model, each residue in the primary sequence of a protein is characterized by its backbone components NH C˛ H C0 O and one of 20 possible amino acid sidechains attached to the central C˛ atom. The three-dimensional structure of the chain is determined by internal molecular coordinates consisting of bond lengths l, bond angles , sidechain torsion angles , and the backbone dihedral angles , , and !. Fortunately, these 10r 6 parameters (for an r-residue structure) do not all vary independently. Some of these (7r 4 of them) are regarded as fixed since they are found to vary within only a very small neighborhood of an experimentally determined value. Among these are the 3r 1 backbone bond lengths l, the 3r 2 backbone bond angles , and the r 1 peptide bond dihedral angles ! (fixed in the trans conformation). This leaves only the r sidechain torsion angles , and the r 1 backbone dihedral angle pairs (, ). In the reduced representation model presented here, the sidechain angles are also fixed since sidechains are treated as united atoms (see below) with their respective torsion angles fixed at an ‘average’ value taken from the Brookhaven Protein Databank. Remaining are the r 1 backbone dihedral angles pairs. These also are not completely independent; they are severely constrained by known chemical data (the Ramachandran plot) for each of the 20 amino acid residues. Furthermore, since the atoms from one C˛ to the next C˛ along the backbone can be grouped into rigid planar peptide units, there are no extra parameters required to express the three-dimensional position of the attached O and H peptide atoms. Hence, these bond lengths and bond angles are also known and fixed. Another key element of this simplified polypeptide model is that each sidechain is classified as either hydrophobic or polar, and is represented by only a single ‘virtual’ center of mass atom. Since each sidechain is represented by only the single center of mass ‘virtual atom’ Cs , no extra parameters are needed to define the position of each sidechain with respect to the backbone

Molecular Structure Determination: Convex Global Underestimation

mainchain. The twenty amino acids are thus classified into two groups, hydrophobic and polar, according to the scale given by S. Miyazawa and R.L. Jernigan [7]. Corresponding to this simplified polypeptide model is a simple energy function. This function includes four components: a contact energy term favoring pairwise hydrophobic residues, a second contact term favoring hydrogen bond formation between donor NH and acceptor C0 = O pairs, a steric repulsive term which rejects any conformation that would permit unreasonably small interatomic distances, and a main chain torsional term that allows only certain preset values for the backbone dihedral angle pairs (, ). Since the residues in this model come in only two forms, hydrophobic and polar, where the hydrophobic monomers exhibit a strong pairwise attraction, the lowest free energy state involves those conformations with the greatest number of hydrophobic ‘contacts’ [4] and intrastrand hydrogen bonds. Simplified potential functions have been successful in [10,11], and [12]. Here we use a simple modification of the energy function from [11].

One practical means for finding the global minimum of the polypeptide’s potential energy function is to use a convex global underestimator to localize the search in the region of the global minimum. The idea is to fit all known local minima with a convex function which underestimates all of them, but which differs from them by the minimum possible amount in the discrete L1 norm. The minimum of this underestimator is used to predict the global minimum for the function, allowing a more localized conformer search to be performed based on the predicted minimum. More precisely, given an r-residue structure with n = 2r 2 backbone dihedral angles, denote a conformation of this simplified model by 2 Rn , and the corresponding simplified potential energy function value by F(). Then, assuming that k 2n + 1 local minimum conformations (j) , for j = 1, . . . , k, have been computed, a convex quadratic underestimating function U() is fitted to these local minima so that it underestimates all the local minima, and normally interpolates F( (j) ) at 2n + 1 points. This is accomplished by determining the coefficients in the function U() so that ı j D F( ( j) ) U( ( j) ) 0

P for j = 1, . . . , k, and where njD1 ı j is minimized. That is, the difference between F() and U() is minimized in the discrete L1 norm over the set of k local minima (j) , j = 1, . . . , k. Of course, this ‘underestimator’ only underestimates known local minima. The specific underestimating function U() used in this convex global underestimator (CGU) method is given by

U() D c0 C

(1)

n X iD1

1 2 c i i C di i : 2

(2)

Note that ci and di appear linearly in the constraints of (1) for each local minimum (j) . Convexity of this quadratic function is guaranteed by requiring that di 0 for i = 1, . . . , n. Other linear combinations of convex functions could also be used, but this quadratic function is the simplest. Additionally, in order to guarantee that U() attains its global minimum U min in the hyperrectangle H D f i : 0 i i i 2 g, an additional set of constraints are imposed on the coefficients of U(): (

The Convex Global Underestimator

M

c i C i d i 0; c i C i d i 0;

i D 1; : : : ; n:

(3)

Note that the satisfaction of (3) implies that ci 0 and di 0 for i = 1, . . . , n. The unknown coefficients ci , i = 0, . . . , n, and di , i = 1, . . . , n, can be determined by a linear program which may be considered to be in the dual form. For reasons of efficiency, the equivalent primal of this problem is actually solved, as described below. The solution to this primal linear program provides an optimal dual vector, which immediately gives the underestimating function coefficients ci and di . Since the convex quadratic function U() gives a global approximation to the local minima of F(), then its easily computed global minimum function value U min is a good candidate for an approximation to the global minimum of the correct energy function F(). An efficient linear programming formulation and solution satisfying (1)–(3) will now be summarized. Let f (j) = F( (j) ), for j = 1, . . . , k, and let f 2 Rk be the vector with elements f (j) . Also let ! (j) 2 Rn be the vector with ( j) elements 12 ( i )2 , i = 1, . . . , n, and let ek 2 Rk be the vector of ones. Now define the following two matrices

2313

2314

M

Molecular Structure Determination: Convex Global Underestimation

˚ 2 R(n+1)×k and ˝ 2 Rn×k : 8 ! ˆ e> ˆ k

k ı such that 0 1 1 0 > f ˚ ˝> 0 0 1 c B f C B˚ > ˝ > I k C C @d A B C ; B (5) @ 0 A @ I0 D 0 A n ı 0 D 0 I 0n where D D diag( 1 ; : : : ; n ), D D diag( 1 ; : : : ; n ), I k is the identity matrix of order k, and I 0 n is the n × (n + 1) ‘augmented’ matrix (0 : I n where I n is the identity matrix of order n. Since the matrix in (5) has more rows than columns (2(k + n) rows and k + 2n + 1 columns, where k 2n + 1), it is computationally more efficient to consider it as a dual problem, and to solve the equivalent primal. After some simple transformations, this primal problem reduces to: 8 ˆ min f > y1 f > e k ˆ ˆ 0 1 ˆ ˆ ˆ ! ! y ˆ < 0> 0> B 1 C ˚ I n I n B C ˚ ek s.t. (6) y2 D ˆ ˝ D D @ A ˝ ek ˆ ˆ ˆ y3 ˆ ˆ ˆ : y1 ; y2 ; y3 0 which has only 2n + 1 rows and k + 2n 4n + 1 columns, and the obvious initial feasible solution y1 = ek and y2 = y3 = 0. Furthermore, since the first of the 2n + 1 constraints in (6) in fact requires that e> k y1 = 1, then the function f | y1 f | ek is also bounded below, and so this primal linear program always has an optimal solution. This optimal solution gives the values of c, d, and ı via the dual vectors, and also determines which values of f (j) are interpolated by the potential function U(). That is, the basic columns in the optimal solution to (6) correspond to the conformations (j) for which F( (j) ) = U( (j) ).

Note that once an optimal solution to (6) has been obtained, the addition of new local minima is very easy. It is done by simply adding new columns to ˚ and ˝, and therefore to the constraint matrix in (6). The number of primal rows remains fixed at 2n + 1, independent of the number k of local minima. The convex quadratic underestimating function U() determined by the values c 2 Rn+1 and d 2 Rn now provides a global approximation to the local minima of F(), and its easily computed global minimum point min is given by ( min )i = ci /di , i = 1, . . . , n, with corresponding function value U min given by U min = c0 P niD1 c2i /di . The value U min is a good candidate for an approximation to the global minimum of the correct energy function F(), and so min can be used as an initial starting point around which additional configurations (i. e., local minima) should be generated. These local minima are added to the constraint matrix in (6) and the process is repeated. Before each iteration of this process, it is necessary to reduce the volume of the hyperrectangle H over which the new configurations are produced so that a tighter fit of U() to the local minima ‘near’ min is constructed. The rate and method by which the hyperrectangle size is decreased, and the number of additional local minima computed at each iteration must be determined by computational testing. But clearly the method depends most heavily on computing local minima quickly and on solving the resulting linear program efficiently to determine the approximating function U() over the current hyperrectangle. If Ec is a cutoff energy, then one means for decreasing the size of the hyperrectangle H at any step is to let H = {: U() Ec }. To get the bounds of H, consider U() Ec where U() satisfies (2). Then limiting i requires that n X iD1

1 c i i C d i i2 2

E c c0 :

(7)

As before, the minimum value of U() is attained when i = ci /di , i = 1, . . . , n. Assigning this minimum value to each i , except k , then results in 1 1 X c 2i ˇk : c k k C d k k2 E c c0 C 2 2 di i¤k

(8)

Molecular Structure Determination: Convex Global Underestimation

The lower and upper bounds on k , k = 1, . . . , n, are given by the roots of the quadratic equation 1 c k k C d k k2 D ˇ k : 2

(9)

Hence, these bounds can be used to define the new hyperrectangle H in which to generate new configurations. Clearly, if Ec is reduced, the size of H is also reduced. At every iteration the predicted global minimum value U min satisfies U min F( ), where is the smallest known local minimum conformation. Therefore, Ec = F( ) is often a good choice. If at least one improved point , with F() < F( ), is obtained in each iteration, then the search domain H will strictly decrease at each iteration, and may decrease substantially in some iterations.

M

new local minimum conformations as columns to the matrices ˚ and ˝. 7) Return to step 2. The number of new local minima to be generated in step 6 is unspecified since there is currently no theory to guide this choice. In general, a value exceeding 2n + 1 would be required for the construction of another convex quadratic underestimator in the next iteration (step 2). In addition, the means by which the volume of the hyperrectangle H is reduced in step 5 may vary. One could use the two roots of (7) to define the new bounds of H. Another method would be simply to use H = { i : ( min )i ı i i ( min )i + ı i } where ı i = |( min)i ( )i |, i = 1, . . . , n. For complete details of the CGU method and its computational results, see [5,8]. See also

The CGU Algorithm Based on the preceding description, a general method for computing the global, or near global, energy minimum of the potential energy function F() can now be described. 1) Compute k 2n + 1 distinct local minima (j) , for j = 1, . . . , k, of the function F(). 2) Compute the convex quadratic underestimator function given in (2) by solving the linear program given in (6). The optimal solution to this linear program gives the values of c and d via the dual vectors. 3) Compute the predicted global minimum point min given by ( min )i = ci /di , i = 1, . . . , n, with corresponding function value U min given by U min = c0 Pn 2 iD1 c i /(2di ). 4) If min = , where = argmin{F( (j) ): j = 1, 2, . . . } is the best local minimum found so far, then stop and report as the approximate global minimum conformation. 5) Reduce the volume of the hyperrectangle H over which the new configurations will be produced, and remove all columns from ˚ and ˝ which correspond to the conformations which are excluded from H. 6) Use min as an initial starting point around which additional local minima (j) of F() (restricted to the hyperrectangle H) are generated. Add these

Adaptive Simulated Annealing and its Application to Protein Folding Genetic Algorithms Global Optimization in Lennard–Jones and Morse Clusters Global Optimization in Protein Folding Monte-Carlo Simulated Annealing in Protein Folding Multiple Minima Problem in Protein Folding: ˛BB Global Optimization Approach Packet Annealing Phase Problem in X-ray Crystallography: Shake and Bake Approach Protein Folding: Generalized-ensemble Algorithms Simulated Annealing Simulated Annealing Methods in Protein Folding References 1. Abagyan RA (1993) Towards protein folding by global energy optimization. Federation of Europ Biochemical Soc: Lett 325:17–22 2. Androulakis IR, Maranas CD, Floudas CA (1997) Prediction of oligopeptide conformations via deterministic global optimization. J Global Optim 11:1–34 3. Benner SA, Gerloff DL (1993) Predicting the conformation of proteins: man versus machine. Federation of Europ Biochemical Soc: Lett 325:29–33 4. Dill KA (1990) Dominant forces in protein folding. Biochemistry 29(31):7133–7155

2315

2316

M

Monotonic Optimization

5. Dill KA, Phillips AT, Rosen JB (1997) Protein structure and energy landscape dependence on sequence using a continuous energy function. J Comput Biol 4(3):227–239 6. Merz K, Grand S Le (1994) The protein folding problem and tertiary structure prediction. Birkhäuser, Basel 7. Miyazawa S, Jernigan RL (1993) A new substitution matrix for protein sequence searches based on contact frequencies in protein structures. Protein Eng 6:267–278 8. Phillips AT, Rosen JB, Walke VH (1995) Molecular structure determination by global optimization. In: Pardalos PM, Xue GL, Shalloway D (eds) DIMACS. Amer Math Soc, Providence, pp 181–198 9. Richards FM (1991) The protein folding problem. Scientif Amer:54–63 10. Srinivasan R, Rose GD (1995) LINUS: A hierarchic procedure to predict the fold of a protein. PROTEINS: Struct Funct Genet 22:81–99 11. Sun S, Thomas PD, Dill KA (1995) A simple protein folding algorithm using binary code and secondary structure constraints. Protein Eng 8(8):769–778 12. Yue K, Dill KA (1996) Folding proteins with a simple energy function and extensive conformational searching. Protein Sci 5:254–261

Monotonic Optimization SAED ALIZAMIR Department of Industrial and Systems Engineering, University of Florida, Gainesville, USA MSC2000: 90C26, 65K05, 90C30 Article Outline Introduction Normal Sets and Polyblocks Normal Sets Polyblocks

Solution Method Generalizations Optimization of the Difference of Monotonic Functions Discrete Monotonic Optimization

Applications Conclusions References Introduction The role of convexity in optimization theory has increased significantly over the last few decades. Despite this fact, a wide variety of global optimization problems

are usually encountered in applications in which nonconvex models need to be tackled. For this reason, developing solution methods for specially structured nonconvex problems has become one of the most active areas in recent years. Although these problems are difficult by their nature, promising progress is achieved for some special mathematical structures. Among the solution methods developed for these special structures, monotonic optimization, first proposed by Tuy [9], is presented in this study. Problems of optimizing monotonic functions of n variables under monotonic constraints arise in the mathematical modeling of a broad range of real-world systems, including in economics and engineering. The original difficulties of these problems can be reduced by a number of principles derived from their monotonicity properties. For example, in nonconvex problems in general, a solution which is known to be feasible or even locally optimal, does not provide any information about global optimality and the search should be continued on the entire feasible space, while for an increasing objective function, a feasible solution like z, would exclude n from the search procedure (for a minthe cone z C RC imization objective function). In a similar way, if g(x) in a constraint like g(x) 0 is increasing, then by knowing that z is infeasible for this constraint, the whole cone n can be discarded from further consideration. z C RC This kind of information would obviously restrict the search space and may result in more efficient solution methods. To formally present the general framework of the monotonic optimization problem, consider two vectors x; x 0 2 R n . We say x 0 x (x 0 dominates x) if x 0i x i 8i D 1; : : : ; n. We say x 0 > x (x 0 strictly dominates x) if x 0i > x i 8i D 1; : : : ; n. Let n n D fx 2 R n jx 0g and RCC D fx 2 R n jx > 0g. If RC n a; b 2 R and a b, we define the box a; b as the n set of all x 2 R such that a x b. Similarly, let a; b) D fxja x < bg and (a; b D fxja < x bg. A f : R n ! R is called increasing on a box function n a; b 2 R if f (x) f (x 0 ) for a x x 0 b. A function f is called decreasing if –f is increasing. Any increasing or decreasing function is referred to as monotonic. It can be easily shown that the pointwise supremum of a bounded-above family of increasing functions and the pointwise infimum of a boundedbelow family of increasing functions are increasing.

Monotonic Optimization

M

˚ n jg(x) 1 . It can be shown that as the set G D x 2 RC the level set of an increasing function is a normal set and it is closed if the function is lower semicontinuous. Maximize (minimize) f (x) n jx 0i > Define I(x) D fijx i D 0g, ˚K x D fx 0 2 RC n subject to g i (x) 1 8i D 1; : : : ; m1 ; x i 8i … I(x)g, and clK x D x 0 2 RC jx 0 x . Then n h j (x) 1 8 j D 1; : : : ; m2 ; a point y 2 RC is called an upper boundary point of n nG. a bounded normal set G if y 2 clG while K y RC n ; x 2 RC The set of upper boundary points of G is called the up(1) per boundary of G and is denoted by @+ G. For a compact normal set G [0; b] with in which f (x), g i (x), and hj (x) are increasing functions n n n f0g, the nonempty interior and for every point z 2 RC on R . A more general definition of this problem is pre+ G at a unique point half line from 0 through z meets @ sented in Sect. “Normal Sets and Polyblocks”. Heuristi(z), which is defined as

(z) D z, denoted by

G G cally, f (x) may be a cost function (profit function for the maximize problem), g i (x) may express some resource D max f˛ > 0j˛z 2 Gg. n is called a reverse normal set (also A set H RC availability constraints, while hj (x) may be a family of known as conormal) if x 0 x and x 2 H implies utility functions which have to take a value at least as 0 x 2 H. A reverse normal set in a box [0, b] is defined as big as a goal. n 0 The remainder of this article is organized as follows. a set like H 2 RC for which 0 x x b and x 2 H 0 We first describe the theory of normal sets and poly- implies x 2 H. As before, rN[D] is the smallest reverse n blocks in Sect. “Normal Sets and Polyblocks”. Mono- normal set containing D RC and˚ is called a reverse n tonic optimization algorithms are presented in Sect. normal hull of set D. Define H D x 2 RC jh(x) 1 “Solution Method”. Section “Generalizations” contains for the increasing function h(x). Then it can be shown two generalizations of monotonic optimization. Differ- that H is reverse normal and it is closed if h(x) is upper ent class of applications for which monotonic optimiza- semicontinuous. n is called a lower boundary point of A point y 2 RC tion is adapted are discussed in Sect. 5 and finally cona reverse normal set H if y 2clH and x … H 8x < y. clusions are made in Sect. “Conclusions”. The set of lower boundary points of H is called the lower boundary of H and is denoted by @ H. Normal Sets and Polyblocks For the closed reverse normal set H and b 2 The theory of normal sets and polyblocks is the under- intH and every point z 2 [0; b]nH, the half line lying principle for monotonic optimization. In this sec- from b through z meets @ H at a unique point tion, the definitions are presented as well as the main (z), which is defined as (z) D b C (z b), H H concepts and properties to help the reader to under- D max > 0jb C ˛(z b) 2 Hg. f˛ stand the upcoming algorithms. For more details and Now consider the set of constraints imposed by proofs see [5,9,10]. increasing functions g i (x) and hj (x) in problem (1). The feasible space characterized by these sets of conNormal Sets straints can properly be presented by normal sets and n n is called normal if for any two points reverse˚ normal sets. Define the sets G; H RC as A set G RC n n 0 0 0 x; x 2 RC such that x x > x 2 G implies x 2 G. ˚G D x 2 RC jg i (x) 1 8i D 1; : : : ; m1 and H D n n x 2 RC , the set N[D], which is called jh j (x) 1 8i D 1; : : : ; m2 . Then by the baGiven any set D RC the normal hull of D, is the smallest normal set contain- sic properties of normal and reverse normal sets which ing D. In other words, N[D] can be interpreted as the were described above, G is the intersection of a finite intersection of all normal sets that contain D. The in- number of normal sets which is normal. In a similar tersection and the union of a family of normal sets are way, H is the intersection of a finite number of reverse n we normal sets which is reverse normal. Now we can redenormal. If the normal set contains a point u 2 RCC say it has a nonempty interior. Suppose that g(x) is an fine the fundamental problem of monotonic optimizaincreasing function over Rn+ . Define the level set of g(x) tion, also called the canonical monotonic optimization In monotonic optimization, the following problem is considered:

2317

2318

M

Monotonic Optimization

problem, as optimizing a monotonic function on the intersection of a family of normal and reverse normal sets as follows: Maximize (minimize) subject to

f (x) x 2G\H;

(2)

n in which G [0; b] RC is a compact normal set, H is a close reverse normal set, and f (x) is an increasing function on [0, b]. Tuy [9] proved that if G has a nonempty interior (if b 2 intH), then the maximum (minimum) of f (x) over G \ H, if it exists, is attained on @C G \ H (G \ @ H). On the basis of this essential result, it can be shown that for every n , max f f (x)jx 2 Dg D arbitrary compact set D RC max f f (x)jx 2 N[D]g. Analogously, for the minimization version of the objective function, for any arn , we have min f f (x)jx 2 Eg D bitrary set E RC min f f (x)jx 2 rN[E]g. It is worth mentioning that the minimization problem can be converted to the maximization case by making a simple set of transformations. So it can be either transformed to the maximization problem or treated separately.

Polyblocks The role of polyblocks in monotonic optimization is the same as that of the polytope in convex optimization. As the polytope is the convex hull of finitely many points in Rn , a polyblock is the normal hull of finitely n is a polyblock in many points in Rn + . A set P RC n [a; b] RC if it is the union of a finite number of boxes [a, z], z 2 T [a; b]. The set T is called the vertex set of the polyblock. We call the vertex z 2 T a proper vertex if z … [0; z0 ] 8z0 2 Tn fzg, i. e., by removing the vetex z from T, the new polyblock created by T is not equivalent to P. A vertex which is not proper is called an improper vertex. A polyblock can be defined by the set of its proper vertices. A polyblock is a closed normal set and the intersection of a set of polyblocks is again a polyblock. Now suppose that x 2 [a; b] and consider the set P D [a; b]n(x; b]. Then it is easy to verify that P is a polyblock with vertices z i D b C (x b)e i ; 8i D 1; : : : ; n in which ei is the ith unit vector. Using this property, we can approximate an arbitrary compact n (with any desired accuracy) by normal set ˝ RC

a nested sequence of polyblock approximation. At each iteration, a point x … ˝ is found and a new polyblock is constructed based on that which is a subset of the previous polyblock but still contains the set ˝. To present the main idea of the polyblock approximation method in monotonic optimization, we need one more result on optimizing an increasing function over a polyblock. Tuy [9] proved that the increasing function f (x) achieves its maximum over a polyblock at a proper vertex. Now consider the problem of maximizing the increasing function f (x) over the arbitrary compact n . As mentioned before, we can substiset ˝ RC tute ˝ by its normal hull. So without loss of generality, we assume that ˝ is normal. The idea is to construct a nested sequence of polyblock outer approximation P1 P2 : : : ˝ in such a way that maxf f (x)jx 2 Pk g & maxf f (x)jx 2 ˝g. At iteration k, assume zk is the proper vertex of Pk which maximizes f (x), i. e., z k D arg maxf f (z)jz 2 Tk g, where T k is the set of proper vertices of Pk . Then if zk is feasible in ˝, the initial feasible space, it also solves the problem. Otherwise, we are interested in a new polyblock PkC1 Pk nfz k g which still contains ˝ as a subset. To obtain Pk+1 from Pk , the box [0; z k ] is replaced by [0; z k ]nK x k , in which xk is defined as (zk ). MathS ematically, PkC1 D ([0; z k ]nK x k ) z2Tk nfz k g [0; z], which clearly satisfies the desired property of ˝ PkC1 Pk nfz k g. The vertex set of the established polyblock Pk+1 , denoted by V k+1 , contains the proper vertices of Pk excluding zk and a set of n new vertices, z k; 1 ; z k; 2 ; : : : ; z k; n , defined as z k; i D z k C (x ik z ik )e i . This result is directly followed by the earliermentioned property of polyblocks about the vertices of [a; b]n(x; b]. Finally, the proper vertex set of Pk+1 , T k+1 , is obtained from V k+1 by removing its improper vertices [9,10]. n is called a reverse polyblock in A set P RC [0, b] if it is the union of a finite number of boxes [z; b]; z 2 T; T [0; b]. The set T is called the vertex set of the reverse polyblock. As before, z is a proper vertex if by removing it from T, the new reverse polyblock created by T is not equivalent to P. A reverse polyblock can be defined by the set of its proper vertices. An increasing function f (x) achieves its minimum over a re-

Monotonic Optimization

verse polyblock at a proper vertex. Similar results to what we had for polyblocks can be developed for reverse polyblocks in the very same way. For more details see [9,10]. Solution Method Consider problem (2) (in the maximization form) as discussed in Sect. “Normal Sets and Polyblocks” with the additional assumptions that f (x) is semicontinn . The latter assumpuous on H and G \ H RCC tion implies the existence of a vector a such that 0 < a x; 8x 2 G \ H. Let H a D fx 2 Hjx ag. For 0 as a given tolerance, the solution x 0 is called -optimal if f (x 0 ) f (x) ; 8x 2 G \ H. We attempt to design an algorithm which is capable of finding an -optimal solution for any given . Obviously, b 2 H because otherwise the problem is infeasible. Let P1 D [0; b] be the initial polyblock and T1 D fbg its corresponding proper vertex set. If we apply the polyblock approximation method described in Sect. “Normal Sets and Polyblocks” to this problem, at each iteration k, Pk and its proper vertex set, T k , are obtained from the last iteration. We should notice that every vertex z 2 Tk nH a can be removed since they do not belong to the initial feasible space. Also suppose that f (xk ) is the best value found for the objective function so far. Then any vertex z for which f (z) f (x k ) C is discarded because no -optimal solution happens to be in box [0, z]. These two rules can be applied at each iteration to refine the proper vertex set T k and delete some of the vertices from further consideration. If Tk D ; in some iteration k, it means there is no solution x for which f (x) > f (x k ) C . So, xk , the best solution found so far, is -optimal and the procedure terminates. Otherwise, let z k D arg max f f (z)jz 2 Tk g. If zk is feasible in G \ H, it solves the problem. Since z k 2 H is always true, it is feasible if it belongs to G and infeasible otherwise. In the case of infeasibility, we find x k D G (z k ) and construct the polyblock Pk+1 as described in Sect. “Normal Sets and Polyblocks” which excludes zk while still containing a global optimal solution of the problem. This procedure is repeated until the termination criteria are satisfied or the problem is known to be infeasible. This procedure, first proposed by Tuy [9], is called the polyblock algorithm. Tuy [9] discussed the convergence of this method and showed that

M

as k ! 1, the sequence xk converges to a global optimal solution of the problem. Now consider the minimization case of problem (2) in Sect. “Normal Sets and Polyblocks” with additional assumptions that f (x) is semicontinuous on G and there exists a vector c such that 0 < c < b and 0 x c; 8x 2 G \ H. A nested sequence of reverse polyblock outer approximation of G \ H (or a subset of G \ H in which the existence of at least one optimal solution is guaranteed) is called the reverse polyblock algorithm (copolyblock algorithm) which is devised to solve this problem [9]. The polyblock approximation algorithm works properly for relatively small dimension n, typically n D 10. However, the algorithm converges slowly as it gets closer to the global optimal solution and needs a large number of iterations even for a value of n as small as 5. Tuy et al. [12] presented two main reasons for this drawback of the algorithm. First, the speed of convergence depends on the way in which we construct the current polyblock from the previous one. Obviously, we prefer to remove a larger portion of the previous polyblock to have a smaller search space and a higher speed of convergence. This goal is achieved by employing more complex rules of constructing the polyblocks, which imposes some additional computational effort. The second source of the slowness of the algorithm is how it selects the solution xk in each iteration. These solutions are basically derived from the monotonicity properties of the problem, while sometimes there may exist some amount of convexity which can be used to speed up the algorithm. Tuy and Al-Khayyal [11] introduced the concept of reduced box and reduced polyblock. It involves tightening the box in which we are interested to find the upper bound of f (x), in such a way that the reduced box still contains an optimal solution of the problem. Then based on that, a new procedure is developed to produce tighter polyblocks. They also redefined the proper vertex set of polyblocks in the algorithm and suggested that instead of selecting xk as the last point of G on the halfline from a through zk , as the original algorithm does, a more complex way can be implemented by incorporating some of the convexity properties of the problem. This˚is by solving the convex relaxation of the problem max f (x)jx 2 G \ H; x 2 [a; z k ] which gives us an upper bound of f (x) over the feasible solu-

2319

2320

M

Monotonic Optimization

tion x in box [a, xk ]. Similar ideas were applied to the reverse polyblock algorithm as well. Using these two new modifications and improvements, they developed new algorithms and discussed their convergence properties, namely, the revised polyblock algorithm and the revised reverse polyblock (copolyblock) algorithm. Most of the outer approximation procedures, including the polyblock algorithm, encounter storage and numerical problems while solving problems in high dimensions. By using branch-and-bound strategies, one can tackle these difficulties. Bounding is performed on the basis of the polyblock approximation. As before, monotonicity cuts and convex relaxation can be combined to enhance the quality of the bounds in the corresponding portion of the feasible space. In this branch-and-bound approach, branching is performed as partitioning the feasible space into cones pairwise having no common interior point. The logic behind using conical partitioning instead of rectangular partitioning is the fact that the optimal solution of the monotonic optimization problem, as discussed before, is always achieved on the upper boundary of the feasible normal set. Using conical partitioning is more efficient and less expensive in terms of the computational time. The algorithm starts with initial cone Rn+ and partitions it into subcones. For each of these subcones, an upper bound for the value of the objective function over the feasible solutions contained in it is derived. Those cones which are known to not contain an optimal solution are fathomed and the remaining ones are subdivided again and the procedure is repeated until the termination criteria are satisfied. Among the remaining cones, the one having the maximal bound is the first candidate for branching. This algorithm, suggested by Tuy and Al-Khayyal [11], is called the conical algorithm. For those problems having partial monotonicity and partial convexity, this branch-and-bound scheme can be extended to devise a more general method. In this method, branching is performed on the nonconvex variables and bounds are computed by Lagrangian or convex relaxation [6]. To further exploit the monotonic structure of the problem, reduction cuts are combined with original monotonicity cuts and a more efficient method is developed [13]. This method creates branch-and-cut algorithms to solve monotonic optimization problems by systematic use of these cuts.

Finally, it is worth mentioning that a new concept of the essential -optimal solution can be applied to monotonic optimization problems. The advantage of the method developed on the basis of this concept is the finding of an approximate optimal solution which is more appropriate and more stable than that which is found by the -optimal method. For details see [8]. Generalizations The essential approach used in monotonic optimization can be further generalized to cover a wider class of non-convex general optimization problems. Among these generalizations, optimization of the difference of monotonic functions and discrete monotonic optimization are presented here. Optimization of the Difference of Monotonic Functions The underlying idea of monotonic optimization can be extended to deal with problems including the differn ! R is ence of monotonic functions. A function f : RC said to be a difference of monotonic functions if it is representable as the difference of two increasing funcn n ! R and f 2 : RC ! R. Similar to functions: f1 : RC tions presented as the difference of convex functions, the class of difference of monotonic functions is a linear space. The pointwise minimum and pointwise maximum of a family of difference of monotonic functions (difference of convex functions) is still a difference of monotonic functions (difference of convex functions). The linear combination of a set of difference of monotonic functions is a difference of monotonic functions. Obviously, any polynomial function can be presented as the difference of two increasing functions, the first one includes all terms having positive coefficients and the second one includes all terms having negative coefficients. Consider the problem: Maximize (minimize)

f (x) g(x)

subject to

x 2G\H;

(3)

in which G and H are as before and f (x) and g(x) are increasing functions on [0, b]. Tuy [9] extended the original polyblock algorithm to solve this problem. By introducing t as the difference between g(b)

Monotonic Optimization

and g(x) for x 2 [0; b] and regarding the fact that t is always positive owing the function g(x) being increasing, we rewrite the model as (maximization case) maxf f (x) C t g(b)jx 2 G \ H; t D g(b) g(x)g. Now g(b) is a constant and can be removed from the objective function. In the resulting problem, max f f (x) C tjx 2 G \ H; 0 t g(b) g(x)g, consider the set of constraints. By incrementing the dimension of the problem by one, the feasible space can be presented as D \ E such that D D f(x; t)jx 2 G; t C g(x) g(b); 0 t g(b) g(0)g and E D f(x; t)jx 2 H; 0 t g(b) g(0)g. It is easy to verify that D is a normal set and H is a reverse normal set in the box [0; b] [0; g(b) g(0)]. Also the function F(x; t) D f (x) C t is an increasing function on [0; b] [0; g(b) g(0)]. So problem (3) is reduced to problem (2) in Sect. “Normal Sets and Polyblocks” and can be treated by the original polyblock algorithm. The additional cost that the presence of difference of monotonic functions has incurred is the dimension of the problem incremented by one. For the minimization case of problem (3), a similar transformation can be applied to convert this problem to the minimization case of problem (2). To make the problem even more general, suppose that all constraints are also difference of monotonic functions. Specifically, consider the problem: Maximize (minimize) subject to

f1 (x) f 2 (x) g i (x) h i (x) 0 8i D 1; : : : ; m ;

(4)

n ; x 2 ˝ [0; b] RC

in which f 1 (x), f 2 (x), g i (x), and hi (x) are increasing functions and ˝ is a normal set. By the above argument, first we can make a proper transformation and convert the objective function to an increasing function. So without loss of generality, let us assume that f 2 (x) D 0. Now consider the set of m constraints. This set of constraints can be rewritten as max i fg i (x) h i (x)g 0. Since the pointwise maximum of a family of difference of monotonic functions is still a difference of monotonic functions, we can represent the space imposed by these constraints by g(x) h(x) 0, where both g(x) and h(x) are increasing. By introducing the new variable t 0 and assuming g(b) 0 (this assumption is not restrictive), the set

M

of the following two constraints fully defines the space mentioned: g(x) C t g(b), h(x) C t g(b). The first constraint gives us the upper bound of g(b) g(0) for t. Finally the problem reduces to (maximization case): maxf f 1(x)jg(x) C t g(b); h(x) C t g(b); x 2 ˝; 0 t g(b) g(0)g. This problem is the same as problem (2) by defining G D f(x; t)j x 2 ˝; g(x) C t g(b); 0 t g(b) g(0)g, which is a subset of the box [0; b] [0; g(b) g(0)] nC1 . and H D f(x; t)jh(x) C t g(b)g is defined in RC Increasing the dimension of the problem is the main drawback of the above mentioned approach. Tuy and Al-Khayyal [11] presented a direct approach for the difference of monotonic functions optimization problem requiring no additional dimension. This method is referred to as the branch-reduce-and-bound (BRB) algorithm. As the name of the algorithm suggests, it contains three main steps, which are branching upon nonconvex variables, reducing any partition set before bounding, and bounding over each partition set. The branching phase is performed by rectangular subdivision. Every box is divided into two subboxes by a hyperplane. The reduction phase contains a set of operations by which the box [p, q] is tightened without losing any feasible solution. This is called a proper reduction of [p,q]. This approach takes advantage of the monotonicity properties of the problem and increases the rate of convergence in the algorithm. In the bounding phase, for a properly reduced box [p; q], an upper bound like ˇ is obtained such that ˇ maxf f1 (x) f 2 (x)jg i (x) h i (x) 0; 8i D 1; : : : ; m; x 2 [p; q]g. As mentioned before, stronger bounds are obtained by a sequence of polyblock approximations or by combining monotonicity with convexity present in the problem. Furthermore, more complex methods can be applied to improve the quality of the bounds in the bounding phase. Discrete Monotonic Optimization A class of monotonic optimization problems containing the additional discrete constraints are called discrete monotonic optimization problems. Specifically, given a finite set S of points in the box [a,b], the constraint x 2 S is added to the model. So the problem can be represented as max f f (x)jx 2 G \ H \ Sg (all the assumptions are as in problem (2).

2321

2322

M

Monotonic Optimization

The original polyblock algorithm is not practical for these problems. Since the polyblock algorithm is an iterative procedure, it does not have the capability to produce the optimal solution in a finite number of iterations. However, by making suitable modifications, one can use this algorithm to obtain the exact optimal solution of the problem in a finite number of steps [1,14]. In the new method, monotonicity cuts are adjusted on the basis of a special procedure to cope with discrete requirements. This adjustment consists in updating the vertex of the monotonicity cut by pushing it deeper inside the polyblock to obtain a tighter space while keeping all discrete points which are not proven to be nonoptimal, unaffected. The algorithm first constructs the normal hull of ˜ and then tries to solve the probG \ S, denoted by G, ˚ lem max f (x)jx 2 G˜ \ H in continuous space. This method is called the discrete polyblock algorithm. For large-scale instances, a similar BRB algorithm was developed by Tuy et al. [14]. Applications Although monotonic optimization is a new approach in global optimization and there is not a broad literature on its applications, it can be applied to numerous problems. In most of these applications, first some transformations are performed and the problems are reformulated in the proper way. Then monotonic optimization is applied and other approaches are employed to enhance the quality of the bounds. Some of these applications are briefly introduced in this section. Polynomial programming: The problem of minimizing or maximizing a polynomial function under a set of polynomial constraints, which is encountered in a multitude of applications, is called polynomial programming. Tuy [9] reformulated this problem as a difference of monotonic functions problem which can be solved by the methods described before. Tuy [7] proposed a robust solution approach for polynomial programming based on a monotonic optimization scheme. He developed a BRB procedure to tackle the polynomial optimization problems of higher dimensions. Polynomial optimization contains nonconvex quadratic programming as a special case. So every polynomial optimization method can be applied to solve this important class of problems [4,16].

Fractional programming: In fractional programming, we are dealing with functions which are represented by ratios of other functions. Phuong and Tuy [3] considered a generalized linear fractional programming problem. In this problem, the objective function consists of an arbitrary continuous increasing function of m linear fractional functions and the feasible set is the polytope D. Linear fractional functions are defined as the ratio of two linear affine functions. They proposed a new unified approach which reformulates the problem and solves it as a monotonic optimization problem. Tuy [17] considered a more general class of fractional programming problems which is optimizing a polynomial fractional function (the ratio of two polynomial functions) under polynomial constraints. His method to solve the problem is again based on reformulating the problem as a monotonic optimization problem. A branch-and-bound scheme was presented for problems of higher ranks. Clearly, polynomial programming is a special case of this class of problems. Multiplicative programming: Multiplicative programming problems are optimization problems containing products of a number of convex or concave functions in the objective function or constraints. Tuy [9] showed that these classes of problems are essentially monotonic optimization problems. Tuy and Nghia [15] devised a new approach based on the reverse polyblock approximation method for a broad class of problems including generalized linear multiplicative and linear fractional programming as special cases. For more applications, including Lipschitz optimization, optimization under network constraints, the Fekete points problem, and the Lennard-Jones potential energy function, see [9]. Conclusions We have discussed the recently developed theory of monotonic optimization as well as its generalizations and applications. This noble scheme which is capable of solving a wide range of nonconvex problems is based on an polyblock outer approximation procedure. The approach that monotonic optimization uses to deal with optimization problems is analogous to convex optimization in several respects. Just as we approx-

Monte-Carlo Simulated Annealing in Protein Folding

imate convex sets by polyhedrons, normal sets, defined as the level sets of increasing functions, can be approximated by a set of polyblocks in monotonic optimization. As the difference of convex functions plays an essential role in convex analysis (because any arbitrary continuous function can be represented as the difference of two convex functions), optimization problems representable as the difference of monotonic functions can be treated in monotonic optimization. The performance of this method can be significantly improved by incorporating some other techniques like convex relaxation to exploit other properties present in the problem. In high dimensions, branch-and-bound or branch-and-cut extensions of the algorithm can be applied to overcome storage difficulties and increase the convergence speed. References 1. Minoux M, Tuy H (2001) Discrete Monotonic Global Optimization. preprint. Institute of of Mathematics, Hanoi 2. Pardalos PM, Romeijn HE, Tuy H (2000) Recent developments and trends in global optimization. J Comput Appl Math 124:209–228 3. Phuong NTH, Tuy H (2003) A Unified Monotonic Approach to Generalized Linear Fractional Programming. J Global Optim 26:229–259 4. Phuong NTH, Tuy H (2002) A Monotonicity Based Approach to Nonconvex Quadratic Minimization. Vietnam J Math 30:373–393 5. Rubinov A, Tuy H, Mays H (2001) An Algorithm for Monotonic Global Optimization Problems. Optimization 49:205–221 6. Tuy H (2005) Partly Convex and Convex-Monotonic Optimization Problems. preprint, Institute of Mathematics, Hanoi 7. Tuy H (2005) Polynomial Optimization: A Robust Approach. preprint, Institute of Mathematics, Hanoi 8. Tuy H (2005) Robust Solution of Nonconvex Global Optimization Problems. J Global Optim 32:307–323 9. Tuy H (2000) Monotonic Optimization: Problems and Solution Approaches. SIAM J Optim 11:464–494 10. Tuy H (1999) Normal sets, Polyblocks, and Monotonic Optimizatin. Vietnam J Math 27:277–300 11. Tuy H, Al-Khayyal F (2003) Monotonic Optimization Revisited. preprint, Institute of Mathematics, Hanoi 12. Tuy H, Al-Khayyal F, Ahmed S (2001) Polyblock Algorithms Revisited. preprint, Institute of Mathematics, Hanoi 13. Tuy H, Al-Khayyal F, Thach PT (2005) Monotonic Optimization: Branch and Cut Methods. In: Audet C, Hansen P, Savard G (eds) Essays and Surveys in Global Optimization. Springer US, pp 39–78

M

14. Tuy H, Minoux M, Phuong NTH (2006) Discrete Monotonic Optimization with Application to a Discrete Location Problem. SIAM J Optim 17:78–97 15. Tuy H, Nghia ND (2001) Reverse Polyblock Approximation for Generalized Multiplicative/Fractional Programming. preprint, Institute of Mathematics, Hanoi 16. Tuy H, Phuong NTH (2007) A robust algorithm for quadratic optimization under quadratic constraints. J Global Optim 37:557–569 17. Tuy H, Thach PT, Konno H (2004) Optimization of Polynomial Fractional Functions. J Global Optim 29:19–44

Monte-Carlo Simulated Annealing in Protein Folding YUKO OKAMOTO Department Theoret. Stud. Institute Molecular Sci. and Department Functional Molecular Sci., Graduate University Adv. Stud., Okazaki, Japan MSC2000: 92C40 Article Outline Keywords Introduction Energy Functions of Protein Systems Methods Results Conclusions See also References Keywords Simulated annealing; Protein folding; Tertiary structure prediction; ˛-helix; ˇ-sheet We review uses of Monte-Carlo simulated annealing in the protein folding problem. We will discuss the strategy for tackling the protein folding problem based on all-atom models. Our approach consists of two elements: the inclusion of accurate solvent effects and the development of powerful simulation algorithms that can avoid getting trapped in states of energy local minima. For the former, we discuss several models varying in nature from crude (distance-dependent dielectric function) to rigorous (reference interaction site model).

2323

2324

M

Monte-Carlo Simulated Annealing in Protein Folding

For the latter, we show the effectiveness of Monte-Carlo simulated annealing. Introduction Proteins under their native physiological conditions spontaneously fold into unique three-dimensional structures (tertiary structures) in the time scale of milliseconds to minutes. Although protein structures appear to be dependent on various environmental factors within the cell where they are synthesized, it was inferred by experiments ‘in vitro’ that the threedimensional structure of a protein is determined solely by its amino-acid sequence information [12]. Hence, it has been hoped that once the correct Hamiltonian of the system is given, one can predict the native protein tertiary structure from the first principles by computer simulations. However, this has yet to be accomplished. There are two reasons for the difficulty. One reason is that the inclusion of accurate solvent effects is nontrivial, because the number of solvent molecules that have to be considered is very large. The other reason for the difficulty comes from the fact that the number of possible conformations for each protein is astronomically large [30,60]. Simulations by conventional methods such as Monte-Carlo or molecular dynamics algorithms in canonical ensemble will necessarily be trapped in one of many local-minimum states in the energy function. In this article, I will discuss a possible strategy to alleviate these difficulties. The outline of the article is as follows. In Sect. “Energy Functions of Protein Systems” we summarize the energy functions of protein systems that we used in our simulations. In Sect. “Methods” we briefly review our simulation methods. In Sect. “Results” we present the results of our protein folding simulations. Section “Conclusions” is devoted to conclusions. Energy Functions of Protein Systems The energy function for the protein systems is given by the sum of two terms: the conformational energy EP for the protein molecule itself and the solvation free energy ES for the interaction of protein with the surrounding solvent. The conformational energy function EP (in kcal/mol) for the protein molecule that we used is one of the standard ones. Namely, it is given by the sum of the electrostatic term EC , 12-6 Lennard–Jones term ELJ ,

and hydrogen-bond term EHB for all pairs of atoms in the molecule together with the torsion term Etor for all torsion angles: 8 ˆ EP D EC C ELJ C EHB C Etor ; ˆ ˆ ˆ X 332q i q j ˆ ˆ ˆ E D ; C ˆ ˆ r i j ˆ ˆ (i; j) ˆ ! ˆ ˆ ˆ X Ai j Bi j ˆ Ala > Leu > Phe > Val > Ile > Gly:

(10)

Monte-Carlo Simulated Annealing in Protein Folding

M

Monte-Carlo Simulated Annealing in Protein Folding, Table 1 ˛-Helix formation in homo-oligomers from 20 Monte-Carlo simulated annealing runs

Peptide ` 3 4 5 6 7 8 9 10 Total

(Met)10

(Ala)10

(Leu)10

(Phe)10

(Val)10

(Ile)10

(Gly)10

1 2 0 2 2 7 1 0 15/20

0 0 1 3 1 4 0 0 9/20

4 2 1 2 0 0 0 0 9/20

1 2 1 1 0 0 0 0 5/20

0 2 0 0 0 0 0 0 2/20

2 0 0 0 0 0 0 0 2/20

1 0 0 0 0 0 0 0 1/20

This can be compared with the experimentally determined helix propensities [6,8]. Our rank order (10) is in good agreement with the experimental data. We then analyzed the relation between helixforming tendency and energy. We found that the differences E = ENH EH between minimum energies for nonhelical (NH) and helical (H) conformations is large for homo-oligomers with high helix-forming tendency (9.7, 10.2, 21.5 kcal/mol for (Met)10 , (Ala)10 , (Leu)10 , respectively) and small for those with low helix-forming tendency (0.5, 1.6, 3.2 kcal/mol for (Val)10 , (Ile)10 , (Gly)10 , respectively). Moreover, we found that the large E for the former homo-oligomers are caused by the Lennard–Jones term ELJ (13.3, 8.0, 17.5 kcal/mol for (Met)10 , (Ala)10 , (Leu)10 , respectively). Hence, we conjecture that the differences in helix-forming tendencies are determined by the following factors [44]. A helical conformation is energetically favored in general because of the Lennard–Jones term ELJ . For amino acids with low helix-forming tendency except for Gly, however, the steric hindrance of side chains raises ELJ of helical conformations so that the difference ELJ between nonhelical and helical conformations are reduced significantly. The small ELJ for these amino acids can be easily overcome by the entropic effects and their helix-forming tendencies are small. Note that such amino acids (Val and Ile here) have two large side-chain branches at Cˇ , while the helix forming amino acids such as Met and Leu have only one branch at Cˇ and Ala has a small side chain. We now study the ˇ-strand forming tendencies of these seven homo-oligomers. In Table 2 we summarize

the ˇ-strand formation in 20 Monte-Carlo simulated annealing runs [44]. The implications of the results are not as obvious as in the ˛-helix case. This is presumably because a short, isolated ˇ-strand is not very stable by itself, since hydrogen bonds between ˇ-strands are needed to stabilize them. However, we can still give a rough estimate for the rank order of strand-forming tendency for the seven amino acids [44]: Val > Ile > Phe > Leu > Ala > Met > Gly:

(11)

Here, we considered Val as more strand-forming than Ile, since the longer the strand segment is, the harder it is to form by simulation. Our rank order (11) is again in good agreement with the experimental data [8]. By comparing (11) with (10), we find that the helixforming group is the strand-breaking group and vice versa, except for Gly. Gly is both helix and strand breaking. This reflects the fact that Gly, having no side chain, has a much larger (backbone) conformational space than other amino acids. The helix-coil transitions of homo-oligomer systems were further analyzed by multicanonical algorithms [3] in [47,48]. The obtained results gave quantitative support to those by Monte-Carlo simulated annealing described above [44]. We have so far studied peptides with nonpolar amino acids each of which is electrically neutral as a whole. We now discuss the helix-forming tendencies of peptides with polar amino acids where side chains are charged by protonation or deprotonation. One example is the C-peptide, residues 1–13 of ribonuclease A.

2329

2330

M

Monte-Carlo Simulated Annealing in Protein Folding

Monte-Carlo Simulated Annealing in Protein Folding, Table 2 ˇ-Strand formation in homo-oligomers from 20 Monte-Carlo simulated annealing runs

Peptide m 3 4 5 6 7 8 9 10 Total

(Met)10

(Ala)10

(Leu)10

(Phe)10

(Val)10

(Ile)10

(Gly)10

0 0 0 0 0 0 0 0 0/20

0 0 0 0 0 0 0 0 0/20

2 0 0 0 0 0 0 0 2/20

5 1 0 0 0 0 0 0 6/20

1 0 2 1 0 1 0 0 5/20

7 4 1 0 0 0 0 0 12/20

0 0 0 0 0 0 0 0 0/20

It is known from the X-ray diffraction data of the whole enzyme that the segment from Ala-4 to Gln-11 exhibits a nearly 3-turn ˛-helix [58,64]. It was also found by CD [56] and NMR [53] experiments that the isolated Cpeptide also has significant ˛-helix formation in aqueous solution at temperatures near 0°C. Furthermore, the CD experiment of the isolated Cpeptide showed that the side-chain charges of residues Glu-2 and His-12+ enhance the stability of the ˛-helix, while the rest of the charges of other side chains do not [56]. The NMR experiment [53] of the isolated Cpeptide further observed the formation of the characteristic salt bridge between Glu-2 and Arg-10+ that exists in the native structure determined by the X-ray experiments of the whole protein [58,64]. In order to test whether our simulations can reproduce these experimental results, we made 20 MonteCarlo simulated annealing runs of 10,000 MC sweeps with several C-peptide analogues [23,46]. The aminoacid sequences of four of the analogues are listed in Table 3. The simulations were performed in gas phase ( = 2). The temperature was decreased exponentially from 1000 K to 250 K for each run. As usual, all the simulations were started from random conformations. In Table 4 we summarize the helix formation of all the runs [46]. Here, the number of conformations with segments of helix length ` 3 are given with Definition I of the ˛-helix state. From this table one sees that ˛-helix was hardly formed for Peptide IV where Glu-2 and His-12 are neutral, while many helical conformations were obtained for the other peptides. This is in

Monte-Carlo Simulated Annealing in Protein Folding, Table 3 Amino-acid sequences of the peptide analogues of Cpeptide studied by Monte-Carlo simulated annealing

Peptide Sequence 1 2 3 4 5 6 7 8 9 10 11 12 13

I Lys+ Glu Thr Ala Ala Ala Lys+ Phe Glu Arg+ Gln His+ Met

II

III

IV

Glu

Glu

Leu

His

accord with the experimental results that the charges of Glu-2 and His-12+ are necessary for the ˛-helix stability [56]. Peptides II and III had conformations with the longest ˛-helix (` = 7). These conformations turned out to have the lowest energy in 20 simulation runs for each peptide. They both exhibit an ˛-helix from Ala-5 to Gln-11, while the structure from the X-ray data has an ˛-helix from Ala-4 to Gln-11. These three conformations are compared in Fig. 4. As mentioned above, the agreement of the backbone structures is conspicuous, but the side-chain

M

Monte-Carlo Simulated Annealing in Protein Folding

Monte-Carlo Simulated Annealing in Protein Folding, Table 4 ˛-Helix formation in C-peptide analogues from 20 MonteCarlo simulated annealing runs

Peptide ` 3 4 5 6 7 Total

Monte-Carlo Simulated Annealing in Protein Folding, Figure 4 The lowest-energy conformations of Peptide II (a) and Peptide III (b) of C-peptide analogues obtained from 20 MonteCarlo simulated annealing runs in gas phase, and the corresponding X-ray structure (c)

I

II

III

IV

4 3 1 0 0 8/20

2 2 1 1 1 7/20

3 3 0 0 1 7/20

1 0 0 0 0 1/20

structures are not quite similar. In particular, while the X-ray [58,64] and NMR [53] experiments imply the formation of the salt bridge between the side chains of Glu-2 and Arg-10+ , the lowest-energy conformations of Peptides II and III obtained from the simulations do not have this salt bridge. The disagreement is presumably caused by the lack of solvent in our simulations. We have therefore made multicanonical Monte-Carlo simulations of Peptide II with the inclusion of solvent effects by the distancedependent dielectric function (see (2)) [18,19]. It was found that the lowest-energy conformation obtained has an ˛-helix from Ala-4 to Gln-11 and does have the characteristic salt bridge between Glu-2 and Arg10+ [18,19]. Similar dependence of ˛-helix stability on sidechain charges was observed in Monte-Carlo simulated annealing runs of a 17-residue synthetic peptide [43]. The pH difference in the experimental conditions was represented by the corresponding difference in charge assignment of the side chains, and the agreement with the experimental results (stable ˛-helix formation at low pH and low helix content at high pH) was observed in the simulations by Monte-Carlo simulated annealing with the distance-dependent dielectric function [43]. Considering our simulation results on homooligomers of nonpolar amino acids, C-peptide, and the synthetic peptide, we conjecture that the helix-forming tendencies of oligopeptide systems are controlled by the following factors [43]. An ˛-helix structure is generally favored energetically (especially, the Lennard– Jones term). When side chains are uncharged, the steric hindrance of side chains is the key factor for the difference in helix-forming tendency. When some of the

2331

2332

M

Monte-Carlo Simulated Annealing in Protein Folding

side chains are charged, however, these charges play an important role in the helix stability in addition to the above factor: Some charges enhance helix stability, while others reduce it. We have up to now discussed ˛-helix formations in our simulations of oligopeptide systems. We have also studied ˇ-sheet formations by Monte-Carlo simulated annealing [38,39,51]. The peptide that we studied is the fragment corresponding to residues 16–36 of bovine pancreatic trypsin inhibitor (BPTI) and has the amino-acid sequence: Ala16 -Arg+ -Ile-Ile-Arg+ -TyrPhe -Tyr -Asn -Ala -Lys+ -Ala -Gly -Leu -Cys -Gln -ThrPhe-Val-Tyr-Gly36 . An antiparallel ˇ-sheet structure in residues 18–35 is observed in X-ray crystallographic data of the whole protein [10]. We first performed 20 Monte-Carlo simulated annealing runs of 10,000 MC sweeps in gas phase ( = 2) with the same protocol as in the previous simulations [38]. Namely, the temperature was decreased exponentially from 1000 K to 250 K for each run, and all the simulations were started from random conformations. The difference of the present simulation and the previous ones comes only from that of the amino-acid sequences. The most notable feature of the obtained results is that ˛-helices, which were the dominant motif in previous simulations of C-peptide and other peptides, are absent in the present simulation. Most of the conformations obtained consist of stretched strands and a ‘turn’ which connects them. The lowest-energy structure indeed exhibits an antiparallel ˇ-sheet [38]. We next made 10 Monte-Carlo simulated annealing runs of 100,000 MC sweeps for BPTI(16–36) with two dielectric functions: = 2 and the sigmoidal, distancedependent dielectric function of (2) [39]. The results with = 2 reproduced our previous results: Most of the obtained conformations have ˇ-strand structures and no extended ˛-helix is observed. Those with the sigmoidal dielectric function, on the other hand, indicated formation of ˛-helices. One of the low-energy conformations, for instance, exhibited about a four-turn ˛helix from Ala-16 to Gly-28 [39]. This presents an example in which a peptide with the same amino-acid sequence can form both ˛-helix and ˇ-sheet structures, depending on its electrostatic environment. NMR experiments suggest that this peptide actually forms a ˇ-sheet structure [40]. The representation of

Monte-Carlo Simulated Annealing in Protein Folding, Figure 5 The structure of BPTI(16–36) deduced from X-ray experiments (a) and the lowest-energy conformation of BPTI(16– 36) obtained from 20 Monte-Carlo simulated annealing runs in aqueous solution represented by solvent-accessible surface area (b)

solvent by the sigmoidal dielectric function (which gave ˛-helices instead) is therefore not sufficient. Hence, the same peptide fragment, BPTI(16–36), was further studied in aqueous solution that is represented by solventaccessible surface area of (3) by Monte-Carlo simulated annealing [51]. Twenty simulation runs of 100,000 MC sweeps were made. It was indeed found that the lowestenergy structure obtained has a ˇ-sheet structure (actually, type II0 ˇ-turn) at the very location suggested by the NMR experiments [40]. This structure and that deduced from the X-ray experiments [10] are compared in Fig. 5. The figures were created with Molscript [29] and Raster3D [2,35]. Although both conformations are ˇ-sheet structures, there are important differences between the two: The positions and types of the turns are different. Since

Monte-Carlo Simulated Annealing in Protein Folding

the X-ray structure is taken from the experiments on the whole BPTI molecule, it does not have to agree with that of the isolated BPTI(16–36) fragment. It was found [51] that the simulated results in Fig. 5b have remarkable agreement with those in the NMR experiments of the isolated fragment [40]. We have so far dealt with peptides with small number of amino acids (up to 21) with simple secondary structural elements: a single ˛-helix or ˇ-sheet. The native proteins usually have more than one secondary structural elements. We now discuss our attempts on the first-principles tertiary structure predictions of larger and more complicated systems. The first example is the fragment corresponding to residues 1–34 of human parathyroid hormone (PTH). An NMR experiment of PTH(1–34) suggested the existence of two ˛-helices around residues from Ser-3 to His-9 and from Ser-17 to Leu-28 [28]. Another NMR experiment of a slightly longer fragment, PTH(1–37), in aqueous solution also suggested the existence of the two helices [32]. One of the determined structures, for instance, has ˛-helices in residues from Gln-6 to His-9 and from Ser-17 to Lys-27 [32]. For PTH(1–34) we performed 20 Monte-Carlo simulated annealing runs of 10,000 MC sweeps in gas phase ( = 2) with the same protocol as in the previous simulations [50]. Many conformations among the 20 final conformations obtained exhibited ˛-helix structures (especially in the N-terminus area). In Fig. 6 we show the lowest-energy conformation of PTH(1–34) [50]. This conformation indeed has two ˛-helices around residues from Val-2 to Asn-10 (Helix 1) and from Met18 to Glu-22 (Helix 2), which are precisely the same locations as suggested by experiment [28], although Helix 2 is somewhat shorter (5 residues long) than the corresponding one (12 residues long) in the experimental data. A slightly larger peptide fragment, PTH(1–37), was also studied by Monte-Carlo simulated annealing [34] to compare with the results of the recent NMR experiment in aqueous solution [32]. Ten simulation runs of 100,000 MC sweeps were made in gas phase ( = 2) and in aqueous solution that is represented by the terms proportional to the solvent-accessible surface area (see (3)). Although the results are preliminary, the simulations in gas phase did not produce two helices this time in contrast to the previous work [50], where a short

M

Monte-Carlo Simulated Annealing in Protein Folding, Figure 6 Lowest-energy conformation of PTH(1–34) obtained from 20 Monte-Carlo simulated annealing runs in gas phase

second helix was observed, as discussed in the previous paragraph. The lowest-energy conformation has an ˛-helix from Val-2 to Asn-10. The simulations in aqueous solution, on the other hand, did observe the two ˛helices. The lowest-energy conformation obtained has ˛-helices from Gln-6 to His-9 and from Gly-12 to Glu22. Note that the second helix is now more extended than the first one in agreement with experiments. This structure together with one of the NMR structure [32] is shown in Fig. 7. The figures were again created with Molscript [29] and Raster3D [2,35]. Generalized-ensemble simulations of PTH(1–37) are now in progress in order to obtain more quantitative information such as average helicity as a function of residue number, etc. The second example of more complicated system is the immunoglobulin-binding domain of streptococcal protein G. This protein is composed of 56 amino acids and the structure determined by an NMR experiment [14] and an X-ray diffraction experiment [1] has an ˛helix and a ˇ-sheet. The ˛-helix extends from residue Ala-23 to residue Asp-36. The ˇ-sheet is made of four ˇ-strands: from Met-1 to Gly-9, from Leu-12 to Ala20, from Glu-42 to Asp-46, and from Lys-50 to Glu-56.

2333

2334

M

Monte-Carlo Simulated Annealing in Protein Folding

Monte-Carlo Simulated Annealing in Protein Folding, Figure 7 A structure of PTH(1–37) deduced from NMR experiments (a) and the lowest-energy conformation of PTH(1–37) obtained from 10 Monte-Carlo simulated annealing runs in aqueous solution represented by solvent-accessible surface area (b)

This structure is shown in Fig. 8a). The figures in Fig. 8 were again created with Molscript [29] and Raster3D [2,35]. We have performed eight Monte-Carlo simulated annealing runs of 50,000 to 400,000 MC sweeps with the sigmoidal, distance-dependent dielectric function of (2). The lowest-energy conformation so far obtained has four ˛-helices and no ˇ-sheet in disagreement with the X-ray structure. This structure is shown in Fig. 8b). The disagreement of the lowest-energy structure (Fig. 8b) so far obtained with the X-ray structure (Fig. 8a) is presumably caused by the poor representation of the solvent effects. As can been seen in Fig. 8a), the X-ray structure has both interior where a welldefined hydrophobic core is formed and exterior where it is exposed to the solvent. The distance-dependent dielectric function, which mimics the solvent effects only

Monte-Carlo Simulated Annealing in Protein Folding, Figure 8 A structure of protein G deduced from an X-ray experiment (a) and the lowest-energy conformation of protein G obtained from Monte-Carlo simulated annealing runs with the distance-dependent dielectric function (b)

in electrostatic interactions, is therefore not sufficient to represent the effects of the solvent here.

Conclusions In this article we have reviewed theoretical aspects of the protein folding problem. Our strategy in tackling this problem consists of two elements: 1) inclusion of accurate solvent effects, and 2) development of powerful simulation algorithms that can avoid getting trapped in states of energy local minima. We have shown the effectiveness of Monte-Carlo simulated annealing by showing that direct folding of ˛-helix and ˇ-sheet structures from randomly-generated initial conformations are possible.

Monte-Carlo Simulated Annealing in Protein Folding

As for the solvent effects, we considered several methods: a distance-dependent dielectric function, a term proportional to solvent-accessible surface area, and the reference interaction site model (RISM). These methods vary in nature from crude but computationally inexpensive (distance-dependent dielectric function) to accurate but computationally demanding (RISM theory). In the present article, we have shown that the inclusion of some solvent effects is very important for a successful prediction of the tertiary structures of small peptides and proteins.

See also Adaptive Simulated Annealing and its Application to Protein Folding Bayesian Global Optimization Genetic Algorithms Genetic Algorithms for Protein Structure Prediction Global Optimization Based on Statistical Models Global Optimization in Lennard–Jones and Morse Clusters Global Optimization in Protein Folding Molecular Structure Determination: Convex Global Underestimation Monte-Carlo Simulations for Stochastic Optimization Multiple Minima Problem in Protein Folding: ˛BB Global Optimization Approach Packet Annealing Phase Problem in X-ray Crystallography: Shake and Bake Approach Protein Folding: Generalized-ensemble Algorithms Random Search Methods Simulated Annealing Simulated Annealing Methods in Protein Folding Stochastic Global Optimization: Stopping Rules Stochastic Global Optimization: Two-phase Methods

References 1. Achari A, Hale SP, Howard AJ, Clore GM, Gronenborn AM, Hardman KD, Whitlow M (1992) 1.67- Å X-ray structure of the B2 immunoglobulin-binding domain of streptococcal protein G and comparison to the NMR structure of the B1 domain. Biochemistry 31:10449–10457

M

2. Bacon D, Anderson WF (1988) A fast algorithm for rendering space-filling molecular pictures. J Mol Graphics 6:219– 220 3. Berg BA, Neuhaus T (1991) Multicanonical algorithms for first order phase transitions. Phys Lett B267:249–253 4. Brooks III CL (1998) Simulations of protein folding and unfolding. Curr Opin Struct Biol 8:222–226 5. Brünger AT (1988) Crystallographic refinement by simulated annealing: Application to a 2.8 Å resolution structure of aspartate aminotransferase. J Mol Biol 203:803–816 6. Chakrabartty A, Kortemme T, Baldwin RL (1994) Helix propensities of the amino acids measured in alanine-based peptides without helix-stabilizing side-chain interactions. Protein Sci 3:843–852 7. Chandler D, Andersen HC (1972) Optimized cluster expansions for classical fluids. Theory of molecular liquids. J Chem Phys 57:1930–1937 8. Chou PY, Fasman GD (1974) Prediction of protein conformation. Biochemistry 13:222–245 9. Daggett V, Kollman PA, Kuntz ID (1991) Molecular dynamics simulations of small peptides: dependence on dielectric model and pH. Biopolymers 31:285–304 10. Deisenhofer J, Steigemann W (1975) Crystallographic refinement of the structure of bovine pancreatic trypsin inhibitor at 1.5 Å resolution. Acta Crystallogr B31:238–250 11. Dill K (1990) The meaning of hydrophobicity. Science 250:297–297 12. Epstain CJ, Goldberger RF, Anfinsen CB (1963) The genetic control of tertiary protein structure: studies with model systems. Cold Spring Harbor Symp Quant Biol 28: 439– 449 13. Graham WH, Carter ES, II, Hicks RP (1992) Conformational analysis of Met-enkephalin in both aqueous solution and in the presence of sodium dodecyl sulfate micelles using multidimensional NMR and molecular modeling. Biopolymers 32:1755–1764 14. Gronenborn AM, Filpula DR, Essig NZ, Achari A, Whitlow M, Wingfield PT, Clore GM (1991) A novel, highly stable fold of the immunoglobulin binding domain of streptococcal protein G. Science 253:657–661 15. Hansmann UHE, Okamoto Y (1993) Prediction of peptide conformation by multicanonical algorithm: new approach to the multiple-minima problem. J Comput Chem 14:1333–1338 16. Hansmann UHE, Okamoto Y (1994) Comparative study of multicanonical and simulated annealing algorithms in the protein folding problem. Phys A 212:415–437 17. Hansmann UHE, Okamoto Y (1994) Sampling ground-state configurations of a peptide by multicanonical annealing. J Phys Soc Japan 63:3945–3949 18. Hansmann UHE, Okamoto Y (1998) Tertiary structure prediction of C-peptide of ribonuclease A by multicanonical algorithm. J Phys Chem B 102:653–656 19. Hansmann UHE, Okamoto Y (1999) Effects of side-chain charges on ˛-helix stability in C-peptide of ribonuclease

2335

2336

M 20.

21. 22.

23.

24.

25.

26.

27. 28.

29.

30. 31.

32.

33. 34. 35.

36.

37.

Monte-Carlo Simulated Annealing in Protein Folding

A studied by multicanonical algorithm. J Phys Chem B 103:1595–1604 Hingerty BE, Ritchie RH, Ferrell T, Turner JE (1985) Dielectric effects in biopolymers: the theory of ionic saturation revisited. Biopolymers 24:427–439 Hirata F, Rossky PJ (1981) An extended RISM equation for molecular polar fluids. Chem Phys Lett 83:329–334 Kawai H, Kikuchi T, Okamoto Y (1989) A prediction of tertiary structures of peptide by the Monte Carlo simulated annealing method. Protein Eng 3:85–94 Kawai H, Okamoto Y, Fukugita M, Nakazawa T, Kikuchi T (1991) Prediction of ˛-helix folding of isolated C-peptide of ribonuclease A by Monte Carlo simulated anne