Volume 18, Number 3, August 2023 
IEEE Computational Intelligence Magazine

Citation preview

Volume 18 Number 3 ❏  August 2023 www.ieee-cis.org

Features

14 How Good is Neural Combinatorial Optimization? A Systematic Evaluation on the Traveling Salesman Problem by Shengcai Liu,Yu Zhang, Ke Tang, and Xin Yao 29 Jack and Masters of all Trades: One-Pass Learning Sets of Model Sets From Large Pre-Trained Models by Han Xiang Choong,Yew-Soon Ong, Abhishek Gupta, Caishun Chen, and Ray Lim      41 A Multi-Factorial Evolutionary Algorithm With Asynchronous Optimization Processes for Solving the Robust Influence Maximization Problem by Shuai Wang, Beichen Ding, and Yaochu Jin

Columns on the cover

©SHUTTERSTOCK.COM/PRO500

IEEE Computational Intelligence Magazine (ISSN 1556603X) is published quarterly by The Institute of Electrical and Electronics Engineers, Inc. Headquarters: 3 Park Avenue, 17th Floor, New York, NY 10016-5997, U.S.A. +1 212 419 7900. Responsibility for the contents rests upon the authors and not upon the IEEE, the Society, or its members. The magazine is a membership benefit of the IEEE Computational Intelligence Society, and subscriptions are included in Society fee. Replacement copies for members are available for US$20 (one copy only). Nonmembers can purchase individual copies for US$220.00. Nonmember subscription prices are available on request. Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limits of the U.S. Copyright law for private use of patrons: 1) those post-1977 articles that carry a code at the bottom of the first page, provided the per-copy fee is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01970, U.S.A.; and 2) pre-1978 articles without fee. For other copying, reprint, or republication permission, write to: Copyrights and Permissions Department, IEEE Service Center, 445 Hoes Lane, Piscataway NJ 08854 U.S.A. Copyright © 2023 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved. Periodicals postage paid at New York, NY and at additional mailing offices. Postmaster: Send address changes to IEEE Computational Intelligence Magazine, IEEE, 445 Hoes Lane, Piscataway, NJ 08854-1331 U.S.A. PRINTED IN U.S.A. Canadian GST #125634188.

Digital Object Identifier 10.1109/MCI.2023.3282533

54 AI-eXplained Multi-Magnification Attention Convolutional Neural Networks by Chia-Wei Chao, Daniel Winden Hwang, Hung-Wen Tsai, Shih-Hsuan Lin,Wei-Li Chen, Chun-Rong Huang, and Pau-Choo Chung      56 Research Frontier Contribution-Based Cooperative Co-Evolution With Adaptive Population Diversity for Large-Scale Global Optimization by Ming Yang, Jie Gao, Aimin Zhou, Changhe Li, and Xin Yao  69 A Macro-Micro Population-Based Co-Evolutionary Multi-Objective Algorithm for Community Detection in Complex Networks by Lei Zhang, Haipeng Yang, Shangshang Yang, and Xingyi Zhang

Departments

2 Editor’s Remarks Takeaways From CI Leading Researchers by Chuan-Kang Ting 4 President’s Message Chat, Anyone? by Jim Keller 6 Conference Reports Conference Report on 2022 IEEE Symposium Series on Computational Intelligence (IEEE SSCI 2022) by Ah-Hwee Tan, Dipti Srinivasan, and Chunyan Miao

 10 Publication Spotlight CIS Publication Spotlight by Yongduan Song, Dongrui Wu, Carlos A. Coello Coello, Georgios N.Yannakakis, Huajin Tang,Yiu-Ming Cheung, and Hussein Abbass 88   Conference Calendar by Marley Vellasco and Liyan Song

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

1

CIM Editorial Board Editor-in-Chief Chuan-Kang Ting National Tsing Hua University Department of Computer Science No. 101, Section 2, Kuang-Fu Road Hsinchu 300044, TAIWAN (Phone) +886-3-5742795 (Email) [email protected] Founding Editor-in-Chief Gary G. Yen, Oklahoma State University, USA Past Editors-in-Chief Kay Chen Tan, Hong Kong Polytechnic University, HONG KONG Hisao Ishibuchi, Southern University of Science and Technology, CHINA Editors-At-Large Piero P. Bonissone, Piero P. Bonissone Analytics, USA David B. Fogel, Natural Selection, Inc., USA Vincenzo Piuri, University of Milan, ITALY Marios M. Polycarpou, University of Cyprus, CYPRUS Jacek M. Zurada, University of Louisville, USA Associate Editors Jose M. Alonso-Moral, Universidade de Santiago de Compostela, SPAIN Sansanee Auephanwiriyakul, Chiang Mai University, THAILAND Ying-ping Chen, National Yang Ming Chiao Tung University, TAIWAN Keeley Crockett, Manchester Metropolitan University, UK Liang Feng, Chongqing University, CHINA Jen-Wei Huang, National Cheng Kung University, TAIWAN Eyke H€ ullermeier, University of Munich, GERMANY Min Jiang, Xiamen University, CHINA Sheng Li, University of Virginia, USA Hongfu Liu, Brandeis University, USA Zhen Ni, Florida Atlantic University, USA Nelishia Pillay, University of Pretoria, SOUTH AFRICA Danil Prokhorov, Toyota R&D, USA Kai Qin, Swinburne University of Technology, AUSTRALIA Rong Qu, University of Nottingham, UK Manuel Roveri, Politecnico di Milano, ITALY Gonzalo A. Ruz, Universidad Adolfo Iba~nez, CHILE Ming Shao, University of Massachusetts Dartmouth, USA Ah-Hwee Tan, Singapore Management University, SINGAPORE Vincent S. Tseng, National Yang Ming Chiao Tung University, TAIWAN Handing Wang, Xidian University, CHINA Dongbin Zhao, Chinese Academy of Sciences, CHINA IEEE Periodicals/ Magazines Department Journals Production Manager, Eileen McGuinness Senior Manager, Journals Production: Patrick Kempf Senior Art Director, Janet Dudar Associate Art Director, Gail A. Schnitzer Production Coordinator, Theresa L. Smith Director, Business Development— Media & Advertising, Mark David Advertising Production Manager, Felicia Spagnoli Production Director, Peter M. Tuohy Editorial Services Director, Kevin Lisankie Senior Director, Publishing Operations, Dawn Melley IEEE prohibits discrimination, harassment, and bullying. For more information, visit http://www. ieee.org/web/abou-tus/whatis/policies/p9-26.html. Digital Object Identifier 10.1109/MCI.2023.3278543

2

Chuan-Kang Ting National Tsing Hua University, TAIWAN

Editor's Remarks

Takeaways From CI Leading Researchers

W

ith the pandemic subsiding, academic activities are reviving rapidly around the world. In May 2023, I organized the AI Forum 2023 and had the honor to invite Yaochu Jin, Kay Chen Tan, and Yew Soon Ong to deliver keynote speeches about cutting-edge CI technologies and, in particular, to share their valuable research experience at the panel discussion. One thing these former Editors-in-Chief of IEEE Transactions and outstanding researchers all agreed on—you got to love what you are doing. Without passion, the ambition in research will soon wither into a compulsion to drill experiments and a source of unwelcome pressure. Furthermore, it does not matter if your topic is steering away from the mainstream. “Don’t be afraid to be different; be afraid to be the same,” Yew Soon stressed. Working hard on your innovative ideas with a non-give-up mentality, its rewards and achievements can be more than you expected. Apart from the positive attitudes, they pointed out a practical guideline for publication: checking the interests and narrative style of the target journal/conference and presenting your research work accordingly is always a must-do. This issue includes three Features articles. The first article introduces neural combinatorial optimization (NCO), which uses deep learning techniques to solve combinatorial optimization problems, and further presents a comparative study of NCO solvers on the traveling salesman problem. The second article proposes using many smaller and specialized models, rather than one single large model, as Masters of All Trades to deal with different task settings. In the third article, a

(Right to left) Yaochu Jin, Yew Soon Ong, Kay Chen Tan, and Chuan-Kang Ting in the panel session of AI Forum 2023 at National Tsing Hua University.

Digital Object Identifier 10.1109/MCI.2023.3278544 Date of current version: 13 July 2023

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

1556-603X ß 2023 IEEE

multi-factorial evolutionary algorithm is developed to solve the robust influence maximization problem. In the Columns, the AI-X article presents a multi-magnification attention convolutional neural network that can identify the importance of features at different magnification levels. The interactive contents in this immersive article demonstrate the effectiveness of the proposed method on cell segmentation in liver histopathology images. The second article embeds an adaptive mechanism in cooperative co-evolution for enhancing population diversity and then performance. The third article proposes employing two populations, i.e., macro- and micro-populations, to balance the exploration and exploitation of multi-objective evolutionary

algorithms for community detection in complex networks. In the Society Briefs, the General Co-Chairs present a report on the IEEE SSCI 2022. We cordially invite readers to enjoy the articles in this issue and hope you will find them informative, inspiring, and engaging. As always, we welcome your feedback and suggestions regarding the content of this magazine. Please do not hesitate to contact me at [email protected].

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

3

CIS Society Officers President – Jim Keller, University of Missouri, USA President-Elect – Yaochu Jin, Bielefeld University, GERMANY Vice President-Conferences – Marley M. B. R. Vellasco, Pontifical Catholic University of Rio de Janeiro, BRAZIL Vice President-Education – Pau-Choo (Julia) Chung, National Cheng Kung University, TAIWAN Vice President-Finances – Pablo A. Estevez, University of Chile, CHILE Vice President-Industrial and Governmental Activities – Piero P. Bonissone, Piero P. Bonissone Analytics, USA Vice President-Members Activities – Sanaz Mostaghim, Otto von Guericke University of Magdeburg, GERMANY Vice President-Publications – Kay Chen Tan, Hong Kong Polytechnic University, HONG KONG Vice President-Technical Activities – Luis Magdalena, Universidad Politecnica de Madrid, SPAIN Publication Editors IEEE Transactions on Neural Networks and Learning Systems Yongduan Song, Chongqing University, CHINA IEEE Transactions on Fuzzy Systems Dongrui Wu, Huazhong University of Science and Technology, CHINA IEEE Transactions on Evolutionary Computation Carlos A. Coello Coello, CINVESTAV-IPN, MEXICO IEEE Transactions on Games Georgios N. Yannakakis, University of Malta, MALTA IEEE Transactions on Cognitive and Developmental Systems Huajin Tang, Zhejiang University, CHINA IEEE Transactions on Emerging Topics in Computational Intelligence Yiu-ming Cheung, Hong Kong Baptist University, HONG KONG IEEE Transactions on Artificial Intelligence Hussein Abbass, University of New South Wales, AUSTRALIA Administrative Committee Term ending in 2023: Oscar Cord on, University of Granada, SPAIN Guilherme DeSouza, University of Missouri, USA Pauline Haddow, Norwegian University of Science and Technology, NORWAY Haibo He, University of Rhode Island, USA Hisao Ishibuchi, Southern University of Science and Technology, CHINA Term ending in 2024: Sansanee Auephanwiriyakul, Chiang Mai University, THAILAND Jonathan Garibaldi, University of Nottingham, UK Janusz Kacprzyk, Polish Academy of Sciences, POLAND Derong Liu, Guangdong University of Technology, CHINA Ana Madureira, Polytechnic of Porto, PORTUGAL Term ending in 2025: Keeley Crockett, Manchester Metropolitan University, UK Jose Lozano, University of the Basque Country UPV/EHU, SPAIN Alice E. Smith, Auburn University, USA Christian Wagner, University of Nottingham, UK Gary G. Yen, Oklahoma State University, USA Digital Object Identifier 10.1109/MCI.2023.3279704

4

Jim Keller University of Missouri, USA

President's Message

Chat, Anyone?

H

i all, Let’s talk about the elephant in the room. ChatGPT and other Large Language Models (LLM) along with their AI cousins focused on images, graphics, math, coding, and audio are dominating our world of generating reports and manuscripts. They can be immensely fun: I asked ChatGPT to write me song lyrics about fuzzy logic in various genres. I wasn’t crazy about the folk song, but I really liked ChatGPT’s blues lyrics and its bluegrass rendition. And who hasn’t chuckled over clever images synthesized by DALL-E? However, these are serious applications with huge potential. Can we get too much of a good thing? We are experiencing, if not the peak, at least the wave of hype, both positive and negative. Many people, including many IEEE members, have signed a letter calling for a sixmonth “AI Pause”, worried about the potential for moral and ethical misuse of AI. I didn’t sign the letter for several reasons, but it is important to have these conversations to focus on what and how to regulate, as well as our own personal interaction with AI programs. From an IEEE perspective, what are the guidelines and rules to govern our use of AI programs in the creation and reviewing of manuscripts in IEEE publications including, of course, conference proceedings? I chair an AdHoc Committee of the IEEE Publication Services and Products Board (yea, that’s a mouthful) on AI in the publication domain: opportunities and threats. One of our charters is to articulate the principles and guidelines for the use of AI in publications. So we’ve been collecting a lot of information and discussing the Good, the Bad, and the Ugly of AI in publications. There is already a considerable effort to add AI and AI-like (maybe call this “little ai”) capabilities into publications. The IEEE Author Portal is one good example of little ai at the moment, though plans are in the works for moving closer to Big AI. Dawn Melley, Senior Director of IEEE Publishing Operations summarizes “The IEEE Author Portal is an article submission system that works seamlessly with ScholarOne Manuscripts to allow authors to submit more quickly and efficiently. It uses metadata extraction to take what is provided in the author’s files and automatically populate many of the submission fields, which saves the authors a lot of time and effort during the submission process.” One feature involves machine learning algorithms that are used to extract key submission information from manuscript files so that authors do not have to rekey or copy/paste the information (definitely something not fun about manuscript submission). Enhancements and additions to these features, think grammar checking, etc., are clearly to me part of the Good. Approaches to aid in the review process are being researched and developed. Suggesting relevant referees is a great low hanging fruit. This helps the beleaguered AEs and conference program chairs, and aids authors by minimizing referees who don’t have the background to successfully interpret a paper. I reported in a previous President’s

Digital Object Identifier 10.1109/MCI.2023.3278545 Date of current version: 13 July 2023

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

1556-603X ß 2023 IEEE

Message about a CIS initiative to research ways that AI can assist in the review process. Many potential good uses are being studied. It will take a while for IEEE to iterate on an official policy for Big AI. I’m going to give my opinion. First of all, don’t include ChatGPT or other programs as a co-author; that makes no sense. The main guideline right now is to put some details of how you used AI in your manuscript in an acknowledgement section. For example, if you use a LLM to act like a human copyeditor, just acknowledge it along with something about the extent of its use. It’s still your responsibility as the author to be sure the copyediting doesn’t change the meaning of your text. Or, if you ask for help identifying previous work in an area, be upfront about it, and carefully check those references – we’ve all now heard stories of false information being generated by LLMs. Don’t use AI programs to “create” new ideas to pass along as yours. My bottom line is that if you are uncomfortable about divulging how you used an AI program, you probably shouldn’t use it. For me, this boils down to a question of personal ethics. If you are asked to referee a paper, you should not ask a LLM to do your review. That amounts to a breach in confidentiality since most LLMs will integrate that input into their training corpus. I’d sure hate to see my new good idea given to the next person who asks ChatGPT. Wouldn’t you? The genie is out of the bottle. AI capabilities will continue to provide useful tools for authors, referees, editors, and publishers. Let’s use them to make our jobs easier but avoid ethical pitfalls. As always, feel free to contact me at [email protected] with your thoughts,

suggestions, questions, and innovative ideas, particularly with thoughts on the use of AI in publishing. Please stay safe and healthy. I hope to see many of you at the CIS meetings over the rest of the year.

PS: I asked the public version of ChatGPT on May 9, 2023 to rewrite the opening paragraph above. Here is what it gave me. While it is smoother and contains most of my thoughts, I don’t think it “sounds” like me. What do you think? “Let’s address the elephant in the room: Large Language Models (LLMs) like ChatGPT and other AI technologies specializing in graphics, mathematics, coding, and audio are dominating the production of reports and manuscripts. While they can be a lot of fun, such as when I asked ChatGPT to create song lyrics about fuzzy logic in different genres (although I wasn’t fond of the folk song, I enjoyed the blues lyrics and bluegrass rendition it generated), we must acknowledge their significant potential for serious applications. However, there is a growing concern about whether we can have too much of a good thing, particularly given the wave of both positive and negative hype surrounding AI. Many IEEE members and other individuals have signed a letter calling for a six-month “AI Pause,” worried about the potential ethical and moral misuse of AI. While I did not sign the letter for various reasons, it is crucial to have conversations about regulating AI and our own interactions with AI programs.”

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

5

Conference Reports

Ah-Hwee Tan Singapore Management University, SINGAPORE Dipti Srinivasan National University of Singapore, SINGAPORE Chunyan Miao Nanyang Technological University, SINGAPORE IEEE SSCI 2022 General Co-Chairs

Conference Report on 2022 IEEE Symposium Series on Computational Intelligence (IEEE SSCI 2022)

O

n behalf of the organizing committee, we are delighted to deliver this conference report for the 2022 IEEE Symposium Series on Computational Intelligence (SSCI 2022), which was held in Singapore from 4th to 7th December 2022. IEEE SSCI is an established flagship annual international series of symposia on computational intelligence (CI) sponsored by the IEEE Computational Intelligence Society (CIS) to promote and stimulate discussions on the latest theory, algorithms, applications, and emerging topics on computational intelligence. After two years of virtual conferences due to the global pandemic, IEEE SSCI returned as an inperson meeting with online elements in 2022. One unique feature of IEEE SSCI is to co-locate multiple symposia and special sessions under one roof, each dedicated to a specific topic in the CI domain, thereby encouraging crossfertilization of ideas and providing a unique platform for researchers, practitioners, and students from all around the world to discuss and present their findings. The call for symposia and special sessions of SSCI 2022 was

Digital Object Identifier 10.1109/MCI.2023.3278554 Date of current version: 13 July 2023

6

IEEE SSCI aims to promote and stimulate discussions on the latest theory, algorithms, applications, and emerging topics on computational intelligence.

responded with proposals of 41 symposia and 11 special sessions. The subsequent call for papers of SSCI 2022 received a total of 379 submissions by 672 authors across 57 countries. While the individual symposium and special session chairs handled the reviewer assignment and acceptance/rejection recommendation for each paper, the acceptance/rejection decision was finalized by the program chairs. For papers contributed by the symposium and special session chairs, a separate Conflict of Interest (COI) track was set up to handle the paper submission, reviewer assignment, and acceptance/ rejection recommendation. In addition, a separate submission site was created to cater for papers submitted by the conference and program chairs. Following a rigorous review process, 230 regular papers were selected for oral presentations at the conference and included in the conference proceedings published by IEEE Xplore. After merging some small symposia and special sessions, the regular papers

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

were organized into 32 symposia and 6 special sessions presented in six parallel tracks over three days. In addition to the accepted regular papers, 20 late breaking papers were selected and featured as poster presentations at the conference. For recognizing outstanding original contributions to areas in the field of computational intelligence, SSCI 2022 has taken much effort to provide two best paper awards with external sponsorship. The IEEE SSCI 2022 Best Paper Award, received by Siying Zhang, Andi Han and Junbin Gao for their paper entitled “Robust Denoising In Graph Neural Networks” was sponsored by the AI, Analytics & Informatics (AI3) Horizontal Technology Programming Office of the Agency for Science, Technology and Research (ASTAR). On the other hand, the IEEE Brain Best Paper Award, received by Shuailei Zhang, Dezhi Zheng, Ning Tang, Effie Chew, Rosary Yuting Lim, Kai Keng Ang, and Cuntai Guan for their paper

1556-603X ß 2023 IEEE

One unique feature of IEEE SSCI is to co-locate multiple symposia and special sessions under one roof, each dedicated to a specific topic in the CI domain. on “Online Adaptive CNN: A Session-to-Session Transfer Learning Approach for Non-Stationary EEG” was sponsored by IEEE Brain to recognize outstanding original contributions to areas related to neurotechnology and brain research. Jointly organized by Singapore Management University (SMU), Nanyang Technological University (NTU) and National University of Singapore (NUS), IEEE SSCI 2022, held in the SMU campus located in downtown Singapore, was attended by more than 300 delegates from 42 countries. Surrounded by many interesting places and attractions, including museums, theatres, cafes, shops, and education institutes, SMU offered a modern and well-connected city venue with state-of-the-art conference facilities. Besides the symposia and special sessions, IEEE SSCI 2022 featured five plenary talks provided by world renowned researchers in the field of computational intelligence: Witold Pedrycz, Mengjie Zhang, Chuan-Kang Ting, Shuicheng Yan, and Donald C. Wunsch II. In addition, six symposia keynotes were presented by leading scholars, including Daniel Polani, Ruhul Sarker, Yubiao Sun, Kumar Venayagamoorthy, Luca Oneto, and Donald C. Wunsch II. These plenary and keynote speeches have served to present the audience with an opportunity to listen firsthand and engage directly with world-leading experts on cutting edge technology development and insights in the field of computational intelligence. For knowledge dissemination, eleven technical tutorials were offered by experts in the field on the pre-conference day, which were free to attend for all participants. The very wellattended tutorials covered a diverse range of specialized topics, ranging from Brain-Inspired Computation based on Spiking Neural Networks,

Evolving Fuzzy Systems, Prototypebased Deep Learning, Hyperplane based Autonomous machine learning,

Decomposition Multi-Objective Optimization, Genetic Programming and Machine Learning for Job Shop Scheduling, Unrevealing Data Correlations with Self-Supervised Learning, Evolutionary Computation for Automating Deep Neural Architecture Design, Randomization Based Deep and Shallow Learning Methods, Machine Learning Applications in Adversarial Attacks and Mitigation

FIGURE 1 On behalf of the three general Co-Chairs, Ah-Hwee Tan welcomed the conference delegates at the opening of IEEE SSCI 2022.

FIGURE 2 SSCI 2022 Guest of Honor Professor Jim Keller (President, IEEE Computational Intelligence Society) providing the opening welcome address.

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

7

Strategies, AI for Manufacturing data analysis in Industrial 4.0, as well as Ethical Challenges and Opportunities with Computational Intelligence Systems. As an integral part of the conference program, two engaging panel sessions

IEEE SSCI encourages cross-fertilization of ideas and provides a unique platform for researchers, practitioners, and students from all around the world to discuss and present their findings.

FIGURE 3 Professor Witold Pedrycz, our first plenary speaker, delivering his speech in the SMU’s Ngee Ann Kongsi Auditorium.

FIGURE 4 Professor Witold Pedrycz, our plenary speaker, receiving the certificate of appreciation from Professor Marley Vellasco (IEEE CIS Vice President for Conferences).

8

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

were organized. The panel on “Women in Computational Intelligence Achieving IEEE Fellow Status - The Way Forward” was a joint initiative between IEEE CIS Women in Computational Intelligence (WCI) and IEEE Women in Engineering (WIE) to provide support and encouragement to women in Computational Intelligence to achieve IEEE fellow status. The panel session on “Artificial and Computational Intelligence: Opportunities and Challenges”, organized and moderated by Professor Yaochu Jin, provided an interactive platform to discuss the opportunities and challenges in the field, and efforts that need to be made to ensure a healthy and sustainable growth. Also collocated with SSCI 2022 was a number of CIS events, including the 2022 IEEE CIS Student Hackathon on Computational Intelligence in Biomedicine and Healthcare organized by the IEEE CIS and Miin Wu School of Computing, National Cheng Kung University, Taiwan, the CIS Student Activities organized by the IEEE CIS Students Activities Sub-Committee and IEEE Guangzhou Section YP Affinity Group, as well as the CIS Membership Round Table, all of which were open to the participants. A conference is not complete without an accompanying social program. IEEE SSCI 2022 has provided ample opportunities for delegates to interact and connect, including a pre-conference welcome reception on 4th December at the SMU Concourse T-Junction, daily luncheons and coffee breaks during the main conference days, as well as a conference banquet on 6th December. As a special highlight, our social events further included an off-site visit to the Gardens by the Bay, arguably the Singapore’s jewel of the crown, followed by a cocktail reception and conference dinner at the Sands Expo and Convention Centre.

FIGURE 5 Professor Yaochu Jin leading an all-star expert panel session on Artificial and Computational Intelligence: Opportunities and challenges.

FIGURE 6 SSCI 2022 delegates visiting the Gardens by the Bay, Singapore.

FIGURE 7 SSCI 2022 delegates danced away with the cultural dance troupe at the conference banquet held at the Sands Expo and Convention Centre, Singapore.

Though SSCI 2022 was organized as an in-person physical event, the conference had further provided support for online access to our conference registrants, who were unable to travel. In particular, online meeting platform Zoom has been used to support remote presentations for some of our tutorial speakers as well as paper authors. The conference also adopted and made extensive use of Whova, which served a wide range of functions, such as online event management, information dissemination, and social interaction. The success of a conference cannot come without the support and hard work of many dedicated individuals. We have many to thank for their generous contributions, in particular the leadership of the IEEE Computational Intelligence Society, in particular Professor Jim Keller and Professor Marley Vellasco, for their continual support and guidance. We also thank the Singapore Tourism Board for providing the conference grant, Singapore Management University for sponsoring the conference venue and facilities, as well as Agency for Science, Technology and Research (ASTAR) and IEEE Brain for sponsoring the Best Paper Awards. We also wish to acknowledge all our speakers for contributing their professional expertise and all the symposium/special session chairs and members of our conference technical and organizing committees, who have worked relentlessly to take care of all necessary administrative, publicity, and logistic arrangement for our technical as well as social programs. Finally, we would like to mention the incredible support provided by our PCO, J. Jayes Pte Ltd, who handled various matters, including registration, finance, logistics, catering, and setting up the Whova content. We trust that IEEE SSCI 2022 has been a memorable event for all who have participated. Thank you all who have walked with us in this journey in co-creating this successful meeting. We look forward to the next successful edition of IEEE SSCI 2023 in Mexico.

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

9

Publication Spotlight

Yongduan Song Chongqing University, CHINA Dongrui Wu Huazhong University of Science and Technology, CHINA Carlos A. Coello Coello CINVESTAV-IPN, MEXICO Georgios N. Yannakakis University of Malta, MALTA Huajin Tang Zhejiang University, CHINA Yiu-Ming Cheung Hong Kong Baptist University, HONG KONG Hussein Abbass University of New South Wales, AUSTRALIA

CIS Publication Spotlight

IEEE Transactions on Neural Networks and Learning Systems

IEEE Transactions on Fuzzy Systems

Fuzzy-Based Optimization and Control of a Soft Exosuit for Compliant Robot– Human–Environment Interaction, by Q. Li, W. Qi, Z. Li, H. Xia, Y. Kang, and L. Cheng, IEEE Transactions on Fuzzy Systems, Vol. 31, No. 1, Jan. 2023, pp. 241–253.

A Survey on Evolutionary Neural Architecture Search, by Y. Liu, Y. Sun, B. Xue, M. Zhang, G. G. Yen, and K. C. Tan, IEEE Transactions on Neural Networks and Learning Systems, Vol. 34, No. 2, Feb. 2023, pp. 550–570. Digital Object Identifier: 10.1109/ TNNLS.2021.3100554 “Deep neural networks (DNNs) have achieved great success in many applications. The architectures of DNNs play a crucial role in their performance, which is usually manually designed with rich expertise. However, such a design process is labor-intensive because of the trial-and-error process and also not easy to realize due to the rare expertise in practice. Neural architecture search (NAS) is a type of technology that can Digital Object Identifier 10.1109/MCI.2023.3278555 Date of current version: 13 July 2023

10

IMAGE LICENSED BY INGRAM PUBLISHING

design the architectures automatically. Among different methods to realize NAS, the evolutionary computation (EC) methods have recently gained much attention and success. Unfortunately, there has not yet been a comprehensive summary of the EC-based NAS algorithms. This article reviews over 200 articles of most recent EC-based NAS methods in light of the core components, to systematically discuss their design principles and justifications on the design. Furthermore, current challenges and issues are also discussed to identify future research in this emerging field.”

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

Digital Object Identifier: 10.1109/ TFUZZ.2022.3185450 “Many previous studies of soft exosuits improved human locomotion performance. However, there is no example to control a soft exosuit using human ankle impedance adaption in assistance tasks compliantly. In this article, the human–environment interaction information is exploited into the exosuit control. A novel fuzzy-based optimization and control method of soft exosuit is proposed to provide plantarflexion assistance for human walking by changing the human–robot interaction. In particular, a fuzzy neurodynamics optimization is developed

1556-603X ß 2023 IEEE

to learn the unknown human ankle impedance parameters automatically. A fuzzy approximation technique is applied to improve the control performance of the exosuit when a human is walking with unknown human–robot interaction model parameters. This control scheme guarantees that the human–robot dynamics follows a target human ankle impedance model to obtain the compliant interaction performance. Experiments on different participants verify the effectiveness of the control scheme. Results show that a compliant human–robot interaction is achieved by learning the human–environment interaction parameters, i.e., the human ankle parameters. It indicates that our proposed method can facilitate exosuit control to achieve compliant robot–human–environment interaction.” Radom Feature-Based Collaborative Kernel Fuzzy Clustering for Distributed Peer-to-Peer Networks, by Y. Wang, S. Han, J. Zhou, L. Chen, C. L. P. Chen, T. Zhang, Z. Liu, L. Wang, and Y. Chen, IEEE Transactions on Fuzzy Systems, Vol. 31, No. 2, Feb. 2023, pp. 692–706. Digital Object Identifier: 10.1109/ TFUZZ.2022.3188363 “Kernel clustering has the ability to get the inherent nonlinear structure of the data. But the high computational complexity and the unknown representation of the kernel space make it unavailable for the data clustering in distributed peer-to-peer (P2P) networks. To solve this issue, we propose a new series of random feature-based collaborative kernel clustering algorithms in this article. In the most basic algorithm, each node in a distributed P2P network first maps its data into a low-dimensional random feature space with the approximation of the given kernel by using the random Fourier feature mapping method. Then, each node independently searches the clusters with its local data and the collaborative knowledge from its neighbor nodes, and the distributed clustering is performed among all network nodes until reaching the global consensus result, i.e., all

nodes have the same cluster centers. In addition, an improved version is designed with assignment of feature weights, which is optimized by the maximum-entropy technique to extract important features for the cluster identification. What’s more, to relief the impact of different kernel functions and related parameters on clustering results, the combination of multiple kernels rather than a single kernel is adopted for the low-dimensional approximation, and the optimized weights are assigned to provide the guidance on the choice of the kernels and their parameters and discover significant features at the same time. Experiments on synthetic and real-world datasets show that the proposed methods achieve similar and even better results than the traditional kernel clustering methods on various performance metrics, including the average classification rate, the average normalized mutual information, and the average adjusted rand index. More importantly, the low-dimensional random features approximated to kernels and the distributed clustering mechanism adopted in these methods bring the greatly lower temporal complexity.” IEEE Transactions on Evolutionary Computation

Promoting Transfer of Robot NeuroMotion-Controllers by Many-Objective Topology and Weight Evolutions, by A. Salih and A. Moshaiov, IEEE Transactions on Evolutionary Computation, Vol. 27, No. 2, Apr. 2023, pp. 385–395. Digital Object Identifier: 10.1109/ TEVC.2022.3172294 “The ability of robot motion controllers to quickly adapt to new environments is expected to extend the applications of mobile robots. Using the concept of transfer optimization, this study investigates the capabilities of neuro-motion-controllers, which were obtained by simultaneously solving several source problems, to adapt to target problems. In particular, the adaptation comparison is carried out between specialized controllers, which are optimal for a single source motion problem, and

nonspecialized controllers that can solve several source motion problems. The compared types of controllers were simultaneously obtained by a many-objective evolution search that is tailored for the optimization of the topology and weights of neural networks. Based on the examined problems, it appears that nonspecialized solutions, which are “good enough” in all the source motion problems, show significantly better transfer capabilities as compared with solutions that were optimized for a single source motion problem. The proposed approach opens up new opportunities to develop controllers that have good enough performances in various environments while also exhibiting efficient adaptation capabilities to changes in the environments.” IEEE Transactions on Games

Using Evolutionary Algorithms to Target Complexity Levels in Game Economies, by K. Rogers, V. L. Claire, J. Frommel, R. Mandryk, and L. E. Nacke, IEEE Transactions on Games, Vol. 15, No. 1, Mar. 2023, pp. 56–66. Digital Object Identifier: 10.1109/ TG.2023.3238163 “Game economies (GEs) describe how resources in games are created, transformed, or exchanged: They underpin most games and exist in different complexities. Their complexity may directly impact player difficulty. Nevertheless, neither difficulty nor complexity adjustment has been explored for GEs. Moreover, there is a lack of knowledge about complexity in GEs, how to define or assess it, and how it can be employed by automated adjustment approaches in game development to target specific complexity. We present a proof-of-concept for using evolutionary algorithms to craft targeted complexity graphs to model GEs. In a technical evaluation, we tested our first working definition of complexity in GEs. We then evaluated player-perceived complexity in a city- building game prototype through a user study and confirmed

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

11

the generated GEs’ complexity in an online survey. Our approach toward reliably creating GEs of specific complexity can facilitate game development and player testing but also inform and ground research on player perception of GE complexity.” IEEE Transactions on Cognitive and Developmental Systems

conditions, we conduct experiments with a scenario that includes main tasks: avoiding consecutive unknown obstacles and turning at corner while the robot follows continuously human user along the corridor.” IEEE Transactions on Emerging Topics in Computational Intelligence

Collision-Free Navigation in HumanFollowing Task Using a Cognitive Robotic System on Differential Drive Vehicles, by C. V. Dang, H. Ahn, J. -W. Kim, and S. C. Lee, IEEE Transactions on Cognitive and Developmental Systems, Vol. 15, No. 1, Mar. 2023, pp. 78–87.

ADAST: Attentive Cross-Domain EEG-Based Sleep Staging Framework With Iterative Self-Training, by E. Eldele, M. Ragab, Z. Chen, M. Wu, C. K. Kwoh, X. Li, and C. Guan, IEEE Transactions on Emerging Topics in Computational Intelligence, Vol. 7, No. 1, Feb. 2023, pp. 210–221.

Digital Object Identifier: 10.1109/ TCDS.2022.3145915 “As human–robot collaboration increases tremendously in real-world applications, a fully autonomous and reliable mobile robot for the collaboration has been a central research topic and investigated extensively in a large number of studies. One of the most pressing issues in such topic is the collision-free navigation that has a moving goal and unknown obstacles under the unstructured environment. In this article, a cognitive robotic system (CRS) is proposed for the robot to navigate itself to the moving target person without obstacle collision. This CRS consists of a cognitive agent, which is created based on the Soar cognitive architecture to reason its current situation and make action decision for the robot to avoid obstacles and reach the target position, and a speed planning module, which is based on dynamic window approach (DWA) to generate appropriate linear and angular velocities for driving the robot’s motors. For the implementation of the proposed system, we use a differential drive wheel robot equipped with two ultrawideband (UWB) sensors and a color depth camera as the experimental platform. Finally, to evaluate the performance of our system in actual operating

Digital Object Identifier: 10.1109/ TETCI.2022.3189695 “Sleep staging is of great importance in the diagnosis and treatment of sleep disorders. Recently, numerous datadriven deep learning models have been proposed for automatic sleep staging. They mainly train the model on a large public labeled sleep dataset and test it on a smaller one with subjects of interest. However, they usually assume that the train and test data are drawn from the same distribution, which may not hold in real-world scenarios. Unsupervised domain adaption (UDA) has been recently developed to handle this domain shift problem. However, previous UDA methods applied for sleep staging have two main limitations. First, they rely on a totally shared model for the domain alignment, which may lose the domain-specific information during feature extraction. Second, they only align the source and target distributions globally without considering the class information in the target domain, which hinders the classification performance of the model while testing. In this work, we propose a novel adversarial learning framework called ADAST to tackle the domain shift problem in the unlabeled target domain. First, we develop an unshared attention mechanism to preserve the domain-specific

12

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

features in both domains. Second, we design an iterative self-training strategy to improve the classification performance on the target domain via target domain pseudo labels. We also propose dual distinct classifiers to increase the robustness and quality of the pseudo labels. The experimental results on six cross-domain scenarios validate the efficacy of our proposed framework and its advantage over state-of-the-art UDA methods.” IEEE Transactions on Artificial Intelligence

Ensemble Image Explainable AI (XAI) Algorithm for Severe CommunityAcquired Pneumonia and COVID-19 Respiratory Infections, by L. Zou, H. L. Goh, C. J. Y. Liew, J. L. Quah, G. T. Gu, J. J. Chew, M. P. Kumar, C. G. L. Ang, and A. Ta, IEEE Transactions on Artificial Intelligence, Vol. 4, No. 2, Apr. 2023, pp. 242–254. Digital Object Identifier: 10.1109/ TAI.2022.3153754 “Since the onset of the COVID19 pandemic in 2019, many clinical prognostic scoring tools have been proposed or developed to aid clinicians in the disposition and severity assessment of pneumonia. However, there is limited work that focuses on explaining techniques that are best suited for clinicians in their decision making. In this article, we present a new image explainability method named ensemble AI explainability (XAI), which is based on the SHAP and Grad-CAMþþ methods. It provides a visual explanation for a deep learning prognostic model that predicts the mortality risk of communityacquired pneumonia and COVID-19 respiratory infected patients. In addition, we surveyed the existing literature and compiled prevailing quantitative and qualitative metrics to systematically review the efficacy of ensemble XAI, and to make comparisons with several state-of-the-art explainability methods (LIME, SHAP, (continued on page 86)

Share Your Preprint Research with the World! TechRxiv is a free preprint server for unpublished research in electrical engineering, computer science, and related technology. Powered by IEEE, TechRxiv provides researchers across a broad range of fields the opportunity to share early results of their work ahead of formal peer review and publication.

BENEFITS: • Rapidly disseminate your research findings • Gather feedback from fellow researchers • Find potential collaborators in the scientific community • Establish the precedence of a discovery • Document research results in advance of publication

Upload your unpublished research today!

Follow @TechRxiv_org Learn more techrxiv.org

Powered by IEEE

ßSHUTTERSTOCK.COM/METHU DAS

How Good is Neural Combinatorial Optimization? A Systematic Evaluation on the Traveling Salesman Problem

Abstract—Traditional solvers for tackling combinatorial optimization (CO) problems are usually designed by human experts. Recently, there has been a surge of interest in utilizing deep learning, especially deep reinforcement learning, to automatically learn effective solvers for CO. The resultant new paradigm is termed neural combinatorial optimization (NCO). However, the advantages and disadvantages of NCO relative to

other approaches have not been empirically or theoretically well studied. This work presents a comprehensive comparative study of NCO solvers and alternative solvers. Specifically, taking the traveling salesman problem as the testbed problem, the performance of the solvers is assessed in five aspects, i.e., effectiveness, efficiency, stability, scalability, and generalization ability. Our results show that the solvers learned by NCO approaches, in general, still fall short of traditional solvers in nearly all these aspects. A potential benefit of NCO solvers would be their superior time and energy efficiency for small-size problem instances when sufficient training instances are available. Hopefully, this work would help with a better understanding of the strengths and weaknesses of NCO and provide a

Digital Object Identifier 10.1109/MCI.2023.3277768 Date of current version: 13 July 2023

Corresponding author: Ke Tang (e-mail: [email protected]).

Shengcai Liu , Yu Zhang , Ke Tang , and Xin Yao Southern University of Science Technology, CHINA

14

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

1556-603X ß 2023 IEEE

comprehensive evaluation protocol for further benchmarking NCO approaches in comparison to other approaches.

I. Introduction

C

ombinatorial optimization (CO) concerns optimizing an objective function by selecting a solution from a finite solution set, with the latter encoding constraints on the solution space. It has been involved in numerous real-world applications in logistics, supply chains, and energy [1]. From the perspective of computational complexity, many CO problems are NP-hard due to their discrete and nonconvex nature [2]. In recent decades, methods for solving CO problems have been extensively developed and can be broadly categorized into exact and approximate/heuristic/meta-heuristic methods [3]. The former methods are guaranteed to optimally solve CO problems but suffer from an exponential time complexity. In contrast, the latter methods seek to find good (but not necessarily optimal) solutions within reasonable computation time, i.e., they trade optimality for computational efficiency. In general, most (if not all) of the above methods are manually designed. By analyzing the structure of the CO problem of interest, domain experts would leverage the algorithmic techniques that most effectively exploit this structure (e.g., proposed in the literature) and then continuously refine these methods (e.g., introducing new algorithmic techniques). Such a design process heavily depends on domain expertise and could be extremely expensive in terms of human time. For example, although the well-known traveling salesman problem (TSP) [4] has been studied for approximately 70 years, its methods [5], [6], [7], [8], [9] are still being actively and relentlessly updated. Inspired by the success of deep learning (DL) in fields such as image classification [10], machine translation [11], and board games [12], recently there has been a surge of research interest in utilizing DL, especially deep reinforcement learning (DRL), to automatically learn effective methods for CO problems [13]. The resultant new paradigm is termed neural combinatorial optimization (NCO) [14], [15]. For the sake of clarity, henceforth, the optimization methods (either hand-engineered or automatically learned) are called solvers and the ways to design solvers are called design approaches. Compared to the traditional manual approach, NCO exhibits a significant paradigm shift in solver design. As illustrated in Figure 1, traditional solver design process is human-centered, while NCO is a learning-centered paradigm that develops a solver by training. The training process of NCO essentially calibrates the parameters of the solver (model). Although this approach induces a greater offline computational cost, the training process allows solver design to be conducted in an automated manner and thus involves much less human effort.1 Despite the appealing features NCO might bring, its advantages and disadvantages relative to other approaches have not been 1

It is noted that NCO still requires human time and expertise to carefully construct the training set, which should sufficiently represent the target use cases of the learned solver. However, this is not an easy task. This point will be further discussed in Section V.

clearly specified. More specifically, although numerous computational experiments comparing NCO solvers with other solvers have been conducted in NCO works, they are generally nonconclusive for several reasons. First, it is often the case that the state-of-the-art traditional solvers are missing in the comparison, which would distort the conclusion and undermine the whole validation process. For example, the Google OR-Tools [16] is widely considered by NCO works [17], [18], [19], [20], [21] to be the baseline traditional solver for the vehicle routing problems (VRPs); however, it performs far worse than the state-of-the-art solvers for VRPs [22]. Second, for traditional solvers, their default configurations (parameter values) are used when comparing them with NCO solvers learned from training sets. Such an approach neglects the fact that, when a training set is available, the performance of traditional solvers could also be significantly enhanced by tuning their parameters [23], [24]. In practice, it is always desirable to make full use of the available technologies to achieve the best possible performance. In fact, with the help of the existing open-source algorithm configuration tools [24], [25], [26], the tuning processes of traditional solvers can be easily automated with little human effort involved. Third, the benchmark instances used in the comparative studies are often quite limited in terms of problem types and sizes, making it difficult to gain insights into how these approaches would perform on problem instances with different characteristics. For example, for TSP, the main testbed problem in NCO, most works have only reported results obtained on randomly generated instances with up to 100 nodes [18], [20], [27], [28], [29], [30]. In comparison, traditional TSP solvers are generally tested on problem instances collected from distinct applications, with up to tens of thousands of nodes [5], [6], [7], [8], [9]. To better understand the benefits and limitations of NCO, this work presents a more comprehensive empirical study. Specifically, TSP is employed as the testbed problem, since it is the originally oriented problem for many widely-used architectures in NCO and thus the conclusions drawn from it could have strong implications for other problems. Three recently developed NCO approaches and three state-of-the-art traditional TSP solvers are involved in the experiments. These solvers are compared on five problem types with node numbers ranging from 50 to 10000. The performance of the solvers is compared in five aspects that are critical in practice, i.e., effectiveness, efficiency, stability, scalability, and generalization ability. In particular, the energy efficiency (in terms of electric power consumption) of the solvers is also investigated, since energy consumption is being recognized as an important factor for solver selection if the applications of solvers continue to develop. To the best of our knowledge, this is the first comparative study of NCO approaches and traditional solvers on TSPs that 1) considers five different problem types, 2) involves problem instances with up to 10000 nodes, 3) includes tuned traditional solvers in the comparison, and 4) investigates five different performance aspects including the energy consumption of the solvers. The presented comprehensive empirical study has led to several interesting findings. First, traditional solvers still significantly outperform NCO solvers in finding high-quality solutions regardless of problem types and sizes. In particular, current NCO

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

15

FIGURE 1 Illustrations of the two solver design paradigms. (a) Human-centered traditional paradigm. (b) Learning-centered NCO.

approaches are not adept at handling large-size problem instances and structural problem instances (e.g., clustered TSP instances). Second, a potential benefit of NCO solvers might be their efficiency (both in time and energy). For example, to achieve the same solution quality on small-size randomly generated problem instances, NCO solvers consume at most one-tenth of the resources consumed by traditional solvers. Third, when the training instances cannot sufficiently represent the target cases of the problem, both NCO solvers and tuned traditional solvers exhibit performance degradation, although the degradation is more dramatic for the former. The remainder of the article is organized as follows. Section II briefly reviews the literature on NCO. Section III explains the design of the comparative study. Section IV presents the experimental results and analysis. Finally, concluding remarks are given in Section V. II. Review of Neural Combinatorial Optimization

Before reviewing NCO, it is useful to first quickly recap typical CO solvers. In general, CO solvers include exact ones and approximate ones. Typical exact solvers are based on the branch-andbound techniques that explore the solution space by branching into sub-problems and then filtering the set of possible solutions based on the upper and lower estimated bounds of the optimal solution. Typical approximate CO solvers are heuristics, which can be further roughly classified into constructive heuristics and improvement heuristics. The former incrementally builds a solution to a CO problem by adding one element at a time until a complete solution is obtained. In contrast, the latter improves upon a given solution by iteratively modifying it. The way of modifying a given solution is called a move operator. In recent decades, a lot of move operators have been proposed for different CO problems. For a comprehensive overview of CO, interested readers are referred to [1]. It is worth noting that the applications of neural networks to solve CO problems are actually not new. The earlier works [31] from the 80 s in the last century focused on using Hopfield neural networks (HNNs) to solve small-size TSP instances, which were later extended to other problems [32]. The main limitation of HNN-based approaches is that they need to use a separate network to solve each problem instance. The term NCO refers to the series of works that utilize DL to learn a solver (model) to solve a set of different problem instances. According to the types of the learned solvers, the existing NCO

16

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

approaches can be categorized into learning constructive heuristics (LCH), learning improvement heuristics (LIH), and learning hybrid solvers (LHS). As the names suggest, the solvers learned by LCH approaches and LIH approaches are constructive heuristics and improvement heuristics, respectively. Compared to traditional heuristics, their main differences are that the heuristic rules are no longer manually designed but are instead automatically learned. For example, the well-known greedy constructive heuristic for TSPs always selects the closest point for insertion, while LCH approaches learn a deep neural network (DNN) to score each point and finally select the point with the highest score for insertion. Compared to the manually designed greedy heuristic rule, the DNN model is trained with data and unnecessarily exhibits greedy behavior. Finally, LHS approaches seek to learn solvers that are hybrids of learning models and traditional solvers. The following sections will introduce these NCO approaches, mainly focusing on the key works. For a comprehensive survey of this area, interested readers are referred to [13], [33].

A. Learning Constructive Heuristics

1) Pointer Network-Based Approaches As the seminal work, Vinyals et al. [27] introduced a sequence-tosequence model, dubbed pointer network (Pr-Net), for solving TSPs defined on the two-dimensional plane. Specifically, Pr-Net is composed of an encoder and a decoder, and both of them are recurrent neural networks. Given a TSP instance, the encoder parses all the nodes in it and outputs an embedding (a real-valued vector) for each of them. Then, the decoder repeatedly uses an attention mechanism, which has been successfully applied to machine translation [11], to output a probability distribution over these previously encoded nodes, eventually obtaining a permutation over all the nodes, i.e., a solution to the input TSP instance. This approach allows the network to be used for problem instances with different sizes. However, Pr-Net is trained by supervised learning (SL) with precomputed near-optimal TSP solutions as labels. This could be a limiting factor, since in real-world applications such solutions of CO problems might be difficult to obtain. To overcome the above limitation, Bello et al. [28] proposed training Pr-Net with reinforcement learning (RL). In their implementation, the tour length of the partial TSP solution is used as the reward signal. Another limitation of Pr-Net is that it treats the input as a sequence, while many CO problems have no natural internal

ordering. For example, the order of the nodes in a TSP instance is actually meaningless; instead, it should be better viewed as a set of nodes. To address this issue, Nazari et al. [17] replaced the recurrent neural network in the encoder of Pr-Net, which was supposed to capture the sequential information contained in the input, with a simple embedding layer that was invariant to the input sequence. In addition, the authors extended Pr-Net to solve VRPs, which differ from TSPs since VRPs involve dynamically changing properties (e.g., the demands of nodes) during the solution construction process. Specifically, the proposed model passes both the static properties (coordinates) and the dynamic properties (demands) of nodes and outputs an embedding for each of them. Then, at each decoding step, the decoder produces a probability distribution over all nodes while masking the serviced nodes and the nodes with demands that are larger than the remaining vehicle load. 2) Transformer-Based Approaches In addition to the sequence-to-sequence model, the well-known transformer architecture [34] has also been applied to solve CO problems. In particular, transformer also follows the encoderdecoder framework but involves a so-called multi-head attention mechanism to extract deep features from the input. Such a mechanism has been used by Deudon et al. [29] to encode the nodes of TSP instances. Moreover, for sequentially decoding nodes, the authors of [29] proposed using only the last three decoding steps (i.e., the last three selected nodes) to obtain the reference vector for the attention mechanism, thus reducing the computational complexity. A similar model to that of [29] was implemented by Kool et al. [18]. Notably, the authors adjusted the model for many different CO problems including TSPs, prize collecting TSPs, VRPs, and orienteering problems, to accommodate their special characteristics. Additionally, the authors proposed an enhanced RL training procedure that used a simple rollout baseline and exhibited superior performance over [17], [29]. Based on the model proposed in [18], many subsequent works have improved it to achieve better solution quality or extended it to solve other VRP variants. For example, Peng et al. [35] adapted the model to re-encode the nodes during the solution construction process, obtaining better solution quality than the original model. A similar idea was implemented by Xin et al. [36], where the authors proposed changing the attention weights of the visited nodes in the encoder instead of completely re-encoding them. Another interesting work was completed by Li et al. [37], where the authors considered the multi-objective TSPs. They first decomposed the multi-objective problem into a series of single-objective subproblems and then used a Pr-Net to sequentially solve each sub-problem, where the network weights were shared among neighboring sub-problems. Finally, motivated by the fact that an optimal solution to a VRP instance, in general, has many different representations, Kwon et al. [15] introduced a modified RL training procedure to force diverse rollouts toward optimal solutions. The resultant approach, called POMO, is currently one of the strongest NCO approaches for learning constructive heuristics for TSPs and VRPs.

3) Graph Neural Network-Based Approaches Another line of works leverages graph neural networks (GNNs) [38] to address the aforementioned issue of having an order-invariant input. Specifically, GNNs deal with graphs as inputs without considering the order of input sequence. Khalil et al. [39] introduced a GNN model for solving several graph CO problems including maximum cut problems, minimum vertex cover problems, and TSPs. The model, trained with RL, learns the embeddings of the nodes in the input problem instances, and then greedily selects nodes to construct a complete solution. It is also possible to integrate the node embeddings learned by GNNs into Pr-Net, as shown by Ma et al. [40]. Based on [39], subsequent works have extended GNNs to solve many other CO problems defined on graphs. For example, Li et al. [41] utilized GNNs to solve the maximal independent set problems and maximal clique problems. Unlike TSPs, for these problems the goal is not to find a permutation of nodes but a subset of nodes. Hence, instead of sequentially extending a solution, the authors used SL to train a graph convolutional network (GCN) to directly output an estimate of the probability of selecting each point and then utilized a guided tree search to construct a feasible solution based on these estimates. A similar work was brought forward by Joshi et al. [30], where the authors trained a GNN by SL to predict the probability of an edge being in the final TSP solution and then constructed a feasible tour by beam search. 4) Discussion Due to their frameworks of sequentially encoding and decoding, Pr-Net-based approaches and transformer-based approaches are intrinsically suitable for handling permutation-based problems (e.g., TSPs and VRPs), where the orders of node selection form problem solutions. Among these two types of approaches, transformer-based approaches can achieve better performance mainly due to their advanced multi-head attention mechanism. In contrast, GNN-based approaches are suitable for handling CO problems defined on graphs and have no requirement regarding the sequential characteristics of the problems. Overall, compared to LIH and LHS, LCH requires the least expert knowledge about the problems to be solved and therefore has the most potential to become a domain-independent solver design framework. However, with respect to obtaining high-quality solutions, current solvers learned by LCH approaches still perform worse than those learned by LIH approaches [21], [42] and LHS approaches [43]. B. Learning Improvement Heuristics

Unlike LCH approaches that learn models to sequentially extend a partial solution for a given problem instance, LIH approaches seek to learn a policy that manipulates local search operators to improve a given solution. Nonetheless, the encoder/decoder models used by LIH approaches are still similar to those used by LCH approaches. In an early work, Chen and Tian [19] proposed learning two models to control the 2-opt operator, which is a

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

17

conventional move operator for VRPs. Specifically, the region-picking model selects a fragment of a solution to be improved and the rule-picking model selects a rewriting rule to be applied to the region. Both models are trained by RL and the solution is improved continuously until it converges. Many subsequent works have improved upon [19] to achieve better solution quality. For example, Costa et al. [44] proposed learning separate embeddings for nodes and edges in the solution; Wu et al. [42] simplified the approach by learning only one model to select node pairs that are subject to the 2-opt move operator. Another notable work is that of Lu et al. [45]. Unlike previous approaches, the authors of [45] proposed learning a model to control several different move operators and applied a random permutation operator to the solution if the quality improvement could not reach the threshold. Finally, motivated by the circularity and symmetry of VRP solutions (i.e., cyclic sequences), Ma et al. [21] proposed a cyclic positional encoding mechanism to learn embeddings for positional features, which are independent of the node embeddings. The decoder and the employed move operators are similar to those of [42]. The resultant LIH approach, called DACT, has achieved superior performance on solving VRPs in comparison with other LCH approaches and LIH approaches. Overall, compared to LCH, LIH integrates more expert knowledge about the problems (move operators) and can achieve better solution quality than the former. On the other hand, the application scope of LIH approaches is inevitably limited by the operators they integrate. For example, the abovementioned approaches cannot be applied to CO problems without sequential characteristics (e.g., maximum cut problems and minimum vertex cover problems) because their move operators are inapplicable to these problems. Moreover, because the solvers learned by LIH approaches employ an iterative local search procedure, they need to consume more computation time than the solvers learned by LCH approaches [21]. C. Learning Hybrid Solvers

As aforementioned, LHS approaches seek to learn solvers that are hybrids of learning models and traditional solvers. It is worth mentioning that the integration of learning models (such as neural networks) into solvers is a long-standing research topic. For example, many studies have been conducted on integrating HNNs into evolutionary algorithms (EAs) [46], [47]. Below the research line of using DL and DRL to train such solvers is reviewed. One early example is the neural large neighborhood search (NLNS) of Hottung and Tierney [48], which integrates a learning model into the well-known large neighborhood search (LNS) algorithm. Specifically, the use of extended/large neighborhood structures has widely proved to be effective for obtaining high-quality solutions to CO problems [49], [50]. LNS [51] is a typical algorithm framework that follows this idea. It explores the solution space by iteratively applying destroy-and-repair operators to a starting solution and has exhibited strong performance on a number of VRP variants.

18

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

NLNS uses an attention-based model trained with RL as the repair operator for LNS. Later, Chen et al. [52] and Gao et al. [53] introduced two different variants of NLNS. The former trains a hierarchical recursive GCN as the destroy operator, while the latter uses an elementwise GNN with edge embedding as the destroy operator. Both approaches adopt a fixed repair operator that simply inserts the removed nodes into the solution according to the minimum cost principle. In addition to LNS, another notable example is the Lin-Kernighan-Helsgaun (LKH) algorithm [5], which is widely recognized as a strong solver for TSPs. During the solution process, LKH iteratively searches for -opt moves based on a small candidate edge set to improve the existing solution. Zheng et al. [54] proposed training a policy that helps LKH select edges from the generated candidate set. However, the policy is trained for each instance instead of a set of instances. Later, Xin et al. [43] proposed training a GNN with SL to predict edge scores, based on which LKH can create the candidate edge set and transform edge distances to guide the search process. The resultant LHS approach, called NeuroLKH, has learned solvers that remarkably outperform the original LKH algorithm in obtaining high-quality solutions when solving TSPs and VRPs. Compared to LCH and LIH, LHS integrates the most expert knowledge (traditional solvers) about the problems and can obtain the best solution quality [43]. However, since LHS relies on the existing solvers, its application scope is limited to the problems for which strong solvers exist. Moreover, a LHS approach is generally specifically tailored for a solver (e.g., NeuroLKH is tailored for LKH) and it is difficult to be extended to other solvers/problems.

III. Comparative Studies

This section first explains the design principle and the overall framework of the comparative study, then elaborates on the details, and finally summarizes the main differences between this study and the previous ones. Specifically, the whole study is designed to simulate two typical scenarios that arise in practice when a practitioner is faced with a CO problem to solve. In the first scenario, one is aware of the target problem instances that the solver is expected to solve and can collect sufficient training instances to represent them. As an illustrative example, consider a delivery company that needs to solve TSP instances for the same city on a daily basis, with only slight travel time differences across the instances due to varying traffic conditions. In this example, one can use the accumulated instances to sufficiently represent the target use cases of the TSP solver. Suppose that the company expands its delivery business to another city, which differs from the first city in terms of their sizes, traffic conditions, and customer distributions. Then, the decision maker of the company faces the second scenario, in which the information of the target use cases of the solver is unavailable; thus, the decision maker expects the solver to handle problem instances with a broad range of problem characteristics.

From the perspective of computational study, the first scenario corresponds to the setting where the training instances and the testing instances have the same problem characteristics. NCO approaches are intrinsically appropriate for learning solvers in this scenario. On the other hand, traditional solvers can be directly applied to the testing instances or can first be tuned with the training instances and then tested. Furthermore, in this scenario practitioners are often concerned about the following aspects regarding the performance of the solvers. 1) Effectiveness: the extent to which the solver can solve the problem instances, generally measured by solution quality. 2) Efficiency: the computational resources (energy and computation time) consumed by the solver. 3) Stability: the extent to which the output of the solver is affected by its internal randomness. 4) Scalability: the problem sizes that the solver can handle. This is a natural performance consideration for traditional solvers. For NCO, it can be easily mixed up with generalization (see its definition below). More specifically, the scalability of an NCO approach refers to its ability to learn solvers as the problem size grows. Unlike the first scenario, the second scenario corresponds to the experimental setting where the testing instances significantly differ from the training instances. In this scenario, practitioners expect the solvers to generalize well from training instances to unseen testing instances. 5) Generalization: how the learned solver would perform on instances that have different characteristics (e.g., problem sizes) from those of the training instances. In the comparative study, NCO solvers and traditional solvers are evaluated in the above two scenarios. In particular, this work takes TSP as the testbed problem to elaborate on the design of the experiments. As a conventional CO problem, TSP has been studied for many years, and a number of strong traditional solvers have been developed for it [5], [6], [7], [8], [9]. More importantly, TSP has been the testbed problem for nearly all leading architectures in NCO [18], [20], [21], [27], [28], [36], [43]; thus, the most recently proposed NCO approaches that have achieved strong performance can be included in the experiments, and the conclusions drawn from this study may also have strong implications for other CO problems. Since most NCO solvers are trained to handle the EUC2D TSP instances, where the nodes are defined on a twodimensional plane and the distances between two nodes are the same in both directions, this study also considers the EUC2D TSP instances. A. Overall Framework

The whole experiments aim to answer the following five research questions. 1) Q1: In the first scenario, how would the solvers perform on small-size problem instances? 2) Q2: In the first scenario, how would the solvers perform on medium/large-size problem instances?

3) Q3: In the second scenario, how would the solvers generalize over different problem types (i.e., characterized by node distributions)? 4) Q4: In the second scenario, how would the solvers generalize over different problem sizes? 5) Q5: In the second scenario, how would the solvers generalize over different problem types and sizes? Specifically, the first two questions are concerned with the effectiveness, efficiency, stability, and scalability of the solvers in the first scenario where the training instances and the testing instances have the same problem characteristics. The other three questions are concerned with the generalization ability of the solvers in the second scenario, where the training instances and the testing instances have different problem characteristics. Each of the above questions is investigated in a separate group of experiments, denoted as Exp_1/2/3/4/5. Note that throughout the experiments, training instances are only used for learning NCO solvers or tuning traditional solvers, and all the solvers are tested on the testing instances. The training/ testing sets in each group of experiments, the compared methods, and the evaluation metrics are further elaborated below. B. Benchmark Instances

Two different sources for obtaining TSP instances were considered: data generation and existing benchmark sets. Specifically, for data generation, the portgen generator which has been used to create testbeds for the 8-th DIMACS Implementation Challenge [4] and the ClusteredNetwork generator from the netgen R-package [55] were used. 1) The portgen generator generates a TSP instance (called a rue instance) by uniformly and randomly placing points on a two-dimensional plane. 2) The ClusteredNetwork generator generates a TSP instance (called a clu instance) by randomly placing points around different central points. Three benchmark sets, i.e., TSPlib, VLSI, and National, were used: 1) TSPlib [56]: a widely used benchmark set of instances drawn from industrial applications and geographic problems featuring the locations of cities (nodes) on maps. 2) VLSI: a benchmark set of instances extracted from the verylarge-scale integration design data of the Bonn Institute. 3) National: a benchmark set of instances extracted from the maps of different countries.2 For all the generated instances, Concorde [57], an exact TSP solver, was used to obtain their optimal solutions.3 For the instances belonging to the existing benchmark sets, their optimal solutions or best-known solutions (in case the optimal solutions are unknown) were collected and used. Based on the above data generation/collection procedure, the training/testing sets in each of the five groups of experiments were constructed as follows (also summarized in Table I): 2

All three benchmark sets are available at http://www.math.uwaterloo.ca/tsp/data. Concorde is available at https://www.math.uwaterloo.ca/tsp/concorde.html.

3

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

19

TABLE I The training sets and the testing sets (separated by “j”) in five groups of experiments. “Exist: Bench:” refers to the testing set containing 30 instances selected from the existing benchmark sets. EXPERIMENT GROUP

Training Set j Testing Set

Description

SCENARIO 1

SCENARIO 2

EXP_1

EXP_2

EXP_3

rue-50 j rue-50

rue-500 j rue-500

rue-100, mix-100 j clu-100

clu-50 j clu-50

clu-500 j clu-500

clu-100, mix-100 j rue-100

rue-50 j rue-100

rue-100 j rue-100

rue-1000 j rue-1000

rue-1000, mix-1000 j clu-1000

clu-50 j clu-100

clu-100 j clu-100

clu-1000 j clu-1000

The training instances and the testing instances have the same problem characteristics

1) Exp_1: Two problem sizes (50 and 100) and two problem types (rue and clu) were considered. Consequently, four combinations were produced, denoted as rue/clu-50/100. For each of them, following the common practice in NCO [18], [20], [21], one million training instances and 10000 testing instances were generated. For the clu instances, the number of clusters was set to n=10, where n was the problem size. 2) Exp_2: The whole procedure for constructing training/ testing sets was exactly the same as that used in Exp_1, except that the considered problem sizes were 500/1000 and the testing set size was 1000. Besides, for the clu instances, the number of clusters was set to n=100. 3) Exp_3: Unlike Exp_1 and Exp_2, in this experiment, the testing instances differed from the training instances in problem types. Specifically, two problem sizes (100 and 1000) and two problem types (rue and clu) were considered. The solvers learned on the training set of rue-100/1000 instances would be tested on the testing set of clu-100/1000 instances, and vice versa. Moreover, in addition to the rue and clu training sets, another training set called mix was also used, which contained half rue instances and half clu instances. 4) Exp_4: In this experiment, the testing instances differed from the training instances in problem sizes. Specifically, two problem sizes (50 and 100) and two problem types (rue and clu) were considered. The solvers learned on the rue-50 training instances and the clu-50 training instances would be tested on the rue-100 testing instances and the clu-100 testing instances, respectively. 5) Exp_5: In this experiment, problem instances selected from the TSPlib, VLSI, and National benchmark sets were TABLE II The competitors in the experiments. METHOD POMO [20]

Learning Constructive Heuristics (LCH)

DACT [21]

Learning Improvement Heuristics (LIH)

NeuroLKH [43]

Traditional Solver

EAX [7]

Traditional Solver

LKH (tuned)

EXP_5 rue-1000 j Exist. Bench. clu-1000 j Exist. Bench. mix-1000 j Exist. Bench.

The training instances and the testing instances differ in either problem types, problem sizes, or both.

used as the testing instances. Specifically, 10 instances were selected from each of these three sets, with problem sizes distributed between 1000 and 10000. In addition, three training sets were considered in this experiment, i.e., rue1000, clu-1000, and mix-1000. C. Compared Methods

Table II lists all the competitors in the experiments. For each of the three types of NCO approaches, a recently proposed approach that has achieved strong performance was considered. Specifically, POMO [20], DACT [21], and NeuroLKH [43] were the considered approaches for LCH, LIH, and LHS, respectively. According to the results reported in [20] and [21], POMO and DACT could achieve their best performance with an extra instance augmentation mechanism; thus, these variants of POMO and DACT were also considered in the experiments. All the hyper-parameters of these approaches were set as reported in their original papers, except that in the experiments, their batch sizes were always tuned to fully utilize the GPU memory. Regarding traditional solvers, except for the widely adopted LKH (version 3.0) [5] in NCO works, this study included two other meta-heuristic solvers, EAX [7] and MAOS [58], in the experiments.4 EAX is a genetic algorithm equipped with a powerful edge assembly crossover. It has proved to outperform LKH in solving a broad range of TSP instances [7]. MAOS is a strong swarm intelligence-based TSP solver that does not contain any explicit local search heuristic. The parameters of LKH, EAX, and MAOS were kept as their default values in the experiments. Moreover, LKH poses many parameters whose values may significantly affect its performance; it is thus possible to tune these parameters on a training set to achieve better performance. Hence, in the experiments, the tuned variants of LKH obtained by using the general-purpose automatic algorithm configuration tool SMAC [24] were also considered. Generally, the computation time needed by SMAC to tune LKH was much shorter than that needed by NCO approaches to train their solvers.

Learning Hybrid Solvers (LHS)

LKH [5]

MAOS [58]

20

TYPE

clu-1000, mix-1000 j rue-1000

EXP_4

Traditional Solver Tuned Traditional Solver

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

4

Concorde was not included in the comparison because it needs to run for prohibitively long periods of time to solve those very-large-size problem instances (e.g., larger than 5000). Besides, LKH, EAX, and MAOS could achieve solution quality very close to that of Concorde, while consuming much less computation time than the latter.

D. Evaluation Metrics

The testing results of the solvers are reported in terms of three metrics, i.e., optimum gap, computation time, and energy. For all three metrics, the smaller the results are, the better. Specifically, the optimum gap is defined as 



ðQ  Q Þ=Q ; where Q is the length of the tour found by the solver and Q is the optimal tour length. The computation time of a solver is the time it takes to solve all the instances in the testing set. Note that NCO solvers would naturally benefit from running on massively parallel hardware architectures, i.e., GPUs, while in previous comparative studies [17], [18], [20], [21], [28] traditional solvers were generally run on CPUs using a single thread. To conduct fair comparisons, in the experiments, the traditional solvers and their tuned variants were also run on k CPU threads to solve k problem instances in parallel (k ¼ 32 on our reference machine). Nevertheless, it is noted that the different programming languages adopted by NCO solvers and traditional solvers would also affect their runtime, and this cannot be avoided in our experiments. Specifically, NCO solvers are usually implemented with Python that mixes inefficient interpreted code with efficient DL libraries (e.g., those for utilizing GPUs). On the other hand, traditional solvers are typically implemented in highly efficient languages such as C/C++/java. Currently, how to avoid the influence of different programming languages when comparing NCO solvers and traditional solvers is still an open question. Finally, the energy is the electric power consumed by a solver for solving all the instances in the testing set, which is a particularly useful metric in resource-limited cases such as embedded devices. In the experiments, the open-source PowerJoular tool was used to record the electric power consumed by the solvers.5 When testing the solvers, to prevent them from running for prohibitively long periods of time, the maximum runtime for solving a testing instance was set to 3600 seconds. If a solver consumed its time budget, it would be terminated immediately and the best solution found by it would be returned. Note that some tested solvers (e.g., EAX and LKH) involve randomized components. In the experiments, these solvers were applied on each testing instance for 10 runs. Then, the mean value, as well as the standard deviation of the optimum gaps over the 10 runs were recorded, which were further averaged over all the testing instances to obtain the average optimum gap and the average standard deviation on the whole testing set. All the experiments were conducted on a server with an Intel Xeon Gold 6240 CPU (2.60 GHz, 24.75 MB of Cache) and an NVIDIA TITAN RTX GPU (24 GB of video memory) with 377 GB of RAM, running Ubuntu 18.04. The complete experimental results, benchmark instances, NCO solvers,

5

PowerJoular is available at https://github.com/joular/powerjoular.

traditional solvers and their tuned variants, and codes for training/tuning solvers are available at https://github.com/yzhanggh/benchmarking-tsp. E. Main Differences From Previous Comparative Studies

In general, the above established experimental protocol could be used as a standard protocol for benchmarking NCO approaches. More specifically, our comparative study differs from the studies presented in previous NCO works [17], [18], [20], [21], [28], [43] in the following aspects. 1) Regarding benchmark instances, in the NCO literature, it is common to use the rue type of instances as both training and testing instances, and some studies used TSPlib to assess the generalization ability of their learned solvers. Compared to them, this study used three more types of problem instances (clu, National, and VLSI) in the experiments, leading to testbed problems with much more diverse characteristics. Moreover, the considered problem sizes ranged from 50 to 10000, which were much larger than those in the previous NCO works. 2) Regarding traditional solvers, in addition to the LKH solver widely adopted by NCO works, this study included two other strong solvers (EAX and MAOS) in the comparison to fully represent the state-of-the-art TSP solvers. To assess the potential of traditional solvers, this study also considered tuning their parameters, which to the best of our knowledge has never been considered by the existing NCO works. 3) This study investigated five different performance aspects and introduced a new efficiency metric, i.e., electric power consumption, which could be particularly useful in energy-limited environments. Besides, to conduct fair comparisons in terms of time efficiency, all the NCO solvers and traditional solvers (and their tuned variants) were tested in the parallel mode to make full use of our reference machines. In comparison, previous comparative studies often tested traditional solvers on CPUs using a single thread. IV. Experimental Results and Analysis

This section first presents the main findings drawn from the experiments and then analyzes the results of each group of experiments in detail. A. Main Findings

Overall, four main findings can be obtained based on the experimental results. First, for all the TSP problem sizes and types considered in the experiments, traditional solvers still significantly outperformed NCO solvers in finding high-quality solutions (Sections IVB, IV-C, and IV-D). Among NCO solvers, the hybrid solvers trained by LHS approaches could find much better solutions than the learned constructive and improvement heuristics. In other words, the more expert knowledge that was integrated in an NCO solver, the better it solved the problems. Hence, it appears that the research status in this area has not yet reached those in domains such as vision, speech, and natural language processing, where DL can learn strong models from scratch.

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

21

Second, due to their simple solving strategy (i.e., sequentially constructing a solution) and massively parallel computing mode, a major potential benefit of NCO solvers (i.e., the constructive heuristics learned by LCH approaches) is their superior efficiency (in terms of both time and energy). In particular, on small-size randomly generated problem instances, the computational resources consumed by the learned heuristics were usually at most one-tenth of the resources consumed by traditional solvers (Section IV-B), where the latter were terminated once they achieved the same solution quality as that of the former.

Third, current LCH and LIH approaches are not suitable for handling large-size problem instances (Section IV-C) and structural problem instances (Sections IV-B and IV-C), e.g., the clu type of TSP instances. Fourth, parameter tuning can significantly boost the performance of traditional solvers in terms of solution quality while maintaining efficiency (Section IV-D). However, when the training instances had different problem characteristics (problem types and sizes) from those of the testing instances, both NCO solvers and tuned traditional solvers exhibited performance degradation (Sections IV-D

TABLE III Testing results of Exp_1/2. For each metric, the best performance is indicated in gray. LKH* and EAX* refer to the variants of LKH and EAX, respectively, which were terminated once they achieved the same solution quality as that of POMO solver.

22

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

and IV-E), and NCO solvers suffered from far more severe performance degradation. B. Exp_1: Small-Size Testing Instances With the Same Problem Characteristics as Training Instances

The testing results of Exp_1 in terms of average optimum gap (Gap), standard deviation (std), total computation time, and energy are reported in Table III. In Exp_1, for DACT with the instance augmentation mechanism (denoted by aug.), its results on the rue-100 and clu-100 testing instances are missing because it ran for prohibitively long periods of time to solve these instances. Besides, the tuned variant of LKH, i.e., LKH (tuned), was not included in Exp_1 because the original LKH has already achieved nearly optimal solution quality. In addition, the medians and variance of the optimum gaps across all the testing instances are visualized by box plots in Figure 2. For brevity, the name of the NCO approach is used to denote the solvers learned by it. The first observation from these results is that traditional solvers still achieved much better solution quality than the learned solvers. For example, EAX and MAOS could notably solve all the testing instances to optimality. Among all the randomized solvers, i.e., DACT, NeuroLKH, LKH, EAX, and MAOS, EAX and MAOS also exhibited the best stability. They achieved the smallest standard deviation over 10 repeated runs. The second observation is that, after EAX and MAOS, NeuroLKH was the third best-performing solver. Compared to LKH, NeuroLKH reduced the average optimum gap by one order of magnitude on three out of the four testing sets. Based on Figure 2, one can also observe that NeuroLKH achieved more stable performance than LKH across the testing instances. Compared to the other two NCO solvers POMO and DACT,

the performance advantages of NeuroLKH in terms of solution quality were much more significant. In general, NeuroLKH could reduce the average optimum gaps by at least two orders of magnitude on all four testing sets. Although the performance of POMO and DACT could be improved when equipped with the instance augmentation mechanism, they still performed worse than NeuroLKH. The third observation is that NCO solvers could generally achieve better solution quality on the rue instances than on the clu instances. For example, the average optimum gap achieved by POMO on the clu-50 testing instances was 14.18% greater than that obtained on the rue-50 testing instances, and the corresponding numbers for DACT and NeuroLKH were 9.60 times and 33.33%, respectively. Moreover, as the problem size grows, such performance gaps became larger. These results show that current NCO approaches are less adept at learning solvers for structural problem instances (i.e., clustered TSP instances) than for uniformly and randomly generated instances, indicating that the learning models adopted by them may have limitations in handling structural data. This could be an important direction for improving NCO approaches. The fourth observation is that in Figure 2, as the problem size grows, the performance of POMO and DACT significantly deteriorated, while the performance of NeuroLKH was still stable. These results indicate that currently the scalability of LCH approaches and LIH approaches is still quite limited. The last observation is that regarding efficiency, POMO exhibited excellent performance in terms of both runtime and energy. Notably, it usually consumed at most one-tenth of the resources consumed by other solvers, which could be very useful in resource-limited environments. This is also true when EAX and LKH were terminated at the solution quality achieved by POMO solver (marked by LKH and EAX in

FIGURE 2 Visual comparison in box plots of the optimum gaps achieved by the tested solvers in Exp_1.

FIGURE 3 Visual comparison in box plots of the optimum gaps achieved by the tested solvers in Exp_2.

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

23

FIGURE 4 Visual comparison in box plots of the optimum gaps achieved by the tested solvers in Exp_3. Each learned/tuned solver is marked with the corresponding problem type of the training instances.

Table III). On the other hand, considering the poor scalability of POMO, its high efficiency was still limited to small-size problem instances. It is also found that another NCO approach, DACT, performed poorly in terms of efficiency, especially when equipped with the instance augmentation mechanism. Finally, NeuroLKH could improve the efficiency of the original LKH in both runtime and energy, and in general, the efficiency of EAX was slightly worse than the LKH-family solvers.

NeuroLKH). This may be because parameter tuning can change the behaviors of LKH to a greater extent than that of NeuroLKH (which only modifies the candidate edge set in LKH), eventually leading to better fitting to the specific instance distribution. Such results also suggest an important future research direction of combining parameter tuning and NCO to achieve more comprehensive control over the behaviors of traditional solvers.

C. Exp_2: Medium/Large-Size Testing Instances With the Same Problem Characteristics as Training Instances

D. Exp_3: Testing Instances With Different Node Distributions From Training Instances

Similar to Exp_1, the testing results of Exp_2 are reported in Table III and illustrated in Figure 3. The main difference between Exp_1 and Exp_2 is that the latter considered much larger problem sizes. In Exp_2, POMO and DACT were not tested due to their poor scalability. The first observation from these results is that overall, EAX is still the best-performing solver in terms of solution quality. Nevertheless, the tuned variant of LKH outperformed EAX on the rue-500 testing instances. Moreover, it outperformed MAOS on three out of the four testing sets, i.e., rue/clu-500 and rue-1000, while the original LKH fell behind MAOS on all four testing sets. Compared to the original LKH, the tuned variant of LKH could reduce the average optimum gaps by two orders of magnitude on three testing sets and by at least 50% on the remaining set. Based on Figure 3, it can also be observed that the tuned variant of LKH performed much more stably across the testing instances than LKH. It is worth mentioning that such performance improvement did not come at the cost of degraded efficiency. Overall, the tuned variant of LKH and the original LKH performed competitively in terms of time efficiency and energy efficiency. Such results indicate that traditional solvers can largely benefit from parameter tuning and this should be utilized when a sufficient training set is available. The second observation is that although NeuroLKH could also achieve better solution quality than LKH, the former consumed much more computation time and energy than the latter. This is particularly evident on the clu testing instances. Such phenomenon once again implies that the current NCO approaches may have limitations on handling structural data. Taking a closer look at Figure 3, one can observe that on the clu testing instances, for LKH, parameter tuning could achieve greater performance improvement than NCO (i.e.,

The medians and variance of the optimum gaps across all the testing instances in Exp_3 are illustrated by box plots in Figure 4. Note that DACT was unable to converge on the mix training set and thus the corresponding testing performance is not reported. Recalling that Exp_3 was designed to assess the generalization ability of the learned/tuned solvers over different problem types, the first observation from Figure 4 is that when applying a learned solver on the testing instances belonging to different types from those of the training instances, the solver’s performance degraded. For example, on the rue-100 testing instances, the POMO solver trained on the rue-100 training instances performed better than the one trained on the clu-100 training instances, and the results were the opposite of the clu-100 testing instances. Based on the last two plots in Figure 4, one can observe that this was also true for NeuroLKH and the tuned variants of LKH. The second observation is that when a mix training set was used, the learned/tuned solver still could not obtain the best possible performance. For example, on the rue-100 testing instances, the POMO solver trained on the mix-100 instances obtained an average optimum gap of 0.2420%, which is better than the one trained on the clu-100 instances (0.9234%) but still worse than the one trained on the rue-100 instances (0.1278%). These results indicate that the adopted learning models may not have sufficient capacity to simultaneously handle different problem types, which could be a direction for improving NCO approaches.

24

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

E. Exp_4: Testing Instances With Different Problem Sizes From Training Instances

Table IV presents the testing results of Exp_4 in terms of average optimum gap. Recalling that Exp_4 was designed to assess the generalization ability of the learned solvers over problem

TABLE IV Testing results of Exp_4. Each learned solver is marked with the corresponding training set. METHOD (TRAINING SET)

rue-100

METHOD (TRAINING SET)

GAP (%)  STD (%)

clu-100 GAP (%)  STD (%)

POMO (rue-50)

0.6703  0.0000

POMO (clu-50)

0.6829  0.0000

POMO (rue-100)

0.1278  0.0000

POMO (clu-100)

0.1405  0.0000

DACT (rue-50)

27.5437  31.4449

DACT (clu-50)

21.4630  5.3292

DACT (rue-100)

0.6596  0.5216

DACT (clu-100)

1.2220  0.4773

sizes, in Table IV each learned solver is marked with the corresponding problem sizes of its training instances. The main observation is that when applying the solvers learned by POMO and DACT to the testing instances with larger sizes than the training instances, the performance of the solvers seriously degraded. For example, on the rue-100 testing instances, the DACT solver trained on the rue-50 training instances could only obtain an average optimum gap of 27.53%, which is generally an unacceptable level of solution quality for TSP. Such results demonstrate that although the learning models adopted by POMO and DACT can process variable-length inputs, it does not mean that the solvers trained by them can naturally generalize to larger problem sizes. F. Exp_5: Testing Instances With Different Node Distributions and Problem Sizes From Training Instances

Recall that Exp_5 was designed to assess the ability of the learned/ tuned solvers to generalize from generated instances to real-world instances, where the latter differed from the former in both problem sizes and problem types. Specifically, only the LKH-family solvers and EAX were tested in Exp_5 due to the large problem sizes. Three training sets were used in Exp_5, i.e., rue-1000, clu1000, and mix-1000. Based on each training set, a tuned variant of LKH and a NeuroLKH solver were obtained. Then, for each testing instance, each solver was applied for 10 runs. Table V presents the testing results in terms of the number of times where the optimal solution was successfully found among the 10 runs, average optimum gap, and average computation time. The first observation from Table V is that the best-performing solver is EAX. It achieved the highest number of successes on 22 out of the 30 testing instances, which is far more than that of the second best-performing solver. In particular, on the testing instances belonging to the National benchmark set, the performance gap between the LKH family solvers and EAX is significant, indicating that LKH may be intrinsically limited in solving this type of instances. The second observation is that among all the LKH-family solvers, the solver learned by NeuroLKH on the mix-1000 training instances succeeded more times than the tuned variants of LKH and the original LKH. This may be because the mixed training set could cover more cases of possible TSP instances than the pure rue or clu training sets, finally leading to better generalization. On the other hand, based on the same training set, the NeuroLKH solver consistently performed

better than the tuned variant of LKH. For example, the variant of LKH tuned on the mix-1000 training instances obtained fewer successes than the solver learned by NeuroLKH on the mix-1000 training instances, and the former was actually the one among all the LKH-family solvers that achieved the fewest successes. These results indicate that the parameter tuning process of LKH may cause the solver to be more easily overfitted to the training instances than the NCO approach NeuroLKH. In summary, for real-world TSP instances that are unknown in advance, EAX is an appropriate default solver to use; to learn/train a TSP solver for these instances, NeuroLKH seems to be a better option than parameter tuning. G. Learning Curves of NCO Solvers

It is meaningful to investigate the training phases of NCO solvers, since they have a significant impact on the solvers’ performance. Figure 5 illustrates the learning curves of POMO solvers for rue/ clu-50/100. After each training epoch, the POMO solver was evaluated on a validation set of 10000 problem instances; then, the average tour length of the obtained TSP solutions is plotted. Moreover, the average tour length of the optimal solutions and the GPU hours consumed for training POMO solvers are also illustrated in Figure 5. Figure 6 illustrates the learning curves of NeuroLKH for rue/clu-500. Note that for NeuroLKH, the trained model was used to generate a candidate edge set for LKH, not to directly solve the problem instances; thus, in Figure 6 the training loss is plotted. From these results, one could make three observations. First, the validation performance of POMO solvers gradually improved as the training epochs increased and eventually sufficiently converged. Second, as the problem size increased from 50 to 100, the learning curves of POMO solvers converged more slowly, and the final optimum gaps became larger. This echoes the previous finding that the learning capability of POMO solvers is not sufficient for handling large-size problems. Finally, the training time of NCO solvers could vary from several GPU hours to several GPU days, but it is generally acceptable. V. Conclusion

The applications of neural networks to solve CO problems have been studied for decades (starting from HNN-based works [31], [32]), and recently a subfield known as neural combinatorial optimization (NCO) has emerged rapidly. This work highlighted several issues exhibited by the comparative studies in the existing NCO works and presented an in-depth

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

25

TABLE V Testing results of Exp_5. Each cell contains three values, i.e., number of successes, average optimum gap (%), and computation time (s). For each testing instance, the highest number of successes is indicated in gray, and the highest number of successes achieved among all LKH variants (including NeuroLKH and the tuned variants of LKH) is boxed.

26

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

FIGURE 5 Learning curves of POMO solvers for rue-50/100 and clu-50/100.

comparative study of traditional solvers and NCO solvers on TSPs. An evaluation protocol driven by five research questions was established, which could be used as a basis for benchmarking NCO approaches against others on more CO problems. Specifically, two practical scenarios, categorized by whether one could collect sufficient training instances to represent the target cases of the problem, were considered. Then, the performance of the solvers was compared in terms of five critical aspects in these scenarios, i.e., effectiveness, efficiency, stability, scalability, and generalization ability. Five different problem types with node numbers ranging from 50 to 10000 were used as the benchmark instances in the experiments. Based on the experimental results, it is found that, in general, NCO solvers were still dominated by traditional solvers in nearly all performance aspects. A potential benefit of NCO solvers might be their high efficiency (in terms of both time and energy) on small-size problem instances. It is also found that, for NCO approaches, a crucial assumption is that the training instances should sufficiently represent the target cases of the problem; otherwise, the trained solvers would exhibit severe performance degradation on the testing instances.

However, in many real-world applications, one can only collect a limited number of problem instances [59], [60], or the accumulated instances are outdated and cannot effectively reflect the current properties of the problem [61], [62], [63]. In these cases, collecting a good training instance set can take a significant amount of time and may even be impossible, which might reduce the potential advantage of NCO approaches. As shown in the experiments, NCO faces several challenges that need to be dealt with in the future; several potential research directions are suggested. 1) Development of novel architectures or training algorithms to better handle structural problem instances. 2) Enhancement of current NCO approaches to learn solvers that can perform well on large-size problem instances and multiple (not one) problem types. 3) Hybridization of parameter tuning and NCO to achieve more comprehensive control over the behaviors of traditional solvers, hopefully leading to even better performance. Finally, it is worth mentioning that the merits of a CO solver can always be considered from two different perspectives. The first is its strength, i.e., how well it can solve a particular CO problem, which is exactly the perspective adopted by this work. The second is the generality, i.e., how many different CO problems it can be used to solve. Recent studies have extended NCO with unified DNN models to many different CO problems [39], [41], where there might be no specialized solvers such as LKH and EAX. Hence, it seems that NCO is a potential alternative for general-purpose CO solvers. A systematic evaluation study of the generality of NCO approaches has the potential for future research. Acknowledgment

FIGURE 6 Learning curves of NeuroLKH for rue-500 and clu-500 in terms of training loss. The GPU hours consumed for training are also illustrated.

This work was supported in part by the National Key Research and Development Program of China under Grant 2022YFA1004102, in part by the National Natural Science Foundation of China under Grant 62250710682, and in part by the National Natural Science Foundation of China under Grant 62272210.

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

27

References [1] B. H. Korte, J. Vygen, B. Korte, and J. Vygen, Combinatorial Optimization. Berlin, Germany: Springer, 2011. [2] R. M. Karp, “Reducibility among combinatorial problems,” in Proc. Symp. Complexity Comput. Comput., 1972, pp. 85–103. [3] J. Puchinger and G. R. Raidl, “Combining metaheuristics and exact algorithms in combinatorial optimization: A survey and classification,” in Proc. Artif. Intell. Knowl. Eng. Appl.: Bioinspired Approach: 1st Int. Work-Conf. Interplay Between Natural Artif. Comput., 2005, pp. 41–53. [4] G. Gutin and A. P. Punnen, The Traveling Salesman Problem and Its Variations. Berlin, Germany: Springer, 2006. [5] K. Helsgaun, “An effective implementation of the lin–kernighan traveling salesman heuristic,” Eur. J. Oper. Res., vol. 126, no. 1, pp. 106–130, 2000. [6] K. Helsgaun, “General K-opt submoves for the Lin-Kernighan TSP heuristic,” Math. Program. Comput., vol. 1, no. 2–3, pp. 119–163, 2009. [7] Y. Nagata and S. Kobayashi, “A powerful genetic algorithm using edge assembly crossover for the traveling salesman problem,” INFORMS J. Comput., vol. 25, no. 2, pp. 346–363, 2013. [8] Y. Nagata, “Population diversity measures based on variable-order markov models for the traveling salesman problem,” in Proc. 14th Int. Conf. Parallel Problem Solving Nature, 2016, pp. 973–983.  . D. Taillard and K. Helsgaun, “POPMUSIC for the travelling salesman prob[9] E lem,” Eur. J. Oper. Res., vol. 272, no. 2, pp. 420–429, 2019. [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1106–1114. [11] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. Int. Conf. Learn. Representations, 2015. [12] D. Silver et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017. [13] Y. Bengio, A. Lodi, and A. Prouvost, “Machine learning for combinatorial optimization: A methodological tour d’horizon,” Eur. J. Oper. Res., vol. 290, no. 2, pp. 405–421, 2021. [14] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neural combinatorial optimization with reinforcement learning,” in Proc. Workshop Track Proc. ICLR, 2017. [15] Y. Kwon, J. Choo, I. Yoon, M. Park, D. Park, and Y. Gwon, “Matrix encoding networks for neural combinatorial optimization,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 5138–5149. [16] L. Perron and V. Furnon, “OR-tools,” Version v9.6, 2019. [Online]. Available: https://developers.google.com/optimization/ [17] M. Nazari, A. Oroojlooy, L. Snyder, and M. Takac, “Reinforcement learning for solving the vehicle routing problem,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 9839–9849. [18] W. Kool, H. van Hoof, and M. Welling, “Attention, learn to solve routing problems!,” in Proc. Int. Conf. Learn. Representations, 2019. [19] X. Chen and Y. Tian, “Learning to perform local rewriting for combinatorial optimization,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 6281–6292. [20] Y. Kwon, J. Choo, B. Kim, I. Yoon, Y. Gwon, and S. Min, “POMO: Policy optimization with multiple optima for reinforcement learning,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 21188–21198. [21] Y. Ma et al., “Learning to iteratively solve routing problems with dual-aspect collaborative transformer,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 11096–11107. [22] L. Accorsi, A. Lodi, and D. Vigo, “Guidelines for the computational testing of machine learning approaches to vehicle routing problems,” Oper. Res. Lett., vol. 50, no. 2, pp. 229–234, 2022. [23] S. Liu, K. Tang, and X. Yao, “Automatic construction of parallel portfolios via explicit instance grouping,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 1560–1567. [24] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-based optimization for general algorithm configuration,” in Proc. Int. Conf. Learn. Intell. Optim., 2011, pp. 507–523. [25] C. Ans otegui, M. Sellmann, and K. Tierney, “A gender-based genetic algorithm for the automatic configuration of algorithms,” in Proc. 15th Int. Conf . Princ. Pract. Constraint Program., 2009, pp. 142–157. [26] M. L opez-Iban ~ez, J. Dubois-Lacoste, L. P. Caceres, M. Birattari, and T. St€ utzle, “The irace package: Iterated racing for automatic algorithm configuration,” Oper. Res. Perspectives, vol. 3, pp. 43–58, 2016. [27] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2692–2700. [28] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neural combinatorial optimization with reinforcement learning,” in Proc. Int. Conf. Learn. Representations, 2017. [29] M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, and L.-M. Rousseau, “Learning heuristics for the TSP by policy gradient,” in Proc. 15th Int. Conf. Integration Constraint Program., Artif. Intell., Oper. Res., 2018, pp. 170–181. [30] C. K. Joshi, T. Laurent, and X. Bresson, “An efficient graph convolutional network technique for the travelling salesman problem,” 2019, arXiv:1906.01227. [31] J. J. Hopfield and D. W. Tank, “‘Neural’ computation of decisions in optimization problems,” Biol. Cybern., vol. 52, no. 3, pp. 141–152, 1985. [32] K. A. Smith, “Neural networks for combinatorial optimization: A review of more than a decade of research,” INFORMS J. Comput., vol. 11, no. 1, pp. 15–34, 1999.

28

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

[33] N. Mazyavkina, S. Sviridov, S. Ivanov, and E. Burnaev, “Reinforcement learning for combinatorial optimization: A survey,” Comput. Oper. Res., vol. 134, 2021, Art. no. 105400. [34] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008. [35] B. Peng, J. Wang, and Z. Zhang, “A deep reinforcement learning algorithm using dynamic attention model for vehicle routing problems,” in Proc. 11th Int. Symp. Artif. Intell. Algorithms Appl., 2019, pp. 636–650. [36] L. Xin, W. Song, Z. Cao, and J. Zhang, “Step-wise deep learning models for solving routing problems,” IEEE Trans. Ind. Informat., vol. 17, no. 7, pp. 4861–4871, Jul. 2021. [37] K. Li, T. Zhang, and R. Wang, “Deep reinforcement learning for multiobjective optimization,” IEEE Trans. Cybern., vol. 51, no. 6, pp. 3103–3114, Jun. 2021. [38] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 1, pp. 4–24, Jan. 2021. [39] E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song, “Learning combinatorial optimization algorithms over graphs,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6348–6358. [40] Q. Ma, S. Ge, D. He, D. Thaker, and I. Drori, “Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning,” 2019, arXiv:1911.04936. [41] Z. Li, Q. Chen, and V. Koltun, “Combinatorial optimization with graph convolutional networks and guided tree search,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 539–548. [42] Y. Wu, W. Song, Z. Cao, J. Zhang, and A. Lim, “Learning improvement heuristics for solving routing problems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 9, pp. 5057–5069, Sep. 2022. [43] L. Xin, W. Song, Z. Cao, and J. Zhang, “Neurolkh: Combining deep learning model with lin-kernighan-helsgaun heuristic for solving the traveling salesman problem,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 7472–7483. [44] P. R. d. O. da Costa, J. Rhuggenaath, Y. Zhang, and A. Akcay, “Learning 2opt heuristics for the traveling salesman problem via deep reinforcement learning,” in Proc. Asian Conf. Mach. Learn., 2020, pp. 465–480. [45] H. Lu, X. Zhang, and S. Yang, “A learning-based iterative method for solving vehicle routing problems,” in Proc. Int. Conf. Learn. Representations, 2019. [46] S. Salcedo-Sanz and X. Yao, “Assignment of cells to switches in a cellular mobile network using a hybrid hopfield network-genetic algorithm approach,” Appl. Soft Comput., vol. 8, no. 1, pp. 216–224, 2008. [47] S. Salcedo-Sanz and X. Yao, “A hybrid hopfield network-genetic algorithm approach for the terminal assignment problem,” IEEE Trans. Syst. Man Cybern. Syst., Part B, vol. 34, no. 6, pp. 2343–2353, Dec. 2004. [48] A. Hottung and K. Tierney, “Neural large neighborhood search for the capacitated vehicle routing problem,” in Proc. ECAI, 2019, pp. 443–450. [49] K. Tang, Y. Mei, and X. Yao, “Memetic algorithm with extended neighborhood search for capacitated arc routing problems,” IEEE Trans. Evol., vol. 13, no. 5, pp. 1151–1166, Oct. 2009. [50] X. Yao, “Simulated annealing with extended neighbourhood,” Int. J. Comput. Math., vol. 40, no. 3-4, pp. 169–189, 1991. [51] P. Shaw, “A new local search algorithm providing high quality solutions to vehicle routing problems,” APES Group, Dept. Comput. Sci., vol. 46, 1997. [52] M. Chen, L. Gao, Q. Chen, and Z. Liu, “Dynamic partial removal: A neural network heuristic for large neighborhood search,” 2020, arXiv:2005.09330. [53] L. Gao, M. Chen, Q. Chen, G. Luo, N. Zhu, and Z. Liu, “Learn to design the heuristics for vehicle routing problem,” 2020, arXiv:2002.08539. [54] J. Zheng, K. He, J. Zhou, Y. Jin, and C.-M. Li, “Combining reinforcement learning with lin-kernighan-helsgaun algorithm for the traveling salesman problem,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 12445–12452. [55] J. Bossek, “netgen: Network generator for combinatorial graph problems,” R package version, 2015. [56] G. Reinelt, “TSPLIB–A traveling salesman problem library,” ORSA J. Comput., vol. 3, no. 4, pp. 376–384, 1991. [57] D. Applegate, R. Bixby, V. Chvatal, and W. Cook, “Concorde TSP solver,” Version v03.12.19, 2006. [Online]. Available: https://www.math.uwaterloo.ca/tsp/ concorde.html [58] X. -F. Xie and J. Liu, “Multiagent optimization system for solving the traveling salesman problem (TSP),” IEEE Trans. Syst. Man Cybern. Part B, vol. 39, no. 2, pp. 489–502, Apr. 2009. [59] S. Liu, K. Tang, and X. Yao, “Generative adversarial construction of parallel portfolios,” IEEE Trans. Cybern., vol. 52, no. 2, pp. 784–795, Feb. 2022. [60] K. Tang, S. Liu, P. Yang, and X. Yao, “Few-shots parallel algorithm portfolio construction via co-evolution,” IEEE Trans. Evol. Comput., vol. 25, no. 3, pp. 595–607, Jun. 2021. [61] C. H. Reilly, “Synthetic optimization problem generation: Show us the correlations!,” INFORMS J. Comput., vol. 21, no. 3, pp. 458–467, 2009. [62] K. Smith-Miles and S. Bowly, “Generating new test instances by evolving in instance space,” Comput. Oper. Res., vol. 63, pp. 102–113, 2015. [63] K. Tang, J. Wang, X. Li, and X. Yao, “A scalable approach to capacitated arc routing problems based on hierarchical decomposition,” IEEE Trans. Cybern., vol. 47, no. 11, pp. 3928–3940, Nov. 2017.

©SHUTTERSTOCK.COM/WHYIMAGE

Jack and Masters of all Trades: One-Pass Learning Sets of Model Sets From Large Pre-Trained Models

Abstract—For deep learning, size is power. Massive neural nets trained on broad data for a spectrum of tasks are at the

forefront of artificial intelligence. These large pre-trained models or “Jacks of All Trades” (JATs), when fine-tuned for downstream tasks, are gaining importance in driving deep learning advancements. However, environments with tight resource constraints, changing objectives and intentions, or varied task requirements, could limit the real-world utility of a singular JAT. Hence, in tandem with current trends towards building increasingly large JATs, this paper conducts an initial exploration into concepts underlying the creation of a diverse set of compact machine learning model sets. Composed of many smaller and specialized models, the Set of Sets is formulated to simultaneously fulfil many task settings and environmental conditions. A means to arrive at such a set tractably in one pass of a neuroevolutionary multitasking algorithm is presented for the first time, bringing us closer to models that are collectively “Masters of All Trades”.

Digital Object Identifier 10.1109/MCI.2023.3277769 Date of current version: 13 July 2023

Corresponding author: Abhishek Gupta (e-mail: [email protected]).

Han Xiang Choong

Nanyang Technological University, SINGAPORE

Yew-Soon Ong

and Abhishek Gupta

Agency for Science, Technology and Research (A*STAR) and Nanyang Technological University, SINGAPORE

Caishun Chen

Agency for Science, Technology and Research (A*STAR), SINGAPORE

Ray Lim

Nanyang Technological University, SINGAPORE

1556-603X ß 2023 IEEE

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

29

I. Introduction

B

uoyed by bountiful computing power and data, Deep Neural Networks (DNNs) currently enjoy primacy in Machine Learning (ML) and Artificial Intelligence (AI). From early DNNs succeeding at highly specific and narrow tasks [1], the field has progressed gradually towards increasingly general models. Consequently, the size and complexity of DNNs have skyrocketed [2]. Let’s consider the original LeNet (65 K parameters) [3], as well as AlphaGo (4.6 M parameters) [4]. In performing numeral recognition and in playing Go, these models are specialized for a single task, and may be called Masters of One Trade (MOTs). In contrast, ViT-G/14 [5] (1.8B parameters) and GPT-3 (175B parameters) [6] are capable of a virtually unlimited range of tasks within Computer Vision (CV) and Natural Language Processing (NLP). These latter examples have come to be regarded as foundation models [7], approaching our notion of Jacks of All Trades (JATs) in this paper. Being trained on broad data at scale, these models play a key supporting role in ML development through downstream transfer learning [8]. The archetypal JAT (Figure 1) is massive and highly expressive [9], allowing the learning of multiple tasks simultaneously [10]. Such training exploits similarities between tasks, promotes the acquisition of generalizable knowledge, and thus improves performance [11]. However, this multifaceted learning sometimes results in a trade-off of errors between tasks, as a natural consequence of capacity limits in practice. This limitation could hinder performance of individual tasks [12], with network over-sharing causing negative effects of over-generalization in DNNs trained on multiple tasks [13]. Depending on the complexity, size, and number of tasks to be learned, JATs with finite information encoding capacity often cannot afford to allocate each task a specialized internal

substructure. Conflicts between task-specific training signals could therefore occur, causing JATs to be weaker at performing individual tasks compared to specialists. In addition, despite the flexibility conferred by greater generality, JATs are often too large to be deployed in many situations. Resource-constrained computing requirements are growing in prevalence [14], [15], [16] owing to the surge in popularity of deep learning across industries and fields. As a result, the problem of inaccessibility—i.e., the lack of resources to utilize such stateof-the-art tools—is recognized as being of critical importance to the majority of the community [17], [18]. Nevertheless, the pursuit of greater generality continues unabated, with the size and complexity of models ever-increasing. The culmination of this pursuit is envisioned to be a model that exhibits broad, human-level intelligence (or beyond) across a spectrum of tasks, often claimed to constitute AI’s Holy Grail [19]. The existence of at least two associated pathways that could lead to this vision is conjectured. The first path is clearly marked by current trends towards the building of titanic DNNs, brute-forcing the mastery of many tasks using sheer size. These effectively unconstrained models shall be equipped to possess a very large number of task-specialized components (when needed), would not suffer from any capacity-related conflicts, and could therefore be classified as Masters of All Trades (MATs). However, this direction is encumbered by both extreme computational expense that is inaccessible to most, and diminishing returns [20]. What’s more, unfettered generality is often not required in practice. To illustrate, a classical opera may be in need of a violinist, but is far less likely to specifically need a violinist who also plays the guitar and the trombone.

FIGURE 1 An illustrative application in multilingual translation. The JAT is a large pre-trained model that performs many translation tasks. Each compact MOT specializes for a single translation task. By originating from the JAT, MOTs shed conflicting information and gain generalizable knowledge. In this paper, it is investigated whether the collective of MOTs could surpass the JAT in performance and computational efficiency, gradually approximating a Master of All Trades (MAT).

30

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

This paper presents a first exploration of a complementary pathway. The archetypal MOT is a task specialist; it is compact, efficient, and focused on a given task. The less complex nature of its structure and training suggests reduced internal conflict relative to JATs. This theoretically allows MOTs to achieve stronger performance in their specific domains. If an MOT can sufficiently surpass the performance of a JAT on relatively narrow tasks, then a collective of such MOTs, over many tasks, starts to functionally approach an MAT. A collective is modular and can easily grow and shrink, whilst it caters to a priori unknown objectives, intentions, and constraints of human end-users. Given current methods and technologies, such a collective could be produced economically, with sizeable resource efficiency advantages in the long-term. Importantly, MOTs, JATs, and MATs are conceived as being interdependent rather than separate. The creation of many task-specialized DNNs is proposed by taking into consideration a number of computational factors. One of these is the considerable energy cost incurred by a single development cycle in deep learning [21]. Another factor is the potential to harness valuable information contained within the parameters of pre-trained JATs. When trained from scratch for narrow tasks, an MOT does not benefit from a broad base of fundamental and generalizable knowledge as a JAT would. On the other hand, by learning MOTs from JATs, e.g., taking a large pre-trained model and then cutting it down small [22], these concerns can be alleviated. An abstraction of this concept is depicted in Figure 1, where a massive multilingual translation model is compressed into sets of small but specialized monolingual translators. The collective of many such MOTs, each suited for a different environment and task, constitutes what is referred to as the set of ML model sets (or Set of Sets). The creation and deployment of a Set of Sets promises attractive benefits, especially in resource-constrained environments. It is then necessary to devise methods that could generate such a set in a tractable manner. Going forward, our aim is to provide a novel means to arrive at optimized sets of ML models in just a single algorithmic pass. To this end, in Section II, the notions of different task settings and environmental conditions are first formalized, and the Set of Sets is defined in this context. In Section III, neuroevolutionary multitasking [23], [24], [25] is introduced as the engine for creating the Set of Sets practically and near-optimally by approaching the problem for the first time as one of multi-objective, multi-task optimization. Different from recent advances in evolutionary neural architecture search—where the goal is typically to substitute hand-crafted DNN architectures with those evolved from the ground up [26], [27], [28]—our method enables one-pass learning of multiple compact MOTs from pre-trained DNNs, collectively specializing to various tasks and environments. In Section IV, details of our experimental study are provided. The experimental results and a summary of additional insights gleaned are reported in Section V. Finally, in Section VI, the paper is concluded with a summary of our contributions and inspiring directions for future research.

FIGURE 2 Isolating a compact, task-specialized subnetwork from the large JAT results in a model which could be better at the task and more suitable for deployment outside the cloud. Highlighted connections are those that are retained in the subnetwork.

II. Introducing the Set of Sets

Our problem setting revolves around the existence of (a) multiple tasks and (b) multiple environments. It is necessary then to describe each of those components in detail, and what specializing for them entails. JATs, MOTs, and MATs are formally introduced in this regard. Finally, all of these elements are tied together to formalize the Set of Sets concept. A. Multiple Task Settings

Assimilating fundamental knowledge from multiple tasks can improve predictive performance and model generalization. Hence, after training on a broad spectrum of tasks, JATs can become particularly suitable for few-shot and transfer learning [29], making them powerful foundation models which support ML work and research. However, finite-sized JATs still encounter inter-task interferences, e.g., conflicting gradients [12], which could obstruct performance on individual specializations [11]. Moreover, inference costs with JATs could turn out to be a bottleneck, exceeding what is actually needed for the specific objectives and intentions that an end-user may have at any given time. In a real-life scenario, for example, consider an English-speaking tourist in Germany, and suppose there is a multilingual translation model trained with many northern and central European languages. These languages possess high lexical similarity due to common ancestry, resulting in improved translation quality after training. However, the tourist only requires German-to-English translation, not Czech, Swedish or Danish. If this German-to-English competence is prioritized above all, then the relevant faculty could be extracted in relative isolation (see illustration in Figure 2). The main benefits are improved performance and greater efficiency after fine-tuning, assisted by a strong initialization stemming from the model’s existing training. With this background, one notion of tasks is delineated in what follows. Consider K supervised machine learning problems, each assoi ciated with a labeled dataset Di ¼ ðxin ; yin ÞN n¼1 ; i ¼ i 1; 2; 3; . . .; K. It is assumed that all inputs (i.e., x ’s 8i) are

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

31

FIGURE 3 A variety of resource-constrained environmental conditions encountered in practice. Each device demands a particular capability of the original model, but can only support a more restricted model. A large-scale JAT may be infeasible for many such environments.

embedded in a common vector representation space X , within which the marginal probability distribution Pðxi Þ may vary across datasets. Here D1 may pertain to a German-to-English translation, D2 may pertain to Czech-to-English translation, and so on. A task is thus associated with training an ML model that takes a set of parameters and input data, and outputs a prediction ^y such that ^y ¼ Fðx; QÞ. Hereinafter, the predictive function F is assumed to take the form of an arbitrary DNN parameterized by Q; this could either be a convolutional neural network, an attention-based model, or other neural architectures. The goal is for F to accurately capture the underlying conditional probability distribution in output space Y, i.e., we want Fðx; QÞ  PðyjxÞ. In order to optimize Q, tasks are typically assigned a loss function Li that encapsulates not only the error between predictions (^y) and ground truth labels (yi )—such as cross-entropy loss [30] in the case of classification, focal loss [31] in the case of image segmentation, etc.—but also user-supplied priors and preferences, which could themselves change in time. Thus, a task can be defined as consisting of the following 3-tuple: Ti , fDi ; Fðxi ; QÞ; Li ð^y; yi Þg:

(1)

Specializing for the ith task setting, therefore, entails specializing in a single set of these components by minimizing the loss function Li over dataset Di . B. Multiple Environmental Conditions

As more tasks are introduced, performance accuracy is not the only cost incurred. Large-scale models of increasing expressive capacity are required to achieve good average performance across combinations of data distributions. The increasing scale of the best-performing ImageNet models is testament to this phenomenon [32]. On returning to the example of our English-speaking tourist in Germany, the original multilingual

32

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

model entails significantly greater computational and memory demands compared to a compact monolingual translator. A model this large would typically be housed on a cloud computing cluster, and would depend on real-time connections to provide on-demand service [33]. If our hypothetical tourist is expecting unreliable connections and is operating a typical smartphone dependent on mobile data, then long transmission latencies, connection instability, and high inference times become challenging constraints and factors of concern [34]. All these concerns can be overcome or alleviated by direct deployment of models to mobile devices, as illustrated in Figure 3. However, directly deploying large-scale cloud-based models on the edge is difficult and often infeasible [17], [18]. This is primarily due to the existence of a very large number of parameters, incurring considerable computational cost and occupying sizeable memory bandwidth. This limitation is acknowledged by a growing interest in DNN architectures oriented towards deployment on resource-constrained hardware [35], [36]. Let the structure of a generic DNN, denoted as G, comprise of L layers: fG1 ; G2 ; G3 ; . . .; GL g. Each layer Gl is associated with a parameter vector ul of dimensionality rl , where l is the layer index. Since the computing and memory cost of DNNs is strongly dependent on model size [37], [38], one may choose to represent a smaller subnetwork of the full DNN by a binary vector mask B ¼ r r ½b11 ; b21 ; b31 ; . . .; b1 1 ; b12 ; b22 ; b32 ; . . .; bLL , such that bkl 2 f0; 1g. k th Here, bl ¼ 1 implies that the k parameter of the l th layer, i.e., ukl , is retained in the subnetwork, whereas bkl ¼ 0 represents that the parameter is pruned away (as shown in Figure 2). This operation is denoted hereafter as B  G. The L1 -norm jjBjj1 serves as an approximate representation of the size of a pruned model in terms of the number of parameters it contains. Optimization processes minimizing this norm can perform DNN model compression effectively [39],

[40]. This offers a relatively straightforward means of specializing to resource-constrained environments.

produced by isolating compact subnetworks from a generic DNN G (i.e., from a large-scale JAT).

C. Defining MOTs, JATs, and MATs

Definition 1 (Membership in the Set of Sets). Let model Q0 be produced from G by the operation B0  G. Then, Qa 2 S if and only if 9Ti for which @Qb such that jjBa jj1  jjBb jj1 ^ Li ðFðxi ; Qa Þ  Li ðFðxi ; Qb Þ and jjBa jj1 < jjBb jj1 _ Li ðFðxi ; Qa Þ < Li ðFðxi ; Qb Þ.

Having described the notions of tasks and environments, the meaning of the terms MOT, JAT, and MAT may now be defined in the context of the optimization problem formulation each is associated with. First, the creation of MOTi specialized for the ith task can be expressed as follows: i 1 X Li ðFðxin ; Qi Þ; yin Þ; yi 2 Y i ; Ni n¼1

N

minQi

(2)

where Qi is the set of model parameters tuned for task Ti , and Y i is the output label space. By contrast, a JAT deals with minimizing the average loss over a set of K distinct tasks. As such, it contains a subset of parameters Qsh that is strictly shared by all tasks, in addition to possessing K sets of task-specialized parameters Qi . The JAT is then formulated as: K i 1X 1 X Li ðFðxin ; Qsh ; Qi Þ; yin Þ: min Qsh ; Q1 ...QK K i¼1 Ni n¼1 N

(3)

The averaging of the loss function over a task set is what gives rise to inter-task interactions. While the resultant inductive transfer can sometimes improve generalization by using information from related tasks, the interference may often hamper individual task performance under finite information encoding capacity of JATs in practice [12]. Theoretically, an MAT possesses sufficient capacity to reduce the need for parameter sharing between tasks. Thus, it is posited that the MAT could assume at least two different forms. The first is a singular model with sufficient capacity to allocate a dedicated internal substructure of its architecture to each task when needed. The second form, that of the Set of Sets, is composed of a plurality of compact MOTs. The MAT could then be regarded as the assimilation of all specialized models, denoted as QMAT ¼ [Ki¼1 Qi , obtained for all tasks: 8Ti ; i 2 f1; 2; 3; . . .; Kg; i 1 X Li ðFðxin ; Qi Þ; yin Þ: Ni n¼1

N

Qi ¼ arg min

(4)

This formulation frames a set of MOTs as functionally approximating an MAT. D. Formalizing the Set of Sets

The Set of Sets is a collective of models which approximates an MAT while simultaneously catering to multiple resource-constrained environments. Ideally, it offers a Pareto-optimal model specialized for every combination of task setting and environmental condition. It is thus best to define the Set of Sets, denoted S, in terms of Pareto-optimality [41]. To that end, a model Qa is said to be a member of the Set of Sets if there exists some task Ti for which there is no other model Qb that dominates Qa in terms of size and performance accuracy. This leads to the following formal definition of membership in S in the particular context of models

Producing such a plurality of Qa ’s is clearly a non-trivial challenge that goes beyond conventional multi-objective optimization problem formulations. The most straightforward solution is to train specialized models individually from scratch. However, this would incur considerable expense in engineerlabor and energy [21] due to the sheer diversity of possible devices, each with unique resource budgets and requirements [42], [43], [44]. This is compounded by each device potentially needing to fulfil a range of task settings. Clearly, a more general, resource-efficient approach is desirable. III. Learning the Set of Sets From JATs

In this section, a means to arrive at the set of ML model sets tractably within just one run of an evolutionary multitasking algorithm is offered. The approach first evolves the binary masks given a generic DNN, and then applies gradient-based fine-tuning to achieve task-specialized MOTs. A. Multi-Objective, Multi-Task Optimization Formulation

When trained from scratch for a single task, an MOT does not benefit from a broad base of generalizable knowledge as a foundation model exposed to multifaceted data would. Compressing JATs into MOTs avoids wastage and makes full use of the JAT as a large pre-trained model supporting downstream transfer learning. Taking these observations and Definition 1 as the basis of our approach, a formulation that could lead to an efficient generation of the Set of Sets by isolating compact subnetworks from a JAT G is considered. Our formulation leverages the dual concepts of multiple tasks and multiple objectives. Referring to Section II-A, let there be a series of K tasks with associated loss functions to be minimized. This allows us to specialize for multiple task settings concurrently. However, performance is not our sole objective. Referring to Section II-B, it is also of interest to optimize for multiple environmental conditions within the context of each task. Hence, the L1 -norm jjBjj1 and the loss function L are to be jointly minimized. This necessitates the acceptance of trade-offs between performance and size, since larger models would generally be expected to perform better but applied to only a small range of high-resourced environments. Thus, the Set of Sets solves the following bi-objective optimization problem given a series of tasks:

minimizeBi

8Ti ; i 2 f1; 2; 3; . . .; Kg; ! Ni X 1 i i i jjB jj1 ; L ðFðxin ; B  GÞ; yin Þ : Ni n¼1

(5)

This results in sets PS i of Pareto-optimal models Qi for all Ti , which provide optimized trade-offs of both objective functions. The Set of Sets S is then the union of all PS i ’s:

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

33

B. Other Practical Constraints

Qi ¼ Bi  G; i i PS i ¼ fQi 1 ; Q2 ; Q3 ; . . .g; [K PS i : S¼ i¼1

(6)

Building a Set of Sets is thus framed as a multi-objective, multi-task optimization problem. By this definition, the Set of Sets is a collective of compressed and specialized DNN models, which offers Pareto-optimal options for a range of environmental conditions and tasks. Algorithm 1. Pseudocode for Evolving the Set of Sets

1: Input: JAT G, K training datasets ½D1 ; D2 ; D3 ; . . .; DK , population size G, total number of generations V:

2: Output: Set of Sets S 3: Randomly generate a population of G candidate solutions for each of K tasks

4: for each population P i ; i 2 f1; 2; 3; . . .; Kg do 5: for each candidate Bij ; j 2 f1; 2; 3; . . .; Gg do 6: Evaluate Bij in terms of fi1 ¼ jjBij jj1 and 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

PNi Li ðF ðxni ; Bi  GÞ; yni Þ using dataset Di fi2 ¼ N1 n¼1 i CurrentGen ¼ 0 while CurrentGen < V do Select parent population P 0i from P i , 8i Assign inter-task crossover probability pkl for all task pairs k, l Generate K empty sets of offspring Oi ; i 2 f1; 2; 3; . . .; Kg for j in f1; 2; 3; . . .; G 2 K g do SK 0 Randomly select two parents Bka ; Blb from i¼1 P i Uniformly sample a number rand 2 ½0; 1 if k ¼¼ l then Crossover and mutate Bka ; Blb to produce two

17:

offspring Evaluate both offspring on fk1 and fk2 , and append

18: 19: 20:

to Ok else if k 6¼ l and rand < pkl then Crossover and mutate Bka ; Blb to produce two offspring Randomly evaluate one offspring on fk1 and fk2 , and

21: 22: 23:

append to Ok Evaluate the other on fl1 and fl2 , and append to Ol else Randomly select two additional parents Bkaþ1 , Blbþ1

24: 25: 26: 27: 28: 29: 30: 31: 32:

from P k and P l , respectively Crossover and mutate Bka with Bkaþ1 , evaluate offspring

on fk1 and fk2 , and append to Ok Crossover and mutate Blb with Blbþ1 , evaluate offspring on fl1 and fl2 , and append to Ol for i in f1; 2; 3; . . .; Kg do Rank members of the population Oi [ P i Select G fittest members from Oi [ P i to form next P i CurrentGen ¼ CurrentGen þ 1 for i in f1; 2; 3; . . .; Kg do Acquire set of approximated Pareto solutions in P i Fine-tune acquired models by gradient descent on fi2 to get PSi SK

33: Return

34

i i¼1 PS

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

1) Mask Dimensionality Given the large number of parameters in a modern JAT, the dimensionality of binary mask B could become a bottleneck—as assigning a binary variable to every parameter can quickly make optimization intractable. One way to overcome this challenge is to reduce the effective dimensionality rl of any layer l by a parameter grouping mechanism. One such implementation with masks placed over entire convolutional filters was proposed in [39]. Consider ResNet-18 with its 11 M parameters. When each binary variable bkl corresponds to an individual parameter, dimensionality is 11,173,962. Alternatively, if each bkl groups together approximately half the parameters in layer l, then the overall dimensionality of the binary mask could be vastly reduced to 36. Note that a group is either wholly retained, if bkl ¼ 1, or wholly pruned away, if bkl ¼ 0. The mask granularity and grouping mechanism can be freely defined prior to the optimization run, depending on the problem at hand and compute resource availability. An advantage of this approach, which targets parameters directly, is that it can be applied to different types of DNNs with little customization needed. This model-agnosticism shall be shown in Section V with application to convolutional networks, recurrent networks, and transformers. 2) Layer Compatibility It is important to ensure that no layer of the JAT is removed entirely by B, since doing so can collapse performance by cutting off gradient flow. Hence, it is necessary to make sure that a certain number of binary variables bkl ’s remain active in each layer. This defines a hard limit on the extent of compression. Additionally, it is necessary to ensure that redundancies are avoided. For example, it is plausible that all of the input parameters to a node are pruned away, rendering all of that node’s output parameters redundant. Constraints and checks to avoid such scenarios or prune away affected parameters in subsequent layers should be implemented. C. A One-Pass Neuroevolutionary Multitasking Algorithm

Multi-objective optimization problems are known to be efficiently handled using Evolutionary Algorithms (EAs) [45], [46]. EAs are crafted to generate and update a population of solutions. When multiple objectives are considered, the population can be structured to simultaneously approximate Pareto-optimal tradeoffs between the objectives [47], [48], essentially creating a set of solutions that fulfil different environmental conditions or user preferences. However, conventional single- and multi-objective EAs are typically oriented towards solving just a single target task at a time. This tends to limit the implicit parallelism and hence the convergence rate of population-based search [49], as skills evolved for other tasks are not readily transferred and reused. Evolutionary multitasking offers a natural means to overcome this limitation by tackling multiple optimizations jointly [50]. Given K tasks, the time complexity of one iteration in evolutionary multitasking is comparable to the summed complexity per iteration over K

independent baseline EAs (assuming equal population size in corresponding tasks). Nevertheless, the former is able to uniquely build on related tasks by harnessing reusable building-blocks of knowledge contained in its population [51], [52]. It simplifies the search by transferring this information encoded in evolved solutions across joint optimization processes, thus speeding up convergence rates (i.e., reducing the number of iterations taken) in discovering near-optimal solutions [53], [54]. Any binary-coded multi-task algorithm can in principle be extended to address all T1 ; T2 ; . . . ; TK in (5) in a single run, thus making possible one-pass neuroevolution of the Set of Sets from pre-trained models. Since a singular JAT G is utilized by all the tasks, a unified solution representation space (i.e., the space of binary masks) is naturally available. Hence, solutions evolved for separate tasks can be easily recombined via known operators. In producing the Set of Sets, the inter-task crossover operation is what enables implicit transfer and reuse of beneficial neural substructures across multiple tasks. Consequently, the wastage of computational resources in re-exploring overlapping solution spaces is reduced, enhancing efficiency of the overall optimization cycle [55], [56], [57]. A pseudocode of our approach is given in Algorithm 1. The approach combines the diversity of trade-offs achieved by multiobjective optimization with the improved search generalization of multi-task optimization. In our experimental study, the Multiobjective Multifactorial Evolutionary Algorithm (MO-MFEA) [24], [58] is employed as an instantiation of a neuroevolutionary multitasking engine acting on the underlying JAT. The distinctive steps that induce implicit transfer occur in Lines 18–21 of Algorithm 1, where inter-task crossovers are performed with some probability (fixed at 0.4 in the experiments unless otherwise mentioned). In particular, a uniform crossover-like variable swap operation (in Line 19) leads to the exchange of schema between solutions excelling at different tasks. As a result, if a neural substructure evolved for one task turns out to be useful for another, then the mutually beneficial information could be exploited without having to rediscover it from scratch. To facilitate exploration, crossovers are accompanied by bitwise mutation, with the probability of mutating a bit given by 1=sizeðBÞ. The MO-MFEA therefore is able to address the multi-objective, multi-task formulation in (5) in a natural way; it is simple to implement without requiring ad hoc transfer mechanisms [50]. The literature [59], [60] provides reviews of alternative algorithmic techniques that could also be considered. After the evolutionary process terminates, the obtained approximate Pareto-optimal subnetworks are fine-tuned via gradient descent on task-specific loss functions to return the final set of MOTs (Line 32 of Algorithm 1). IV. Experimental Overview

In addition to showcasing the effectiveness and efficiency of the problem formulation and approach, the experiments are intended to highlight the generality of the Set of Sets concept across a variety of datasets. The experimental study thus makes use of four datasets, in the domains of multilingual translation, regression, and image classification. A large-scale DNN, representing a JAT, was pre-

trained on each dataset in full. Each dataset was then reconfigured into a number of tasks, each of which represents a specialization for categories of data, and is therefore of narrower scope. Specialists are then evolved for all tasks simultaneously. Fitness is measured using two objective functions: size and performance measure. The final evolved populations are fine-tuned (by gradient descent) on task-specific training data, and then evaluated on held-out test data. All experiments were conducted using an AMD Threadripper 3990X and a single Nvidia 3090 RTX. A. Multilingual Translation: WMT19

A unique facet of this work is the application to domains beyond image classification. 1) Dataset With our example of an English-speaking tourist in Germany (see Section II-A), multilingual translation is a domain where Set of Sets has practical utility. Languages with common origins tend to share loan-words and similar lexical structures, with a wealth of transferable and mutually beneficial information. Specialization is conceived as producing master monolingual translators from a large multilingual model. For this purpose, the dataset used is the WMT-19 [61] multilingual dataset, focusing on Czech-to-English and German-to-English translation (see Figure 4). 2) Experimental Details In keeping with the premise of a JAT being a large, generalist DNN, Facebook’s M2M100-418 M model [62] is used with pre-trained weights. This is acquired from the HuggingFace transformers and datasets libraries [63]. The pre-trained M2M100-418 M is tuned for both translation tasks for 65 epochs, with a batch size of 1920, until no further loss improvement is observed. The AdamW optimizer [64] is used with a learning rate of 0.0005 and a multiplicative decay of 0.99. The dimensionality of the binary mask is 67,584. It is applied solely to attention layers, not to vocabulary or position embeddings or to the prediction head. The model size is approximately 1.8 GB. A population of 20 specialist models is evolved for each language simultaneously using MO-MFEA, for 120 generations. During evolution, model fitness is assessed using negative log likelihood loss on training data. After termination, the obtained solutions are fine-tuned for 200 epochs or until no further improvement is observed, using the AdaFactor optimizer [65]. After fine-tuning, models are evaluated for BLEU score [66] on the test data. B. Time-Series Regression: Beijing Air Quality

1) Dataset Our second domain is time-series regression. In these experiments, the Beijing Multi-Site Air-Quality dataset [67] from the UCI ML repository is used. This dataset consists of air-pollutant readings from 12 monitoring sites situated around Beijing. The objective is to predict PM2.5 levels [68]. Specialization is for specific geographic locations (see Figure 4). Measurements from four monitoring stations in different geographical areas are selected.

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

35

FIGURE 4 Overview of the tasks used in each experiment. NLP tasks specialize for a single language. Time-series regression tasks specialize for a geographical area. Image classification tasks specialize for a subset of classes.

Datapoints are measured hourly, from March 1st, 2013 to February 28th, 2017. 2) Experimental Details A bidirectional LSTM with two hidden layers of 128 nodes and one fully-connected output layer is pre-trained on the full dataset [69]. A time window of 10 hours is used. The dimensionality of the binary mask is 4096, and the model is approximately 2.12 MB in size. A population of 60 specialists is evolved corresponding to each area in the MO-MFEA, for 120 generations. After evolution, the obtained solution candidates are fine-tuned for 15 epochs using the AdamW optimizer [64] with a learning rate of 0.001 and a multiplicative decay of 0.99. The root mean squared error (RMSE) loss is used for evolution, fine-tuning, and final evaluation on test data. C. Image Classification: MNIST and CIFAR-10

1) Dataset MNIST [70] and CIFAR-10 [71] are ubiquitous benchmarks in the computer vision community. For these datasets, specialized tasks consist of subsets of the labels. MNIST is synthetically split into even and odd numbers, while CIFAR-10 is split into animals and vehicles (see Figure 4). 2) Experimental Details As ImageNet [72] models constitute an important group of extant pre-trained models, we use ResNet-18 [73]. This model is approximately 44 MB in size. Additionally, the datatype of parameters is changed from 32 b floats to 16 b half floats at inference time. This reduces model size to 22 MB. The dimensionality of the binary mask is 18,944. Like the regression experiment, a population of 60 specialists is evolved for each task, for 120 generations. After evolution, the obtained candidates are fine-tuned using the Adam optimizer [74] with a learning rate of 0.0005 and a multiplicative decay of 0.95. MNIST models are fine-tuned for 10 epochs, while CIFAR models are fine-tuned for 25 epochs. The cross-entropy loss is used for evolution and fine-tuning, and

36

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

the final population is evaluated for classification accuracy on test data. V. Experimental Results A. Diverse and Task-Specialized Set of Sets

Our first experimental results evaluate the Set of Sets against the original JAT. Each evolved population approximates a part of the Pareto front (see Figures 5–7) showcasing diverse trade-offs between model size and performance. In most cases, specialist models which exceed or match the JAT in performance on their assigned tasks are observed. Additionally, significant reductions in model size are achieved, while still maintaining acceptable standards of performance, in all tasks considered. Hence, in each domain, the obtained Set of Sets is deemed to collectively approximate an MAT, guaranteeing deployability across a range of resource-constrained environments. The most striking examples of performance improvement are observed in the regression tasks (Figure 6). The best performing regression model, specialized for the extreme North-North-West of Beijing (Dingling), achieves an RMSE reduction of 7.19—a performance improvement of 18.5% over the JAT. Note that performance (accuracy) improvement was not witnessed in the CIFAR-10 Vehicles task (Figure 7). However, the best performing specialist came within 0.12% accuracy of the JAT. It is likely that further improvement in this case is extremely difficult, given that the underlying JAT appears to have achieved close to 100% accuracy on this task. The clearest examples of successful model compression while maintaining acceptable performance levels are seen in image classification, with the smallest specialist shown for the CIFAR-10 Animals task achieving a size reduction of 95.5%. In a less extreme example, the smallest specialist in the Dingling regression task achieves a size reduction of 82%. It is also contended that size reductions of the most accurate models may be inversely correlated with task difficulty. For example, the NLP tasks are the most difficult and computationally demanding out of those considered, with the specialist with highest BLEU score in the German-to-

FIGURE 5 Images in objective space of the evolved Set of Sets for multilingual translation experiments (WMT-19). Each plot displays an evolved population of translators specialized for either Czech or German. Circles denote populations of MOTs, plotted by size and BLEU score for each task. Stars denote the JAT. MOTs with higher BLEU scores than the JAT are achieved, indicating superior performance on the test set.

English task achieving relatively moderate size reduction of 9.5% (see Figure 5). B. Productivity Gains of One-Pass Learning

Having demonstrated model diversity, the benefit of neuroevolutionary multitasking for arriving at the Set of Sets in a single optimization run is established next. Conceivably, a comparable Set of Sets could be acquired by repetitively running a single-task multiobjective EA, given sufficient time and resources. As discussed previously, the primary benefit of multi-task optimization lies in the transfer of beneficial neural substructures between tasks (without having to re-explore overlapping solution spaces), boosting convergence to near-optimal solutions. As a result, given the same computational budget, a neuroevolutionary multitasking algorithm is expected to obtain better solutions than its single-task counterpart. For the single-task setting, the mechanism of inter-task crossover for information transfer was disabled, rendering the MO-MFEA functionally equivalent to the NSGA-II [47]. The hypervolume indicator [76] was employed to assess the quality of populations at each generation. Hypervolume trends were averaged over 10 independent runs of the EAs, providing an indicator of convergence behavior. Figure 8 shows that the hypervolume increases significantly more rapidly in the multi-task optimization setting than in the single-task setting. The smallest improvement was seen in the MNIST Even Numbers task, with multitasking achieving a final hypervolume of 0.7 whereas single-tasking reached a hypervolume of 0.648 (an 8% improvement). The largest improvement was seen in the CIFAR-10 Vehicles task, with MO-MFEA achieving a hypervolume of 0.755 compared to 0.673 in the single-task case (a 12.2% improvement). This provides indirect evidence that neuroevolutionary multitasking is able to make use of structural patterns which are commonly beneficial across tasks, transferring them across optimization processes to accelerate convergence to positive effect. To summarize, it is likely that the single-task approach, given more generations to evolve, would achieve hypervolume scores that are comparable to the MO-MFEA. However,

under limited and equal computational budget, the results show that multitasking accelerates convergence to a significant degree, which translates to providing substantial efficiency (and hence productivity) gains in practice. C. Comparison to the State-of-the-art

Here, the competitiveness of our multi-task algorithm is established against the state-of-the-art in evolutionary multiobjective model compression. To this end, the recent EMOMC [75], a single-task, bi-objective evolutionary compression method utilizing both pruning and quantization of DNNs is considered. Note that EMOMC, or other comparable techniques for evolutionary model compression [40], do not simultaneously cater to both task settings and environmental conditions, and hence must be adapted for meaningful comparison. EMOMC starts with JATs trained on the full MNIST or CIFAR-10 datasets, and subsequently produces a library of pre-pruned models. It then performs evolutionary optimization (selecting pre-pruned models from the library and quantizing their parameters) over the same taskspecific datasets as in our approach, hence allowing the generated models to specialize like MOTs. As with our approach, the EMOMC was run with a population of 60 candidates per task, over 120 generations. 1) Results As Figure 9 illustrates, the Set of Sets approach generates a greater spread of possible solutions in many cases. In the MNIST tasks, it is observed that the average accuracy of the final population is comparable for both approaches. EMOMC achieves an average accuracy increase of 0.293% over our approach in the Even Numbers task, while our approach increases the average accuracy by 0.164% in the Odd Numbers task. On the other hand, in both the CIFAR-10 tasks (Animals and Vehicles), our approach achieves average accuracy increases of 6.83% and 2.65%, respectively. In addition, the Set of Sets exhibits major improvements in terms of average model size. Striking examples include the MNIST Odd Numbers task and the CIFAR-10 Animals task, where our

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

37

FIGURE 6 Images in objective space of the evolved Set of Sets for regression (Beijing Air Quality). Each plot displays an evolved population of timeseries regression models specialized for air quality prediction in a specific area of Beijing. Circles denote populations of MOTs, plotted by size and RMSE Loss for each task. A large proportion of MOTs achieves lower RMSE Loss than the JAT, potentially making them more useful for predictions in their specific areas.

FIGURE 7 Images in objective space of the evolved Set of Sets acquired for image classification experiments (MNIST and CIFAR-10). Each plot displays an evolved population of classifiers specialized for a subset of classes in their respective datasets. Circles denote populations of MOTs, plotted by size and classification accuracy for each task. In most cases, the presence of MOTs that are significantly more accurate than the JAT is less pronounced than in the previous experiments. However, considerable reduction in model size is achieved.

approach achieves an average model size 73.1% and 38.9% smaller than EMOMC, respectively. Since EMOMC is a single-task algorithm, these results provide further support to our intuition regarding the benefits of neuroevolutionary multitasking. From a computational standpoint, the proposed multi-task approach is expected to be more efficient; while it takes around 24 hours for pre-processing alone in EMOMC [75], the total wall-clock time consumed by our method is 5.5 hours on MNIST and 6.5 hours on CIFAR-10. This includes the time taken for evolution of the binary masks, followed by fine-tuning of model parameters of the compact MOTs on task-specific loss functions. D. Summarizing Discussions

The Set of Sets concept was investigated in three diverse and popular domains within deep learning, using a range of datasets, performance metrics, and neural architectures. It was found that on specialized tasks, performance improvements over large-scale models could be achieved simultaneously with compression. In almost all experiments, fine-tuning resulted in specialized models which outperformed the JAT. This indicates that inter-task interference and the trading-off of errors plays a role in constraining model performance. It was also found that evolutionary optimization of the binary masks without fine-tuning is often insufficient to achieve such results, unless the extent of pruning is constrained.

38

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

Hence, it is posited that evolution is able to uncover compact model initializations which respond well to further training. The ensuing fine-tuning then enables the broad and generalizable knowledge contained in the JAT to be efficiently transferred to all MOTs. This makes the Set of Sets approach with neuroevolutionary multitasking practically advantageous, especially when compared to manually training models from scratch for each potential environmental condition and task setting. VI. Conclusion and Future Research

This paper marks a first study on the concept and in silico evolution of a set of ML model sets from large pre-trained models. A multi-objective, multi-task problem formulation, and an algorithmic means of arriving at the set in a computationally tractable manner, are presented. The experiments confirmed that the removal of inter-task interferences from a JAT could indeed prime its compressed counterparts to achieve stronger performance on specialized tasks, with significant reductions in model size. The collection of all such compact models approximates what is regarded as a Master of All Trades/Tasks. Additionally, it is shown that the developed techniques can be effectively applied in a diverse range of settings with different neural architectures (spanning transformers for multilingual translation, LSTM for timeseries regression, and convolutional networks for image

FIGURE 8 Hypervolume trend comparisons between the multi-task and single-task settings. The former results in significantly improved efficiency, requiring less time to converge to better solutions. Multitasking thus forms a crucial element in practical usage of the Set of Sets. Note that this experiment made use of the MO-MFEA with online learning of inter-task crossover probabilities [58] in Line 10 of Algorithm 1. The binary mask dimensionality was therefore reduced to 1920 to facilitate the online learning.

FIGURE 9 Images in objective space of populations produced by our approach (after fine-tuning) and by EMOMC [75]. Our approach results in a greater spread of solutions in many cases, with noticeably greater size reductions.

classification), with multitasking greatly improving efficiency in comparison to the associated state-of-the-art in evolutionary model compression. The research presented herein illuminates many new directions for future exploration. From an algorithmic perspective, the analysis of task-specific subnetworks (their commonalities, distinctions, topologies, and information content), as they are jointly evolved from a singular JAT, represents an important line of inquiry for the design of more efficient and scalable multitasking engines. Differentiable architecture search techniques may be studied and synergized with EAs in this regard, leveraging the best of both worlds. In terms of practical application, evolved DNNs of varying size and task-specialization can be viewed as artificial ecosystems with wide-ranging utility from the cloud to personal edge devices. Future work could therefore encompass predictive as well as generative artificial intelligence, giving rise to compact models that can produce diverse digital artifacts with potential applications to engineering design, personalized drug discovery, digital art, to name just a few. Acknowledgment This work was supported in part by the Data Science and Artificial Intelligence Research Center (DSAIR), School of Computer Science and Engineering, Nanyang Technological University, in part

by the A*STAR Center for Frontier AI Research, and in part by the A*STAR AI3 Seed Grant C211118016.

References [1] G.-W. Ng and W. Leung, “Strong artificial intelligence and consciousness,” J. Artif. Intell. Consciousness, vol. 07, pp. 63–72, Mar. 2020. [2] R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green AI,” Commun. ACM, vol. 63, no. 12, pp. 54–63, Nov. 2020. [3] Y. LeCun Jackel et al., “Backpropagation applied to handwritten zip code recognition,” Neural Comput., vol. 1, no. 4, pp. 541–551, Dec. 1989. [4] M. Igami, “Artificial intelligence as structural estimation: Deep blue, bonanza, and alphago*,” Econometrics J., vol. 23, no. 3, pp. S1–S24, Sep. 2020. [5] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 12104–12113. [6] T. Brown et al., “Language models are few-shot learners,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2020, vol. 33, pp. 1877–1901. [7] L. Orr, K. Goel, and C. Re, “Data management opportunities for foundation models,” in Proc. 12th Annu. Conf. Innov. Data Syst. Res., 2021. [8] R. Bommasani et al., “On the opportunities and risks of foundation models,” 2021, arXiv:2108.07258. [9] P. Baldi and R. Vershynin, “The capacity of feedforward neural networks,” Neural Netw.: Official J. Int. Neural Netw. Soc., vol. 116, pp. 288–311, 2019. [10] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein, “On the expressive power of deep neural networks,” in Proc. 34th Int. Conf. Mach. Learn. Res., 2017, vol. 70, pp. 2847–2854. [11] S. Wu, H. R. Zhang, and C. Re, “Understanding and improving information transfer in multi-task learning,” in Proc. 8th Int. Conf. Learn. Representations, 2020. [12] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2020, vol. 33, pp. 5824–5836. [13] P. Guo, C.-Y. Lee, and D. Ulbricht, “Learning to branch for multi-task learning,” in Proc. 37th Int. Conf. Mach. Learn., 2020, vol. 119, pp. 3854–3863.

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

39

[14] M. G. S. Murshed, C. Murphy, D. Hou, N. Khan, G. Ananthanarayanan, and F. Hussain, “Machine learning at the network edge: A survey,” ACM Comput. Surv., vol. 54, no. 8, Oct. 2021, Art. no. 170. [15] B. Varghese, N. Wang, S. Barbhuiya, P. Kilpatrick, and D. S. Nikolopoulos, “Challenges and opportunities in edge computing,” in Proc. IEEE Int. Conf. Smart Cloud, 2016, pp. 20–26. [16] H. Ju and L. Liu, “Innovation trend of edge computing technology based on patent perspective,” Wireless Commun. Mobile Comput., vol. 2021, Jun. 2021, Art. no. 2609700. [17] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding,” in Proc. 4th Int. Conf. Learn. Representations, 2016. [18] K. Bhardwaj, C.-Y. Lin, A. Sartor, and R. Marculescu, “Memory- and communication-aware model compression for distributed deep learning inference on IoT,” ACM Trans. Embedded Comput. Syst., vol. 18, no. 5s, pp. 1–22, Oct. 2019. [19] M. Tegmark, Life 3.0: Being Human in the Age of Artificial Intelligence. New York, NY, USA: Knopf, 2017. [20] G. F. Marcus, “Deep learning: A critical appraisal,” 2018, arXiv:1801.00631. [21] E. Strubell, A. Ganesh, and A. Mccallum, “Energy and policy considerations for modern deep learning research,” in Proc. AAAI Conf. Artif. Intell., vol. 34, pp. 13693–13696, Apr. 2020. [22] Z. Li et al., “Train big, then compress: Rethinking model size for efficient training and inference of transformers,” in Proc. 37th Int. Conf. Mach. Learn. Res., 2020, vol. 119, pp. 5958–5968. [23] A. Gupta, Y.-S. Ong, and L. Feng, “Multifactorial evolution: Towards evolutionary multitasking,” IEEE Trans. Evol. Comput., vol. 20, no. 3, pp. 343–357, Jun. 2016. [24] A. Gupta, Y.-S. Ong, L. Feng, and K. C. Tan, “Multiobjective multifactorial optimization in evolutionary multitasking,” IEEE Trans. Cybern., vol. 47, no. 7, pp. 1652–1665, Jul. 2017. [25] Y.-S. Ong and A. Gupta, “Evolutionary multitasking: A computer science view of cognitive multitasking,” Cogn. Comput., vol. 8, no. 2, pp. 125–142, 2016. [26] P. Vidnerova and R. Neruda, “Multi-objective evolution for deep neural network architecture search,” in Int. Conf. Neural Inf. Process., H. Yang, K. Pasupa, A. C.-S. Leung, J. T. Kwok, J. H. Chan, and I. King, Eds. Cham, Switzerland, 2020, pp. 270–281. [27] Y. Liu, Y. Sun, B. Xue, M. Zhang, G. G. Yen, and K. C. Tan, “A survey on evolutionary neural architecture search,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 2, pp. 550–570, Feb. 2023. [28] L. Tong and B. Du, “Neural architecture search via reference point based multiobjective evolutionary algorithm,” Pattern Recognit., vol. 132, 2022, Art. no. 108962. [29] Q. Sun, Y. Liu, T.-S. Chua, and B. Schiele, “Meta-transfer learning for fewshot learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 403–412. [30] Z. Zhang and M. R. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst., Red Hook, NY, USA, 2018, pp. 8792–8802. [31] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 02, pp. 318–327, Feb. 2020. [32] O. Russakovsky et al., “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015. [33] S. B. Calo, M. Touna, D. C. Verma, and A. Cullen, “Edge computing architecture for applying ai to IoT,” in Proc. IEEE Int. Conf. Big Data, 2017, pp. 3012–3016. [34] D. Xu et al., “Edge intelligence: Architectures, challenges, and applications,” 2020, arXiv:2003.12172. [35] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetv2: Inverted residuals and linear bottlenecks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520. [36] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9. [37] Y. Tay et al., “Scale efficiently: Insights from pre-training and fine-tuning transformers,” in Proc. Int. Conf. Learn. Representations, 2022. [38] J. Kaplan et al., “Scaling laws for neural language models,” 2020, arXiv:2001.08361. [39] Y. Wang, C. Xu, J. Qiu, C. Xu, and D. Tao, “Towards evolutionary compression,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2018, pp. 2476–2485. [40] Y. Zhou, G. G. Yen, and Z. Yi, “A knee-guided evolutionary algorithm for compressing deep neural networks,” IEEE Trans. Cybern., vol. 51, no. 3, pp. 1626– 1638, Mar. 2021. [41] M. Ehrgott, Multicriteria Optimization, vol. 491. Berlin, Germany: Springer, 2005. [42] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once-for-all: Train one network and specialize it for efficient deployment,” in Int. Conf. Learn. Representations, 2020. [43] M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep learning for IoT Big Data and streaming analytics: A survey,” IEEE Commun. Surveys Tut., vol. 20, no. 4, pp. 2923–2960, Fourthquarter 2018. [44] J. Tang, D. Sun, S. Liu, and J.-L. Gaudiot, “Enabling deep learning on IoT devices,” Computer, vol. 50, no. 10, pp. 92–96, 2017. [45] C. Fan, Y. Li, L. Yi, L. Xiao, X. Qu, and Z. Ai, “Multi-objective LSTM ensemble model for household short-term load forecasting,” Memetic Comput., vol. 14, no. 1, pp. 115–132, 2022.

40

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

[46] R. Tanabe and H. Ishibuchi, “A review of evolutionary multimodal multiobjective optimization,” IEEE Trans. Evol. Comput., vol. 24, no. 1, pp. 193–200, Feb. 2020. [47] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Trans. Evol. Comput., vol. 6, no. 2, pp. 182–197, Apr. 2002. [48] A. Trivedi, D. Srinivasan, K. Sanyal, and A. Ghosh, “A survey of multiobjective evolutionary algorithms based on decomposition,” IEEE Trans. Evol. Comput., vol. 21, no. 3, pp. 440–462, Jun. 2017. [49] A. Gupta and Y.-S. Ong, “Back to the roots: Multi-x evolutionary computation,” Cogn. Comput., vol. 11, no. 1, pp. 1–17, 2019. [50] A. Gupta, L. Zhou, Y.-S. Ong, Z. Chen, and Y. Hou, “Half a dozen real-world applications of evolutionary multitasking, and more,” IEEE Comput. Intell. Mag., vol. 17, no. 2, pp. 49–66, May 2022. [51] A. Gupta, Y. Ong, and L. Feng, “Insights on transfer optimization: Because experience is the best teacher,” IEEE Trans. Emerg. Topics Comput. Intell., vol. 2, no. 1, pp. 51–64, Feb. 2018. [52] T. B. Thang, T. C. Dao, N. H. Long, and H. T. T. Binh, “Parameter adaptation in multifactorial evolutionary algorithm for many-task optimization,” Memetic Comput., vol. 13, no. 4, pp. 433–446, 2021. [53] L. Bai, W. Lin, A. Gupta, and Y.-S. Ong, “From multitask gradient descent to gradient-free evolutionary multitasking: A proof of faster convergence,” IEEE Trans. Cybern., vol. 52, no. 8, pp. 8561–8573, Aug. 2022. [54] Z. Huang, Z. Chen, and Y. Zhou, “Analysis on the efficiency of multifactorial evolutionary algorithms,” in Proc. Int. Conf. Parallel Problem Solving Nature, 2020, pp. 634–647. [55] Y.-S. Ong and A. Gupta, “Evolutionary multitasking: A computer science view of cognitive multitasking,” Cogn. Comput., vol. 8, no. 2, pp. 125–142, Apr. 2016. [56] T. Rios, B. v. Stein, T. B€ack, B. Sendhoff, and S. Menzel, “Multi-task shape optimization using a 3D point cloud autoencoder as unified representation,” IEEE Trans. Evol. Comput., vol. 26, no. 2, pp. 206–217, Apr. 2022. [57] W. Dai, Z. Wang, and K. Xue, “System-in-package design using multi-task memetic learning and optimization,” Memetic Comput., vol. 14, no. 1, pp. 45–59, 2022. [58] K. K. Bali, A. Gupta, Y.-S. Ong, and P. S. Tan, “Cognizant multitasking in multiobjective multifactorial evolution: MO-MFEA-II,” IEEE Trans. Cybern., vol. 51, no. 4, pp. 1784–1796, Apr. 2021. [59] E. Osaba, J. D. Ser, A. D. Martinez, and A. Hussain, “Evolutionary multitask optimization: A methodological overview, challenges, and future research directions,” Cogn. Comput., vol. 14, pp. 927–954, 2022. [60] T. Wei, S. Wang, J. Zhong, D. Liu, and J. Zhang, “A review on evolutionary multi-task optimization: Trends and challenges,” IEEE Trans. Evol. Comput., vol. 26, no. 5, pp. 941–960, Oct. 2022. [61] L. Barrault et al., “Findings of the 2019 conference on machine translation (WMT19),” in Proc. 4th Conf. Mach. Transl., 2019, vol. 2, pp. 1–61. [62] A. Fan et al., “Beyond english-centric multilingual machine translation,” J. Mach. Learn. Res., vol. 22, no. 107, pp. 1–48, 2021. [63] T. Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proc. Conf. Empirical Methods Natural Lang. Process.: Syst. Demonstrations, 2020, pp. 38–45. [64] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Representations, 2018. [65] N. Shazeer and M. Stern, “Adafactor: Adaptive learning rates with sublinear memory cost,” in Proc. 35th Int. Conf. Mach. Learn., 2018, vol. 80, pp. 4596–4604. [66] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics, Philadelphia, PA, USA, 2002, pp. 311–318. [67] S. Zhang, B. Guo, A. Dong, J. He, Z. Xu, and S. X. Chen, “Cautionary tales on air-quality improvement in Beijing,” Proc. Roy. Soc. A: Math., Phys. Eng. Sci., vol. 473, 2017, Art. no. 20170457. [68] A. G. M. Mengara, E. Park, J. Jang, and Y. Yoo, “Attention-based distributed deep learning model for air quality forecasting,” Sustainability, vol. 14, no. 6, 2022, Art. no. 3269. [69] J. Zhao, F. Deng, Y. Cai, and J. Chen, “Long short-term memory - fully connected (LSTM-FC) neural network for pm2.5 concentration prediction,” Chemosphere, vol. 220, pp. 486–492, 2018. [70] L. Deng, “The MNIST database of handwritten digit images for machine learning research [Best of the Web],” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 141–142, Nov. 2012. [71] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, Univ. Toronto, Toronto, Canada, 2009. [72] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255. [73] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778. [74] D. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2015. [75] Z. Wang, T. Luo, M. Li, J. T. Zhou, R. S. M. Goh, and L. Zhen, “Evolutionary multi-objective model compression for deep neural networks,” IEEE Comput. Intell. Mag., vol. 16, no. 3, pp. 10–21, Aug. 2021. [76] A. P. Guerreiro, C. M. Fonseca, and L. Paquete, “The hypervolume indicator: Computational problems and algorithms,” ACM Comput. Surv., vol. 54, no. 6, Jul. 2021, Art. no. 119.

©SHUTTERSTOCK.COM/PROSTOCKSTUDIO

A Multi-Factorial Evolutionary Algorithm With Asynchronous Optimization Processes for Solving the Robust Influence Maximization Problem

Abstract—The complex network has attracted increasing attention and shown effectiveness in modeling multifarious systems. Focusing on selecting members with good spreading

ability, the influence maximization problem is of great significance in network-based information diffusion tasks. Plenty of attention has been paid to simulating the diffusion process and choosing influential seeds. However, errors and attacks typically threaten the normal function of networked systems, and few studies have considered the influence maximization problem under structural failures. Therefore, a quantitative measure with a changeable parameter is first developed in this paper to tackle the unpredictable destruction percentage on networks. Further, limitations on the existing methods are shown experimentally. To

Digital Object Identifier 10.1109/MCI.2023.3277770 Date of current version: 13 July 2023

Corresponding author: Yaochu Jin (e-mail: [email protected]).

Shuai Wang

and Beichen Ding

Sun Yat-sen University, CHINA

Yaochu Jin

Bielefeld University, GERMANY

1556-603X ß 2023 IEEE

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

41

address these limitations, the evolutionary multitasking paradigm is employed, and several problem-specific operators are developed. On top of these developments, a multi-factorial evolutionary algorithm is devised to find seeds with robust influence ability, termed MFEARIM, where the genetic information for both myopia and holistic areas is considered to improve the search ability. Additionally, an asynchronous strategy is designed to efficiently tackle tasks with distinct costs, and the convergence of the search process can thus be accelerated. Experiments on several synthetic and real-world networks validate the competitive performance of MFEARIM over the existing methods.

I. Introduction

N

etworked systems, which widely exist in nature and human society, have attracted increasing attention in recent decades. Modern systems always present complex structures; for example, the “robust yet fragile” feature of the Internet [1] and the potential community partitions of social networks [2]. Generally, the structures of networks are closely correlated with their functions [3]; therefore, a direct and powerful way to analyze the dynamics of networks is to dissect their structural properties. A number of network properties have been discovered, including the degree distribution [4] and the clustering coefficient [5]. These unique properties contribute to further investigations on in-depth disciplines of networked systems. With a rapid growth of information technologies, it has been witnessed an explosion of social network services in the past decade. Diversified platforms like TikTok, Facebook, WeChat, and Microblog are founded and available to worldwide users. These online social networks provide efficient mediums to diffuse knowledge and information, and this remarkable advantage boosts the applications of online information propagations and virtual marketing [6]. Some related studies indicate that a few effective influencers can generate considerable influence on networks [7]. This problem is NP-hard [8], and the existing studies can be roughly divided into two categories. The first is to simulate the information diffusion process on networks, and several models have been proposed [8], [9]. The second is to detect influential seeds. Methods such as the heuristic-based optimization algorithm [10], strategies considering network properties [11], and population-based searching techniques [12] have been verified to be effective in selecting influential seeds from nodal members. These studies lay a solid basis for analyzing the information diffusion process on social networks, and possible solutions can be generated to solve marketing problems. Meanwhile, networked systems are often exposed to complicated environments, where destructions may cause systematic malfunctions or even great losses for human beings. Therefore, the invulnerability of networks against attacks and errors, or robustness, is imperative to networks. Several existing studies provide applicable solutions [13], [14]. The robustness enhancement on networks is of great challenge due to its high complexity. Progresses have been made to tackle this knotty optimization problem, including the greedy search in the local

42

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

area of randomly-generated candidates [15] and populationbased topological rewiring or construction [16]. Decision makers benefit from the optimized networks to solve the design of robust systems and the improvement of infrastructures. In addition to the structure of networks, the robustness of the information diffusion process is worthy of attention. Some pilot studies indicate that a robust solution is in high demand when solving the influence maximization problem. In [17], the uncertainty of parameter input may hinder the performance of selected seeds. Analogously, the impact of noises during the diffusion process, including multiple diffusion settings and the instability of spreading activities, has been touched upon in [18]. These studies highlight the significance of the robust influence maximization (RIM) problem when finding influential seeds. However, the existing work only considers disturbances from the diffusion model, and may lack in generality. When designing diffusion models, the determination of parameters and spreading activities considered the empiricism on a series of real applications [8], [9]; and the tolerance against some uncertain factors is hence maintained. From this perspective, only studying the RIM problem through uncertainties and noises may be lopsided and insufficient in the complicated environments faced by networks. On the other hand, structural failures are common in applications and give rise to a predominant threat to the normal function of networked systems [4]. But little attention has been paid to the diffusion process in the presence of structural failures. Regarding the information diffusion process, how to evaluate the influence loss caused by structural losses and how to select seeds with robust performance remain to be addressed. Focusing on such deficiencies, this work intends to solve the RIM problem under structural failures. The link-based attack, which has been demonstrated to be destructive to the information diffusion process [19], [20], is considered in the evaluation and optimization process. Based on the existing studies on the robust influence maximization problem [21], a performance measure is first developed to numerically estimate the influence ability of seeds under attacks. Guided by this measure, an intuitive solution is then provided to highlight shortages of the current selection strategies when finding robust seeds. Since it is hard to predict a detailed damage percentage in applications, a solution that can simultaneously deal with different damage situations is desirable. To address this difficulty and improve the performance, a problem-orientated multifactorial evolutionary algorithm (MFEA) is developed to solve the RIM problem on networks, termed MFEARIM. Several operators have been designed to transfer genetic information between different tasks (i.e., the degree of damage) and search in the local area of current candidates. Experiments on both synthetic and real network data reveal the competitive performance of the algorithm, and a better computational efficiency has been reached. The contributions of this work are summarized as follows. First, in terms of the complex network field, sabotages towards the network typology are considered in the RIM problem. An effective performance metric has also been developed for the evaluation and optimization processes. Second, to address the challenges of selecting seeds with robust influence ability, evolutionary multitasking has been introduced to

solve this network-related optimization task, and several problemdirected operators are devised to exploit the optimal information. In the proposed algorithm, a distance-aware crossover operator is embedded to exchange information between individuals aiming at different tasks; and a three-stage local search operator is also maintained to improve the overall performance of the entire population. The efficacy of these operators is empirically validated, and encouraging results have been obtained. In addition, a significant computational discrepancy can be found between different tasks in tackling large-scale networks, which may cause delays in the optimization process. An asynchronous strategy based on transfer learning [26] is developed to fully utilize the transferred information from low-cost tasks. This strategy is optional and contributes to an efficient solution for large-scale networks. The rest of this paper is organized as follows: Section II reviews the related work on influence maximization, network robustness and its optimization, and evolutionary multi-factorial optimization. Section III proposes a robustness measure for the RIM problem and its direct solution. Section IV gives the details of MFEARIM. The experimental results are reported in Section V. Finally, Section VI summarizes the work in this paper. II. Related Work A. The Influence Maximization Problem

The essence of the influence maximization problem is to select K influential seeds from a given network G consisting of N nodes. Here, G is a network structure where nodes represent members in the system, and links represent the interactive relation between nodes. A popular representation for G is a connection matrix G ¼ (V, E), where V ¼ {1, 2,. . ., N} contains N nodes and E ¼ {eij j i, j 2 V} contains M links. If G is undirected, the connection would be symmetric; otherwise, asymmetric. The unweighted networks are considered in this work, and eij would be 1 if nodes i and j are connected with each other in this kind of networks. The selected K influential seeds constitute the seed set S, and its influence ability is denoted as s(S). Several models have been proposed to simulate the information diffusion process on networks triggered by S. For example, in the independent cascading (IC) model [8], the existing seeds can manage to activate each of their neighbors at a predefined probability p in each spreading round, and the successfully activated ones are able to spread influence in the next round. Details of diffusion models can be found in Section I of the supplementary materials. To perform an efficient performance evaluation, a fast approximation method is proposed in [22] through calculating the influence made in the 2-hop range of seeds. s(S) is the direct calculation of the influence range of S on a specific network, and s^ (S) is the estimation of s(S) based on the connectivity information in the network, defined as, s^ ðSÞ ¼

X s2S

s^ fsg 

X X s2S c2C s \S

! pðs; cÞðs 1c

 pðc; sÞÞ

x (1)

where Cs is the connected nodes of seed s (i.e., the 1-hop neighbors), p is the propagation probability between seeds and inactive nodes, s 1c is the 1-hop range of node c. x denotes the duplicated influence made by active nodes on initial seeds, P P P which is defined as x ¼ s2S c2C s =S d2C c \S=fsg pðs; cÞ pðc; dÞ. Briefly, s^ (S) obtains the influence made by the selected seeds in S within their 2-hop range, and deducts the duplicated influence spread from one seed to another. Further, an extended version is given in [23] to deal with the seed selection task in multiplex networks. With the help of performance approximation techniques, several optimization algorithms, including the memetic algorithms [23] and the particle swarm optimization algorithms [24] are applicable to find influential seeds in networks. However, none of these studies touch upon the robustness of seeds in the influence diffusion process, which has been proven to be significant in applications [18], [21]. The definition of seeds’ influential robustness and the determination of robust seeds remain open to further examination. B. Network Robustness and Its Optimization

Generally, the robustness of networks is defined as their invulnerability or resistance against destructions [1], [4]. Human sabotages [13], natural disasters [14], senescent devices [20], and plenty of intrinsic and extrinsic factors may incur threats to networked systems and cause serious losses. Robust networks are therefore highly desirable in real-world applications, which boost investigations on the definition and optimization of network robustness. The topology of networks has been shown to closely correlate with its performance [3], and a comprehensive way to utilize the topology information is via the integrity of networks under structural failures. As indicated in [14], a common scenario is that a specific network is under attack but not totally collapsed. An effective solution is to record the fraction of the largest connected component during all possible damage processes, and a numerical measure R is defined. The definition of R is given as follows, R¼

N 1X sðqÞ N q¼1

(2)

where N is the number of nodes in the network, and 1/N works as a normalization factor to guarantee a fair evaluation. s(q) is the fraction of the largest connected component after q nodes and their attached links are removed. Networks with larger R tend to retain better robustness. Networks with different scales can also be compared with the help of the normalized accumulation operation. R focuses on nodal members in networks, and an extended version has been proposed in [17] to evaluate networks’ invulnerability when losing links named Rl. These measures provide criteria to judge whether a certain network is reliable or not, and guide potential optimization processes to search for better candidates. Two kinds of methods have been introduced in the related studies on the network optimization. The first is to improve the invulnerability through protecting key components. As in [19], an information disturbance strategy is

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

43

developed to hide part of the structural information. From the perspective of measures like R and Rl, several optimization algorithms are considerable in the existing studies. Heuristic-based algorithms [14] and population-based searching techniques [16] are effective in enhancing network robustness. These topology-rewiring-based methods directly change network structures to improve the performance and can deal with a wide range of networked systems. But the related searching processes depend on the guidance measures, leading to a prohibitive cost, especially on large-scale networks. Intuitive examples of R and Rl can be found in Fig. S2 of the supplementary materials. C. Multi-Factorial Evolutionary Optimization

Multi-factorial optimization aims at exploiting the knowledge transfer between different targets (tasks) when in evolutionary optimization [25]. Traditionally, tasks can be solved separately, which is reliable but does not exploit the possible benefit of sharing common knowledge among these tasks. Enlightened by the model of multi-factorial inheritance, multiple blocks of cultural bias can coexist with each other, and an interplay can be achieved between tasks [26]. An implicit assumption is that the tasks have some common properties and there is transferable knowledge between them. Based on this idea, the multi-factorial evolutionary algorithm (MFEA) has been developed [25], [27]. The crossdomain multitasking optimization is one of the pursuits of MFEA, even those tasks in different data shapes. The empirical results reveal that functions with intersecting and separated optima can be exploited simultaneously, and the implicit gene transfer operation accelerates the convergence of the evolutionary procedure [28]. The search ability has been verified in [29]. This paradigm introduces a novel framework to solve optimization problems in parallel through harnessing latent genetic complementarities between problems, which induces an explosive growth of research interests on this topic. Strategies including the resource allocation strategy [30], the encoder on decision variables [31], the coordinate approach [32], and the decomposition [33] have been developed to further improve the efficiency towards different scenarios. In addition, the multitasking approach has attracted great attention in engineering fields, such as the beneficiation process [34], the computationally expensive problems [35], and the engineering design [36]. These applications boost in-depth investigations on multitasking optimization techniques. Meanwhile, some pilot studies have validated the effectiveness of MFEA on solving optimization problems on multilayer networks [37], and a case study on the RIM problem towards single-layer networks is given in this work. III. A Robustness Measure for the RIM Problem and Its Direct Solution

Most existing studies on the influence maximization problem intend to determine influential seeds from network data. Little attention has been paid to potential changes of the network structure and the possible impact of such changes on the information diffusion process. For seeds, the activation operation on other nodes

44

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

relies heavily on the connection between them; therefore, the integrity of a network is crucial to the information diffusion process. It indicates that structural failures are destructive to the connectivity of networks and other attached functional components [14], [21]. This work manages to investigate the effect made by structural failures on the information diffusion process, i.e., the RIM problem. A. A Robustness Measure

The first step is to evaluate the performance of seeds under threats. As mentioned in the last paragraph, the connection or link in networks is significant to the RIM problem. Meanwhile, links in networks are risky to attacks and errors, and the loss of important links can lead to serious malfunctions as shown in [14]. A considerable way to evaluate the robustness of seeds is to study changes on their influence ability caused by link-based attacks. In the destruction process, pivotal links show dominance to the connectivity, and are likely to be removed first. Here, several metrics are optional to evaluate the importance of links, and the link degree [15] is a widely used one due to its low cost and good efficacy. For a link eij, its degree (deg) is defined as the square root of the product on the degree of two connected nodes (i and j), i.e., deg(eij) ¼ sqrt(deg (i)  deg(j)). A high-degree link means that the two connected nodes have more neighbors, and that this link tends to be important in maintaining the connectivity. According to the degree rank of links, attackers can gradually destruct the network topology; an extreme situation is that all the links are removed to make all nodes totally isolated. However, the removal operation still needs a certain cost, and a more practical situation occurred is that only a fraction of Per (Per  1) links can be removed due to the limitation of resources. Under link-based attacks, the robustness of seeds when spreading information is thus a focal point. Assuming that Per percentage of links are to be removed gradually based on their importance, a series of damaged networks can be obtained. Robust seeds are expected to maintain a good influential ability consistently over all networks. The design of network robustness measures, such as R and Rl, may be considerable when designing the measure for seeds. The evaluation process on each network is independent, and a weighted accumulation is made to obtain the final numerical result. Given a seed set S, the measure for its robustness in the RIM problem is defined as follows, RS ¼

MPer X 1 s^ ðSjPÞ M  Per P¼1

(3)

where M is the number of links in the network, and Per is the damaged percentage. Similar to (1), s^ (S j P) represents the estimated influence range of the seed set S in the damaged network after losing P links. For RS, a parameter Per is introduced, and only M  Per (a round number) links are removed in the evaluation process. Similar with R and Rl, RS also yields performance evaluation results in a numerical form, and seeds with robust influential ability tend to obtain larger values of RS. This work concentrates on the robustness of seeds in the information diffusion process. There are several differences between these two problems. First, the optimization targets are different.

TABLE I RS values of seeds selected by GA on SF networks. The first column shows the Per set when conducting GA, and the first row shows the Per set when testing the performance. For each testing scenario (In each column expect for the first one), the reached best performance is marked. Results are averaged over five independent realizations.

FIGURE 1 An intuitive description of the evaluation process of RS. Two cases of different parametric configurations are compared.

The network robustness primarily aims at searching for better topologies to resist structural perturbances, while the RIM problem intends to select robustly influential nodes as seeds and does not attempt to change the topology. Although both decision variables for the two problems are network-related discrete data, they maintain diverse evaluation processes. In the process of enhancing the robustness of networks, the decision variable is the network structure; and the robustness evaluation can be terminated given a specific structure. Yet the decision variable for the RIM problem is a number K of node labels. It should be noted that the influence ability cannot be estimated only with these labels, the specific network is needed to be presented concurrently. These distinct divergences prevent a direct application of the existing optimization methods like in [19], [20] on the RIM problem. The efficient selection of seeds remains an open question. In addition, an intuitive example of RS is given in Figure 1. B. An Intuitive Solution of the RIM Problem

To solve an optimization problem, an explicit target is necessary when conducting search operations. There is a changeable parameter Per in RS to determine the specific damage degree on networks, which is another feature of RS. As shown in [10], [38], traditional solutions for finding influential seeds are single-objective, and measures like s(S) are taken as the performance indicator. Similarly, an intuitive solution for the RIM problem is to determine the parameter Per first; and then complete the seed selection task. A pilot study can be found in [21]. Here, Per would incur distinct preference into the optimization process; a larger Per means that the network suffers from a serious damage, and vice versa. A noteworthy difference can be found in networks when choosing different Per, which may further impact the determined seeds. Assuming Per lies in the range of {0.1, 0.3, 0.5, 0.7, 0.9}, and the plain genetic algorithm (GA) is implemented to find corresponding seed guided by RS with a specific Per. The IC model is adopted to simulate the information diffusion process with an activation probability of 0.1. Experiments are conducted on two synthetic networks including scale-free (SF) networks with the power-law degree distribution  s-Renyi (ER) networks [39] with the Poisson degree [38] and Erdo distribution. All networks are with 500 nodes and the averaged degree of 4, the size of seed set K is set as 10. For the SF network, the initial fully connected network consists of four nodes, and two

0.1

0.3

0.5

0.7

0.9

0.1

12.05

11.35

10.89

10.63

10.49

0.3

11.98

11.37

10.92

10.66

10.51

0.5

11.66

11.29

10.96

10.70

10.54

0.7

11.65

11.28

10.95

10.69

10.54

0.9

11.52

11.21

10.94

10.69

10.55

links are attached for each initial node [39]. For the ER network, links are attached on two independent nodes at a probability of 0.05 until the link number reaches the predefined criterion [40]. For a specific value of Per, a seed set St is provided by GA, and St is tested by RS with other values of Per. If a set St can reach good results when RS is set as different Per values, then St can be considered as a robust solution. Numerical RS results of selected seeds on SF networks are given in Table I. As depicted in Table I, the set of Per impacts the seed selection process. When GA is guided by RS with small Per, obtained seeds are likely to perform well on networks suffering from mild attacks, but perform poorly on networks suffering from serious attacks, and vice versa. This phenomenon is consistent with the intuitive assumption. As indicated in [21], RS with a rational Per is expected to obtain seeds that perform robustly in different scenarios. In other words, the RS value should be close to the achievable best result. Therefore, based on the results in Table I, seeds can be numerically evaluated under which Per can reach better results on other Per sets. For each Per, difference values between the obtained result and the best searched one are collected and shown in Table II. Results in Table II reveal that when Per is set as 0.3, the seeds obtained by GA can reach a relatively robust influential ability over all tested damage scenarios. Meanwhile, as Per increases, the robustness of obtained seeds deteriorates, which is mainly caused by those seeds performing extremely poorly when the network loses only a small part of links. Following the same calculation method, results on ER networks are shown in Table III. From the table, a similar conclusion can be drawn that the obtained seeds are more robust when Per is set as 0.3. Both experiments on two popular-used synthetic networks indicate that a suitable setting of Per contributes to a higher performance of selected seeds. Meanwhile, considering the universality of SF and ER networks in the real world [3], [38], 0.3 seems to be applicable when solving the RIM problem, and the corresponding seeds are considered robust information diffusion. More trivial details can be found in [21]. Briefly, the introduced intuitive solution provides a considerable value for the changeable parameter in RS based on experimental validations. RS is therefore transformed as a deterministic measure, which is similar to s^ (S) in (1).

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

45

TABLE II The summary of difference values of RS between the obtained results and the corresponding best one in Table I.

TABLE III The summary of difference values of RS tested on ER networks.

PER

0.1

0.3

0.5

0.7

0.9

PER

0.1

0.3

0.5

0.7

0.9

Difference

0.22

0.19

0.48

0.52

0.72

Difference

0.36

0.31

0.39

0.35

0.37

Nevertheless, some key questions remain to be addressed for this intuitive solution. First of all, setting Per as 0.3 may not be effective for other kinds of network data and other experimental setups [21]. A possible but time-consuming option is to find the suitable Per for each network, which is of low efficiency and causes additional computational costs. Also, only one solution seems to be inadequate to fully solve the RIM problem. A clear performance degeneration can be detected in Table I when there exists difference in the parametric settings for optimizing and testing processes. This phenomenon reveals that a series of solutions with different emphases could be considerable for the RIM problem. Here, the optimization guided by RS with different Per shows some correlations; the multi-objective technique is not applicable. In allusion to addressing these questions, the multi-factorial optimization theory is employed to solve the RIM problem, the designed algorithm is presented in the next section. IV. MFEARIM

Similar to the traditional EAs, an MFEA performs the search process using a single population consisting of P individuals, and L optimization tasks {T1, T2, . . ., TL} are tackled simultaneously. Here each task Tl indicates a specific optimization target RS with a certain Per. Two definitions are requisite for the multitasking optimization process [25]. 1) Skill factor: The skill factor t i of individual pi is the one task selected from all L tasks, on which pi is the most effective. 2) Scalar fitness: The scalar fitness of pi is given by the fitness rank of this individual in the multitasking environment, i.e., ’i ¼ 1/riti , where riti is the rank of pi on task t i. With the skill factor, an L-factorial environment is constructed to leverage the implicit parallelism between different tasks. All individuals are encoded in a unified space X, and a complementary genetic material transfer is considered between individuals with disparate optimization tasks. The transfer operation is commonly via crossover operations, where a mating probability may work as the criterion. This algorithmic parameter is prescribed in [25], and an improved version introduces a learning procedure on it based on the probabilistic model [27]. Most existing work focuses on solving optimization problems with numerical decision variables. But the RIM problem requires discrete data, which are labels of seeds selected from nodes. This distinct divergence hinders a direct application of current MFEAs to find seeds with robust influential ability. A. The Framework

The low efficiency can be found in the intuitive solution in Section III-B, and MFEARIM is proposed to tackle this deficiency and manages to find the best seeds under different damage percentages. Tasks for the RIM problem can be

46

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

represented as fRS1 ; RS2 ; . . . ; RSL g, where every task RSl has a specific Per setting in the evaluation process. A series of seed sets is expected to be the final output of MFEARIM when solving these tasks. An intuitive description of the algorithm is given in Fig. S3 of the supplementary materials. Algorithm 1. MFEARIM

Input: G0: The input network; P0: The initial population size; MaxGen: The maximum number of genetic iterations; K: The size of seed set; Output: {S1, S2, . . ., SL}: The best solution found for each task; ______________________________________________________ Conduct the initialization operator (G0, P0, K) to generate Pop0, set g ¼ 0; while g < MaxGen do: Conduct the crossover operator on Pop0 to generate Pop (shown in Algorithm S1); Conduct the mutation operator on Pop; Conduct the local search operator on Pop (shown in Algorithm S2); Conduct the selection operation on Pop to generate Pop0; Update the best solution (S1, S2, . . ., SL) for each task; g ¼ g þ 1; end while;

P individuals are generated first to initialize the whole population, and each of them has K seeds as the coding information. Both random selection and degree related selection have been adopted in the seed selection operation. The performance of all individuals is evaluated on L tasks (using corresponding measure RSl ) to get the rank information of each individual on every task based on their performance. Each individual pi has L scalar fitness T T ’i, j ¼ 1=ri j , where j 2 [1, L] and ri j is the rank of pi on task Tj; then the task Tj that ’i, j reaches the best competence is selected as the skill factor t i. Conducted on the initial population, the crossover operator intends to exchange genetic information between two selected individuals (p1 and p2) and generate more potential candidates. Here, if p1 and p2 have the same t, then a uniform crossover operation is implemented. Otherwise, if p1 and p2 have diverse t, a seed c in p2 is selected, where the distance information between seeds in p2 to those in p1 is considered. Then, a seed in p1 is replaced by c. In this way, several new candidates are generated and included into the whole population. Followed by the mutation operator, this operator randomly mutates part of the genetic information of individuals at a low probability. Also, a three-stage local search

operator is designed to utilize the micro and macro structural information of the selected seeds, and individuals can learn from the ones with good performance. A performance improvement is desired across the entire population. To terminate genetic iteration, the selection operator first updates the rank and scalar fitness information of all individuals, and adopts the elitism-remained strategy to keep those better individuals into the next generation. Solutions that perform best on each task are found and updated, which would be the output of MFEARIM. The framework of MFEARIM is summarized in Algorithm 1. B. Genetic Operators

First, two strategies are adopted in the initialization operator. For the first half of the population, K seeds are randomly selected from the N nodes in G0. For the second half, K nodes with a larger degree (e.g., top two percent) are determined and saved in a group Top. The first seed for each individual is randomly selected from Top, and the remaining K - 1 seeds are based on random selections. Duplicate seeds should be avoided in a specific seed set, and those repetitive ones are replaced by the randomly selected nodes from G0. For individuals in Pop0 generated via inter-task transfer, their performance on the L tasks is to be evaluated, together with the corresponding skill factor and scalar fitness. For individuals generated via the intra-task crossover, only the performance on this task is evaluated. The shortest path distance between each node pair in G0 is also evaluated and saved in Dis. For each individual pi, a seed set is expected, as pi ¼ {s1, s2, . . ., sK, sj 2 {1, . . ., N}}; here, each seed s is selected from the N nodes in G0 to work as influence spreader. Given Pop0, the crossover operator aims at generating more candidates to form the whole population Pop, where both intraor inter- task genetic information exchanges are considered. A total of P – P0 individuals are generated here. For each generated individual pt, two parents, p1 and p2, are randomly selected from Pop0. A uniform crossover operation is implemented if the skill factor t 1 of p1 is the same as that of p2 (t 2); in this way, each seed of pt is selected from the corresponding position in p1 or p2 at the same 6 t 2, the distance information Dis is considered probability. If t 1 ¼ when selecting the crossover position. This distance information considers the structure of the inputted network G0, and the shortest paths between each node pair are evaluated first. For each seed j in p2, the distance between j and all seeds in p1 is evaluated and summarized as disj; here, a small disj indicates that j is located at a closer area to seeds in p1, and less disturbance can be made if j is replaced into p1. Additionally, the genetic iteration information is considered; assuming current genetic iteration is gen, the distance information for node j is normalized as disj ¼ gen / disj. Procedures are summarized in Algorithm S1 of the supplementary material. Based on Pop, the mutation operator is conducted to make subtle perturbances on the genetic materials at a probability of pm. Specifically, a seed in the current individual is replaced by another random node from G0, and its performance is updated. This operator is at random and helps to escape from local optima. Further, the local search operator is conducted to improve the performance of all individuals in current population.

Similar operations can be found in [16], [21], and have shown effectiveness in searching for seeds with good performance. In this work, several tasks are considered simultaneously, and three-stage procedures are developed in this operator. In addition, an optional asynchronous strategy is given. Based on a specific individual p ¼ {s1, s2, . . ., sK}, the local area is exploited to improve its performance, and p is expected to be updated as p ¼ {s10 , s2, . . ., sK0 }. Details are given as follows. 1) Micro-view local search: The first stage searches in the neighboring area of seeds of the current individual at a probability of pl. For each seed s, every neighbor of s in G0 is checked to judge whether this node is suitable to become seed; if so, those possible seed candidates are put into a temporary set Nei. Focusing on the skill factor of this individual, the performance of all candidate seed sets is evaluated, and the best is identified to update the individual. 2) Macro-view local search: The second stage considers hubs in G0 as seeds, which have been saved in the set Top in the initialization process. Those inapplicable ones are removed from Top first. The operation probability is defined as pl  (MaxGen - gen) / MaxGen; in such manner, more replacements are conducted in the early phase of the genetic process, but less in the later phase. A roulette selection is implemented to choose a low-degree seed in the current individual, and this seed is replaced by one hub with the best performance in Top. Here the performance is evaluated considering the skill factor. Such low-degree nodes tend to make limited influence [38], and the replacement by hubs may promote the influence ability. An excessive utilization of hubs is likely to cause local optima, and this operation is therefore regulated by the changeable probability. 3) Individual learning procedure: The whole population is divided into several sub-populations through the skill factor of individuals, and the last stage intends to make individuals learn from others, especially from those with good performance. For each sub-population (on different tasks), an individual is randomly selected first to work as the learner. In general, a task is randomly determined. If the task is the same as that of the learner, then a tutor is chosen from corresponding sub-population at random; otherwise, the tutor is chosen by roulette based on the individual fitness in this sub-population. One low-degree seed of the learner is also selected by roulette, which is replaced by one high-degree seed selected from the tutor. The replacement is kept if it promotes the scalar fitness. 4) An asynchronous strategy: Different Per values are necessary for different tasks, and a larger Per produces higher latency when evaluating this measure. This non-uniform latency may cause delays when solving expensive optimization problems [40]. An additional parameter, LowS, is introduced to control the exorbitant computational resources on high-cost tasks. In generations which are less than LowS, only the first half of the tasks are addressed. Individuals with higher skill factors randomly choose another task in the first half in the following local search procedure. Considering that a correlative relation can be found in the optimization on RS with different Per values, an approximate optimization with low cost can reduce

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

47

FIGURE 2 Results of compared algorithms on SF networks when considering three tasks against NOFE.

the required running time for the algorithm. Details of the aforementioned strategy are given in Algorithm S2 of the supplementary material. V. Experimental Results

In this section, empirical validations are given to test the performance of MFEARIM. Algorithmic parameters are set as follows, MaxGen as 200, P0 as 50, P as 70, pc as 0.6, pm as 0.3, and pl as 0.6. Here, the maximum number of function evaluations (NOFE) is adopted as the stopping criterion to guarantee a fair comparison. The budget for low-cost tasks is denoted as 1, and for high-cost tasks is denoted as 2. A total budget of 2  104 is allowed. The IC model is taken as the influence diffusion model with an activation probability of 0.1. Considering few methods have been designed for solving the RIM problem in a multitasking way, several improved versions have been developed based on the existing ones to fit the seed selection task. The original MFEA shows reliable performance in solving multi-tasking optimization problems, and here the modified version with genetic operators on networks is, termed MFEANet, compared. SREMTO proposed in [30] considers the changeable intensity between cross-domain transfer, which is similar with the idea in this paper, and the networkdirected version SREMTONet is implemented. GEMFDE in [32] is also modified as GEMFDENet. The theory of MFDE/MVD [33] is implemented as MFDE/MNet. In addition, a differential evolution (DE) based MFEA is also compared [41], termed MFDENet, with the population size being set as 100, the crossover probability as 0.6, the mutation probability as 0.3, and the random mating probability as 0.3. Operations in [24] are implemented to adapt DE to the seed selection task. A. Benefit of Using Multitasking Optimization

Synthetic SF and ER networks are generated to test the performance of algorithms. These networks are with N ¼ 500 hki ¼ 4, and K is set as 10. Similar with the experimental setting in Section III-B, Per values are chosen in the range of {0.1, 0.3, 0.5, 0.7, 0.9}, and MFEAs have five different tasks (L ¼ 5) at most. These tasks are labeled with T1  T5, respectively. Results in Tables II and III validate that RS with smaller Per can promote the robustness of selected seeds, and these tasks

48

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

are considered in priority. In the experiment of Figure 2, the results of MFEAs when L ¼ 3 on SF networks are plotted, and that of the single-objective memetic algorithm (MA), which is GA embedded with the first-stage local-search operator in Algorithm 3, are also given to provide comparisons with pc as 0.6, pm as 0.3, pl as 0.6, and the population size as 50. As shown in Figure 2, distinct performance differences can be noticed. MFDE cannot solve this optimization problem. Other methods can gradually approach similar optimal results when solving T1 and T2, and MFEARIM reaches a better result on T3. Considering the decision variables for the RIM problem are highly discrete nodal information, the optimization process using DE shows infeasibility on such problem, which may lead to the inferior result of MFDE in Fig. 2. Meanwhile, MFEARIM shows superiority in terms of the convergence ability in search of the diverse seeds guided by RS with a certain Per setting, especially when Per is large. A three-stage local operator has been embedded in the proposed algorithm, which is not considered in SREMTO, GEMFDE, and the plain MFEA. MFDE/M lacks of problem-directed searches. Results here demonstrate the remarkable contribution of this operator in the search process. Also, it should be noted that the results of the MA for the three tasks are obtained in independent realizations; on the contrary, the results of MFEAs for these tasks only require one realization. A dramatic improvement on the computational efficiency can be achieved with the utilization of MFEAs. In general, the proposed MFEARIM has a competitive performance compared with other tested methods on all three tasks in Figure 2. The optimization results of MFEAs when L ¼ 5 and K ¼ 10 on two synthetic networks are listed in Table IV with a maximum NOFE as 2  104. For which, it can be seen that MFEARIM achieves the best results in almost all tested cases; the performance advantage gets marked with the increasing of Per in RS (tasks with lager labels). Compared with the results shown in Fig. 2, a performance degeneration can be found in MFEANet and other approaches when these algorithms consider more tasks. It seems that multitasking optimization may hinder the exploitation ability of these algorithms. MFEARIM can maintain a considerable performance and even tackle multiple tasks, which is preferable in applications. Focusing on MA, this method can obtain competitive

TABLE IV The best-found results on SF and ER networks with five tasks. The wilcoxon rank sum tests with the significance level of p ¼ 0.05 are adopted to analyze the statistical difference of results compared with MFEARIM. “-” means the algorithm is inferior to MFEARIM, “þ” means the algorithm outperforms MFEARIM, and “” means they do not show clear difference. The results are averaged over five independent realizations. NETWORK METHOD

ER

T2

T3

T4

T5

12.059

11.366

10.961 10.700

10.544

SREMTONet 12.053 (-)

11.366 ()

10.955 10.699 (-) ()

10.540 (-)

GEMFDENet12.055 (-)

11.366 ()

10.959 10.689 (-) (-)

10.543 ()

MFDE/ MNet

12.058 ()

11.365 ()

10.957 10.698 (-) ()

10.539 (-)

MFEANet

12.050 (-)

11.365 ()

10.954 10.696 (-) (-)

10.538 (-)

MFDENet

11.983 (-)

11.335 (-)

10.874 10.559 (-) (-)

10.525 (-)

MA

12.059 ()

11.367 ()

10.945 10.685 (-) (-)

10.537 (-)

MFEARIM

10.731

10.597

10.512 10.438

10.367

SREMTONet 10.727 (-)

10.592 (-)

10.508 10.426 (-) (-)

10.360 (-)

GEMFDENet10.730 ()

10.593 (-)

10.505 10.433 (-) (-)

10.361 (-)

MFDE/ MNet

10.729 (-)

10.590 (-)

10.507 10.437 (-) ()

10.364 (-)

MFEANet

10.722 (-)

10.591 (-)

10.504 10.425 (-) (-)

10.368 ()

MFDENet

10.601 (-)

10.416 (-)

10.389 10.325 (-) (-)

10.229 (-)

MA

10.730 ()

10.608 (þ)

10.499 10.408 (-) (-)

10.357 (-)

MFEARIM

SF

T1

results when Per is small (tasks with smaller labels), but only moderate results when Per is large. MA only maintains a single objective, and a series of realizations is required to tackle various tasks, but still fails to achieve good results on some tasks. MA only considers the local area of seeds in the searching process, and this operator guarantees selected seeds perform well under modest structural failures. However, when the network losses significant connectivity, these seeds are likely to show limited influence ability as shown in Table I. The results in Figure 2 and Table IV confirm the competitive performance of MFEARIM in identifying robust seeds. In addition, a runtime comparison of the algorithms on SF network is given in Table V, taking T5 as an example. The MFEANet variant without local search procedures requires the least computational budget, but the obtained result is not competitive. It can be seen that MFEARIM holds the similar runtime level compared with the existing methods, while the performance demonstrates its superiority. B. Tests on Multitasking Parameter Setting

Taking the task T3 as an example, the performance of MFEARIM and MFEANet with different L on two synthetic

TABLE V The running time comparison of different MFEABased methods. METHOD MFEARIM SREMTONET GEMFDENET MFDE/MNET MFEANET Time (h)

110

104

112

113

85

networks is compared in Figure 3. As shown in the figure, the performance change can be detected while considering different numbers of tasks. For MFEARIM, a larger L can lead to an improvement on the robustness of seeds compared with a smaller L. In Section IV, several operations have been designed to leverage the pool of genetic information in the multitask search process. In the crossover operator, the structural distance between seeds is taken as criterion to conduct inter-task genetic material exchanges; in the local search operator, a learning procedure is adopted to get potential knowledge from other individuals. These operations are directed by the RIM problem, and consider the unique features of network data. With increasing L, richer information is provided to the optimization process, and the exploitation ability of MFEARIM can be promoted. On the contrary, the existing MFEAs, like MFEANet, do not maintain such problem-guided operations, deteriorating the search performance when more tasks are considered. Without powerful information transfer technique, larger L may lead to a plethora of information, and overly disperse computational costs on each task. Following this manner, the performance degeneration would occur in Table IV and Figure 3. In essence, the objective of the RIM problem is to find a seed set that perform robustly on a wide range of damage situations. But in terms of the algorithm design, individuals with good performance on a specific task are expected and tend to be kept in the population. Following this selection manner, some individuals have good overall performance but are not prominent in a specific task. These individuals may be omitted in the evolutionary process. Considering such deficiency, a rectified selection strategy is considered. In the algorithm termed MFEARIM-Overall, individuals with better overall performance are selected, i.e., larger SRS ¼ Sl RSl , 1  l  L, instead of a specific RS on one task. The experimental results on two synthetic networks are shown in Figure 4, where two different seed set sizes (K ¼ 10 or 20) are tested and the summarized values of RS on five tasks are compared with the maximum NOFE as 1.2  104. As shown in Figure 4, the new selection strategy has a distinct effect on the search process on ER networks, and that of SF networks only fluctuates within a narrow range. Here, the result may be caused by the diverse structural features of these two networks. For ER networks, a Poisson degree distribution can be detected [40], which means that the degree of all nodes is similar with each other, and peculiar nodes with extremely high or low degree are difficult to find in such networks. And for SF networks, a remarkable power-law degree distribution has been discovered in the previous studies [3], [32], which indicates that only a few hubs always exist in such networks

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

49

FIGURE 3 Results of T3 obtained by two MFEAs with different L on SF and ER networks against NOFE.

and play crucial roles to the connectivity. Infected by these structural features, the seed selection process on ER networks relies less on the structural information compared with that on SF networks; therefore, an intensive searching operation may be necessary to find powerful seeds. An averaged nodal importance provides diversified solutions to reach a robust information diffusion, and remarkable differences tend to exist in solutions guided by different measures (i.e., RS with different Per). On the contrary, some nodes always show dominance in the diffusion process due to their structural significance, and these nodes cannot be neglected to become seeds. In this manner, the summarized RS is insensitive to different selection strategies on SF networks, but gets dramatically affected on ER networks. Numerical comparisons of the performance of seeds under the two selection strategies are given in Table V. MFEARIM-Overall and MFEARIM reach similar performance on SF networks, and a marginal decrease can be found in the s^ (S) of MFEARIM-Overall. But on ER networks, MFEARIM-Overall achieves better SRS, and the two algorithms show superiority on different tasks. Also, Per ¼ 0.3 may not always be the best solution for finding robust seeds, which reflects another deficiency of the intuitive solution introduced in Section III-B. A fixed Per seems to be inadequate to tackle seed selection tasks on different networks. Results in Figure 4 and Table VI validate the potential contribution of the selection strategy based on the overall performance, which is likely to facilitate seed selection tasks on networks with a moderate degree distribution. C. Algorithmic Sensitivity Analysis

In addition, the sensitivity of the proposed algorithm to its parameters is tested with the maximum NOFE as 2  104. Results obtained by algorithms with different pc, pm, and pl are drawn in Figure 5. It can be concluded the effect of different parameters on the performance. For pc, the intensity of genetic

FIGURE 4 The overall RS values of seeds under different selection strategies. Results on SF and ER networks with the desired K are shown.

information exchange operation is controlled, including those inter-task and intra-task ones. Fewer exchanges are conducted if pc is small, which may cause decrease on the performance as shown in Figure 5(a). Meanwhile, excessive changes (when pc is too large) are not suggested for the overly frequent disturbance on the genetic materials. For pm, the influence is not remarkable, and a suitable setting can better overcome the side-effect caused by local optima. For pl, the frequency of conducting the time-consuming local search procedures is decided. A higher pl guarantees the quality of solutions, but incurs an increasing computation cost. But small values of pl should be avoided due to the limited ability to reach considerable results. The rationality of the given parameter setting is verified in Figure 5, and a trade-off between the performance and the computational cost is attained. An additional parameter is the LowS in the asynchronous strategy, and an SF network with 1000 nodes with hki ¼ 4 is generated to evaluate the effect of these parameters. L ¼ 5 and K ¼ 10 are considered in the experiment. Detailed numerical results are listed in Table VII. A larger LowS contributes to reducing the computational budget, which is the core of this asynchronous strategy. A lower LowS guarantees more search procedures on all tasks, but the results are not competitive, especially on SRS and s^ (S). A suitable LowS configuration can improve the performance with the help of positive genetic information transfers between different tasks. If a distinct difference can be found on the required computational cost of tasks, some low-cost tasks can be tackled in priority. High-cost tasks are thus provided with valuable information to accelerate the convergence. The results validate the effectiveness of the proposed asynchronous strategy, and the inter-task transfer on promoting the search ability is also demonstrated.

TABLE VI The overall best solution obtained by different selection strategies, taking K ¼ 20 and L ¼ 5 as examples. SRS is the summarized RS values, and the corresponding Per set is given. s^ ðSÞ is the influence range in the original network. NETWORK SF

ER

50

METHOD

SRS (PER)

T1

T2

T3

T4

T5

s^ ðSÞ

MFEARIM

109.832 (0.3)

23.303

22.464

21.776

21.287

21.001

23.695

MFEARIM-Overall

109.837 (0.5)

23.323

22.467

21.771

21.280

20.995

23.662

MFEARIM

104.766 (0.3)

21.328

21.135

20.931

20.757

20.615

21.436

MFEARIM-Overall

104.830 (0.3)

21.298

21.133

20.960

20.791

20.649

21.372

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

FIGURE 5 The sensitivity analysis of the proposed algorithm, results of T3 are shown in the figure.

D. Tests on Real-World Networks

CS PhD network (CS) is the relation between students in a university provided in [43], which consists of 1025 nodes and 1043 links. Taking experiments with L ¼ 5 and K ¼ 10 as an example with the maximum NOFE as 2  104, numerical results of different algorithms on the two real networks are listed in Table VIII. MFEARIM can also obtain competitive results on the two real networks as shown in Table VIII, and this algorithm almost determines the best solution on all five tasks. For the Fried network, the selected seeds have a better ability to spread information, and these nodes are likely to

The information diffusion process is important in many real systems. Compared with synthetic ones like SF networks, realworld networks may not show a distinct degree distribution feature, which may cause difficulties on the seed selection task. Experiments here can further demonstrate the effectiveness of the proposed algorithm. Two real-world networks, including a logistics network and a social network, are tested in this part to further validate the performance of MFEARIM. The Friedrichshain network (Fried) is a transportation network in part of Berlin [42], which consists of 224 nodes and 376 links. The

TABLE VII The best-found performance of MFEARIM with different LowS on each target. The column labelled with “Time” shows the required running time of corresponding experiment. Results are averaged over five independent realizations. SRS

T1

T2

T3

T4

T5

s^ ðSÞ

10

57.034

12.919

11.722

11.134

10.818

10.636

13.757

283

20

57.060

12.883

11.728

11.134

10.817

10.635

13.766

277

30

57.103

12.926

11.741

11.134

10.817

10.635

13.806

271

40

56.980

12.872

11.712

11.134

10.817

10.635

13.705

270

LowS

TIME (h)

TABLE VIII Numerical results on five tasks of different methods. These results are averaged over five independent realizations. NETWORK

Fried

CS

T1

T2

T3

T4

T5

MFEARIM

10.573

10.472

10.405

10.284

10.463

SREMTONet

10.565 (-)

10.472 ()

10.386 (-)

10.281 ()

10.462 ()

METHOD

GEMFDENet

10.570 (-)

11.471 ()

10.403 ()

10.280 (-)

10.460 (-)

MFDE/MNet

10.572 ()

11.469 (-)

10.400 (-)

10.279 ()

10.459 (-)

MFEANet

10.559 (-)

10.471 ()

10.345 (-)

10.288 (þ)

10.460 (-)

MFDENet

10.491 (-)

10.391 (-)

10.331 (-)

10.257 (-)

10.225 (-)

MA

10.572 ()

10.473 ()

10.403 ()

10.279 (-)

10.459 (-)

MFEARIM

11.757

11.190

10.798

10.571

10.444 10.440 (-)

SREMTONet

11.745 (-)

11.185 (-)

10.797 ()

10.566 (-)

GEMFDENet

11.753 (-)

11.187 ()

10.796 (-)

10.568 (-)

10.441 (-)

MFDE/MNet

11.757 ()

11.189 ()

10.797 ()

10.569 (-)

10.443 () 10.442 ()

MFEANet

11.681 (-)

11.167 (-)

10.795 ()

10.568 (-)

MFDENet

11.554 (-)

11.002 (-)

10.651 (-)

10.257 (-)

10.243(-)

MA

11.758 ()

11.191 ()

10.790 (-)

10.541 (-)

10.399 (-)

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

51

FIGURE 6 Structures of Fried network under different damages scenarios. The size of nodes is proportional to its degree. Highlighted links are functional, where those in gray color are removed. Seeds are marked in red, and plain nodes are in green.

become distribution centers to improve the efficiency of the whole system. For the CS network, the selected seeds are influential members in this network, and they may reach a larger influence range when spreading information or working as manage staff. Compared with the existing method for determining seeds as in [21], [24], MFEARIM give diversified solutions that accommodate different damage situations (i.e., tasks in the algorithm) in one iteration. The computational efficiency can be greatly improved using the multitasking optimization technique. Complex systems may get destructed in different extents as indicated in [28], [44], and robust solutions are highly desirable in real-world applications. In the context of typological perturbances towards networks, this work presents a promising approach to tackling the influential member detection problem. Taking the Fried network as an example, the distribution of selected seeds in the typology is plotted in Figure 6, where three cases are considered including T1, T3, and T5. T1 represents a slight damage scenario, T5 represents a serious one, and T3 represents a balanced one. As shown in the figure, links tend to be removed as the damage percentage increases, which causes congestions in the information diffusion process. Consequently, the seed determination is impacted due to the lack of integrity. The difference can be noticed in the given typology, which reflects the effect of the changeable parameter Per in (3). A larger Per indicates more structural failures, and most nodes tend be destructed to different degrees. Under this context, hubs may not be suited for being selected as seeds when attacks happen. This conclusion coincides with that in [21], which verifies the significance of studying the robust influence maximization problem. The results also enlighten decision makers that the determination of seeds is expected to be integrated with the structural features, and the multitasking optimization approach may provide useful candidates to tackle multiple practical scenarios.

52

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

Meanwhile, common candidates can be found between seeds for T1, T3, and T5. In this manner, the efficacy of the designed asynchronous strategy is demonstrated. Considering that the required computational budget increases as more links are disconnected, promising candidates can be obtained in the search towards T1 at a relatively low cost. Employing the multitasking optimization theory, low-cost tasks are solved in priority, and propose valuable knowledge and alternatives for high-cost tasks. The convergence of the algorithm is thus accelerated. As listed in Table VII, a reduced computational budget is achieved for MFEARIM. The results are competitive compared with the state-of-the-art approaches shown in Tables IV and VIII, and an efficient solver is constructed for decision makers to solve the diffusion dilemmas. In addition, the solution of the RIM problem can be taken as a data mining task from networked systems. Results in Fig. 6 and Fig. S4 of the supplementary material also validate that the crux of structural information in the excavation procedure. As an objective description of complicated systems, the network structure implies the indepth information and preferences in the composition process. For example, the typology in Figure 6 exhibits a power-law degree distribution [42], and an aggregation tendency exists between key nodes. Formed by nodes and links, networks constitute discrete decision variables, preventing a direct employment of plenty of the existing optimizers. In this work, elaborated operators are devoted to exploiting knowledge pertaining to the network, and exploring the whole solution space to gradually promote the performance. Diversified information from seeds’ micro scope, the network’s macro scope, and stochastic factors are fully considered, and competitive search ability is guaranteed. This work may shed some light on further studies towards solving network-related and discrete optimization problems. The local search procedure and the transfer operation are practicable on handling other complex optimization problems.

VI. Conclusion and Discussions

Based on the existing studies on the influence maximization problem and the network robustness theory, the robustness of information diffusion processes on networks has been introduced in this paper. The influence ability of seeds in networks suffering from varying structural losses is comprehensively considered as different tasks. MFEARIM has been proposed to solve the seed determination problem, and makes up for the deficiency of existing solutions. Experimental results validate the effectiveness of the designed operators, and indicate the significance and effectiveness of the evolutionary multitasking theory on this network-based optimization problem. This work also highlights the remarkable impact caused by structural losses on the information diffusion process. The obtained seeds have a robust influence ability and can deal with different

damage scenarios, which facilitate decision makers to reach an efficient and reliable information propagation process. A new but rational perspective has been provided on studying the robustness of the influence maximization problem. Several challenging problems still remain to be addressed, including the effect of potential different diffusion models [17], the existence of multiple attacks [45], and the surrogateassisted optimization on the RIM problem [46]. These questions are worthy of further investigations in the future. Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grants 62203477 and 52105079, in part by Guangdong Basic and Applied Basic Research Foundation under Grant 2021A1515110543, and in part by Fundamental Research Funds for the Central Universities, Sun Yat-sen University under Grant 23qnpy72. The work of Y. Jin was funded by an Alexander von Humboldt Professorship for Artificial Intelligence endowed by the German Federal Ministry of Education and Research. This article has supplementary downloadable material available at https://doi.org/10.1109/MCI.2023.3277770, provided by the authors. References [1] J. Doyle et al., “The ‘robust yet fragile’ nature of the internet,” Proc. Nat. Acad. Sci., vol. 102, no. 41, pp. 14497–14502, 2005. [2] M. Girvan and M. E. J. Newman, “Community structure in social and biological networks,” Proc. Nat. Acad. Sci., vol. 99, no. 12, pp. 7821–7826, 2002. [3] M. E. J. Newman, Networks: An Introduction. London, U.K.: Oxford Univ. Press, 2010. [4] R. Albert, H. Jeong, and A. L. Barabasi, “Error and attack tolerance of complex networks,” Nature, vol. 406, pp. 378–382, 2000. [5] D. J. Watts and S. H. Strogatz, “Collective dynamics of small-world networks,” Nature, vol. 393, pp. 440–442, 1998. [6] R. Bond et al., “A 61-million-person experiment in social influence and political mobilization,” Nature, vol. 489, no. 7415, pp. 295–298, 2010. [7] E. Cambria, M. Grassi, A. Hussain, and C. Havasi, “Sentic computing for social media marketing,” Multimedia Tools Appl., vol. 59, pp. 557–577, 2012.  . Tardos, “Maximizing the spread of influence [8] D. Kempe, J. Kleinberg, and E through a social network,” in Proc. 9th ACM SIGKDD Int. Conf. Knowl. Discov. Date Mining, 2003, pp. 137–146. [9] W. Chen, Y. Wang, and S. Yang, “Efficient influence maximization in social networks,” in Proc. 15th ACM SIGKDD Int. Conf. Knowl. Discov. Date Mining, 2009, pp. 199–208. [10] A. Goyal, W. Lu, and L. Lakshmanan, “CELFþþ: Optimizing the greedy algorithm for influence maximization in social networks,” in Proc. 20th ACM SIGKDD Int. Conf. Companion World Wide Web, 2011, pp. 47–48. [11] K. Saito, M. Kimura, K. Ohara, and H. Motoda, “Super mediator-a new centrality measure of node importance for information diffusion over social network,” Inf. Sci., vol. 329, pp. 985–1000, 2016. [12] M. Gong, C. Song, C. Duan, L. Ma, and B. Shen, “An efficient memetic algorithm for influence maximization in social networks,” IEEE Comput. Intell. Mag., vol. 11, no. 3, pp. 22–33, Aug. 2016. [13] J. Wu, M. Barahona, Y.-J. Tan, and H.-Z. Deng, “Spectral measure of structural robustness in complex networks,” IEEE Trans. Syst., Man Cybern. Part A-Syst. Hum., vol. 41, no. 6, pp. 1244–1252, Nov. 2011. [14] C. M. Schneider, A. A. Moreira, J. S. Andrade, S. Havlin, and H. J. Herrmann, “Mitigation of malicious attacks on networks,” Proc. Nat. Acad. Sci., vol. 108, no. 10, pp. 3838–3841, 2011. [15] S. Wang and J. Liu, “Constructing robust cooperative networks using a multiobjective evolutionary algorithm,” Sci. Rep., vol. 7, 2017, Art. no. 41600. [16] S. Wang and J. Liu, “Designing comprehensively robust networks against intentional attacks and cascading failures,” Inf. Sci., vol. 478, pp. 125–140, 2019. [17] W. Chen, T. Lin, Z. Tan, M. Zhao, and X. Zhou, “Robust influence maximization,” in Proc. KDD, 2016, pp. 795–804. [18] X. He and D. Kempe, “Stability and robustness in influence maximization,” ACM Trans. Knowl. Discov. Data, vol. 12, no. 6, 2018, Art. no. 66.

[19] S. Wang and J. Liu, “Constructing robust community structure against edgebased attacks,” IEEE Syst. J., vol. 13, no. 1, pp. 582–592, Mar. 2019. [20] S. Wang and J. Liu, “The effect of link-based topological changes and recoveries on the robustness of cooperation on scale-free networks,” Eur. Phys. J. Plus, vol. 131, 2016, Art. no. 219. [21] S. Wang and J. Liu, “A memetic algorithm for solving the robust influence maximization problem towards network structural perturbances,” Chin. J. Comput., vol. 44, no. 6, pp. 1153–1167, 2021. [22] J. Lee and C. Chung, “A fast approximation for influence maximization in large social networks,” in Proc. 23rd ACM SIGKDD Int. Conf. Companion World Wide Web, 2014, pp. 1157–1162. [23] S. Wang, J. Liu, and Y. Jin, “Finding influential nodes in multiplex networks using a memetic algorithm,” IEEE Trans. Cybern., vol. 51, no. 2, pp. 900–912, Feb. 2021. [24] M. Gong, J. Yan, B. Shen, L. Ma, and Q. Cai, “Influence maximization in social networks based on discrete particle swarm optimization,” Inf. Sci., vol. 367, pp. 600–614, 2016. [25] A. Gupta, Y.-S. Ong, and L. Feng, “Multifactorial evolution: Towards evolutionary multitasking,” IEEE Trans. Evol. Comput., vol. 20, no. 3, pp. 343–357, Jun. 2016. [26] X. Wang, Y. Jin, S. Schmitt, and M. Olhofer, “Transfer learning based cosurrogate assisted evolutionary bi-objective optimization for objectives with non-uniform evaluation times,” Evol. Comput., vol. 30, no. 2, pp. 221–251, 2022. [27] K. K. Bali, Y.-S. Ong, A. Gupta, and P. S. Tan, “Multifactorial evolutionary algorithm with online transfer parameter estimation: MFEA-II,” IEEE Trans. Evol. Comput., vol. 24, no. 1, pp. 69–83, Feb. 2020. [28] A. Gupta, L. Zhou, Y.-S. Ong, Z. Chen, and Y. Hou, “Half a dozen real-world applications of evolutionary multitasking, and more,” IEEE Comput. Intell. Mag., vol. 17, no. 2, pp. 49–66, May 2022. [29] L. Bai, W. Lin, A. Gupta, and Y.-S. Ong, “From multitask gradient descent to gradient-free evolutionary multitasking: A proof of faster convergence,” IEEE Trans. Cybern., vol. 52, no. 8, pp. 8561–8573, Aug. 2022. [30] X. Zheng, A. K. Qin, M. Gong, and D. Zhou, “Self-regulated evolutionary multitask optimization,” IEEE Trans. Evol. Comput., vol. 24, no. 1, pp. 16–28, Feb. 2020. [31] L. Feng et al., “Evolutionary multitasking via explicit autoencoding,” IEEE Trans. Cybern., vol. 49, no. 9, pp. 3457–3470, Sep. 2019. [32] Z. Tang, M. Gong, Y. Wu, A. K. Qin, and K. C. Tan, “A multifactorial optimization framework based on adaptive intertask coordinate system,” IEEE Trans. Cybern., vol. 52, no. 7, pp. 6745–6758, Jul. 2022. [33] X. Ma et al., “Enhanced multifactorial evolutionary algorithm with meme helper-tasks,” IEEE Trans. Cybern., vol. 52, no. 8, pp. 7837–7851, Aug. 2022. [34] C. Yang, J. Ding, Y. Jin, C. Wang, and T. Chai, “Multitasking multiobjective evolutionary operational indices optimization of beneficiation processes,” IEEE Trans. Automat. Sci. Eng., vol. 16, no. 3, pp. 1046–1057, Jul. 2019. [35] J. Ding, C. Yang, Y. Jin, and T. Chai, “Generalized multitasking for evolutionary optimization of expensive problems,” IEEE Trans. Evol. Comput., vol. 23, no. 1, pp. 44–58, Feb. 2019. [36] M. Cheng, A. Gupta, Y. S. Ong, and Z. Ni, “Coevolutionary multitasking for concurrent global optimization: With case studies in complex engineering design,” Eng. Appl. Artif. Intell., vol. 64, pp. 13–24, 2017. [37] S. Wang and X. Tan, “Determining seeds with robust influential ability from multi-layer networks: A multi-factorial evolutionary approach,” Knowl.-Based Syst., vol. 246, 2022, Art. no. 108697. [38] S. Wang and J. Liu, “Robustness of single and interdependent scale-free interaction networks with various parameters,” Physica A, vol. 460, pp. 139–151, 2016.  s and A. Renyi, “On the evolution of random graphs,” Pub. Math. Inst. [39] P. Erdo Hung. Acad. Sci., vol. 5, pp. 17–61, 1960. [40] T. Chugh, Y. Jin, K. Miettinen, J. Hakanen, and K. Sindhya, “A surrogateassisted reference vector guided evolutionary algorithm for computationally expensive many-objective optimization,” IEEE Trans. Evol. Comput., vol. 22, no. 1, pp. 129–142, Feb. 2018. [41] M. Gong, Z. Tang, H. Li, and J. Zhang, “Evolutionary multitasking with dynamic resource allocating strategy,” IEEE Trans. Evol. Comput., vol. 23, no. 5, pp. 858–869, Oct. 2019. [42] A. Farid, “Symmetrica: Test case for transportation electrification research,” Infrastructure Complexity, vol. 2, 2015, Art. no. 9. [43] W. Nooy, A. Mrvar, and V. Batagelj, Exploratory Social Network Analysis with Pajek. Cambridge, U.K.: Cambridge Univ. Press, 2004. [44] A. Cully, J. Clune, D. Tarapore, and J. B. Mouret, “Robots that can adapt like animals,” Nature, vol. 521, pp. 503–507, 2015. [45] S. Wang, J. Liu, and Y. Jin, “A computationally efficient evolutionary algorithm for multiobjective network robustness optimization,” IEEE Trans. Evol. Comput., vol. 25, no. 3, pp. 419–432, Jun. 2021. [46] S. Wang, J. Liu, and Y. Jin, “Surrogate-assisted robust optimization of largescale networks based on graph embedding,” IEEE Trans. Evol. Comput., vol. 24, no. 4, pp. 735–749, Aug. 2020.

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

53

AIeXplained

Chia-Wei Chao , Daniel Winden Hwang , Hung-Wen Tsai , Shih-Hsuan Lin , Wei-Li Chen , Chun-Rong Huang , and Pau-Choo Chung National Cheng Kung University, TAIWAN

Multi-Magnification Attention Convolutional Neural Networks Abstract

T

o apply convolutional neural networks (CNNs) on high-resolution images, a common approach is to split the input image into smaller patches. However, the field-of-view is restricted by the input size. To overcome the problem, a multi-magnification attention convolutional neural network (MMA-CNN) is proposed to analyze images based on both local and global features. Our approach focuses on identifying the importance of individual features at each magnification level and is applied to pathology whole slide images (WSIs) segmentation to show its effectiveness. Several interactive figures are also developed to enhance the reader’s understanding of our research.

I. Introduction

Convolutional neural networks (CNNs) are nowadays used for many image analysis tasks. Even though CNNs perform well in many cases, they struggle to deal with applications involving large-scale and highresolution images, which typically contain hundreds of thousands of pixels. The problem of large-scale input images can be addressed by splitting the input image into multiple small patches, which are then input into the model. The model subsequently performs its assigned task based on the input image patches. However, in many applications, it is necessary to scrutinize not only the local features of the input image patches but also the global features

The full article with interactive content is available at https://doi.org/10.1109/MCI.2023.3277771.

Digital Object Identifier 10.1109/MCI.2023.3277771 Date of current version: 13 July 2023

54

of the input image at lower magnifications. In other words, rather than the singlemagnification view used in most CNN models, it is necessary to adopt a multimagnification approach, in which the model accesses both the global and local features of the image for the final decision. Several methods have been proposed for imitating such multi-magnification observation processes [1], [2]. Although [2] enables the model to adaptively determine the general importance of all of the features extracted at each magnification, it is hard to ascertain the weights of the features at each magnification. Consequently, attention mechanisms [3], which allow weights of extracted features at each magnification to be adapted individually, have recently attracted a great deal of interest. From the discussions above, it is clear that multi-scale observations and attention mechanisms provide an ideal opportunity for furthering the applications of CNNs in many fields. In this article, a multi-magnification attention convolutional neural network (MMA-CNN) is proposed to solve the aforementioned problems and applied to achieve patch-wise classification and approximated segmentation of highresolution pathology whole slide images (WSIs). MMA-CNN combines multiscale observations and attention mechanisms to improve the performance of CNNs, which only considers a singlemagnification view of the target object. II. MMA-CNN

The architecture of MMA-CNN is shown in Figure 1. In MMA-CNN, three modules, including the feature extraction

Corresponding author: Pau-Choo Chung (e-mail: [email protected]).

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

module (FEM), feature integration and magnification attention module (FIMAM), and classification module (CM), are proposed to solve the patchwise classification problem. The FEM contains two parallel CNNs to extract feature maps from image patches of WSIs of the high and low magnifications, respectively. To extract larger field-of-view information of image patches of the low magnification, an ASPP block [4] is appended after the CNN to learn atrous feature maps from these image patches. The ASPP block is composed of a 1  1 convolutional layer and an atrous convolution layer. Different from general convolutions, atrous convolutions expand the field-of-views without increasing the kernel sizes and the number of parameters. In this light, two different feature maps of image patches of the WSIs with respect to two different magnifications are obtained. To discover the important information from these two feature maps, the FIMAM is proposed. It concatenates two feature maps at first to obtain the concatenated feature maps. Then, the magnification-attention block is applied to extract attention feature maps, which serve as the inputs of the CM to classify the labels of the image patches. The magnification-attention block is constructed based on the squeezing and excitation block [3]. The concatenated feature maps are squeezed to obtain global spatial information by using global average pooling. Then, the excitation process is applied to obtain the important information of each channel. Finally, a channel-wise multiplication is applied to obtain the attention feature maps. With attention feature maps, the CM containing three convolutional

1556-603X ß 2023 IEEE

magnifications in National Cheng Kung University Hospital. In our experiments, 15 training WSIs, five validation WSIs, and seven testing WSIs were applied. Our method is compared with the methods that only used a single network to classify image patches of WSIs of the single magnification and the parallel networks that naively concatenate features of WSIs of different magnifications. Table I shows the quantitative results. Our method achieves the best sensitivity and IoU results because the ASPP and magnification-attention blocks help extract representative features. IV. Conclusion

In this article, MMA-CNN is introduced, which leverages multi-magnification and attention mechanisms, to improve classification performance. By applying both global and local feature information, MMA-CNN achieves better segmentation and classification results in the stained liver WSIs compared with the competing methods using global or local features alone. In the future, MMA-CNN will be explored for potential applications besides pathology image analysis.

FIGURE 1 The architecture of MMA-CNN.

FIGURE 2 Approximated segmentation results of the tumor (red), necrosis (blue), and normal (green) regions in a stained liver WSI.

TABLE I The quantitative results of different methods. NETWORK STRUCTURE

SCALE

TUMOR

NECROSIS

SENSITIVITY

IoU

SENSITIVITY

IoU

Single Network

40

0.890

0.846

0.935

0.926

Single Network

10

0.923

0.887

0.964

0.951

Single Network

2.5

0.938

0.894

0.959

0.947

Parallel Network

40&10

0.968

0.936

0.967

0.950

Parallel Network

10&2.5

0.985

0.913

0.973

0.954

MMA-CNN

40&10

0.991

0.977

0.967

0.958

MMA-CNN

10&2.5

0.976

0.937

0.977

0.955

layers, a max pooling layer, and three fully connected layers with a softmax function is applied to classify the labels of the image patches of different magnifications. III. Experimental Results

For the demonstration purposes, MMACNN is applied to solve the patch-wise

pathology classification problem and remap the classified image patches to the pathology WSI to obtain the approximated segmentation results shown in Figure 2. The stained liver WSIs containing tumor and necrosis tissues in this study were obtained from the Leica scanner at 40  , 10 and 2:5 optical

Acknowledgment This work was supported in part by the National Science and Technology Council, Taiwan under Grants NSTC 111-2634-F-006-012 and NSTC 1112628-E-006-011-MY3. We would like to thank National Center for High-performance Computing (NCHC) for providing computational and storage resources. References [1] W.-C. Huang et al., “Automatic HCC detection using convolutional network with multi-magnification input images,” in Proc. IEEE Int. Conf. Artif. Intell. Circuits Syst., 2019, pp. 194–198. [2] H. Tokunaga, Y. Teramoto, A. Yoshizawa, and R. Bise, “Adaptive weighting multi-field-of-view CNN for semantic segmentation in pathology,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 12589–12598. [3] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141. [4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018.

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

55

Research Frontier

Ming Yang and Jie Gao China University of Geosciences, CHINA Aimin Zhou East China Normal University, CHINA Changhe Li China University of Geosciences, CHINA Xin Yao Southern University of Science and Technology, CHINA

Contribution-Based Cooperative Co-Evolution With Adaptive Population Diversity for Large-Scale Global Optimization

Abstract

C

ooperative co-evolution (CC) is an evolutionary algorithm that adopts the divide-and-conquer strategy to solve large-scale optimization problems. It is difficult for CC to specify a suitable subpopulation size to solve different subproblems. The population diversity may be insufficient to search for the global optimum during subpopulations’ evolution. In this paper, an adaptive method for enhancing population diversity is embedded in a contribution-based CC. In CC, there are two kinds of subpopulation: the convergent or stagnant subpopulations and the non-convergent and non-stagnant subpopulations. A method is proposed in the paper to evaluate the convergent and stagnant subpopulations’ contributions to improving the best overall objective value, which is different from the contribution evaluation on the non-convergent and non-stagnant subpopulations. In each co-evolutionary cycle, the new CC adaptively determines to select a subpopulation, which can make a greater contribution to improving the best overall objective value, between the above two kinds of subpopulation to undergo evolution. When a convergent or stagnant subpopulation is selected to undergo evolution, the subpopulation is

Digital Object Identifier 10.1109/MCI.2023.3277772 Date of current version: 13 July 2023

56

IMAGE LICENSED BY INGRAM PUBLISHING

re-diversified to enhance its global search capability. Our experimental results and analysis suggest that the new CC algorithm can improve the performance of CC and serves as a competitive solver for large-scale optimization problems. I. Introduction

Large-scale optimization problems [1], [2] are difficult to solve because this kind of optimization problem involves at least thousands of variables. Cooperative coevolution (CC) [3], [4], [5], [6], [7] decomposes an optimization problem into several subproblems. Compared with the original problem, each subproblem has relatively smaller search space.

Corresponding author: [email protected]).

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

Ming

Yang

(e-mail:

CC accordingly divides the population into several subpopulations; each subpopulation is responsible for optimizing the subcomponent of a subproblem’s variables. The above divide-and-conquer strategy decreases the difficulty in solving the large-scale optimization problems. The subproblems are very likely to have different effects on the overall objective value of the original problem. Different subpopulations’ evolution may make different contributions to the improvement of the best overall objective value [8]. To make good use of computational resources and accelerate the convergence of CC, the contribution-based CC algorithms [8], [9], [10], [11], [12] allocate computational resources among the subpopulations according to their contributions. More computational resources are allocated to the subpopulation with a greater contribution. In CC, including the contributionbased CC, different subproblems may have different characteristics (e.g., the dimensionality), which results in diverse difficulties in solving these subproblems. During the co-evolution of CC, some subpopulations may be trapped in local optima due to the loss of the population diversity, while others may not suffer from the same issue. The adaptive population diversity enhancement is essential for CC to solve different subproblems. A novel population diversity enhancement method named

1556-603X ß 2023 IEEE

AEPD [13] was proposed for differential evolution [14] to adapt the diversity of a single population. Compared with other approaches to adapting population diversity, AEPD can accurately identify the convergent or stagnant population. When the population is convergent or stagnant, AEPD re-diversifies it immediately.

Cooperative co-evolution is an efficient means of adopting the divide-andconquer strategy to solve large-scale optimization problems. AEPD was proposed to enhance the population diversity of a single population. In CC, some subpopulations are convergent or stagnant, while others are in the opposite situation. Because there exist the above two kinds of subpopulation in different evolutionary states, AEPD cannot be directly embedded in CC. How to adaptively enhance the subpopulations’ diversity in CC is a new issue that has not been studied by researchers. To address the above issue, a contribution-based CC with population diversity enhancement is proposed in this paper. The new CC can evaluate the contributions of the above two kinds of subpopulation to improving the best overall objective value. According to the contributions, the new CC can adaptively determine to select a non-convergent and non-stagnant subpopulation to continue with its current evolution or a convergent or stagnant subpopulation to be re-diversified. The remainder of this paper is organized as follows. Section II presents the background of the research stated in this paper. Section III introduces the proposed CC with adaptive population diversity. Section IV presents the experimental studies and results. Finally, Section V concludes the paper. II. Background

The contribution-based CC and the population diversity enhancement

method in [13] are reviewed in this section. A. Cooperative Co-Evolution

Before the co-evolution of CC starts, users need to decompose a problem into several subproblems. In the original cooperative co-evolutionary genetic algorithm (CCGA) [3], a D-dimensional problem is decomposed into D one-dimensional subproblems. Each subproblem consists of one variable. This decomposition strategy ignores the interrelationship between variables, which causes CCGA to perform poorly on nonseparable problems. The random grouping methods [15], [16], [17], [18], [19], [20] randomly decompose the variables into several subcomponents of variables. The interrelated variables may be grouped into the same subcomponent with a probability. For the problems consisting of additively separable subproblems, the differential grouping methods [21], [22], [23], [24], [25], [26], [27], [28] can decompose them into separable subproblems with high accuracy. Recently, researchers proposed two decomposition strategies [29], [30] which can decompose the problem consisting of both additively and non-additively separable subproblems. An improved grouping method [31] was proposed to decrease the scale of nonseparable problems with overlapping components of variables by breaking the interrelationship shared by overlapping components. After the original optimization problem is decomposed into several subproblems, CC solves the subproblems separately. Suppose C ¼ fC1 ; . . . ; CM g is a decomposition of variables, where Ci is the component of the ith subproblem’s variables. Suppose P ¼ fx1 ; . . . ; xN g is a population; Pi is the subpopulation responsible for optimizing Ci , where Pi ¼ fxt;j jxt;j 2 P; t ¼ 1; . . . ; N; j 2 Ci g. xbest is the combination of the best known individual from each subpopbi is xbest without the ulation. xPbest bi consists dimensions from Pi , i.e., xPbest = Pi . For Pi , its of x 2 xbest but x 2 overall solution to the original problem combines the individuals from Pi bi and the dimensions from xPbest , which

bi is denoted as ðx; xPbest Þ where x 2 Pi . Algorithm 1 illustrates the CCGA framework. Steps 2 to 12 are one coevolutionary cycle. In each co-evolutionary cycle, the M subproblems are optimized by an optimizer for GEs evolutionary generations one by one. The subproblems are most likely to have different effects on the best overall objective value. Therefore, it may be an inefficient way of using computational resources that CCGA allocates equal computational resources to solve different subproblems. Algorithm 1. CCGA [3]

Input: C, P, GEs and f (f is the objective function of an optimization problem). Output: xbest .

1: while the termination criterion is not met do

2: for i 1 to M do 3: for GEs generations do bi  P 4: Pi0 fðx; xbest Þ x 2 Pi g; 5: Evaluate Pi0 ; 6: Apply genetic operators to Pi ; bi  P 7: Pi00 fðx; xbest Þ x 2 Pi g; 00 8: Evaluate Pi ; 9: Pi Selection on {Pi0 , Pi00 }; 10: Update xbest ; 11: end for 12: end for 13: end while

According to the evolution status of CC and the contribution estimation, a fine-grained strategy [32] can allocate unequal computational resources to solve different subproblems. A computational resource allocation method was proposed in [33] to improve the frequency of resource allocation when solving nonseparable problems with overlapping subcomponents of variables. A dynamic CC framework [34] was also proposed to solve nonseparable problems, where the computational resources are allocated to a series of elitist subcomponents consisting of superior variables. To make potentially effective use of computational resources, a bilevel resource allocation mechanism [35] allocates computational resources to the subpopulations that perform poorly in the

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

57

current evolution but may make potential great fitness improvements in the future. The multi-armed bandit based method [36] can also be applied to CC to allocate computational resources among subpopulations. The above CC algorithms allocate computational resources according to the subpopulations’ contributions. The improvement of the best overall objective value is usually computed as the contribution of a subpopulation (Pi ) [9]. An improved contribution-based CC named CBCC3 [10] adopts the real-time improvement of the best overall objective value in one co-evolutionary cycle as the contribution of Pi . The contribution computed based on the limited information may be inaccurate. A CC framework (CCFR) [8] computes the average of historical and real-time improvements of the best overall objective value as the contribution of Pi . An improved CCFR named CCFR2 [11] pre-specifies unequal-sized subpopulations for solving different subproblems. A larger subpopulation is specified to solve a larger-dimensional subproblem. CCFR and CCFR2 update the contribution of Pi every GEs evolutionary generations. When Pi makes a stable contribution to improving the best overall objective value, a variant of CCFR named CCFR3 [37] terminates the evolution of Pi and computes the following real-time improvement of the best overall objective value as the contribution of Pi . Zi ¼

Df , Dn

(1)

  where Df ¼ f ð^ xbest Þ  f ðxbest Þ, and Dn is the number of fitness evaluations Pi spends in a co-evolutionary cycle. ^ xbest and xbest are respectively the best overall solutions before and after Pi undergoes evolution in the co-evolutionary cycle. f is the objective function of the solved optimization problem. B. Auto-Enhanced Population Diversity

A population diversity enhancement method named AEPD was proposed in [13] for the differential evolutionary

58

algorithm. The experimental study in [13] showed that the evolutionary progress of the population in different dimensions is asynchronous. AEPD measures the population diversity according to the mean and standard deviation of individuals’ variables in each dimension. Suppose that rj;G is a flag to denote whether the population is convergent in the jth dimension at a generation. The value of rj;G is set as follows.  ð2aÞ 1 if stdj  vj rj ¼ 0 otherwise, ð2bÞ where stdj is the standard deviation of the population in the jth dimension, and vj is a threshold. The value of vj is set as follows. vj ¼ minð10-3 ; jmj  M j j  10-3 Þ, (3) where mj is the mean of the population in the jth dimension, and M j is the value of mj when the population is convergent last time. If the population converges at the same point (e.g., the global optimum) in successive diversity enhancement operations, the value of vj decreases. This can guarantee that the population can thoroughly converge at an optimum. If stdj  vj , rj is set to one to indicate that the population has converged in the jth dimension and should be re-diversified in the dimension. If the population was stagnant, stdj remains unchanged. The population may never converge at a point, i.e., rj is always equal to zero. AEPD can identify the population stagnation in the jth dimension according to mj and stdj . Suppose that j denotes the number of successive generations where the values of mj and stdj remain unchanged in the jth dimension. ^rj is a flag to denote whether the population is stagnant in the jth dimension at a generation. The value of^rj is set as follows.  ^rj ¼

1 if j  UN 0 otherwise,

ð4aÞ ð4bÞ

where UN is a specified integer. If the population distribution (the mean and standard deviation) in the jth

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

dimension remains unchanged for several successive generations, it is deemed that the population is stagnant in the dimension. Therefore, ^rj is set to one when j  UN. The individuals’ variables in a stagnant dimension should be regenerated to make the evolution forward. AEPD regenerates individuals xt with t 2 ½1; N in a convergent or stagnant dimension (i.e., the dimension with rj =1 or ^rj =1) as follows. xt;j ¼ lowj þ ðupj  lowj Þ  randnj , (5) where lowj and upj are respectively predefined lower and upper bounds in the jth dimension, and randnj is a random number with the standard normal distribution. The population re-diversification can enhance the global search capability of the population and make the population search for another optimum. III. CC With Auto-Enhanced Population Diversity

In this section, AEPD is embedded in CCFR2 [11], a contribution-based CC. A method is proposed to evaluate the convergent and stagnant subpopulations’ contributions. The convergent and stagnant subpopulations are rediversified to enhance their global search capability. A. Contribution Evaluation on Convergent and Stagnant Subpopulations

Suppose that hi is a flag to denote whether a subpopulation (Pi ) is convergent or stagnant. If Pi is convergent [rj =1, see (2)] or stagnant [^rj =1, see (4)] in all the dimensions of Pi , the value of hi is set to one; otherwise, the value of hi is set to zero. In (2), the value of vj is set as follows, which is different from (3). where std j is the initial standard deviation of Pi after Pi is regenerated, and M j is the value of mj when Pi is convergent last time. If the search space size is smaller than 103 in a dimension, minð103 ; std j  103 Þ guarantees that Pi can converge at an optimum. In practice, the population does not need to completely converge at

an optimum. The optimization can stop once the accuracy of a solution meets users’ requirement. The maximum value of vj in (6) at the bottom of the page, is set to 1010 . When Pi converges at an optimum with the accuracy of 1010 , the evolution of Pi is stopped. Users can set the maximum value of vj to another value according to users’ own requirement. The experimental results show that the subpopulation responsible for solving a larger-dimensional subproblem takes more generations to enter a stable stagnation state. Therefore, the value of UN in pffiffiffiffi (4) is set to b Di c þ 1, where Di is the ith subproblem’s dimensionality. When hi =1, the evolution of Pi terminates and the contribution of Pi is evaluated. For a convergent or stagnant subpopulation, the real-time improvement of the best overall objective value is relatively small. The convergent or stagnant subpopulation may never be selected to undergo evolution according to the contribution computed by (1). Therefore, the contribution of the subpopulation with hi =1 is computed as follows. P Zi ¼ P

i Df 2SDf

i Dn2SDn

Df Dn

,

(7)

i i is the set of Df and SDn is where SDf the set of Dn. Df and Dn have the same meanings with those in (1). Zi is the overall improvement of the best overall objective value per fitness evaluation from the last generation of Pi . The convergent or stagnant subpopulation’s contribution is the overall improvement of the best overall objective value over an evolutionary period but not the real-time improvement.

B. Population Diversity Enhancement

In each co-evolutionary cycle, the subpopulation with the greatest contribution is selected to undergo evolution. If the selected subpopulation is convergent or stagnant, its population diversity will be enhanced by

Algorithm 2. CCFR2-AEPD

Input: f (the objective function of an optimization problem) and a variable grouping method. Output: xbest (the best overall solution to f ).

1: Decompose f as C ¼ fC1 ; . . . ; CM g by a variable grouping method; 2: Generate subpopulations Pi , i ¼ 1; . . . ; M, using CMA-ES; 3: Set xbest to a combination of a randomly chosen individual from each subpopulation, 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:

n 0; Zi 0; hi 0, SiDf ;, SiDn ;, i ¼ 1; . . . ; M; i 8j 2 Ci , ij 0, Mij ;, std j ;, i ¼ 1; . . . ; M; while the termination criterion is not met do Zmax ¼ maxfZi jhi ¼ 0g; Set the value of p according to (8); if rnd < p then /*rnd is a uniform random number.*/ S fPi j hi ¼ 1 and Zi > Zmax g; else S fPi j hi ¼ 0 and Zi ¼ Zmax g; end if for Pi 2 S do ^ ^ xbest xbest , n n; hi ; xbest ; n Optimizer();   f ð^ ^; Df xbest Þ  f ðxbest Þ, Dn nn SiDf SiDf [ fDf g, SiDn SiDn [ fDng; if hi =1 then Set Zi according to (7); SiDf ;, SiDn ;; i 8j 2 Ci , ij 0, std j ;; Reinitialize the values of xm i , s i and Di in CMA-ES according to (10); else Set Zi according to (1); end if end for end while

regenerating it; otherwise, it continues to evolve. The best overall objective value may not be improved by regenerating the convergent or stagnant subpopulation, e.g., the subpopulation has been convergent at the global optimum. In this case, the regenerated subpopulation’s evolution would waste computational resources. Therefore, the convergent and stagnant subpopulations are regenerated with a probability, where the probability is defined as follows. (

1 p ¼ S  1 M

if 8i; hi ¼ 1

ð8aÞ

otherwise,

ð8bÞ

   vj ¼ max 10-10 ; min minð10-3 ; std j  10-3 Þ; jmj  M j j  10-3 ,

(6)

where M is the number of subpopulations, and S1 ¼ fPi j hi ¼ 1 and Zi > Zmax g. S1 consists of the convergent and stagnant subpopulations which make greater contributions than Zmax , where Zmax ¼ maxfZi jhi ¼ 0g. The subpopulations of S1 are the potential ones which may make a great contribution by the population regeneration. If there are more potential subpopulations, the best overall objective value can be improved by the population regeneration with a larger probability. If all the subpopulations are convergent or stagnant, the value of p is set to one, i.e., all subpopulations will be regenerated. The subpopulations with hi =1 are regenerated in the descending order of Zi . CCFR2 uses the covariance matrix adaptation evolution strategy (CMAES) [38] as the optimizer to make a

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

59

subpopulation evolve in each co-evolutionary cycle. At each evolutionary generation, CMA-ES adopts the adaptation of the covariance matrix to generate new individuals. The tth individual of Pi is set as follows. xi;t ¼ xmi þ s i  ðBi Di NÞ,

(9)

where xmi is the mean value of the search distribution, s i is the step size of adaptation, Bi is an orthogonal matrix, Di is a diagonal matrix, and N is a matrix with the standard normal distribution. As the population converges, the values of s i and Di decrease, which is unbeneficial for the population to jump out of the local optimum. If Pi is convergent or stagnant, Pi is regenerated by reinitializing the values of xmi , s i , and Di as follows. 8 Pi m > ð10aÞ < xi ¼ xbest ð10bÞ s i ¼ ðupi  lowi Þ  0:1 > : ð10cÞ Di ¼ IDi , where xPbesti is the best overall solution so far in the dimensions of Pi , lowi and upi are respectively predefined lower and upper bounds in the dimensions of Pi , and IDi is a Di  Di identity matrix. The new individuals xi;t are generated close to the current best solution with a large probability but far away from the current best solution with a small probability [see (9)]. C. CCFR2-AEPD

CCFR2 with auto-enhanced population diversity named CCFR2-AEPD is introduced in this section. Similar to CCFR2, to make good use of computational resources, CCFR2-AEPD selects the subpopulation with the greatest contribution to undergo evolution in each co-evolutionary cycle. CCFR2-AEPD can identify the convergent and stagnant subpopulations and regenerate these subpopulations, where CCFR2-AEPD is different from CCFR2. Algorithm 2 illustrates CCFR2AEPD. Before the co-evolution starts, CCFR2-AEPD adopts a variable grouping method to decompose the optimization problem (see Step 1). Steps 2 to 3

60

initialize the subpopulations which solve the decomposed subproblems. For each subpopulation, Steps 4 to 5 initialize the values of the parameters that are used to evaluate its contribution to improving the best overall objective value and check whether it is convergent or stagnant. Steps 7 to 13 select a subpopulation, which can make a potential great contribution to improving the overall best objective value, to undergo evolution. If the convergent and stagnant subpopulations can make a potential great contribution, these subpopulations will be selected to undergo evolution (see Step 10); otherwise, the only one of non-convergent and non-stagnant subpopulations with the greatest contribution is selected to undergo evolution (see Step 12). Steps 14 to 27 are the evolution of a selected subpopulations (Pi ). CCFR2-AEPD makes Pi evolve using Algorithm 3 (see Step 16). After the evolution of Pi , CCFR2-AEPD records the

improvement of the best overall objective value and the number of spent fitness evaluations (see Steps 17 to 18). When Pi is convergent or stagnant (i.e., hi =1), CCFR2-AEPD evaluates its contribution as the overall improvement of the best overall objective value since the last regeneration of Pi (see Step 20); otherwise, CCFR2-AEPD evaluates its contribution as the real-time improvement of the best overall objective value (see Step 25). For a convergent or stagnant subpopulation, CCFR2AEPD reinitializes the parameters that are used to evaluate its contribution and check whether it is convergent or stagnant (see Steps 21 to 22). CCFR2-AEPD re-diversifies a convergent or stagnant subpopulation by reinitializing the parameters of CMAES (see Step 23). Algorithm 3 illustrates the evolution of Pi . Steps 1 to 5 evaluate the constructed overall individuals. Pi evolves

Algorithm 3. Optimizer

Input: Pi (the ith subpopulation), Ci (the decision variables of the ith subproblem), and f (the objective function of the optimization problem).

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:

Output: hi (a flag to denote whether Pi is convergent or stagnant), xbest (the best overall solution so far), and n (the number of fitness evaluations). for x 2 Pi do if xbest is not the same with the one when the last evolution of Pi finishes then bi P Evaluate ðx; xbest Þ; end if end for 0 x0best xbest , n0 n, SDF ;, sf 0; while true do Pi the evolution of Pi by CMA-ES for one generation; Update xbest and n; if Pi has been re-diversified and xbest 2 = Pi then Replace the worst individual of Pi with xbest ; end if s 0; if xbest is different from x0best in all the dimensions of Ci then  0  f ðx Þ  f ðxbest Þ=ðn  n0 Þ; DF best SDF SDF [ fDF g, sf the standard deviations of SDF ; 0 if jSDF j > 2 and sf < sf then s 1; end if 0 x0best xbest , n0 n, sf sf ; end if hi AEPD(); if hi =1 or s=1 then return; end if end while

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

until Pi makes a stable contribution to improving the best overall objective value or Pi needs to be re-diversified (see Steps 7 to 26). At each generation, Pi evolves using CMA-ES (see Steps 8 to 12). If Pi has been re-diversified, the worst individual of Pi is replaced with the best overall solution so far (i.e., xbest ) so that xbest can continue to be improved (see Step 11). Steps 13 to 21 check whether Pi makes a stable contribution to improving the best overall objective value. If the standard deviation of the contributions of Pi over several successive generations becomes smaller, it is deemed that Pi makes a stable contribution (i.e., s=1, see Step 18). It is reasonable to evaluate the contribution of Pi at this time. Step 22 invokes Algorithm 4 to check whether Pi is convergent or stagnant at each generation. If Pi is convergent or stagnant, hi =1; otherwise, hi =0. The convergent or stagnant subpopulation’s evolution would waste computational resources. When s=1 or hi =1, CCFR2-AEPD stops Algorithm 3 and returns to Algorithm 2 where the contribution of Pi is evaluated (see Steps 23 to 25).

CCFF2-AEPD enhances its global search capability by rediversifying the convergent or stagnant subpopulations. Algorithm 4 illustrates the AEPD operation on Pi . Steps 1 to 18 check whether Pi is convergent or stagnant in all dimensions. Steps 3 to 11 check whether Pi is convergent in a dimension. If the standard deviation of Pi in a dimension is smaller than threshold vj , Pi is convergent in the dimension (i.e., rj =1, see Step 11). If Pi converges at the same point at two successive convergences, the value of vj decreases (see Step 8), which guarantees that the population can thoroughly converge at an optimum. Steps 12 to 17 check whether Pi is stagnant in a dimension. If the mean and standard deviation of Pi in a dimension remain

TABLE I Grouping results of ERDG on the CEC’2013 functions: used fitness evaluations (FEs), the number of decomposed groups, and grouping accuracy. r1 , r2 , and r3 are defined in [23]. F

USED FES

# OF GROUPS SEP

NON-SEP

r1

r2

f1

2998

1000

0



100.0%

100.0%

f2

2998

1000

0



100.0%

100.0%

f3

3996

0

1



0.0%

0.0%

f4

5326

700

7

100.0%

100.0%

100.0%

f5

5395

700

7

100.0%

100.0%

100.0%

f6

5905

0

7

100.0%

46.6%

47.5%

f7

5554

700

7

100.0%

100.0%

100.0%

f8

8451

200

17

70.8%

99.9%

97.9%

f9

8812

0

20

100.0%

100.0%

100.0%

f10

8794

0

20

98.6%

99.5%

99.4%

f11

9212

0

20

100.0%

100.0%

100.0%

f12

26980

0

1

100.0%

0.0%

0.2%

f13

7599

0

2

99.7%

53.1%

57.0%

f14

8420

0

1

100.0%

0.0%

8.2%

f15

3996

0

1

100.0%



100.0%

unchanged for several successive generations, Pi is stagnant in the dimension (i.e., ^rj =1, see Step 17). If Pi is convergent or stagnant in all dimensions, hi =1 and the convergent point (i.e., M ij ) is recorded; otherwise, hi =0 (see Steps 19 to 24). The subpopulation with hi =1 will be rediversified in Algorithm 2. CCFR2-AEPD needs to check whether Pi is convergent or stagnant and execute the population diversity enhancement on Pi . The time complexity of the above two extra computations is OðDi Þ, where Di is the dimensionality of Ci . IV. Experimental Studies

In this section, the performance of CCFR2-AEPD is studied. Section IV-A presents the study on the population diversity enhancement proposed in Section III. Section IV-B presents the effect of population diversity enhancement on subpopulations’ contribution evaluation. Section IV-C presents the comparison between CCFR2-AEPD and its competitors on optimality. A set of 15 1000dimensional test instances proposed in the IEEE CEC’2013 special session on largescale global optimization were used in the experimental studies. The detailed description of these test instances is given

ACCURACY r3

in [39]. The CEC’2013 functions are classified into the following five categories. 1) Fully separable functions (f1 –f3 ). 2) Partially separable functions with 700 separable variables and seven subcomponents of nonseparable variables (f4 –f7 ). 3) Partially separable functions with 20 subcomponents of nonseparable variables (f8 –f11 ). 4) Nonseparable functions with overlapping subcomponents (f12 –f14 ). 5) Fully nonseparable function (f15 ). The maximum number of fitness evaluations is set to 3106 as the termination criterion, which is suggested in [39]. CCFR2-AEPD was compared with two contribution-based CC algorithms (CCFR [8] and CCFR2 [11]) and the original CC [3]. The parameters of CCFR, CCFR2, and the original CC were set according to their publications, e.g., the value of GEs was set to 100. All the CC algorithms adopted CMA-ES [38] as the optimizer, which is recommended in CCFR and CCFR2. The parameters of CMA-ES were set according to the recommendations in [38]. The settings of all the parameters of CMA-ES, including the subpopulation size, are adaptive to each subproblem. The CC algorithms used an

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

61

Algorithm 4. AEPD

Input: Pi (the ith subpopulation), Ci (the decision variables of the ith subproblem), i

std j , 8j 2 Ci (the initial standard deviation of Pi after the generation of Pi ), Mij , 8j 2 Ci (the mean of Pi when Pi is convergent or stagnant last time), and ij , 8j 2 Ci (the number of successive generations where the mean and standard deviation of Pi remain unchanged). Output: hi (a flag to denote whether Pi is convergent or stagnant).

1: for j 2 Ci do 2: mj the mean of Pi in the jth dimension, std j 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24:

efficient recursive differential grouping (ERDG) method [28] to decompose the optimization problems. The experimental results in [28] showed that compared with other grouping methods, ERDG can correctly group the interrelated variables together with much fewer fitness evaluations on almost all of the IEEE CEC’2013 functions. Table I summarizes the grouping results of ERDG on the CEC’2013 functions. The fitness evaluations spent by ERDG were counted as part of the computational budget. A. Population Diversity Enhancement

f2 is a fully separable function with 1000 variables. The CC algorithms optimize each variable separately. Figure 1 shows the evolutionary information of CCFR2AEPD and CCFR2 on f2 in a single run. Relative to the other variables, the variable in the 263 rd dimension took population diversity enhancement more times in the

62

the standard deviation of Pi in

the jth dimension; i if std j ¼ ; then i std j std j ; end if i vj minð10-3 ; std j  10-3 Þ; i if Mj 6¼ ; then vj minðvj ; jmj  Mij j  10-3 Þ; end if vj maxðvj ; 10-10 Þ; Set the value of rj according to (2); if mj and std j remain unchanged in two successive generations then ij ij þ 1; else ij 0; end if pffiffiffiffi Set the value of ^rj according to (4), where UN ¼ b Di c þ 1; end for if 8j 2 Ci ; rj ¼ 1 or ^rj ¼ 1 then 8j 2 Ci , Mij mj ; hi 1; else hi 0; end if

single run. Therefore, the evolution of P263 , which is responsible for optimizing the variable in the 263 rd dimension, was taken as an example to explain the effect of AEPD on CCFR2. During the evolution of P263 in CCFR2, the values of D and s in (9) decreased as the evolution progressed (see Figure 1(a) and (b)). Because the value of s i  ðBi Di NÞ decreased, the difference between the individuals generated by CMA-ES shrank. After 1.3106 fitness evaluations, the standard deviation of generated individuals was zero (see Figure 1(c)), and the generated individuals were the same with the mean value of the search distribution, i.e., xm263 . The individuals were generated around 3.4743 but not near the global optimum, i.e, 2.6557 (see Figure 1(d)). The convergence graph shows that the subpopulation of CCFR2 was trapped in a local optimum on f2 (see Figure 1(e)). CCFR2

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

performed similarly in many other dimensions. If the absolute difference between the best solution found by an algorithm and the global optimum in a dimension is smaller than 108 , it is deemed that the algorithm has found the global optimum in the dimension. Figure 1(f) shows the number of dimensions where the global optimum had been found on f2 . At the end of the single run, CCFR2 only found the global optimum in 114 out of 1000 dimensions.

According to the evolutionary information of subpopulations, CCFR2AEPD automatically identifies the convergent subpopulations and enhances their population diversity, which makes the algorithm find another better optimum. CCFR2-AEPD can identify and rediversify the convergent subpopulations. As the subpopulation converged, the standard deviation of the individuals decreased (see Figure 1(c)). When the subpopulation was convergent, CCFR2-AEPD increased the values of D and s in (9) (see Figure 1(a) and (b)), which resulted in the enlargement of the standard deviation of generated individuals (see Figure 1(c)). The subpopulation diversity was enhanced. It can be seen in Figure 1(d) that the mean of individuals changed constantly. This indicated that CCFR2-AEPD could jump out of the local optimum. Compared with CCFR2, CCFR2-AEPD found a better solution to f2 by enhancing the population diversity (see Figure 1(e)). At the end of the single run, CCFR2-AEPD found the global optimum in all 1000 dimensions (see Figure 1(f)). CCFR2-AEPD performed much better than CCFR2 on f2 .

FIGURE 1 For the same initial population, the evolutionary information on f2 in a single run: (a) D263 in (9), (b) s 263 in (9), (c) std 263 , (d) m263 , (e) the best fitness, (f) the number of dimensions where the global optimum has been found, (g) threshold v263 in CCFR2-AEPD, and (h) M263 in CCFR2AEPD.

vj is a threshold that is used to check whether a subpopulation is convergent in the jth dimension [see (2)]. CCFR2AEPD can adapt the value of vj to the information on the convergence of a subpopulation (see Step 8 in Algorithm 4). Figure 1(g) shows the value of v263 in the above single run. Figure 1(h) shows the mean of individuals when P263 was convergent, i.e., M263 in CCFR2-AEPD. When the current mean of individuals (m263 in Figure 1(d)) differed much from the mean at the last convergence of P263 (M263 in Figure 1(h)), the value of v263 increased. When m263 had a similar value to M263 , the value of v263 shrank. A small

value of vj can make a subpopulation thoroughly converge at an optimum in the jth dimension. A large value of vj can make a population take the population re-diversification as soon as possible, which is beneficial to find another better optimum. B. Effect of Population Diversity Enhancement on Contribution Evaluation

f8 is a partially separable function with 20 subcomponents of nonseparable variables. ERDG identifies three subcomponents of nonseparable variables as 200 separable variables by mistake (see

Table I). The CC algorithms optimize the 200 variables separately. For one of the 200 variables, the global optimum may be false because the global optimum probably becomes a local optimum when the values of its interrelated variables change. When a convergent subpopulation responsible for optimizing one of the 200 variables is trapped in a false global optimum, the best overall objective value cannot be improved by the immediate subpopulation re-diversification. The subpopulation’s contribution will be set to zero, and the subpopulation is hardly selected to undergo evolution in the future. CCFR2-AEPD does not

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

63

FIGURE 2 For the same initial population, the evolutionary information on f8 in a single run: (a) The times of ineffective co-evolution and (b) the best fitness.

immediately re-diversify the convergent subpopulations but with a probability [see (8)]. The subpopulations responsible for optimizing the interrelated variables can be re-diversified one by one. For a variable, when the values of its interrelated variables are modified by the subpopulation’s evolution, CCFR2-AEPD can immediately turn a false global optimum to be a local optimum. The best overall objective value can be improved by the subpopulation re-diversification. The contribution made by the subpopulation re-diversification will not be zero. These subpopulations can be selected to undergo evolution in the future. Steps 7 to 13 in Algorithm 2 were replaced with S fPi ji ¼ arg max Zi g. The new algorithm with this replacement was named CCFR2-AEPD-Test, where the convergent or stagnant subpopulation with the greatest contribution is immediately re-diversified. If a convergent subpopulation, which is trapped in a false global optimum, is selected to undergo evolution in the following co-evolutionary cycle, it will be re-diversified and evolves. Because the subpopulation has been trapped in a false global optimum, it is ineffective that the subpopulation spends computational resources for its re-diversification and co-evolution. The ineffective coevolution cannot improve the best overall objective value, which results in the fact that the subpopulation’s

64

contribution will be wrongly set to zero even though it has not found the real global optimum. Figure 2 shows the evolutionary information of CCFR2-AEPD and CCFR2-AEPD-Test on f8 in a single run. It can be seen in Figure 2(a) that CCFR2-AEPD-Test took ineffective co-evolution much more times than CCFR2-AEPD as the evolution progressed. CCFR2-AEPD-Test took totally 156 times, while CCFR2-AEPD took totally only six times. The experimental results show that the ineffective co-evolution resulted in the wrong contribution evaluations for 131 subpopulations in CCFR2-AEPD-Test and for six subpopulations in CCFR2-AEPD. CCFR2-AEPD could make better use of computational resources than CCFR2-AEPD-Test. CCFR2-AEPD converged faster than CCFR2-AEPDTest in the later period of the single run (see Figure 2(b)). Re-diversifying a subpopulation with a probability but not immediately is beneficial for solving the optimization problem where the nonseparable variables are wrongly identified as separable variables. f11 is a partially separable function with 20 subcomponents of nonseparable variables. ERDG can correctly group the 20 subcomponents of nonseparable variables (see Table I). Figure 3 shows the evolutionary information on f11 in a single run. It can be seen in

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

Figure 3(a) that CCFR2 without population re-diversification obtained the global optimum in all dimensions at the end of the single run, while CCFR2AEPD obtained the global optimum in 907 out of 1000 dimensions. The population re-diversification would disturb the normal evolutionary process of CCFR2. This would decrease the convergence speed of the algorithm. For the optimization problem CCFR2 can efficiently solve, it can reduce the negative effect of the population re-diversification on the convergence speed that the convergent and stagnant subpopulations are regenerated with a probability but not immediately. Figure 3(b) shows the convergence of CCFR2-AEPD on optimizing the 20th subcomponent of nonseparable variables in the single run. During the run, subpopulation P20 , which is responsible for optimizing the 20th subcomponent of nonseparable variables, was re-diversified four times. The best overall objective value became better after 105, 227, and 314 evolutionary generations due to the first, second, and third population rediversifications, respectively. For different subpopulation re-diversifications, P20 spent different numbers of evolutionary generations improving the best overall objective value. This phenomenon was also observed in other subpopulations’ evolution. The experimental study indicated that it is necessary that

FIGURE 3 For the same initial population, the evolutionary information on f11 in a single run: (a) The number of dimensions where the global optimum has been found and (b) the convergence of CCFR2-AEPD on optimizing the 20th subcomponent of nonseparable variables.

CCFR2-AEPD adaptively determine when to evaluate a subpopulation’s contribution (see Algorithm 3). C. Comparison on Optimality

Table II summarizes the optimization results of the CC algorithms on the CEC’2013 functions over 25 independent runs. CCFR2-AEPD performs significantly better than its competitors on two out of three fully separable functions

f1 –f3 , especially on f2 where CCFR2AEPD performs better than its competitors by several orders of magnitude. For f2 , CCFR2-AEPD obtained the global optimum in all dimensions in 23 out of 25 runs where the best overall objective value was smaller than 1015 . In the other two runs, CCFR2-AEPD was trapped in a local optimum in only one dimension on f2 . CCFR2-AEPD performs significantly better than its

competitors on five out of eight partially separable functions f4 –f11 . The performance of CCFR2-AEPD is better than that of its competitors on f4 , f7 , and f8 by several orders of magnitude. There is no significant difference between the performance of CCFR2-AEPD and its competitors on nonseparable functions f12 –f15 . CCFR2-AEPD significantly outperforms all its competitors on 7 out of 15 functions. The values of b/e/l

TABLE II Average fitness values standard deviations over 25 independent runs on the CEC’2013 functions. The significantly best results are in boldface (Wilcoxon test with Holm p-value correction, a=0.05). b, e, and l respectively denote the number of functions where CCFR2-AEPD performs significantly better than, statistically equivalent to, and significantly worse than its competitors. F

CCFR2-AEPD

CCFR2

CCFR

f1

6.79e-16 1.03e-16

5.26e-17 6.06e-18#

5.58e-17 4.55e-18#

1.64e-18 5.04e-19#

CC

f2

7.96e-02 2.75e-01

4.32e+02 3.23e+01"

4.32e+02 3.23e+01"

4.32e+02 3.23e+01"

f3

2.03e+01 4.76e-02

2.04e+01 4.79e-02"

2.04e+01 5.03e-02"

2.04e+01 5.76e-02"

f4

2.13e-12 6.94e-12

3.61e-04 1.00e-04"

3.73e-05 3.96e-05"

3.12e+08 1.79e+08"

f5

2.32e+06 3.30e+05

2.39e+06 4.82e+05

2.39e+06 5.69e+05

2.42e+06 4.95e+05

f6

9.96e+05 2.17e-01

9.96e+05 5.48e+01"

9.96e+05 2.19e-01"

9.96e+05 1.35e+02"

f7

5.68e-15 2.35e-15

2.79e-07 1.19e-07"

1.32e-08 2.27e-08"

1.03e+07 5.48e+06"

f8

3.77e+03 6.87e+02

2.56e+04 7.72e+03"

2.62e+04 1.06e+04"

1.08e+09 2.60e+08"

f9

1.65e+08 3.41e+07

1.54e+08 2.91e+07

1.55e+08 2.84e+07

1.61e+08 3.24e+07

f10

9.05e+07 1.23e+02

9.17e+07 1.77e+06"

9.07e+07 2.03e+05"

9.09e+07 4.45e+05"

f11

1.45e-04 5.62e-04

5.25e-18 2.81e-18#

7.62e-09 5.34e-09#

4.03e+03 5.07e+03"

f12

9.93e+02 4.89e+01

9.90e+02 6.18e+01

9.90e+02 6.18e+01

9.90e+02 6.18e+01

f13

4.81e+05 4.86e+04

4.92e+05 7.13e+04

4.67e+05 5.11e+04

8.47e+05 1.44e+05"

f14

2.62e+07 2.01e+06

2.62e+07 2.02e+06

2.62e+07 2.03e+06

2.62e+07 2.03e+06

f15

2.14e+06 2.00e+05

2.17e+06 2.17e+05

2.16e+06 2.14e+05

2.16e+06 2.14e+05



7/6/2

7/6/2

9/5/1

b/e/l

The symbols " and # respectively denote that the CCFR2-AEPD performs significantly better than and worse than this algorithm by the Wilcoxon test at the significance level of 0.05.

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

65

indicate that the overall performance of CCFR2-AEPD is better than its competitors on the CEC’2013 functions. The results in Table II show that CCFR2 without AEPD obtains the global optima on f1 , f4 , f7 , and f11 . The subpopulation re-diversification would decrease the convergence speed of an algorithm on f1 and f11 . Therefore, the result accuracy of CCFR2-AEPD is worse than that of CCFR2 on f1 and f11 . There are 700 separable variables in f4 and f7 . The CC algorithms optimize the separable variables separately. Our experimental results show that in the first co-evolutionary cycle, CCFR2-AEPD respectively spent about 2.6104 and 2.9104 fitness evaluations optimizing these separable variables, while CCFR2 spent about 2.8105 fitness evaluations. This is because CCFR2 stops optimizing these separable variables when the subpopulations are stagnant, while CCFR2-AEPD stops optimizing the variables when the standard deviations of the subpopulations are smaller than threshold vj . CCFR2-AEPD could stop optimizing these separable variables earlier than CCFR2. Then CCFR2-AEPD could allocate computational resources

earlier than CCFR2 to optimize the seven subcomponents of nonseparable variables which greatly affect the overall objective value. For f4 and f7 , CCFR2-AEPD converges faster than CCFR2, and the result accuracy of CCFR2-AEPD is better than that of CCFR2. If the best overall solution obtained in two successive subpopulation re-diversifications are different (i.e., the best overall solution becomes better), the second subpopulation re-diversification is deemed as effective. Table III summarizes the average results of CCFR2-AEPD on the subpopulation re-diversification over 25 independent runs. Because there are less than two subpopulation re-diversifications on f3 , f12 , f13 , f14 , and f15 , the values on the last two columns are empty on these functions. CCFR2 is trapped in a local optimum on the functions except for f1 , f4 , and f7 (see Table II). The subpopulation re-diversification may make an algorithm find another better optimum. Therefore, CCFR2-AEPD outperforms CCFR2 on f2 , f3 , f5 , f6 , f8 , and f10 . Especially, CCFR2-AEPD takes more than 1000 effective subpopulation re-diversifications on f2 and f8 , where the optimization results of CCFR2-AEPD are better than those of CCFR2 by several orders of

TABLE III Average values of CCFR2-AEPD on the CEC’2013 functions over 25 independent runs. n1 is the times of subpopulation re-diversification, and n2 is the times of effective subpopulation re-diversification.

66

F

n1

n2

n2 =n1 (%)

f1

11691

0

0.00%

f2

12504.6

1285.9

10.28%

f3

1





f4

5775.2

4.3

0.07%

f5

9720.6

477.3

4.91%

f6

6.1

1.1

18.03%

f7

7479.5

2.5

0.03%

f8

1699.8

1429.8

84.12%

f9

481.2

31.1

6.46%

f10

21.6

2.1

9.72%

f11

71.1

3.2

4.50%

f12

0





f13

0





f14

0





f15

0





IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

Non-CC algorithms underperform CC algorithms when solving partially separable problems. magnitude. The population is hardly convergent or stagnant in all dimensions at the same time on nonseparable functions f12 –f15 . Under the fixed computational budget (i.e., 3106 fitness evaluations), no population diversity enhancement is executed on f12 –f15 . Therefore, there is no significant difference between the performance of CCFR2-AEPD and CCFR2 on f12 –f15 (see Table II). ERDG decomposes f3 , f6 , f8 , and f10 incorrectly (see Table I). Table IV summarizes the optimization results on the four functions with ideal grouping results. CCFR2 obtains the global optimum on f8 . Because the subpopulation re-diversification decreases the convergence speed of an algorithm, the result accuracy of CCFR2-AEPD is worse than that of CCFR2 and CCFR on f8 . The subpopulation re-diversification does not make CCFR2-AEPD jump out of the local optimum on f3 . There is no difference between the performance of the CC algorithms on f3 . CCFR2-AEPD outperforms its competitors on f6 and f10 due to the subpopulation re-diversification. CCFR2-AEPD was also compared with MMO-CC [6], CBCCRDG3 [31], SHADE-ILS [40], MOSCEC2013 [41], and MA-SWChains [42]. MMO-CC is a multimodal optimization enhanced CC algorithm. CBCC-RDG3, SHADE-ILS, MOSCEC2013, and MA-SW-Chains were respectively the winners of the IEEE CEC’2019, CEC’2018, CEC’2013, and CEC’2010 competitions on large-scale global optimization. The parameters of the competitors of CCFR2-AEPD were set according to their publications. Table V summarizes the results of these algorithms. CCFR2-AEPD significantly outperforms CBCC-RDG3 on six functions (i.e., f2 , f3 , f4 , f8 , f10 , and f14 ), especially on f2 , f4 , f8 , and f14 . The performance of CCFR2-AEPD is

TABLE IV Average fitness values standard deviations over 25 independent runs on f3 , f6 , f8 , and f10 with ideal grouping results. The significantly best results are in boldface (Wilcoxon test with Holm p-value correction, a=0.05). b, e, and l have similar meanings as in Table II. F

CCFR2-AEPD

CCFR2

CCFR

CC

f3

2.0000e+01 0.0000e+00

2.0000e+01 0.0000e+00

2.0000e+01 0.0000e+00

2.0000e+01 0.0000e+00

f6

9.9596e+05 2.2020e-04

9.9599e+05 8.4330e+01"

9.9597e+05 5.1231e+01"

1.0652e+06 1.4445e+03"

f8

3.0110e-02 2.1603e-02

3.0047e-05 1.8033e-05#

8.0277e-03 2.3646e-03#

1.3087e+07 5.4084e+06"

f10

9.0544e+07 2.1721e+02

9.1431e+07 1.7171e+06"

9.0706e+07 7.7306e+05"

9.1117e+07 1.3728e+06"



2/1/1

2/1/1

3/1/0

b/e/l

The symbols " and # have similar meanings as in Table II.

TABLE V Average fitness values standard deviations over 25 independent runs on the CEC’2013 functions. The significantly best results are in boldface (Wilcoxon test with Holm p-value correction, a=0.05). b, e, and l have similar meanings as in Table II. F

CCFR2-AEPD

MMO-CC

CBCC-RDG3

SHADE-ILS

MOS-CEC2013

f1

6.79e-16 1.03e-16

4.83e-20 9.45e-21#

1.17e-18 1.38e-19#

2.69e-24 1.35e-23#

1.27e-22 7.56e-23#

MA-SW-CHAINS 8.49e-13 1.11e-12"

f2

7.96e-02 2.75e-01

1.53e+03 7.42e+01" 2.33e+03 1.03e+02" 1.00e+03 8.90e+01" 8.32e+02 4.57e+01" 1.22e+03 1.16e+02"

f3

2.03e+01 4.76e-02

2.01e+01 1.31e-02#

f4

2.13e-12 6.94e-12

2.96e+11 3.70e+11" 2.10e+04 2.20e+04" 1.48e+08 8.72e+07" 1.74e+08 8.03e+07" 4.58e+09 2.51e+09"

f5

2.32e+06 3.30e+05

2.80e+06 1.70e+06

f6

9.96e+05 2.17e-01 1.06e+06 3.21e+03"

9.96e+05 1.16e+02

1.02e+06 1.19e+04" 1.48e+05 6.56e+04# 1.01e+06 1.56e+04"

f7

5.68e-15 2.35e-15

1.68e-21 3.59e-22#

7.41e+01 5.46e+01" 1.62e+04 9.29e+03" 3.45e+06 1.29e+06"

1.44e+10 1.27e+10"

2.04e+01 6.28e-02"

2.01e+01 1.12e-02#

9.18e-13 5.23e-14#

2.14e+01 5.73e-02"

2.08e+06 4.16e+05# 1.39e+06 2.03e+05# 6.94e+06 9.03e+05" 1.87e+06 3.13e+05#

f8

3.77e+03 6.87e+02 1.77e+14 1.34e+14" 1.05e+04 7.60e+03" 3.17e+11 3.06e+11" 8.00e+12 3.14e+12" 4.85e+13 1.04e+13"

f9

1.65e+08 3.41e+07

f10

9.05e+07 1.23e+02 9.39e+07 7.14e+05" 9.09e+07 1.17e+06" 9.18e+07 6.93e+05" 9.02e+05 5.17e+05# 9.18e+07 1.08e+06"

f11

1.45e-04 5.62e-04

1.66e+08 2.94e+07

2.72e+12 2.47e+12"

1.58e+08 2.80e+07

9.20e-11 2.31e-10#

1.64e+08 1.57e+07

3.83e+08 6.42e+07" 1.07e+08 1.71e+07#

5.11e+05 2.25e+05" 5.22e+07 2.10e+07" 2.19e+08 3.04e+07"

f12

9.93e+02 4.89e+01 8.98e+10 2.60e+11" 7.12e+02 1.19e+02# 6.18e+01 1.04e+02# 2.47e+02 2.59e+02# 1.25e+03 1.07e+02"

f13

4.81e+05 4.86e+04 1.76e+12 1.44e+12" 7.01e+04 6.37e+04# 1.00e+05 7.19e+04# 3.40e+06 1.08e+06" 1.98e+07 1.86e+06"

f14

2.62e+07 2.01e+06 3.54e+11 4.77e+11" 1.38e+09 1.26e+09" 5.76e+06 3.76e+05# 2.56e+07 8.11e+06

f15

2.14e+06 2.00e+05 4.31e+08 2.06e+08"

b/e/l



11/2/2

2.23e+06 2.97e+05 6/3/6

1.36e+08 2.15e+07"

6.25e+05 2.40e+05# 2.35e+06 1.98e+05" 5.71e+06 7.73e+05" 7/1/7

9/1/5

13/0/2

The symbols " and # have similar meanings as in Table II.

significantly worse than CBCC-RDG3 on six functions (i.e., f1 , f5 , f7 , f11 , f12 , and f13 ), especially on f13 . CCFR2-AEPD and MOS-CEC2013 obtain the global optimum on two out of three fully separable functions f1 –f3 . CCFR2-AEPD outperforms its competitors on f2 by several orders of magnitude. CCFR2AEPD significantly outperforms its competitors except for CBCC-RDG3 on almost eight partially separable functions f4 –f11 , especially on f4 , f7 , f8 , and f11 where CCFR2-AEPD performs better by several orders of magnitude. The CC algorithms optimize the decision variables together on nonseparable functions. As mentioned previously, CCFR2-AEPD does not perform the population diversity

enhancement on f12 –f15 . Employing different optimizer is the difference between the optimization of CC and non-CC algorithms on f12 –f15 . SHADE-ILS performs best on f12 –f15 , which indicates that SHADE-ILS is superior to CMA-ES, the optimizer used in CCFR2-AEPD. V. Conclusion

In this paper, the contribution-based CC with population diversity enhancement was studied, and a new CC named CCFR2-AEPD was proposed to solve large-scale optimization problems. At each evolutionary generation, CCFR2AEPD checks whether a subpopulation is convergent or stagnant. CCFR2-AEPD regenerates the convergent and stagnant

subpopulations to enhance their population diversity. CCFR2-AEPD was tested on the IEEE CEC’2013 large-scale benchmark functions. From the experimental results in this paper, several conclusions can be drawn. Firstly, CCFR2-AEPD can identify the convergent and stagnant subpopulations. CCFR2-AEPD re-diversifies the convergent and stagnant subpopulations to enhance the global search capability of the algorithm. Secondly, by regenerating the subpopulations with a probability but not immediately, CCFR2-AEPD can decrease the negative effect of population re-diversification on the convergent and stagnant

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

67

subpopulations’ contribution evaluation. Finally, CCFR2-AEPD with ERDG, a method for grouping variables, significantly outperforms its competitors on the CEC’2013 large-scale optimization problems, especially on partially separable problems. CCFR2AEPD is a competitive solver for large-scale optimization problems. In the future, the performance of CCFR2-AEPD in solving nonseparable problems will be improved by employing the algorithm portfolio [43], [44].

Online resources for the tutorial: The MATLAB source code implementation is provided in the following URL. https://github.com/ ymzhongzhong/CCFR2-AEPD Acknowledgment This work was supported in part by the Open Research Project of the Hubei Key Laboratory of Intelligent Geo-Information Processing under Grant KLIGIP-2021B04, in part by the National Natural Science Foundation of China under Grant 61305086, and in part by Guangdong Provincial Key Laboratory under Grant 2020B121201001.

References [1] M. N. Omidvar, X. Li, and X. Yao, “A review of population-based metaheuristics for large-scale blackbox global optimization: Part I,” IEEE Trans. Evol. Comput., vol. 26, no. 5, pp. 802–822, Oct. 2022. [2] M. N. Omidvar, X. Li, and X. Yao, “A review of population-based metaheuristics for large-scale blackbox global optimization: Part II,” IEEE Trans. Evol. Comput., vol. 26, no. 5, pp. 823–843, Oct. 2022. [3] M. A. Potter and K. A. D. Jong, “A cooperative coevolutionary approach to function optimization,” in Parallel Problem Solving From Nature. Heidelberg, Germany: Springer, 1994, pp. 249–257. [4] P. Yang, K. Tang, and X. Yao, “A parallel divideand-conquer-based evolutionary algorithm for large-scale optimization,” IEEE Access, vol. 7, pp. 163105–163118, 2019. [5] J. Blanchard, C. Beauthier, and T. Carletti, “A surrogate-assisted cooperative co-evolutionary algorithm using recursive differential grouping as decomposition strategy,” in Proc. IEEE Congr. Evol. Comput., 2019, pp. 689–696. [6] X. Peng, Y. Jin, and H. Wang, “Multimodal optimization enhanced cooperative coevolution for largescale optimization,” IEEE Trans. Cybern., vol. 49, no. 9, pp. 3507–3520, Sep. 2019. [7] Y. Wu, X. Peng, H. Wang, Y. Jin, and D. Xu, “Cooperative coevolutionary CMA-ES with landscape-

68

aware grouping in noisy environments,” IEEE Trans. Evol. Comput., vol. 27, no. 3, pp. 686-700, Jun. 2023. [8] M. Yang et al., “Efficient resource allocation in cooperative co-evolution for large-scale global optimization,” IEEE Trans. Evol. Comput., vol. 21, no. 4, pp. 493–505, Aug. 2017. [9] M. N. Omidvar, X. Li, and X. Yao, “Smart use of computational resources based on contribution for cooperative co-evolutionary algorithms,” in Proc. 13th Annu. Conf. Genet. Evol. Comput., 2011, pp. 1115–1122. [10] M. N. Omidvar, B. Kazimipour, X. Li, and X. Yao, “CBCC3–A contribution-based cooperative co-evolutionary algorithm with improved exploration/exploitation balance,” in Proc. IEEE Congr. Evol. Comput., 2016, pp. 3541–3548. [11] M. Yang, A. Zhou, C. Li, J. Guan, and X. Yan, “CCFR2: A more efficient cooperative co-evolutionary framework for large-scale global optimization,” Inf. Sci., vol. 512, pp. 64–79, 2020. [12] Y.-H. Jia, Y. Mei, and M. Zhang, “Contribution-based cooperative co-evolution for nonseparable large-scale problems with overlapping subcomponents,” IEEE Trans. Cybern., vol. 52, no. 6, pp. 4246–4259, Jun. 2022. [13] M. Yang, C. Li, Z. Cai, and J. Guan, “Differential evolution with auto-enhanced population diversity,” IEEE Trans. Cybern., vol. 45, no. 2, pp. 302–315, Feb. 2015. [14] K. Yu, J. Liang, B. Qu, Y. Luo, and C. Yue, “Dynamic selection preference-assisted constrained multiobjective differential evolution,” IEEE Trans. Syst., Man, Cybern. Syst., vol. 52, no. 5, pp. 2954–2965, May 2022. [15] Y. Shi, H. Teng, and Z. Li, “Cooperative coevolutionary differential evolution for function optimization,” in Proc. Adv. Natural Comput., Heidelberg, Germany, 2005, pp. 1080–1088. [16] F. V. d. Bergh and A. P. Engelbrecht, “A cooperative approach to particle swarm optimization,” IEEE Trans. Evol. Comput., vol. 8, no. 3, pp. 225–239, Jun. 2004. [17] Z. Yang, K. Tang, and X. Yao, “Large scale evolutionary optimization using cooperative coevolution,” Inf. Sci., vol. 178, no. 15, pp. 2985–2999, 2008. [18] M. Omidvar, X. Li, Z. Yang, and X. Yao, “Cooperative co-evolution for large scale optimization through more frequent random grouping,” in Proc. IEEE Congr. Evol. Comput., 2010, pp. 1–8. [19] Z. Yang, K. Tang, and X. Yao, “Multilevel cooperative coevolution for large scale optimization,” in Proc. IEEE Congr. Evol. Comput., 2008, pp. 1663–1670. [20] X. Li and X. Yao, “Cooperatively coevolving particle swarms for large scale optimization,” IEEE Trans. Evol. Comput., vol. 16, no. 2, pp. 210–224, Apr. 2012. [21] M. N. Omidvar, X. Li, Y. Mei, and X. Yao, “Cooperative co-evolution with differential grouping for large scale optimization,” IEEE Trans. Evol. Comput., vol. 18, no. 3, pp. 378–393, Jun. 2014. [22] Y. Sun, M. Kirley, and S. K. Halgamuge, “Extended differential grouping for large scale global optimization with direct and indirect variable interactions,” in Proc. Annu. Conf. Genet. Evol. Comput., 2015, pp. 313–320. [23] Y. Mei, M. Omidvar, X. Li, and X. Yao, “A competitive divide-and-conquer algorithm for unconstrained large-scale black-box optimization,” ACM Trans. Math. Softw., vol. 42, no. 2, pp. 13:1–13:24, 2016. [24] X. Hu, F. He, W. Chen, and J. Zhang, “Cooperation coevolution with fast interdependency identification for large scale optimization,” Inf. Sci., vol. 381, pp. 142–160, 2017. [25] M. N. Omidvar, M. Yang, Y. Mei, X. Li, and X. Yao, “DG2: A faster and more accurate differential grouping for large-scale black-box optimization,” IEEE Trans. Evol. Comput., vol. 21, no. 6, pp. 929– 942, Dec. 2017. [26] Y. Sun, M. Kirley, and S. K. Halgamuge, “A recursive decomposition method for large scale continuous optimization,” IEEE Trans. Evol. Comput., vol. 22, no. 5, pp. 647–661, Oct. 2018.

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

[27] Y. Sun, M. N. Omidvar, M. Kirley, and X. Li, “Adaptive threshold parameter estimation with recursive differential grouping for problem decomposition,” in Proc. Genet. Evol. Comput. Conf., 2018, pp. 889–896. [28] M. Yang, A. Zhou, C. Li, and X. Yao, “An efficient recursive differential grouping for large-scale continuous problems,” IEEE Trans. Evol. Comput., vol. 25, no. 1, pp. 159–171, Feb. 2021. [29] M. M. Komarnicki, M. W. Przewozniczek, H. Kwasnicka, and K. Walkowiak, “Incremental recursive ranking grouping for large scale global optimization,” IEEE Trans. Evol. Comput., early access, Oct. 25, 2022, doi: 10.1109/ TEVC.2022.3216968. [30] M. Chen, W. Du, Y. Tang, Y. Jin, and G. G. Yen, “A decomposition method for both additively and non-additively separable problems,” IEEE Trans. Evol. Comput., early access, Nov. 1, 2022, doi: 10.1109/TEVC.2022.3218375. [31] Y. Sun, X. Li, A. Ernst, and M. N. Omidvar, “Decomposition for large-scale optimization problems with overlapping components,” in Proc. IEEE Congr. Evol. Comput., 2019, pp. 326–333. [32] Z. Ren, Y. Liang, A. Zhang, Y. Yang, Z. Feng, and L. Wang, “Boosting cooperative coevolution for large scale optimization with a fine-grained computation resource allocation strategy,” IEEE Trans. Cybern., vol. 49, no. 12, pp. 4180–4193, Dec. 2019. [33] Y.-H. Jia et al., “Distributed cooperative co-evolution with adaptive computing resource allocation for large scale optimization,” IEEE Trans. Evol. Comput., vol. 23, no. 2, pp. 188–202, Apr. 2019. [34] X.-Y. Zhang, Y.-J. Gong, Y. Lin, J. Zhang, S. Kwong, and J. Zhang, “Dynamic cooperative coevolution for large scale optimization,” IEEE Trans. Evol. Comput., vol. 23, no. 6, pp. 935–948, Dec. 2019. [35] X.-F. Liu, J. Zhang, and J. Wang, “Cooperative particle swarm optimization with a bilevel resource allocation mechanism for large-scale dynamic optimization,” IEEE Trans. Cybern., vol. 53, no. 2, pp. 1000–1011, Feb. 2023. [36] K. S. Kim and Y. S. Choi, “Cooperative coevolutionary algorithm with resource allocation strategies to minimize unnecessary computations,” Appl. Soft Comput., vol. 113, 2021, Art. no. 108013. [37] M. Yang, A. Zhou, X. Lu, Z. Cai, C. Li, and J. Guan, “CCFR3: A cooperative co-evolution with efficient resource allocation for large-scale global optimization,” Expert Syst. Appl., vol. 203, 2022, Art. no. 117397. [38] N. Hansen, “The CMA evolution strategy: A tutorial,” 2016, arXiv:1604.00772. [39] X. Li, K. Tang, M. N. Omidvar, Z. Yang, and K. Qin, “Benchmark functions for the CEC’2013 special session and competition on large scale global optimization,” 2013. [Online]. Available: https://www. tflsgo.org/assets/cec2018/cec2013-lsgo-benchmarktech-report.pdf [40] D. Molina, A. LaTorre, and F. Herrera, “SHADE with iterative local search for large-scale global optimization,” in Proc. IEEE Congr. Evol. Comput., 2018, pp. 1–8. [41] A. LaTorre, S. Muelas, and J.-M. Pena, “Large scale global optimization: Experimental results with MOS-based hybrid algorithms,” in Proc. IEEE Congr. Evol. Comput., 2013, pp. 2742–2749. [42] D. Molina, M. Lozano, and F. Herrera, “MA-SW-Chains: Memetic algorithm based on local search chains for large scale continuous global optimization,” in Proc. IEEE Congr. Evol. Comput., 2010, pp. 1–8. [43] K. Tang, F. Peng, G. Chen, and X. Yao, “Population-based algorithm portfolios with automated constituent algorithms selection,” Inf. Sci., vol. 279, pp. 94–104, 2014. [44] K. Tang, S. Liu, P. Yang, and X. Yao, “Fewshots parallel algorithm portfolio construction via coevolution,” IEEE Trans. Evol. Comput., vol. 25, no. 3, pp. 595–607, Jun. 2021.

Lei Zhang , Haipeng Yang Anhui University, CHINA

, Shangshang Yang

, and Xingyi Zhang

A Macro-Micro Population-Based Co-Evolutionary Multi-Objective Algorithm for Community Detection in Complex Networks

Abstract

R

ecently, multi-objective evolutionary algorithms (MOEAs) have shown promising performance in terms of community detection in complex networks. However, most studies have focused on designing different strategies to achieve good community detection performance based on a single population. Unlike these studies, this study proposes a macro-micro population-based co-evolutionary multiobjective algorithm called MMCoMO for community detection in complex networks to obtain a better trade-off between exploration and exploitation. This algorithm employs two populations, i.e., macro-population and micro-population, for co-evolution to obtain better community structures. In particular, the macro-population prefers exploration and is responsible for quickly determining approximate partitions of the network to obtain good rough community structures as early as possible, whereas the micro-population favors exploitation and is responsible for searching for good fine community structures through the local search process. Thus, these two populations can be used to improve each other through interactions in the co-evolutionary process. In particular, a guiding strategy is designed using the elite (non-dominated) solutions of the macro-population to guide the micro-population, and an influencing strategy is further

Digital Object Identifier 10.1109/MCI.2023.3277773 Date of current version: 13 July 2023

IMAGE LICENSED BY INGRAM PUBLISHING

designed using the elite solutions of the micro-population to positively influence the macro-population. Experiments on synthetic networks and 14 real-world networks demonstrate the superiority of the proposed algorithm over several state-of-the-art community detection algorithms. I. Introduction

Community detection is a crucial tool for revealing the functional and structural properties hidden in complex networks, such as social and biological networks [1]. Specifically, community detection involves dividing networks into communities based on the topology structure, in which the connections between nodes in the same community are dense (denser is better) and the connections between nodes in different communities are sparse (sparser is better) [2]. Owing to the fundamental importance of community detection in complex network analysis, many algorithms have been proposed for

Corresponding author: Xingyi [email protected]).

Zhang

(e-mail:

community detection, e.g., graph partitioning [3], hierarchical clustering [4], spectral clustering [5], modularity optimization-based [6], label propagation-based [7], community structure enhancement-based [8], and evolutionary algorithms (EAs) [9]. EA is a population-based metaheuristic optimization algorithm inspired by the natural evolution process. In the adoption of this algorithm, the problem is first formulated as an optimization problem. Then, the population comprising a set of individuals (solutions) evolves for several generations. In each generation, new individuals are produced by evolutionary operators, such as crossover and mutation, and good individuals are then selected by environmental selection for the next evolution. This process stops when a stop criterion is satisfied. Among the existing studies on community detection, EAs often perform well since they can automatically determine the number of communities, and domain-specific knowledge can be adopted within the method [9]. In addition, because of the conflict between the two optimization objectives, namely, maximizing the number of internal links in communities and minimizing the number of external links between communities, the task of community detection is typically formulated as a multi-objective problem, and multi-objective evolutionary algorithms (MOEAs) are designed to solve this problem. Note that MOEAs overcome some limitations of single-

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

69

objective EAs for community detection, such as the limited resolution of modularity, as well as provide a set of network divisions at different hierarchical levels [10], [11]. Owing to the effectiveness of MOEAs, numerous studies have been conducted to design various MOEAs for community detection in complex networks. In 2009, Pizzuti investigated the first multi-objective optimization-based community detection algorithm called MOGA-Net [12], and employed two objectives for optimization: the community score and fitness. The algorithm is based on the non-dominated sorting genetic algorithm, NSGA-II [13]. Experiments have shown that MOGA-Net can obtain a set of good partitions of the network at different hierarchical levels. Subsequently, several MOEA-based community detection methods were proposed [10], [11], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26]. For example, the multi-objective community detection algorithm called MOCD [10] is based on the Pareto envelope-based selection algorithm (PESAII) [27], the multi-objective discrete particle swarm optimization algorithm called MODPSO [11] is based on the discrete particle swarm optimization algorithm (DPSO) [28], and the multi-objective ant colony optimization algorithm for community detection called MOACO [25] is based on ant colony optimization. The aforementioned MOEA-based community detection algorithms demonstrate the competitiveness of complex networks with different characteristics. However, most of these methods focus on improving the community detection performance in the design of different strategies based on a single population, restricting the achievement of a good balance between exploration and exploitation in MOEAs. Exploration means that the algorithm searches the global solution space to determine diverse solutions, and exploitation means that the algorithm uses local information during the search process to generate better solutions near the current solutions. The balance between exploration and exploitation poses an important problem for metaheuristic

70

algorithms. Specifically, excessive exploitation leads the algorithm to quickly converge to a local optimum, whereas too much exploration slows the convergence of the algorithm [29]. Therefore, a macro-micro population-based co-evolutionary multiobjective algorithm called MMCoMO is proposed for community detection in complex networks between exploration and exploitation. MMCoMO considers community detection at both the macro and micro levels (the macro level focuses on exploration, and the micro level on exploitation), based on which macro-population and micro-population are suggested and co-evolved to obtain better performance. The main contributions of this study are summarized as follows. ❏ A macro-micro population-based coevolution scheme is proposed to consider the community detection problem from two different levels (macrolevel and micro-level). In particular, the macro-population used at the macro level is good at determining diverse, good rough community structures quickly, which prefers exploration. In addition, the micropopulation used at the micro level performs well on searching for good, fine community structures through the local search process, which prefers exploitation. The main ideas of coevolution are: 1) using the elite solutions of the macro-population to guide the micro-population and 2) applying the elite solutions of the micro-population to positively influence the macro-population. Therefore, the coevolution of these two populations can combine their advantages to obtain better community structures. ❏ Based on the macro-micro population-based co-evolution scheme, a coevolutionary multi-objective algorithm called MMCoMO is suggested for community detection. In MMCoMO, two individual representations are adopted to represent the two populations accordingly. In particular, the macro-population uses the binary medoid-based representation [9], which can obtain diverse,

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

good rough community structures quickly. The micro-population uses the vector-based representation [11], which can search for good, fine community structures through the local search process. In addition, two effective interaction strategies, namely the guidance and influence strategies, applied in the macro- and micro-populations are suggested to significantly improve the detection performance of MMCoMO. ❏ The effectiveness of the proposed MMCoMO algorithm is verified in terms of two well-known metrics, normalized mutual information (NMI) and modularity (Q), on 14 real-world networks and several Lancichinetti-Fortunato-Radicchi (LFR) synthetic networks [30] with different characteristics. The results indicate that the macro- and micro-populations can promote each other and that the proposed MMCoMO outperforms stateof-the-art non-EA- and EA-based community detection algorithms. The remainder of this article is organized as follows. Section II introduces preliminaries of the community detection problem and related studies. Then, Section III presents the proposed MMCoMO algorithm with a detailed description of the macro-micro population-based scheme. Section IV presents the results and analysis of the experiments. Finally, Section V concludes the article and provides remarks concerning future directions. II. Preliminaries and Related Work

This section provides preliminaries regarding the community detection problem and presents several related studies on EA-based community detection and co-evolutionary multi-objective algorithms. A. Community Detection Problem

This study focuses on the non-overlapping community detection problem, i.e., each node belongs to one and only one community, in undirected and unweighted networks. Non-overlapping community detection can often be modeled as a twoobjective optimization problem by

maximizing the intra-link density in the same community and minimizing the interlink density between different communities. This study adopted two objectives popularly used in recent works [11], [26], [31]: the first is kernel k-means (KKM) [32], for measuring the density of the intra-links, which is deduced from the objective function of k-means by replacing a kernel matrix; the second is the ratio cut (RC), for measuring the density of interlinks. Let A be the adjacency matrix and C ¼ fC1 ; C2 ; . . .; Ck g the set of communities in a solution s. The two objective functions are formally defined as follows: 8 P < KKM ¼ 2ðn  kÞ  ki¼1 LðCi ;Ci Þ jCi j min : RC ¼ Pk L ðCi ;Ci Þ ; i¼1 jCi j (1) where n denotes the number of nodes, and k denotes the number of communities, which is automatically determined by decoding a solution and may differ for different solutions. LðCi ; Ci Þ ¼ P i2Ci ;j2Ci Ai;j denotes the number of intra-links in community Ci , and P LðCi ; Ci Þ ¼ i2Ci ;j2Ci Ai;j denotes the number of inter-links between Ci and other communities. Note that the right operand of KKM measures the degree of the intra-link density, whereas that of RC measures the degree of the inter-link density. Evidently, minimizing the KKM and RC can ensure that the links between the nodes in the same community are dense while the links between the nodes in different communities are sparse. B. Related Work

In recent decades, many EA-based community detection algorithms have been proposed. In the following, three directions of related work are reviewed: single-objective EA-based community detection, multi-objective EA-based community detection, and co-evolutionary multi-objective algorithms. 1) Single-Objective EA-Based Community Detection Algorithms Many single-objective EA-based community detection algorithms have adopted different EA paradigms [33], [34], [35], [36], [37], [38], [39], [40]. For example,

Pizzuti [34] proposed a genetic algorithm called GA-Net for detecting communities in complex networks, where the community score was used as a fitness function, and the locus-based representation was used as the genetic representation. He et al. [36] suggested a cooperative co-evolutionary algorithm called CoCoMi for community identification with applications in cancer disease module discovery, and the network was divided into subnetworks and co-evolved with different subpopulations. In CoCoMi, modularity was adopted as a fitness function, and the subpopulations were all encoded by the vector-based representation. Based on the chemical reaction optimization algorithm (CRO), Chang et al. [38] suggested a dual-representation-based CRO called DCRO for community detection, in which modularity was adopted as a fitness function, and each solution in the population was encoded by two representations (i.e., locus- and vector-based). Experiments on synthetic and real-life networks demonstrated that DCRO can achieve better performance than other singleobjective EA-based community detection algorithms. These single-objective EAbased community detection algorithms show performance potential; however, most of them still face the challenge of a resolution limitation of modularity [10], [11]. That is, single-objective EA-based algorithms cannot detect communities smaller than a certain size by optimizing only modularity.

2) Multi-Objective EA-Based Community Detection Algorithms Multi-objective optimization is an effective tool for overcoming the limited resolution of modularity. In addition, instead of a single optimal solution obtained using single-objective EA-based algorithms, MOEAs can obtain a set of optimal network partitions at different hierarchical levels. The first MOEA for complex network community detection, called MOGA-Net, was proposed by Pizzuti in 2009 [12]. In MOGA-Net, community detection was formulated as a multiobjective optimization problem by

simultaneously optimizing the community score and fitness, and the locus-based representation was adopted for encoding each solution in the population. MOGANet is based on the non-dominated sorting genetic algorithm (NSGA-II) [13]. MOGA-Net can achieve good performance by simultaneously optimizing multiple conflicting objectives, which can overcome some of the disadvantages of single-objective optimization, e.g., the resolution limitation of modularity. Since this preliminary work, based on different evolutionary paradigms, numerous MOEA-based methods have been proposed for detecting communities in complex networks [10], [11], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26]. For example, Shi et al. [10] proposed MOCD for community detection in complex networks, which is based on the PESA-II [27]. In MOCD, two objectives, the intra and inter objectives, were suggested for optimization, and each solution in the population was encoded by the locus-based representation. Experiments on both synthetic and real-world networks demonstrated the effectiveness of MOCD compared with several wellestablished single-objective EA-based community detection algorithms. Gong et al. [11] suggested MODPSO for detecting communities in complex networks, which was based on the DPSO [28]. In this algorithm, two objectives, KKM and RC, were suggested for optimization. In addition, a population with a special initialization method based on label propagation and a turbulence operator were designed to significantly improve performance. This algorithm adopts the vector-based representation because it is easy to decode and convenient for an intensification search. Experiments on both synthetic and real-world networks showed that MODPSO outperforms the existing single-objective EAand MOEA-based algorithms for community detection in complex networks. Attea et al. [41] modeled community detection with two designed objective functions, the intra- and inter-neighbor scores, for capturing the intra- and intercommunity structures. Afterwards, they proposed a multi-objective evolutionary

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

71

algorithm based on the framework of a multi-objective evolutionary algorithm with decomposition (MOEA/D) [42], for which a problem-specific heuristic perturbation operator was designed to improve the convergence velocity and convergence reliability of the algorithm. Experiments verified the effectiveness of the proposed community detection model and designed heuristic operator. Recently, a survey of community detection in [43] indicated that some search strategies, such as operators with domain knowledge, can be exploited to further improve the detection performance of EA-based algorithms. Based on the teaching-learning-based optimization algorithm (TLBO) [44], Chen et al. [19] proposed a multi-objective discrete TLBO called MODTLBO/D for complex network community detection. In MODTLBO/D, two objectives based on the negative ratio association (NRA) and RC were adopted for optimization, and the vector-based representation was adopted for encoding each solution in the population. Experiments on six real-world networks demonstrated the effectiveness of MODTLBO/D compared with the existing EA-based community detection algorithms (including single- and multiobjective algorithms). To solve the scalability of MOEAbased methods on large-scale complex networks, Zhang et al. [26] recently presented a network reduction-based multiobjective evolutionary algorithm called RMOEA for large-scale complex networks community detection, in which the size of the networks was recursively reduced as the evolution proceeds. In RMOEA, the two objectives suggested in [11] were adopted for optimization, and each solution in the population was encoded by the locus-based representation. Experiments on synthetic and realworld networks demonstrated the superiority of RMOEA compared with several state-of-the-art community detection algorithms for large-scale networks. In addition to the aforementioned MOEA-based algorithms focusing on non-overlapping community detection (each node in the network belonging to one and only one community), some interesting MOEA-based algorithms have

72

been proposed for overlapping community detection (each node in the network belongs to one or more communities) [31], [45], [46], [47], [48], [49], [50] or focused on other types of networks, such as signed [51], [52], multilayer [53], [54], [55], attributed [48], [56], [57], and dynamic networks [58], [59], [60]. The empirical results of the aforementioned algorithms have justified the superiority of MOEAs for solving community detection in complex networks over single EAs and non-EAs. However, to improve the performance, most existing algorithms have employed different search strategies for a single population, which restricts the MOEAs in achieving a better balance between exploration and exploitation. Therefore, the macromicro population-based co-evolutionary multi-objective algorithm called MMCoMO is proposed, which can further improve the quality of the detected network partitions. The proposed MMCoMO adopts the same objective functions (KKM and RC) as MODPSO [11] and RMOEA [26] to model the community detection problem. However, the proposed co-evolutionary framework differs from the evolutionary frameworks used in these algorithms. In MMCoMO, the co-evolutionary framework with macro-micro populations combines exploration and exploitation to achieve a good balance, and the micro population adopts a local search strategy. This is similar to some memetic algorithms used for community detection [61], [62]. The main difference between MMCoMO and these memetic algorithms is that, in most memetic algorithms, only one population is considered, and designing the representation, operators, and strategies is difficult in terms of balancing the global evolution and local refinement. 3) Co-Evolutionary Multi-Objective Algorithms To address large-scale problems and achieve better performance, co-evolutionary algorithms have been widely adopted for solving many multi-objective optimization problems [63]. Co-evolutionary multi-objective algorithms can be

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

divided into cooperative and competitive approaches. In cooperative approaches, the fitness of an individual is computed by collaborating with individuals of other species [64]. For example, Gong et al. [65] proposed a cooperative co-evolutionary algorithm for hyperspectral sparse unmixing problems, in which the decision variable vector was divided into subsets and assigned to different subpopulations for optimization. Typically, cooperative approaches adopt the divide-and-conquer strategy, and subpopulations are designed to solve the corresponding subproblems. In competitive approaches, the fitness of an individual is determined by encounters with individuals of other species [66]. In this type of approach, individuals of subpopulations typically compete for survival. For example, Silva et al. [67] suggested a predator-prey biogeography-based optimization (PPBBO) that models the predators based on the individuals with the worst values of the objective function, and in which they prey on individuals with the worst values. Unlike these co-evolutionary multiobjective algorithms, this study designed a macro-micro population-based co-evolution scheme. The proposed MMCoMO employs two populations with different representations that focus on different levels to balance the global and local searches (i.e., exploration and exploitation). The two populations improve each other during the co-evolutionary process, and no competitive interaction exists between them. Unlike CoCoMi [36], the two populations in the proposed MMCoMO were designed for the entire network and have different representations. The following section details the proposed algorithm. III. Proposed MMCoMO Algorithm

This section describes the proposed MMCoMO algorithm for community detection. First, the general framework of the proposed algorithm is provided, including the initialization, co-evolutionary, and mergence phases. The proposed macro-micro population-based co-evolution scheme is then presented, which is a crucial component of MMCoMO. Subsequently, two individual representations

that can effectively encode macro- and micro-populations are suggested. Finally, the two proposed interaction strategies between the macro- and micro-populations are described in detail. Algorithm 1. General framework of MMCoMO.

Input: A: adjacency matrix of a network; gen: number of generations; pop: size of a population; pc : probability of crossover for the micropopulation; pm : probability of mutation for the macropopulation; gap: interval between interactions; Output: PF : final non-dominated solutions; Phase-1: Initialization 1: Pmi get the initial micro-population; get the initial

2: Pma

macro-population; get the initial similarity matrix

3: SM

based on A and the diffusion kernel similarity measure; Phase-2: Co-evolution 4: for t ¼ 1 to gen do 0 5: Pma get the offspring of macro-population via uniform crossover and bitwise mutation

6: 7:

with probability pm ; 0 Evaluate Pma with KKM and RC; 0 Pmi get the offspring of micropopulation via one-way crossover with probability pc and neighborbased mutation; 0 Evaluate Pmi with KKM and RC; if t % gap ¼¼ 0 then

8: 9: 0 10: Pmi GuidanceðPma ; Pmi ; Pmi ; SMÞ; 11: Pmi LocalSearchðPmi Þ; 0 12: Pma InfluenceðPmi ; Pma ; Pma ; SMÞ; 13: else 0 14: Pmi EnvironmentSelectionðPmi [ Pmi Þ; 0 15: Pma EnvironmentSelectionðPma [ Pma Þ; 16: end if 17: end for Phase-3: Mergence

18: PF

NonDominatedSortðPmi [ Pma Þ;

A. Overall MMCoMO Procedure

MMCoMO is implemented based on the proposed macro-micro population-based

co-evolution scheme introduced in Section III-B, and Algorithm 1 presents the general framework of MMCoMO. The inputs of the algorithm are defined as follows: A is the adjacency matrix of the given network, gen is the number of generations of the evolutionary process, pop is the size of the population (i.e., the number of individuals in a population), pc and pm are the probabilities of crossover for the macro-population and mutation for the micro-population, respectively, and gap is the number of interval generations between interactions. NonDominated Sort refers to non-dominated sorting, which is used to sort the individuals in the population and determine the solutions in the Pareto set. EnvironmentSelection refers to environment selection, which is used to select individuals for the next generation of the population. These two functions were adopted from [13]. MMCoMO consists of three phases: initialization (Lines 1–3), co-evolution (Lines 4–17) and mergence (line 18). In the first phase, the macro-micro populations are initialized. Specifically, for the macro-population, MMCoMO adopts a similar initialization strategy as that suggested in [46], with which half of the macro-population is initialized with the candidate central nodes of larger degree and the remaining half is initialized randomly. Then, the initial macropopulation is obtained and denoted as Pma . For the micro-population, a simple strategy is adopted for each population individual; one neighbor of each node vi is randomly selected as the label of vi . The initial micro-population is then obtained and denoted as Pmi . In addition, based on the adjacency matrix A, the initial similarity matrix (denoted by SM) can be obtained with a size of n  n, where each element SMi;j is the diffusion kernel similarity value [68] between nodes vi and vj . In the second phase, based on the suggested co-evolution scheme, the macroand micro-populations evolve individually and interact with each other. In particular, at each generation, the offspring of the 0 ), macro-population (denoted as Pma which adopts the medoid-based representation [46], is generated via uniform crossover and bitwise mutation with probability pm [46] (Line 5). The offspring

0 of the micro-population (denoted as Pmi ), which adopts a vector-based representation [11], is generated via one-way crossover [33] with probability pc and neighbor-based mutation (i.e., the node’s label is changed to that of a neighboring node) with probability 1=n (n is the number of nodes in the network) (Line 7). The two offspring populations are evaluated using the KKM and RC (Lines 6 and 8). Then, the interactions between the macro-micro populations perform every gap generations (Lines 10–12). The proposed guiding strategy is adopted in Algorithm 2 via the elite individuals of the macro-population to guide the micropopulation. Subsequently, the local search suggested in [38] is used for the non-dominated solutions in the micro-population to improve the modularity. Finally, Algorithm 3 employs the proposed influencing strategy using elite individuals of the micro-population to positively influence the macro-population. This process is repeated in the second phase until the maximum generation gen is reached. In the last phase, the final macro-population Pma and micro-population Pmi can be obtained and merged into a single population. Subsequently, the non-dominated sorting method [13] is adopted to obtain the optimal solutions (i.e., PF) on the first front. Section II of the supplementary material presents the complexity analysis of the proposed algorithm. The following introduces the proposed macro-micro population-based co-evolution scheme, which is a key component of MMCoMO.

B. Macro-Micro Population-Based Co-Evolution Scheme

1) Main Idea To obtain balance better between exploration and exploitation in MOEA-based community detection algorithms, a macro-micro population-based co-evolution scheme is suggested for the proposed MMCoMO, and Figure 1 shows the overall framework, in which the macropopulation at the macro level prefers exploration and the micro-population at the micro level prefers exploitation. The elite solutions of the macro-population

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

73

strategies in Section III-D, i.e., a guidance and an influence strategy, for significantly improving the detection performance of MMCoMO. C. Macro-Micro Population Representations

FIGURE 1 Overall framework of the proposed macro-micro population-based co-evolution scheme.

are used to guide the micro-population and those of the micro-population are used to positively influence the macropopulation during the co-evolution process. Specifically, the community detection problem in the proposed scheme is considered at two different levels (the macro- and micro-levels), on which two corresponding populations (i.e., the macro- and micro-populations) are based. In particular, the macro-population at the macro level should be good at quickly determining diverse, good rough community structures, and the micro-population at the micro level should perform well in searching for good, fine community structures through a local search process. That is, the macro-population prefers exploration, whereas the micro-population prefers exploitation. Thus, these two populations can co-evolve by interacting with each other to promote mutually. The main idea of co-evolution between the two populations is that the elite solutions of the macro-population are used to guide the micro-population, while those of the micro-population are used to positively influence the macro-population. Specifically, through periodic interactions between the two populations, some preliminary elite solutions, community structures, can be provided early by the macropopulation and further adjusted by the micro-population. After the local search procedure, the elite solutions in the micro-population are used to positively influence the macro-population.

74

2) Key Challenges As can be seen in the proposed co-evolution scheme, two key challenges must be solved. The first challenge is how to represent the macro- and micro-populations to meet the different requirements of the two populations (i.e., the macro-population prefers exploration and the micro-population prefers exploitation). That is, the key principle for choosing the representation of a macro-population is that a small change in the representation should cause a huge change in the community structures, whereas that for the micro-population is that a small change in the representation can cause a small change in the community structures. The second challenge is how to design specific strategies for the co-evolution between the macro- and micro-populations to promote each other. In particular, this challenge can be divided into two subchallenges: how to design an effective guidance strategy using the elite solutions of the macro-population to guide the micro-population, and how to design an effective influence strategy using the elite solutions of the micropopulation to positively influence the macro-population. To solve the first challenge, this article suggests using two different representations in Section III-C to effectively encode the macro- and micro-populations. To solve the second challenge, this article proposes two specific

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

1) Macro-Population and Its Representation for Exploration As mentioned above, the macro-population is responsible for determining diverse, good rough community structures quickly, which can be used to guide the search of micro-populations. To satisfy this requirement, a modified medoid-based representation used in [46] was adopted to encode this population, in which the centers of the communities are called “central nodes.” Specifically, this representation is defined as a binary vector, b: b ¼½b1 ; b2 ; . . . ; bn ; bi 2 f0; 1gði ¼ 1; 2; . . . ; nÞ;

(2Þ

where bi denotes whether node vi is a central node; specifically, bi ¼ 1 indicates that vi is a central node, and bi ¼ 0 indicates that vi is not a central node otherwise. According to this representation, the nodes in the network can be classified into two sets: the non-central node set NN and central node set CN, which are formally defined as NN CN

¼ fNN1 ; NN2 ; . . .; NNr g; (3) ¼ fCN1 ; CN2 ; . . .; CNs g;

where NNi is the i-th node in NN whose corresponding element in b is 0 (1  i  r, r is the number of non-central nodes in the network), and CNj is the j-th node in CN whose corresponding element in b is 1 (1  j  s, s is the number of central nodes in the network, and r þ s ¼ n). Then, the similarity matrix SM ¼ ðSMi;j Þnn can be obtained, in which each element SMi;j is calculated based on a similarity measure called the diffusion kernel similarity [68]. From the SM , the similarities between the non-central and central nodes are directly obtained, and each element in the membership matrix U ¼

FIGURE 2 Illustrative example of the medoid-based representation scheme.

ðUi;j ÞjNNjjCNj is defined as follows: SMNNi ;CNj

Ui;j ¼ PjCNj t¼1

SMNNi ;CNt

;

(4)

where Ui;j represents the membership degree between NNi and CNj , and the denominator is the normalized factor. Based on the above definitions, the non-overlapping communities can be easily obtained. In particular, each central node is considered as the center of one community, and each non-central node NNi belongs to the community centered at the central node CNj (denoted as labelðNNi Þ) if Ui;j (4) is maximized in the i-th row of the matrix U, which can be defined as follows: labelðNNi Þ ¼ CNj ; where j ¼ arg max Ui;t

nodes fv2 ; v5 ; v13 g. Membership matrices U1 and U2 can be obtained using (4). Finally, the partition of the network for Ind1 is ffv1 ; v5 ; v6 ; v8 ; v10 ; v11 ; v12 ; v13 ; v14 g, fv2 ; v3 ; v4 ; v7 ; v9 gg, whereas that for Ind2 is ffv1 ; v11 ; v12 ; v13 ; v14 g, fv5 ; v6 ; v8 ; v10 g, fv2 ; v3 ; v4 ; v7 ; v9 gg. The difference between Ind1 and Ind2 is small; however, the partitions decoded by these two individuals, Partition-1 and Partition-2, are completely different. Thus, a small change in the medoidbased representation can be observed to cause a huge change of community structures. The medoid-based representation can change the macroscopic structure of communities quickly and directly, thereby meeting the requirement of a macro-population that prefers exploration.

(5)

t¼1;...;jCNj

A small real network ENZYMESg124 [69] (containing 14 nodes and 31 edges) is considered as an example to illustrate the procedure of the medoid-based representation, as shown in Figure 2. Consider the two similar individuals, Ind1¼ ½0; 1; 0; 0; 1; 0; 0; 0; 0; 0; 0; 0; 0; 0 and Ind2 ¼ ½0; 1; 0; 0; 1; 0; 0; 0; 0; 0; 0; 0; 1; 0, and only the value of node v13 is different. Note that Ind1 has two central nodes fv2 ; v5 g, and Ind2 has three central

2) Micro-Population and Its Representation for Exploitation As mentioned above, the micro-population is responsible for searching for good, fine community structures through a local search process, which can be used to positively influence the searching of the macro-population. To meet this requirement, the vector-based representation [11] was adopted to encode this population. In particular, this representation is defined as a label vector l:

l ¼½l1 ; l2 ; . . . ; ln ; li 2 f1; 2; . . .; ngði ¼ 1; 2; . . . ; nÞ;(6Þ where li denotes the community label to which node vi belongs. Based on this representation, nonoverlapping communities can be easily obtained, and nodes with the same label are in the same community. An obvious advantage of this representation is that it can adjust every node independently. That is, if the label of one node is modified, only one node is influenced, which is convenient for local searches [11], [38]. Figure 3 provides an illustrative example of a vector-based representation used to detect the communities of the small real network ENZYMESg124 [69]. Consider the two similar individuals, Ind1 ¼ ½1; 2; 2; 2; 1; 1; 2; 1; 2; 1; 1; 1; 1; 1 and Ind2 ¼ ½1; 2; 2; 2; 1; 1; 2; 2; 2; 1; 1; 1; 1; 1, and only the label of node v8 is different. Then, the partition of the network for Ind1 is {fv1 ; v5 ; v6 ; v8 ; v10 ; v11 ; v12 ; v13 ; v14 g, fv2 ; v3 ; v4 ; v7 ; v9 g}, whereas that for Ind2 is {fv1 ; v5 ; v6 ; v10 ; v11 ; v12 ; v13 ; v14 g, fv2 ; v3 ; v4 ; v7 ; v8 ; v9 g}. Thus, a tiny change in the vector-based representation can be observed to cause a small change in the community structures. The difference between Ind1 and Ind2 is considerably small, and that between the partitions decoded by these two

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

75

FIGURE 3 Illustrative example of the vector-based representation scheme.

individuals is also small. Therefore, an individual in the micro-population can be adjusted microscopically to obtain better community structures, thereby meeting the requirement of a micro-population that prefers exploitation in the proposed co-evolution scheme. In summary, the macro-population, which adopts the medoid-based representation, is good at quickly finding diverse, good rough community structures. However, it is unsuitable for a local search. The micro-population, which uses the vector-based representation, is good at searching for good, fine community structures through a local search process. However, it cannot quickly obtain various good rough community structures. Based on these two representations, the two populations can be used to improve each other by interacting in the proposed co-evolution scheme; thus, a better balance between exploration and exploitation can be obtained. Section I of the supplementary material provides an experimental analysis of the medoidbased and vector-based representations.

can thus be used to guide the micropopulation to skip the early search procedure. The main idea of the guidance strategy is to use elite individuals in the macro-population to guide the evolution of the micro-population. 0 Algorithm 2. GuidanceðPma ; Pmi ; Pmi ; SMÞ.

Input: Pma : macro-population; Pmi : 0 : offspring micro-population; Pmi

of the micro-population; SM: similarity matrix; Output: SM : updated similarity matrix for the next generation;

1: PF NonDominatedSortðPma Þ; 2: PFvec ? ; 3: for each ind in PF do 4: NN, CN get the non-central node set NN and central node set

5:

CN of ind according to (3); Uind obtain the membership matrix of ind according to (4) and

6: 7:

similarity matrix SM; for each node vi in NN do indvec get the vector-based representation of ind by assigning vi the label of central node vj in CN, which has the largest membership value (see (5));

D. Macro-Micro Population Interaction Strategies

1) Guidance Strategy Individuals from the macro-population have a good rough community structure and diversity in the early stage, and

76

8: end for 9: PFvec PFvec [ indvec ; 10: end for 11: Evaluate PFvec using the KKM and RC;  12: Pmi

EnvironmentSelectionðPFvec [

0 Þ; Pmi [ Pmi

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

Algorithm 2 presents the main procedure of the proposed guidance strategy, which consists of the following three steps. Let Pma and Pmi be the macro- and micro0 be the offpopulations, respectively, Pmi spring of the micro-population, and SM be the similarity matrix. The first step obtains the elite individuals PF in Pma via non-dominated sorting (Line 1). Then, the second step transforms each ind with a medoid-based representation in PF into a vector-based representation (Lines 3–9). In particular, according to (3), the noncentral node set NN and central node set CN of ind (Line 4) are first obtained. Subsequently, based on the similarity matrix SM , the membership matrix Uind of ind can be obtained according to (4) (Line 5). Finally, ind is transformed into an individual indvec with a vector-based representation by assigning each node vi in NN as the label of the central node vj in CN with the largest membership value according to (5) (Lines 6–9). In the final step, the transformed individuals are evaluated based on the KKM and RC, and the transformed elite individuals in the macro-population are then transformed into the micro-population, and the micro-population of the next generation is obtained by selecting individuals according to the nondominated sorting and crowding distance (Lines 11–12). Figure 4 presents an example to illustrate the main steps involved in transforming one individual in the macropopulation into its vector-based representation. First, the nodes are divided into the central node set CN and non-central node set NN according to the medoidbased individual. Then, each node in CN forms a community, and each node in NN is assigned to a community using membership matrix U, which is calculated using (4) based on similarity matrix SM . This figure presents one medoid-based individual: ind ¼ ½0; 1; 0; 0; 1; 0; 1; 0; 0; 0; 0; 0; 1; 0. According to (3), the noncentral node set NN ¼ fv1 ; v3 ; v4 ; v6 ; v8 ; v9 ; v10 ; v11 ; v12 ; v14 g and central node set CN ¼ fv2 ; v5 ; v7 ; v13 g are obtained. Subsequently, based on the similarity matrix SM 1414 , the membership matrix

SM  ¼ ð1  rÞ  SM p þ r  SM v ; r ¼ 0:5  t=gen; (7) where t and gen are the current and maximum generation numbers, respectively, and r is an adaptive weight parameter used to control the balance of SM p and SM v . Note that the elite individuals in the micro-population improve as the population evolves; thus, the importance of SM v in updating the similarity matrix increases with each generation. 0 Algorithm 3. InfluenceðPmi ; Pma ; Pma ; SMp Þ.

Input: Pmi : micropopulation; Pma : 0 : offspring macro-population; Pma

of the macro-population; SMp : previous similarity matrix;  : macro-population of the Output: Pma

next generation; FIGURE 4 The main steps of transforming one individual in the macro-population into its vectorbased representation.

1: PF NonDominatedSortðPmi Þ; 2: PFmed ? ; 3: SMv the membership-voting matrix initialized with a zero matrix;

104

U of ind can be obtained according to (4). Finally, ind can be transformed into a vector-based individual indvec by assigning each node vi in NN as the label of the central node vj in CN with the largest membership value according to (5). For example, the label of node v1 is assigned 13 because the membership value between non-central node v1 and central node v13 is the largest in the first row of U. Similarly, the remaining non-central nodes fv3 ; v4 ; v6 ; v8 ; v9 ; v10 ; v11 ; v12 ; v14 g are assigned labels [2,2,5,7,2,5,13,13,13]. Thus, the transformed vector-based individual is indvec ¼ ½13; 2; 2; 2; 5; 5; 7; 7; 2; 5; 13; 13; 13; 13. 2) Influence Strategy Note that the micro-population using the vector-based representation is good at searching for fine community structures through a local search process, which can be used to positively influence the macropopulation to obtain better community structures. The main idea of the influencing strategy is to use elite individuals in the micro-population to positively influence the evolution of the macro-population.

Specifically, two sub-strategies are considered. One involves using elite individuals in the micro-population to influence the macro-population indirectly by updating similarity matrix SM . The other involves transforming the elite individuals in the micro-population into the medoid-based representation and directly placing them into the macro-population. Algorithm 3 presents the main procedure for the suggested influencing strategy, which is performed as follows. Let Pma and Pmi be the macro- and micro0 be the offpopulations, respectively, Pma spring of the macro-population, and SM be the similarity matrix. First, the elite individuals PF in Pmi are obtained via non-dominated sorting (Line 1). Then, based on the PF, a membership voting matrix SM v can be generated, in which SMi;jv indicates the proportion of elite individuals who assign nodes vi and vj into the same community to all elite individuals (Lines 3–12). Subsequently, the similarity matrix SM  can be updated by combining the previous similarity matrix SM p and membership voting matrix SM v (Line 13), and is defined as follows:

4: for each ind in PF do 5: for i ¼ 1 to jV j do 6: for j ¼ 1 to jV j do 7: if nodes vi and vj are in the same community then

8: SMv ½i½j SMv ½i½j þ 1=jPF j; 9: end if 10: end for 11: end for 12: end for 13: SM update the similarity matrix according to (7);

14: for each ind in PF do 15: C ¼ fC1 ; C2 ; . . .; Ck g 16: 17: 18: 19: 20: 21:

decode

ind as k communities; CN ?; for i ¼ 1 to k do vc get the central node in community Ci according to (8); CN ¼ CN [ vc ; end for indmed get the medoid-based representation of ind by setting each central node in CN to 1;

22: PFmed PFmed [ indmed ; 23: end for 24: Evaluate PFmed using the KKM and RC; EnvironmentSelectionðPFmed [

 25: Pma

0 Pma [ Pma Þ;

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

77

Next, the PF encoded by a vectorbased representation is transformed into medoid-based representation set PFmed (Lines 14–23). In particular, for each ind in PF, it can be decoded as k communities fC1 ; C2 ; . . .; Ck g. For each community Ci ði ¼ 1; 2; . . .; kÞ, node vc in Ci is considered as the central node of this community if the sum of the diffusion kernel similarity value between vc and the other nodes in Ci is maximized, which is formally defined by (8). vc ¼ arg max vc 2Ci

X

SMc;j

(8)

vj 2Ci &vj 6¼vc

Based on this equation, for each ind in PF, the medoid-based representation indmed can be obtained by setting all central nodes to 1. Subsequently, PFmed can be obtained (Line 22). Finally, the transformed elite individuals PFmed in the micro-population are evaluated using the KKM and RC and then incorporated into the macro-population, and the macro-population of the next generation can be obtained via environment selection (Lines 24–25). Figure 5 presents an example of transforming one individual in the micro-population into the corresponding medoid-based representation. First, the similarity matrix is used to choose a central node in each community in the vector-based individual, and these central nodes can then be used to construct a medoid-based individual using (8). In this figure, suppose that the vectorbased individual is ind=[1,2,2,2, 1,1,2,1,2,1,1,1,1,1]. The decoded communities for ind are Community-1= {v1 ; v5 ; v6 ; v8 ; v10 ; v11 ; v12 ; v13 ; v14 } and Community-2= {v2 ; v3 ; v4 ; v7 ; v9 }. According to (8), node v5 is considered as the central node of Community-1 because v5 has the largest similarity value with respect to the remaining nodes ({v1 ; v6 ; v8 ; v10 ; v11 ; v12 ; v13 ; v14 }), and the largest value is 6.83 (SM5;1 +SM5;6 +SM5;8 +SM5;10 +SM5;11 +SM5;12 +SM5;13 +SM5;14 ). Similarly, node v9 is considered as the central node of Community-2 with a largest value of 3.77. Thus, the transformed

78

FIGURE 5 The main steps of transforming one individual in the micro-population into its medoidbased representation.

medoid-based individual is indmed ¼ ½0; 0; 0; 0; 1; 0; 0; 0; 1; 0; 0; 0; 0; 0. IV. Experiments and Results

This section verifies the performance of the proposed MMCoMO through a series of experiments on 14 real-world networks and several LFR synthetic networks compared with ten representative baselines and three variants of MMCoMO. A. Experiment Settings

1) Test Networks In this study, 14 popular real-world networks with different characteristics were used to evaluate the performance of the proposed algorithm. Specifically, these networks included the Zachary karate club [70], the Dolphin social network [71], the American college football [72], the Books about US politics [72], the Yeast PPI dataset [73], the Blogs network [45], the PGP network [75], five collaboration networks (Ca-GrQc, Ca-HepTh1, Ca-HepTh2, Ca-AstroPh, and Ca-CondMat) [74],

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

Epinions [69], and Enron-large [69]. Table I presents the detailed characteristics of the 14 real-world networks. Note that the Karate, Dolphin, Polbooks, Football, and Yeast networks have a ground truth community structure, whereas the true community structure of the remaining nine networks is unknown. The 14 networks have different scales, assortativity coefficients [76], and average clustering coefficients [77]. To fully demonstrate the performance of the proposed algorithm on different networks with ground truths, four groups of LFR synthetic networks were used in our experiments. The first group of LFR networks had a mixing parameter m ranging from 0.1 to 0.8 with an interval of 0.1, and the sizes of the networks were fixed at n ¼ 1000. The second group of LFR networks had sizes ranging from 500 to 2500 with an interval 500 and a mixing parameter m fixed at 0.6. The third group of LFR networks had a mixing

TABLE I Characteristics of 14 real-world networks. “AD” means the average degree of the nodes in the network, “RC” means the number of real communities in the network (“-” indicates the real community structure is unknown), “AC” means the assortativity coefficient, and “CC” means the average clustering coefficient. REAL NETWORKS

#NODES

#LINKS

AD

AC

CC

RC

Karate [70]

34

78

4.59

0.48

0.57

2

Dolphin [71]

62

159

5.13

0.04

0.26

2

Polbooks [72]

105

441

8.4

0.13

0.49

3

Football [72]

115

613

10.66

0.16

0.40

12

Yeast [73]

2,361

7,182

6.08

0.08

0.13

13

Blogs [45]

3,982

6,803

3.42

0.13

0.28

-

Ca-GrQc [74]

5,242

14,496

5.53

0.66

0.53

-

Ca-HepTh1 [74]

9,877

25,998

5.26

0.27

0.47

-

PGP [75]

10,680

24,316

4.55

0.24

0.27

-

Ca-HepTh2 [74]

12,008

118,521

19.74

0.63

0.61

-

Ca-AstroPh [74]

18,772

198,110

21.11

0.21

0.63

-

Ca-CondMat [74]

23,133

93,497

8.08

0.14

0.63

-

Epinions [69]

26,588

100,120

7.53

0.06

0.14

-

Enron-large [69]

33,696

180,811

10.73

0.12

0.51

-

TABLE II Parameter settings of the EA-based comparison algorithms, where pop denotes the population size, gen is the maximum number of generations, pc is the crossover and mutation Possibility, pm is the mutation possibility, and ns is the neighborhood size. gap is the interval between interactions. ALGORITHM

pop

gen

pc

pm

CoCoMi

100

100

0.9

0.1





[36]













[38]

SOSFCD

100

100









[78]

MOCD

100

100

0.9

0.1





[10]

MOEA/D-PM

100

100

0.8

0.2

5



[41]

MODTLBO/D

100

100

0.9

0.1

40



[19] [11]

DCRO

ns

gap

REFERENCE

MODPSO

100

100



0.1

40



RMOEA

100

100

0.9

0.1

40



[26]

MMCoMO

100

50

0.1

0.1



10

[ours]

parameter m ranging from 0.1 to 0.8 with an interval of 0.1, and the sizes of the networks were fixed at n ¼ 5000. The fourth group of LFR networks had their sizes ranging from 3000 to 5000 with an interval 500 and the mixing parameter m fixed at 0.6. For the four groups of LFR networks, the remaining parameters were set as follows. The average degree dave was set to 20, the maximum degree dmax was set to 50, and the community size ranged from 20 to 100. In addition, the exponents of the power-law distribution of node degrees t 1 was set to 2, and the community size t 2 was set to 1.

2) Comparison Algorithms In this study, ten comparison community detection algorithms were adopted as baselines, which included two non-EAbased algorithms and eight EA-based algorithms. The two non-EA-based algorithms were CSE [8] and SSCF [5]. CSE is a state-of-the-art community structure enhancement-based algorithm, and SSCF is a representative spectral-clustering-based algorithm. The eight representative EA-based algorithms included CoCoMi [36], DCRO [38], SOSFCD [78], MOCD [10], MODPSO [11], MOEA/ D-PM [41] (the authors did not name the algorithm, and this study adopted the

name MOEA/D-PM used in [23]), MODTLBO/D [19], and RMOEA [26]. Among these EA-based algorithms, CoCoMi, DCRO, and SOSFCD are three single-objective EA-based algorithms, and the remaining five are multiobjective EA-based algorithms. The parameter settings for the two non-EA-based algorithms (CSE [8] and SSCF [5]) were set to those recommended in the literature. In addition, the parameters of all comparison EA-based algorithms were set to those recommended in [11], [26], which are listed in Table II. Because the parameters of DCRO are significantly different from those of other EAs, this study adopted the parameters suggested in [38]. Noting that the proposed MMCoMO includes two populations, for a fair comparison, in MMCoMO, the size of the population was set to pop ¼ 100, and the number of maximal generations to half that in other EAs (i.e., gen ¼ 50). In addition, the interval between interactions was set to gap ¼ 10, the mutation probability for the macro-population to pm ¼ 0:1, and the crossover probability for the micro-population to pc ¼ 0:1. To obtain more diverse solutions, the probability of crossover pc in competing algorithms with only one population has been typically set to a high value (e.g., 0.9). Unlike competing algorithms, in MMCoMO, the macro-population focuses on exploration, where uniform crossover is adopted to produce diverse solutions. Simultaneously, the micro-population focuses on exploitation, where the probability of one-way crossover should be set to a low value (e.g., 0.1) to obtain solutions with a better convergence. The results for all comparison algorithms on all test networks were obtained by averaging over 20 independent runs. All experiments were conducted on computers with an Intel Core i7-8700 K 3.70 GHz CPU, 32 GB of RAM, and a Windows 10 operating system. 3) Evaluation Metrics To evaluate the quality of the solutions, two popular evaluation metrics were adopted. These include the modularity

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

79

TABLE III Comparison results of Q on the 14 real-world networks, where ‘+,’ ‘-,’ and ‘

’ indicate that the performance is significantly better, significantly worse, and statistically similar to that of MMCoMO, respectively. The first symbol is the result of the Wilcoxon rank sum test, and the second symbol is the result of pairwise t-tests. Note that ‘/’ indicates that Q values are not provided because these results cannot be obtained within 12 h for one run, and ‘O’ indicates that the results are not provided because the algorithms run out of memory. NETWORK

Karate

Dolphin

Polbooks

Football

Yeast

Blogs

Ca-GrQc

Ca-HepTh1

PGP

Ca-HepTh2

Ca-AstroPh

Ca-CondMat

80

MEASURE

CSE

SSCF

CoCoMi

DCRO

SOSFCD

MOCD

MODPSO

MOEA/D-PM

MODTLBO/D

RMOEA

MMCoMO

QMax

0.3826

0.3600

0:4198

0:4198

0:4198

0:4198

0:4198

0.4188

0:4198

0.3875

0:4198

QAvg

0.3826=

0.3600=

0:4198 =

0:4198 =

0.4124=

0.4129=

0.4121=

0.4172=

0:4198 =

0.3824=

0:4198

std

0

0

0

0

0.0092

0.0072

0.0078

0.0016

0

0.0067

0

runtimeðsÞ

0.2

0.1

4.9

3.5

2.5

19.2

5.5

49.1

31.8

6.4

1.0

QMax

0.3476

0.3787

0:5285

0:5285

0:5285

0.5265

0.5268

0.5166

0.5220

0.5268

0.5268

QAvg

0.3476=

0.3787=

0.5274þ=þ

0:5279þ=þ

0.5167=

0.5155=

0.5177=

0.5163=

0.5205=

0.5261þ=þ

0.5233 0.0029

std

0

0

0.0006

0.0007

0.0058

0.0089

0.0066

0.0008

0.0012

0.0009

runtimeðsÞ

0.2

0.1

11.7

6.4

4.5

18.4

5.8

93.5

55.3

8.6

1.5

QMax

0.5001

0.4860

0:5272

0:5272

0.5267

0.5268

0.5262

0.5262

0.5269

0.5268

0:5272

QAvg

0.5001=

0.4860=

0.5271þ=þ

0:5272þ=þ

0.5134=

0.5184=

0.5250=

0.5262=

0.5268 =

0.5239=

0.5269

std

0

0

0.0001

0.0001

0.0083

0.0102

0.0018

0

0.0002

0.0021

0.0003

runtimeðsÞ

0.9

0.1

29.3

11.5

9.398

19.4

9.0

34.3

94.3

9.7

1.9

QMax

0.5989

0.6005

0:6046

0:6046

0:6046

0.5822

0:6046

0.6043

0:6046

0:6046

0:6046

QAvg

0.5989=

0.5981=

0:6045þ=

0.5996=

0.5749=

0.5249=

0.6017=

0.6043 =

0.6044 =

0.6044 =

0.6039

std

0

0.0020

0.0001

0.0049

0.0125

0.0189

0.0029

0

0.0003

0.0001

0.0013

runtimeðsÞ

1.0

0.1

15.5

15.1

11.7

21.5

9.3

38.7

102.9

9.6

1.5

QMax

0.4985

0.4492

0.5689

0.5464

0.4856

0.4598

0.5447

0.5557

0.4315

0.5049

0:5818

QAvg

0.4927=

0.4384=

0.5659=

0.5419=

0.4843=

0.4453=

0.5213=

0.5442=

0.4232=

0.4926=

0:5732

std

0.0020

0.0052

0.0019

0.0025

0.0019

0.0085

0.0136

0.0083

0.0070

0.0063

0.0053

runtimeðsÞ

55.17

13.7

8901.3

711.4

27152.9

90.4

453.7

3947.9

21347.8

481.2

259.1

QMax

0.7735

0.8051

0.7646

0.7858

/

0.7467

0.8062

0.8075

/

0.7908

0:8379

QAvg

0.7695=

0.7976=

0.7562=

0.7768=

/

0.7367=

0.7946=

0.8020=

/

0.7844=

0:8336

std

0.0017

0.0039

0.0044

0.0042

/

0.0072

0.0069

0.0037

/

0.0031

0.0028

runtimeðsÞ

30.4

454.3

26663.7

1370.1

/

204.4

2156.6

7851.2

/

730.5

642.6

QMax

0.7570

0.7648

/

0.7868

/

0.7132

0.8090

0.8094

/

0.7959

0:8357

QAvg

0.7547=

0.7551=

/

0.7816=

/

0.6995=

0.7948=

0.8042=

/

0.7872=

0:8262

std

0.0010

0.0040

/

0.0020

/

0.0087

0.0114

0.0041

/

0.0037

0.0044

runtimeðsÞ

186.1

133.5

/

2905.4

/

400.3

3819.0

25590.7

/

1195.7

1040.8

QMax

0.6148

0.6547

/

0.6672

/

0.5502

0.7012

/

/

0.6500

0:7347

QAvg

0.6124=

0.6460=

/

0.6631=

/

0.5502=

0.6746=

/

/

0.6440=

0:7203

std

0.0010

0.0037

/

0.0018

/

0.0040

0.0108

/

/

0.0040

0.0081

runtimeðsÞ

240.9

675.7

/

7341.4

/

1149.1

11937.9

/

/

4062.1

4327.1

QMax

0.7261

0.8049

/

0.7882

/

0.6799

0.8198

/

/

0.8042

0:8618

QAvg

0.7207=

0.7952=

/

0.7828=

/

0.6676=

0.8045=

/

/

0.7963=

0:8548

std

0.0029

0.0069

/

0.0018

/

0.0086

0.0100

/

/

0.0027

0.0033

runtimeðsÞ

348.3

7097.6

/

7503.5

/

1341.4

13359.9

/

/

4291.4

5019.8

QMax

0.4585

0.4956

/

0.6166

/

0.4269

0.6064

/

/

0.5677

0:6200

QAvg

0.4543=

0.4832=

/

0.6113 =

/

0.4030=

0.5784=

/

/

0.5649=

0:6124 0.0059

std

0.0031

0.0072

/

0.0024

/

0.0121

0.0221

/

/

0.0011

runtimeðsÞ

33654.0

1106.5

/

19703.9

/

1600.7

13741.7

/

/

5235.6

6298.8

QMax

0.3830

O

/

/

/

0.2677

/

/

/

0.5345

0:6005

QAvg

0.3796=

O

/

/

/

0.2541=

/

/

/

0.5278=

0:5953

std

0.0016

O

/

/

/

0.0069

/

/

/

0.0026

0.0043

runtimeðsÞ

9160.8

O

/

/

/

4078.0

/

/

/

17710.5

18988.7

QMax

0.5474

O

/

/

/

0.4610

/

/

/

0.6283

0:6947

QAvg

0.5459=

O

/

/

/

0.4518=

/

/

/

0.6224=

0:6871

std

0.0011

O

/

/

/

0.0048

/

/

/

0.0038

0.0032

runtimeðsÞ

1652.5

O

/

/

/

6019.5

/

/

/

28168.0

22846.1

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

TABLE III (Continued ) NETWORK

MEASURE

CSE

SSCF

CoCoMi

DCRO

SOSFCD

MOCD

MODPSO

MOEA/D-PM

MODTLBO/D

RMOEA

QMax

/

O

/

/

/

0.3369

/

/

/

0.4112

0:5423

QAvg

/

O

/

/

/

0.3252=

/

/

/

0.3966=

0:5356

Epinions

MMCoMO

std

/

O

/

/

/

0.0060

/

/

/

0.0108

0.0034

runtimeðsÞ

/

O

/

/

/

6881.4

/

/

/

23223.9

23352.9

QMax

/

O

/

/

/

0.3077

/

/

/

0.5407

0:5956

QAvg

/

O

/

/

/

0.2918=

/

/

/

0.5357=

0:5847

std

/

O

/

/

/

0.0076

/

/

/

0.0023

0.0052

runtimeðsÞ

/

O

/

/

/

9836.8

/

/

/

23070.4

42614.8

rank sum test þ//

0/14/0

0/14/0

3/10/1

2/10/2

0/14/0

0/13/1

0/14/0

0/13/1

0/11/3

1/12/1

pairwise t-tests þ//

0/14/0

0/14/0

2/10/2

2/10/2

0/14/0

0/13/1

0/14/0

0/13/1

0/11/3

1/12/1

Enron-large

(Q) [79] and normalized mutual information (NMI) [80]. The modularity Q was adopted to measure the quality of the communities, which can be calculated without the real community labels of the network. Specifically, Q is formally defined as Q¼

" k X ls s¼1



ds  M 2M

2 # ;

where k is the number of detected communities, M is the number of edges in the network, ls is the number of edges connecting nodes in the same community s, and ds is the sum of degrees of nodes in community s. A larger Q value indicates better community detection performance. The normalized mutual information NMI was adopted to measure the similarity between the detected and real communities. Specifically, NMI is formally defined as NMIðA; BÞ P A PCB 2 C _ =Ci: C:j Þ i¼1 j¼1 Ci;j logðCi;j n ; ¼ PCA PCB i¼1 Ci: logðCi: =nÞ þ j¼1 C:j logðC:j =nÞ

where CA (or CB ) is the number of communities in partition A (or B), C is the confusion matrix, and Ci;j denotes the number of shared nodes in community i of partition A and community j of partition B. Ci: (or C:j )) is the sum of the elements of C in row i (or column j), and n is the number of nodes in the network. NMIðA; BÞ ¼ 1 means that partition A is the same as B, whereas NMIðA; BÞ ¼ 0 means that partition A is completely different from B. A larger NMI value indicates better community detection performance.

In summary, Q is used on networks without true community structures, whereas NMI is used only on networks with true community structures. For both Q and NMI, a larger value indicates a better detection performance. Thus, in our experiments, Q was used to measure 14 real-world networks, and NMI was only used for the five real-world networks as well as the LFR networks with true community structures. To comprehensively compare the proposed algorithm, MMCoMO, with the baseline algorithms, the modularity was used for the networks with known true community partitions, as suggested for most existing EA-based algorithms, e.g., [15], [19], [38]. B. Comparison Results Between MMCoMO and Baselines

This subsection first presents the comparison results in terms of Q on the 14 real-world networks and then compares the results in terms of NMI on the five real-world networks and LFR networks with true community structures. For each MOEA-based algorithm, the algorithm was first run to obtain the final non-dominated solution set, in which the solution with the best value of Q (or NMI) was chosen for comparison, which has been widely adopted in the existing MOEA-based community detection algorithms [11], [26], [31]. 1) Results in Terms of Modularity Q Table III lists the average Q values of the comparison algorithms (averaged over 20 runs) on the 14 real-world networks.

Note that the Wilcoxon rank-sum and pairwise t-tests were used to evaluate the statistical difference in the performance of the comparison algorithms, both at a significance level of 0.05. The first symbol is the result of Wilcoxon rank-sum test, and the second is the result of the pairwise ttest; ‘+,’ ‘-,’ and ‘ ’ indicate that the result is significantly better, significantly worse, and statistically similar to that obtained by MMCoMO, respectively. In addition, the average calculation time for the 20 runs is provided. Note that ‘/’ indicates that the Q values are not provided because these results cannot be obtained within 12 h for one run, and ‘O’ indicates that the results are not provided because the algorithms run out of memory. Based on the results in Table III, the proposed MMCoMO achieves the best performance on most real-world networks in terms of Q. The two non-EA-based algorithms (CSE and SSCF) are more efficient than most EA-based baselines because they do not evaluate the solutions many times. However, the performance of the two algorithms on most networks is worse than that of most EA-based algorithms. Note that SSCF requires a high amount of memory space during the calculation process, and the results for some larger networks (e.g., Ca-AstroPh and CaCondMat) are not provided here because the algorithm runs out of memory. In addition, note that the two single-objective EA-based algorithms (CoCoMi and DCRO) can achieve better performance on small-scale networks with good structures (e.g., Karate, Dolphin, Polbooks, and Football) because multiple localsearch times are involved in improving the

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

81

MMCoMO is attributed to the macropopulation changing the rough structure of communities exceedingly quick in the early evolution, which can guide the micro-population to avoid many local searches. Among the six multi-objective EA-based algorithms, RMOEA obtained similar results to those of MMCoMO because the sizes of the networks in RMOEA were recursively reduced as the evolution proceeded. The efficiency of MMCoMO on considerably large networks is worse than MOCD; however, the average Q of MMCoMO is significantly higher than that of MOCD.

FIGURE 6 NMI values of comparison algorithms averaged over 20 runs on the four groups of LFR networks.

Q in these algorithms. SOSFCD can achieve the best modularity on three small-scale networks with its search strategy based on the node neighborhood information. However, when the networks become larger and more complicated, the influence of these search strategies becomes smaller, and the time consumption increases; thus, the performance of these single-objective EA-based algorithms deteriorates significantly on large-scale networks. Compared with the five MOEAbased baseline algorithms, when the number of nodes in these real-world networks is greater than 2,000, the proposed MMCoMO achieves a much better performance in terms of Q. The superior performance of MMCoMO is attributed to the macro-micro population-based co-evolution scheme suggested in MMCoMO, which can significantly reduce the cost of blind search, particularly on large-scale networks. To verify the ability to search non-dominated

82

solutions of MMCoMO, Section V of the supplementary materials presents the final non-dominated solutions obtained by MMCoMO and the five MOEAbased baseline algorithms. In addition to the good performance of MMCoMO in terms of Q, MMCoMO has some advantages in efficiency. Table III indicates that it requires the least or second-least amount of time on most real-world networks. Note that the two single-objective EA-based algorithms, CoCoMi and DCRO, cannot obtain Q values within 12 h on larger networks, such as Ca-AstroPh and CaCondMat, for one run because a large number of local searches are used. Similar situations were observed for the two multi-objective EA-based algorithms MODPSO and MODTLBO/D because MODPSO must compute the velocity for each particle (which is time-consuming), and the teaching and learning strategies used in MODTLBO/D require considerable time. The high efficiency of

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

2) Results in Terms of NMI To further demonstrate the performance of the comparison algorithms, the performance metric NMI was adopted. However, NMI can be used only on networks with true community structures. In our experiments, five real-world networks had true community structures, i.e., Karate, Dolphin, Polbooks, Football, and Yeast. Table IV presents the average NMI values of the comparison algorithms (averaged over 20 runs) on the five real-world networks with ground-truth community structures. This table indicates that CSE and SSCF outperform the two singleobjective EAs, i.e., CoCoMi, DCRO, and SOSFCD, but perform worse than most MOEA-based algorithms on the five real-world networks. In addition, note that the results of MMCoMO are statistically better than those of CoCoMi, DCRO, SOSFCD, MOCD, and MOEA/D-PM, and achieved a comparable performance to those of MODPSO, MODTLBO/D, and RMOEA. In addition, Figure 6 presents the average NMI values (averaged over 20 runs) of the comparison algorithms on the four groups of LFR networks: (a) m ranging from 0.1 to 0.8 at a step of 0.1 (n ¼ 1000), (b) n ranging from 500 to 2500 with an interval of 500 (m ¼ 0:6), (c) m ranging from 0.1 to 0.8 at a step of 0.1 (n ¼ 5000), and (d) n ranging from 3000 to 5000 with an interval of 500 (m ¼ 0:6). The results of CoCoMi, MODTLBO/D, and SOSFCD on (c) and (d) are not provided here because they cannot be obtained within 12 h for

TABLE IV Comparison results of NMI on the five real-world networks with ground truths. NETWORK MEASURE NMIMax Karate

NMIAvg std NMIMax

Dolphin

NMIAvg std NMIMax

Polbooks

NMIAvg std NMIMax

Football

Yeast

NMIAvg

CSE

SSCF

CoCoMi

DCRO

SOSFCD

MOCD

MODPSO

0.8155

0.8365

0.6873

0.6873

0.8041

0.8372

1

0.8372=

1

1

1

0.9025=

0.8372=

1 =

0.9403=

1

0.1058

0

0

0.0886

0

0.8155= 0.8365= 0.6873= 0.6873= 0.6909= 0.8372= 0

0

0

0

0.0421

0

0.7224

0.8889

0.5932

0.5932

0.6169

0.9105

0.7224= 0.8889þ= 0.5846= 0.5872= 0.5159= 0.8524 =

0

0

0.0091

0.0040

0.0742

0.0567

0.4418

0.4744

0.5198

0.5198

0.5359

0.5520

0.4418= 0.4744= 0.5110= 0.5104= 0.4742= 0.5134= 0

0

0.0045

0.0040

0.0346

0.0245

0.9079

0.9308

0.8903

0.909

0.8903

0.8317

0.9079= 0.9282þ=þ 0.8783= 0.8463= 0.8167= 0.7928=

MOEA/D-PM MODTLBO/D

RMOEA

MMCoMO

1

1

1

1

0.8888

0.8570 =

1þ=þ

1þ=þ

0.9971þ=þ

0.8758

0.1433

0

0

0.0128

0.0327

0.5413

0.5412

0.5670

0.5512

0:5751

0.5130=

0.5160=

0.5363 =

0:5416 =

0.5374

0.0233

0.0142

0.0109

0.0069

0.0142

0.9269

0:9367

0:9367

0.9361

0.9269

0.9210 =

0:9317þ=þ

0.9272þ=þ

0.9312þ=þ

0.9180

std

0

0.0033

0.0168

0.0350

0.0448

0.0204

0.0073

0.0063

0.0043

0.0065

0.0094

NMIMax

0.2395

0.2896

0.2123

0.2133

0.1996

0.2276

0:3792

0.2325

0.2426

0.2667

0.3428

0.2299=

0.2340=

0.2571=

0.3290 0.0114

NMIAvg std

0.2387= 0.2854= 0.2025= 0.2033= 0.1951= 0.2090= 0:3759þ=þ 0.0004

0.0027

0.0069

0.0072

0.0030

0.0079

0.0023

0.0032

0.0035

0.0056

rank sum test þ//

0/5/0

2/3/0

0/5/0

0/5/0

0/5/0

0/4/1

1/2/2

2/3/0

2/2/1

2/2/1

pairwise t-test þ//

0/5/0

1/3/1

0/5/0

0/5/0

0/5/0

0/4/1

1/2/2

2/3/0

2/2/1

2/2/1

one run. Note that the community structure of the network becomes increasingly unclear as m increases. Figure 6(a) suggests that the proposed MMCoMO achieves the best performance among the comparison algorithms when m  0:6. However, on networks with m > 0:6, MMCoMO does not achieve a better NMI value than MODPSO but can still achieve the second-best value. Figure 6(c) suggests that when n ¼ 5000, m40:4, and m ¼ 0:7, MMCoMO achieves the best performance among the comparison algorithms. When 0:54 m40:6 and m ¼ 0:8, MMCoMO achieves the second-best result, which is only slightly worse than the best result. Figure 6(b) and (d) indicate that the proposed MMCoMO performs well on all networks except n ¼ 500. The improved performance of MMCoMO results from the proposed co-evolution scheme having an effective diversification search. The good performance of MODPSO on networks with large m is because the label propagation-based initialization used in MODPSO can determine solutions near the real labels when the community structures are considerably weak. Based on the empirical results in Tables III, IV, and Figure 6, the proposed MMCoMO can be concluded to perform competitively in terms of both Q and NMI.

C. Effectiveness of the Proposed Co-Evolution Scheme and Strategies in MMCoMO

In this section, three variants of the proposed MMCoMO, namely, MMCoMO-TP, MMCoMO-WG, and MMCoMO-WI, are compared to further verify the effectiveness of the proposed co-evolution framework and interaction strategies in MMCoMO. Specifically, MMCoMO-TP is a twophase algorithm, which is the same as MMCoMO except that MMCoMOTP uses the medoid-based representation to search macroscopically in the first phase and then uses the vectorbased representation to adjust the structure microscopically in the second phase. MMCoMO-WG and MMCoMO-WI are the same as MMCoMO except that the proposed guidance strategy and influence strategy are removed from MMCoMO. Note that “TP”, “WG”, and “WI” denote “two phase”, “without guidance strategy”, and “without influence strategy”, respectively. Table V lists the comparison results of the proposed MMCoMO and its three variants on the 14 datasets: MMCoMO-TP, MMCoMO-WG, and MMCoMO-WI. From this table, the average results of MMCoMO are observed to be superior to those of

MMCoMO-TP in almost all cases, particularly in large-scale networks. Specifically, in the results of Wilcoxon rank sum tests, MMCoMO outperforms on 11, ties on 3, and underperforms on 0 of 14 comparisons in terms of the average Q; for the pairwise ttests, MMCoMO outperforms on 12, ties on 2, and underperforms on 0. These results verify the effectiveness of the proposed co-evolution scheme. Similar results were observed by comparing MMCoMO with MMCoMOWG and MMCoMO-WI; MMCoMO outperforms on 12, ties on 2, and underperforms on 0 of 14 comparisons for both MMCoMO-WG and MMCoMO-WI in terms of the Wilcoxon rank sum tests. In addition, MMCoMO-WI can achieve better performance than MMCoMO-WG in almost all cases, indicating that the guidance strategy is more effective than the influence strategy used in MMCoMO. To further verify the effectiveness of the proposed interaction strategies, Section III of the supplementary material presents the effectiveness of the status of the two populations in the optimization. In summary, the effectiveness of the proposed co-evolution scheme and two interaction strategies used in MMCoMO were verified.

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

83

TABLE V Q values of MMCoMO and its three variants on the 14 real-world networks. NETWORK

Karate

MEASURE

MMCoMO-TP

MMCoMO-WG

MMCoMO-WI

QMax

0:4198

0:4198

0:4198

QAvg std

Dolphin

Polbooks

Football

Blogs

CA-GrQc

CA-HepTh1

CA-HepTh2

CA-AstroPh

Epinions

Enron-large

0:4198 0

0:4198 0

0:5268

0:5268

0:5268

0:5268

0.5223 =

0.5217 =

0:5233

std

0.0027

0.0036

0.0034

0.0029

QMax

0:5272

0:5272

0.5269

0:5272

QAvg

0.5266 =

0.5268=

0.5262=

0:5269

std

0.0006

0.0004

0.0013

0.0003

QMax

0:6046

0.6036

0:6046

0:6046

QAvg

=

0.6037

0.0016

0.5988

=

0.0053

0.6024

=

0.0029

0:6039 0.0013

QMax

0.5662

0.5414

0.5698

0:5818

QAvg

0.5602=

0.5338=

0.5628=

0:5732

std

0.0034

0.0042

0.0036

0.0053

QMax

0.8310

0.7792

0.8285

0:8379

QAvg

0.8252=

0.7690=

0.8235=

0:8336

std

0.0034

0.0057

0.0689

0.0028

QMax

0.8223

0.7780

0.8244

0:8357

QAvg

=

0.8159

0.7735

=

0.8146

=

0:8262

std

0.0041

0.0028

0.0044

0.0044

QMax

0.7182

0.6610

0.7217

0:7347

QAvg

=

0.7093

0.0040

0.6549

=

0.0033

0.7105

=

0.0054

0:7203 0.0081

QMax

0.8463

0.7949

0.8493

0:8618

QAvg

0.8400=

0.7793=

0.8408=

0:8548

std

0.0030

0.0057

0.0038

0.0033

QMax

0.6172

0.6136

0.6124

0:6200

QAvg

0.6025=

0.6039=

0.6039=

0:6124

std

0.0052

0.0047

0.0045

0.0059

QMax

0.5953

0.5736

0.5977

0:6005

QAvg

=

0.5889

0.0028

0.5622

=

0.0053

0.5901

=

0.0034

0:5953 0.0043

QMax

0.6757

0.6222

0.6789

0:6947

QAvg

0.6710=

0.6181=

0.6733=

0:6871

std

0.0041

0.0023

0.0030

0.0032

QMax

0.5387

0.5074

0.5371

0:5423

QAvg

0.5278=

0.5045=

0.5271=

0:5356

std

0.0045

0.0021

0.0045

0.0034

QMax

0.5816

0.5586

0.5848

0:5956

QAvg std

84

0.0029

0:4198

0.5217 =

std

CA-CondMat

0.4191

=

QAvg

std

PGP

0

=

QMax

std

Yeast

0:4198

=

MMCoMO

=

0.5718

0.5494

=

0.5692

=

0.0060

0.0050

rank sum þ//

0/11/3

0/12/2

0/12/2

pairwise t-tests þ//

0/12/2

0/12/2

0/10/4

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

0.0071

0:5847 0.0052

V. Conclusion and Future Work

This study considered the community detection problem from both macro and micro perspectives and proposed a macro-micro population-based co-evolutionary multi-objective algorithm called MMCoMO for community detection in complex networks, which can achieve a better trade-off between exploration and exploitation. In the proposed MMCoMO, the macro-population was used to quickly find diverse, good rough community structures, and the micro-population was used to search for good, fine community structures through a local search process. Thus, the two populations can be harnessed to improve each other through interactions during the co-evolutionary process. To combine the advantages of the two populations, two effective interaction strategies (a guiding and an influencing strategy) were suggested. Experiments on 14 real-world and several synthetic LFR networks demonstrated the competitiveness of the proposed MMCoMO in terms of Q and NMI compared with ten representative baseline algorithms and three variants of MMCoMO. Nonetheless, directions for future work related to MMCoMO are still in demand. The proposed MMCoMO has demonstrated the effectiveness of the proposed macro-micro population-based co-evolution scheme in complex networks for detecting communities that do not have common nodes. Thus, in the future, extending MMCoMO to handle overlapping community detection is possible by considering overlapping nodes. Considering a co-evolution scheme for community detection in other types of large-scale complex networks, such as signed [52], multilayer [53], attributed [56], and dynamic networks [58], is another direction for future work worthy of researching. Acknowledgment

This work was supported in part by the National Key Research and Development Project, Ministry of Science and Technology, China, under Grant 2018AAA0101302, in part by the

National Natural Science Foundation of China under Grants 61976001, 61876184, and 62076001, in part by the Natural Science Foundation of Anhui Province under Grant 2008085QF309, and in part by the Key Projects of University Excellent Talents Support Plan of Anhui Provincial Department of Education under Grant gxyqZD2021089. This article has supplementary downloadable material available at https://doi. org/10.1109/MCI.2023.3277773, provided by the authors. References [1] Q. Cai, L. Ma, M. Gong, and D. Tian, “A survey on network community detection based on evolutionary computation,” Int. J. Bio-Inspired Comput., vol. 8, no. 2, pp. 84–98, 2016. [2] M. Girvan and M. E. Newman, “Community structure in social and biological networks,” Proc. Nat. Acad. Sci., vol. 99, no. 12, pp. 7821–7826, 2002. [3] E. R. Barnes, “An algorithm for partitioning the nodes of a graph,” SIAM J. Algebr. Discrete Methods, vol. 3, no. 4, pp. 541–550, 1982. [4] M. E. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Phys. Rev. E, vol. 69, no. 2, 2004, Art. no. 026113. [5] A. Mahmood and M. Small, “Subspace based network community detection using sparse linear coding,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 3, pp. 801–812, Mar. 2016. [6] Z. Bu, C. Zhang, Z. Xia, and J. Wang, “A fast parallel modularity optimization algorithm (FPMQA) for community detection in online social network,” Knowl.-Based Syst., vol. 50, pp. 246–259, 2013. [7] Z. Lin, X. Zheng, N. Xin, and D. Chen, “CKLPA: Efficient community detection algorithm based on label propagation with community kernel,” Physica A: Stat. Mechanics Appl., vol. 416, pp. 386–399, 2014. [8] Y. Su, C. Liu, Y. Niu, F. Cheng, and X. Zhang, “A community structure enhancement-based community detection algorithm for complex networks,” IEEE Trans. Syst., Man, Cybern. Syst., vol. 51, no. 5, pp. 2833–2846, May 2021. [9] C. Pizzuti, “Evolutionary computation for community detection in networks: A review,” IEEE Trans. Evol. Comput., vol. 22, no. 3, pp. 464–483, Jun. 2018. [10] C. Shi, Z. Yan, Y. Cai, and B. Wu, “Multi-objective community detection in complex networks,” Appl. Soft Comput., vol. 12, no. 2, pp. 850–859, 2012. [11] M. Gong, Q. Cai, X. Chen, and L. Ma, “Complex network clustering by multiobjective discrete particle swarm optimization based on decomposition,” IEEE Trans. Evol. Comput., vol. 18, no. 1, pp. 82–97, Feb. 2014. [12] C. Pizzuti, “A multi-objective genetic algorithm for community detection in networks,” in Proc. IEEE 21st Int. Conf. Tools Artif. Intell., 2009, pp. 379–386. [13] K. Deb, A. Pratap, S. Agrawal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Trans. Evol. Comput., vol. 6, no. 2, pp. 182–197, Apr. 2002. [14] B. Amiri, L. Hossain, and J. W. Crawford, “An efficient multiobjective evolutionary algorithm for community detection in social networks,” in Proc. IEEE Congr. Evol. Comput., 2011, pp. 2193–2199. [15] C. Pizzuti, “A multiobjective genetic algorithm to find communities in complex networks,” IEEE Trans. Evol. Comput., vol. 16, no. 3, pp. 418–430, Jun. 2012. [16] M. Gong, X. Chen, L. Ma, Q. Zhang, and L. Jiao, “Identification of multi-resolution network structures with multi-objective immune algorithm,” Appl. Soft Comput., vol. 13, no. 4, pp. 1705–1717, 2013.

[17] B. Amiri, L. Hossain, J. W. Crawford, and R. T. Wigand, “Community detection in complex networks: Multi-objective enhanced firefly algorithm,” Knowl. Based Syst., vol. 46, pp. 1–11, 2013. [18] F. Zou, D. Chen, S. Li, R. Lu, and M. Lin, “Community detection in complex networks: Multiobjective discrete backtracking search optimization algorithm with decomposition,” Appl. Soft Comput., vol. 53, pp. 285–295, 2017. [19] D. Chen, F. Zou, R. Lu, L. Yu, Z. Li, and J. Wang, “Multi-objective optimization of community detection using discrete teaching–learning-based optimization with decomposition,” Inf. Sci., vol. 369, pp. 402–418, 2016.  alik and B. Z  alik, “Multi-objective evo[20] K. R. Z lutionary algorithm using problem-specific genetic operators for community detection in networks,” Neural Comput. Appl., vol. 30, no. 9, pp. 2907–2920, 2018. [21] F. Cheng, T. Cui, Y. Su, Y. Niu, and X. Zhang, “A local information based multi-objective evolutionary algorithm for community detection in complex networks,” Appl. Soft Comput., vol. 69, pp. 357–367, 2018. [22] S. Rahimi, A. Abdollahpouri, and P. Moradi, “A multi-objective particle swarm optimization algorithm for community detection in complex networks,” Swarm Evol. Comput., vol. 39, pp. 297–309, 2018. [23] F. Zou, D. Chen, D.-S. Huang, R. Lu, and X. Wang, “Inverse modelling-based multi-objective evolutionary algorithm with decomposition for community detection in complex networks,” Physica A: Stat. Mechanics Appl., vol. 513, pp. 662–674, 2019. [24] C. Mu, J. Zhang, Y. Liu, R. Qu, and T. Huang, “Multi-objective ant colony optimization algorithm based on decomposition for community detection in complex networks,” Soft Comput., vol. 23, no. 23, pp. 12683–12709, 2019. [25] N. S. Sani, M. Manthouri, and F. Farivar, “A multiobjective ant colony optimization algorithm for community detection in complex networks,” J. Ambient Intell. Humanized Comput., vol. 11, no. 1, pp. 5–21, 2020. [26] X. Zhang, K. Zhou, H. Pan, L. Zhang, X. Zeng, and Y. Jin, “A network reduction-based multiobjective evolutionary algorithm for community detection in largescale complex networks,” IEEE Trans. Cybern., vol. 50, no. 2, pp. 703–716, Feb. 2020. [27] D. W. Corne, N. R. Jerram, J. D. Knowles, and M. J. Oates, “PESA-II: Region-based selection in evolutionary multiobjective optimization,” in Proc. 3rd Annu. Conf. Genet. Evol. Comput., 2001, pp. 283–290. [28] J. Kennedy and R. C. Eberhart, “A discrete binary version of the particle swarm algorithm,” in Proc. IEEE Int. Conf. Syst., Man, Cybern. Comput. Cybern. Simul., 1997, pp. 4104–4108. [29] M. P. Saka, O. Hasançebi, and Z. W. Geem, “Metaheuristics in structural optimization and discussions on harmony search algorithm,” Swarm Evol. Comput., vol. 28, pp. 88–97, 2016. [30] A. Lancichinetti, S. Fortunato, and F. Radicchi, “Benchmark graphs for testing community detection algorithms,” Phys. Rev. E, vol. 78, 2008, Art. no. 046110. [31] X. Wen et al., “A maximal clique based multiobjective evolutionary algorithm for overlapping community detection,” IEEE Trans. Evol. Comput., vol. 21, no. 3, pp. 363–377, Jun. 2017. [32] L. Angelini, S. Boccaletti, D. Marinazzo, M. Pellicoro, and S. Stramaglia, “Identification of network modules by optimization of ratio association,” Chaos: Interdiscipl. J. Nonlinear Sci., vol. 17, no. 2, 2007, Art. no. 023114. [33] M. Tasgin, A. Herdagdelen, and H. Bingol, “Community detection in complex networks using genetic algorithms,” 2007, arXiv:0711.0491. [34] C. Pizzuti, “GA-Net: A genetic algorithm for community detection in social networks,” in Proc. 10th Int. Conf. Parallel Problem Solving From Nature, 2008, pp. 1081–1090. [35] C. Shi, Z. Yan, Y. Wang, Y. Cai, and B. Wu, “A genetic algorithm for detecting communities in largescale complex networks,” Adv. Complex Syst., vol. 13, no. 1, pp. 3–17, 2010. [36] S. He et al., “Cooperative co-evolutionary module identification with application to cancer disease

module discovery,” IEEE Trans. Evol. Comput., vol. 20, no. 6, pp. 874–891, Dec. 2016. [37] H. Sun, S. Ma, and Z. Wang, “A community detection algorithm using differential evolution,” in Proc. IEEE 3rd Int. Conf. Comput. Commun., 2017, pp. 1515–1519. [38] H. Chang, Z. Feng, and Z. Ren, “Community detection using dual-representation chemical reaction optimization,” IEEE Trans. Cybern., vol. 47, no. 12, pp. 4328–4341, Dec. 2017. [39] H. N. Mohammad, L. Mathieson, and P. Moscato, “A memetic algorithm for community detection by maximising the connected cohesion,” in Proc. IEEE Symp. Ser. Comput. Intell., 2017, pp. 1–8. [40] M. Moradi and S. Parsa, “An evolutionary method for community detection using a novel local search strategy,” Physica A- Stat. Mechanics Appl., vol. 523, pp. 457–475, 2019. [41] B. A. Attea, W. A. Hariz, and M. F. Abdulhalim, “Improving the performance of evolutionary multiobjective co-clustering models for community detection in complex social networks,” Swarm Evol. Comput., vol. 26, pp. 137–156, 2016. [42] Q. Zhang and H. Li, “MOEA/D: A multiobjective evolutionary algorithm based on decomposition,” IEEE Trans. Evol. Comput., vol. 11, no. 6, pp. 712–731, Dec. 2007. [43] B. A. Attea et al., “A review of heuristics and metaheuristics for community detection in complex networks: Current usage, emerging development and future directions,” Swarm Evol. Comput., vol. 63, 2021, Art. no. 100885. [44] R. V. Rao, V. J. Savsani, and D. P. Vakharia, “Teaching-learning-based optimization: An optimization method for continuous non-linear large scale problems,” Inf. Sci., vol. 183, no. 1, pp. 1–15, 2012. [45] L. Zhang, H. Pan, Y. Su, X. Zhang, and Y. Niu, “A mixed representation-based multiobjective evolutionary algorithm for overlapping community detection,” IEEE Trans. Cybern., vol. 47, no. 9, pp. 2703–2716, Sep. 2017. [46] Y. Tian, S. Yang, and X. Zhang, “An evolutionary multiobjective optimization based fuzzy method for overlapping community detection,” IEEE Trans. Fuzzy Syst., vol. 28, no. 11, pp. 2841–2855, Nov. 2020. [47] H. Ma, H. Yang, K. Zhou, L. Zhang, and X. Zhang, “A local-to-global scheme-based multi-objective evolutionary algorithm for overlapping community detection on large-scale complex networks,” Neural Comput. Appl., vol. 33, no. 10, pp. 5135–5149, 2021. [48] W. Zheng, J. Sun, Q. Zhang, and Z. Xu, “Continuous encoding for overlapping community detection in attributed network,” IEEE Trans. Cybern., early access, Mar. 14, 2022, doi: 10.1109/TCYB.2022.3155646. [49] R. Shang, K. Zhao, W. Zhang, J. Feng, Y. Li, and L. Jiao, “Evolutionary multiobjective overlapping community detection based on similarity matrix and node correction,” Appl. Soft Comput., vol. 127, 2022, Art. no. 109397. [50] A. Ramesh and G. Srivatsun, “Evolutionary algorithm for overlapping community detection using a merged maximal cliques representation scheme,” Appl. Soft Comput., vol. 112, 2021, Art. no. 107746. [51] A. Amelio and C. Pizzuti, “Community mining in signed networks: A multiobjective approach,” in Proc. IEEE/ACM Int. Conf. Adv. Social Netw. Anal., 2013, pp. 95–99. [52] C. Liu, J. Liu, and Z. Jiang, “A multiobjective evolutionary algorithm based on similarity for community detection from signed social networks,” IEEE Trans. Cybern., vol. 44, no. 12, pp. 2274–2287, Dec. 2014. [53] A. Amelio and C. Pizzuti, “Community detection in multidimensional networks,” in Proc. IEEE 26th Int. Conf. Tools Artif. Intell., 2014, pp. 352–359. [54] Z. Yin, Y. Deng, F. Zhang, Z. Luo, P. Zhu, and C. Gao, “A semi-supervised multi-objective evolutionary algorithm for multi-layer network community detection,” in Proc. 14th Int. Conf. Knowl. Sci., Eng. Manage., 2021, pp. 179–190. [55] Y. Chen and D. Mo, “Community detection for multilayer weighted networks,” Inf. Sci., vol. 595, pp. 119–141, 2022. [56] X. Teng, J. Liu, and M. Li, “Overlapping community detection in directed and undirected attributed

AUGUST 2023 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

85

networks using a multiobjective evolutionary algorithm,” IEEE Trans. Cybern., vol. 51, no. 1, pp. 138–150, Jan. 2021. [57] J. Sun, W. Zheng, Q. Zhang, and Z. Xu, “Graph neural network encoding for community detection in attribute networks,” IEEE Trans. Cybern., vol. 52, no. 8, pp. 7791–7804, Aug. 2022. [58] X. Zeng, W. Wang, C. Chen, and G. G. Yen, “A consensus community-based particle swarm optimization for dynamic community detection,” IEEE Trans. Cybern., vol. 50, no. 6, pp. 2502–2513, Jun. 2020. [59] Y. Yin, Y. Zhao, H. Li, and X. Dong, “Multiobjective evolutionary clustering for large-scale dynamic community detection,” Inf. Sci., vol. 549, pp. 269–287, 2021. [60] W. Jiang et al., “Label entropy-based cooperative particle swarm optimization algorithm for dynamic overlapping community detection in complex networks,” Int. J. Intell. Syst., vol. 37, no. 2, pp. 1371–1407, 2022. [61] M. Gong, B. Fu, L. Jiao, and H. Du, “Memetic algorithm for community detection in networks,” Phys. Rev. E, vol. 84, Nov. 2011, Art. no. 056101. [62] K. R. Zalik and B. Zalik, “Memetic algorithm using node entropy and partition entropy for community detection in networks,” Inf. Sci., vol. 445/446, pp. 38–49, 2018. [63] L. M. Antonio and C. A. C. Coello, “Coevolutionary multiobjective evolutionary algorithms: Survey of the state-of-the-art,” IEEE Trans. Evol. Comput., vol. 22, no. 6, pp. 851–865, Dec. 2018.

[64] M. A. Potter and K. A. D. Jong, “Cooperative coevolution: An architecture for evolving coadapted subcomponents,” Evol. Comput., vol. 8, no. 1, pp. 1–29, 2000. [65] M. Gong, H. Li, E. Luo, J. Liu, and J. Liu, “A multiobjective cooperative coevolutionary algorithm for hyperspectral sparse unmixing,” IEEE Trans. Evol. Comput., vol. 21, no. 2, pp. 234–248, Apr. 2017. [66] C. D. Rosin and R. K. Belew, “New methods for competitive coevolution,” Evol. Comput., vol. 5, no. 1, pp. 1–29, 1997. [67] M. d. A. Costa e Silva, L. d. S. Coelho, and L. Lebensztajn, “Multiobjective biogeography-based optimization based on predator-prey approach,” IEEE Trans. Magn., vol. 48, no. 2, pp. 951–954, Feb. 2012. [68] R. I. Kondor and J. Lafferty, “Diffusion kernels on graphs and other discrete structures,” in Proc. 19th Int. Conf. Mach. Learn., 2002, pp. 315–322. [69] R. A. Rossi and N. K. Ahmed, “The network data repository with interactive graph analytics and visualization,” in Proc. AAAI Conf. Artif. Intell., 2015, pp. 4292–4293. [70] W. W. Zachary, “An information flow model for conflict and fission in small groups,” J. Anthropological Res., vol. 33, no. 4, pp. 452–473, 1977. [71] D. Lusseau, “The emergent properties of a dolphin social network,” Proc. Roy. Soc. London. Ser. B: Biol. Sci., vol. 270, no. suppl_2, pp. S186–S188, 2003.

[72] M. E. Newman, “Modularity and community structure in networks,” Proc. Nat. Acad. Sci., vol. 103, no. 23, pp. 8577–8582, 2006. [73] D. Bu et al., “Topological structure analysis of the protein–protein interaction network in budding yeast,” Nucleic Acids Res., vol. 31, no. 9, pp. 2443– 2450, 2003. [74] J. Leskovec and A. Krevl, “Snap datasets: Stanford large network dataset collection,” 2014. [Online]. Available: https://snap.stanford.edu/data/ [75] X. Guardiola, R. Guimera, A. Arenas, A. DiazGuilera, D. Streib, and L. Amaral, “Macro-and microstructure of trust networks,” 2002, arXiv:cond-mat/ 0206240. [76] M. E. J. Newman, “Mixing patterns in networks,” Phys. Rev. E, vol. 67, Feb. 2003, Art. no. 026126. [77] M. Kaiser, “Mean clustering coefficients: The role of isolated nodes and leafs on clustering measures for small-world networks,” New J. Phys., vol. 10, no. 8, 2008, Art. no. 083042. [78] J. Xiao, Y. Wang, and X. K. Xu, “Fuzzy community detection based on elite symbiotic organisms search and node neighborhood information,” IEEE Trans. Fuzzy Syst., vol. 30, no. 7, pp. 2500–2514, Jul. 2022. [79] M. E. Newman, “Fast algorithm for detecting community structure in networks,” Phys. Rev. E, vol. 69, no. 6, 2004, Art. no. 066133. [80] L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas, “Comparing community structure identification,” J. Stat. Mechanics: Theory Experiment, vol. 2005, no. 09, 2005, Art. no. P09008.

CIS Publication Spotlight (continued from page 12) saliency map, Grad-CAM, GradCAMþþ). Our quantitative experimental results have shown that ensemble XAI has a comparable absence impact (decision impact: 0.72, confident impact: 0.24). Our qualitative experiment, in which a panel of three radiologists were involved to evaluate

86

the degree of concordance and trust in the algorithms, has showed that ensemble XAI has localization effectiveness (mean set accordance precision: 0.52, mean set accordance recall: 0.57, mean set F1: 0.50, mean set IOU: 0.36) and is the most trusted method by the panel of radiologists (mean vote: 70.2%).

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

Finally, the deep learning interpretation dashboard used for the radiologist panel voting will be made available to the community. Our code is available at https://github.com/IHISHealthInsights/Interpretation-MethodsVoting-dashboard.”

Introducing IEEE Collabratec™ The premier networking and collaboration site for technology professionals around the world.

IEEE Collabratec is a new, integrated online community where IEEE members, researchers, authors, and technology professionals with similar fields of interest can network and collaborate, as well as create and manage content. Featuring a suite of powerful online networking and collaboration tools, IEEE Collabratec allows you to connect according to geographic location, technical interests, or career pursuits. You can also create and share a professional identity that showcases key accomplishments and participate in groups focused around mutual interests, actively learning from and contributing to knowledgeable communities. All in one place!

Learn about IEEE Collabratec at ieee-collabratec.ieee.org

Network. Collaborate. Create.

Conference Calendar

Marley Vellasco lica do Rio de Janeiro, BRAZIL Pontifıcia Universidade Cato Liyan Song Southern University of Science and Technology, CHINA



Denotes a CIS-Sponsored Conference

D Denotes a CIS Technical Co-Sponsored Conference 

2023 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2023) August 13–17, 2023 Place: Incheon, Korea General Chairs: Frank Chung-Hoon Rhee, and Byung-Jae Choi Website: http://fuzz-ieee.org/



2023 IEEE Conference on Games (IEEE CoG 2023) August 21–24, 2023 Place: Boston, USA General Chairs: Casper Harteveld and JiaLin Liu Website: https://2023.ieee-cog.org/



2023 IEEE Smart World Congress (IEEE SWC 2023) August 25–28, 2023 Place: Portsmouth, U.K. General Chairs: Hui Yu and Amir Hussian Website: https://ieee-smart-worldcongress.org/



IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (IEEE CIBCB 2023) August 29–31, 2023 Place: Eindhoven, Netherlands General Chair: Marco S. Nobile Website: https://cmte.ieee.org/cis-bbtc/ cibcb2023/ D 18th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP 2023) September 25–26, 2023 Place: Limassol, Cyprus

General Chairs: Nicolas Tsapatsoulis and Jahna Otterbacher Website: https://cyprusconferences.org/ smap2023/ 

2023 IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA 2023) October 9–13, 2023 Place: Thessaloniki, Greece General Chairs: Yannis Manolopoulos and Zhihua Zhou Website: https://conferences.sigappfr. org/dsaa2023/



IEEE Latin American Conference on Computational Intelligence (IEEE LA-CCI 2023) October 29–November 1, 2023 Place: Porto de Galinhas, Brazil General Chairs: Diego Pinheiro and Rodrigo Monteiro Website: https://netuh.github.io/la-ccihome/ D International Conference on Behavioural and Social Computing (BESC 2023) October 30–November 1, 2023 Place: Larnaca, Cyprus (University of Cyprus) General Chair: George A. Papadopoulos Website: http://besc-conf.org/2023/



2023 IEEE International Conference on Development and Learning (ICDL 2023) November 9–11, 2023 Place: Macau, China General Chair: Zhijun Li Website: http://www.icdl-2023.org/



2023 IEEE Symposium Series on Computational Intelligence (IEEE SSCI 2023) December 6–8, 2023 Place: Mexico City, Mexico

General Chair: Wen Yu Website: https://attend.ieee.org/ssci-2023/ 

2024 IEEE International Conference on Development and Learning (IEEE ICDL 2024) May 20–23, 2024 Place: Austin, TX, USA General Chair: Chen Yu Website: TBA



2024 IEEE Evolving and Adaptive Intelligent Systems Conference (IEEE EAIS 2024) May 23–24, 2024 Place: Madrid, Spain General Chairs: Jose Iglesias and Rashmi Baruah Website: TBA



2024 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (IEEE CIVEMSA 2024) June 15–15, 2024 Place: Xi’an, China General Chairs: Yong Hu, Xiaodong Zhang, and Yi Zhang Website: TBA



2024 IEEE International Conference on Artificial Intelligence (IEEE CAI 2024) June 25–27, 2024 Place: Singapore General Chairs: Ivor Tsang, Yew Soon Ong, and Hussein Abbass Website: TBA



2024 IEEE World Congress on Computational Intelligence (IEEE WCCI 2024) June 30–July 5, 2024 Place: Yokohama, Japan General Chairs: Akira Hirose and Hisao Ishibuchi Website: https://wcci2024.org/

Digital Object Identifier 10.1109/MCI.2023.3278575 Date of current version: 13 July 2023

88

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2023

1556-603X ß 2023 IEEE

IEEE

WCCI 2024

Yokohama Japan

CALL FOR PAPERS

IMPORTANT DATES 15 November 2023

Special Session & Workshop Proposals Deadline

15 December 2023

Competition & Tutorial Proposals Deadline

15 January 2024

Paper Submission Deadline

15 March 2024

Paper Acceptance Notification

1 May 2024

Final Paper Submission & Early Registration Deadline

30 June - 5 July 2024 IEEE WCCI 2024 Yokohama, Japan

IEEE WCCI is the world’s largest technical event on computational intelligence, featuring the three flagship conferences of the IEEE Computational Intelligence Society (CIS) under one roof: The International Joint Conference on Neural Networks (IJCNN), the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) and the IEEE Congress on Evolutionary Computation (IEEE CEC). IEEE WCCI 2024 will be held in Yokohama, Japan. Yokohama is a city that inspires academic fusion and multidisciplinary & industrial association. The Yokohama area boasts a number of universities, institutes and companies of advanced information technology, electronics, robotics, mobility, medicine and foods. IEEE WCCI 2024 held in this area will strongly inspire the attendees to imagine next-generation science and technology as the fusion of AI, physiology and psychology as well as a cooperation with intelligence-related industries.

IJCNN 2024 The International Joint Conference on Neural Networks (IJCNN) covers a wide range of topics in the field of neural networks, from biological neural networks to artificial neural computation. IEEE CEC 2024 The IEEE Congress on Evolutionary Computation (IEEE CEC) covers all topics in evolutionary computation from theory to real-world applications.

FUZZ-IEEE 2024 The IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) covers all topics in fuzzy systems from theory to real-world applications. CALL FOR PAPERS Papers for IEEE WCCI 2024 should be submitted electronically through the Congress website at wcci2024.org, and will be refereed by experts in the fields and ranked based on the criteria of originality, significance, quality and clarity. CALL FOR TUTORIALS IEEE WCCI 2024 will feature pre-Congress tutorials, covering fundamental and advanced topics in computational intelligence. A tutorial proposal should include title, outline, expected enrollment, and presenter/organizer biography. Inquiries regarding tutorials should be addressed to Tutorials Chairs.

CALL FOR SPECIAL SESSION PROPOSALS IEEE WCCI 2024 solicits proposals for special sessions within the technical scope of the three conferences. Special sessions, to be organized by internationally recognized experts, aim to bring together researchers in special focused topics. Cross-fertilization of the three technical disciplines and newly emerging research areas are strongly encouraged. Inquiries regarding special sessions and proposals should be addressed to Special Sessions Chairs. CALL FOR COMPETITION PROPOSALS IEEE WCCI 2024 will host competitions to stimulate research in computational intelligence. A competition proposal should include descriptions of the problem(s) addressed, evaluation procedures, and a biography of the organizers. Inquiries regarding competitions should be addressed to the Competitions Chair.

2024 IEEE Conference on Artificial Intelligence Sands Expo and Convention Centre Marina Bay Sands, Singapore 25-27 June 2024

CALL FOR PAPERS PAPERS | WORKSHOPS | PANEL PROPOSALS

IEEE CAI is a new conference and exhibition with an emphasis on the applications of AI and key AI verticals that impact industrial technology applications and innovations.

IMPORTANT DATES Workshop & Panels proposal: 20 Nov 2023 Abstract submission: 13 Dec 2023 Full paper submission: 20 Dec 2023 Acceptance notifications & reviewers’ comments: 25 Mar 2024 Final reviewed submission: 25 Apr 2024

IEEE CAI seeks original, high-quality proposals describing the research and results that contribute to advancements in the following AI applications and verticals:

AI in Education

Healthcare and Life Science

Involves the creation of adaptive learning systems, personalised content delivery, and administrative automation. It also utilises predictive analytics to monitor student progress and identify learning gaps.

Explores the need for improved decisionmaking to assist medical practitioners as well as additional medical issues including personnel allocation and scheduling, automated sensing, improved medical devices and manufacturing processes, and supply chain optimisation.

Industrial AI Enhance the aerospace, transportation, and maritime sectors by optimising system design, autonomous navigation, and management logistics. It emphasises robust cybersecurity, efficient Digital Twins usage, and comprehensive asset health management.

Societal Implication of AI Explores the impact of artificial intelligence on society, including issues related to ethics, privacy, and equity. It examines how AI influences job markets, decision-making processes, and personal privacy, while also considering the importance of fairness, transparency, and accountability in AI systems.

Resilient and Safe AI

Develop AI systems that are reliable, secure, and able to withstand unexpected situations or cyberattacks. It emphasises the importance of creating AI technologies that function correctly and safely, even under adverse conditions, while also maintaining data privacy and system integrity.

Sustainable AI Employ AI to optimise resource usage, reduce waste, and support renewable energy initiatives. The ultimate aim is to leverage AI's problem-solving capabilities to address environmental challenges and contribute to a sustainable future.

Access all submission details: https://ieeecai.org/2024/authors Papers accepted by IEEE CAI will be submitted to the IEEE Xplore® Digital Library. Selected high-quality papers will be further invited for submission to a journal special issue.

CO-SPONSORS:

COMMITTEE INFO: General Co-Chair: Prof Ivor Tsang General Co-Chair: Prof Yew Soon Ong General Co-Chair: Prof Hussein Abbass