Computer and Information Science 9783031080203, 9783031121265, 9783031121272

456 94 8MB

English Pages [224] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Computer and Information Science
 9783031080203, 9783031121265, 9783031121272

Table of contents :
Foreword
Contents
Contributors
symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications
1 Introduction
2 Proposed Systems
2.1 Preliminary
2.2 Description
2.3 Encryption Process
2.4 Decryption Process
2.5 Sequence of Messages' Blocks
2.6 Encryption of the Entire Message as a Single Block
2.7 True-Random Number Generator
2.8 Pseudo-random Number Generator
3 Analysis
3.1 Time Complexity
3.2 Correctness of SymKrypt
3.3 Brute-Force Attacks
3.4 Birthday Attacks
3.5 Cryptanalysis Attacks
3.6 Attacks Analysis
3.7 Dictionary Attacks
3.8 Attacks Analysis
4 Experimental Results
4.1 Cryptography Testing
4.2 Randomness Testing
4.3 Pseudo-random Number Generator
5 Discussion
6 Conclusion
References
PK-BERT: Knowledge Enhanced Pre-trained Models with Prompt for Few-Shot Learning
1 Introduction
2 Related Work
2.1 Pre-trained Language Model for Few-Shot Learning
2.2 Pre-trained Language Model with Prompt
2.3 Knowledge Enhanced Pre-trained Language Model
3 Methodology
3.1 Notation
3.2 Dataset and Benchmark
3.3 Prompt
3.4 Representation
3.5 Masked Language Modelling for Few-Shot Learning
4 Experiments
4.1 Experimental Principle
4.2 Experimental Parameters
4.3 Experimental Result
4.4 Confusion Matrix
4.5 Sample Setting and Baseline Models
4.6 Comparison of Results
5 Discussion
6 Conclusions
References
Typhoon Track Prediction Based on TimeForce CNN-LSTM Hybrid Model
1 Introduction
2 TF-CNN-LSTM Model
2.1 CNN Model
2.2 TimeForce Module
2.3 LSTM Model
3 Experiment
3.1 Experimental Data
3.2 Accuracy Evaluation Index
3.3 Analysis of Experimental Results
3.4 Case Verification
4 Discussion
5 Conclusion
References
The Novel Characterizing Method of Collective Behavior Pattern in PSO
1 Introduction
2 Related Work
2.1 The Original PSO Algorithm
2.2 Swarm State Division Based on the Evolutionary Factor
2.3 Swarm State Division Based on Order Parameter
3 Our Work
3.1 Visualization of Swarm Behavior Pattern
3.2 Swarm Trend Factor of PSO
4 Experimental Results
4.1 Description of Algorithm Performance by Swarm Trend Factor
4.2 Comparison with Evolutionary Factor
5 Conclusions
References
Research on Box Office Prediction of Commercial Films Based on Internet Search Index and Multilayer Perceptron
1 Introduction
2 Data Collection and Forecasting Methods
2.1 Baidu Index and Google Trends
2.2 Data Collection
2.3 Forecasting Methods
3 Results and Findings
3.1 Results
3.2 Findings
4 Conclusion
References
A DCRC Model for Text Classification
1 Introduction
2 Related Work
2.1 CNN Model
2.2 BiLSTM Model
2.3 BiGRU Model
3 Problem Definition and Proposed Work
3.1 BiGRU-CNN Based Text Feature Acquisition and Context Dependent Learning
3.2 Mitigation of Feature Loss Problems Based on BiLSTM
3.3 Overview of the DCRC Model
4 Experiments
4.1 Datasets
4.2 Evaluation
4.3 Experiment Results
5 Conclusion
References
Hierarchical Medical Classification Based on DLCF
1 Introduction
2 Background and Related Work
2.1 CNN Model
2.2 LSTM
2.3 Random Forest
3 Problem Definition and Proposed Work
3.1 Introduction to the DLC Layer
3.2 Introduction to R Layer
3.3 Overview of DLCR Model
4 Experiments
4.1 Datasets
4.2 Baselines and Evaluation
5 Conclusion and Future Work
References
Noise Detection and Classification in Chagasic ECG Signals Based on One-Dimensional Convolutional Neural Networks
1 Introduction
2 Related Work
3 Materials and Methods
3.1 Dataset and Pre-processing
3.2 1D CNN Design
4 Experimental Results
5 Conclusion
References
Based on the Analysis of Interrelation Between Parallel Distributed Computer System and Network
1 Introduction
2 Combined with the Interconnection of Network Resources, the Time-Sharing System of Computer Is Analyzed
3 Analyze the Processor Inside the Computer, Extending to Distributed Network Analysis
3.1 Analysis of Distributed Treatment
3.2 Peripheral Multi-machine System
3.3 Relationship Between Distribution Processing and Network
4 Analysis of Computer System Parallelism, Extending to Distributed Network Systems
4.1 Analysis of Computer Space Structure
4.2 Computer Structure Combined with the Role of Processor Implementation
4.3 Structure Analysis of Distributed Computer System
4.4 The Role of Computer Systems in the Form of Network Distribution
5 Conclusion
References
Improvement of DGA Long Tail Problem Based on Transfer Learning
1 Introduction
2 Related Work
2.1 Long Tail
2.2 DGA Detection
3 Methodology
3.1 Long Tail
3.2 Data Pre-processing
3.3 Transfer Learning
3.4 Data Balanced Review
4 Experiments
4.1 Dataset
4.2 Experimental Setup
5 Experimental Results
6 Conclusions
References
A Phonetics and Semantics-Based Chinese Short Text Fusion Algorithm
1 Introduction
2 Method
2.1 Background
2.2 Feature Construction
2.3 Training
2.4 Similarity Calculation and Features Fusion
3 Experiments and Result
3.1 Evaluation Setting
3.2 Comparison Result
3.3 Component Analysis
4 Conclusion
References
Feature Extension for Chinese Short Text Based on Tongyici Cilin
1 Introduction
2 Method
2.1 Compute Surface Similarity
2.2 Segment Words and Extract Major Differences in Sentenses
2.3 Similarity Calculation of Major Difference Components
2.4 Similarity Calculation of Major Difference Components
3 Experiment and Analysis
3.1 Dataset
3.2 Evaluation Setting
3.3 Criteria
3.4 Comparison Result
4 Conclusion
References
Task-Level Consistency Semi-supervised Based Domain Adaptation for Lung Nodules Segmentation
1 Introduction
2 Related Work
2.1 nnU-Net
2.2 Dual-Task Consistency (DTC)
3 Method
3.1 Dataset
3.2 DTC Semi-supervised Based Domain Adaptation
3.3 Dual-Task Deep Supervision
4 Evaluation Metrics
4.1 Dice Similarity Coefficient
4.2 Precision and Recall
5 Experiments and Results
5.1 Experimental Platform
5.2 Applying nnU-Net
5.3 Result
6 Conclusion
References
Malaria Blood Smears Object Detection Based on Convolutional DCGAN and CNN Deep Learning Architectures
1 Introduction
2 Literature Review
3 Materials and Methods
3.1 Convolutional Neural Networks
3.2 Generative Adversarial Networks
3.3 The Dataset
3.4 Proposed Method
4 Results and Discussion
4.1 DCGAN Results
4.2 Hyper Parameters Tuning
5 Conclusion
References
Author Index

Citation preview

Studies in Computational Intelligence 1055

Roger Lee   Editor

Computer and Information Science

Studies in Computational Intelligence Volume 1055

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. This series also publishes Open Access books. A recent example is the bookSwan, Nivel, Kant, Hedges, Atkinson, Steunebrink: The Road to General Intelligence https://link.springer.com/book/10.1007/978-3-031-08020-3 Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

Roger Lee Editor

Computer and Information Science

Editor Roger Lee ACIS Headquarters Mount Pleasant, MI, USA

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-031-12126-5 ISBN 978-3-031-12127-2 (eBook) https://doi.org/10.1007/978-3-031-12127-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Foreword

The purpose of the 22nd IEEE/ACIS International Conference on Computer and Information Science (ICIS 2022) held on June 26–28, 2022, in Zhuhai, China, was to bring together researchers, scientists, engineers, industry practitioners, and students to discuss, encourage, and exchange new ideas, research results, and experiences on all aspects of computer and information science, and to discuss the practical challenges encountered along the way and the solutions adopted to solve them. The conference organizers have selected the best 14 papers from those papers accepted for presentation at the conference in order to publish them in this volume. The papers were chosen based on review scores submitted by members of the program committee and underwent further rigorous rounds of review. In Chapter “symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications,” Ripon Patgiri proposed an algorithm which uses multiple private keys to encrypt a single block of a message. To generate the private keys, they proposed a true-random number generator, called Grando, and a pseudorandom number generator, called Prando. In Chapter “PK-BERT: Knowledge-Enhanced Pre-trained Models with Prompt for Few-Shot Learning,” Han Ma, Benjamin K. Ng, and Chan-Tong Lam proposed PK-BERT—knowledge-enhanced pre-trained models with prompt for few-shot learning. In Chapter “Typhoon Track Prediction Based on TimeForce CNN-LSTM Hybrid Model,” Jiadong Lu, Meixuan Jiang, Yuchen Zhang, Wei Lv, and Tongfei Li proposed a model mixing mechanism based on TF-CNN-LSTM and applied it to typhoon trajectory prediction. In Chapter “The Novel Characterizing Method of Collective Behavior Pattern in PSO,” Xuan Wu, Jiaqi Huang, Xiaoyang Fu, You Zhou, Yanchun Liang, and Chunguo Wu proposed a visualization method of collective behavior patterns based on the velocity field, discovered various collective behavior patterns, and proposed a discriminate index named swarm trend factor. In Chapter “Research on Box Office Prediction of Commercial Films Based on Internet Search Index and Multilayer Perceptron,” Te Guo, Chiawei Chu, Junhan

v

vi

Foreword

Gao, Mengyao Wang, and Wei Lu analyzed the research on box office prediction of commercial films based on Internet search index and multilayer perceptron. In Chapter “A DCRC Model for Text Classification,” Zhaoquan Hao, Jiangyong Jin, Shengbin Liang, Suying Cheng, and Yanqing Shen proposed a BiGRU-CNNBiLSTM model (DCRC model) based on CNN, GRU, and LSTM, which is trained and validated on the THUCNews and Toutiao News datasets. In Chapter “Hierarchical Medical Classification Based on DLCF,” Mingyuan Yao, Haoran Sun, Shengbin Liang, Yanqing Shen, and Niki Yukie tried to solve the problems of medical classification by using LSTM-CNN. In Chapter “Noise Detection and Classification in Chagasic ECG Signals Based on One-Dimensional Convolutional Neural Networks,” Weslley Lioba Caldas, João Paulo do Vale Madeiro, Roberto Coury Pedrosa, João Paulo Pordeus Gomes, Wencai Du, and João Alexandre Lobo Marques discussed noise detection and classification in chagasic ECG signals based on one-dimensional convolutional neural networks. In Chapter “Based on the Analysis of Interrelation Between Parallel Distributed Computer System and Network,” Tingrui Yang analyzed the time-sharing system and the network connection, through exploring internal computer processor, extended to the form of distribution network analysis, combined with the parallel computer system principle and way of the analysis of the extended to the distributed network system. In Chapter “Improvement of DGA Long Tail Problem Based on Transfer Learning,” Baoyu Fan, Yue Liu, and Laurie Cuthbert proposed an effective knowledge transfer DGA detection model that transfers the knowledge learned in the previous stage of training to the next stage and optimized the impact of the long tail problem on the detection model. In Chapter “A Phonetics and Semantics-Based Chinese Short-Text Fusion Algorithm,” Yuchao Jiang, Xinru Li, Chuying Huang, Wei Lv, and Minghe Xu conducted an in-depth study on the unsupervised Chinese short-text similarity algorithm and proposed a fusion algorithm based on phonetics and semantics. In Chapter “Feature Extension for Chinese Short Text Based on Tongyici Cilin,” Chuying Huang, Xinru Li, Yuchao Jiang, Wei Lv, and Minghe Xu adopted a feature extension algorithm based on an external thesaurus Tongyici Cilin (extended) for short texts. The purpose is to solve the feature sparseness problem of Chinese shorttext feature vectors. In Chapter “Task-Level Consistency Semi-supervised Based Domain Adaptation for Lung Nodules Segmentation,” Yifan Zeng, Aohui Pang, Wei Lv, and Xiaolin Zhu proposed an out-of-the-box semi-supervised-based domain adaptation framework DTCnnU-Net, which used dual-task consistency. In Chapter “Malaria Blood Smears Object Detection Based on Convolutional DCGAN and CNN Deep Learning Architectures,” Francisco Nauber Bernardo Gois, João Alexandre Lobo Marques, Allberson Bruno de Oliveira Dantas, Márcio Costa Santos, José Valdir Santiago Neto, José Antônio Fernandes de Macêdo, Wencai Du, and Ye Li proposed the automating of the diagnosis process with the use of an intelligent system capable of recognizing malaria parasites could aid in the early treatment of malaria.

Foreword

vii

It is our sincere hope that this volume provides stimulation and inspiration and that it will be used as a foundation for works to come. June 2022

Prof. Jixin Ma University of Greenwich London, UK Prof. Wencai Du University of Saint Joseph Macau, China Prof. Wei Lu Zhuhai College of Science and Technology Zhuhai, China

Contents

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ripon Patgiri

1

PK-BERT: Knowledge Enhanced Pre-trained Models with Prompt for Few-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Han Ma, Benjamin K. Ng, and Chan-Tong Lam

31

Typhoon Track Prediction Based on TimeForce CNN-LSTM Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiadong Lu, Meixuan Jiang, Yuchen Zhang, Wei Lv, and Tongfei Li

45

The Novel Characterizing Method of Collective Behavior Pattern in PSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xuan Wu, Jiaqi Huang, Xiaoyang Fu, You Zhou, Yanchun Liang, and Chunguo Wu Research on Box Office Prediction of Commercial Films Based on Internet Search Index and Multilayer Perceptron . . . . . . . . . . . . . . . . . . Te Guo, Chiawei Chu, Junhan Gao, Mengyao Wang, and Wei Lu A DCRC Model for Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhaoquan Hao, Jiangyong Jin, Shengbin Liang, Suying Cheng, and Yanqing Shen

59

73 85

Hierarchical Medical Classification Based on DLCF . . . . . . . . . . . . . . . . . . 101 Mingyuan Yao, Haoran Sun, Shengbin Liang, Yanqing Shen, and Niki Yukie Noise Detection and Classification in Chagasic ECG Signals Based on One-Dimensional Convolutional Neural Networks . . . . . . . . . . . . . . . . . 117 Weslley Lioba Caldas, João Paulo do Vale Madeiro, Roberto Coury Pedrosa, João Paulo Pordeus Gomes, Wencai Du, and João Alexandre Lobo Marques

ix

x

Contents

Based on the Analysis of Interrelation Between Parallel Distributed Computer System and Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Tingrui Yang Improvement of DGA Long Tail Problem Based on Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Baoyu Fan, Yue Liu, and Laurie Cuthbert A Phonetics and Semantics-Based Chinese Short Text Fusion Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Yuchao Jiang, Xinru Li, Chuying Huang, Wei Lu, and Minghe Xu Feature Extension for Chinese Short Text Based on Tongyici Cilin . . . . . 167 Chuying Huang, Xinru Li, Yuchao Jiang, Wei Lv, and Minghe Xu Task-Level Consistency Semi-supervised Based Domain Adaptation for Lung Nodules Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Yifan Zeng, Aohui Pang, Wei Lv, and Xiaolin Zhu Malaria Blood Smears Object Detection Based on Convolutional DCGAN and CNN Deep Learning Architectures . . . . . . . . . . . . . . . . . . . . . 197 Francisco Nauber Bernardo Gois, João Alexandre Lobo Marques, Allberson Bruno de Oliveira Dantas, Márcio Costa Santos, José Valdir Santiago Neto, José Antônio Fernandes de Macêdo, Wencai Du, and Ye Li Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

Contributors

Caldas Weslley Lioba Department of Computer Science, Federal University of Ceará, Fortaleza, Brazil Cheng Suying 3rd Branch, China Petroleum Pipeline Engineering Co. Ltd., Zhengzhou, China Chu Chiawei Faculty of Data Science, City University of Macau, Macao, China Cuthbert Laurie Faculty of Applied Sciences, Macao Polytechnic University, Macao, China de Macêdo José Antônio Fernandes Science Center, Department of Computer Science, Federal University of Ceará, Fortaleza, Brazil de Oliveira Dantas Allberson Bruno Shenzhen Institutes of Advanced Technology/Chinese Academy of Sciences, Shenzhen, China do Vale Madeiro João Paulo Department of Computer Science, Federal University of Ceará, Fortaleza, Brazil Du Wencai Institute of Data Engineering and Sciences, University of Saint Joseph, Macau SAR, China Fan Baoyu Faculty of Applied Sciences, Macao Polytechnic University, Macao, China Fu Xiaoyang School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China Gao Junhan School of Aliyun Big Data Applications, Zhuhai College of Science & Technology, Zhuhai, China Gois Francisco Nauber Bernardo University of Saint Joseph, Macau SAR, China; Controllership and General Ombudsman of the State of Ceara, Fortaleza, Ceara, Brazil

xi

xii

Contributors

Gomes João Paulo Pordeus Department of Computer Science, Federal University of Ceará, Fortaleza, Brazil; Institute of Data Engineering and Sciences, University of Saint Joseph, Macau SAR, China; Laboratory of Applied Neurosciences, University of Saint Joseph, Macau SAR, China Guo Te School of Aliyun Big Data Applications, Zhuhai College of Science & Technology, Zhuhai, China Hao Zhaoquan School of Software, Henan University, Zhengzhou, China Huang Chuying School of Aliyun Big Data Applications, Zhuhai College of Science and Technology, Zhuhai, China Huang Jiaqi School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China Jiang Meixuan Alibaba Cloud Big Data Application College, Zhuhai College of Science and Technology, Zhuhai, China Jiang Yuchao School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China Jin Jiangyong School of Software, Henan University, Zhengzhou, China Lam Chan-Tong Faculty of Applied Sciences, Macao Polytechnic University, Macao, China Liang Shengbin Institute for Data Engineering and Science, University of Saint Joseph, Macau, China Liang Yanchun School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China Li Tongfei Alibaba Cloud Big Data Application College, Zhuhai College of Science and Technology, Zhuhai, China Li Xinru Meituan Select, Beijing Sankuai Technology Co., Ltd., Beijing, China; Department of Meituan Select, Beijing Sankuai Technology Co., Ltd, Beijing, China Li Ye Shenzhen Institutes of Advanced Technology/Chinese Academy of Sciences, Shenzhen, China Liu Yue Faculty of Applied Sciences, Macao Polytechnic University, Macao, China Lu Jiadong Alibaba Cloud Big Data Application College, Zhuhai College of Science and Technology, Zhuhai, China Lu Wei School of Aliyun Big Data Applications, Zhuhai College of Science & Technology, Zhuhai, China

Contributors

xiii

Lv Wei School of Alibaba Cloud Big Data Application, Zhuhai College of Science and Technology, Zhuhai, China; Alibaba Cloud Big Data Application College, Zhuhai College of Science and Technology, Zhuhai, China Ma Han Faculty of Applied Sciences, Macao Polytechnic University, Macao, China Marques João Alexandre Lobo Laboratory of Applied Neurosciences, University of Saint Joseph, Macau SAR, China Neto José Valdir Santiago Repair Pricer, Austin, TX, USA Ng Benjamin K. Faculty of Applied Sciences, Macao Polytechnic University, Macao, China Pang Aohui Department of Alibaba Cloud Big Data Application, Zhuhai College of Science and Technology, Zhuhai, China Patgiri Ripon National Institute of Technology Silchar, Silchar, India Pedrosa Roberto Coury Department of Computer Science, Federal University of Ceará, Fortaleza, Brazil Santos Márcio Costa Department of Computer Science, Federal University of Ceará, Russas, Brazil Shen Yanqing School of Software, Henan University, Zhengzhou, China; Zhongyuan Wuhu Research Institute, Henu Univrsity, Kaifeng, China Sun Haoran School of Software, Henan University, Kaifeng, China Wang Mengyao School of Aliyun Big Data Applications, Zhuhai College of Science & Technology, Zhuhai, China Wu Xuan College of Computer Science and Technology, Jilin University, Changchun, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China Wu Chunguo College of Computer Science and Technology, Jilin University, Changchun, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China Xu Minghe School of Aliyun Big Data Applications, Zhuhai College of Science and Technology, Zhuhai, China Yang Tingrui School of Software, Henan University, Kaifeng, China Yao Mingyuan School of Software, Henan University, Kaifeng, China Yukie Niki School of Foreign Languages, Henan University, Kaifeng, China

xiv

Contributors

Zeng Yifan Department of Alibaba Cloud Big Data Application, Zhuhai College of Science and Technology, Zhuhai, China Zhang Yuchen Alibaba Cloud Big Data Application College, Zhuhai College of Science and Technology, Zhuhai, China Zhou You College of Computer Science and Technology, Jilin University, Changchun, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China Zhu Xiaolin Faculty of Data Science, City University of Macau, Macau, China

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications Ripon Patgiri

Abstract Symmetric-key cryptography is used widely due to its capability to provide a strong defense against diverse attacks; however, it is prone to cryptanalysis attacks. Therefore, we propose a novel and highly secure symmetric-key cryptography, symKrypt for short, to defend against diverse attacks and provide tighter security than the conventional cryptography. Our proposed algorithm uses multiple private keys to encrypt a single block of a message. To generate the private keys, we again propose a true-random number generator, called Grando, and a pseudorandom number generator, called Prando. Moreover, symKrypt keeps secret about the bit mixing of the original message with the private keys. Also, the number of private keys is kept secret. In addition, the private keys are generated dynamically based on the initial inputs using a pseudo-random number generator which is highly unpredictable and secure. In this paper, we theoretically analyze the capabilities of symKrypt and provide experimental demonstration using millions of private keys to prove its correctness. Furthermore, we demonstrate the proposed pseudo-random number generator algorithm experimentally in NIST SP 800-22 statistical test suite. Our propose random number generators, Grando and Rando, pass all 15 tests in the NIST SP 800-22 test suite. To the best of our knowledge, symKrypt is the first model to use multiple private keys in encryption yet lightweight and powerful. Keywords Encryption · Cryptography · Symmetric cryptography · Elliptic-curve Diffie-Hellman cryptography · True random number generator · Pseudo-random number generator · Security · Security protocol · Computer networking

1 Introduction Symmetric-key cryptography is the most secure cryptography protocol. Therefore, there are diverse variants of symmetric-key cryptography, particularly, Twofish [1], Serpent, AES (Rijndael) [2], Salsa20 [3], ChaCha20 [3], Blowfish, Kuznyechik, R. Patgiri (B) National Institute of Technology Silchar, Silchar, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_1

1

2

R. Patgiri

DES, 3DES, Skipjack, and IDEA [4]. Diverse new variants are suggested by many researchers [5–10]. Also, diverse platforms are available, which requires modification of symmetric-key cryptography techniques [11–13]. Recent analysis on symmetrickey cryptography suggests many possible attacks [14–18]. There are diverse attacks on symmetric-key cryptography, anfd therefore, it demands a symmetric-key cryptography algorithm that shows strong resistance against the attacks. It also demands general-purpose symmetric-key cryptography that can be applied in diverse domains, for instance, the Internet of Medical Things. Moreover, most modern devices require lightweight symmetric-key cryptography, particularly securing small IoT devices [19]. Moreover, Edge Computing is emerging, and therefore, there are diverse cryptographic requirements. For instance, Edge Nodes and Cloud Computing can communicate with high-sized key cryptography. But the Edge Devices are low-, mid-, and high-powered. Therefore, the large keysized block cipher algorithm becomes expensive for low-powered devices. Therefore, the key size requirements range from 16-bits to 2048 bit; even more. For instance, smartphones are highly capable devices but not wearable devices. Consequently, it demands general-purpose symmetric-key cryptography, which can provide a tighter security than conventional symmetric-key cryptography. A general-purpose symmetric key cryptography is required to suffice all those requirements. Apart from the above cited issues and demands, there are various attacks on symmetric-key cryptography due to single-keyed encryption; for instance, cryptanalysis attacks. We, therefore, propose a general-purpose symmetric-key cryptography algorithm called symKrypt, which is a straightforward solution and lightweight cryptography yet very powerful to address the above issues. symKrypt relies on anyone of the the following for key sharing- Diffie-Hellman algorithm [20], Elliptic-curve cryptography [21, 22] or Elliptic-curve Diffie-Hellman (ECDH) algorithm [23], where symKrypt computes two secret keys, namely, shared secret key and shared secret seed value. Once the secret keys are computed securely, symKrypt does not require key exchange algorithm for the entire session. The secret keys are used to generate the private keys, and these are not used in encryption. symKrypt uses multiple private keys in each round of encryption of a block of a message. Alternatively, it uses multiple private keys in the encryption of each block of a message as demonstrated in Fig. 1 using color. The colors are the private keys and the large glass is the mixer of the original message and private keys. The private keys (colors in the Fig. 1) are computed asymmetrically by the sender and receiver; but both has to maintain the order of the private keys for each message. Alternatively, the private keys are not shared with anyone. As we know that the private keys play a vital role in defending diverse attacks in symmetric-key cryptography. These private keys are generated dynamically using a pseudo-random number generator algorithm that is highly unpredictable for adversaries. Also, symKrypt performs rotation operation, and it never shares the number of rotation r and types of rotation (circular shift left/right), and the total number of private keys t. Therefore, symKrypt creates a strong deterrence against the attackers. The t and r are computed dynamically, and also, the value of r changes in each iteration. symKrypt also protects the left or right

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications

3

Fig. 1 Principle of the symKrypt encryption algorithm. The colored glasses are generated private keys and the large water container glass is the original message transforming into ciphertext

rotation’s information, and thus, the adversaries do not have any clue on the types of the rotations. Our key contributions are outlined below— • We propose a novel and highly secure symmetric-key cryptography algorithm, called symKrypt for short, based on dynamic private keys. • symKrypt changes its private key in each iteration of a block of a message. Also, it changes the private keys in each block of a message. • To generate private keys, it requires a true-random number generator (TRNG) algorithm for initial key agreement protocol such as EDCH [23]. We propose a TRNG algorithm, called Grando, which is derived from Rando [24]. Grando provides higher performance than Rando in all aspects. Moreover, there are two variants of Grando, namely, GrandoM and GrandoX. Our experimental results show the superiority of GrandoX over GrandoM. Therefore, symKrypt uses GrandoX for its encryption and decryption. • The private keys are generated by a pseudo-random number generator which is highly unpredictable. The shared secret key and shared secret seed value are replaced with the private keys. • The total number of private keys is kept secret. Moreover, the rotation information is kept secret, and it is dynamically generated. • We propose a pseudo-random number generator to generate the private keys, called Prando, which relies on the non-cryptographic string hash functions. Prando uses two different hash functions to take the advantages the two hash functions, namely, Murmur2 [25] and xxHash [26] non-cryptography string hash functions. • symKrypt demonstrates its strong resistance against many attacks, including cryptanalysis attacks. • symKrypt does not impose on the bit restriction, and therefore, it can be used for diverse applications; for instance, IoT, Edge Computing and conventional communications. This article demonstrates the capabilities of symKrypt theoretically and experimentally. To the best of our knowledge, symKrypt is the first of its kind to use multiple keys in encryption. The changes of the private keys create a resistance

4

R. Patgiri

against cryptanalysis attacks. This article also demonstrates how to change its private keys and generate the dynamic private keys in each round. Moreover, we show how it helps in defending many attacks by keeping secret about the rotation information and the total number of private keys. The adversaries have no clue in gaining these information. Therefore, symKrypt can provide a strong deterrence to any possible attacks of symmetric-key cryptography yet lightweight. This article is organized as follows—Sect. 2 describes the proposed system and elaborates the proposed algorithms in detail. Section 3 analyzes the proposed system theoretically and demonstrates the resistance of symKrypt against any attacks. Section 4 proves the proposed system experimentally. Section 5 discusses on the proposed system. Finally, Sect. 6 concludes the article.

2 Proposed Systems We propose a novel and highly secure SYMmetric-Key cRYPTography algorithm, symKrypt for short. The key objective of our proposed systems is to provide strong resistance against the possible attacks on symmetric-key cryptography. Therefore, we have a few assumptions, and these assumptions are outlined below— • At the given time, the sender and the receiver must be active. • Our proposed algorithm relies on the secure key agreement protocols, for instance, Elliptic-curve Diffie-Hellman protocol. Thus, we omit a detailed analysis of the same. Also, we assume that the symmetric-key exchange protocol is secure enough to protect against any attacks on key sharing. • We assume that sender and receiver are valid. Therefore, the man-in-the-middle attack is out of scope. Moreover, our proposed system does not deal with DDoS attacks or DoS attacks (Table 2).

2.1 Preliminary Table 1 shows the essential parameters of our proposed algorithm and their state. Most of the parameters are kept secret. The shared secret key and the shared secret seed value are kept secret, and used only once to generate the first bit of the first private key. The key-exchange take place only once. The private keys are changed and generated dynamically, and these private keys are kept secret. Similarly, the r and t are kept secret, and generated dynamically. Moreover, the rotation decision is also made dynamically.

2.2 Description Figure 2 demonstrates the abstract architecture of our proposed algorithm, symKrypt. The key idea of the encryption process is to mix the original message with several

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications

5

Table 1 Parameters and their states of symKrypt Parameter State Shared secret key SK Shared secret seed value S Private key P Total number of private keys t Number of circular shift rotation r Left or right shift rotation Bit size of the key β Bit size of the message L Grando with Murmur hash function Grando with xxHash hash function Rando with Murmur hash function Rando with xxHash hash function

Table 2 Operator notations table Operators ⊕ ≪ ≫ mod

Secret and static Secret and static Secret and dynamic Secret and dynamic Secret and dynamic Secret and dynamic Public Public GrandoM GrandoX RandoM RandoX

Operator description Exclusive-OR operator Bit-wise shift left operator Bit-wise shift right operator Modulus operator

Fig. 2 Abstract architecture of symKrypt. The message m is mixed with t externally generated different private keys (P(1≤i≤t) ) to produce ciphertext

randomly generated private keys. We generates a total of t private keys to mix with the original key. Let original message be the water and the different private keys be the different colors as shown in Fig. 1. Colors are generated in each iteration, four into the water container, and mix with it. The process repeats for t×. After t iterations, the water container produced different colored water, that is the ciphertext. The rotation, rotation types, XORing, and merging operations are discussed in later section. Let A and B be the sender and receiver, respectively. Let A and B mutually agree on a shared secret key SK and a shared secret seed value S securely using the Diffie-

6

R. Patgiri

Hellman algorithm or Elliptic-curve Diffie-Hellman (ECDH) algorithm. The sender A divides the message m into several blocks and let the block size be L bits. The blocks are encrypted and sent to the B. The SK and S are used to generate several private keys for encryption and let the private keys be P = {P1 , P2 , P3 , . . . , Pt }, and the bit size of each private key be the β, where β > L. The Pi is generated using pseudo-random number generator (see Algorithm 5) where 1 ≤ i ≤ t. Figure demonstrates the encryption process of a particular block of a message. It is further demonstrated by all these randomly generated private keys in Eq. (1). ζ1 = P1 ⊕ m r otate(ζ1 , (P1 ζ2 = P2 ⊕ ζ1

mod β))

r otate(ζ2 , (P2 ζ3 = P3 ⊕ ζ2

mod β))

r otate(ζ3 , (P3 .. . ζt = Pt ⊕ ζ(t−1)

mod β))

r otate(ζt , (Pt

mod β))

(1)

The ζ is rotated using circular shift left/right operation depending on the last bit of the private key Pi . The total number of rotations is calculated using modulus operation as shown in Eq. (1). The circular shift left/right and the total number of rotations in the circular shift left/right rotation are kept secret, i.e., the rotation type and number rotations are not known to adversaries because these values are calculated dynamically. In the decryption process, the receiver B receives ciphertext ζ and decrypts using secret private keys P as given in Eq. (2). The sender’s private keys must be same as the receiver; otherwise, the decryption process fails. Firstly, the B derives the total number of rotations and performs the ciphertext exactly the opposite rotation of the encryption process. Secondly, the rotated ciphertext is decrypted using the private keys shown in Eq. (2). r otate(ζt , (Pt ζ(t−1) = Pt ⊕ ζt

mod β))

r otate(ζ(t−1) , (P(t−1) ζ(t−2) = P(t−1) ⊕ ζ(t−1)

mod β))

r otate(ζ(t−2) , (P(t−2) mod β)) ζ(t−3) = P(t−2) ⊕ ζ(t−2) .. . r otate(ζ1 , (P1 mod β)) m = P1 ⊕ ζ1

(2)

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications

7

The decryption process is the opposite of the encryption process. Therefore, the private keys are XORed with the ciphertext in descending order. Algorithm 1 Encryption algorithm for message m. The K and κ are the secret keys, the β is the bit size of each private key. The K is input into Prando() as string value and the κ is input into Prando() as seed value. 1: procedure symEnc(m, K, κ, β) ▷ Calculation of the total number of iterations for the algorithm. 2: t = (K mod β) + c 3: ζ =m L = Length(m) 4: P1 = Prando(K, κ, β) 5: P1 = Merge(m, L, P1 , β) ▷ Condition: β > L 6: 7: i =2 8: while i ≤ t do Pi = Prando(P(i−1) , κ, β) ▷ Generating Private Key 9: 10: κ = Prando(Pi , κ, β) 11: r = Pi mod β 12: ζ = ζ ⊕ Pi 13: if Pi ∧ 1 = 1 then 14: ζ = CircularRotateLeft(ζ, r ) 15: else 16: ζ = CircularRotateRight(ζ, r ) 17: end if 18: i =i +1 19: end while 20: Ci pher text ζ generated success f ully 21: return ζ 22: end procedure

2.3 Encryption Process The sender A wants to send a message to receiver B, and therefore, both parties agree on the shared secret key SK and the shared secret seed value S, where SK /= S. symKrypt computes the two secret keys through key agreement protocol ECDH [23] or ECC [21] protocol. These key agreement protocol requires random numbers generated by a true random number generator such that these numbers are highly unpredictable and secure. We propose a TRNG based on Rando [24], called Grando, and it enhances the performance and randomness Rando algorithm. Our proposed algorithm, Grando, (see Algorithm 7) is highly secure and unpredictable which is proven by a series of experiments. Grando is an incremental enhancement of Rando algorithm [24]. Algorithm 1 requires the original message m, secret key K and secret seed value κ which are generated using SK and S, and the bit size of encryption β as the input parameters. The necessary condition is the bit size of β must be equal to or

8

R. Patgiri

greater than m, and β be of any size (β ≥ m or 128 bits by default and the size of β can be 16 bits, 32 bits, 64 bits, 128 bits, 256 bits, 512 bits, and so on, as per the requirement of the user’s application). There is no restrictions on bit sizes. The bit size β is public. Let t and r be the total number of iterations and the total number bits’ rotations. Algorithm 1 calculates t = K mod β + c and r = P mod β which are not known to adversaries. The c = 2 is a constant value which means that at least 2 rounds of XOR and rotation operation need to be performed, and it is made public. Also, the value of c can be adjusted; for instance, if c = 5, then at least five rounds of XOR operations need to be performed between the original message and the five private keys. Alternatively, the t represents the total number of private keys to be used in encryption or decryption. The private keys are generated using a pseudo-random number generator (PRNG) algorithm. Moreover, the seed value is changed to the generated private key in each iteration. The generated private key P is XORed with the original message m and form ciphertext ζ . The ciphertext ζ is rotated by the circular shift left/right r times in each iteration. If the generated private key is odd, then performs circular shift left operation; otherwise, performs circular shift right operation. The adversary is unable to know whether to perform a circular shift right or left. On the successful completion of encryption, the sender A sends the ciphertext ζ to the B over insecure channel. Algorithm 2 Initial merging the original message m with dynamically generated key P. 1: procedure Merge(m, L, P , β) 2: l = P mod (β − L) 3: m =m ≪l P =m⊕P 4: 5: return P 6: end procedure

▷ Condition: β > L

2.4 Decryption Process The original message disappears from the encrypted message. The message is altered by several pseudo-random numbers or private keys. Therefore, the encrypted message does not contain the original message. Consequently, any permutation or combination cannot reproduce the original message. In the decryption process, the original message is reconstructed using the same private keys. Thus, a wrong sequence of private keys cannot reconstruct the original message. Moreover, the private keys are generated using pseudo-random number generator. The pseudo-random number generator algorithm requires shared secret key SK and shared secret seed value S to regenerate the set of private keys.

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications

9

Algorithm 3 Decryption algorithm for ciphertext ζ . It is a reversal process of Algorithm 1.

1: procedure symDec(ζ, K, κ, β) 2: t = (K mod β) + c 3: m=ζ 4: i =1 5: while i ≤ t do Pi = Prando(K, κ, β) 6: 7: κ = Pi 8: i =i +1 9: end while 10: i =t 11: while i ≥ 1 do 12: r = Pi mod β 13: if Pi ∧ 1 = 1 then 14: ζ = CircularRotateRight(ζ, r ) 15: else 16: ζ = CircularRotateLeft(ζ, r ) 17: end if 18: m = m ⊕ Pi 19: i =i −1 20: end while 21: m = UnMerge(ζ, L, β) 22: W rite message m 23: return m 24: end procedure

The receiver B receives the ciphertext ζ , and decrypts the ciphertext using Algorithm 3. Algorithm 3 is similar to Algorithm 1 except the rotation operation and its order. The circular shift rotation is performed in each iteration after the XOR operation in the encryption process. In contrast, the circular shift rotation is performed in each iteration before the XOR operation in the decryption process. Moreover, the rotation direction should be opposite to each other; for instance, if encryption performs circular shift rotate left, then decryption has to perform a circular shift rotate right operation depending on the private key. In short, the decryption operation has to perform the reverse order of the encryption operation. Algorithm 4 Initial merging of message m with dynamically generated key P. 1: procedure UnMerge(ζ, L, β) 2: l = P mod (β − L) 3: ζ =ζ ≫l 4: return ζ 5: end procedure

10

R. Patgiri

2.5 Sequence of Messages’ Blocks

Algorithm 5 Encryption of the blocks of messages by symKrypt. The SK and S are shared secret keys. The message m[ψ] is a set of messages to be encrypted. 1: procedure symKryptEnc(m[ψ], SK, S , β) 2: i =1 K = SK 3: 4: κ=S 5: while i ≥ ψ do 6: l = Length(K) K = Prando(K, l, κ) 7: 8: κ = Prando(K, l, κ) ▷ The value of K and κ are altered in each iteration. 9: ζi = symEnc(m i , K, κ, β) 10: i =i +1 11: end while 12: end procedure

Algorithm 6 Encryption of the blocks of messages by symKrypt. The ciphertext ζ [ψ] is a set of ciphertexts to be decrypted. 1: procedure symKryptDec(ζ [ψ], SK, S , β) 2: i =1 K = SK 3: 4: κ=S 5: while i ≥ ψ do 6: l = Length(K) K = Prando(K, l, κ) 7: 8: κ = Prando(K, l, κ) ▷ The value of K and κ are altered in each iteration. 9: m i = symDec(ζi , K, κ, β) 10: i =i +1 11: end while 12: end procedure

Algorithms 1 and 3 demonstrate the encryption and decryption of a single block, respectively. Initially, the shared secret key and shared secret seed value are used to generate the first private key. Also, shared secret key is assigned to K and shared seed value is assigned to κ. Later, the value of K and κ are altered using Prando() in each iteration. It returns a private key, and it is required to encrypt or decrypt the next blocks of messages. Let the blocks of message to be encrypted is m[ψ] = m 1 , m 2 , m 3 , . . . , m ψ . Algorithms 5 and 6 demonstrate the encryption and decryption of entire blocks of messages in a communication between A and B, respectively. In encryption or decryption, the private keys are changed in each round or iteration. Moreover, the shared secret key SK and the shared secret seed value S are used only once to generate the first bit of the first private key. Later, the shared

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications

11

secret key and shared secret seed value are replaced. Moreover, the value of t and r change in each block’s encryption or decryption.

2.6 Encryption of the Entire Message as a Single Block Apart from chunking the original message, the entire message can be treated as a single block. Therefore, the ψ is configured to 1, i.e., ψ = 1 in the both Algorithms 5 and 6. For instance, it is required in encrypting data for storing purposes. Moreover, it can also be used to transmitting the encrypted data into the destination.

2.7 True-Random Number Generator

Algorithm 7 Grando algorithm. A true random number generator using a string hash function. 1: procedure Grando(m) 2: i ← 1, β ← getCPUClock( ) 3: while i ≤ m do 4: α ← getCPUClock( ) K ← ConvertToString(α), β ← β ⊕ α 5: 6: l ← Length(K) 7: δ ← StringHashFunction(K, l, β) 8: B[i] ← (δ ∧ 1), i ← i + 1 9: end while 10: end procedure

We enhance the execution performance of Rando [24] by reducing the CPU clock readings, termed it as Grando. Algorithm 7 demonstrates the modification of the Rando algorithm. Grando enhances the execution performance significantly. Grando uses a single CPU clock reading operation inside the loop. Therefore, it reduces execution times dramatically. Algorithm 7 is used to generate shared secret key SK and shared secret seed value S. The true-randomness is proved through experimental analysis of Algorithm 7 using NIST SP 800-22 statistical test suite [27, 28].

2.8 Pseudo-random Number Generator The necessary conditions of pseudo-random number generator for symKrypt are outlined below—

12

R. Patgiri

Algorithm 8 Algorithm for Pseudo-random number generation (private key). 1: procedure Prando(K, S , β) 2: i =0 3: l = Length(K) 4: b=β 5: while β ≥ 1 do 6: d = Murmur2(K, l, S ) S=d 7: 8: e = xxHash(K, l, S ) S =e⊕d 9: 10: bin[i] = d ∧ 1 11: i =i +1 12: β =β −1 13: end while 14: return bin 15: end procedure

• The PRNG must be able to produce a highly random, unpredictable, and cryptographically secure key. • The PRNG must pass all the 15 tests of NIST SP 800-22. • The generated random key must be reproducible for the correct inputs. The Prando must satisfy the above conditions to decrypt the encrypted message correctly; otherwise, the symKrypt fails. symKrypt heavily depends on PRNG with the above-cited conditions. Our proposed PRNG, Prando, depends on the noncryptographic string hash functions, and the string hash function mixes the bits and produces unpredictable LSB bits. Prando uses two different non-cryptography string hash function, namely, Murmur2 [25] and xxHash [26] hash function. The LSB bit is extracted to form a private key. Algorithm 8 iterates β times and forms private keys of β bits. Algorithm 8 takes shared secret key, shared secret seed value, and bit information as the initial inputs. It can produce the same output for a given input set at any given time. However, the output is truly random, tested in NIST SP 800-22 statistical test suite [27, 28].

3 Analysis symKrypt is a symmetric-key cryptography algorithm that depends on symmetric key exchange algorithms. There are a few symmetric-key exchange algorithm, namely, Diffie-Hellman [20], Elliptic-key cryptography [21, 22] and ECDH [23]. We choose one of the most secure symmetric key exchange algorithm, ECDH algorithm. The security of symKrypt depends on a pseudo-random number generator that generates private keys. Many parameters of symKrypt are kept secret and a few parameters

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications

13

are made public which secures from diverse attacks from the adversaries. Moreover, Table 1 shows that the minimal number of parameters are made public, preventing the attackers from gaining information on the ciphertext.

3.1 Time Complexity The time complexity of a block of a message in symKrypt is approximately constant. The time complexity of Algorithm 1 depends on the time complexity of Algorithm 5. The time complexity of Algorithm 5 depends on the hash function and the number of bit size requirements. The has function’s time complexity depends on the input string length, for instance, l. Algorithm 5 iterates β times, and hence, the total time complexity of the Algorithm 5 is O(β × l). In practical applications, the bit size ranges from 16 bits to 2048, quite a small number. Also, the string length is similar to the bit sizes. Therefore, we can easily rewrite the time complexity of Algorithm 3 intuitively, and it is O(1). The total time complexity of Algorithm 1 is O(t × β × l). In a practical scenario, the value of t ranges from 10 to 100. Therefore, the total time complexity of Algorithms 1 and 2 is O(1). The time complexity of symKrypt depends on Algorithms 3 and 4. Algorithms 3 and 5 depends on the number of blocks. The total number of blocks is ψ. Therefore, the total time complexity is O(ψβ × l) ≈ O(ψ).

3.2 Correctness of SymKrypt We exploit the XOR property to perform encryption. The plaintext m is XORed with several private keys. It also performs several rotation operations. XOR operation produces zero if two bits are equal; otherwise, it produces one. Similarly, XOR operation produces zero for two same keys. Therefore, symKrypt can produce zero in the encryption process to transmit to the receiver. The zero value is correct, and it can be sent to the receiver. The receiver can retrieve the original message from the zero value if the shared secret key and shared secret seed value are correct. It is also possible that symKrypt produces ‘1’ in all bit fields. Moreover, symKrypt can also produce single-digit output or two digits output in the encryption process. symKrypt can produce any output in encryption. The original message can be decrypted from the ciphertext in any condition. The message m is XORed with set of keys P. For example, Eq. (3) encrypts the message m. (3) ζ = m ⊕ P1 Equation (4) decrypts the encrypted message by Eq. (3). m = ζ ⊕ P1

(4)

14

R. Patgiri

Let us assume that the encryption process using multiple keys as shown in Eq. (5). ζ = m ⊕ P1 ⊕ P2 ⊕ P3 ⊕ . . . ⊕ Pt ζ = m ⊕ (P1 ⊕ P2 ⊕ P3 ⊕ . . . ⊕ Pt ) ζ = m ⊕ (equi valent to a single key)

(5)

Equation (5) shows that encrypting a message with multiple keys cannot protect the attackers if we do not mix the message with private keys properly. Equation (5) is easy to decrypt by any novice attacker. Therefore, we address this issue by rotation in each iteration. The rotation information is kept secret, and therefore, there is no way to trace back the original message from the ciphertext for the adversaries. Exclusively, the intended user can decrypt the original message even if the encryption process produces a single or double-digit number. Moreover, the correctness of symKrypt also depends on the rotation. Let m is circular shift left rotated by r times. It requires a circular shift right rotate by r times to produce m correctly. The value of r is dynamic and changes in each iteration. In addition, the left/right rotation is also dynamic and depends on the private keys. The exact reverse order of the encryption process can decrypt the original message from the ciphertext. Therefore, the intended user can exclusively decrypt the original message. The rotation process creates a strong resistance against any attacks. Moreover, the value of t also provides a good defense against attacks.

3.3 Brute-Force Attacks A brute-force attack is the most famous attack; however, many symmetric-key cryptographic algorithms have already taken preventive measures to secure communication. Similarly, symKrypt provides a strong resistance to brute-force attacks. The brute-force attackers perform exhaustively search the key such that Dec(k, m) = Dec(k ' , m). In symKrypt, multiple keys are used to encrypt, and therefore, Dec(k, m) /= Dec(k ' , m). Let us assume that the c = 0 and SK mod β = 0, then the value of t is zero in t = (SK mod β) + c and there is no encryption. A raw message is sent to the receiver. In this case, symKrypt fails. Another instance, if c = 1 and t = 0, then symKrypt also fails. Therefore, we suggest the value of c between 10 and 100 for the practical scenario. In the worst-case scenario, the value of t is 227 where SK mod β = 127 in 128 bits key size and c = 100. It implies that symKrypt has to perform encryption or decryption using 227 private keys; however, an adversary is unable to know the total number of private keys. In addition, the private keys are highly unpredictable, and thus, it also provides a strong deterrence against brute-force attacks. Let us assume that the adversary knows the value of t, which is 10. Let us also assume that the same adversary knows the shared secret key SK. Theoretically, the adversary has the most information of the communication; however, the adversary

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications

15

does not know the seed value of S. The probability of gaining the correct information about the seed value is 21β where β can be 26 , 27 , 28 , . . . bits and it is made public. Our assumption was the adversary knows the shared secret key SK. If the adversary does not know the SK, then the probability of getting correct SK is 21β . Therefore, the total probability to break both the secret information using brute-force attack (BF) is given in Eq. (6). Pr (B F) = Pr (SK) ∩ Pr (S) 1 1 = β × β 2 2 1 = β 4

(6)

The Pr (B F) is the probability of breaking security using the brute-force method, and the two secret information are independent of each other. Therefore, the probability of not getting secret information is (1 − 41β ). If the adversary knows the two secret information, it is easy to decrypt an encrypted message. We also assumed that the adversary knows the value of t. If the adversary does not the value of t, then it can also provide a strong deterrence against the attacks. Let us assume that the attacker can match the secret information due to collision, but the attacker is unable to produce the correct value of t. Let us assume that the value of t ranges from 10 to 137 for the value of c = 10 and β = 128. Thus, the 1 because c and β are public. Alternatively, the probability to get correct t value is 127 1 probability is β−1 . Also, the value of r is private, and therefore, the probability to 1 where the maximum value of r is (β − 1). Thus, know the total value of r is 2(β−1) the total probability of breaking symKrypt using brute-force is given in Eq. (7). Pr (B F) = Pr (SK) ∩ Pr (S) ∩ Pr (t) ∩ Pr (r ) 1 1 = 2β × 2 (β − 1) × 2(β−1) 1 ≈ β 8 × (β − 1)

(7)

1 )). Therefore, the probability of not able to break symKrypt is (1 − ( 8β ×(β−1) Let us assume that attacker is not interested in attacking the shared secret keys. Let also us assume that the adversary knows the value of t. Therefore, there are t private keys used to encrypt. The probability of knowing a private key is 21β . There are t private keys; thus, the total probability break a single block of communication is 21βt . Moreover, there are ψ blocks in a message; thus, the total probability capture 1 entire message is 2(βtψ) which almost zero. If the adversary does not know the value 1 of t and r , then the total probability of capturing entire message is 2(βtψ) ×(β−1)×2 (β−1) . Moreover, the value of t and r change in each communication of each block. The

16

R. Patgiri

adversary does not know whether to rotate left or right, and how many times to rotate, as shown in Table 1. Therefore, it is easier to attack in Diffie-Hellman algorithm rather than symKrypt. Hence, symKrypt assumes that Diffie-Hellman algorithm can provide strong security. Thus, our proposed system is able to provide a strong security measurement against the attacks because there is a few public information of symKrypt as shown in Table 1. Most of the parameters are dynamically generated and kept secret.

3.4 Birthday Attacks The birthday attack is used to find collision in an encrypted message or to find hash collision. If η items are hashed to find a collision, the collision probability is given using birthday paradox in Eq. (8). ρ =1−

2β ! 2ηβ (2β − η)!

(8)

Solving Eq. (8), we get Eq. (9). η2

ρ = 1 − e− 2β+1 η2

1 − ρ = e− 2β+1 η2 2β+1 2 η = −2β+1 ln(1 − ρ) β+1 √ −ln(1 − ρ) η=2 2

ln(1 − ρ) = −

(9)

In Eq. (9), we approximate ln(1 − ρ) = −ρ, then we get Eq. (10). η=2

β+1 2

√ ρ

(10)

Equation (10) gives us the probability of collision of any secure hash function. The η becomes enormous for 256-bits and onward. Equation (10) shows the collision the probability of hash function which is hard to create a collision for large sized bits. symKrypt uses two secret keys: shared secret ( ) key and shared secret seed value. . The probability of picking Therefore, the combination of the two keys is η2 = η(η−1) 2 2 2 ). a correct pair is η(η−1) . The probability of not picking a correct pair is (1 − η(η−1) 2 The η is large, and thereupon, we approximate the probability η(η−1) ≈ 0; thus, the probability of not picking a correct pair is 1. With this approximation, we can rewrite Eq. (8), and thus the probability of collision becomes almost 0, which is given in Eq. (11).

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications

2β ! 2β (2β − 1)! 2β ρ ≈1− β 2 ρ≈0

17

ρ ≈1−

(11)

Significantly, Eq. (11) is an approximation of the probability, and it shows the difficulties in getting collision attacks.

3.5 Cryptanalysis Attacks A cryptanalysis attack is an attack by analyzing the ciphertext to discover the the original text. An attacker collects many ciphertexts and performs analysis on the collected database. It performs ciphertext-only, known-plaintext, chosen-plaintext/ciphertext, adaptive chosen-plaintext/ciphertext, related-key, and differential attacks. These types of attacks can be applied in the single-keyed symmetric ciphertext. symKrypt uses t random key generated by pseudo-random number generator. The private keys change in the encryption or decryption of each block of messages.

3.5.1

Chosen-Ciphertext And/Or Ciphertext Attacks

We assume that the ciphertexts are made public to all including the all adversaries. The attacker become successful when it can decode the ciphertext into plaintext. The Chosen-ciphertext and/or ciphertext only attackers study the statistical distribution of the characters in plaintext or ciphertext. Therefore, it become easy to decode the plaintext from the ciphertext. In the conventional system, ciphertext is created using a single secret key; thereby, it is prone to cryptanalysis attacks. On the contrary, symKrypt uses multiple random number as secret keys to encrypt the message. Moreover, the secret keys are never repeated for any circumstances in symKrypt. Therefore, statistical analysis does not reveal the plaintext or the secret keys. For instance, Dec(k, m 1 ) = Dec(k ' , m 1 ) and it is applied to Dec(k, m 2 ) = Dec(k ' , m 2 ). The Dec(k, m 2 ) = Dec(k ' , m 2 ) is possible if k = k ' . In contrast, symKrypt uses several private keys to encrypt m 1 , i.e., Enc(k[], m 1 ). Therefore, the Dec(k[], m 1 ) = Dec(k ' [], m 1 ) is not possible, and k[] /= k ' [] where k[] is an array of private keys. Even if it is possible, then the Dec(k[], m 2 ) = Dec(k ' [], m 2 ) is not possible.

18

R. Patgiri

3.6 Attacks Analysis Chosen-plaintext and/or Known-plaintext attacks is used to discover the keys for further decryption of ciphertexts. For instance, if Dec(k, m 1 ) = Dec(k ' , m ' ), then it is applied to perform Dec(k, m 2 ) = Dec(k ' , m 2 ) and it is possible only when k[] = k ' []. In contrast, symKrypt uses a total of i secret keys to encrypt m 1 , i.e., Enc(ki , m 1 ). Therefore, the Dec(k[], m 1 ) = Dec(k ' , m ' ) is not possible, and k[] /= k ' . Assume that k[] /= k ' is possible, then the Dec(k[], m 2 ) = Dec(k ' , m 2 ) is not possible. Therefore, adaptive chosen-plaintext or it’s variant is not possible in symKrypt. Similarly, other cryptanalysis also does not apply including related-key attacks, differential cryptanalysis, mod-n cryptanalysis, integral cryptanalysis, linear cryptanalysis etc.

3.7 Dictionary Attacks The dictionary attack is an attack by creating a dictionary through collecting several ciphertexts. A dictionary attack is dangerous for password-based attacks by creating a large dictionary. Moreover, there is also a birthday attack based on collision probability; however, this kind of attack does not apply to symKrypt due to encryption using several private keys. Also, symKrypt provides strong resistance against a preimage attack.

3.8 Attacks Analysis Let us assume that an adversary is able to break symKrypt with a probability of 1 ). The adversary may use any techniques to break the security of symKrypt, ( 8β ×(β−1) for instance, fault attack [14]. In this case, the adversary can break a particular block 1 ). But there are several blocks of the of message with the probability of ( 8β ×(β−1) messages still secured even if a block of message is compromised. It provides a strong deterrence against any possible attacks.

4 Experimental Results We have conducted a series of rigorous tests to verify the correctness of our proposed system. This experimentation is two-fold- first, we experiment the encryption and decryption, and secondly, we test the pseudo-random number generator on NIST SP 800-22 statistically test suite. Our experimental environment is as follows—(a)

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications

19

CPU is configured with Intel(R) Core(TM) i7-7700 CPU @ 3.60 GHz, (b) RAM size is 8 GB, HDD size is 1TB, (c) operating system is Ubuntu 18.04.5 LTS, and (d) programming language is GCC version 7.5.0.

4.1 Cryptography Testing Figure 3 demonstrates the time taken to encrypt and decrypt by symKrypt for various round settings. The round t is set for single message encryption with 1–5 M private keys. Algorithm 1 takes approximately equal times as Algorithm 2. Algorithm 1 and 2 can perform 476399.83 and 479518.57 rounds per second, respectively. It implies that symKrypt can perform XOR operation between the original message and the 476399.83 and 479518.57 private keys per second. Therefore, it is quite fast to encrypt or decrypt a message using symKrypt. Figure 4 depicts the total time taken to both encrypt and decrypt a single message by various t value settings. The t value represents the total number of private keys ranging from 1 to 5 M in encryption and decryption each. symKrypt takes time 4.21 and 20.85 s total time for 1 and 5 M private keys. Figure 5 shows the time taken to encrypt and decrypt 1–5 M blocks of messages at t = 10. Here, we use ten private keys to encrypt or decrypt. symKrypt takes 20.92 and 20.99 s to encrypt and decrypt 1M blocks, respectively. Similarly, symKrypt takes 104.92 and 104.97 s to encrypt and decrypt 5 M blocks, respectively.

Fig. 3 Comparison between symEnc and symDec in various values of t (the t is the total rounds of encryption). Lower is better

Fig. 4 The total times taken for encryption and decryption by symKrypt in t rounds (the t is the total rounds of encryption)

20

R. Patgiri

Fig. 5 Time taken to encrypt or decrypt several millions of blocks in seconds at the settings of t = 10. Lower is better

Fig. 6 Comparison between symEnc and symDec for time as the total number of blocks per second in various values of t. Higher is better

Figure 6 demonstrates the total number of blocks per second for various t value settings. The t value ranges from 10 to 50 which directly translates it uses 10– 50 private keys for encryption or decryption. At t = 10, symKrypt can perform 47639.98 and 47551.85 blocks per second, respectively. Similarly, symKrypt can perform 9527.99 and 9510.37 blocks per second at t = 50, respectively.

4.2 Randomness Testing 4.2.1

Randomness Testing of Grando

Table 3 shows the randomness results of GrandoM and GrandoX for 64 bits and 128 bits streams. The minimum P-values of GrandoM are 0.122325 for 64 bits, and 0.051391 for 128 bits streams. The minimum P-values of GrandoX are 0.100508 for 64 bits and 0.046169 for 128 bits streams. The maximum P-values of GrandoM are 0.985035 for 64 bits and 0.976060 for 128 bits streams. Similarly, the maximum P-values of GrandoX are 0.985035 for 64 bits and 0.980883 for 128 bits streams. The minimum success rates of GrandoM are 0.96875 for 64 bits and 0.9765625 for 128 bits streams. Likewise, the minimum success rates of GrandoX are 0.984375 for 64 bits and 0.9921875 for 128 bits streams. Therefore, GrandoX produces higher quality randomness than GrandoM in the best and worst cases. Similar to Crando, the overall randomness quality of GrandoM and GrandoX are approximately the same. Notably, GrandoX produces better P-values, and it has higher success rates.

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications

21

Table 3 Comparison of Grando algorithms for 64 and 128 bits in NIST SP 800-22 Test name

64 bits & GrandoM

64 bits & GrandoX

128 bits & GrandoM

128 bits & GrandoX

P-value

Pass rate

P-value

Pass rate

P-value

Pass rate

P-value

Pass rate

Approximate entropy 0.637119

64/64

0.299251

64/64

0.070445

126/128

0.275709

127/128

Frequency

0.378138

64/64

0.299251

64/64

0.772760

127/128

0.253551

128/128

Block frequency

0.253551

62/64

0.134686

64/64

0.468595

127/128

0.046169

128/128

Cumulative sums

0.985035

64/64

0.468595

64/64

0.242986

127/128

0.706149

128/128

Runs

0.275709

64/64

0.911413

62/64

0.911413

127/128

0.568055

128/128

Longest runs

0.232760

64/64

0.888137

64/64

0.051391

128/128

0.848588

127/128

Rank

0.122325

63/64

0.985035

64/64

0.862344

127/128

0.086458

127/128

FFT

0.162606

63/64

0.110952

63/64

0.213309

128/128

0.931952

128/128

Non-overlapping template

0.964295

64/64

0.985035

64/64

0.976060

128/128

0.980883

128/128

Overlapping template 0.911413

64/64

0.100508

63/64

0.602458

125/128

0.311542

127/128

Random excursions

0.213309

16/16

0.350485

18/18

0.834308

11/11

0.213309

18/18

Random excursions 0.213309 Variant

16/16

0.213309

18/18

0.834308

11/11

0.213309

18/18

Serial

0.500934

64/64

0.407091

64/64

0.407091

127/128

0.378138

128/128

Linear complexity

0.437274

63/64

0.568055

61/64

0.086458

125/128

0.980883

127/128

Universal

0.568055

64/64

0.074177

64/64

0.517442

127/128

0.311542

128/128

4.2.2

Randomness Comparison Between Grando and Rando Algorithm

Table 4 demonstrates the comparison between Grando and Rando algorithm. The Grando is incrementally enhanced version of Rando. The difference of randomness quality is demonstrated in Table 4. The experimental analysis compares among the GrandoM, GrandoX, RandoM and RandoX. The highest P-values of GrandoM, GrandoX, RandoM, and RandoX are 0.976060, 0.980883, 0.991468, and 0.931952, respectively. But the GrandoM and GrandoX outperform in overall P-values. The lowest P-values of GrandoM, GrandoX, RandoM, and RandoX are 0.051391, 0.046169, 0.003363, and 0.001313, respectively. Here, GrandoM and GrandoX clearly outperform RandoM and RandoX. The lowest success rates of GrandoM, GrandoX, RandoM, and RandoX are 0.9765625, 0.9921875, 0.96875, and 0.9609375, respectively. Therefore, Grando outperforms the Rando.

4.2.3

Comparison of Grando with Other Algorithms

Tables 5, 6 and 7 compare Grando algorithm with other state-of-the-art TRNG algorithms. Table 5 compares Grando with Erozan et al. [29] and Koyuncu et al. [30]. Grando algorithm outperforms Erozan et al. [29] in all aspects similar to Brando and Crando. Koyuncu et al. [30] outperforms Grando in ranks, overlapping template, and serial; but Grando outperforms in the rest test cases. Grando outperforms Kouncu et al. [30] algorithm in almost all the P-values. Overall, Grando exhibits

22

R. Patgiri

Table 4 Comparison of Grando with Rando algorithms in successful randomness testing in NIST SP 800-22 Test name

GrandoM P-value

GrandoX

RandoM

RandoX

Pass rate

P-value

Pass rate

P-value

Pass rate

P-value

Pass rate

Approximate entropy 0.070445

126128

0.275709

127128

0.003363

127128

0.437274

124128

Frequency

0.772760

127128

0.253551

128128

0.081277

128128

0.350485

126128

Block frequency

0.468595

127/128

0.046169

128/128

0.931952

127/128

0.422034

126/128

Cumulative sums

0.242986

127/128

0.706149

128/128

0.311542

127/128

0.862344

126/128

Runs

0.911413

127/128

0.568055

128/128

0.141256

128/128

0.222869

128/128

Longest runs

0.051391

128/128

0.848588

127/128

0.015065

128/128

0.051391

125/128

Ranks

0.862344

127/128

0.086458

127/128

0.364146

128/128

0.001313

127/128

FFT

0.213309

128/128

0.931952

128/128

0.016911

124/128

0.900104

125/128

Non-overlapping template

0.976060

128/128

0.980883

128/128

0.788728

128/128

0.911413

128/128

Overlapping template 0.602458

125/128

0.311542

127/128

0.568055

126/128

0.602458

123/128

Random excursions

0.834308

11/11

0.213309

18/18

0.991468

10/10



8/8

Random excursions 0.834308 variant

11/11

0.213309

18/18

0.911413

10/10



8/8

Serial

0.407091

127/128

0.378138

128/128

0.2873036 127/128

0.350485

127/128

Linear complexity

0.086458

125/128

0.980883

127/128

0.110952

125/128

0.922036

123/128

Universal

0.517442

127/128

0.311542

128/128

0.941144

125/128

0.931952

126/128

Table 5 Comparison of Grando with other algorithms in successful randomness testing in NIST SP 800-22 Test name

Erozan et al.[29]

Koyuncu et al.[30]

P-value

Pass rate

P-value

Pass rate

P-value

Pass rate

P-value

Pass rate

Approximate entropy

0.070445

126/128

0.275709

127/128





0.15224

Successful

Frequency

0.772760

127/128

0.253551

128/128

0.202268

96/100

0.72184

Successful

Block frequency

0.468595

127/128

0.046169

128/128

0.213309

100/100

0.06380

Successful

Cumulative sums

0.242986

127/128

0.706149

128/128

0.428568

96/100

0.56254

Successful

Runs

0.911413

127/128

0.568055

128/128

0.171867

99/100

0.06380

Successful

Longest runs

0.051391

128/128

0.848588

127/128





0.19640

Successful

Ranks

0.862344

127/128

0.086458

127/128





0.99834

Successful

FFT

0.213309

128/128

0.931952

128/128

0.474986

98/100

0.12786

Successful

Non-overlapping tem- 0.976060 plate

128/128

0.980883

128/128





0.69314

Successful

Overlapping template

0.602458

125/128

0.311542

127/128

0.055361

99/100

0.90598

Successful

Random excursions

0.834308

11/11

0.213309

18/18





0.86541

Successful

excursions 0.834308

11/11

0.213309

18/18





0.35789

Successful

Random variant

GrandoM

GrandoX

Serial

0.407091

127/128

0.378138

128/128

0.494555

100/100

0.87105

Successful

Linear complexity

0.086458

125/128

0.980883

127/128

0.249284

97/100

0.01101

Successful

Universal

0.517442

127/128

0.311542

128/128





0.02262

Successful

126/128

127/128 127/128

127/128

127/128 128/128 127/128 128/128 128/128

125/128

11/11

11/11

127/128 125/128

127/128

0.070445

0.772760 0.468595

0.242986

0.911413 0.051391 0.862344 0.213309 0.976060

0.602458

00.834308

0.834308

0.407091 0.086458

0.517442

Approximate entropy Frequency Block frequency Cumulative sums Runs Longest runs Rank FFT Nonoverlapping template Overlapping template Random excursions Random excursions variant Serial Linear complexity Universal

Pass rate

GrandoM P-value

Test name

0.311542

0.378138 0.980883

0.213309

0.213309

0.311542

0.568055 0.848588 0.086458 0.931952 0.980883

0.706149

0.253551 0.046169

0.275709

GrandoX P-value

128/128

128/128 127/128

18/18

18/18

127/128

128/128 127/128 127/128 128/128 128/128

128/128

128/128 128/128

127/128

Pass rate

0.000954

0.795464 0.350485





0.5929591

0.042413 0.042413 0.094936 0.739918 –

0.426525

0.477737 0.768138

0.00983

76/76

76/76 76/76

818/828

360/368

75/76

75/76 76/76 76/76 75/76 11052/11248

74/76

74/76 75/76

75/76

Jiang et al.[31] P-value Pass rate

Table 6 Comparison of Grando with other algorithms in successful randomness testing in NIST SP 800-22



0.5341 0.9114





0.0213

0.0089 0.7400 0.0043 0.0089 0.0043

0.3505

0.9114 0.9114

0.7399



1.0 1.0





0.8

0.92 1.0 1.0 1.0 1.0

1.0

1.0 1.0

1.0

Johnson et al.[32] P-value Pass rate

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications 23

126/128

127/128 127/128

127/128

127/128 128/128 127/128 128/128 128/128

125/128

11/11

11/11

127/128 125/128

127/128

0.070445

0.772760 0.468595

0.242986

0.911413 0.051391 0.862344 0.213309 0.976060

0.602458

0.834308

0.834308

0.407091 0.086458

0.517442

Approximate entropy Frequency Block frequency Cumulative sums Runs Longest runs Rank FFT Nonoverlapping template Overlapping template Random excursions Random excursions variant Serial Linear complexity Universal

Pass rate

GrandoM P-value

Test name

0.311542

0.378138 0.980883

0.213309

0.213309

0.311542

0.568055 0.848588 0.086458 0.931952 0.980883

0.706149

0.253551 0.046169

0.275709

GrandoX P-value

128/128

128/128 127/128

18/18

18/18

127/128

128/128 127/128 127/128 128/128 128/128

128/128

128/128 128/128

127/128

Pass rate

0.72

0.79 0.44

0.34

0.48

0.21

0.92 0.34 0.68 0.82 0.81

0.27

0.55 0.08

0.49

0.99

0.99 0.99

0.99

0.98

0.99

0.99 0.99 0.99 0.99 0.99

0.98

0.99 0.99

0.98

Wieczorek and Golofit [33] P-value Pass rate

Table 7 Comparison of Grando with other algorithms in successful randomness testing in NIST SP 800-22

0.373625

0.402962 0.433590

0.9966685

0.292960

0.502247

0.122325 0.291091 0.530120 0.858002 0.743915

0.572847

0.516113 0.928857

0.647530

0.984

0.988 0.984

0.9893

0.9863

0.984

0.991 0.985 0.995 0.990 0.99

0.989

0.988 0.993

0.995

Yeoh et al. [34] P-value Pass rate

24 R. Patgiri

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications

25

good P-values as compared to other two algorithms, Erozan et al. [29] and Kouncu et al. [30]. Similarly, Table 6 compares Crando with Jiang et al. [31] and Johnson et al. [32]. Grando outperforms Jiang et al. [31] and Johnson et al. [32] in almost all test cases. However, Jiang et al. [31] outperform Grando in block frequency and serial. Similarly, Johnson et al. [32] outperforms Grando in approximate entropy, frequency, and block frequnecy. Likewise, Table 7 compares Grando with Wieczorek and Golofit [33] and Yeoh et al. [34]. Wieczorek and Golofit [33] outperform Grando in approximate entropy, runs, and serial. Yeoh et al. [34] outperforms Grando in approximate entropy, block frequency, and random excursions variant. Therefore, Grando outperforms all other TRNG algorithms in P-values.

4.2.4

Performance Comparison Between Grando and Rando Algorithm

We term Grando and Rando with Murmur hash function as GrandoM and RandoM, for short, respectively. Similarly, Grando and Rando with xxHash hash function are termed as GrandoX and Randox, for short, respectively. Figure 7 demonstrates the time taken to produce 10, 20, 30, 40, and 50 M random bits string. GrandoM is the fastest, and it takes 5.14 s to generate 10 M bits strings. On the contrary, RandoX is the slowest, and it takes 9.78 s to produce 10 M bits strings. However, the ascending order of generating speed of binary bits string is GrandoM, GrandoX, RandoM, and RandoX. From Fig. 7, we can conclude that xxHash hash function is slower than the murmur hash function, but the differences are insignificant (Fig. 8). m M, where m is the Million operations per second (MOPS) is defined as t×1,000,000 number of bits, t is the time in second, and M is millions in the count. Each operation

Fig. 7 Running time of GrandoM, GrandoX, RandoM, and RandoX in seconds. Lower is better

Fig. 8 Million operations per second (MOPS) of GrandoM, GrandoX, RandoM, and RandoX. Higher is better

26

R. Patgiri

produces a single bit. The GrandoM, GrandoX, RandoM, and RandoX produce 1.94, 1.88, 1.05, and 1.02 M bits per seconds on average, respectively. Thus, GrandoM can produce the highest bits per second, and RandoX produces the lowest.

4.3 Pseudo-random Number Generator Algorithm 8 uses Murmur2 [25] and xxHash [26] hash functions to generate private keys. In the TRNG experiments, we have evidence that Murmur hash function exhibits steady outcome of P-value whereas xxHash hash function produces higher quality P-values but it exhbits more fluctuations. Therefore, we combine the advantages of the both hash functions to produce private keys. Prando 8 is experimented to test its randomness in NIST SP 800-22 statistical test suite [27, 28]. This experimental evaluation shows the randomness of the generated private keys. Table 8 demonstrates the P-value and pass rate of randomness testing in NIST SP 800-22 statistical test suite. NIST SP 800-22 provides approximation entropy, frequency, block frequency, cumulative sums, runs, longest runs, rank, FFT, non-overlapping template, overlapping template, random excursions, random excursions variant, serial, linear, and universal statistical testing of a given input. We have generated 10M random bits and input them into the test suite. The 32 bits, 64 bits, and 128 bits stream are tested in the default configuration of the NIST SP 800-22 test suite. Table 8 proves that the generated private keys are highly unpredictable and random. Therefore, it is difficult to guess the private keys by the adversaries. The P-value (≥ 0.01) is important in deciding the randomness and the pass rate. Table 8 shows the P-values and these P-values are greater than minimum P-value (0.01). The maximum P-value of 32 bits, 64 bits, and 128 bits stream are 0.991468, 0.976060, and 0.941144, respectively. The minimum P-value of 32 bits, 64 bits, and 128 bits stream is 0.100508, 0.016990, and 0.028181, respectively. The maximum pass rate is 1 for all. The minimum pass rate of 32 bits, 64 bits, and 128 bits stream is 0.9375, 0.96875, and 0.96875, respectively. Thus, Algorithm 8 proves its capability of generating a truly random number that can be used to generate the private keys for symKrypt.

5 Discussion symKrypt uses ECDH algorithm which requires TRNG algorithm. We, therefore, propose a TRNG, called Grando, for computing the random numbers in the computation of shared secret keys. Our work depends on the existing key agreement protocol, and therefore, we omit the detail description. The proposed work uses the uses several private keys to encrypt or decrypt, and thus, it requires a PNRG which can regenerate the same private for decryption in correct parameters. The salient feature of symKrypt is the mixing the original message with several newly gener-

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications

27

Table 8 P-values and success rates of prando for 32, 64 and 128 bits in NIST SP 800-22 32 bits 128 bits Test name 64 bits P-value Pass rate P-value Pass rate P-value Pass rate Approximate entropy Frequency Block frequency Cumulative sums Runs Longest runs Rank FFT Non-overlapping template Overlapping template Random excursions Random excursions variant Serial Linear complexity Universal

0.299251 0.299251 0.299251 0.299251 0.739918 0.100508 0.534146 0.407091 0.991468 0.213309 0.637119 0.637119 0.602458 0.299251 0.350485

32/32 32/32 32/32 32/32 31/32 32/32 31/32 30/32 32/32 32/32 13/13 13/13 32/32 31/32 31/32

0.350485 0.253551 0.534146 0.500934 0.637119 0.568055 0.949602 0.016990 0.976060 0.407091 0.122325 0.213309 0.862344 0.862344 0.772760

62/64 63/64 64/64 63/64 63/64 64/64 61/64 64/64 64/64 64/64 18/18 18/18 62/64 62/64 63/64

0.378138 0.025193 0.170294 0.028181 0.148094 0.723129 0.155209 0.671779 0.941144 0.095617 0.534146 0.911413 0.834308 0.213309 0.706149

128/128 127/128 127/128 127/128 124/128 126/128 127/128 125/128 128/128 128/128 12/12 12/12 127/128 123/128 127/128

ated random private keys. These pseudo-random private keys are generated proposed pseudo-random algorithm and the proposed PNRG is tested in NIST SP 800-22 statistical test suite. Our proposed PNRG asses all the test cases of randomness. Moreover, symKrypt depends on many dynamic parameters which are not shared to any others. For instance, the total private keys t and the rotation information r are kept secret. These are calculated dynamically. Besides, the left or right rotation is kept secret, which is also computed dynamically. Thus, symKrypt has two public information: the bit size information β and the maximum/minimum private key ranges. We have demonstrated the value of t ≥ 10 ranging from 1 to 5 M experimentally and validated its correctness. It shows the correctness of our proposed algorithm that it works on a very large set of private keys. The performance of encryption and decryption is quite fast, as shown in the experimental section. Moreover, the bit size can be defined by the user, and it can be any size as per the requirement of the user’s application. However, the condition is β ≥ m where m is the block of a message. There is no restriction on bit size, unlike conventional symmetric-key cryptography. We also analyzed the time complexity, which is O(m) where m is the total block of messages. Moreover, we discuss the correctness of our proposed algorithm theoretically and experimentally.

28

R. Patgiri

6 Conclusion This article demonstrates our proposed symmetric-key cryptography algorithm, symKrypt, which is the first of its kind. Our proposed algorithm is simple and straightforward yet powerful. It can be used on any platform to secure symmetric communication. symKrypt depends on multiple private keys, which are generated dynamically and kept secret. Our experimental results show that the proposed pseudorandom number generator algorithm to generate private keys are unpredictable and secure. It is tested in NIST SP 800-22 statistical test suite. Moreover, the symKrypt uses two shared secret keys, namely, shared secret key and shared secret seed values are computed by the ECDH algorithm. These two secret keys are used to generate private keys but are not used to encrypt the messages. In addition, symKrypt changes its private key for the encryption or decryption process in each iteration of a block of message. Also, it changes the private keys in each block of a message. symKrypt is the first variant to use multiple private keys without using extra communication for the private keys to the best of our knowledge. The sender and receiver do not exchange the private keys but compute the private keys independently without communication. Moreover, the original message vanishes in encryption process, and therefore, decryption process requires reconstruction of the original message. symKrypt allows block level encryption and the encryption of the entire message (particularly entire data). symKrypt provides strong resistance against any attacks except DDoS and MITM attacks. We illustrate the resistance of the symKrypt for any attacks. The probability of attacking symKrypt is too small, and it is almost negligible. Our proposed algorithm can able to defend any possible symmetric-key cryptography attack due to various reasons, particularly (a) minimal public key, (b) secret information is dynamic in nature, and (c) multiple private keys.

References 1. N. Ferguson, Impossible differentials in twofish (1999). Accessed on April 2021 from https:// www.schneier.com/wp-content/uploads/2016/02/paper-twofish-impossible.pdf 2. Specification for the advanced encryption standard (aes). Federal Information Processing Standards Publication 197 (2001). http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf 3. D.J. Bernstein, The Salsa20 Family of Stream Ciphers (Springer Berlin Heidelberg, Berlin, Heidelberg, 2008), pp. 84–97. https://doi.org/10.1007/978-3-540-68351-3_8 4. D. Khovratovich, G. Leurent, C. Rechberger, in Advances, in Cryptology—EUROCRYPT 2012, ed. by D. Pointcheval, T. Johansson (Springer, Berlin Heidelberg, Berlin, Heidelberg, 2012), pp. 392–410 5. A. Aly, T. Ashur, E. Ben-Sasson, S. Dhooghe, A. Szepieniec, IACR Trans. Symmetric Cryptol. (3), 1 (2020). https://doi.org/10.13154/tosc.v2020.i3.1-45 6. S. Agrawal, P. Mohassel, P. Mukherjee, P. Rindal, in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (Association for Computing Machinery, New York, NY, USA, 2018), CCS ’18, pp. 1993–2010. https://doi.org/10.1145/3243734. 3243774

symKrypt: A Lightweight Symmetric-Key Cryptography for Diverse Applications

29

7. A. Boldyreva, J.P. Degabriele, K.G. Paterson, M. Stam, in Proceedings of the 31st Annual International Conference on Theory and Applications of Cryptographic Techniques (Springer, Berlin, Heidelberg, 2012), EUROCRYPT’12, pp. 682–699. https://doi.org/10.1007/978-3642-29011-4_40 8. G. Samid, FAMILY KEY CRYPTOGRAPHY: interchangeable symmetric keys—a different cryptographic paradigm. Cryptology ePrint Archive, Report 2021/458 (2021). https://eprint. iacr.org/2021/458 9. R. Kumar, K.K. Mishra, A. Tripathi, A. Tomar, S. Singh. Msea: modified symmetric encryption algorithm. Cryptology ePrint Archive, Report 2014/280 (2014). https://eprint.iacr.org/2014/ 280 10. M. Islam, M. Shah, Z. Khan, T. Mahmood, M.J. Khan, in 2015 13th International Conference on Frontiers of Information Technology (FIT) (2015), pp. 1–5. https://doi.org/10.1109/FIT. 2015.12 11. X. Ge, J. Yu, H. Zhang, C. Hu, Z. Li, Z. Qin, R. Hao, IEEE Trans. Depend. Secure Comput. 18(1), 490 (2021). https://doi.org/10.1109/TDSC.2019.2896258 12. K. McCusker, N.E. O’Connor, IEEE Trans. Depend. Secure Comput. 8(3), 363 (2011). https:// doi.org/10.1109/TDSC.2010.73 13. S. Raza, L. Seitz, D. Sitenkov, G. Selander, IEEE Trans. Autom. Sci. Eng. 13(3), 1270 (2016). https://doi.org/10.1109/TASE.2015.2511301 14. A. Baksi, S. Bhasin, J. Breier, D. Jap, D. Saha, Fault attacks in symmetric key cryptosystems. Cryptology ePrint Archive, Report 2020/1267 (2020). https://eprint.iacr.org/2020/1267 15. P. Lorek, F. Zagórski, M. Kulis, IEEE Trans. Depend. Secure Comput. 16(5), 805 (2019). https://doi.org/10.1109/TDSC.2017.2751475 16. L. Guan, J. Lin, Z. Ma, B. Luo, L. Xia, J. Jing, IEEE Trans. Depend. Secure Comput. 15(5), 742 (2018). https://doi.org/10.1109/TDSC.2016.2631548 17. S. Ahmadi, M.R. Aref, IEEE Access 8, 2284 (2020). https://doi.org/10.1109/ACCESS.2019. 2962101 18. M. Alioto, M. Poli, S. Rocchi, IEEE Trans. Depend. Secure Comput. 7(3), 226 (2010). https:// doi.org/10.1109/TDSC.2009.1 19. A. Biryukov, L. Perrin, IACR Cryptol. ePrint Arch. 511 (2017). http://eprint.iacr.org/2017/511 20. W. Diffie, M. Hellman, IEEE Trans. Inform. Theory 22(6), 644 (1976). https://doi.org/10.1109/ TIT.1976.1055638 21. V.S. Miller, in Advances Cryptology—CRYPTO ’85 Proceedings, ed. by H.C. Williams (Springer, Berlin, Heidelberg, 1986), pp. 417–426 22. N. Koblitz, Math. Comput. 48(177), 203 (1987) 23. E. Barker, L. Chen, A. Roginsky, M. Smid, Recommendation for pair-wise key establishment schemes using discrete logarithm cryptography (2007). Accessed on January 2021 from https:// nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-56ar.pdf 24. A. Author, in To be appeared in The 20th IEEE International Conference on Trust, Security, and Privacy in Computing and Communications (TrustCom 2021), 20–22 October 2021 (Shenyang, China, 2021), pp. 107–113. https://doi.org/10.1109/TrustCom53373.2021.00032 25. A. Appleby, Murmurhash. Retrieved on December 2020 from https://sites.google.com/site/ murmurhash/(2008) 26. Y. Collet, xxhash. Retrieved on December 2020 from https://create.stephan-brumme.com/ xxhash/(2004) 27. A. Rukhin, J. Soto, J. Nechvatal, M. Smid, E. Barker, A statistical test suite for random and pseudorandom number generators for cryptographic applications. Tech. rep., Booz-allen and hamilton inc mclean va (2001). https://nvlpubs.nist.gov/nistpubs/Legacy/SP/ nistspecialpublication800-22r1a.pdf 28. L.E. Bassham III, A.L. Rukhin, J. Soto, J.R. Nechvatal, M.E. Smid, E.B. Barker, S.D. Leigh, M. Levenson, M. Vangel, D.L. Banks, et al., SP 800-22 rev. 1a. a statistical test suite for random and pseudorandom number generators for cryptographic applications (National Institute of Standards & Technology, 2010). https://csrc.nist.gov/publications/detail/sp/800-22/rev-1a/ final

30

R. Patgiri

29. A.T. Erozan, G.Y. Wang, R. Bishnoi, J. Aghassi-Hagmann, M.B. Tahoori, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 28(6), 1485 (2020). https://doi.org/10.1109/TVLSI.2020. 2975876 30. ˙I Koyuncu, M. Tuna, ˙I Pehlivan, C.B. Fidan, M. Alçın, Analog Integr. Circuits Signal Process. 102(2), 445 (2020). https://doi.org/10.1007/s10470-019-01568-x 31. H. Jiang, D. Belkin, S.E. Savel’ev, S. Lin, Z. Wang, Y. Li, S. Joshi, R. Midya, C. Li, M. Rao, M. Barnell, Q. Wu, J.J. Yang, Q. Xia, Nature Commun. 8(1), 1 (2017). https://doi.org/10.1038/ s41467-017-00869-x 32. A.P. Johnson, R.S. Chakraborty, D. Mukhopadyay, IEEE Trans. Circuits Syst. II Express Briefs 64(4), 452 (2017). https://doi.org/10.1109/TCSII.2016.2566262 33. P.Z. Wieczorek, K. Gołofit, IEEE Trans. Circuits Syst. I Regular Papers 65(4), 1279 (2018). https://doi.org/10.1109/TCSI.2017.2751144 34. W.Z. Yeoh, J.S. Teh, H.R. Chern, Multimed. Tools Appl. 78(12), 15929 (2019)

PK-BERT: Knowledge Enhanced Pre-trained Models with Prompt for Few-Shot Learning Han Ma, Benjamin K. Ng, and Chan-Tong Lam

Abstract The amount of data in some fields are scarce because they are difficult or expensive to obtain. The general practice is to pre-train a model on similar data sets and fine-tune the models in downstream tasks by transfer learning. The pre-trained models could learn the general language representation from large-scale corpora but their downstream task may be different from the pre-trained tasks in form and type. It also lacks related semantic knowledge. Therefore, we propose PK-BERT— Knowledge Enhanced Pre-trained Models with Prompt for Few-shot Learning. It (1) achieves few-shot learning by using small samples with pre-trained models; (2) constructs the prefix that contains the masked label to shorten the gap between downstream task and pre-trained task; (3) uses the explicit representation to inject knowledge graph triples into the text to enhance the sentence information; and (4) uses masked language modelling (MLM) head to convert the classification task into generation task. The experiments show that our proposed model PK-BERT achieves better results. Keywords Few-shot learning · Pre-trained models · Knowledge graph · Prompt · Masked language modelling

This work is funded by Macao Polytechnic University (File no. RP/ESCA-02/2021). H. Ma (B) · B. K. Ng · C.-T. Lam Faculty of Applied Sciences, Macao Polytechnic University, Macao, China e-mail: [email protected] B. K. Ng e-mail: [email protected] C.-T. Lam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_2

31

32

H. Ma et al.

1 Introduction Recently, with the development of big data and high-performance hardware, largescale pre-trained models have injected new vitality into the development of artificial intelligence and created a new paradigm. Using big models to fine-tune downstream task has become a common practice. This approach is to transfer the knowledge in the pre-trained task to the downstream task by transfer learning, and help the downstream task better complete the prediction. However, this method does not fully consider the similarity of input forms between pre-trained task and downstream task, resulting in some tasks can not make good use of pre-trained task knowledge, and the indicators are usually not ideal. How to shorten the distance between downstream task and pre-trained task is an important problem. Brown et al. [1] propose GPT-3 which is a prompt method constructing the downstream task into the form of pre-trained task, and add context to the text. That could help the downstream task make better use of the knowledge of pre-trained phase. Liu et al. [4] propose K-BERT method. By explicitly injecting triples knowledge into the text with attention mask matrix, the text can be enriched and the model can learn the knowledge related to text from the knowledge graph. Schick et al. [8] propose PET, which is a method based on MLM, transforming the text classification task into the cloze task, that could shorten the distance between downstream task and pre-trained task. Schick et al. [9] propose method which based on PET [8]. It could improve the small sample problem using BERT [2]’s MLM head. Xu et al. [15] propose Chinese Few-shot Learning Evaluation Benchmark (FewCLUE), the first comprehensive few-shot evaluation benchmark in Chinese. We use this benchmark data set and compare our results with it. In this work, we first describe an approach, PK-BERT, to improve the small sample problem based on prompt, knowledge graph, pre-trained models, and MLM head as shown in Fig. 1. We first describe an approach to improve the small sample problem based on knowledge graph, pre-trained models, prompt, and MLM head. The original inputs are some complete sentences. However, the pre-trained task is to predict the word which has been masked. So the prompt is to construct the input with context including the masked label words. That could better use pre-trained knowledge. There is a consideration about input sentence information. The length of Each sentence is not too long, and the information in each one is limited. So injecting the knowledge graph into the sentence could help to enrich the semantic information of the input sentences. In addition, it injects the knowledge representation into the text, which could enrich the semantic information of the sentences. Another method to improve the input sentence information is to use the pre-trained model. The pre-trained task of the model uses a large corpus for training, which means, the model has seen many words and sentences even articles. That could help the model to better understand the semantic information of the sentences by transfer learning. The pre-trained task

PK-BERT: Knowledge Enhanced Pre-trained Models …

33

Fig. 1 Model structure. Prompt and knowledge graph could give the sentences more information. The input the sentences to the model which is reused pre-trained models’ embedding and encoder and the output layer is MLM head. Finally, get the output to do the downstream task

of models is filling in the masked positions. However, when we do some other types of the tasks like classification tasks, the pre-trained models may not be good at this task because the form of the task has not been seen before. So to achieve a better result, it needs to convert the task form by the MLM head. The MLM head could convert the tasks into filling the masked positions. That could narrow the distance between the downstream task and the pre-trained task.

2 Related Work 2.1 Pre-trained Language Model for Few-Shot Learning GPT-3 [1] is a one-way language model, and the number of parameters increases to 175 billion. It shows the ability of few samples even zero samples generalization to downstream task, but its stability and predictability still need to be improved. The scale of GPT-3 parameter is close to the number of human neurons, which shows that its representation ability is close to human beings, but it has no common sense. Schick et al. [8] propose to transform the classification task into the cloze task to narrow the gap between downstream task and pre-trained task. This work is a further work based on PET [9], which proves that the MLM model of BERT can also solve the small sample problem. We are greatly inspired by these works, our method uses the pre-trained models as a feature extractor that could use the hidden knowledge to map the sentences into the feature space. We also use the MLM head to make the model task form consistent with the pre-trained phase.

34

H. Ma et al.

2.2 Pre-trained Language Model with Prompt Shin et al. [10] propose AutoPrompt which is a method of automatically designing prompts for diversified tasks based on gradient guided search. Although AutoPrompt works well, it requires a large number of samples for gradient based search, which is not very explanatory and its search method is relatively simple which is not applicable for small samples. Gao et al. [3] propose LM-BFF. It’s automatically building prompt templates for fine-tuning based on templates, and dynamically selecting samples as input context. Although the method of LM-BFF automatically constructing prompt is effective, expanding the search space is still a great challenge in practical applications and the labels number could not be too many. Liu et al. [6] propose P-Tuning to transform the construction of the template into a continuous parameter optimization problem. Liu et al. [5] propose P-Tuning v2 to adapt the prefix tuning technology of text generation to the NLU task. It solves the problems prompt-tuning not effectively improved on small models and expanded prompt-tuning to NER and other sequence annotation tasks. Wang et al. [14] propose EFL method to transform label descriptions as input sentences and reformulate the original task as the entailment task. Our method uses prompts including masked labels as the prefix which is a simple and effective way to construct the context for the downstream task to stimulate much knowledge by transfer learning from the pre-trained task.

2.3 Knowledge Enhanced Pre-trained Language Model Zhang et al. [16] use the entity of a high amount of information extracted from the knowledge graph, the corresponding representation in the text is enhanced through a special semantic fusion module. This work depends on the accuracy of named entity recognition, and the model structure is complex. Sun et al. [12] propose ERNIE 1.0. It uses the mask mechanism to mask the entity semantics in the knowledge graph, and uses three different levels of masking strategies to learn different fine-grained language models. Sun et al. [13] propose ERNIE 2.0. The concept of continuous multiple tasks learning is introduced, and a series of tasks (which can be considered as different loss functions) with coarse to fine granularity are added to realize SOTA. Sun et al. [11] propose ERNIE 3.0. The common practice in multiple tasks training is adopted. Different feature layers are divided into the universal representation and the tasks specific representation. A series of related work of Baidu has proved the importance and effectiveness of constructing reasonable pre-trained task and using data to construct unsupervised training sets. By modifying the attention mechanism in transformer, Liu et al. [4] use a special mask method to take the relevant edges in the knowledge graph into account in the coding process to enhance the effect of the pre-trained model. Since this strategy only changes the mask strategy, it can support a series of pre-trained models such as BERT, RoBERTa [7], etc. Finally, the method has been improved under 8 open domain tasks and 4 specific domain

PK-BERT: Knowledge Enhanced Pre-trained Models …

35

tasks. However, the model still uses the fine-tuning paradigm. There are differences between downstream task and pre-trained task, which may affect the effect of transfer learning. Our focus is on using the knowledge graph as an outside semantic base to complement input sentences to get more information.

3 Methodology 3.1 Notation In this work, we use a pre-trained language model M in the downstream task dataset D. In the few-shot learning setting, we just sample K training sentences each category for training. The labels set are including N classification. So the training set number is n = K ∗ N . D include two parts, one is sentences s = {s1 , s2 , s3 , ..., sn }, the other is the labels l = {l1 , l2 , l3 , ..., ln }. Each s has only one label l in the label space L. Each sentence has its token sequences si = {w1 , w2 , w3 , ..., wm }. Each token wi is in the vocabulary V , wi ∈ V . Prompt p will be added to each sentence s, then the sentences s will be s p . Knowledge graph K G = {h, r, t} is a set composed of head entity h, relation r and tail entity t. When the token w matches the head entity h, the token w will be injected representation knowledge as wr = (wi , r j , tk ). Then the sentences with knowledge will be sk . Each sentences si has their corresponding label n . li . So the dataset is D = {(s pki , li )}i=1

3.2 Dataset and Benchmark Tnews is a Chinese text classification dataset about news. The dataset is divided into 15 categories, including tourism, education, finance, military and so on. The complete data set contains 73,360 pieces of data, and each data contains three attributes: classification ID, classification name and news title. In this work, we use the dataset the same as the baseline FewCLUE [15] that the data set is only 15-way 16-shot for few-shot learning. In the FewCLUE [15] benchmark, the dataset is randomly shuffle and split the whole dataset into small dataset, which is divided into training set, development set and test set The training set consists 15 categories and 16 samples for each category, it is also known as 15-way 16-shot.

36

H. Ma et al.

3.3 Prompt Prompt is an input template designed for the downstream task to imitate the input form of the pre-trained task. This method can provide input context for language model and facilitate knowledge transfer. For example, we can formulate an input sentence s using a prompt as: s p = I t is [MASK][MASK] news, s [SEP] and the Function 1 could transform s into s p : s p = f prompt (s)

(1)

The manual construction template is designed according to the text content and common methods of natural language. We construct the prefix templates at the beginning of the text to make the form of input text close to the form of the model pre-trained task. The text templates contain labels, which have been masked before entering the model. The purpose is to make the model predict the label of the positions. This method provides a contextual sentence pattern including labels for the text, making the semantics of the input text smooth.

3.4 Representation Knowledge graph representation could divide into explicit representation and implicit representation. Explicit representation is to directly use knowledge graph triples as knowledge into text. Implicit representation is a method of mapping knowledge graph triples into vector space in some way, and then integrating them into text and model. We use the knowledge graph by explicit representation. The knowledge graph is explicitly injected into the input text based on K-BERT [4]. Specifically, this method injects relation and tail entities after the head entities in text, and the attention mask matrix is used to let the inserted parts can only be seen by the head entities, and other parts in the original sentences can not see the inserted parts. For example, we can formulate an input sentences s including tokens t1, t2, t3, and t4. The token t3 is a head h of knowledge graph triples (h, r, t). Then the input sentence s becomes like: sk = t1 t2 h(t3) r t t4 and the Function 2 could convert s into sentences with knowledge graph sk : sk = f representation (s)

(2)

PK-BERT: Knowledge Enhanced Pre-trained Models …

37

The knowledge graph this work used is CN-DBpedia. CN-DBpedia is a large scale Chinese open domain structured encyclopedia knowledge graph developed and maintained by the Knowledge workshop Laboratory of Fudan University, covering tens of millions of entities and hundreds of millions of relations.

3.5 Masked Language Modelling for Few-Shot Learning This work makes the model predict the masked words based on PET [8] and its subsequent work [9]. PK-BERT transforms the classification task into the generation task, and uses small samples to fine-tune the pre-trained models. The existing pre-trained model, such as BERT, RoBERTa, can be used. The output layer uses MLM head which could convert classification task into generated task. Finally, the joint distribution probability and the label words are used for calculating, and the predicted label words lˆ with the largest probability are taken as the output result. Through prompt and representation, we construct the Function 3 input form. lˆ = f M L M (s pk )

(3)

This method can input the prefix template containing the label masks and the text to the model for predicting the label. In this experiment, the classification task is transformed into generation task through MLM head. Then the probability of predicting label l is calculated, as shown in Eq. 4: exp(wV (l) · h [MASK] ) l ' ∈L exp(wV (l ' ) · h [MASK] )

p(l|s pk ) = Σ

(4)

The s pr is the MLM input which contains [MASK] token. l is the label of input s pk . h [MASK] is the hidden vector of [MASK]. V is to map the label space to the vocabulary. w is pre-trained weights. This approach does not update the weights or introduce new variables. It just uses the original weights of the pre-trained model. Through the cross entropy loss function, we calculate the difference between the predicted probability distribution and the true label to update the model weights. This method reduces the gap between pre-trained task and downstream task, and improves the effect of downstream task in the scene of small samples.

4 Experiments 4.1 Experimental Principle In this work, the complete experimental process is shown in the Fig. 2.

38

H. Ma et al.

Fig. 2 Experimental flow chart Table 1 Prompt prefix list

Firstly, we add the prompt containing labels to the input text and then masked the label words to construct a format similar to the pre-trained task. We test some prefixes and symbols, and the results are shown in Tables 1 and 2. Then we choose the best template and symbol to continue the following experiment. Furthermore, the input text is added knowledge graph triples. Specifically, the head entity is identified in the text, then the corresponding relation and tail entity of the head entity in the knowledge graph is matched. After that the relation and tail

PK-BERT: Knowledge Enhanced Pre-trained Models … Table 2 Prompt symbol list No. 1 2 3 4 5 6 7

39

Symbol

Acc.

, ◦ : ; — – (space)

0.54975 0.54925 0.54129 0.54428 0.54378 0.53781 0.54776

Table 3 Experimental parameters Parameter Max length Max entities Batch size Learning rate Warmup

Value 64 2 16 1e − 5 0.1

entity are explicitly injected behind head entity, so that the relation and tail entity belonging to the head entity can be included in the text. Finally, the text with prompt and representation are fed to model which is used to extract features, The model is followed by generating head. The model gives the probability value of mask text according to the vocabulary list, and calculates the most similar label category through the joint probability to complete the classification task. The final input sentences become like: s pk = [C L S] It is [MASK][MASK] news, the 2022 Winter Olympics (Venue, snow and ice) was successfully held in Beijing (capital, China). [SEP]

4.2 Experimental Parameters The specific parameters of the experiment are shown in Table 3. Max length is each sentence max length. Max entities is the number of knowledge graph triples injected into the text. A head entity may have multiple relationships and its corresponding tail entities. This parameter limits the number of triples injected into the text, which can restrict the change of the text and prevent too many triples from being injected into the text to avoid large changes in the semantics and form of the text. Batch size for each feed to the model how many data. The higher the value, the better. But it restricted by the size of the storage space of the graphics

40 Table 4 Classification result Label No. Precision Label 0 0.440 0.526 Label 1 Label 2 0.691 0.466 Label 3 0.429 Label 4 Label 5 0.525 Label 6 0.467 Label 7 0.659 0.703 Label 8 Label 9 0.491 0.510 Label 10 Label 11 0.568 Label 12 0.653 0.611 Label 13 0.743 Label 14 Acc. (Correct/Total): 0.5652

H. Ma et al.

Recall

f1

0.463 0.604 0.716 0.358 0.358 0.627 0.373 0.664 0.478 0.619 0.731 0.313 0.731 0.597 0.843

0.451 0.562 0.703 0.405 0.390 0.571 0.415 0.662 0.569 0.548 0.601 0.404 0.690 0.604 0.790

card, could not feed all the data to the model at once, so this parameter needs to be adjusted to the maximum in actual operation. Learning rate plays a particularly important role in gradient descent optimization algorithms. The higher the learning rate, the faster the weight update. Although a large learning rate can greatly update the model, it will also cause model shock, which may lead to missing the optimal solution. Therefore, generally, we hope that the learning rate is small at the beginning and gradually increases in the iterative process, which can alleviate the early over fitting problem. Warm up, in the training process, the model weights are unstable to update and the gradient is high at the beginning. So if the learning rate set too high, it may lead to weights instability. Warm up method helps to improve the over fitting of the model in the initial stage.

4.3 Experimental Result According to the experimental results as shown in Table 4, most of the 15 categories can reach more than 0.5. Among them, label 14 obtained the best score of 0.743. We speculate that the model can better identify the characteristics of this category. However, the precision of some categories is lower than 0.5, especially label 4 is only 0.429. That means label 4 is vulnerable to other categories.

PK-BERT: Knowledge Enhanced Pre-trained Models …

41

Fig. 3 Result heatmap

4.4 Confusion Matrix According to the confusion matrix as shown in Fig. 3, The accuracy of labels 8 and 14 are greater than or equal to 0.7. The accuracy of labels 2, 7, 12, and 13 are greater than or equal to 0.6 and less than 0.7. The accuracy of labels 1, 5, 10, and 11 are greater than or equal to 0.5 and less than 0.6. However, the accuracy of labels 0, 3, 4, 6, and 9 is less than 0.5. Among them, label 4 is easy to be confused with labels 0, 7, and 14.

4.5 Sample Setting and Baseline Models Fine-tuning uses downstream data to updates model weights by transfer learning. Zero-shot apply to the data training model, and directly uses the test calculation results. GPT [1] alleviates the over-dependence of the model on labeled data in the domain and over fitting the distribution of domain data. The model can use fewer domain data and complete downstream tasks without fine-tuning.

42

H. Ma et al.

PET [8] constructs the prefix containing the label, replaces each word in the label with [MASK] when applying the prefix to the text, then completes the downstream task. LM-BFF [3] adopts the prompt automatic construction method to realize template based on fine-tuning method, and dynamically selects sample examples as input context. P-tuning [6] uses unused tokens in the model’s vocabulary to automatically build templates. When there are few labeled data, the model can only update the template parameters, which can effectively alleviate the overfitting problem of small samples. EFL [14] forms the sentence pair of input text and label description, constructs a general implication task, and transforms the problem into a binary classification problem. K-BERT [4] injects knowledge graph triples into the input text and improves the mask mechanism so that the injected knowledge does not change the semantics of the original text. PK-BERT uses prefix, representation, and MLM head, and the text is explicitly represented by the prefix after the masking label is applied. The downstream task is transformed into a generation task through the MLM head, so that the model has prompt and knowledge, and the distance between the model and the pre-trained task is shortened.

4.6 Comparison of Results Based on our experimental results and FewCLUE, we made the following comparison as shown in Table 5. It can be concluded that PK-BERT exceeds other models in prediction accuracy, reaching 56.5%. It shows that under the setting of 15-way 16shot, the PK-BERT results exceed baseline PET by 2.0%. Compared with other models, PK-BERT uses pre-trained models with prompt, knowledge graph triples, and MLM head to deal with the small sample problems, which is a simple but effective way to achieve better results.

5 Discussion There are some advantages and disadvantages of PK-BERT. The input prompt prefix is constructed manually, which is simple and effective compared with the construction method related to generation. The direct injection of knowledge graph into the text is intuitive and clear, and the results are logical and interpretable. The model is convenient and only the mask mechanism is changed, so many pre-trained models can be used directly. However, prompt may not be the optimal solution. The efficiency of knowledge graph injection is not high, which mainly depends on the capacity and

PK-BERT: Knowledge Enhanced Pre-trained Models … Table 5 Results comparison Method Zero-shot RoB E RT a Zero-shotG P T FineTuning RoB E RT a EFL LM-BFF P-tuning PET (baseline) K-BERT PK-BERT

43

Accuracy 25.3 37.0 49.0 52.1 53.0 54.2 54.5 55.8 56.5

correlation of the knowledge graph and the performance of the entity matching method. How to improve them are the problems worthy of study.

6 Conclusions In this paper, we presented PK-BERT, an algorithm that uses a series of techniques to solve small sample problem. PK-BERT (1) achieves few-shot learning by using small samples with pre-trained models; (2) constructs the prefix contains the masked label to shorten the gap between downstream task and pre-trained task; (3) uses explicit representation to inject knowledge graph triples into the text to enhance the information; and (4) uses masked language modelling head to convert the classification task into generation task. The experiments show that our proposed model PK-BERT achieves better results. We also discussed the limitations and provided new ideas and directions for future research.

References 1. T.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al (2020) Language models are few-shot learners. arXiv:2005.14165 2. J. Devlin, M.W. Chang, K. Lee, K. Toutanova, Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018) 3. T. Gao, A. Fisch, D. Chen, Making pre-trained language models better few-shot learners. arXiv:2012.15723 (2020) 4. W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng, P. Wang, K-bert: enabling language representation with knowledge graph, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34 (2020), pp. 2901–2908

44

H. Ma et al.

5. X. Liu, K. Ji, Y. Fu, Z. Du, Z. Yang, J. Tang, P-tuning v2: prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv:2110.07602 (2021b) 6. X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, J. Tang, Gpt understands, too. arXiv:2103.10385 (2021) 7. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: a robustly optimized bert pretraining approach. arXiv:1907.11692 (2019) 8. T. Schick, H. Schütze, Exploiting cloze questions for few shot text classification and natural language inference. arXiv:2001.07676 (2020) 9. T. Schick, H. Schütze, It’s not just size that matters: small language models are also few-shot learners. arXiv:2009.07118 (2020) 10. T. Shin, Y. Razeghi, R.L. Logan, E. Wallace, S. Singh, Autoprompt: eliciting knowledge from language models with automatically generated prompts. arXiv:2010.15980 (2020) 11. Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu, et al., Ernie 3.0: large-scale knowledge enhanced pre-training for language understanding and generation. arXiv:2107.02137 (2021) 12. Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, H. Wu, Ernie: enhanced representation through knowledge integration. arXiv:1904.09223 (2019) 13. Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, H. Wang, Ernie 2.0: a continual pre-training framework for language understanding. arXiv:1907.12412 (2019) 14. S. Wang, H. Fang, M. Khabsa, H. Mao, H. Ma, Entailment as few-shot learner. arXiv:2104.14690 (2021) 15. L. Xu, X. Lu, C. Yuan, X. Zhang, H. Xu, H. Yuan, G. Wei, X. Pan, X. Tian, L. Qin, et al., Fewclue: a Chinese few-shot learning evaluation benchmark. arXiv:2107.07498 (2021) 16. Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, Q. Liu, Ernie: enhanced language representation with informative entities. arXiv:1905.07129 (2019)

Typhoon Track Prediction Based on TimeForce CNN-LSTM Hybrid Model Jiadong Lu, Meixuan Jiang, Yuchen Zhang, Wei Lv, and Tongfei Li

Abstract The accurate prediction of typhoon trajectory can greatly reduce the loss of life and property, which is of great significance for reducing typhoon disasters and conducting risk assessment. With the improvement of computing power of computers and the development of deep learning technology, deep neural networks are gradually applied in the field of meteorology. This paper proposes a model mixing mechanism based on TF-CNN-LSTM and applies it to typhoon trajectory prediction. First, the convolutional neural network is used to fully extract the data features of the typhoon trajectory, and then the TimeForce module is added, which is used for time series enhancement, and finally the long short-term memory neural network is used to output the prediction results. In the experiment, using the data set of the best typhoon path of the China Meteorological Administration, the TF-CNN-LSTM model is compared with the separate LSTM network model to predict the trajectories of typhoons in 12 h, 24 h and 48 h respectively. The experimental results show that the TF-CNN-LSTM hybrid model is significantly better than the single LSTM model in terms of root mean square error and actual error length. That is, on the basis of the LSTM typhoon trajectory prediction model, the model further improves the typhoon trajectory prediction accuracy. Keywords Convolutional neural network · Long short-term memory network · Hybrid neural network · Typhoon track prediction

J. Lu · M. Jiang · Y. Zhang · W. Lv · T. Li (B) Alibaba Cloud Big Data Application College, Zhuhai College of Science and Technology, Zhuhai, China e-mail: [email protected] W. Lv e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_3

45

46

J. Lu et al.

1 Introduction Typhoon is a typical tropical weather system and one of the important forms of ocean–atmosphere interaction [1]. Our country is located in the Northwest Pacific region where typhoons frequently occur. Every year in the typhoon season of summer and autumn, the coastal areas of our country will cause different degrees of economic losses and casualties due to typhoon disasters. Therefore, timely prediction of typhoon trajectory can provide effective information support for disaster prevention departments, thereby reducing casualties and economic losses. However, there are many factors that affect typhoon trajectory, such as typhoon, thermodynamics, typhoon dynamics and environmental field during typhoon. And after the typhoon makes landfall, the typhoon track will also be affected by the land topography and the water depth of the coastline [2]. Thus, typhoon trajectory prediction is a very important and challenging research topic. The early typhoon trajectory prediction methods mainly rely on thermodynamics and aerodynamics [3] to analyze the typhoon environmental field. It combines the analysis of the influence factors of the typhoon landing point on the complex coastline of the coastal area and the topography of the land. A rule of thumb specific to the field of typhoon trajectory prediction is established. However, this method of relying on one’s own experience is not only inefficient, but also requires a lot of manpower and material resources, and the accuracy and timeliness of prediction are difficult to meet the demand. With the wide application of neural networks, and the typhoon has obvious nonlinear structure and the characteristics of big data related image data [4, 5], the multi-layer deep neural network (Deep Neural Networks, DNN) from the convolutional neural network [6] (Convolutional Neural Networks, CNN) to recurrent neural network [7] (recurrent/recursive neural network, RNN) and other models have gradually begun to be used in typhoon prediction models, hoping to obtain more convenient and accurate prediction effects. The convolutional neural network (CNN) model is usually used to process image features, and it has powerful feature extraction and mining capabilities. There are many researchers using the data of typhoon satellite cloud images to extract the features of the spiral part, so as to improve the prediction accuracy of typhoon. However, the way to extract the image is to directly read the data of the satellite cloud image. If the resolution of the data image sent is low, it will greatly affect the prediction of the CNN model. In the study of Gao Shan et al., there has been a situation that the accuracy of the CNN model’s image prediction is greatly reduced due to the complex atmospheric factors in the typhoon formation process and the insignificant spiral radius of the cloud image [8]. This shows that the CNN model has certain defects in extracting the characteristics of typhoon cloud images, so it cannot complete high-precision typhoon trajectory prediction. Recurrent Neural Networks (RNN) [9–11] is a kind of neural network suitable for processing time series data, which has been widely used in many fields, but it will have the problem of “vanishing gradient” in the later iteration. Hochreiter et al. first proposed a long short-term memory network [12, 13] (Long Short-Term Memory,

Typhoon Track Prediction Based on TimeForce CNN-LSTM …

47

LSTM). By adding the input gates, the forget gates and the output gates, the weight of the network’s self-circulation can be changed, thereby avoiding “gradient disappearance”. It is suitable for processing and forecasting long-delayed events in time series. After a lot of research, it has been shown that the LSTM algorithm does get good results by treating typhoon data as time series data. Gaoyang et al. [14] applied LSTM network in typhoon trajectory prediction, combined with dynamic time warping algorithm, and achieved good results in predicting 6-h typhoon trajectory. With the rapid development of neural networks, the data information considered by the network model increases exponentially, the degree of connection between the data increases, and the impact factors of the data are also different. At this time, the variable feature set of the neural network constructed by the traditional method cannot fully reflect the connection of discontinuous features in high-dimensional space. Moreover, the classification or prediction of data with a high degree of close relationship and many influencing factors cannot be solved perfectly by traditional LSTM algorithms [15]. It can be known from previous studies that convolutional neural networks have significant advantages in data feature extraction and dimensionality reduction [16, 17]. Based on previous studies [18–20], this paper proposes a TimeForce dual-model hybrid network mechanism (TF-CNN-LSTM), which is based on convolutional neural network (CNN) and short-term memory network (LSTM). It first uses typhoon data to construct a matrix of typhoon trajectories, then uses convolution to effectively extract the correlation between the dimensions of the time series, adds the TimeForce module, and finally uses the long short-term memory neural network to map the output to obtain the prediction result.

2 TF-CNN-LSTM Model Time series matrix of typhoon tracks generated using the typhoon dataset: M = M1 , M2 , . . . , M F , F represents the length of time of the sequence matrix, Mi = D1 · H represents the ith (1 ≤ i ≤ F, i ∈ Z ) input sample of the time series of typhoons. Use the typhoon data set to obtain the matrix of the characteristics of the typhoon track: M C = M1C , M2C , . . . , M FC , MiC = D2 · H (1 ≤ i ≤ F, i ∈ Z ), which corresponds to the matrix of the time series of the typhoon track one-toone. The matrix of typhoon track time series and the matrix of typhoon track characteristics are combined to obtain the input matrix of the model: M input = input input input input = D3 · H , i (1 ≤ i ≤ F, i ∈ Z ) is a sample M 1 , M 2 , . . . , M F , Mi of model input. In order to fully consider the correlation of time and characteristics of typhoon trajectory prediction, and the influence of previous typhoon trajectory data on typhoon trajectory prediction, this paper proposes the basic structure of TF-CNNLSTM hybrid neural network. It not only combines the advantages of CNN network and LSTM network, but also incorporates the TimeForce module. It is applied to the prediction of typhoon trajectory, and good results are obtained.

48

J. Lu et al.

Fig. 1 TF-CNN-LSTM hybrid network structure

Figure 1 shows the flow chart of the entire model. The two purple matrices represent the matrix of typhoon trajectory characteristics and the matrix of typhoon trajectory time series from top to bottom, respectively, and the gray matrix represents the fused typhoon trajectory matrix. The yellow matrix and the green matrix represent the feature matrix in the CNN model, the blue part is the variable feature stretch, the red part is the TimeForce module, and the orange part is the LSTM network layer. D1 , D2 , D3 represent the width of the matrix, that is, the number of features about the typhoon sequence data, and H represents the height of the matrix. The specific steps of the TF-CNN-LSTM model are as follows. Step 1: The data of the China Typhoon Network is processed into a matrix of typhoon characteristics and a matrix of typhoon sequences, and the two are arranged in turn according to the time of the typhoon trajectory to form a matrix sequence, which is used as the input of the CNN model. Step 2: Use CNN for feature extraction. Convolutional neural networks can effectively extract short-term correlations between time series dimensions and are suitable for multi-dimensional time series prediction. Step 3: Use the TimeForce module to redistribute the weight of the hidden state feature according to the time step, and then multiply the weight by the corresponding feature to obtain the weighted hidden state feature. Step 4: Input the time-dependent sequence data into the LSTM in time order to obtain the prediction result.

2.1 CNN Model Convolutional Neural Network (CNN) is a widely used neural network in the field of deep learning, and has achieved remarkable application results in image recognition, speech recognition, text classification and other fields. CNN networks are mostly used to solve image processing tasks, but studies have shown that it also has strong

Typhoon Track Prediction Based on TimeForce CNN-LSTM …

49

Fig. 2 CNN network structure

applicability in prediction tasks. The principle is to use the ability of the convolution kernel to know the situation of the data in a period of time, and make predictions based on the situation of the data in the previous period of time. CNN network is composed of input layer, convolution layer, pooling layer, fully connected layer and output layer. Convolutional and pooling layers are used for feature engineering, and fully connected layers are used for feature weighting. The inherent characteristics of the network can simplify the complexity of links and improve the ability of the model to extract abstract features, and alleviate the problems of slow training speed and easy to fall into over fitting of the fully connected network to a certain extent. The structure of the CNN network is shown in Fig. 2. In this paper, the typhoon track data is added to the CNN network to process the time series data. The specific operations are as follows: Use one-dimensional convolution Conv1d to extract features through 64 convolution kernels, then activate through the Relu function, then perform pooling through max-pooling, and add dropout to prevent overfitting. According to the above-mentioned principles of convolution and pooling, the feature matrix is calculated and obtained. Finally, it is stretched into a vector through the variable feature, which is used as the input variable of the next module.

2.2 TimeForce Module The TimeForce module of this paper is shown in Fig. 3. After the input data passes through the CNN network, a series of hidden states will be obtained. In order to improve the accuracy of prediction, the weights of the hidden states are added according to the time series features, so as to improve the overall performance of the entire hybrid model.

50

J. Lu et al.

Fig. 3 TimeForce module flowchart

2.3 LSTM Model In this paper, we use LSTM (Long Short Term Memory Network) to predict the typhoon trajectory for time series. The structure of the LSTM unit is shown in Fig. 4, and each LSTM unit records the state S t at time t. There are three gates for LSTM cells: the forget gates f t , the input gates i t and the output gates ot . The input gates i t is used to control how the input xt of the current time step is input into the current memory cell and recorded. The calculation method is shown in formula (1 and 2): ] ) ( [ i t = σ Wi · h t−1 , xt + bi

(1)

( [ ] ) C˜ t = tan h Wc · h t−1 , xt + bc

(2)

Fig. 4 LSTM neural unit structure

Typhoon Track Prediction Based on TimeForce CNN-LSTM …

51

In the formula: Wi is the weight of the input gates; h t−1 is the hidden state; xt is the input data; bi is the input gates ‘s bias; Wc is the weight of hidden state; bc is the hidden state’s bias; C˜ t is the current memory cell. The forget gates f t controls whether the information in the memory cells of the previous time step can be transmitted to the current time step. The calculation method is shown in formula (3): ] ) ( [ f t = σ W f · h t−1 , xt + b f

(3)

In the formula: W f is the weight of the forget gates; b f is the forget gates bias. When the output gates ot is approximately 0, the current memory cell information will not continue to be transmitted backwards. The calculation method is shown in formula (4): ] ) ( [ ot = σ Wo · h t−1 , xt + bo

(4)

In the formula: Wo is the weight of the output gates; bo is the output gates ‘s bias. The long and short memory operation formula of the memory cell unit is shown in formula (5 and 6): Ct = f t ∗ Ct−1 + i t ∗ C˜ t

(5)

h t = ot ∗ tan h(Ct )

(6)

Compared with the traditional recurrent neural network, the cell state of LSTM can sum up the activities over a period of time, which can overcome the problems of gradient disappearance and gradient explosion, and can better capture the dependence of time series. The advantage of using LSTM to memorize the time series features for a long time is good, and the sequence learning and feature training of the historical operation data of typhoon tracks are helpful to improve the prediction accuracy. The number of structural hidden layers constructed by the LSTM network in the TF-CNN-LSTM hybrid neural network proposed in this paper is 3, and the LSTM network structure is shown in Fig. 5.

Fig. 5 LSTM network structure

52

J. Lu et al.

3 Experiment 3.1 Experimental Data Using python crawler technology to reasonably crawl the typhoon data of the China Weather and Typhoon Network (http://typhoon.weather.com.cn), the best trajectory data of about 200 typhoons from 1949 to 2020 were obtained, including the data of typhoon longitude, latitude, air pressure and wind speed. The typhoon sequence data with the typhoon duration less than 48 h were excluded from the experiment, and due to the uneven quality of the early typhoon record data sets, 11,466 typhoon data were finally retained. The remaining typhoon sequence data is divided into training set and validation set according to the ratio of 4:1, and the original input features include longitude and latitude. Considering that the typhoon track is represented by two-dimensional coordinates, this paper samples the track points of each typhoon, and each track point P can be represented as a binary list: P = [x, y] (x is longitude, y is latitude). For the typhoon sequence data with time step t, this paper uses Tra = ([x 1 , y1 ], [x 2 , y2 ], …, [x t , yt ]) to represent that the above trajectory segment Tra can describe the geographic information generated by any typhoon during its movement. As training data, it needs to be standardized before training. This paper uses the z-score method for standardization: X trans =

xold − mean √ var

(7)

In practice, to predict the position after m time steps, the input can be expressed as [x 1 , y1 ], [x 2 , y2 ], …, [x n , yn ], [x 2 , y2 ], [x 3 , y3 ], …, [x n+1 , yn+1 ], …, [x t-n-m+1 , yt-n-m+1 ], [x t-n-m+2 , yt-n-m+2 ], …, [x t-m , yt-m ], the corresponding output can be expressed as [x n+m , yn+m ], [x n+m+1 , yn+m+1 ], [x t , yt ].

3.2 Accuracy Evaluation Index When using a model for forecasting, the error is calculated based on the difference between the predicted coordinates and the actual coordinates. The actual error distance (ErrDis) (km) is to calculate the actual error distance between the two points according to the predicted coordinate point and the actual coordinate point. The root mean square error (RMSE) can well reflect the offset of the coordinate point. Therefore, using the above two accuracy evaluation indicators to verify the error of the TF-CNN-LSTN model for typhoon prediction can more intuitively reflect the reliability of the model. The formulas for RMSE and ErrDis are expressed as follows:

Typhoon Track Prediction Based on TimeForce CNN-LSTM …

53

[ | N ( |1 Σ ( )2 ) ( )2 xt − xt + yt − yt ) RMSE = √ N t=1 Λ

Λ

(8)

⎧ a = yt − yt ⎪ ⎪ ⎪ ⎨b = x − x t t / ⎪ ⎪ ⎪ ⎩ ErrDis = 2arcsin sin2 a + cos x · cos xˆ · sin2 b × R 2 2 Λ

Λ

(9)

Λ

Λ

Among them, yt and xt are the real values of latitude and longitude, yt and xt are the corresponding predicted values, and t represents the t-th predicted value. a is the difference between the latitude of the real data and the longitude of the predicted data, and b is the difference between the longitude of the real data and the latitude of the predicted data. R represents the radius of the earth, which is taken as 6378.137 km. For RMSE and ErrDis, the smaller their values, the better. When the model training minimizes RMSE and ErrDis, it is the optimal model. During training, the appropriate model structure and model parameters are determined by observing the changes in RMSE and ErrDis.

3.3 Analysis of Experimental Results Since typhoon prediction requires timeliness, the training time step selected for the experiment is one day (24 h) of data. Then all the typhoon track data is divided, 75% of the data is used as the training set to train the parameters of the CNN-LSTM prediction model, and the remaining 25% of the data is used as the validation set to verify the model’s learning effect. In order to verify the performance of the TF-CNN-LSTM model for typhoon track prediction, this paper uses the separate LSTM model and the CNN-LSTM hybrid model without time series enhancement module to predict the typhoon track coordinates of 12 h, 24 h and 48 h respectively with the TF-CNN-LSTM. According to the prediction results in Table 1, it can be seen that the fusion of CNN on the basis of the LSTM model has a certain reduction effect on RMSE and ErrDis; thus verifying the high performance of CNN in extracting data correlation. The TF-CNN-LSTM model after adding the TimeForce mechanism obtains lower RMSE and ErrDis values from the experiments than other models. In particular, it has a larger reduction in these two values when predicting longer time series. Although LSTM also has the ability to strengthen time series and long memory, under the structure of CNN-LSTM, long series will reduce the prediction accuracy of the model. The addition of the TimeForce mechanism can well solve this problem. It can redeploy the weight parameters of time series information. Therefore, in the typhoon trajectory prediction of 12 h, 24 h, and 48 h, the TF-CNN-LSTM model

54

J. Lu et al.

Table 1 Performance comparison of different models LSTM RMSE

ErrDis/km

CNN-LSTM

TF-CNN-LSTM

RMSE

RMSE

ErrDis/km

ErrDis/km

Predict 12 h

1.92

173.01

1.71

150.30

1.59

146.23

Predict 24 h

2.31

281.03

2.01

207.32

1.81

199.76

Predict 48 h

5.03

459.62

3.78

328.91

3.02

285.14

compared with other deep learning prediction models reduces the prediction error, indicating that it has better prediction performance.

3.4 Case Verification This paper selects the typhoon case “Typhoon Kompasu” with greater influence in 2021 for verification. The comparison between the model prediction of typhoon and the actual observation value is shown in Fig. 6. Among them, the green track represents the observed values of typhoon rules, the light yellow track represents the predicted value obtained by the TF-CNN-LSTM hybrid model, and the dark red track represents the predicted value obtained by simply using the LSTM model. It can be seen from the figure that using the LSTM model alone, the predicted value of the typhoon trajectory fluctuates greatly, and the trajectory points are too

Fig. 6 Comparison between the actual observed and predicted values of typhoon “Compass” from October 8 to 14, 2021

Typhoon Track Prediction Based on TimeForce CNN-LSTM …

55

scattered. In contrast, the TF-CNN-LSTM hybrid model proposed in this paper has better prediction stability and higher accuracy.

4 Discussion In this paper, we hope to propose a deep learning-based typhoon track prediction model. Through our work, we found and verified the effectiveness of the TF-CNNLSTM model for typhoon path prediction, and through experiments found that the model’s root mean square error and actual error distance indicators are better than other models. Deep learning has strong prediction ability in the field of typhoon path prediction. The technology of applying deep learning to typhoon path prediction is still in the exploratory stage. Although many scholars have done a lot of work, they still fail to put forward an innovative model that can well complete the prediction of typhoon path, especially the long-term prediction of typhoon path. The TF-CNN-LSTM model proposed in this paper not only achieved considerable results for typhoon track prediction, but to our surprise, the model obtained unexpectedly good results in the long-term prediction of typhoon tracks. From our work, we have done a lot of research on the characteristic variable of time series, and thus novelly propose the time series increase (TimeForce) mechanism in this paper. Incorporating this method into a multimodal model, amplifying the dimension of time series, the model can better pay attention to the time series, which interpretably illustrates how the method proposed in this paper achieves excellent results in long-term forecasting of typhoon tracks. In this paper, we innovatively demonstrate the TF-CNN-LSTM model, whose tasks of long-term prediction and long-term prediction of typhoon trajectory meet or even exceed the current state of the art. While these initial results are encouraging, many challenges remain. As the typhoon track is affected by many factors, such as wind speed, pressure, rainfall, etc., these characteristics have a certain impact on the change of typhoon track. In Xu Guangning’s research on short-term typhoon track prediction, his design model fully exploits the high-dimensional characteristics of typhoon and achieves good results [21]. Therefore, we believe that in the task of shortterm typhoon track prediction, the TF-CNN-LSTM model focuses on the time series, which reduces the influence of other dimensional features on the typhoon track, that is, integrating the time series enhancement mechanism affects the model’s ability to understand other dimensional features, thus weakening the model’s performance in short-term typhoon track prediction. In the follow-up research, we will fully consider the impact of other related factors on the typhoon trajectory, design a deeper model structure, and improve the model complexity. But it is undeniable that the model proposed in this paper has indeed demonstrated its excellent prediction ability for long-term time series. In particular, it can not only achieve considerable results in the long-term prediction of typhoon

56

J. Lu et al.

paths, but also TF-CNN-LSTM can still be used as the first neural network model for other similar long-term prediction tasks.

5 Conclusion The TF-CNN-LSTM hybrid model proposed in this paper has certain research significance in the application of typhoon trajectory prediction. It fully mines the features and time information of typhoon track data, CNN extracts the correlation of time series features from the input data, and weights the hidden states in it, so as to obtain a sequence with enhanced time dependence. Finally, the constructed deep neural network is used to analyze and predict the typhoon trajectory. The experimental results show that the TF-CNN-LSTM model has a strong ability to predict the typhoon trajectory sequence, and the novel extension of integrating the time series enhancement strategy into the hybrid model makes the prediction accuracy improvement more obvious. In the next step, we intend to start from the following three points: (1) It is planned to use element correlation or association rules to filter elements that have a greater impact on typhoon tracks, so as to integrate multi-dimensional data features; (2) Considering the three-dimensional structure of the typhoon and the climate factors such as the wind belt where the typhoon is located, warm and cold currents, etc., to construct a three-dimensional time series structure of the typhoon and its surroundings, so as to accurately model the typhoon path; (3) Establish a multi-modal fusion model with multiple structures and an attention mechanism to improve the understanding and generalization capabilities of the model. So as to further improve the accuracy of typhoon trajectory prediction.

References 1. Z. Zhiwei, A case study of the upper ocean response to typhoons in the Northwest Pacific Ocean. Ocean Bulletin 38(5), 562–568 (2019) 2. Y. Jinhua, T. Jiaxiang, D. Yuhan et al., Operational forecast errors and causes of typhoon tracks in my country. Meteorology 38(6), 695–700 (2012) 3. X.Y. Huang, L. Jin, An artificial intelligence prediction model based on principal component analysis for typhoon tracks. Chinese J. Atmospheric Sci. 37(5), 1154–1164 (2013) 4. P. Liang, in Research on Media Early Warning of Typhoon Disasters (2001–2010) Under the Path of Big Data (Central China Normal University, 2014) 5. W. Xuyang, Review of traditional models and neural networks for predicting typhoon tracks. Scientific Consult. (Science and Technology Management) 01, 62–65 (2020) 6. M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks (2013) 7. L.E. Jeffrey, Finding structure in time. Cognitive Sci. 14(2), 179–211

Typhoon Track Prediction Based on TimeForce CNN-LSTM …

57

8. G. Shan, Research on typhoon intensity prediction based on deep learning. Guangxi University (2021). https://doi.org/10.27034/d.cnki.ggxiu.2021.000721 9. N. Tokgza, in A RNN based time series approach for fore-casting turkish electricity load. 2018 26th Signal Processing and Communications Applications Conference (SIU), pp. 1–4 (2018) 10. L. Yang, W. Yuqian, W. Junli, L. Yili, Review of recurrent neural network research. Computer Appl. 38(S2), 1–32 (2018) 11. X. Yulu, A review of the development of recurrent neural networks. Computer Knowl. Tech. 15(21), 182–184 (2019). https://doi.org/10.14004/j.cnki.ckt.2019.2379 12. R.C. Staudemeyer, E.R. Morris, Understanding LSTM—a tutorial into long short-term memory recurrent neural networks. arXiv:1909.09586 (2019) 13. G. Chengliang, Research on the Representation Method of Text Context-Dependent Features Based on LSTM (Hebei University of Science and Technology, 2019) 14. X. Gaoyang, L. Yao, Application of LSTM network in typhoon track prediction. Comp. Modern. 285(5), 68–72 (2019) 15. H. Qi, L. Dongxu, S. Wei, H. Dongmei, D. Yanling, Typhoon track prediction model based on dual attention mechanism. Ocean Bulletin. 40(4), 387-395 (2021) 16. L. Zhishuai, L. Yisheng, X. Gang, Short-term traffic flow prediction based on graph convolutional neural network and attention mechanism. Traffic Eng. 19(4), 15–47. https://doi.org/10. 13986/j.cnki.jote.2019.04.003 17. Z. Xue, Movie Box Office Prediction Based on Deep Learning Convolutional Neural Network (Capital University of Economics and Business, 2017) 18. H. Jie, Z. Feng, D. Zhenhong, L. Renyi, C. Xiaopei, PM_(2.5) hourly concentration prediction based on RNN-CNN integrated deep learning model. J. Zhejiang University (Science Edition) 46(3), 370–379 (2019) 19. Z. Hongrui, X. Lei, Research on stock prediction based on LSTM-CNN-CBAM model. Comput. Eng. Appl. 57(03), 203–207 (2021) 20. D. Min, in Design and Implementation of Video Semantic Analysis System Based on CNN and LSTM (Nanjing University of Posts and Telecommunications, 2018) 21. X. Guangning, in Research on Typhoon Track and Intensity Prediction Method Based on Deep Learning (Harbin Institute of Technology, 2020)

The Novel Characterizing Method of Collective Behavior Pattern in PSO Xuan Wu, Jiaqi Huang, Xiaoyang Fu, You Zhou, Yanchun Liang, and Chunguo Wu

Abstract Although swarm intelligence algorithms have attracted extensive attention, there is little research on the behavior of collective dynamics. Inspired by the moving patterns of fish school, we propose a visualization method of collective behavior patterns based on the velocity field, discover various collective behavior patterns, and propose a discriminate index named swarm trend factor. In addition, this paper proposes a novel swarm states division method, on the basis of swarm trend factor. In the experiments, we demonstrate that swarm trend factor can reflect the performance of PSO. And, we also compare the difference between swarm trend factor and another swarm state division method.

Xuan Wu and Jiaqi Huang contribute equally. X. Wu · Y. Zhou · C. Wu (B) College of Computer Science and Technology, Jilin University, Changchun, China e-mail: [email protected] X. Wu e-mail: [email protected] Y. Zhou e-mail: [email protected] Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China J. Huang · X. Fu (B) · Y. Liang School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China e-mail: [email protected] J. Huang e-mail: [email protected] Y. Liang e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_4

59

60

X. Wu et al.

1 Introduction In recent years, inspired by the natural swarm behavior [1, 2], many swarm intelligence algorithms have been proposed. Among them, the traditional algorithms include particle swarm optimization (PSO) [3] and ant colony algorithm [4], and the novel algorithms include intelligent water drop algorithm [5], gray wolf algorithm [6], and salp swarm algorithm [7]. Both traditional and novel algorithms have their own characteristics which make their swarm states have significant differences, because of different interaction mechanisms. Besides, almost all swarm intelligence algorithms have randomness. Therefore, the research on the behavior of collective dynamics is difficult. Generally, most algorithms have exploration and exploitation two kinds of behaviors of collective dynamics. And the original intention of these algorithms is to find a better balance between exploration and exploitation. The former is expected to search in a wider region of search space, and the latter is expected to fine-search in the currently optimal solution regions. However, there are few distinct definitions to distinguish these two behaviors. To solve this problem, Zhan et al. [8] divided the swarm into convergence, exploitation, exploration, and jumping out categories according to the evolutionary state defined by the Euclidean distance between particles (see Sect. 2.2 for more details). To strive for better performances of PSO, Aoun et al. [9] used the Hidden Markov Model to predict the swarm state proposed in [8], and better balance exploration and exploitation by adjusting parameters of PSO. Other than the field of computer science, researchers have done a lot of work on the swarm behavior of organisms [10–14]. Kolbjørn et al. [14] modeled the fish school to demonstrate that the swarm behavior can be mapped to a set of order parameters. And then, Kolbjørn et al. defined three swarm states: Swarm, Milling, Polarized, on the basis of order parameters. In this paper, we introduce the concept of velocity field to PSO, and find some regular collective behaviors. In addition, to divide diverse collective behavior patterns, namely exploration and exploitation, this paper designs a quantitative index using the velocity information of particles, named swarm trend factor r . Specifically, we first define the behavior angle αi , which is the angle between the particle velocity vi and the reference direction θi . Subsequently, we define the swarm trend factor r , i.e., the ratio of the number of particles whose behavior angle belongs to a certain interval over the total number of particles. Finally, to better define the exploration and exploitation states. We introduce the membership function in fuzzy mathematics. In the experiments, we first demonstrate that the swarm trend factor r can reflect the performance of PSO, and then compare the difference between the swarm trend factor r and another swarm state division method [8]. The experimental results show that the swarm trend factor and evolutionary factor have high similarity in exploitation state. The rest of this paper is arranged as follows. In Sect. 2, we first introduce the original PSO and review the work on swarm state dividing in the field of PSO and biology. In Sect. 3, we discover various collective behavior patterns based on the velocity field, and then propose a divide method of swarm states. In Sect. 4, we

The Novel Characterizing Method …

61

analyze the relationship between swarm trend factor and the algorithm performance, and finally compared swarm trend factor with evolutionary factor [8]. In Sect. 5, we give the summary of this paper.

2 Related Work In this section, we introduce the original PSO algorithm, swarm state division based on the evolutionary factor, and swarm state division based on order parameter in turn.

2.1 The Original PSO Algorithm In PSO, each particle is associated with two attributes, namely velocity v and position x, whose update formulas are as follows: vi,d = ωvi,d + c1 rand1d ( pbesti,d − xi,d ) + c2 rand2d (gbestd − xi,d ),

(1)

xi,d = xi,d + vi,d ,

(2)

where for the dth dimension, vi,d and xi,d denote the velocity and position of particle i, respectively, rand1d and rand2d are two numbers chosen uniformly and randomly within the [0, 1] range. pbesti denotes the historical best position of particle i, gbest denotes the historical best position in the population, and ω is the inertia weight, c1 and c2 are the acceleration coefficients.

2.2 Swarm State Division Based on the Evolutionary Factor Zhan et al. [8] defined the average Euclidean distance from particle i to the other particles, whose formula is as follows: [ | D N | (  )2 1 | × xi,k − x j,k , di = N − 1 j=1, j/=i k=1

(3)

where N denotes the population size, D denotes the number of dimensions. Subsequently, he proposed the evolutionary factor f to divide swarm states based on the average Euclidean distance, which is computed as follows:

62

X. Wu et al.

Fig. 1 Fuzzy membership functions of four swarm states [8]

f =

dg − dmin , dmax − dmin

(4)

where dg denotes the average Euclidean distance from gbest to other particles, dmax and dmin denote the maximum and minimum distances of all particles in the swarm, respectively. By conducting experiments, Zhan divided the swarm states into four states: When f ∈ [0, 0.3], the swarm is in Convergence state. When f ∈ [0.2, 0.6], the Exploitation state; When f ∈ [0.4, 0.8], the Exploration state; When f ∈ [0.7, 1], the Jumping-out state. Because the boundaries between the four states have fuzziness, Zhan also introduced the membership functions of fuzzy mathematics to construct the fuzzy relationship between the swarm states, as shown in Fig. 1.

2.3 Swarm State Division Based on Order Parameter Kolbjørn et al. [14] modeled the fish shoal system and observed three kinds of collective behaviors, as shown in Fig. 2. To detect the consistency of individual velocity direction in the swarm, Kolbjørn et al. introduced the polarization order parameters O p proposed in [15, 16], which is defined as follows: 1 Op = N

| N | | | | | ui | , | | |

(5)

i=1

where ui denotes the unit direction of fish i. The value range of O p is [0, 1], 0 represents the movement direction of the individuals in the swarm disorder, and 1 represents the movement direction of all the individuals in the swarm is consistent. And, to describe the degree of rotation of the fish around its center of mass, Kolbjørn et al. also introduced the rotation order parameter Or , which is defined as follows: 1 Or = N

| N | | | | | ui × r i | , | | | i=1

(6)

The Novel Characterizing Method …

63

Fig. 2 Swarm, polarized and milling swarm states [14]

where r i denotes the unit vector of the centroid of the fish school pointing to fish i. The value range of Or is [0, 1], where 0 represents that the movement direction of individuals in the swarm cannot present a spiral shape, and 1 represents that the movement direction of all individuals in the swarm presents a spiral shape. Subsequently, Kolbjørn et al. defined the numerical characteristics of these three swarm states by conducting experiments: When O p > 0.65, Or < 0.35, the fish school is in the state of Polarized, which are characterized by slow movement of individual, and relatively dense and disordered structure. When O p < 0.35, Or > 0.65, the fish school is in the state of Milling, which is characterized by uniform movement of individuals. When O p < 0.35, Or < 0.35, the fish school is in the state of Swarm, where the swarm presents a spiral movement trend. In addition, states outside these ranges are transitional states.

3 Our Work Inspired by the study of biological collective behavior pattern, this paper introduces the concept of velocity field into PSO, discovers collective behavior patterns through the velocity field, and constructs a quantitative dividing method for swarm state.

3.1 Visualization of Swarm Behavior Pattern The velocity field is a vector field in physics, which describes the velocity distribution of each point in a given space region. The evolutionary sequence generated by PSO in the time period [0, T ] is denoted as X t (0  t  T ). ) ( X t = x t1 , x t2 , . . . , x tm , where x it ∈ D ⊆ R n denotes the particle i’s position in tth iteration.

(7)

64

X. Wu et al.

To discover collective behavior patterns, this paper takes the minimization of the D (xi + two-dimensional Sphere function as example, whose formula is f (x) = i=1 2 2) . According to the definition of Sphere function, it has a unique global optimal solution (−2, −2). To better observe the collective behavior of particles in PSO, the initial swarm is limited to the interval [−5, −4]. Because PSO has the characteristics of randomness and complexity, this paper adopts multiple experimental results for comprehensive analysis and discovers the collective behavior patterns. To show the dynamic process of swarm evolution, we randomly select an experimental result from many experimental results to visualize the collective behavior patterns of PSO. As shown in Fig. 3a, b, in the initial iteration, under the guidance of gbest, the velocity direction of particles changes from disorder to order. In this iteration, the swarm has the strongest trend of orderly movement, and most particles move towards gbest, showing a significant Polarized state similar to that of fish shoal. As shown in Fig. 3c, the Euclidean distance between gbest and the global optimal solution is quite small which is 0.055. As shown in Fig. 3d, most particles return to the region where the global optimal solution is located because of the attraction of gbest. In Fig. 3e, most of the particles exploit in the global optimal region. The range of swarm activities gradually decreases, and most of the particles exploit search on the region of gbest. To verify the response ability of the velocity field, after the 20th iteration, D (xi − 2)2 . Therefore, the global the objective function is modified to f (x) = i=1 optimal solution changes from (−2, −2) to (2, 2). As shown in Fig. 4, we also found a similar phenomenon in the subsequent iterations. The above experimental results show that PSO has collective behaviors similar to that of fish shoal, and presents the behavior patterns of Swarm and Polarized specifically.

3.2 Swarm Trend Factor of PSO To describe the collective behavior of PSO, we design a swarm state division method in this sub-section. Definition 1 The angle between the particle velocity vi and the reference direction θi = gbest − xi is called the behavior angle of particle i, denote as αi . To quantify the behavior state of particles, the cosine value of ai can be computed as follows: dot (θi , vi ) , (8) cos (αi ) = √ dot (θi , θi ) dot (vi , vi ) where dot(·, ·) denotes the dot product operation function. Subsequently, we define swarm trend factor r according to the proportion of particles in a certain interval. As shown in Fig. 5, we present the proportion of the cosine value in each interval. It can be clearly observed that during the evolution process of the swarm, the angle of

The Novel Characterizing Method … 5

65

1st iteration velocity vector diagram

5

4

4

3

3

2

2

1

-4

1

0

-4.2

0

-1

-4.4

-1

-2

-4.6

-2

-3

-4.8

-3

5

-4.8

-4.6

-4.4

-4.2

0

-4

-4

5

5th iteration velocity vector diagram

5 4

3

3

2

0

1 0

-1.5

-4.5

-4

-3.5

0

-3

5

6th iteration velocity vector diagram

-1

-1.5

-2

-2

-2 -2.5

-3

-3 -3

-5 -5

5

-5 -5

-1

-2

-4

-4.5

2 -1

-1

-3

-3.5

-5 -5

4

1

-3

-4

-4 -5 -5

2nd iteration velocity vector diagram

-2.5

-2

-1.5

0

-1

5

-4 -5 -5

-2.5

-3 -3

-2.5

0

-2

-1.5

-1

5

20th iteration velocity vector diagram

4 3 2 1 0

-1

-1.5

-1 -2

-2 -3 -4 -5 -5

-2.5

-3 -3

-2.5

0

-2

-1.5

-1

5

D (x + 2)2 (The Fig. 3 Particle swarm evolution behavior pattern of sphere function f (x) = i=1 i current gbest and global optimal solution are represented by red and black asterisks, respectively.)

66

X. Wu et al.

D (x − 2)2 (The Fig. 4 Particle swarm evolution behavior pattern of sphere function f (x) = i=1 i current gbest and global optimal solution are represented by red and black asterisks, respectively.)

The Novel Characterizing Method …

67

Fig. 5 Proportion of particles in each interval in the evolution process of typical test function (yellow indicates that the proportion of particles in this region is large, and blue indicates that the proportion of particles in this region is small.)

68

X. Wu et al.

Fig. 6 Fuzzy membership functions for the swarm states

] [ π 5π ] [ ∪ 6 , 12 . Subsequently, we conduct behavior is mostly in the interval − π6 , − 5π 12 a lot of experiments and give the following definition: ] [ Definition 2 The ratio of the number of particles in the interval − π3 , π3 over the total number of particles is named the swarm trend factor, denote as r . Due to the randomness of PSO, we lead into the membership function in fuzzy mathematics to help us better define exploration and exploitation states. We count the samples obtained by 100 experiments for each four functions, and determined the corresponding membership degree according to the proportion of swarm state as follows and shown in Fig. 6. Definition 3 Exploration: A minimal value of r represents M1 ,whose membership function is defined as: ⎧ ⎪ 0.8 ≤ r ≤ 1.0, ⎨0, M1 (r )= −5 × r + 4, 0.6 < r ≤ 0.8, ⎪ ⎩ 1, 0 < r ≤ 0.6. Definition 4 Exploitation: the largest values of r represents M2 ,whose membership function is defined as: ⎧ ⎪ 0 ≤ r ≤ 0.6, ⎨0, M2 (r )= 5 × r − 3, 0.6 < r ≤ 0.8, ⎪ ⎩ 1, 0.8 < r ≤ 1.0.

The Novel Characterizing Method …

69

4 Experimental Results In this section, we first demonstrate that the swarm trend factor r can inflect the performance of PSO, and compare the difference of the swarm trend factor r and the evolution factor f . We used 5 unimodal functions and 11 multimodal functions for testing, including Sphere, NoisyQuartic, Schwefel 2.22, Schwefel 1.2, Rosenbrock and Schwefel, Rastrigin, Noncontinuous Rastrigin, Ackley, Griewank, Penalized1, Penalized2, Dminima, Rastrigin10, Rastrigin100, Weierstrass functions. The problem dimension, fitness evaluation times and population size are set to 30, 300,000, and 100, respectively. And, all test functions run independently 30 times.

4.1 Description of Algorithm Performance by Swarm Trend Factor In this sub-section, to verify that the performance of PSO may be related to the value of swarm trend factor r , we conduct a lot of experiments. Specifically, by conducting experiments, we get r value of the original PSO, which is 0.8. Then, we adjust the parameters of PSO including acceleration coefficient and inertia weight to get different r values, which are 0.4 and 0.6, respectively. In Table 1, we list the Wilcoxon signed rank test [17] results of PSO under different r values. The symbols +, −, ≈ indicate that the former performs significantly better (+), significantly worse (−), or not significantly different (≈) compared to the latter. As shown in Table 1, when we change the parameters of PSO, the r value of PSO algorithm reaches different levels. When r ≈ 0.6, the performance of PSO algorithm is significantly improved, and when r ≈ 0.4, the performance of PSO algorithm decreases slightly. This verifies the hypothesis that the value of swarm trend factor r can reflect the performance of the algorithm. At the same time, it also shows that there is a nonlinear relationship between swarm trend factor r and algorithm performance. Other specific influencing factors and ways need further theoretical analysis and experimental verification.

Table 1 Wilcoxon signed rank test of PSO performance with different r values ``` ≈ 0.8) vs. `PSO(r ``` PSO (r ≈ 0.4) ``` PSO(r ≈ 0.6) Symbol + 2 8 14 8 − 0 0 ≈ p-value 0.0083 0.5695

70

X. Wu et al.

4.2 Comparison with Evolutionary Factor In [8], evolutionary factor f is used to describe the swarm state from the perspective of the position relationship of particles in the solution space and it focuses on describing the static state of the swarm, while swarm trend factor r focuses on describing the dynamic evolution process of the swarm. Swarm trend factor r describes the state of the swarm from the distribution relationship of the velocity field and reflects the evolution trend of several iterations in the future. Because swarm trend factor r does not have Convergence state and Jumping-out state, to facilitate the comparison between swarm trend factor r and evolutionary factor f , the Exploitation and Convergence states in literature [8] are merged into exploitation state, and the Exploration and Jumping out states are merged into exploration state. And f = 0.5 and r = 0.7 as the thresholds for dividing exploration state and exploitation state respectively. That is, when f ≤ 0.5, the swarm is in exploration state. When f ≥ 0.5, the swarm is in exploitation state. To compare the difference between the swarm trend factor r and the evolutionary factor f , we present the results on 8 functions in Fig. 7. In Fig. 7, the circle represents r and the square represents f . When r and f jointly judge that a iteration is in the exploitation state, we mark the circle and square in red, and mark the position of the corresponding number of iterations with red dotted lines. When r and f jointly identify a iteration as exploration state, we mark the circle and square in blue, and mark the position of the corresponding number of iterations with blue dotted lines. In other cases, we mark the circle and square in black. The black broken line in the figure represents the change of fitness function value, and the green broken line represents the change of swarm activity range radius. As shown in Fig. 7, r and f have good consistency in Sphere, Rosenbrock, Ackley, Griewank and Weierstrass. Especially on the Ackley, after 50 iterations, both the swarm trend factor r and the evolutionary factor f judge that the swarm is in the exploration state for a long time. r and f differ greatly in judging the swarm state on Schwefel and Rastrigin. Especially on the Schwefel, the value of swarm trend factor r is much lower than 0.7, and the value of f is also lower than 0.1 for a long time, which is quite different from other test functions. Finally, we draw the following conclusions: swarm trend factor r and evolutionary factor f are related to the test function, and swarm trend factor r and evolutionary factor f have similar definitions of exploitation state.

5 Conclusions In this paper, we propose a new method to define exploration and exploitation states. In the experiment, we demonstrate that the swarm trend factor can reflect the performance of PSO. Finally, we compare the swarm trend factor with the evolution factor, and show the similarities and differences between the two methods for population state division.

The Novel Characterizing Method …

Fig. 7 Comparison of swarm trend factor r and evolutionary factor f

71

72

X. Wu et al.

Acknowledgements This work is supported by the National Natural Science Foundation of China (61876069, 61972174 and 61972175), the Jilin Natural Science Foundation (20200201163JC), the key scientific research platforms and scientific research projects of Guangdong Provincial Department of Education(2020KTSCX192), Guangdong Science and Technology Planning (2020A0505100018), Universities’ Innovation Team (2021KCXTD015) and Key Disciplines (2021ZDJS138) Projects.

References 1. I.D. Couzin, Collective cognition in animal groups. Trends Cognitive Sci. 13(1), 36–43 (2009) 2. T.S. Deisboeck, I.D. Couzin, Collective behavior in cancer cell populations. Bioessays 31(2), 190–197 (2009) 3. R. Eberhart, J. Kennedy, A new optimizer using particle swarm theory.,in MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science (IEEE, 1995), pp. 39–43 4. M. Dorigo, V. Maniezzo, A. Colorni, “Ant system: optimization by a colony of cooperating agents. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 26(1), 29–41 (1996) 5. H.S. Hosseini, Problem solving by intelligent water drops, in 2007 IEEE Congress on Evolutionary Computation (IEEE, 2007), pp. 3226–3231 6. S. Mirjalili, S.M. Mirjalili, A. Lewis, Grey wolf optimizer. Adv. Eng. Softw. 69, 46–61 (2014) 7. S. Mirjalili, A.H. Gandomi, S.Z. Mirjalili, S. Saremi, H. Faris, S.M. Mirjalili, Salp swarm algorithm: a bio-inspired optimizer for engineering design problems. Adv. Eng. Softw. 114, 163–191 (2017) 8. Z.-H. Zhan, J. Zhang, Y. Li, H.S.-H. Chung, “Adaptive particle swarm optimization. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 39(6), 1362–1381 (2009) 9. O. Aoun, M. Sarhani, A. El Afia, Hidden Markov model classifier for the adaptive particle swarm optimization, in Recent Developments in Metaheuristics (Springer, 2018), pp. 1–15 10. J. Lorenz, H. Rauhut, F. Schweitzer, D. Helbing, How social influence can undermine the wisdom of crowd effect. Proc. Nat. Acad. Sci. 108(22), 9020–9025 (2011) 11. L. Conradt, T.J. Roper, Deciding group movements: where and when to go. Behav. Process. 84(3), 675–677 (2010) 12. J.W. Jolles, N.J. Boogert, V.H. Sridhar, I.D. Couzin, A. Manica, Consistent individual differences drive collective behavior and group functioning of schooling fish. Current Biol. 27(18), 2862–2868 (2017) 13. A.J. Ward, D.J. Sumpter, I.D. Couzin, P.J. Hart, J. Krause, Quorum decision-making facilitates information transfer in fish shoals. Proc. Nat. Acad. Sci. 105(19), 6948–6953 (2008) 14. K. Tunstrøm, Y. Katz, C.C. Ioannou, C. Huepe, M.J. Lutz, I.D. Couzin, Collective states, multistability and transitional behavior in schooling fish. PLoS Computat. Biol. 9(2), e1002915 (2013) 15. A. Kolpas, J. Moehlis, I.G. Kevrekidis, Coarse-grained analysis of stochasticity-induced switching between collective motion states. Proc. Nat. Acad. Sci. 104(14), 5931–5935 (2007) 16. I.D. Couzin, J. Krause, R. James, G.D. Ruxton, N.R. Franks, Collective memory and spatial sorting in animal groups. J. Theoret. Biol. 218(1), 1–11 (2002) 17. J. Derrac, S. Garcia, F. Herrera, A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evolut. Comput. 1(1), 3–18 (2011)

Research on Box Office Prediction of Commercial Films Based on Internet Search Index and Multilayer Perceptron Te Guo, Chiawei Chu, Junhan Gao, Mengyao Wang, and Wei Lu

Abstract The film industry has grown rapidly in recent years. The box office is one of the important indicators to measure the success of a movie. In comparison with traditional film marketing manner, the prediction of box office based on search trends can make traditional marketing more targeted and confident. The innovation of this paper is the search trends, Baidu index and Google trends, is used as an independent variable on various methods of linear regression, moving average method and multilayer perception (MLP) for prediction of box office. The research data is collected from IMDb for north American film market, Endata.com and Maoyan.com for mainland China film market. The performance of box office prediction is compared based on different types of forecasting methods, and with or without the search trends as the independent variable. It is found that the MLP model based on Baidu and Google index has the best prediction performance on the movie box office, with an accuracy of 81.11%. While the prediction error is also evaluated using the mean absolute percentage error (MAPE) calculation, and the result is 18.89% for MLP, 9.53% is improved in accuracy over MLP with search trends. Finally, the search trends on the Internet as an independent variable offer consistent better performance. Hence the film company can pay much attention on Internet marketing to optimize their marketing strategy. Keywords Box office · Multilayer perceptron · Google trends · Baidu index

T. Guo · J. Gao · M. Wang · W. Lu (B) School of Aliyun Big Data Applications, Zhuhai College of Science & Technology, Zhuhai, China e-mail: [email protected] T. Guo e-mail: [email protected] C. Chu Faculty of Data Science, City University of Macau, Macao, China e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_5

73

74

T. Guo et al.

1 Introduction As an essential part of cultural life, movie enriches people’s spare time and spirit world. The visibility of movie makes it popular. It is now not only an essential way to spread the local customs and cultural consciousness of a country or region, but also a vital medium to promote cultural exchanges, expand cultural market and financial market in the social development [1]. In addition to the senior cultural attribute, another essential element of movie is its commodity attribute. With the continuous spread of videos all over the world, the economic system of the movie market is developing rapidly. The movie industry has big business possibilities and added value [2]. The sources of earnings of movie enterprises are diversified. In addition to the box office, the primary source, it also includes the earnings from the sale of broadcasting rights, advertising, the development of derivative, etc. [3]. Because the movie industry is attached to huge economic value, which can even regenerate with in-depth research on the economics behind the movie industry [4]. Therefore, it is of great value to predict the film box office and further explore marketing strategies in accordance with the results. At present, research on prediction of movie box office largely focuses on the sorting of target audience’s emotion via search engine to obtain the primary data, and then performs model analysis on this basis [5–7]. Based on research on commercial movie box office, this paper compares the effects of different forecasting methods of movie box office to determine the one with the least error. Predicting the box office using the method chosen by this research can provide movie issuers with good suggestions on strategies, allowing them to allocate resources well. It also can help the investors and production companies to invest arrange capital, personnel and venues reasonably, so as to avoid losses from unreasonable investment and distribution of capital and waste of resources. Further, it will also help with making reasonable promotion strategies to increase ticket sales (Fig. 1).

2 Data Collection and Forecasting Methods 2.1 Baidu Index and Google Trends Baidu Index is a vivid picture of the changing trend of the keywords that the user attention and exposure in the past 30 days. And then Baidu Index is a free data analysis service based on Baidu news and web search, which can reflect the “media attention” and “user attention” of different keywords in the past period of time. In China, Baidu Index is an important data analysis platform with a huge amount of netizen behavior data. And it has an irreplaceable position as a statistical analysis platform in the entire Internet. The Baidu Index is more reflected in the user’s attention to a keyword on the Internet, and the change in user attention over time. Based on

Research on Box Office Prediction of Commercial Films …

75

Fig. 1 Research process

the Baidu search volume data of netizens, it takes keywords that in Baidu web search as the object to calculate and scientifically analyzes. Therefore, the Baidu index is a reflection of absolute search volume. The application of Baidu Index can explore and share valuable information and data on the Internet, and give an objective and direct reflection on the needs, interests and social hotspots of netizens. In academic research, the Baidu index is usually added to the time series forecasting model, and a new model is formed to compare the prediction effect of the new and old models. And then the research results show that the prediction value of the model adding the Baidu index is more accurate, which shows that the network search data plays an important role in prediction. Google Trends is a service developed by Google that analyzes a keyword that users have searched for in Google and shows the attention of that keyword. Its function is mainly to analyze the keywords that users have searched for in Google, and open to users to download the Search Volume Index (SVI) of keywords. The service can search through different keywords, within a time set by itself or by country. In academic research, the Google Keyword Index was added to the general SARIMA model to form a new model, and the new model was used to make predictions with the general SARIMA model. The results show that the SARIMA model with Google keywords predicts better than the general SARIMA model. In the context of the increasing popularity of the Internet, coupled with the drive of information technology development, there have been many studies on various socio-economic behaviors with the help of Baidu index or Google Trends to analyze and predict. Research shows that online search information can partially reflect the

76

T. Guo et al.

characteristics of socio-economic behavior, and after processing the accumulated network search information, search engines such as Baidu will publish it to the outside world. In recent years, due to the role of search engine data such as Baidu Index or Google Trend, online search data has also been more used in various socioeconomic forecasting fields, such as: unemployment rate, public health, tourism demand, epidemic detection, movie box office, finance and other fields. Network search data is a reflection of the subjective will of tourists, with the characteristics of sensitivity to tourist behavior, easy access, high real-time, is an effective representation of tourists’ potential tourism needs, and contains important information that is closely related to tourism needs. Whether in the prediction of social behavior of various economic constructions in the macro field of the country or in the micro field, such data have been effectively applied, and the role of significantly improving the prediction accuracy of the model has been proved.

2.2 Data Collection For the accuracy of the prediction results, the top 100 films were picked from Endata.com and Maoyan.com for the Chinese film market, which these two website provided the information of ranking and box office. Also, another top 100 films were selected from the Internet Movie Database (IMDb) for the north american film market, which contains detailed information of all movies realeased. In this research, action, comedy and drama films were picked from the top 100 films in each film market. This is because that these three kinds of films have accounted for 80% of the top 100 films. Moreover, 10 films for each type for both film markets were selected as the targeted samples for box office prediction. At the same time, cyber notability of each of these 60 films were collected based on Baidu and Google search engines, and search indexes were taken into consideration for calculation of search trends [8]. Because the subjects are the audience in North American and mainland China. The people in North America commonly are used to using Google as a search tool, in contrast with the Chinese movie market, in which Baidu acts as the primary search tool. Therefore, the internet search information chosen in this research were from Google Trends and the official release platform of Baidu Index. There are varied manners in the literature regarding the selection of keywords for web search trends. But in this study, considering the user’s habit of searching for a specific movie, only the title of the movie was used as the keyword for web search. According to Google Trends and Baidu Index, the titles of these 60 films were chosen as the keywords, the search index for each single keyword represents the search volumn over time; and the time period focus on 4 weeks before to the first week after the film is released. The setting of time period is mainly based on the discusion of literature.

Research on Box Office Prediction of Commercial Films …

77

2.3 Forecasting Methods There are many forecasting methods about movie box office. Through the com prehensive assessment of forecasting methods, this paper selected linear regression, moving average method and multilayer perceptron (MLP) method, which are most commonly used in academic research [9–12].

2.3.1

Linear Regression

Regression models can be divided into linear and nonlinear regression models in accordance with the style of the model. Two regression models are commonly used. However, univariate linear regression is typically used in research and analysis to facilitate data processing. Therefore, it is employed to perform the data analysis in the research. The established form of univariate linear regression is: Y = aX + b

(1)

where Y is dependent variable, X is the independent variable, a is the regression coefficient, and b is the residual value. The estimation of model parameters is calculated by least square method.

2.3.2

Moving Average Method

Moving average method is one of the time series methods. It calculates the average of a selected range of data to figure out a long-term trend based on these data [13]. Simple moving average method was used in the research. It takes the measured values of the preceding n of the data to be forecast to calculate the arithmetic average as the anticipated value of the subsequent period. The computational formula is as follows: Mt(1) =

xt + xt−1 + ... + xt−N +1 n

(2)

where, n is the span, that is, the number of figures crossed every movement; Mt(1) is the N + 1 data, that is, the expected value.

2.3.3

Multi-Layer Perceptron Method

Multi-Layer Perceptron (MLP) is a feedforward synthetic neural community mannequin composed of multi-layer nodes in a directed graph. Each layer node is definitely linked with the subsequent layer node. The factors of the hidden layer

78

T. Guo et al.

and the output layer are referred to as neurons. Each neuron is a processing unit. The MLP mannequin consists of input, output and a single hidden layer. MLP can use nonlinear startup characteristic in its hidden layer to solve complicated nonlinear problems. In regression, the sum of errors of the entire coaching pattern set is: T Σ ( t )2 E(W, v|X ) = r − yt

(3)

t=1

where, W and V are the set of weights of the first layer and the second layer, respectively. T is the number of the samples in the coaching set, r t is the real value of sample t, y t is the real output value, that is, the output of the prediction price community is calculated as: yt =

H Σ

vh z ht + v0

(4)

h=1

vh is the weight between hidden node h and output, z ht is the price of hidden node h of sample t. In a two classifications problem, the output y t is calculated through a sigmoid function.

2.3.4

Accuracy Evaluation of Forecasting Methods

The mean absolute percentage error (MAPE) can be used to measure the accuracy of the prediction. The method calculates the percentage of data deviation through processing and analyzing of data [13]. The calculation formula is as follows. | n | 1 Σ|| Actual valuet − Predicted valuet || MAPE = | × 100% n t=1 | Actual valuet

(5)

According to the above method, MAPE is used to measure the accuracy, so as to obtain applicable statistics evaluation results. In order to make sure the accuracy of the results, after investigating the relevant literature, specifically those associated with MAPE [14, 15], this research ultimately determines the applicable contrast standards for the prediction accuracy in accordance with the percentage of data deviation. The contrast standards are as follows (Table 1).

Research on Box Office Prediction of Commercial Films …

79

Table 1 Reference criteria for MAPE MAPE value

Evaluation

MAPE < 10%

Excellent

10% < MAPE < 20%

Good

20% < MAPE < 50%

Acceptable

50% < MAPE

Incorrect

3 Results and Findings 3.1 Results First selects the first week box office of a total of 60 films in three categories in the mainland rankings as the independent variable, and the second-week box office as the dependent variable, obtains the predicted value through three prediction methods, and then compares the predicted value with the actual second week box office. The absolute average error was calculated at the box office, and the absolute average error value of the 60 films was obtained. The specific results are as follows (Table 2). The following three coefficient tests are generally used in testing such as Pearson correlation coefficient [16], Spearman correlation coefficient [17] and Kendall correlation coefficient [18]. Among them, spearman and kendall belong to the rank correlation coefficient, also known as the “rank correlation coefficient”, which is a statistical analysis index that reflects the degree of grade correlation. pearson is a statistic used to reflect the degree of similarity between two variables, and can be used in machine learning to calculate the similarity between features and categories, that is, to determine whether the extracted features and categories are positively correlated, negatively correlated, or not correlated. In order to decrease the error, this research first used Pearson correlation test to investigate the correlation between Baidu and Google search index and the movie box office, so as to show that there is a close correlation between the online search index during certain week before movies are released and the actual movie box office, The calculation formula is as follows: E((X − μ X )(Y − μY )) cov(X, Y ) = σx σ y σx σ y E(X Y ) − E(X )E(Y ) / ( ) =/ ( ) E X 2 − E 2 (X ) E Y 2 − E 2 (Y )

ρx,y =

Table 2 Prediction Error based on Box MAPE

MA error

Regression error

MLP error

29.68%

28.42%

27.69%

(6)

80

T. Guo et al.

The greater the absolute value of the Pearson correlation coefficient, the stronger the correlation. The closer the correlation coefficient is to 1 or −1, the stronger the correlation; the closer the correlation coefficient is to 0, the weaker the correlation. Through the correlation test, it can be concluded that for the 60 films in three types in the Chinese box office list, the search index the first week than the release is correlated with the box office in the second week. After determining the data association, this research made a specific prediction analysis of the box office of 60 movies through three methods: linear regression, moving average method and MLP method. In the linear regression, the research takes the search index seven days after the release as the independent variable, and the daily box office seven days after the release dependent variable as the dependent variable. According to the linear regression structural equation, the expected daily box office in the second week of release was calculated and compared with the actual value to get the average absolute percentage error (MAPE). In the moving average method, the statistics of 60 movies were analyzed, and the daily box office in the first week of release was chosen as the time series to be analyzed. After comparing the actual daily box office when Ma = 2, Ma = 3, Ma = 4, Ma = 5, Ma = 6 and Ma = 7, Ma = 3 is found with the highest office box. Therefore, Ma = 3 was chosen to process the collected data and further got the prediction value of the box office between the 4th day to the 14th day after release. And then the absolute average error between the prediction value and the actual box office in the 2nd week after release was calculated, and in the end the moving average error of the rest of 60 movies is calculated according to the absolute average error. The model was built based on the MLP model, the model composed of input, output and a single hidden layer is used for calculation. The daily box office from the first day to the last day in the first week of release (a total of 7 days) is taken as the actual value r t 30 action movies, comedies and drama films were chosen from the top 100 films at the mainland China and North America box office separately as the number t of samples. The daily search index is taken as a variable, and the daily box office from the first day to the last day in the first week of release (a total of 7 days) is taken as the actual value r t . Weka software was used and the daily box office of the second week (day 8–14) of release as the prediction value y t . The black box algorithm in the model was used for automatic calculation. The average absolute percentage error (MAPE) is calculated based on the expected and actual daily box office in the second week (day 8–14) of release (Fig. 2). Through the statistics prediction of linear regression, moving average method and MLP method, in contrast with the actual movie box office the error value was calculated. The results of overall errors are summarized as follows (Tables 3 and 4).

Research on Box Office Prediction of Commercial Films …

81

Fig. 2 MLP model calculation

Table 3 Prediction error based on Baidu Index

Title

Regression error (%)

MLP error (%)

Captain America: Civil War

17.00

11.27

The Lost World: Jurassic Park

35.00

11.28

The Mermaid

37.57

10.69

Lost In Thailand

10.45

10.06

Breakup Buddies

65.50

11.08

The Island

17.50

13.15

Duckweed

11.57

13.43

Mr. Six

11.57

13.43

The Wandering Earth

22.59

10.61

Looking Up

18.42

15.66

3.2 Findings By processing and analyzing the relevant data of 60 movies, this study compares the pros and cons of three methods, namely moving average method, linear regression method, and MLP multilayer perceptron, in the prediction of 60 movies. And compared the average of the prediction errors of the 60 films without using the index and using the index under the three methods, but the moving average method cannot add the index, so the prediction method after adding the index, only the linear regression and MLP methods were carried out Compared (Table 5). After research, it is found that the method of adding the search index can increase the accuracy of the prediction data. The three methods of moving average method, linear regression method and MLP multilayer perceptron are compared horizontally.

82 Table 4 Prediction error based on Google Index

T. Guo et al. Title

Regression error (%)

MLP error (%)

Carcharocles megalodon

14.79

19.71

Wonder Woman 11.00

19.81

Justice League

33.79

18.11

Jurassic World

11.07

18.71

Finding Dory

11.43

17.07

Sing

12.42

21.16

Bad Boys for Life

48.79

10.75

The Lion King

17.64

18.06

Downton Abbey 14.36

17.38

Little Women

14.04

Table 5 Prediction error comparative results

20.29

Regression error (%)

Moving average error (%)

MLP (%)error

Without index

29.68

27.69

28.42

Index

22.21

/

18.89

Predicting the box office in the second week will improve the prediction accuracy of both linear regression and MLP methods. The prediction accuracy of linear regression was reduced from 29.68% to 22.21%; the MLP prediction method was reduced from 28.42% to 18.89%, and the prediction accuracy was greatly improved. In the MLP prediction results, except for a single data that has serious differences, the rest of the prediction results are relatively accurate. Although the linear regression method performs well in the Google index, it is not stable enough, and the data at both ends are seriously differentiated. Therefore, comprehensive evaluation, the prediction method of MLP multilayer perceptron is better than the other two prediction methods, and the prediction accuracy of MLP multilayer perceptron is higher.

4 Conclusion After horizontally evaluating the prediction effects of linear regression, moving average method and MLP method, we found that the MLP has the best accuracy. Therefore, in accordance with the prediction of MLP method, the promotion and advertising strategies suggested are given as follows. First, issuers and promotion groups have to expand the promotion of comedy films to be released in the first week of release. Online promotion, in particular, shall

Research on Box Office Prediction of Commercial Films …

83

be attached importance to. By methods like creating hot topics, the exposure to the public in the first week of release and the online search index will increase, so as to improve the ticket sales in the second week of release. Secondly, the advertising, marketing and promotion of Chinese films can comprehensively refer to the prediction results of MLP model. The marketing strategy of the second week of release of comedy and drama films can refer to the prediction results of MLP method model incorporating online search index. Action movies are appropriate to refer to the prediction results of linear regression incorporating index. These models can be used to optimize the marketing strategy. Finally, for the cinemas in the North American, the marketing strategy of action movies can refer to the results of moving average method. Comedy and drama films should be primarily based on MLP method incorporating Google index. The marketing strategy of the second week of release can be adjusted according to the results of relevant prediction.

References 1. Z.M. Wang, B. Zhao, The revolutionary significance of film to the development of human civilization. A hundred artists 4, 7 (2012) 2. W. Shanklin, What businesses can learn from the movies. Business Horiz. 45(1), 23–28 (2002) 3. Ken, Yamamoto, A simple view of the heavy-tailed sales distributions and application to the box-office grosses of U.S. movies. EPL (Europhysics Lett.) 108(6), 68004–68004 (2014) 4. Vogel, L. Harold, Entertainment industry economics: making and marketing movies, 126–193 (2014). https://doi.org/10.1017/CBO9781139871679(4) 5. H. Qi, B. Hu, Research on the influencing factors of film consumption and box office forecast in the digital era: Based on the perspective of machine learning and model integration. Wireless Communications and Mobile Computing, 1–10 (2021) 6. Y. Inbal, Network analysis: Understanding consumers’ choice in the film industry and predicting pre-released weekly bos-office revenue. Appl. Stochastic Models Busi. Industry 32(4), 409–422 (2016) 7. D.B. Dai, J.B. Chen, Research on mathematical model of box office forecast through BP neural network and big data technology. J. Phys.: Conference Ser. 4, 1952, 042118 (2021) 8. L. Vaughan, C. Yue, Data mining from web search queries: a comparison of google trends and baidu index. J. Assoc. Info. Sci. Tech. 66(1) (2015) 9. M. Mestyán, T. Yasseri, & J. Kertész, Early prediction of movie box, office success based on wikipedia activity big data (2012) 10. B.R. Litman, L.S. Kohl, Predicting financial success of motion pictures: the ’80s experience. J. Media Econ. 2(2), 35–50 (1989) 11. S. Ramesh, Dursun, Delen, Predicting box-office success of motion pictures with neural networks-sciencedirect. Expert Syst. Appl. 30(2), 243–254 (2006) 12. W. Zhang, S.S. Skiena, in Improving movie gross prediction through news analysis. 2009 IEEE/WIC/ACM International Conference on Web Intelligence, WI, Milan, Italy, 15–18 September, Main Conference Proceedings. ACM (2009) 13. S. Asur, B.A. Huberman, Predicting the future with social media. IEEE (2010) 14. G. Box, Time series analysis, forecasting and control, rev. ed. (Holden-Day, 1976) 15. K. IB, I.D. Sumitra, Comparison of forecasting the number of outpatients visitors based on nave method and exponential smoothing. IOP Conference Series: Mat. Sci. Eng. 662(4), 042002 (5pp) (2019)

84

T. Guo et al.

16. Biber, Douglas, Variation across speech and writing: pearson correlation coefficients for all linguistic features, 270–279 (1988), Appendix IV. https://doi.org/10.1017/CBO978051162 1024 17. A.N. Ban, Spearman correlation (2019) 18. Rice, G.K, Learning climate and the satisfaction and alienation of law students. Doctoral dissertation, The University of Oklahoma (1980)

A DCRC Model for Text Classification Zhaoquan Hao, Jiangyong Jin, Shengbin Liang, Suying Cheng, and Yanqing Shen

Abstract Traditional text classification models have some drawbacks, such as the inability of the model to focus on important parts of the text contextual information in text processing. To solve this problem, we fuse the long and short-term memory network BiGRU with a convolutional neural network to receive text sequence input to reduce the dimensionality of the input sequence and to reduce the loss of text features based on the length and context dependency of the input text sequence. Considering the extraction of important features of the text, we choose the long and short-term memory network BiLSTM to capture the main features of the text and thus reduce the loss of features. Finally, we propose a BiGRU-CNN-BiLSTM model (DCRC model) based on CNN, GRU and LSTM, which is trained and validated on the THUCNews and Toutiao News datasets. The model outperformed the traditional model in terms of accuracy, recall and F1 score after experimental comparison. Keywords CNN · BiGRU · BiLSTM · Text classification

Z. Hao · J. Jin · Y. Shen (B) School of Software, Henan University, Zhengzhou, China e-mail: [email protected] Z. Hao e-mail: [email protected] J. Jin e-mail: [email protected] S. Liang Institute for Data Engineering and Science, University of Saint Joseph, Macau, China e-mail: [email protected] S. Cheng 3rd Branch, China Petroleum Pipeline Engineering Co. Ltd., Zhengzhou, China e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_6

85

86

Z. Hao et al.

1 Introduction In recent years, with the development of the Internet industry, the Internet has entered the era of big data, with the emergence of various social media and news information, data resources are constantly growing, and text information is also expanding in a massive way. Text classification has emerged in the management, organisation and application of large amounts of Internet information, but the traditional feature selection methods often have many shortcomings, such as insufficient feature extraction and feature loss when extracting features. As a result, text classification models that incorporate deep learning are emerging. For text classification based on deep learning, Kalchbrenner et al. [1] proposed the DCNN-Dynamic Convolutional Neural Network, which uses wide convolution and k-max pooling sampling to construct a parse tree-like structure that can extract information over long distances. Liu et al. [2] propose an adversarial multi-task classification framework, MT-LSTM, which effectively reduces the problem of noise pollution of extracted shared features by other tasks and improves the accuracy of text classification. Conneau et al. [3] proposes an architecture for text processing (VD-CNN) that operates directly on the character level and uses only small convolutions and pooling. The model, whose performance increases with depth, has been applied to text processing with good results. Leontjeva et al. [4] use long short-term memory artificial neural networks (LSTM) and Markov models (HMM) as examples, combining extracted information features with static features, and successfully improve the classification performance of the model. Liu et al. [5] propose a character-level text classification model based on BiGRU by combining bidirectional gated recurrent units and CNN to achieve extraction of global and local semantics of text. Cheng et al. [6] applied attention mechanism to the task of emotion feature extraction of text based on multi-channel CNN as well as BiGRU to accomplish the task of emotion classification of text. Schuster et al. [7] propose to extend the ordinary recurrent neural network (RNN) to a bidirectional recurrent neural network (BRNN), which alleviated the limitation of the model for input information and obtained better results than other similar models in regression and classification experiments. The above research results have achieved better results in resolving the dependency of contextual feature sequences and improving the accuracy of text classification. However, in terms of text feature extraction, there are problems such as inaccurate contextual feature extraction and insufficient local feature extraction ability. Based on this, a text classification model BiGRU-CNN-BiLSTM (DCRC model) based on the combination of RNN and CNN is proposed by combining the advantages of CNN, GRU and LSTM. The main work of this paper is as follows. • A text classification model is proposed, using the BiGRU-CNN model. The contextual features of the input text are first extracted by BiGRU, which improves the disadvantage that one-way GRU can only capture information up to a certain moment in time. The precise features of the text are then extracted by convolutional operations. The BiLSTM model is also used to pass the vector of input text to each LSTM, adding the results of the previous moment’s operation to

A DCRC Model for Text Classification

87

the current operation to achieve the effect of extracting contextual relations and solving the gradient disappearance and explosion problems. • The two models are stitched together in parallel to minimise the loss of text features during model training and to guarantee the accuracy of text classification, with significant improvements in accuracy, recall and F1 score. The subsequent part of the paper is structured as follows: Part II presents the relevant work and background, part III first gives the definition of the problem and then focuses on the definition of the model, i.e. the DCRC model, part IV gives the analysis of the experiments and results, and Part V gives an overview of the model in this paper and gives a schedule of future work while showing the excellence of the model.

2 Related Work CNN and RNN are two classical deep neural network models and they are widely used in image processing and natural language processing, while many improved models based on them have emerged. Among them, an LSTM model based on RNN networks, proposed by Hochreiter and Schmidhuber [8] in 1997, has some advantages in sequence modelling problems, has long and short-term memory functions and is relatively simple to implement. Research on CNN network models and RNN network models is still ongoing, and many related models have been derived.

2.1 CNN Model In the 1960s, Hubel and Wiesel [9], while studying the local sensitivity and directional selection of neurons in the cat’s cerebral cortex, found that its unique network structure could effectively reduce the complexity of feedback neural networks, which led to the proposal of convolutional neural networks. When the input text sequence is processed, a sentence consisting of multiple word groups will be represented as a feature matrix to serve as input to the CNN model. The network structure is shown in the Fig. 1. Kim [10] applied CNN models to text classification and proposed the TextCNN model, which performed well on text classification problems. However, it is not very interpretable and takes little account of textual contextual information, which makes it difficult to tune it according to the training results when tuning. In contrast, Wang et al. [11] proposed a dense CNN structure with multi-size feature attention to generate n-gram features and use them for classification, with good results. In addition, Bengong and Mengdi [12] used CNN and BiGRU for text semantic extraction and applied the attention mechanism to the word level, which similarly improved the text classification.

88

Z. Hao et al.

Fig. 1 Application of CNN in NLP

2.2 BiLSTM Model The LSTM was proposed by Hochreiter and Schmidhuber [8] in 1997 to improve the traditional recurrent neural network model, which, along with RNNs, learns from sequential data by repeating the chain form of neural network modules. The BiLSTM model, on the other hand, is based on the LSTM and combines information from the input sequence in both the forward and backward directions. Its structure is shown in Fig. 2. Where x0 , x1 and x2 represent the input information at moments 0, 1 and 2 respectively, h∗ and k∗ represent the output of the different hidden states at a given moment respectively, and the cat operation refers to the splicing of vectors, [ ] Ht−1 = ht−1 , kt−1

(1)

Ht−1 represents the output splicing of the hidden states of the two LSTM layers at moment t − 1. ht−1 and kt−1 are the outputs of the different hidden states at moment t − 1. t − 1. Lai et al. [13] proposed an approach to classify text using recurrent convolutional neural networks, a model that combines RNN and maximum pooling layers, combining the advantages of RNN and CNN to achieve better results in text classification. Zhou et al. [14] chose to combine the advantages of CNN and LSTM based on considering the semantic information of the context, and chose to stack the LSTM neural network Huang et al. [15] chose a bi-directional LSTM and combined it with CNN to propose a short text classification model based on BiLSTM-CNN, through

A DCRC Model for Text Classification

89

Fig. 2 Structure of BiLSTM

which the accuracy of acquiring important information in text can be effectively improved. Liu et al. [16] combined BiLSTM with CNN and used the attention mechanism to apply it to text classification applications, and also achieved good results. Liu et al. [17] proposed the MT-LSTM model (Multi-Timescale Long Short-Term Memory) to capture valuable information with different time scales. of valuable information.

2.3 BiGRU Model GRU (Gate Recurrent Unit) is a type of Recurrent Neural Network (RNN). Cho et al. [18] proposed the Gated Recurrent Unit (GRU) in 2014, which simplifies the LSTM structure and training parameters, while the accuracy is still comparable to that of LSTM. A single GRU unit consists of an update gate and a reset gate, which has a simpler structure and fewer model parameters than the LSTM. And the BiGRU consists of two GRU units with different orientations, its structure is shown in Fig. 3. The bidirectional GRU makes use of two parallel channels, one GRU for semantic modelling of text from the beginning to the end of the sentence, and the other GRU for text representation from the end to the beginning of the sentence, in such a way that the preceding and following contexts can be considered simultaneously. The specific representation is as follows. − → −−→ ht = GRU (wt ), t ∈ [1, n]

(2)

← − ←−− ht = GRU (wt ), t ∈ [1, n]

(3)

90

Z. Hao et al.

Fig. 3 Structure of BiGRU

[− ] → ← − ht = ht , ht ], t ∈[1, n

(4)

−−→ ←−− where GRU denotes the output state of the forward GRU at moment t, GRU denotes the output state of the reverse GRU at moment t, wt denotes the corresponding weight − → ← − at moment t, ht denotes the state of the forward hidden layer at moment t, and ht denotes the state of the reverse hidden layer at moment t. In this way, the bidirectional GRU can learn not only the above information but also the below information. Joulin et al. [19] designed and built a BiGRU network with user and product attention mechanisms to implement a fast text classifier, fastText, to capture multicategory text sentiment and is comparable to deep learning classifiers in terms of accuracy and many orders of magnitude faster in training and evaluation. Zhao et al. [20] also proposed an AD-CharCGNNN based on CharCNN (Character Convolutional Neural Network) and GRU (Gated Recurrent Unit) on all datasets to propose AD-CharCGNN, which combines the temporal and spatial domains to classify text. Sachin et al. [21] used LSTM model, GRU model and BiLSTM mode for feature extraction and analysis of user online comments to effectively achieve sentiment classification of text. Yan et al. [22] combined the respective advantages of CNN and BiGRU in extracting textual information and introduced the attention mechanism, and proposed a multi-channel CNN-BiGRU model (MC-AttCNN-AttBiGRU) based on the attention mechanism for extracting textual sentiment feature information. The regression method used in our work is in fact a variation of hedonic regression, except that we did not consider external factors in our modeling (the data set does not include such information). We did, however, consider different combinations of first-order and second-order attributes in the regression model. The attributes are given in Table 1, where Value is the dependent variable to be predicted, and the other are predictors including 11 first-order and 4 s-order variables. The given data set contains 81 homes.

A DCRC Model for Text Classification Table 1 Overview of datasets

Dataset

91 Number of training sets

Number of test sets

Categories

Toutiao News

188,014

80,578

15

THUCNews

180,000

10,000

10

3 Problem Definition and Proposed Work The traditional CNN and RNN networks each have their own advantages and disadvantages, for example, CNN is better at extracting local spatial or short-term structural relationships, but is less capable of extracting features from sequential data; RNN is better at processing sequential data, but is difficult to extract features. The DCRC model mentioned in this paper mainly consists of BiGRU layer, CNN layer and BiLSTM layer, which is divided into two parts: BiGRU-CNN and BiLSTM. The flow chart of DCRC model is as follows (Fig. 4).

3.1 BiGRU-CNN Based Text Feature Acquisition and Context Dependent Learning As described in the above subsections, GRU is good at capturing long-range dependencies in sequences and textual order information, while CNN is good at capturing location-variant features, so connecting the two is much more effective. For the BiGRU-CNN model, there are currently series and parallel structures for the connection of bidirectional long and short term memory neural networks (BiGRU, BiLSTM) and convolutional neural networks (CNN). When the series structure is chosen and the input sequence is first passed through the convolutional layer, due to the compression and loss of information during the convolutional process, after feeding this compressed text feature information into the bi-directional long and short-term memory network, it receives information with some of the features lost and therefore loses some of the time series features, thus not taking advantage of the BiRGU or BiLSTM. In addition, BiGRU and BiLSTM still have some drawbacks, Fig. 4 Flowchart of the DCRC model

92

Z. Hao et al.

such as the high-dimensional input space common in text processing applications increases the complexity of the model and makes it difficult to optimise, as well as the model’s inability to focus on important parts of the textual contextual information and still suffer from feature loss. In our model, we therefore choose to first pass the text input sequence through the BiGRU model making it possible to first extract the contextual information of the text and then use these event-related features, represented by two hidden state vectors with past and future information, as input to the CNN model to further extract important local features and reduce the dimensionality of the input data. In addition, maximum pooling is stacked at the output of the convolutional layer, making the feature matrix more sensitive to changes in features and thus improving model accuracy. In the BiGRU model, a single gating unit updates the output as follows. h(t) i

⎛ ⎞ ) ( Σ Σ ⎠ (5) = ui(t−1) h(t−1) + 1 − ui(t−1) σ ⎝bi + Ui,j xj(t) + Wi,j rj(t−1) h(t−1) i j j

j

(t−1) where h(t) represent the updated output results of the i th feature at time t i and hi and t − 1, respectively. u represents the update gate, r represents the reset gate, and U , W , and b represent the input weights, loop weights and bias weights, respectively. The updated results are fed into the CNN model for convolution operation, which leads to Eq.

( ) ci = f W · Hhi :hi+k−1 + b

(6)

where ci represents the ith feature value of the input, W represents the size of the convolution kernel and H is the convolution window, where the inpu feature values can be expressed as the updated output result of the BiGRU model hi with hi+k−1 being the i th, (i + k − 1)th updated output value. The convolution kernel W is then applied to each window of the feature matrix to produce the final convolutional output formulation of the BiGRU-CNN model. ] [ 1 n−k+1 2 CGRU = cGRU , cGRU . . . , cGRU

(7)

In this way, not only can the BiGRU model avoid receiving incomplete text features after compression, but it can also use the CNN model to learn important features in context dependency and thus compensate for the shortcomings of the BiGRU model. the structure of the BiGRU-CNN model is shown in Fig. 5.

A DCRC Model for Text Classification

93

Fig. 5 Structure of BiGRU-CNN model

3.2 Mitigation of Feature Loss Problems Based on BiLSTM When we choose the structure of the BiGRU model and the CNN model in series, after passing the text input sequence through the BiGRU model, the features learned from it possessing the time series are fed into the convolutional neural network for convolutional operation, and there is still the problem of feature loss after compression. Therefore, we not only input the text input sequence into the BiGRU-CNN model, but also into the BiLST model, which compensates for the feature loss. In our model, instead of directly concatenating the BiLSTM model at the output of the BiGRU-CNN model, we choose to use it to receive the input sequence information for long-term dependency capture, thus enabling re-learning of the sequence data, placing a Dropout layer after the BiLSTM layer to prevent overfitting. The BiLSTM and BiGRU-CNN models are then stitched together in parallel. Thus, after the input sequence is passed through the BiLSTM model, what is learned is the time series features and context dependencies of the text input sequence, and then its output result is stitched in parallel with the output result of the BiGRU-CNN model, which enables the extraction of the subject features again and alleviates the phenomenon of feature loss in the output result of the CNN model, effectively reducing the BiGRUCNN tandem structure to extract shared features that contaminate each other. This is our DCRC model. In this way, the model is able to extract contextual dependencies again with already reduced data output dimensionality compared to the general traditional model, thus

94

Z. Hao et al.

enabling more accurate identification of important information in context and hence more accurate classification of text.

3.3 Overview of the DCRC Model The structure of the DCRC model is as follows (Fig. 6). To solve the above problem, we first use BiGRU to extract contextual features; then use CNN to extract accurate features, using convolutional kernels of size 3, 4 and 5; immediately afterwards, we perform pooling to extract important features of the text; then use BiLSTM layer to extract the main features; finally, the features extracted by the BiGRU-CNN model are stitched with the BiLSTM layer in parallel to achieve more accurate text classification and reduce gradient dispersion. Fig. 6 Structure of DCRC model

A DCRC Model for Text Classification

95

4 Experiments 4.1 Datasets To better validate the model, we used the Toutiao News dataset and the THUCNews dataset to validate the model. The Toutiao dataset is a selection of Chinese news from Today’s Headlines from June to September 2018, with the news domain encompassing 15 categories and containing 260,000 news data, with "!" delineating the news tags from the news body content. The THUCNews dataset is divided into 10 categories with fewer noise features, but due to the large number of samples, 180,000 samples were randomly selected for model training. We have pre-processed these data and have a clear text classification. Details of the dataset are shown in the Table 1

4.2 Evaluation Typically, classification tasks are evaluated using accuracy, recall and F1 values for model performance. The accuracy metric is calculated as follows Acc =

TP + FN TP + TN + FP + FN

(8)

The recall rate indicator is calculated using the following formula. Rec =

TP TP + FN

(9)

The F1 score indicator is calculated using the following formula. F1 =

2 × Pre × Rec Pre + Rec

(10)

4.3 Experiment Results In order to validate the performance of our constructed model, this session compares our model under simultaneous experimental conditions with the following models.

96

Z. Hao et al.

• BiLSTM-CNN [23], which first extracts textual contextual semantic information using BiLSTM and then uses CNN to extract local key information from the LSTM output, thus better capturing the important features of the text. • BiGRU-CNN [24], first extracts contextual information by BiGRU, which is more time efficient than BiLSTM for this training. The CNN layer is then used to extract local information between texts, and finally the two are stitched together in parallel to obtain important information about the text. • TextCNN [13], which first passes through an Embadding layer before the text sequence is input, followed by a convolutional operation, is able to capture the information of the text very well. • RCNN [25], which applies a recursive structure that captures as much contextual information as possible when learning word representations, is able to introduce considerably less noise than traditional window-based neural networks, providing good text classification results. • DCRC, which first passes through the BiGRU layer, then sends the output to the CNN layer for convolution and pooling, then the BiLSTM layer again extracts the semantic relationship of the context, and finally the BiGRU-CNN and BiLSTM are stitched together in parallel to reduce the loss of important features again to achieve more accurate text classification results. The experimental results of the above model on the two data sets are shown in Tables 2 and 3. From the table we can see that DCRC performs the best on both datasets, substantially improving the text classification. This is because our model achieves parallelism with BiLSTM on the basis of BiGRU-CNN, and this operation can extract the main features of the text again, which can reduce the loss of important features of the text Table 2 Comparison of experimental results for the Toutiao News dataset

Table 3 Comparison of experimental results for the THUCNews dataset

Model

Accuracy

Recall

F1 Score

BiGRU-CNN

88.61

82.32

82.27

BiLSTM-CNN

89.92

83.65

83.93

TextCNN

97.30

90.83

90.89

RCNN

97.03

92.38

93.45

DCRC

98.99

97.78

97.86

Model

Accuracy

Recall

F1 Score

BiGRU-CNN

92.16

92.17

92.19

BiLSTM-CNN

92.84

92.84

92.83

RCNN

96.66

96.63

96.67

TextCNN

97.80

96.89

97.91

DCRC

98.61

98.47

98.37

A DCRC Model for Text Classification

97

to a certain extent. At the same time, its recall rate and F1 score are both ahead of other models to some extent, fully demonstrating the superior performance of the DCRC model for text classification tasks. The accuracy and loss comparison line plots at training time are shown in Figs. 7 and 8. In general, the DCRC model proposed in this paper is improved by many evaluation indicators. Fig. 7 Comparison of models for Toutiao News datasets

Fig. 8 Comparison of models for THUCNews datasets

100.0% 98.0% 96.0% 94.0% 92.0% 90.0% 88.0% 86.0% 84.0% 82.0% 80.0% Accuracy BiGRU-CNN

Recall BiLSTM-CNN

RCNN

DCRC

F1 Score TextCNN

100.0% 99.0% 98.0% 97.0% 96.0% 95.0% 94.0% 93.0% 92.0% 91.0% 90.0% Accuracy

Recall

BiGRU-CNN

BiLSTM-CNN

TextCNN

DCRC

F1 Score RCNN

98

Z. Hao et al.

5 Conclusion In this paper, we realise that bidirectional long and short term memory neural networks have the ability to capture long distance dependencies in sequences but have the disadvantage of not being able to capture feature information effectively when performing text classification tasks, and that convolutional neural networks have the problem of losing some features after information compression. Taking these into account, the DCRC model proposed in this paper can form a BiGRU-CNN model by connecting a bi-directional long and short-term memory neural network and a convolutional neural network in a suitable way to compensate for their respective shortcomings, and combine their output results with the BiLSTM model to form the final DCRC model, which can ensure effective acquisition of textual Context dependencies are effectively captured, while at the same time, important missing features are recaptured. The experimental results of this DCRC model on two publicly available datasets show that it achieves an average improvement of 3.79% in accuracy and 3.57% in F1 score over the best-performing baseline model, indicating that the DCRC model proposed in this paper can indeed perform the text classification task effectively.

References 1. N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences. arXiv:1404.2188 (2014) 2. P. Liu, X. Qiu, X. Huang, Adversarial multi-task learning for text classification. arXiv:1704. 05742 (2017) 3. A. Conneau, et al, Very deep convolutional networks for text classification. arXiv:1606.01781 (2016) 4. A. Leontjeva, I. Kuzovkin, in Combining static and dynamic features for multivariate sequence classification. 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE (2016) 5. B. Liu, Y. Zhou, W. Sun, Character-level text classification via convolutional neural network and gated recurrent unit. Int. J. Mach. Learn. Cybern. 11(8), 1939–1949 (2020) 6. Y. Cheng, et al., in Text sentiment orientation analysis based on multi-channel CNN and bidirectional GRU with attention mechanism. IEEE Access 8, 134964–134975 (2020) 7. M. Schuster, K.K. Paliwal, Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997) 8. S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 9. D.H. Hubel, T.N. Wiesel, Receptive fields of single neurones in the cat’s striate cortex. J. Physiol. 148(3), 574 (1959) 10. Y. Kim, Convolutional neural networks for sentence classification. arXiv[J].preprint (2014) 11. S. Wang, M. Huang, Z. Deng, Densely connected CNN with multi-scale feature attention for text classification. IJCAI (2018) 12. Y. Bengong, Z. Mengdi, Question classification based on bidirectional GRU with hierarchical attention and multi-channel convolution. Data Analysis Knowl. Disc. 4(8), 50–62 (2020) 13. S. Lai, et al., in Recurrent convolutional neural networks for text classification. Twenty-ninth AAAI conference on artificial intelligence (2015)

A DCRC Model for Text Classification

99

14. C. Zhou, et al., A C-LSTM neural network for text classification. arXiv:1511.08630 (2015) 15. J.J. Huang, J.Q. Lin, Y.J. He, et al., Chinese short text classification algorithm based on local semantics and context. Comp. Eng. Appl. 57(6), 94–100 (2021) 16. G. Liu, J. Guo, Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 337, 325–338 (2019) 17. P. Liu, et al., in Multi-timescale long short-term memory neural network for modelling sentences and documents. Proceedings of the 2015 conference on empirical methods in natural language processing (2015) 18. K. Cho, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078 (2014) 19. A. Joulin, et al., Bag of tricks for efficient text classification. arXiv:1607.01759 (2016) 20. W. Zhao, et al. The study on the text classification for financial news based on partial information. IEEE Access 8, 100426–100437 (2020) 21. S. Sachin, et al., Sentiment analysis using gated recurrent neural networks. SN Comp. Sci.1(2), 1–13 (2020) 22. C. Yan et al., Text sentiment orientation analysis of multi-channels CNN and BiGRU based on attention mechanism. J. Comp. Res. Develop. 57(12), 2583 (2020) 23. Y. Li, X. Wang, X. Pengjian, Chinese text classification model based on deep learning. Future Internet 10(11), 113 (2018) 24. H.T. Tran, H.H.P. Vo, S.T. Luu, Predicting job titles from job descriptions with multi-label text classification. arXiv:2112.11052 (2021) 25. B. Cheng, et al., in Revisiting RCNN: On awakening the classification power of faster RCNN. Proceedings of the European conference on computer vision (ECCV) (2018)

Hierarchical Medical Classification Based on DLCF Mingyuan Yao, Haoran Sun, Shengbin Liang, Yanqing Shen, and Niki Yukie

Abstract Medical classification is affected by many factors, and the traditional medical classification is usually restricted by factors such as too long text, numerous categories and so on. In order to solve these problems, this paper uses word vector and word vector to mine the text deeply, considering the problem of scattered key features of medical text, introducing long-term and short-term memory network to effectively retain the features of historical information in long text sequence, and using the structure of CNN to extract local features of text, through attention mechanism to obtain key features, considering the problems of many diseases, by using hierarchical classification. To stratify the disease. Combined with the above ideas, a deep DLCF model suitable for long text and multi-classification is designed. This model has obvious advantages in CMDD and other datasets. Compared with the baseline models, this model is superior to the baseline model in accuracy, recall and other indicators. Keywords Medical classification · Hierarchical classification · Dual channel · LSTM-CNN · RF

M. Yao · H. Sun School of Software, Henan University, Kaifeng, China e-mail: [email protected] H. Sun e-mail: [email protected] S. Liang Institute for Data Engineering and Science, University of Saint Joseph, Macau, China e-mail: [email protected] Y. Shen (B) Zhongyuan Wuhu Research Institute, Henu Univrsity, Kaifeng, China e-mail: [email protected] N. Yukie School of Foreign Languages, Henan University, Kaifeng, China © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_7

101

102

M. Yao et al.

1 Introduction With the rapid development of Internet technology, the amount of text information shows a trend of power growth. How to organize and manage the existing text information is an urgent problem to be solved. Text classification is an effective method, which is widely used in the fields of emotion analysis [1], spam detection [2], topic classification [3] and so on. With the great success of deep learning in vision and speech recognition, many deep architectures are widely used in various fields of NLP. Popular models include, but are not limited to: learning word vectors through neural language models [4], using recurrent neural networks (RNN) [5] and convolution neural networks (CNN) [6] for text classification. Zhang and You [7] propose a Chinese short text classification model based on TextCNN considering the length of short text data. The model uses reverse translation to expand the data and make up for the lack of training data. Mao et al. [8] used word embedding method to obtain word vector representation, and encoded it through LSTM neural network to extract text features, and achieved good results. Chen et al. [9] achieved higher performance on breast cancer classification using a syntax-aware representation of medical reports encoded in a hierarchical manner using hierarchical attention bi-directional recurrent neural networks (HA-BiRNns). Qiao et al. [10] propose a word-character attention model (WCAM) for Chinese text classification. This WCAM method integrates two levels of attention model: the attention model at the word level and the attention model at the character level. At the same time, the method introduces a word-character constraint model and character alignment to ensure the high representativeness of the selected characters and enhance their recognition ability. Li et al. [11] by using the BLSTM-C model (BLSTM for bidirectional long-term and short-term memory and C for CNN) LSTM is responsible for obtaining sequence output based on past and future contexts and then inputting them into the convolution layer to extract features. The results show that the model shows remarkable performance in text classification. Tao et al. [12] proposed a new attention-based four-granularity (RAFG) model, which makes full use of Chinese characters, words, character-level radicals and wordlevel radicals to simulate the character sharing characteristics of Chinese characters and the sequence features of text by capturing remote information. Li and Ning [13] combined the advantages of CNN and LSTM to construct a LSTM-CNN hybrid model for Chinese news text classification tasks. Liu et al. [14] proposed a new classification model, hierarchical integrated context modeling network (HCCMN), which can extract a more comprehensive context. In order to overcome the limitation of one-way language model prediction, Devlin et al. [15] proposed BERT, which pretrains depth direction representation by joint conditioned reflection of left and right context in all layers. The model mentioned above is effective in dealing with short text and few categories of classified data, but the biggest pain point is often the processing of long

Hierarchical Medical Classification Based on DLCF

103

text and multi-label data, especially for medical classification, there are many departments. The problem of wide disease coverage can not be effectively solved. The main work of this paper is as follows: • A DLCR text classification model is proposed, which adopts a dual-channel mechanism. The two channels are used to receive word-level and word-level embedding at the same time, and the two sequences are input into the LSTM model for word feature vector and word feature vector extraction respectively. This method can extract word-level semantic features and semantic pool, reduce the dimension of input data, and reduce the risk of over-fitting. • Two identical channel models are adopted, namely the LSTM-CNN-Attention model. LSTM can effectively retain the features of the historical information in the long text sequence, and use the structure of CNN to extract the local features of the text, and then obtain the key features through the attention mechanism. • Taking into account the long medical text, many categories and other factors, the hierarchical classification model is used to make multi-level medical text categories to further improve the prediction accuracy of the model.

2 Background and Related Work Traditional text classification methods have some disadvantages such as low classification accuracy and poor reliability. Nowadays, models such as LSTM [16] and CNN [17] based on deep learning have certain advantages.

2.1 CNN Model CNN is a kind of feedforward neural network with convolution calculation and depth structure. Its artificial neurons can respond to the surrounding units within a part of the coverage area. The structure of CNN can be divided into three layers, in which the convolution layer is mainly used to extract features, while the pool layer is mainly used for downsampling, but will not damage the recognition results. Finally, the main function of the full connection layer is to classify the final results. The first CNN is the time delay network (TDNN) for speech recognition proposed by Waibel et al. [18] in 1987. In recent years, many scholars have applied convolution neural networks to the field of natural language processing. Collobert et al. [19] first applied the convolution neural network to the NLP field, and Zhang and Wallace [6] first applied the convolution neural network to the text classification task, and proposed a classical convolution neural network text classification model: a convolution layer (using convolution kemels of different sizes), then a maximum pooling layer, and the last classifier uses a full connection layer with Dropout, which achieves very good results on multiple data sets.

104

M. Yao et al.

2.2 LSTM Proposed by Hochreiter and Schmidhuber [20] in 1997, is a time-cyclic neural network to solve the problem of long-term information preservation and disappearance gradient encountered by traditional RNN (cyclic neural network) [21]. LSTM usually contains an input gate that determines what information we want to store in the cell state, such as Formula 1. ] ) ( [ ii = σ Wi · ht−1 , x1 + bi

(1)

The parameters include W i and bi , Contain an amnesia gate that determines what information we want to discard from the cell state, such as Formula 2. ] ) ( [ ft = σ Wf · ht−1 , x1 + bf

(2)

The parameters include W f and bf , The next tannh layer creates a candidate vector ⋍t , which will be added to the cell state. In the next step, we combine these two vectors C to create updated values, such as Formula 3. ( [ ] ) ⋍t = tanh Wc · ht−1 , xt + bc C

(3)

The parameters here include W c and bc . Finally, we decide what we want to output. The secondary output will be based on our cell state, including an output door, such as Formula 4. ( [ ] ) ot = σ Wo ht−1 , xt + bo (4) ht = ot ∗ tanh(Ct ) The parameters here include W o and bo due to their unique design structure, LSTM is suitable for processing and predicting important events with very long intervals and delays in time series. As a nonlinear model, LSTM can be used as a complex nonlinear element to construct larger depth neural networks. The LSTM model diagram is shown in Fig. 1. Based on this, Liu and Guo [22] proposed a new unified architecture which includes bidirectional LSTM (BiLSTM), attention mechanism and convolution layer. This structure is called attention-based convolution layer bidirectional long-and short-term memory (ACBiLSTM). Du et al. [23] adopted a planar neural network called generalized learning system (BLS) and proposed two new text classification learning methods, including recursive BLS (R-BLS) and long-short-term memory (LSTM) structure: strobe BLS (G-BLS). Chen et al. [24] proposed an attention-based bidirectional short-term memory (Att-BiLSTM) model for service robots, which can classify outpatient categories according to text content.

Hierarchical Medical Classification Based on DLCF

105

Fig. 1 LSTM model diagram

2.3 Random Forest Random forest is an integrated decision tree algorithm proposed by Breiman [25] in 2001, which uses randomly selected training samples and variable quantum sets to generate multiple decision trees. As a famous ensemble learning algorithm, it is not easy to have the phenomenon of fitting and collinearity because of its good computing speed and efficiency. at the same time, it has a very good effect on dealing with multi-classification problems, and has been widely used in many fields. Salles et al. [26] use random forest algorithm to solve the problem of highdimensional noisy data. The author proposes an inert version of the traditional random forest classifier, also known as inert NN_RF. Sun et al. [27] propose a weighted voting mechanism to improve the quality of the decision tree, and achieved very good results in the classification of multi-type data. Islam et al. [28] propose a semantic-aware random forest (SARF) classifier. The SARF extraction tree is used to generate predicted features and select a subset of features related to the prediction class. Al Amrani et al. [29] propose a method to determine emotion classification using support vector machine, random forest and RFSVM-based hybrid method. Kukkar et al. [30] proposed a new deep learning model for multi-class severity classification, called Bug severity classification, which solves these challenges by using convolution neural networks and random forests with Boosting (BCR).

106

M. Yao et al.

3 Problem Definition and Proposed Work Text classification is a basic problem of natural language processing, which is roughly divided into two steps, namely, text feature extraction and classifier classification. In order to solve the above problems, this paper proposes a DLCR model for accurate disease classification, which is introduced as follows.

3.1 Introduction to the DLC Layer In order to deeply mine the text, we propose an improved dual-channel (DC) mechanism. In this DC mechanism, the two channels are used to receive word-level and word-level embedding at the same time, and the two sequences are input into the LSTM model to extract word feature vectors and word feature vectors respectively. This method can extract word-level semantic features and semantic pools, reduce the dimension of input data, and reduce the risk of over-fitting. In order to ensure the consistency of eigenvalue dimensions, we adopt two identical channel models, namely LSTM-CNN-Attention model. The various gating mechanisms used by LSTM help them track long-term dependencies and address the inherent shortcomings of vanishing gradients, especially when using longer sequences as input. However, the existing LSTM cannot extract the context information of future tokens and lacks the ability to extract local context information. The accuracy of LSTM is further hindered by the inability to recognize the different relationships between the various parts of the document. At the same time, because CNN is easy to make the training results converge to the local minimum rather than the global minimum, we propose a hybrid model of LSTM, CNN and Attention. LSTM can effectively retain the features of historical information in long text sequences, and use the structure of CNN to extract local features of text, and then obtain key features through attention mechanism to improve the accuracy of the model. The model can extract context semantic information, ignore secondary information, assign initial weights to each input, and update these weights during training according to the correlation between each input and the final prediction, so as to ensure the accuracy of classification tasks. at the same time, the ability of generalization and overfitting is improved. In the LSTM part, the corresponding model Formula 5. yk , (ht , ct )k = LSTM (Xk )

(5)

where X k represents the first eigenvector of the input, yk represents the output of the first LSTM, ht represents the state value of the hidden layer of the last state, and ct represents the forgetting gate value of the hidden layer of the last state in the CNN part of its corresponding model Formula 6. sk (i, j) = conv(Xk ∗ Wk ) + b

(6)

Hierarchical Medical Classification Based on DLCF

107

In order to prevent the model input from being too large, after the CNN, the maximum pool is used to reduce the size of the model, and the final six outputs are divided into two groups. The high eigenvector is extracted again through the attention mechanism, and Formula 7 is calculated. f (Oi ) = vaT tanh(Wa Oi + b)

(7)

Oi represents the output of pooling layers, respectively, and W a represents the weight of each update, so the calculated score is Formula 8. exp(f (Oi )) ω = Σ Tx k=1 exp(kf (Oi ))

(8)

where x is the length of the sequence. The output vector C i is weighted by dynamic adaptive weights, such as Formula 9. ci =

Tx Σ

ω · hj

(9)

j=1

Finally, the output results of the two channels are randomly inactivated by Drouput, and the ground-level classification results are output through the Softmax function. The model is shown in Fig. 2.

3.2 Introduction to R Layer Because there are many types of medical diseases and the similarity between texts is relatively high, we integrate the two-tier model by combining the advantages of TF-IDF and random forest. The specific implementation of the model is as follows. First of all, the prediction information of the upper layer is compared with the real information, and then the stop words are removed from the training text information, and the data are segmented, from which a corpus is trained. then high-frequency features are selected from the corpus as random forest input. Secondly, through TF-IDF to calculate the weight of each feature in each sentence, take out the most representative feature words, formula 10. N +1 N (ω) Nd ,ω tf (d , ω) = Σ k Nk,ω idf (ω) = log

tf − idf (ω) = tf (d , ω) ∗ idf (ω)

(10)

108

M. Yao et al.

Fig. 2 DLC Layer model diagram

Finally, the obtained feature matrix is input into the random forest, and the final classification result is obtained through model parameter adjustment and eigenvalue interception. Formula 11. output = RF(tf − idf (ω)) The related graphics are shown in Fig. 3.

(11)

Hierarchical Medical Classification Based on DLCF

109

Fig. 3 R-layer model

3.3 Overview of DLCR Model The traditional CNN and RNN models are built with single-channel word vectors or word vectors. For CNN, when the network level is too deep, using BP propagation will make the parameters near the input layer change slowly, while for LSTM, when the level is too deep, it is easy to have long-distance dependence problems. The DLCR model is in the input layer, in order to deeply mine the text, word embedding and word embedding patterns are adopted, and the corresponding feature vectors are input into the DLC model for high-level classification, and then the prediction results are input into the random forest for low-level feature training, and finally the disease information is obtained. This project uses DLCR’s multi-layer and multi-channel model to build a specific model shows as Fig. 4.

4 Experiments After the text edit has been completed, the paper is ready for the template. Duplicate the template file by using the Save As command, and use the naming convention prescribed by your conference for the name of your paper. In this newly created file,

110

M. Yao et al.

Fig. 4 DLCR model diagrams

highlight all of the contents and import your prepared text file. You are now ready to style your paper; use the scroll down window on the left of the MS Word Formatting toolbar.

4.1 Datasets There are two experimental data, one of which is CMDD data set, and the other is CCNC data set. After data processing, the scientific data in the data set is shown in Table 1. Table 1 Data statistics after cleaning

Data set

First-level classificationn

Two-tier classification

Amount of data

CMDD

6

263

788,988

CCNC

24

253

54,192

Hierarchical Medical Classification Based on DLCF

111

4.2 Baselines and Evaluation 4.2.1

Evaluation

In this experiment, two datasets are used, one experimental group and four control groups. Among them, the experimental group is designed in accordance with the disease classification model based on DLCR structure. Through the correctness of the final classification, we use four popular evaluation indicators: Precision, Recall, Accuracy, F1-Score. Precision indicates the proportion of correct classified positive samples to the actual classified positive samples, which is used to measure the accuracy of the text classification system. Its Formula 12. Pre = TP/(TP + FP)

(12)

Recall represents the proportion of all positive samples that are classified correctly, which is used to measure the recall of the text classification system, formula 13. Rec = TP/(TP + FN )

(13)

Accuracy indicates the ratio of predicted samples that match the label to the total sample, which is used to measure the overall accuracy of the text classification system. Formula 14. Acc = (TP + TN )/(TP + FP + TN + FN )

(14)

Recall and precision usually need to be taken into account in text classification, so F1-Score is usually used to comprehensively evaluate these two indicators, Formula 15. F1Score = 2 ∗ Pre ∗ Rec/(pre + Rec)

(15)

where TP stands for real example, FN for false counterexample, FP for false positive example, and TN for true counter example.

4.2.2

Super Parameter Setting

The quality of a model training largely depends on the setting of the model parameters, so after a precise meal on the model, the size of each model parameter is shown in Tables 2 and 3.

112

M. Yao et al.

Table 2 DLC parameter setting table

Table 3 RF parameter setting table

4.2.3

Experimental parameters

Value

batch_size

128

max_len

400

embedding_dims

299

nb_filter

150

hidden_dims

100

optimizer

adam

loss

binary_crossentropy

Experimental parameters

Value

n_estimators

2000

random_state

42

min_samples_split

2

max_features

Auto

bootstrap

False

n_estimators

2000

Experimental Results

In this comparative experiment, we compare the new two-layer classification model with the popular single-layer classification model: Transformer model, TextCNN model and multi-layer classification model: double-layer random forest (DRF) model and double-layer logic regression classification (DLogistic) model. Under the same experimental tools and the same experimental data, the results are shown in Table 4. Table 4 Model data comparison Model

Dataset

Acc (%)

Pre (%)

Rec (%)

F1 score (%)

DLC-RF

CMDD

81.3

81.2

81.3

80.5

CCNC

80.4

81.1

80.4

79.2

Transformer

CMDD

76.7

76.8

76.7

76.0

CCNC

41.9

43.5

41.9

39.4

TextCNN

CMDD

75.7

75.4

75.7

74.7

CCNC

67.1

66.9

67.1

65.5

DRF

CMDD

80.5

79.5

80.2

79.8

CCNC

68.5

67.5

68.4

64.3

DLogistic

CMDD

78.5

73.2

74.5

73.8

CCNC

76.6

66.2

70.4

76.6

Hierarchical Medical Classification Based on DLCF

113

Fig. 5 Comparison of models in CMDD dataset

Through the comparison of the above data, it can be concluded that the final evaluation results of DLCRF model are generally better than other models, no matter for CMDD dataset or for CCNC. Generally speaking, the experiment proposes that DLCR model can achieve very good results for multi-layer classified text. The result is shown in Figs. 5 and 6.

5 Conclusion and Future Work Although the DLCR model proposed in this paper is very much improved, there are still several shortcomings and deficiencies: • The size of the dataset is too small, the dataset used in this experiment is relatively small, and the type of data is not evenly distributed. • The online model training is insufficient, due to the limitation of the classification platform, the proportion of online training used for the model is low. • The universality of the algorithm needs to be further extended, and the current application is only in the field of medical text classification and movie review multi-layer classification. At present, the algorithm is only applied in the field of medical text classification, and further research is expected in the fields of multi-layer classification of news text and multi-layer classification of movie reviews.

114

M. Yao et al.

Fig. 6 Comparison of models in CCNC dataset

References 1. B. Liu, in Sentiment Analysis: Mining Opinions, Sentiments, and Emotions (Cambridge University Press, 2020) 2. A. Heydari, M. Ali Tavakoli, N. Salim, Z.J.E.S. Heydari, Detection of review spam: A survey 42(7), 3634–3642 (2015) 3. X. Chen, Y. Zhang, J. Xu, C. Xing, H. Chen, in Deep learning based topic identification and categorization: mining diabetes-related topics on Chinese health websites. International Conference on Database Systems for Advanced Applications (Springer, 2016), pp. 481–500 4. Y. Bengio, R. Ducharme, P.J.A. Vincent, A neural probabilistic language model (vol. 13, 2000) 5. P. Liu, X. Qiu, X.J. Huang, Recurrent neural network for text classification with multi-task learning (2016) 6. Y. Zhang, and B.J. Wallace, A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification (2015) 7. T. Zhang, F. You, Research on short text classification based on textcnn. J. Physics: Conference Series 1757(1), 012092 (2021). IOP Publishing 8. S. Mao, L.-L. Zhang, Z.-G.J.I.A. Guan, An LSTM & topic-CNN model for classification of online Chinese medical questions, 9, 52580–52589 (2021) 9. D. Chen, M. Huang, W.J.I.A. Li, and bioinformatics, Knowledge-powered deep breast tumor classification with multiple medical reports, 18(3), 891–901 (2019) 10. X. Qiao, C. Peng, Z. Liu, Y.J.I.J.M.L. Hu, and Cybernetics, Word-character attention model for Chinese text classification, 10(12), 3521–3537 (2019) 11. Y. Li, X. Wang, P.J.F.I. Xu, Chinese text classification model based on deep learning, 10(11), 113 (2018) 12. H. Tao, S. Tong, H. Zhao, T. Xu, B. Jin, Q. Liu, A radical-aware attention-based model for chinese text classification. Proceedings of the AAAI Conference on Artificial Intelligence 33(01), 5125–5132 (2019)

Hierarchical Medical Classification Based on DLCF

115

13. X. Li, H. Ning, in Chinese text classification based on hybrid model of cnn and LSTM. Proceedings of the 3rd International Conference on Data Science and Information Technology (2020), pp 129–134 14. J. Liu, C. Xia, H. Yan, Z. Xie, J.J.I.A. Sun, Hierarchical comprehensive context modeling for Chinese text classification, 7, 154546–154559 (2019) 15. J. Devlin, M.-W. Chang, K. Lee, K.J. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding (2018) 16. S. Pouyanfar et al., A survey on deep learning: Algorithms, techniques, and applications, 51(5), 1–36 (2018) 17. M. Ikonomakis, S. Kotsiantis, V.J.W. Tampakas, Text classification using machine learning techniques, 4(8), 966–974 (2005) 18. A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K.J.J.I. Lang, Speech and Processing, Phoneme recognition using time-delay neural networks, 37(3), 328–339 (1989) 19. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P.J.J. Kuksa, Natural language processing (almost) from scratch, 12(ARTICLE), 2493−2537 (2011) 20. S. Hochreiter, J.J.N. Schmidhuber, Long short-term memory, 9(8), 1735–1780 (1997) 21. Y. Zhang, J. Zheng, Y. Jiang, G. Huang, R.J.C.J.E. Chen, A text sentiment classification modeling method based on coordinated CNN-LSTM-attention model, 28(1), 120–126 (2019) 22. G. Liu, J.J.N. Guo, Bidirectional LSTM with attention mechanism and convolutional layer for text classification, 337, 325–338 (2019) 23. J. Du, C.-M. Vong, C.P.J.I. Chen, Novel efficient RNN and LSTM-like architectures: Recurrent and gated broad learning systems and their applications for text classification, 51(3), 1586–1597 (2020) 24. C.-W. Chen, S.-P. Tseng, T.-W. Kuan, J.-F.J.I. Wang, Outpatient text classification using attention-based bidirectional LSTM for robot-assisted servicing in hospital, 11(2), 106 (2020) 25. L.J.M.l. Breiman, Random forests, 45(1), 5–32 (2001) 26. T. Salles, M. Gonçalves, V. Rodrigues, L.J.I.S. Rocha, Improving random forests by neighborhood projection for effective text classification, 77, 1–21 (2018) 27. Y. Sun, Y. Li, Q. Zeng, Y. Bian, in Application research of text classification based on random forest algorithm. 2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE) (2020), pp 370–374: IEEE 28. M. Z. Islam, J. Liu, J. Li, L. Liu, W. Kang, in A semantics aware random forest for text classification. Proceedings of the 28th ACM International Conference on Information and Knowledge Management (2019), pp. 1061–1070 29. Y. Al Amrani, M. Lazaar, K.E.J.P.C.S. El Kadiri, Random forest and support vector machine based hybrid approach to sentiment analysis, 127, 511–520 (2018) 30. A. Kukkar, R. Mohana, A. Nayyar, J. Kim, B.-G. Kang, N.J.S. Chilamkurti, A novel deeplearning-based bug severity classification technique using convolutional neural networks and random forest with boosting, 19(13), 2964 (2019)

Noise Detection and Classification in Chagasic ECG Signals Based on One-Dimensional Convolutional Neural Networks Weslley Lioba Caldas, João Paulo do Vale Madeiro, Roberto Coury Pedrosa, João Paulo Pordeus Gomes, Wencai Du, and João Alexandre Lobo Marques Abstract Continuous cardiac monitoring has been increasingly adopted to prevent heart diseases, especially the case of Chagas disease, a chronic condition that can degrade the heart condition, leading to sudden cardiac death. Unfortunately, a common challenge for these systems is the low-quality and high level of noise in ECG signal collection. Also, generic techniques to assess the ECG quality can discard useful information in these so-called chagasic ECG signals. To mitigate this issue, this work proposes a 1D CNN network to assess the quality of the ECG signal for chagasic patients and compare it to the state of art techniques. Segments of 10 s were extracted from 200 1-lead ECG Holter signals. Different feature extractions were considered such as morphological fiducial points, interval duration, and statistical features, aiming to classify 400 segments into four signal quality types: Acceptable ECG, NonECG, Wandering Baseline (WB), and AC Interference (ACI) segments. The proposed CNN architecture achieves a 0.90 ± 0.02 accuracy in the multi-classification experiment and also 0.94 ± 0.01 when considering only acceptable ECG against the other three classes. Also, we presented a complementary experiment showing that, after removing noisy segments, we improved morphological recognition (based on QRS wave) by 33% of the entire ECG data. The proposed noise detector may be applied as a useful tool for pre-processing chagasic ECG signals. Keywords ECG quality assessment · Chagas disease · Deep learning W. L. Caldas (B) · J. P. do Vale Madeiro · R. C. Pedrosa · J. P. P. Gomes Department of Computer Science, Federal University of Ceará, Fortaleza, Brazil e-mail: [email protected] J. P. do Vale Madeiro e-mail: [email protected] R. C. Pedrosa e-mail: [email protected] J. P. P. Gomes · W. Du Institute of Data Engineering and Sciences, University of Saint Joseph, Macau SAR, China e-mail: [email protected] J. P. P. Gomes · J. A. L. Marques Laboratory of Applied Neurosciences, University of Saint Joseph, Macau SAR, China e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_8

117

118

W. L. Caldas et al.

1 Introduction American trypanosomiasis disease, also referred to as Chagas disease, affects more than 21 countries, most commonly in Latin America (Brazil and Mexico) [1, 2], but also in countries from Europe, Asia (Japan) and Oceania (Australia and New Zealand) [3–5]. There are around 6–7 million infected people [6]. Patients normally got infected when they come in contact with the feces of Trypanosoma cruzi vectors, insects belonging to the Triatominae subfamily. There are two main phases to the infection: acute and chronic. The acute phase happens two months after infection, but no symptoms are present in almost the cases [7, 8]. In contrast, according to World Health Organization (WHO), during the chronic phase, 30% of chagasic patients develop heart arrhythmias (typically ventricular tachycardia and ventricular fibrillation) or progressive heart failure, which can lead to sudden cardiac death (SCD) [9–11]. Monitoring the heart continuously is one of the most effective way to prevent the evolution in heart diseases [12, 13]. Also, many approaches to prevent SCD are benefit from continuous monitoring [14–18]. Unfortunately, the majority of existing ECG analysis systems are designed to handle relatively noise-free ECG signals. Furthermore, ECG signals from continuous monitoring are often corrupted by various noises and artifacts, such as Baseline Wandering (BW), AC Interference (ACI), muscle artifacts, electrode unplugged and others making it nearly impossible to perform a morphological and RR interval analysis [19]. Different solutions were presented to avoid ECG segments with noises. SQA (Signal Quality Assessment) uses fiducial points, maxima/minima amplitudes and statistical features to verify the quality of a ECG time series, where low quality signals are rejected [20, 20, 21]. Other methodologies extract morphological features to classify/clustering ECG segments into several kinds of noises but also non-ECG signals. Besides that, a Deep Neural Network was successfully applied to classify good/bad ECG signals similar to SQA methods [22]. Despite the promising results of the presented methodologies, none of them was developed specifically using a base of chagasic patients, which implies different challenges. These patients have their own peculiarities, and traditional SQA methods as well as other approaches to detect noise can end up discarding signals that could contain useful information or inaccurately classifying noise signals as acceptable. In order to solve this issue, the main objective of this work is to develop a deep learning CNN architecture to identify acceptable chagasic ECG signals among three types of signal qualities (wandering baseline, AC interference and non-ecg signals). By the way, another very common type of noise Muscle artifacts(MA) was initially left out in this work, as it is more volatile and difficult to identify [23]. We emphasize that, to our knowledge, this is the first deep neural architecture for noise signals classification, specifically built on a basis of chagasic patients.

Noise Detection and Classification in Chagasic ECG Signals …

119

The remaining of this paper is organized as follows. Section 2 details the corresponding related work within the literature. Section 3.1 presents the methodology also describing the training/test datasets applied in this work. Section 3.2 shows the proposed approach based on a 1d convolutional neural network. The description of the experiments and results using the proposed labels can be seen in Sect. 4. Finally, Sect. 5 presents our conclusions and comments for future works.

2 Related Work Recently, modern tools, techniques, and algorithms have been developed to evaluate the quality of electrocardiograms (ECG). Some of these techniques are related to SQA(signal quality assessment), a preprocessing step that discards pieces of signals with low quality, which aims to reduce false alarm problems. SQA based techniques include amplitude-and-slope criteria, threshold methods, Inverse Fourier Transformation, spectrum-based matrix, and multi-feature fusion methods [21]. One of the most straightforward methods to verify the quality of an ECG relies on examining the QRS complexes (the most significant wave in a heartbeat). In [20], the average QRS index computes the distance of each QRS complex to the average QRS length. Unfortunately, if most samples are corrupted, the average QRS size will be affected. A more robust SQA method was proposed in [24], which consist in extract four signal quality indexes (SQIs): QRS wave power spectrum distribution(pSQI), kurtosis (kSQI), relative baseline power (basSQI), and R peak detection match (qSQI). The original work proposes that a fuzzy system weights the feature set by [0.4, 0.4, 0.1, 0.1] to form an outcome between 0 and 1. The main drawback of this approach is the need to detect the R-peaks (most significant peak of a heartbeat), which can have their precision affected due to the noise. Other approaches rely on morphological features or visual descriptions of objects based on their basic structure and shape. The premise is that different structures are more present in certain types of noise. The work in [25] proposed to use a set of 5 features to cluster noise and noiseless signals: Standard Deviation (STD), Zero Crossing Rate (ZCR), the calculation of Peaks Rate (PR), the Peak Distance (PD), and the Amplitude Difference (AD) between successive maxima and minima. Initially proposed to the clustering problem, this feature set could be easily adapted to classify signals. Similar to [24], computing the PR/PD features may be infeasible if the noise level is too hard. More recently, Deep Neural Networks have been used to classification of good/bad quality heartbeats. The authors in [22], proposed to use an anomaly detection algorithm based on an Autoencoder: an artificial neural network architecture developed to learn efficient representations of unlabeled data. First, they trained the neural network using only good-quality examples. After being trained, the Autoencoder reconstructs clean ECG signals more accurately than noisy examples; the difference between the reconstructed and raw ECG was used to distinguish both classes. The

120

W. L. Caldas et al.

disadvantage of this strategy is that it’s not possible to adapt for multi-classification to identify multiples types of noises. Several techniques were presented and had their drawbacks presented as well. To solve those limitations, we propose in this work a neural network able to classify the ECG signal into 4 waves types (Acceptable ECG, Wandering Baseline ECG, AC interference ECG and Non-ECG).

3 Materials and Methods 3.1 Dataset and Pre-processing Clinical and laboratory data were collected from 314 patients at the University Hospital Clementino Fraga Filho (HUCFF) of the Federal University of Rio de Janeiro over 26 years, between 1990 and 2016. About 160 patients had two or more medical records with 24 h. Thus, the database contains 550 samples, of which 232 are male, and 318 are female patients. To obtain the database used in this work, the protocol was approved by the Ethics Committee HUCFF-UFRJ, which waived the need for written consent under 45360915.1.1001.5262, in accordance with the current standards of the National Research Ethics Committee (Conep) and the principles described in the Declaration of Helsinki. The scope of this work is limited to the classification of acceptable ECG time series, ECG with Wandering Baseline (WB), ECG with AC Interference (ACI) and non-ECG time series. Here we adopted the following definition: 1. Acceptable ECG: Clean segments with minor or imperceptible noises that do not interfere with feature extraction or segmentation techniques. 2. Non-ECG: clearly non-ECG records provided by electrode unplugged or equipment malfunction. 3. ECG with Wandering Baseline (WB): Noisy ECG segments with the hard presence of Wandering Baseline (WB) or AC Interference (ACI). 4. ECG with AC Interference (ACI): Noisy ECG segments with the hard presence of AC Interference (ACI). To compose our dataset, we first selected the most recent record of each patient, forming a total of 314 unique patients. Secondly, we removed 114 signals due to the low quality of ECG extraction (i.e., the entire record was totally compromised, and we could not extract any useful information). After that, we randomly extracted 30 slices of 10 s ( a overall of five minutes) from the last four hours of each of 200 signals forming a total of 6000 10-s segments that went pre-classified into four categories by a set of non-clinical specialist. These categories were: Acceptable ECG time series, Non-ECG time series, ECG with the presence of Wandering Baseline, and ECG with the presence AC Interference. Finally, 100 segments for each category were verified for clinical specialists composing a 400 segments dataset. Figure 1 represents some examples for each one of the categories.

Noise Detection and Classification in Chagasic ECG Signals …

121

Fig. 1 Different kinds of noise, which are common in ECG signals

3.2 1D CNN Design Many factors are involved in designing deep neural network architecture, including performance metrics, loss functions, optimization algorithms, and hyperparameters settings [26]. Several architectures were proposed to solve 2D images computer vision problems. These architectures vary in how many units they should have and how they should be connected. Common hyperparameters are the number of convolution, fully connected layers, pooling layers, learning rate, batch size, optimization algorithm, and activation functions. Network architectures such as LeNet, AlexNet, VGG, Inception and ResNet have achieved excellent results to classify 2d images

122

W. L. Caldas et al.

Table 1 A proposed 1D convolutional neural network (CNN) with the parameters for each layer Layers Parameters Activation Conv1D Maxpooling Dropout Conv1D Maxpooling Dropout Conv1D Maxpooling Dropout Conv1D Maxpooling Dropout Conv1D Maxpooling Dropout Conv1D Maxpooling Dropout Conv1D Maxpooling Dropout Flatten Dense Dense

Filter = 256/kernel = 5 Pool size = 2 Rate = 0.1 Filter = 128/kernel = 5 Pool size = 2 Rate = 0.1 Filter = 128/kernel = 5 Pool size = 2 Rate = 0.1 Filter = 64/kernel = 5 Pool size = 2 Rate = 0.1 Filter = 64/kernel = 5 Pool size = 2 Rate = 0.1 Filter = 32/kernel = 5 Pool size = 2 Rate = 0.1 Filter = 32/kernel = 5 Pool size = 2 Rate = 0.1 32 4

Relu

Relu

Relu

Relu

Relu

Relu

Relu

Relu Softmax

among several fields [27]. Unfortunately, there are few reported network architectures for one dimension signal, as far as we know. For this reason, we decided to develop one from scratch. The proposed architecture was developed using a grid search procedure to find the best hyperparameters based on 10-k folds validation. According with Table 1, we achieved an optimal number of 7 convolutional blocks, 2 fully-connected layers, and one softmax layer as the output prediction. Each convolutional block is composed by a convolutional layer using Rectified Linear Unit (ReLU) activation function, followed by a Maxpooling layer and then by a Dropout Layer. Convolutional layers contain a set of filters with parameters that will be learned throughout the training process. The Maxpooling layers operate at the maximum value in each patch of each feature map, producing the downsampling of the feature maps while highlighting the most prominent patch feature. Finally, the Dropout layer has the effect of shrinking

Noise Detection and Classification in Chagasic ECG Signals …

123

Fig. 2 The proposed 1D convolutional neural network (CNN) architecture

the weights of the previous layers, which helps prevent overfitting. An overview of the final architecture can be verified in Fig. 2.

4 Experimental Results The literature proposed two types of approaches to evaluate ECG quality: a binary assessment to refuse low-quality ECG and a multi-class assessment to not only identify low-quality ECG but also determine which type of noise is present in order to apply noise reduction. Based on this assumption, we conducted two experiment designs to compare the proposed approach with the state-of-art literature: a binary classification of acceptable/not-acceptable ECG and a multi-class classification of Acceptable ECG, Non-ECG, ECG with Wandering Baseline, and ECG with AC Interference. In the first, we proposed a binary classification to identify acceptable ECG time series against not acceptable ECG time series. To form the not acceptable ECG time series, we grouped the labeled examples of Non-ECG, Wandering Baseline, and AC Interference into one set. To set up the experiment, we randomly selected two thirds of the instances for training and one third for testing, ensuring that time-series segments from the same ECG exam must appear only on one set, but not both simultaneously. The proposed work uses a 1d CNN network for feature extraction, and by design neural networks don’t need an external algorithm for classification. For the other methods, we chose the well-know Support Vector Machine (SVM) as the learning model, with hyperparameters tuned using 5-fold cross-validation in a grid-search procedure. For the hyperparameter search space, we chose C ∈ {2−1 , 2−1 , 2−1 , . . . , 212 , 215 } and

124

W. L. Caldas et al.

Table 2 Comparison of all methods for binary classification experiment Method Accuracy F1-score Precision

Recall

Proposed [25] and [24] [24] [25]

0.9000 ± 0.0283 0.8647 ± 0.0445 0.6126 ± 0.0470 0.7297 ± 0.0723

0.8993 ± 0.0289 0.8638 ± 0.0459 0.6081 ± 0.0460 0.7305 ± 0.0700

0.8998 ± 0.0285 0.8727 ± 0.0442 0.6204 ± 0.0490 0.7490 ± 0.0706

0.9047 ± 0.0265 0.8643 ± 0.0448 0.6104 ± 0.0467 0.7294 ± 0.0719

γ ∈ {2−15 , 2−13 , 2−11 , . . . , 21 , 213 } for the Gaussian kernel, and C ∈ {2−1 , 2−1 , 2−1 , . . . , 212 , 215 } for the linear kernel. We repeated the experiment 10 times. We propose adaptations for each approach to promote a fair comparison when necessary. Initially [24] presented the weight array values of different SQI as follows (0.4, 0.4, 0.1, 0.1). Instead of using these values, we used the four SQI indexes as features input of an SVM classifier. The method in [25] used the morphological feature set for cluster different ECG free noises and ECG with noises segments. Here we adopt to use the same feature set in an SVM classifier. Since the [25] and [24] both used ad-hoc feature extracted from ECG segments, we also proposed combining booth feature sets into a single classifier. The approach in [22], it was originally proposed to extract 10 s of an ECG segment, divided into 2.5 s-sections, and then extracted 0.8 s-fragments positioned around the largest peak. The original idea was to classify approximately a single heartbeat. Still, all other methodologies are based on 10 s segments, so we used a 10 s plain autoencoder network with the number of neurons (64, 128, 256, 512) in latent representation tuned in the training phase. It is interesting to note that according to Table 2, all methods achieve good performance except [22]. The 10 s ECG was not well reconstructed, leading to poor results; their authors found similar results, which supply the assumption that this approach is not suitable for middle long ECG time series and only for heartbeats. Another interesting fact is that [25] and [24] achieved good results of 0.8998 accuracy when combined, demonstrating that the combination of the previous methods outperformed the same when evaluated in isolation. Finally, the proposed technique outperformed all compared approaches; as expected, the supervised feature extraction of a 1d CNN achieved higher effects than the unsupervised feature extraction led via the autoencoder. Also, one hypothesis that the CNN outperformed [24, 25] is because the peak detection needed in both methods can be negatively affected by hard noise ECG examples. The possibility to reduce/remove the noise present in ECG signals makes it interesting to have methodologies to identify ECG from non-ECG/noisy ECG and detect which kind of noise is present. With this assumption, we set up the second experiment to correctly identify different types of noise as Wandering Baseline and AC Interference from Noiseless ECG and Non-ECG signals. We used the same training methodology as the previous experiment but with four classes (Acceptable ECG, Non-ECG, ECG with Wandering Baseline, ECG with AC Interference). Unfortu-

Noise Detection and Classification in Chagasic ECG Signals …

125

Table 3 Comparison of all methods for multi-class experiment Method Accuracy F1-score Precision [22] Proposed and [24, 25] [24] [25]

0.6078 ± 0.0397 0.9424 ± 0.0129 0.8998 ± 0.0409 0.7575 ± 0.0214 0.8325 ± 0.0298

0.6045 ± 0.0204 0.9224 ± 0.0206 0.8703 ± 0.0526 0.5571 ± 0.0195 0.7692 ± 0.0402

0.6002 ± 0.0183 0.9250 ± 0.0405 0.8635 ± 0.0515 0.6861 ± 0.0809 0.7903 ± 0.0484

Table 4 Confusion matrix for multi classification problem Acceptable Non-ECG ACI WB Acceptable 47 20 7 29 Acceptable Non-ECG 1 78 0 6 Non-ECG ACI 0 0 58 3 ACI WB 52 2 35 62 WB (a) [24] Acceptable Non-ECG ACI WB

Acceptable 92 2 1 5

Non-ECG 10 88 1 1

(c) [25] and [24]

Recall

Acceptable 78 14 1 7

0.6100 ± 0.0931 0.9252 ± 0.0061 0.8832 ± 0.0580 0.5616 ± 0.0161 0.7649 ± 0.0467

Non-ECG 13 78 5 4

ACI 2 4 77 17

WB 11 0 30 59

ACI 0 0 95 5

WB 6 0 9 85

(b) [25] ACI 3 2 79 16

WB 7 1 5 87

Acceptable Non-ECG ACI WB

Acceptable 89 7 1 3

Non-ECG 6 91 0 3

(d) Proposed work

nately, it was not possible to compare our results with [22], since this methodology was designed only for binary classification. Table 3 summarizes the results for 10 runs of the experiments. According to Table 3, it is possible to see a downside for accuracy in all methodologies; this is meant to show up as soon as we formulate the binary classification into a multi-classification problem. Despite this fact, we find out similar results to experiment one in experiment two. Table 4 summarizes the confusion matrix for all methods performed on the second experiment; here, it is possible to verify that [24] didn’t distinguish well between acceptable ECG and Wandering Baseline ECG segments (Table 4a). A possible explanation for that is that in WB segments, the peak values vary according to the signal’s power level, which can mislead features like the kurtosis of the signal. In contrast, [25] was not able to correctly identify AC Interference ECG and Wandering Baseline ECG segments 4a, probably due to morphological characteristics (P-QRS-T complex) are more susceptible to high-frequency noises. The hypothesis that the combination between [24, 25] could improve the results were again verified. Conforming 4c it is possible to see the improvement in the identification between ACI/WB and WB/Acceptable ECG. Still, the proposed work outperformed again [24, 25] leading to 0.9 ± 0.0283 of accuracy against 0.8647 ± 0.0445 presented by

126

W. L. Caldas et al.

Table 5 Comparison of the matched samples concerning QRS complexes from the 4-h records and QRS complexes from the labeled set Absolute number of signals Percent Unfiltered Proposed work [24, 25] [24] [25]

150 200 183 171 176

75 100 91.5 85.5 88

[24] + [25]. This supports our assumption that a specific neural network architecture developed for a Chagas Disease database can exploit better features than generic approaches. We presented a complementary analyses to illustrate the power of noise reduction in the whole database. Conforming [28], an electrocardiogram can be used as a biometric identifier. Even when the variability in heartbeats exists, it’s not expected differences bigger than 10 ms for the QRS complex (most significant wave in a heartbeat) [29]. Based on that fact, we performed a Pan-Tompkins algorithm to delineate the QRS segments for each of the 4-h ECG records. Next, we assume that the QRS duration of a single ECG record without noise can be estimated as a Gaussian random variable (GRV) X i ∼ N (μi , σi2 ) with mean μi and standard deviation σi2 , wheres i represents the signal index. We can also assume that the full ECG Holter will contain pieces of the signal corrupted by noise, here we going to assume the corrupted portion of the signal as GRV defined by Ci ∼ N (μc i, σc2 i). Consequently, each ECG Holter of 4 h with index i can be computed as X i + Ci . After that, we applied Pan-Tompkins algorithm to delineate the QRS complexes for the 300 labeled 10-s segments (excluded 100 segments for non-ecg), wheres each signal has approximately 10–20 labeled seconds (10–15 heartbeats per signal). Here, we assume an GRV Yi ∼ N (μl i, σl2 i), with mean μl i and standard deviation σl2 i to estimate the QRS duration’s of each group of labeled sample; Also, for segments with baseline wandering, we applied a simple high pass filter [30] and for segments with AC interference, we applied a notch filter 60 Hz cut-off frequency. Subsequently, we compare the QRS Gaussian random variable of full 4-h records against the respective QRS extracted from the labeled data set. Our hypothesis is that since each labeled sample extracted from the signal with index i was originally extracted from the full 4-h ECG Holter, then if the error Ci attributed to noise is insignificant, both random variables X i and Yi will belong to the same distribution. Therefore, to compare two distributions, we performed the Kolmogorov-Smirnov test. The KS test says that the null hypothesis cannot be rejected (two samples are from the same distribution) if the p-value is greater than 0.05, which means two random variables have been sampled from the same distribution. Table 5 summarizes the result of the KV test for each one of the 4-hour signals with their respective labeled cleaned segments into 5 categories: without any method

Noise Detection and Classification in Chagasic ECG Signals …

127

to reject noise segments; rejecting segment according with [25]; according with [24]; according with [25] + [24]; and according the proposed method. It is possible to note a baseline of 75% of matched samples(both the 4 h and the labeled set are accepted to be from the same distribution), which are expected since the signals do not necessarily have noise. Besides that, improvements on the matched samples were observed after performing the rejection of noise segments in all approaches, reaffirming our hypothesis that by reducing the noise for each signal i the variables X i and Yi tend to get closer. Another important fact is that the proposed method achieves 100% of matched samples given an improvement of 33% compared to the baseline. Based on that results, we can conjecture that the proposed CNN architecture is a valuable option for noise detection/rejection in ECG records of Chagasic patients.

5 Conclusion Continuous cardiac monitoring (CCM) is a relevant field in medicine and extremely necessary in Chagas disease. The current approaches have many drawbacks, such as noise sensitivity or presenting poor generalization on Chagas data. To overcome those issues, we proposed a 1d Neural networking architecture, specially designed based on a Chagas database to classify four types of ECG signals: acceptable ECG time series, ECG with Wandering Baseline (WB), ECG with AC Interference (ACI), and non-ECG time series. Based on the results obtained from computational experiments, we have verified that the proposed architecture achieved better accuracies when compared to other well-known techniques for noise classification. Additionally, we proposed a set of labels for noise classification in an ECG Chagasic Patient Database. Future work may include the identification of Muscular Artifact, among other types of noises, and the use of transfer learning to improve the current results. Acknowledgements This work was supported by Coordination for the Improvement of Higher Education Personnel (CAPES), Brazilian Research Council, CNPq (Grant n. 426002/2016-4), and Ceara State Foundation for the Support of Scientific and Technological Development (BP3-013900284.01.00/18 and PS1-0186-00439.01.00/21).

References 1. WHO, Chagas disease (American trypanosomiasis), https://www.who.int/health-topics/ chagas-disease(2021) 2. J. Borges-Pereira, J.R. Coura, P.L. Zauza, C. Pirmez, S.S. Xavier, Chagas disease in virgem da lapa, minas gerais, Brazil: left ventricle aneurysm and the risk of death in the 24-year interval, Memórias do Instituto Oswaldo Cruz, vol. 115 (2020)

128

W. L. Caldas et al.

3. K. Imai, K. Misawa, M. Osa, N. Tarumoto, J. Sakai, K. Mikita, Y. Sayama, Y. Fujikura, A. Kawana, T. Murakami et al., Chagas disease: a report of 17 suspected cases in japan, 2012– 2017. Tropical Med. Health 47(1), 1–5 (2019) 4. N. Klein, I. Hurwitz, R. Durvasula, Globalization of chagas disease: a growing concern in nonendemic countries, Epidemiology Research International, vol. 2012 (2012) 5. K.C.F. Lidani, F.A. Andrade, L. Bavia, F.S. Damasceno, M.H. Beltrame, I.J. Messias-Reason, T.L. Sandri, Chagas disease: from discovery to a worldwide health problem. Front. Publ. Health 7, 166 (2019) 6. L.E.V. Silva, H.T. Moreira, M.M.M. Bernardo, A. Schmidt, M.M.D. Romano, H.C. Salgado, R. Fazan Jr., R. Tinós, J.A. Marin-Neto, Prediction of echocardiographic parameters in chagas disease using heart rate variability and machine learning. Biomed. Signal Process. Control 67, 102513 (2021) 7. P.-J. Li, T. Jin, D.-H. Luo, T. Shen, D.-M. Mai, W.-H. Hu, H.-Y. Mo, Effect of prolonged radiotherapy treatment time on survival outcomes after intensity-modulated radiation therapy in nasopharyngeal carcinoma. PloS One 10(10), e0141332 (2015) 8. L. Capuani, A.L. Bierrenbach, A. Pereira Alencar, A. Mendrone, J.E. Ferreira, B. Custer, A.L.P. Ribeiro, E. Cerdeira Sabino, Mortality among blood donors seropositive and seronegative for chagas disease (1996–2000) in são paulo, brazil: a death certificate linkage study, PLoS Neglected Tropical Diseases, vol. 11, no. 5, p. e0005542 (2017) 9. M.C.P. Nunes, A. Beaton, H. Acquatella, C. Bern, A.F. Bolger, L.E. Echeverria, W.O. Dutra, J. Gascon, C.A. Morillo, J. Oliveira-Filho et al., Chagas cardiomyopathy: an update of current clinical knowledge and management: a scientific statement from the american heart association. Circulation 138(12), e169–e209 (2018) 10. J.A. Marin-Neto, E. Cunha-Neto, B.C. Maciel, M.V. Simões, Pathogenesis of chronic chagas heart disease. Circulation 115(9), 1109–1123 (2007) 11. M.V. Simões, M.M.D. Romano, A. Schmidt, K.S.M. Martins, J.A. Marin-Neto, Chagas disease cardiomyopathy. Int. J. Cardiovascular Sci. 31, 173–189 (2018) 12. U. Satija, B. Ramkumar, M.S. Manikandan, A new automated signal quality-aware ECG beat classification method for unsupervised ECG diagnosis environments. IEEE Sens. J. 19(1), 277–286 (2018) 13. X. Liu, Y. Zheng, M.W. Phyu, B. Zhao, M. Je, X. Yuan, Multiple functional ECG signal is processing for wearable applications of long-term cardiac monitoring. IEEE Trans. Biomed. Eng. 58(2), 380–389 (2010) 14. A. Rassi Jr., A. Rassi, W.C. Little, S.S. Xavier, S.G. Rassi, A.G. Rassi, G.G. Rassi, A. Hasslocher-Moreno, A.S. Sousa, M.I. Scanavacca, Development and validation of a risk score for predicting death in chagas’ heart disease. New England J. Med. 355(8), 799–808 (2006) 15. A.C.J. de Souza, G. Salles, A.M. Hasslocher-Moreno, A.S. de Sousa, P.E.A.A. do Brasil, R.M. Saraiva, S.S. Xavier, Development of a risk score to predict sudden death in patients with chaga’s heart disease. Int. J. Cardiol. 187, 700–704 (2015) 16. A.C. Alberto, G.A. Limeira, R.C. Pedrosa, V. Zarzoso, J. Nadal, Ecg-based predictors of sudden cardiac death in chagas’ disease. Comput. Cardiol. (CinC). IEEE, 1–4 (2017) 17. A.C. Alberto, R.C. Pedrosa, V. Zarzoso, J. Nadal, Association between circadian holter ECG changes and sudden cardiac death in patients with chagas heart disease. Physiol. Meas. 41(2), 025006 (2020) 18. P.E. Primo, W.L. Caldas, G.S. Almeida, L.P. Brasil, C.H. Cavalcante, J.P. Madeiro, D.G. Gomes, R.C. Pedrosa, Auxílio ao diagnóstico para predição de morte súbita em pacientes chagásicos a partir de dados clínicos: uma abordagem baseada em aprendizagem de máquina,” in Anais do XXI Simpósio Brasileiro de Computação Aplicada à Saúde (SBC, 2021), pp. 335–345 19. U. Satija, B. Ramkumar, M.S. Manikandan, A review of signal processing techniques for electrocardiogram signal quality assessment. IEEE Rev. Biomed. Eng. 11, 36–52 (2018) 20. D. Makowski, T. Pham, Z.J. Lau, J.C. Brammer, F. Lespinasse, H. Pham, C. Schölzel, S. Chen, Neurokit2: a python toolbox for neurophysiological signal processing. Beh. Res. Methods 53(4), 1689–1696 (2021)

Noise Detection and Classification in Chagasic ECG Signals …

129

21. J. Xie, L. Peng, L. Wei, Y. Gong, F. Zuo, J. Wang, C. Yin, Y. Li, A signal quality assessmentbased ECG waveform delineation method used for wearable monitoring systems. Med. Biol. Eng. Comput. 59(10), 2073–2084 (2021) 22. J. Garus, M. Pabian, M. Wisniewski, B. Sniezynski, Electrocardiogram quality assessment with autoencoder, in International Conference on Computational Science (Springer, 2021), pp. 693–706 23. X. Chen, X. Xu, A. Liu, S. Lee, X. Chen, X. Zhang, M.J. McKeown, Z.J. Wang, Removal of muscle artifacts from the EEG: a review and recommendations. IEEE Sens. J. 19(14), 5353– 5368 (2019) 24. Z. Zhao, Y. Zhang, Sqi quality evaluation mechanism of single-lead ECG signal based on simple heuristic fusion and fuzzy comprehensive evaluation. Frontiers Physiol. 9, 727 (2018) 25. J. Rodrigues, D. Belo, H. Gamboa, Noise detection on ECG based on agglomerative clustering of morphological features. Comput. Biol. Med. 87, 322–334 (2017) 26. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436–444 (2015) 27. C.-H. Hsieh, Y.-S. Li, B.-J. Hwang, C.-H. Hsiao, Detection of atrial fibrillation using 1d convolutional neural network. Sensors 20(7), 2136 (2020) 28. M. Ingale, R. Cordeiro, S. Thentu, Y. Park, N. Karimian, ECG biometric authentication: a comparative analysis. IEEE Access 8, 117 853–117 866 (2020) 29. V. Vanˇcura, D. Wichterle, I. Ulˇc, J. Šmíd, M. Brabec, M. Zárybnická, R. Rokyta, The variability of automated qrs duration measurement. Europace 19(4), 636–643 (2017) 30. L. Sörnmo, P. Laguna, Bioelectrical Signal Processing in Cardiac and Neurological Applications (Academic Press, 2005), vol. 8

Based on the Analysis of Interrelation Between Parallel Distributed Computer System and Network Tingrui Yang

Abstract Scale of Internet users is increasing for parallel and distributed computer systems research has become the main research contents at present, this paper mainly analyzes the time-sharing system and the network connection, through exploring internal computer processor, extends to the form of distribution network analysis, combined with the parallel computer system principle and way of the analysis of the extended to the distributed network system. Keywords Parallel distributed computer system · Network connection · Architecture analysis

1 Introduction Computer network technology is rapidly developing, it blended in a lot of computer theory, developed under the constant innovation of technology, The fundamental meaning of distribution is parallelism, realize the parallel need distributed as a basis to control, in the computer research level always attaches great importance to the study of distributed parallel computing, Therefore, the current network system should be fully used to share information, make the structure of service network tend to be parallel, and develop into a significant computer network system research.

T. Yang (B) School of Software, Henan University, Kaifeng, China e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_9

131

132

T. Yang

2 Combined with the Interconnection of Network Resources, the Time-Sharing System of Computer Is Analyzed For systems with time-sharing operation function, it can be reflected in the timesharing computer. Its main application principle is to use time-sharing control. Users can get information processing services, mainly by using a separate CPU in the calculation. In the process of user operation, if a certain speed is reached and the speed of CPU is higher than this speed, then a feeling will be formed. Users will feel that they have their own computer, which is like network communication content. It can use time division to control the overall situation, and apply time division control to turn the circuit into many pieces. This principle can be applied to time division operating system at the same time, that is, through logic. Make use of time-sharing control to turn a physical computer into many virtual forms of computers so that it can bring the required information to more users and even provide it at the same time. This system can not only serve a limited number of computers for multiple users, but also promote the full use of computer resources. Therefore, in the network system of the computer, using the time-sharing function of the computer system, it provides more users with the support of the technology of resource interchange, and achieves information access in the case of multiple users. Time-sharing and remote terminal functions can be applied in various computer systems. An important requirement is to implement time-sharing control for a large number of users in a multi-task operating system, because no network resource can become a user’s own information, which is the benefit of the application of time-sharing operating system.

3 Analyze the Processor Inside the Computer, Extending to Distributed Network Analysis 3.1 Analysis of Distributed Treatment In the process of computer application, distributed processing is used. It can rely on a large number of processors, or it can divide the information processing in the application of computer, so that multiple tasks can be completed in less time. These processors can be of the same type as computers, or they can be of a different type. Usually, a task is divided into different types of small tasks, which also need different types of processors or computers to cooperate to complete. In addition to such division, it can also be phased, and different types of processors or computers are applied to carry out tasks in a certain order. Intention of these practices are all want to play in the process of the whole system of the actual effect, distribution processing with complex structures, it is for the interior of the computer, analyze the processor, by one-way development impetus to the development of future, the computer network

Based on the Analysis of Interrelation Between Parallel Distributed Computer …

133

Fig. 1 Distributed architecture diagram

system into a multimachine system, distribution and environmental situation is more complex, Distributed processing internal there is a core structure, this structure is intelligent peripheral controller, its composition with a special processor, the main role is to assist the main CPU, so as to constantly reduce the pressure of peripheral control (Fig. 1). The main CPU is a computing tool, which can process user tasks at high speed. The design of the processor in the peripheral intelligent controller mainly focuses on the control two words. The relationship between the main CPU and the intelligent control processor is relatively complex, they can not be directly related, but through interruption and DMA and other ways. On these basis, it can complete the tasks provided by users with high quality. This form is actually a way for the processor to realize network communication. However, the internal distribution structure is not equal, mainly CPU as the leading object, and the rest belongs to the following relationship. Usually the controller processor can handle a peripheral access tasks, after processing is completed, will be passed to the CPU, the structure of distributed processing applied to multiple processors, is actually a meticulous division to undertake unity through the system work, the function of the main CPU is mainly focus on the user’s task, it can bring more users and resources Effective application of timesharing structure, fully combined with distributed processing structure, promotes the development of computer networks [1]. The diagram is a distributed architecture diagram.

134

T. Yang

3.2 Peripheral Multi-machine System Next analysis of the peripheral multi-machine system, it mainly highlights around two words, using a main computer, at the same time according to the required configuration, in the periphery to establish a computer group, and the host work together, it is mainly applied distribution processing, belongs to a multi-machine system. For multidimensional multi-machine system processor, because each computer has its own nature, so it is to belong to a separate operating system, for information, and other control is also a separate, here the each computer will have their own full form, at the same time also has a certain function, they can be realized in computer and general, Can also use their features of independent specialized applications, this kind of machine system is divided into primary and secondary, to the main computer peripheral computer help, their distribution form usually will be gathered, distance is small enough to a few meters, up to dozens of meters distance, and more machine system communicate with peripheral machine process, because the position is more, High-speed multi-bit parallelism will be applied. Multidimensional multi-machine communication system and computer network system there is a difference between, it can not be connected to the Internet at any time to communicate, it can be treated by multiple processors within the machine using distribution method between computer and the computer to carry on the work, so it belongs to an intermediate type structure, belonging to the computer system to the computer network system [2].

3.3 Relationship Between Distribution Processing and Network In the primary stage design of computer network, is want to share resources through the network, but with the constant development of the autonomous computer system, after connected found, treatment can be applied to the distribution, actually for computer group, even though they are distribution exist, as long as it is the ability to connect with each other to realize the division of tasks, So we can deal with it together. Distribution in the computer network system capacity compared with other system to deal with the qualitative change, so it can extensive distribution network nodes, it also brings the development of distributed processing, that is to say computer nodes can exist in all of any department, even in the world each place can be distributed, In this way, the departments and units without rules in geographical location can realize effective connection through information processing. It is mainly the application of computer network, and the use of distributed computers to divide work, so as to achieve common processing and serve various tasks provided by human beings. This nature is social distribution processing. At present a lot of modern development of computer network will cover comprehensive information system, and multi-machine system can be compared, they are mainly the use of computer system will stipulate some tasks to complete, distribution relatively limited, just in

Based on the Analysis of Interrelation Between Parallel Distributed Computer …

135

the computer framework. However, they are often disconnected from different types of units and departments. They are, metaphorically speaking, islands in geographical locations, closed systems in terms of computers, which explains the difference between computer systems and networks.

4 Analysis of Computer System Parallelism, Extending to Distributed Network Systems 4.1 Analysis of Computer Space Structure At present, the parallel forms and distributed computer systems are making continuous progress, and they are more efficient in information processing. From the research findings, there is a certain limit to improve the processing speed of a CPU, so we should put the central work on the use of computer structure research, and constantly promote the rapid processing of information. Shorten time in space structure, the structure of the computer can usually fall into this class, the first structure parts with a variety of functions, is applied to the operation type of components, such as multiplication addition type parts, and put together can carry on the processing parts at the same time, parallel operations corresponding to the instructions, the parallel operation is based on the internal level of instruction; The second structure is flow mode, which uses a functional component and processes it according to a certain tempo, so that multiple instructions can be processed at the same time. The third kind of structure for array machine form, it contains many processors, will be carried out in accordance with the structure of the array distribution, under the jurisdiction of the same system, and by the same parts for instruction, it reached the same instruction at the same time, despite an array contains different elements, but based on this kind of treatment can achieve the same operation [3].

4.2 Computer Structure Combined with the Role of Processor Implementation The structure types are talking about using single instruction multiple data control, in this kind of computer structure, the application of distributed processors, although it is efficient way to work in parallel, but for the system operation and instructions are working together, it is more suitable for parallel degree high number of tasks, This will be fully applied to the hardware of the parallel nature, in the information processing can be applied less time to complete, for the task of relatively small scale, the use of the above way is not appropriate, so it needs further analysis, to achieve distributed distributed computer application system. Distributed computer system is still the main processing system, so in the actual process of it or parallel system, can

136

T. Yang

Fig. 2 Parallel distributed flat architecture

be called distributed parallel computer system. About the distribution form of the system, can control the whole operation, drive the various distributed processing in the form of a processor, they can control by oneself, also is the ability to operate its own, the overall task through divided to each processor, the processor will be handled by oneself, it is formed by the distribution system. If it contains multiple instructions, it can also mobilize each processor to use its own program, identify different data forms to complete the corresponding operation, and give play to the operation ability at different levels. Its structure belongs to the multi-data form gathered together by a variety of programs. The figure shows the parallel distributed flat architecture (Fig. 2).

4.3 Structure Analysis of Distributed Computer System Analysis of distributed computer system consists of two structures, one is through the study of the common use of memory, using a variety of processor symmetric structure, advantageous to realize the exchange of information and other content, bring convenience for the user, when facing a small number of processors, this kind of structure would be more practical, increasing when the number of processors, will have certain limitations, Impedes the rate at which memory is accessed, so that the entire process takes longer and slows down processing [4]. Another kind of structure is the distribution form of memory, relatively large in scale, belong to the parallel processing structure, mainly for the high performance microprocessor,

Based on the Analysis of Interrelation Between Parallel Distributed Computer …

137

using specific network processor (closely linked together, at the same time also has its own individual form function, can operate independently, itself with enough memory capacity. This structure takes up less common memory when accessing various processors, so the speed can be slightly affected. Need to emphasize that the parallel computer system containing multiple processors is mainly focuses on the parallel operation, carries on the numerical calculations can be more efficient, and the emphasis of the distribution form of processing system is the use of task division to work together, in some special cases may not use parallel operation, as a result of the parallel system application processor is more, Therefore, it is necessary to improve the speed of it, using the Internet to realize the connection with the processor, the number of processors applied in the distributed processing system is contrary to the parallel, so the speed of communication may be relatively slow. Distributed computer system in a long period of development in the overall structure of the existence of a breakthrough, the structure is gathered in a large framework, resulting in the use of the effect of just like the use of a computer. Throughout the system, the user feel is only a computer, as long as the instruction to input, can achieve the program run, but it doesn’t clear the concrete which is a computer at work, mainly by the operating system for the user distribution, according to the instructions to select the best match of the computer to run the program, after the result come out, be received in the right place, These processes are done autonomously by the computer system. In view of the computer network in general, there are significant differences, users need to clear on a computer program which, when logging in specific geographic information need to use the computer, so that it can be a very good application to computer network, the program is transmitted to the computer, and then according to the user’s instructions, The result is finally obtained by a computer that has been arranged.

4.4 The Role of Computer Systems in the Form of Network Distribution A region in the traditional sense of the computer, the construction of network platform will also be able to make the distribution network in the form of a computer system, the design shall be carried out on networked computers, has a distribution network in the form of operating system installed on the computer, so it can be easy to operate, the increase of the distributed computer system interface, When network users need to complete tasks, they can receive them in time. Also can use the network communication function, and to know that each computer CPU specific situation, convenient command if there are other forms of resource impact, will bring the user task is divided into several parts, and at the same time, according to the CPU working condition to realize reasonable task execution, it also can carry on the reasonable planning for the computer results, After effective processing, the results are passed to the user. Distribution network in the form of a computer system is a can be used

138

T. Yang

under high speed calculation and the form of a local area network, but compared with original form of the system has a different, its main function is to realize parallel computing, processing tasks in the system application time is less, and the main purpose of the original form is the common use of resource, We’re dealing with distributions. In the process of using the computer system in the form of network distribution, users will not pay attention to the location of the calculation, and the global form of the system is added, which can carry out a consistent centralized assignment of tasks;But the local area network, the user still needs to clear the connection point, master the specific location location, the computer system inside is the individual form, can carry on the independent work, the use of the network is mainly to coordinate the function.

5 Conclusion Through the analysis, the use of computer parallel and distributed mode, main is to realize the network quickly handle the process of information, is mainly used in parallel characteristics to compute, network distribution form can be understood as a multiple number of computer to realize the system runs fast, these systems, and so did the network system to achieve progress unceasingly.

References 1. Z. QIhua, in Architecture Design and Performance Analysis of Distributed Machine Learning System for Data Center Network (Nanjing University of Posts and Telecommunications, 2021) 2. W.A.N. Zhiyu, Research on Parallel Strategy of Distributed Training in Deep Learning (Xidian University, 2021) 3. S. Zhenlong, L. Xiaofang, X.I.E Xuchao, Comp. Eng. Sci. 42(10), 1765–1773 (2020) 4. X. Congcong, K. Feng, Modern Elect. Tech. 43(15), 143–147 (2020)

Improvement of DGA Long Tail Problem Based on Transfer Learning Baoyu Fan, Yue Liu, and Laurie Cuthbert

Abstract As the number of classes increases in traditional multiple classification and recognition tasks, there is often the problem of a long tail: the sample data is mainly distributed in a few classes. In the detection of domain names generating malware (DGA - domain generation algorithm), due to the variability of malware, the number of classes of DGA is also increasing and shows a long tail nature. However, in previous DGA detection research focused on the classes of a large amount of data so they do not address the long tail characteristics. We propose an effective knowledge transfer DGA detection model that transfers the knowledge learned in the previous stage of training to the next stage, and optimizes the impact of the long tail problem on the detection model. In order to inherit the continuity of the model, we propose a data balance review method, which can alleviate the catastrophic forgetting problem of transfer learning and detect new classes without retraining the whole model. Finally, the macro average F1 score of our model is 76.6%, 8.74% higher than ATT_BiLSTM and 6.34% higher than ATT_CNN_BiLSTM. So our model optimizes the long tail problem and better predicts all classes. Keywords Transfer learning · DGA · Data balanced review · Deep learning · Long tail problem

1 Introduction With the popularity of big data and artificial intelligence, machine learning has been applied to intelligent classification tasks in many fields. For example, classification tasks exist in computer vision recognition [11], speech recognition [20] and recommendation systems [10]. However, in many areas there are also the problems of unbalanced samples. Sorting the samples according to the frequency of different B. Fan (B) · Y. Liu · L. Cuthbert Faculty of Applied Sciences, Macao Polytechnic University, Macao, China e-mail: [email protected] Y. Liu e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_10

139

140

B. Fan et al.

Fig. 1 Long tail problem of knowledge transfer optimization. We evenly divide the classes into Head, Tail 1, Tail 2 and Tail 3, and use transfer learning to continuously migrate the head knowledge F0 to the tail F1 , F2 and F3 . We use data balanced review to alleviate the problem of data imbalance and catastrophic forgetting in the stage. In the figure, we can see that there are more and more classes in each stage from head to tail, but the color of the classes from the previous stage becomes lighter and lighter, indicating that the number of samples of each class is also decreasing in the next stage, and the F3 is the resultant task

types of data from high to low often shows a “long tail” in the data distribution: this is the long tail effect, as shown in Fig. 1. Data is divided into “head” and “tail” categories and there is a large amount of data in the head category and a small amount of data in the tail. In a multiple classification task, the classifier tends to have good performance in the category with more samples and has poor performance in the category with fewer samples. The problem with unbalanced or long tail data is that simply applying the samples with unbalanced categories to the model for learning is not a good approach: it is obvious that the model will learn better from the samples of the major classes because it sees far more samples of head classes than tail classes. The reason for the phenomenon of long-tail data is that specific categories of data are difficult to collect. Just like DGA (Domain Generation Algorithm) detection in intrusion detection, there are many kinds of DGA domains, and the update speed is very fast. Malware threat participants use the command and control (C2) environment to spread and manage attacks. In complex attacks, threat participants usually use a DGA to cycle the network location from where the malware communicates with the C2. Network security controls (such as blacklists, implementing DNS vulnerabilities, or inserting firewall rules) are critical to an organization’s security posture. However, all these methods are usually ineffective for DGA, so the detection of DGA is also necessary.

Improvement of DGA Long Tail Problem Based on Transfer Learning

141

2 Related Work 2.1 Long Tail The earliest attempt to solve the long tail problem is resampling, which can be divided into oversampling of few samples [13] or under sampling of many samples [5]. However, because oversampling is easy to overfit the minor class, it is unable to learn more robust and generalization features, and often performs worse on very unbalanced data; under sampling will cause serious information loss for the major classes, resulting in under fitting. Some approaches also use the method of synthetic samples, such as smote [2], for arbitrarily selected few samples, the k-nearest neighbors approach is used to select similar samples, and obtain new samples by linear interpolation. There is also a commonly used re-weighting method [16] to assign different weights to different categories (or even different samples). There are many variants of such methods, including the simplest weighting according to the reciprocal of the number of categories [7], weighting according to the number of “effective” samples [3], and loss weighting according to the number of samples to optimize the classification spacing [1]. In recent years, transfer learning has been used to solve the long tail problem in the image field and recommendation system [11, 21].

2.2 DGA Detection In the field of DGA detection, the focus is on using traditional algorithms or algorithm fusion methods to improve the detection accuracy. In the authors [9] proposed there is lots of comparison of CNN and many neural networks based architectures leading to the conclusion that a combination of Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) model provides good accuracy. The data set was trained on and evaluated a large number of domains. Multilayer perceptron (MLP) algorithms compared the performance of the Random Forest (RF) model with the same features as the processed approach and they achieved 98% accuracy with the deep learning algorithms. A transfer learning technique was proposed in [15] by combining a CNN with a machine learning algorithm such as Naive Bayes classifier for detection and classification of DGA-generated domains. However, they all have a common problem in that the accuracy of small sample categories is relatively low and the classification results of the latest DGA are inaccurate.

142

B. Fan et al.

3 Methodology 3.1 Long Tail The long tail problem refers to the situation where the data set has a long tail class distribution. After the categories are sorted in descending order according to the number of samples, a small number of classes have a large number of samples, called the head, and most of the remaining classes have only a small number of samples, called the tail. Deep learning can be used to train the deep neural network model from the data set of long tail distribution. Let n {xi , yi }i=1

(1)

be the training set with long tail problem, xi represents each sample, and the class corresponding to the sample is marked as yi . The total number of long tail problem training sets including K classes is n=

ΣK

k=1 n k

(2)

where n k denotes the data number of class k. Let π denote the vector of label frequencies, where πk = n k /n indicates the label frequency of class k. Generally, in long tail learning, it is assumed that the classes that have been sorted in reverse order according to the number of samples, that is, n 1 > n K , and the imbalance ratio is defined as n 1 /n K . The challenges of this task are: (i) due to the imbalance of data volume between head class and tail class, the deep learning model is biased towards head class, while the influence of the tail class is not good; (ii) the shortage of tail samples makes the classification task challenging. In addition to the classification task, it may also occur in visual recognition tasks such as image classification [8, 11], detection [4, 17] and segmentation [6, 18, 19].

3.2 Data Pre-processing The main purpose of data preprocessing is to alleviate the imbalance of data sets caused by the long tail problem. By dividing the data sets into multiple stages, the categories can be relatively balanced. The division criteria are as follows: (i) sort all classes in descending order according to the number of samples, so as to get a sorted data set W , as shown in Fig. 2; (ii) W is divided into multiple mutually exclusive sub datasets Wi , i = 0, 1, 2, . . . , K . This is done by dividing the first n classes accounting for about 80% of the samples of the dataset into W0 , the remaining samples being evenly divided according to the classes, that is, the number of classes contained in each sub dataset Wi is the same; (iii) Finally, we get the data set W = {W0 , W1 , W2 , . . . , W K }.

Improvement of DGA Long Tail Problem Based on Transfer Learning

143

3.3 Transfer Learning According to the data preprocessing, it is clear that this experiment divides the prediction task into multiple stages. Before each stage is the inheritance relationship of the task of the previous stage. As shown in Fig. 1, assuming that the task is F, the prediction task is F0 , F1 , F2 , ..., FT , with FT as the resultant task. In order to learn the knowledge of the previous stage in each stage, transfer learning is used to transfer the knowledge of the previous task to the next task. As the training stage moves towards the tail, the amount of new data will be fewer and fewer, and the old data will be forgotten. This will lead to an incomplete FT prediction model and low prediction accuracy, so here a data-balanced review method is proposed to solve this problem, which will be introduced in detail in Sect. 3.4. For transfer learning, the pre-training model here is required before knowledge transfer. The pre-training model is a DGA prediction model trained separately. The method adopted by the model is a basic DGA prediction model, which performs better than the comparison model, and uses the combination of CNN and BiLSTM (Bi-directional Long Short-Term Memory) with an added attention mechanism [12], The model uses embedding to preprocess the DGA domain name and introduces it into the deep learning model. Embedding is popular in the NLP (Natural Language Processing) [22] field, but DGA is not a regular natural language, so character level embedding is used. CNN + BiLSTM model with added attention mechanism is used in the deep learning model. At the end of the output layer, softmax is used as the activation function to map the output value to the value of 0–1. In staged transfer learning, because 80% of the DGA data is concentrated in the first eight classes of the head, every eight classes is divided into one stage. The first stage F0 is used to train the base model. After the training, the knowledge and weight carried by the model is transferred to the second stage F1, and them successively transferred until the last stage FT , Therefore, when FT is reached, there is not only a complete DGA prediction model, but also the FT stage will be a variety of balanced models, which can optimize the impact of the long-tail problem on the DGA prediction model and help to solve the long-tail problem. This experiment is also a kind of continuous learning, because DGA data is updated in real time, and new classes will be generated over a period of time. In this way, FT knowledge can be directly transferred to FT +1 to predict new classes. Different from the traditional model, it is not necessary to retrain all data, but only needs to transfer the knowledge of FT to new classes. At the stage of predicting new classes, this model acts as a form of few-shot learning. For example, from 23 to 31 classes of DGA, the number of samples in each category was less than 100 and with the short occurrence time and difficult collection of new samples, the general number of samples will be relatively small. At the stage of FT +1 , this method can use a small number of samples to generate a complete prediction model.

144

B. Fan et al.

3.4 Data Balanced Review The data is divided into multiple stages. The difference in the amount of data between each stage is relatively large, but the difference in the number of classes in a single stage will narrow. In order to get a complete prediction model, it is necessary to inherit the data from the previous stage to the next stage, which will lead to a problem: if all classes in the previous stage are put to the next stage, there will also be a problem of data imbalance. For example, the first stage of DGA data is 0–7, which contains 80% of the data of the DGA data set. putting all data from stage 1 to stage 2 will cause the data volume of the stage 2 to expand hugely. If the approach does not inherit the data of the previous stage, there will be a catastrophic “forgetting” problem, and the classes of the previous stage will be directly forgotten in the training process of the next stage. This leads to low prediction accuracy. Therefore, the approach here chooses to review the data and proposes a segmentation strategy of data balanced reload. The working principle is shown in Fig. 2. • The first stage: all class samples are used, and classes 0–7 as data head. • The second stage: 8–15 types of all data + the number of randomly sampled count (the 8th class) samples of each class in the first stage. • The third stage: all data of 16–23 classes + number of randomly sampled count (the 16th class) samples of each class in the first stage + number of randomly sampled count (the 16th class) samples of each class in the second stage. • The fourth stage: all data of 24–31 classes + number of samples randomly sampled per class in the first stage + count of samples randomly sampled per class in the second stage (the 24th class) + count of samples randomly sampled per class in the third stage (the 24th class).

Fig. 2 Data balanced review

Improvement of DGA Long Tail Problem Based on Transfer Learning

145

Fig. 3 DGA long tail data

4 Experiments 4.1 Dataset The data set in the experiment includes two parts: whitelist domain name and DGA domain name. The whitelist comes from the top 1 million websites ranked by Alexa1 . Alexa ranks the websites according to the number of page views and visitors, so these websites will not be DGA domain names because these websites are in operation and use. The DGA domain names come from 360netlab 2 with a total of 1 million, including 31 DGA families. The data set is constantly updated, and new classes will be generated in a period of time, so the DGA data detection model needs to be continuous. We combine 31 classes in Alexa and 360netlab to get the data set results shown in Fig. 3 and it is clear that the data set has a typical long tail problem.

4.2 Experimental Setup In order to better verify our model, we use the super parameter selected by the comparison model, which is the best value selected through repeated experiments, 1 2

Alexa. The web information company http://www.alexa.com/. 360netlab, https://data.netlab.360.com/dga/.

146

B. Fan et al.

Table 1 Super parameters Parameter

Setting

Ephoch Batch size Learning rate Optimizer

10 64 1e − 4 Adam

as shown in Table 1. Our experiment uses two basic models as comparison models: (i) Attention_BiLSTM [14] and (ii) Attention_CNN_BiLSTM [12]. In order to make a good comparison between these models and our model, we use the well-known indicators accuracy, precision, recall, F1 score and macro average as the evaluation indicators of malicious domain name detection. pr ecision =

r ecall =

F1 =

TP T P + FP

TP T P + FN

2 ∗ pr ecision ∗ r ecall pr ecision + r ecall

(3)

(4)

(5)

TP is true positive, FP is false positive, FN is false negative; TP indicates that the prediction result is accurate, FP positive indicates that the prediction result is inaccurate. Precision represents the proportion of examples divided into positive examples that are actually positive examples. Recall is a measure of coverage. The measure is that multiple positive examples are divided into positive examples. It can be seen that the recall rate and sensitivity are the same. Sometimes there are contradictions between precision and recall indicators, so they need to be considered comprehensively. F1 is the most commonly used method being the result of weighted harmonic average of precision and recall. When F1 is high, it indicates that the experimental method is more effective. Therefore, F1 is used as the prediction result index, and also use the confusion matrix to intuitively compare the ordinary DGA prediction model with our model. Similarly, because DGA has a long-tail problem, weighted macro average,(calculated by weighting the score of each class label by the number of true instances) is used when calculating the average.

Improvement of DGA Long Tail Problem Based on Transfer Learning

147

5 Experimental Results We compared the prediction classifiers combined with traditional models with good DGA effect in recent years, including the Attention_BiLSTM model, the Attention_CNN_BiLSTM model, with our Knowledge Transfer model. In Table 2, we compared them through precision, recall, F1 and macro average. As can be seen from Table 2, firstly, the amount of data used in the final training of our model is much less than that of the other two fusion models. Attention_BiLSTM and Attention_CNN_BiSLTM have good prediction results for the classes with a large amount of head data, but the prediction F1 of the tail class with a small amount of data is basically below 0.5, However, in the prediction model, the classification is valuable only when it is more than 0.5. Our Knowledge Transfer model shows good results in the tail class. The F1 value of the basic class is much higher than that of the Attention_BiLSTM and Attention_CNN_BiLSTM models. Especially for the classification of class “matsnu”, the F1 value of the Attention_BiLSTM and Attention_CNN_BiLSTM models is 0, while the Knowledge Transfer model classifies them and the F1 value reaches 0.9032. Additionally, Our Knowledge Transfer model increases the F1 value of multiple tail classes to more than 0.5. The F1 value of the macro average of the Attention_BiLSTM model is 0.6786, the F1 value of the macro average of the Attention_CNN_BiLSTM model is 0.7026, while the F1 value of the knowledge transfer model is 0.7660, which is 0.0874 higher than that of the Attention_BiLSTM model and 0.066 higher than that of the Attention_CNN_BiLSTM model. We can interpret this result, as compared with other traditional deep learning models, our Knowledge Transfer model can better predict all classes. In order to more intuitively view and compare the models, we use fusion matrices to represent them. Figure 4 shows the fusion matrices of Attention_CNN_BiLSTM model, and Fig. 5 shows the conflict matrices of the Knowledge Transfer model. Rows and columns correspond to specific class names. The diagonal of the confusion matrix represents the proportion of the actual value and predicted value of each class of the prediction model. The value range is [0, 1]. The closer the result is to 1, the darker the color of the block on the diagonal. Therefore, it can be seen from the fusion matrices diagram that the prediction of all classes by the Knowledge Transfer model should be more accurate. For example, the false negative of “matsmu” class in the Attention_CNN_BiLSTM model is as high as 1.0, meaning, the model does not correctly predict the value of “matsmu” class, but the “matsmu” class exists on the diagonal of fusion matrices in the Knowledge Transfer model, and the color is dark. Because “pykspa_v2_real” is similar to “pykspa_v2_fake” and “pykspa_v2_fake” has a large amount of data, the classification result of “pykspa_v2_real” is confused by “pykspa_v2_fake”. The characteristics of “tempedrev” and “pykspa_V2_fake” are similar, for example, “net, org, info, com” are used as suffixes, and the length of domain is similar. Therefore, this category is also easy to be confused with “pykspa_v2_fake”, and our knowledge transfer model emphasizes the tail to learn the

148

B. Fan et al.

Table 2 Comparison of model results ATT_BiLSTM F1

P

R

F1

Knowledge transfer

P

Simda

0.9499 1.0000 0.9743 0.9866 0.9966 0.9916 2368

0.9231 1.0000 0.9600 72

Emotet

0.9997 1.0000 0.9999 0.9997 1.0000 0.9999 28533

0.9429 1.0000 0.9706 66

Rovnix

0.9990 0.9994 0.9992 0.9988 0.9994 0.9991 18091

1.0000 1.0000 1.0000 77

Pykspa_v1

0.9770 0.9993 0.9880 0.9891 0.9991 0.9941 4371

0.9600 0.9231 0.9412 78

Alexa

0.9907 0.9921 0.9914 0.9944 0.9929 0.9936 100119 0.6709 0.7361 0.7020 72

Gameover

1.0000 0.9968 0.9984 0.9904 0.9976 0.9940 1247

1.0000 1.0000 1.0000 75

Tinba

0.9768 0.9967 0.9867 0.9797 0.9990 0.9893 9393

0.6262 0.9571 0.7571 70

Banjori

0.9995 1.0000 0.9998 0.9997 1.0000 0.9998 45230

0.9101 1.0000 0.9529 81

Ramnit

0.8133 0.8229 0.8181 0.8283 0.8518 0.8399 1869

0.5044 0.6951 0.5846 82

Ranbyus

0.9086 0.8877 0.8980 0.8926 0.8917 0.8921 997

0.9000 0.7778 0.8344 81

Virut

0.6934 0.8004 0.7431 0.6824 0.8755 0.7670 972

0.9451 0.9885 0.9663 87

Murofet

0.8989 0.8934 0.8961 0.9245 0.8848 0.9042 816

0.8684 0.8919 0.8800 74

Necurs

0.9233 0.7353 0.8187 0.8973 0.7425 0.8126 835

0.8533 0.7356 0.7901 87

Symmi

0.9758 1.0000 0.9878 0.9780 1.0000 0.9889 444

1.0000 1.0000 1.0000 84

Shifu

0.9183 0.9365 0.9273 0.8711 0.9921 0.9276 252

0.9535 0.9880 0.9704 83

Suppobox

0.7527 0.2745 0.4023 0.7547 0.9529 0.8423 255

0.9506 1.0000 0.9747 77

Qadars

0.9839 0.9839 0.9839 0.9893 0.9946 0.9920 186

0.9487 1.0000 0.9737 74

Locky

0.7241 0.3925 0.5091 0.7872 0.3458 0.4805 107

0.7083 0.4474 0.5484 76

Cryptolocker

0.6154 0.2857 0.3902 0.6364 0.3125 0.4192 112

0.8136 0.6761 0.7385 71

Chinad

0.9717 0.9450 0.9581 0.9817 0.9817 0.9817 109

1.0000 1.0000 1.0000 69

Dyre

1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 110

1.0000 1.0000 1.0000 86

Vawtrak

0.9388 0.4894 0.6434 0.9423 0.5213 0.6712 94

0.9500 0.8382 0.8906 68

Pykspa_v2_fake 0.4800 0.4186 0.4472 0.4396 0.4651 0.4520 86

0.4583 0.7051 0.5556 78

Dircrypt

0.7619 0.2540 0.3810 0.8667 0.2063 0.3333 63

0.6071 0.5667 0.5862 60

Conficker

0.1818 0.0444 0.0714 0.2308 0.0667 0.1034 45

0.5600 0.2500 0.3457 56

Matsnu

0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 58

0.8889 0.9180 0.9032 61

Nymaim

0.4286 0.0698 0.1200 1.0000 0.0233 0.0455 43

0.3333 0.1346 0.1918 52

Fobber_v2

0.6154 0.2759 0.3810 0.7500 0.3103 0.4390 29

0.5405 0.8000 0.6452 25

Fobber_v1

1.0000 0.3333 0.5000 1.0000 0.3333 0.5000 24

0.7857 1.0000 0.8800 22

Tempedreve

0.0000 0.0000 0.0000 1.0000 0.0952 0.1739 21

0.0000 0.0000 0.0000 19

Pykspa_v2_real 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 22

0.0000 0.0000 0.0000 15

Padcrypt

1.0000 0.9375 0.9677 16

Macro-average

R

ATT_CNN_BiLSTM

DGA family

Support P

0.9474 0.8571 0.9000 0.9130 1.0000 0.9545 21 0.7633 0.6464

R

F1

Support

0.6786 0.8220 0.6823 0.7026 216922 0.7688 0.7802 0.7660 2094

knowledge of the head, so the characteristics of a large number of “pykspa_v2_fake” are more likely to be confused with “tempedrev”. But we can see that the long tail problem of DGA has been well optimized by using the Knowledge Transfer model.

Improvement of DGA Long Tail Problem Based on Transfer Learning

149

Fig. 4 ATT_CNN_BiLSTM confusion matrices

6 Conclusions Aiming at the long tail problem of DGA, this study proposes an effective Knowledge Transfer DGA multiple classification prediction model. The model uses data migration learning to continuously migrate the knowledge trained by the head model to the new knowledge of the tail, and the knowledge learned in the previous stage can also be used in the later stage. In order to solve the problem of data imbalance in the training stage, an effective DGA data balance review method is also proposed. This method is mainly to segment the data in the previous stage and randomly sample it with all the data in the current stage to form the data used for training in the current stage. This method enables the later stages of training to completely predict all classes in the previous stage, so that the Knowledge Transfer model can fully predict all classes. From the experimental results, we can see the effectiveness of the Knowledge Transfer model. Through this model, we get a macro average score of 0.766 higher than the traditional deep learning model. Our Knowledge Transfer model can better predict all classes. For the long tail problem with heavy head and light tail, the Knowledge Transfer model is not unique in the prediction results of

150

B. Fan et al.

Fig. 5 Knowledge transfer confusion matrices

head classes. For tail data with a small amount of data, we can also make valuable classification to improve the number of categories classified by the DGA prediction model. Moreover, our Knowledge Transfer model is persistent. When it is necessary to predict the new DGA class, it does not need to repeat the training of the old class. It can predict the new DGA class with less training time under the condition of balanced data and small data samples. Therefore, it is sustainable for real-time changing DGA data sets. In the future, we will improve the Knowledge Transfer model and focus on solving the catastrophic forgetting problem of multiple head data migrations. At present, we have prepared to add distillation to the Knowledge Transfer to reduce catastrophic forgetting, so that the knowledge model can not only keep the old classes from forgetting, but also predict good results.

Improvement of DGA Long Tail Problem Based on Transfer Learning

151

References 1. K. Cao, C. Wei, A. Gaidon, N. Arechiga, T. Ma, Learning imbalanced datasets with labeldistribution-aware margin loss. arXiv:1906.07413 (2019) 2. N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority oversampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 3. Y. Cui, M. Jia, T.Y. Lin, Y. Song, S. Belongie, Class-balanced loss based on effective number of samples, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019) 4. C. Feng, Y. Zhong, W. Huang, Exploring classification equilibrium in long-tailed object detection, in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3417– 3426 (2021) 5. H. He, E.A. Garcia, Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009) 6. R. He, J. Yang, X. Qi, Re-distributing biased pseudo labels for semi-supervised semantic segmentation: a baseline investigation, in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6930–6940 (2021) 7. C. Huang, Y. Li, C.C. Loy, X. Tang, Deep imbalanced learning for face recognition and attribute prediction. IEEE Trans. Pattern Anal. Mach. Intel. 42(11), 2781–2794 (2019) 8. B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, Y. Kalantidis, Decoupling representation and classifier for long-tailed recognition. arXiv:1910.09217 (2019) 9. P. Karunakaran, Deep learning approach to dga classification for effective cyber security. J. Ubiquitous Comput. Commun. Technol. (UCCT) 2(04), 203–213 (2020) 10. S. Liu, Y. Zheng, Long-tail session-based recommendation, in Fourteenth ACM Conference on Recommender Systems, pp. 509–514 (2020) 11. Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, S.X. Yu, Large-scale long-tailed recognition in an open world, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2537–2546 (2019) 12. J. Namgung, S. Son, Y.S. Moon, Efficient deep learning models for dga domain detection. Secur. Commun. Netw. 2021 (2021) 13. S. Pouyanfar, Y. Tao, A. Mohan, H. Tian, A.S. Kaseb, K. Gauen, R. Dailey, S. Aghajanzadeh, Y.H. Lu, S.C. Chen, et al., Dynamic sampling in convolutional neural networks for imbalanced data classification, in 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), (IEEE, 2018) pp. 112–117 14. Y. Qiao, B. Zhang, W. Zhang, A.K. Sangaiah, H. Wu, Dga domain name classification method based on long short-term memory with attention mechanism. Applied Sciences 9(20), 4205 (2019) 15. R. Rajalakshmi, S. Ramraj, R.R. Kannan, Transfer learning approach for identification of malicious domain names, in International Symposium on Security in Computing and Communication (Springer, 2018), pp. 656–666 16. J. Ren, C. Yu, S. Sheng, X. Ma, H. Zhao, S. Yi, H. Li, Balanced meta-softmax for long-tailed visual recognition. arXiv:2007.10740 (2020) 17. J. Tan, C. Wang, B. Li, Q. Li, W. Ouyang, C. Yin, J. Yan, Equalization loss for long-tailed object recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11,662–11,671 (2020) 18. T. Wang, Y. Li, B. Kang, J. Li, J. Liew, S. Tang, S. Hoi, J. Feng, The devil is in classification: a simple framework for long-tail instance segmentation, in European Conference on Computer Vision (Springer, 2020), pp. 728–744 19. Z. Weng, M.G. Ogut, S. Limonchik, S. Yeung, Unsupervised discovery of the long-tail in instance segmentation using hierarchical self-supervision, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2603–2612 (2021) 20. G.I. Winata, G. Wang, C. Xiong, S. Hoi, Adapt-and-adjust: overcoming the long-tail problem of multilingual speech recognition. arXiv:2012.01687 (2020)

152

B. Fan et al.

21. X. Yin, X. Yu, K. Sohn, X. Liu, M. Chandraker, Feature transfer learning for face recognition with under-represented data, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5704–5713 (2019) 22. M. Zago, M.G. Pérez, G.M. Pérez, Umudga: a dataset for profiling dga-based botnet. Comput. Sec. 92, 101,719 (2020)

A Phonetics and Semantics-Based Chinese Short Text Fusion Algorithm Yuchao Jiang, Xinru Li, Chuying Huang, Wei Lu, and Minghe Xu

Abstract With the rapid development of the Internet, short text has become more and more popular on the Internet and many short texts with a length of a few words to dozens of words have exploded, such as chats, text messages and comments. Among them, the extraction and analysis of such short texts relies on accurately text similarity calculations. Due to the ambiguity and data sparsity of short text, improving the accuracy of calculations for the similarity of short text remains an important and challenging task. To solve this problem, this paper conducts an indepth study on the unsupervised Chinese short text similarity algorithm and proposes a fusion algorithm based on phonetics and semantics. The algorithm takes the sound, character, and word-meaning of the short text as the features, through these features to construct the feature vectors and calculate its own similarity for each vector, and then calculate the comprehensive semantic similarity of the text through the fusion algorithm and finally complete the similarity calculation method that integrates the three features of phonetic, character and semantics. In this paper, our algorithm was experimentally verified by LCQMC, a Large-scale Chinese Question Matching Corpus and the accuracy was improved by up to 29.2% compared with the traditional text similarity algorithm.

Y. Jiang School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China e-mail: [email protected] X. Li Meituan Select, Beijing Sankuai Technology Co., Ltd., Beijing, China e-mail: [email protected] C. Huang · W. Lu · M. Xu (B) School of Aliyun Big Data Applications, Zhuhai College of Science and Technology, Zhuhai, China e-mail: [email protected] C. Huang e-mail: [email protected] W. Lu e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_11

153

154

Y. Jiang et al.

Keywords Chinese short text · Unsupervised · Text similarity · Phonetics · Feature fusion

1 Introduction With the vigorous development of China’s Internet technology, many social applications such as WeChat and Weibo with Chinese short text as the carrier have exploded in recent years and how to accurately extract and rationally use such texts and the information, they contain has become a problem in the field of natural language processing. Before we can make in-depth use and analysis of this type of text, we must first process the original text information and calculate the similarity of the text. The calculation methods of text similarity are divided into two categories: unsupervised text similarity algorithm and supervised text similarity algorithm. The classic unsupervised text similarity algorithm based on n-gram [1] and Bag of words (Bow) [2] can be used directly for the calculation of similarity between texts and its construction cost and similarity calculation complexity are low. Later, Devlin, J. et al. used a supervised text similarity algorithm combined with a neural network to calculate text similarity [3–7], showing that neural networks have good adaptability to models in different task scenarios. However, since neural network models need many manually labeled corpora for construction and training, this often leads to a low degree of generalization of the model and a high complexity of the neural network when performing similarity calculations. Therefore, unsupervised text similarity algorithms that can use large amounts of data as feature engineering are more suitable for short text similarity calculations that use scene diversification. For any language, its documentation and dissemination depend on characters. Thus, the classic unsupervised text similarity algorithm [8, 9] derives the similarity of two text by comparing the similarity between two text characters. This method is simple and efficient, but this kind of method that only considers the characteristics of characters will cause ambiguity in the understanding of text content due to the in accuracy of text segmentation in the face of colloquialized Chinese short text. Therefore, to improve the accuracy of text similarity calculations, it is first necessary to correctly understand the association of different words between texts and semantic information. In order to solve this problem, later scholars have constructed a Chinese semantic ontology knowledge base such as HowNet [10] and Tongyici Cilin [11], which can describe the similar relationship between various entities and concepts and the superior and inferior and isotopic relationships through knowledge graphs [12] and can be inspired by knowledge and theoretical rules in structured or semistructured data, so that the original corpus information can be semi-automatically labeled and effectively understand text content from the semantic aspect. However, compared with written long texts, short texts contain less valid information and a single content and topic. Therefore, it often appears to be sparsely characterized in spatial vector models and is more susceptible to noise. In [13], Xu et al. constructed an extension algorithm based on Tongyici Cilin, which significantly improved the

A Phonetics and Semantics-Based Chinese Short Text …

155

accuracy of short text similarity calculations by optimizing the matching method in dictionaries. Because short text has a highly sparse text characteristic and such a text similarity algorithm based on the semantic knowledge base is highly dependent on the semantic knowledge base that has been built, it is not suitable for the calculation of short text similarity. In [14], Doval et al. attempted to normalize nonstandard terms in English short texts as inputs, enabling them to obtain as few normalized candidate sets as output as possible, which inspired us to use phonetic features to improve the accuracy of short text similarity calculations. In addition, we also found the advantages of using word sounds as features for calculation in Chinese. First, the phonetic characteristics of words improve the influence of highly sparse text characteristics on the calculation of text similarity, because the number of pinyin is limited, then the pinyin of the constituent words must be limited. Secondly, the word sound characteristics to a certain extent well eliminate the word vector in the calculation process due to insufficient feature training of low frequency words, but the number of word vectors characterized by word sounds is far less than the word vector characterized by characters. However, this also shows that the character phonological features are not completely characterized by the character features, so the phonological features are not suitable for use alone. In the similarity calculation, we use the phonetic features of words as auxiliary features of characters and word meanings to participate in the operation. In this work, we found that providing more external features is beneficial to improve the accuracy of text similarity calculations for short text. That is, the more external features used, the accuracy of the calculation of text similarity will be correspondingly improved; conversely, the fewer external features are used, the accuracy of text similarity calculation will be reduced. This gives us inspiration on how to improve the accuracy of text similarity calculations. We use the three external features of text characters, word sounds and word meanings to process the short text data and then calculate the text similarity of the two texts in the three features and finally give different weights to the text similarity of each feature and use the linear regression method to obtain the text similarity of the three features. The contribution of this paper is divided into two aspects: First, we demonstrate the rationality of introducing word sounds as auxiliary features to participate in accuracy operations in the calculation problem of Chinese short text similarity. Second, we evaluate the algorithm using the large-scale Chinese problem matching corpus LCQMC [15]. Through many experiments, we have verified the superiority of the proposed method in the calculation of short text similarity.

2 Method In this section, we will describe the methods used in this paper from three aspects: feature construction, feature training and feature fusion and its similarity calculation. In the feature construction stage, we use Jieba, pypinyin and other tools to construct characters, character sounds and word meaning feature vectors respectively. In the

156

Y. Jiang et al.

feature training stage, we will train the model through the three features of character, word sound and word meaning after the feature construction and share the parameter settings we use for model training. In the stage of feature fusion and its similarity calculation, we described in detail our practices in both feature fusion and similarity calculation. And We show our algorithm architecture in the follow Fig. 1.

Fig. 1 Algorithm architecture

A Phonetics and Semantics-Based Chinese Short Text …

157

2.1 Background Jieba Text Segmentation is the most basic but also the most important step in text preprocessing and the quality of text segmentation will directly affect the accuracy of the short text similarity calculation [16]. Jieba word segmentation is a Chinese word segmentation component based on Python, which can accurately divide text into corresponding phrases and use dictionaries to automatically label the divided phrases, which is the best Chinese word segmentation component in Python at present. The Jieba word segmentation component has the advantages of simplicity, effectiveness, maturity and has been widely used in natural language processing tasks in the past and has achieved good results. In addition, the Jieba word segmentation component is more suitable for text segmentation Chinese [17], so the Jieba word segmentation component was selected for this article to segment the original corpus in the dataset. pypinyin For colloquialized short texts, the character similarity is much smaller than the phonetic similarity. At this time, converting the character features into pinyin features can effectively improve the accuracy of text similarity calculations. pypinyin is a Chinese-to-pinyin framework developed on Python, which combines machine learning models with dictionary-based rules to intelligently match the most correct pinyin according to phrases; and the framework also supports the recognition of polysyllabic characters, which can be used for batch Chinese character Zhuyin, text sorting, pinyin retrieval text and other scenarios. Therefore, this article uses the pypinyin framework to construct word vectors with word-phonetics characteristics. TF-IDF (Term Frequency–Inverse Document Frequency) TF-IDF is an analytical model for statistical keywords that evaluates the importance of a word in a sentence, article, or corpus. In a piece of text, some words have a stronger ability to characterize the whole sentence and such words are called the central word in the sentence. Higher weight should be given to such words, while words with weaker feature representations should be given lower weight [18]. This paper uses the TF-IDF algorithm, using the term frequency (TF) and inverse document frequency (IDF) two indicators to calculate the characterization ability of each phrase and assign corresponding weights to each phrase, through the algorithm to weight the word vector fusion, aiming to obtain a better feature vector. The calculation formula for TF-IDF weight values and sentence vectors is as follows: |D| n i, j |{ }| · log T F − I D Fi, j = Σ n k, j 1 + | j : ti ∈ d j |

(1)

k

vec(d) =

Σ t∈d

Wt ∗ T F − I D F(t, d)

(2)

158

Y. Jiang et al.

2.2 Feature Construction Character Feature Computers cannot process text data directly. Therefore, we need to convert text information into numerical data that computers can recognize. Character features are fundamental features found in all text and are the first text features to be used for natural language processing [19]. The first step in constructing character features is to split the original corpus, i.e., text segmentation. The goal of text segmentation is to divide the corpus into a series of meaningful phrases. The second step is to remove phrases that are not related to character features. This will reduce data redundancy and feature dimensions, which is conducive to improving the efficiency of operations. The third step is to label each phrase with part-of-speech (POS). POS labeling is the process of determining the grammatical scope of each word in a text, determining its part-of-speech, and labeling it. Phrases can be divided in more detail through part-of-speech annotations, laying the foundation for the construction of subsequent features. This article uses The Jieba Chinese Word segmentation tool to split two texts, remove punctuation marks and annotate parts of speech to obtain text A* and text B* . In this section, we will describe the methods used in this paper from three aspects: feature construction, feature training and feature fusion and its similarity calculation. In the feature construction stage, we use Jieba, pypinyin and other tools to construct characters, phonetics and semantics feature vectors respectively. In the feature training stage, we will train the model through the three features of character, word sound and word meaning after the feature construction and share the parameter settings we use for model training. In the stage of feature fusion and its similarity calculation, we described in detail our practices in both feature fusion and similarity calculation. Semantics Feature Based on TF-IDF, we propose a semantic feature sentence vector based on word nature and word order, with the purpose of integrating word content and word order as features into the sentence vector, so that the constructed vector can contain the characteristics of word nature and word order while containing keyword information. In modern texts, the specific characteristics of the text are often characterized by four types of words: nouns, verbs, adjectives and adverbs, which occupy an important position in syntactic relations [20]. This is especially true in short texts, where these four types of words have the effect of influencing the meaning of short texts, while words of other parts of speech do not have much practical meaning. In the construction of sentence vectors, the words of these parts of speech will produce a certain amount of noise, affecting the construction of sentence vectors, so in the sentence vectors constructed for parts of speech, this article will assign these four kinds of larger weights of speech, while other parts of speech will give smaller weights, for text A* and text B* do a feature extension based on Tongyici Cilin to obtain the expanded text A~ and text B~ . The weight of part-of-speech (Table 1) is as follows:

A Phonetics and Semantics-Based Chinese Short Text …

159

Table 1 POS (Part-of-speech) weight POS

Noun

Verb

Adjective

Adverb

Other

Weight

0.9

0.8

0.65

0.5

0.05

Phonetics Feature The principle of Chinese characters to pinyin is related to the encoding of Chinese character communication. GB2312 is an encoded character set for Chinese characters. In GB2312, the received Chinese characters are partitioned and each area contains 94 Chinese characters. Each Chinese character is represented by a high-bit byte and a low-bit byte, which is encoded using the 0xA1 to 0xF7 interval, while the low-bit byte is encoded using 0xA1 to 0xFE. Through these two encodings, any Chinese character can be in a two-dimensional space and the coordinates in this two-dimensional space also correspond to each Chinese character, so that a twodimensional table corresponding to the pinyin of the Chinese can be established. Because the number of pinyin is small, it is possible to establish a one-dimensional vector with pinyin and then use the above two-dimensional array to save the index of Chinese characters and pinyin, when it is necessary to convert Chinese characters into pinyin, it is found in the vector where the pinyin is located through the index to obtain the corresponding pinyin of the index. This article uses the pypinyin framework to convert feature expanded text A~ and text B~ to pinyin text Ap and pinyin text Bp .

2.3 Training First, we use the PV-DBOW model [21] (A lightweight doc2vec model based on neural networks, which feature vectors can inherit the information of word vectors to make the relevance of words more clear) to train the character characteristic vectors of text A* and text B in character units for 60 epochs to obtain sentence vectors with character characteristics S A = [Sa 1 , Sa 2 , Sa 3 , . . . Sa n ] and S B = [Sb1 , Sb2 , Sb3 , . . . Sbn ]. And the parameters of the character feature vector training are window size of 10 and the number of negative samples is 3. Secondly, we use the PV-DBOW model to train the phonetic feature vectors of text Ap and text Bp in a single pinyin unit for 60 epochs to obtain sentence vectors with phonetic characteristics S A p = [Sa p,1 , Sa p,2 , Sa p,3 , . . . Sa p,n ] and S B p = [Sb p,1 , Sb p,2 , Sb p,3 , . . . Sb p,n ]. The parameters of the word-phonetic characteristic vector are window size of 5 and the number of negative samples is 2. Finally, we use the Skip-Gram model [22] (A word2vec model based on neural networks that can learn a continuous vector to predict the context through the central word) to train the character feature vector with a window size of 15 and a negative sample of 5, and then combine the word meaning features to obtain the sentence vector S A∗ and S B ∗ . Finally, the different features are weighted to obtain the fused feature sentence vectors.

160

Y. Jiang et al.

2.4 Similarity Calculation and Features Fusion The similarity algorithm can compare the text similarity between text A and text B and the similarity is usually calculated by calculating the “distance” between samples. The Pearson correlation coefficient is an improvement in the cosine similarity in the absence of dimension values and the most important thing about this correlation coefficient is to data center the two groups, correcting in the case of missing dimensions, even if there is only a small deviation under nonlinear conditions, with good robustness [23]. Therefore, we chose the Pearson correlation coefficient for the short text similarity calculation. We pair the phonetic vectors S A p and S A p , text vectors S A and S B , semantic vectors S A∗ and S B ∗ to do the similarity distance calculation. And we calculate Sim(S A p , S B p ), Sim(S A , S B ) and Sim(S A∗ , S B ∗ ). Pearson’s correlation coefficient is calculated as Eq. (3). cov(x, y) E[(x − μx)(y − μy)] = σx σ y σx σ y E[(x − μx)(y − μy)] =/ / n n Σ 2 Σ (xi − μx) (yi − μy)2

ρx,y =

i=1

(3)

i=1

In many works, feature fusion is an important way of improving the accuracy of text similarity calculations. How to efficiently integrate multiple features, take their essence and remove their dross is the key to improving the accuracy of model calculation. Therefore, we use linear regression in the fusion algorithm to calculate the fusion of text similarity for the three characteristics of characters, phonetics, and semantics. The fusion algorithm Eq. (4) is shown below. And the relationship between a1 , a2 and a3 , we show it in Eq. (5). sim(A, B) = a1 sim(S A p , S B p ) + a2 sim(S A , S B ) + a3 sim(S A∗ , S B ∗ )

(4)

a1 + a2 + a3 = 1

(5)

3 Experiments and Result In this section, we use the similarity calculated from the semantic algorithm, the similarity calculated from the spatial vector algorithm, and the similarity calculated by the sentence vector based on TF-IDF [24] and the weighted sentence vector algorithm based on SIF [25] to do a comprehensive comparison experiment and use a large number of experiments to prove the advantages of the proposed algorithm in this paper.

A Phonetics and Semantics-Based Chinese Short Text …

161

3.1 Evaluation Setting Dataset The dataset in this paper uses the well-known public semantic matching dataset LCQMC. LCQMC is a Chinese semantic matching dataset built by Harbin Institute of Technology at COLING2018. The dataset uses Baidu Zhidao as a data source to obtain large-scale sentence pairs and manually annotates a corpus containing 260,068 pairs of large Chinese questions. Encompassing all areas of daily contact, it is a dataset that is extensive and universal. In addition, because the data source of the dataset comes from “Baidu Baike”, the dataset is basically problem-oriented and the sentence structure is more colloquial, which is an excellent choice as the criterion for this article. Criteria We used four indicators: Accuracy, Precision, Recall and F1-score to quantitatively test the proposed algorithm. The precision rate refers to how many of all the data predicted to be positive are positive; The recall rate is an accuracy rate that is only for the positive class. Its specific meaning is that in all the positive classes of the test set, how many predictions are correct, in the semantic similarity match, the evaluation standard only focuses on the correct rate of similar sentences;; F1-score is a com-prehensive value of the accuracy and recall rate, F1-score is often more authoritative in the evaluation index of semantic similarity, the value of F1-score will be biased towards the smaller value of the precision and recall, so when the value of the precision and recall is closer. The larger the value of F1-score, without focusing on the correct rate of the dissimilar sentence.

3.2 Comparison Result Knowledge Graph-Based Text Similarity Algorithm In this section, the similarity algorithm based on How-Net, the similarity algorithm based on Tongyici Cilin, the comprehensive similarity algorithm based on How-Net and Tongyici Cilin are selected for comparative analysis and the results obtained are shown in Fig. 2. Figure 2 shows the correlation between the similarity threshold and the accuracy rate. Among them, the highest accuracy rate obtained by the semantic similarity algorithm based on How-Net is 0.556, the highest accuracy obtained by the semantic similarity algorithm based on Tongyici Cilin is 0.544 and the highest accuracy obtained based on the semantic similarity algorithm of How-Net and Tongyici Cilin is 0.548. It shows that the semantic-based similarity algorithm is not very accurate on the short text LCQMC dataset, which also shows that this similarity algorithm based on the external semantic library is not suitable for colloquial short text.

162

Y. Jiang et al.

Fig. 2 Knowledge Graph-Based text similarity calculation result

Fig. 3 Spatial vectors-based text similarity calculation result

Spatial Vectors-Based Text Similarity Algorithm A comparative experiment was conducted on the traditional word vector model and the weighted sentence vector model, and the experimental results were shown in Fig. 3. The Fig. 3 shows that spatial vectors-based text similarity algorithm have a good effect on the calculation of short text similarity, with the original skip-gram model having an accuracy of 0.704 and F1 score of 0.704 and 0.693, respectively, while a sentence vector model based on TF-IDF weights can achieve an accuracy of 0.771, F1 score of 0.760 and a sentence vector model based on SIF weights achieving an accuracy of 0.783 and F1 score 0.775. Single Feature Extension Text Similarity Algorithm (our) The phonetic similarity algorithm, the character similarity algorithm and the semantics similarity algorithm are selected for experiments and the experimental results are shown in Fig. 4. Figure 4 shows the three parts of the word-sound-based and word-meaning fusion algorithm proposed in this paper, in which the accuracy rates of the word-sound-based similarity algorithm and the character-based similarity algorithm after feature expansion reach 0.784 and 0.792 and the F1 score reaches 0.816 and 0.794, respectively. The accuracy of the similarity algorithm based on the meaning of words reached 0.803 and the F1 score reached 0.823. From the image, the extreme point abscissa of these three algorithms is quite different, indicating that the short text similarity

A Phonetics and Semantics-Based Chinese Short Text …

163

Fig. 4 Single feature extension (our) text similarity calculation result

results calculated by the three algorithms are different from each other and have their own advantages. Phonetics and Semantics-Based Fusion Algorithm (our) We use the algorithm to fusion the three features. Through many experiments on the importance of the three similarity algorithms and the measurement of the effect of the short text similarity algorithm in the actual scene application, we have a1 , a2 and a3 . Each of the three parameters is set to the following values (6). a1 = 0.1, a2 = 0.4, a3 = 0.5

(6)

As shown in Fig. 5, the accuracy image of the fusion algorithm based on word sound and word meaning proposed in this paper yields an accuracy of 0.836 and an F1 score of 0.858 and the best results are obtained on all four similarity indicators.

3.3 Component Analysis The pairs of experimental results in this paper are shown in Table 2. The proposed fusion algorithm in this paper is 13.2% more accurate than Skip-gram, 6.5% higher than the TF-IDF weighted sentence vector and 5.3% higher than the SIF-based sentence vector, 16.5% higher than Skip-gram on F1-score, 9.8% higher than the TF-IDF weighted sentence vector and 8.3% higher than the SIF-based sentence vector. This shows that the algorithm proposed in this paper is an algorithm that can effectively improve the accuracy of the calculation of the similarity of short text.

4 Conclusion This paper proposes a Chinese short text fusion algorithm based on semantics and phonetics, which creatively integrates the pinyin of the text, the part of speech of the word and the word order as features into the calculation of similarity and this

164

Y. Jiang et al.

Fig. 5 Fusion algorithm (our) text similarity calculation result

Table 2 Comparison results for algorithms Accuracy

Algorithm

Precision

Recall

F1-score

HowNet

0.556

0.694

0.189

0.300

Cilin

0.544

0.655

0.186

0.290

HowNet+ Cilin

0.548

0.656

0.186

0.290

Skip-gram

0.704

0.718

0.670

0.693

Skip-gramT

0.771

0.800

0.728

0.760

Skip-gramS

0.783

0.804

0.748

0.775

Similar algorithmCB (our)

0.792

0.787

0.801

0.794

Similar algorithmSB (our)

0.784

0.800

0.836

0.816

Similar

algorithmPB

(our)

Fusion algorithm (our)

0.803

0.835

0.810

0.823

0.836

0.848

0.869

0.858

T +TF-IDF S +SIF CB Character-based SB Semantics-based PB Phonetics-based

multi-feature fusion algorithm improves the calculation accuracy of text similarity at the same time, but also solves the problem of word mistakes that often occur in colloquial short texts to a certain extent. The algorithm is superior to the semantic model and sentence vector model that are more commonly used at present and is optimal in multiple indicators. For unsupervised short text similarity algorithms, from feature extraction to model construction to spatial distance algorithms, there is a certain amount of room for simplification, how to abandon the complicated steps, and further simplifying the

A Phonetics and Semantics-Based Chinese Short Text …

165

algorithm is the main research topic in the future. In addition, although the algorithm can only be applied to Chinese short text at present, how to break through the limitations of the language and further promote and apply the algorithm is also a direction that can continue to be explored in the future. Acknowledgements This work is supported by Guangdong Basic and Applied Basic Research Foundation 2021A1515310003.

References 1. W.B. Cavnar, J.M. Trenkle, in N-gram-based text categorization. Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, vol. 161175. Citeseer (1994) 2. Y. Zhang, R. Jin, Z.-H. Zhou, Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1(1), 43–52 (2010) 3. Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, D. Inkpen, Enhanced LSTM for natural language inference (2016). arXiv:1609.06038 4. Wang, W. Hamza, R. Florian, in Bilateral multi-perspective matching for natural language sentences (2017). arXiv:1702.03814 5. M. Mirakyan, K. Hambardzumyan, H. Khachatrian, Natural language inference over interaction space: ICLR 2018 reproducibility report (2018). arXiv:1802.03198 6. J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, in Bert: pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805 7. C. Xie, X. Wang, C. Qian, M. Wang, A source code similarity based on Siamese neural network. Appl. Sci. 10(21), 7519 (2020) 8. M. Kouylekov, B. Magnini, in Recognizing textual entailment with tree edit distance algorithms. Proceedings of the First Challenge Workshop Recognising Textual Entailment (2005), pp. 17– 20. 9. S. Niwattanakul, J. Singthongchai, E. Naenudorn, S. Wanapu, Using of Jaccard coefficient for keywords similarity. Proceedings of the International Multiconference of Engineers and Computer Scientists 1(6), 380–384 (2013) 10. Y. Guan, X.-L. Wang, X.-Y. Kong, J. Zhao, in Quantifying semantic similarity of Chinese words from HowNet. Proceedings. International Conference on Machine Learning and Cybernetics, vol. 1. IEEE (2002), pp. 234–239. 11. J.-L. Tian, W. Zhao, Words similarity algorithm based on Tongyici Cilin in semantic web adaptive learning system. J. Jilin Univ. (Information Science Edition) 28(6), 602–608 (2010) 12. S. Haixia, Q. Qing, C. Ying, Review of ontology-based semantic similarity measuring. Data Analysis Knowl. Disc. 26(1), 51–56 (2001) 13. L. Xu, S. Sun, Q. Wang, in Text similarity algorithm based on semantic vector space model. 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), IEEE (2016), pp. 1–4 14. Y. Doval, M. Vilares, J. Vilares, On the performance of phonetic algorithms in microtext normalization. Expert Syst. Appl. 113, 213–222 (2018) 15. A. Csomai, R. Mihalcea, Linking documents to encyclopedic knowledge. IEEE Intell. Syst. 23(5), 34–41 (2008) 16. G.S.X. Zhiming, Research on automatic clustering technique of Chinese words in statistical language model. Comp. Eng. Appl. (2003) 17. J. Lin, W. Dongbo, Automatic extraction of domain terms using continuous bag-of-words model. Data Analysis Knowledge Disc. 32(2), 9–15 (2016)

166

Y. Jiang et al.

18. J. Lilleberg, Y. Zhu, Y. Zhang, in Support vector machines and word2vec for text classification with semantic features. 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC), IEEE (2015), pp. 136–140 19. M.Z. Asghar, A. Khan, S. Ahmad, F.M. Kundi, A review of feature extraction in sentiment analysis. J. Basic Appl. Scient. Res. 4(3), 181–186 (2014) 20. C.-H. Huang, J. Yin, F. Hou, A text similarity measurement combining word semantic information with TF-IDF method. Jisuanji Xuebao (Chinese J. Computers) 34(5), 856–864 (2011) 21. Q. Le, T. Mikolov, in Distributed representations of sentences and documents. International conference on machine learning (PMLR, 2014), pp. 1188–1196 22. T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space (2013) arXiv:1301.3781 23. J. Benesty, J. Chen, Y. Huang, I. Cohen, Pearson correlation coefficient, in Noise Reduction in Speech Processing (Springer, 2009), pp. 1–4 24. G. Xu, Y. Meng, X. Qiu, Z. Yu, X. Wu, Sentiment analysis of comment texts based on BiLSTM. IEEE Access 7, 51522–51532 (2019) 25. S. Arora, Y. Liang, T. Ma, in A simple but tough-to-beat baseline for sentence embeddings. International conference on learning representations (2017)

Feature Extension for Chinese Short Text Based on Tongyici Cilin Chuying Huang, Xinru Li, Yuchao Jiang, Wei Lv, and Minghe Xu

Abstract Since the short text has characteristics such as sparse features, calculating its similarity is a considerable challenge. However, there is less research on the method of Chinese short text feature extension in short text similarity calculation. Therefore, to have a deeper understanding of the method on using feature extension in the similarity of Chinese short texts, this paper adopts a feature extension algorithm based on an external thesaurus Tongyici Cilin (extended) for short texts. The purpose is to solve the feature sparseness problem of Chinese short text feature vectors. Firstly, segment words in the short text according to certain rules with high surface similarity and extract the main difference components in the text. Then, calculate the similarity of the major difference components between the two short texts based on Cilin. Finally, perform feature extension according to the similar results in the corresponding short text. In the large-scale Chinese Question Matching Corpus LCQMC, a variety of unsupervised models are used for testing. The experimental results show that the method in this paper has a certain improvement effect on various spatial vector similarity algorithms. It can achieve accuracy rates and F1-score of about 3% improvements. Keywords Chinese short text · Tongyici Cilin · Text similarity · Feature extension · Unsupervised C. Huang · W. Lv · M. Xu (B) School of Alibaba Cloud Big Data Application, Zhuhai College of Science and Technology, Zhuhai, China e-mail: [email protected] C. Huang e-mail: [email protected] W. Lv e-mail: [email protected] X. Li Department of Meituan Select, Beijing Sankuai Technology Co., Ltd, Beijing, China e-mail: [email protected] Y. Jiang School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_12

167

168

C. Huang et al.

1 Introduction With the rapid growth of network communication, comments on Weibo, Q&A in forums, bullet screens on online videos, and other forms of short texts have become the mainstream of information exchange. Due to short texts’ large quantity, strong immediacy, poor standardization, less content, and numerous expressed information, it is more difficult to extract the meaning of short text information and make reasonable use of the information in it. The short text similarity problem has also become a major problem in the field of natural language processing. In recent years, researchers have done a lot of work on text similarity matching. By designing word vectors, computers can better understand human language. In [9], Tomas et al. proposed two novel model architectures for computing continuous vector representations of words from very large data sets. But it is difficult to better reflect the relationship between words when the corpus is insufficient. As for optimization of the former model, Andry and Koray proposed a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation in [10]. After that, Le and Mikolov [5] proposed Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Jacob et al. created BERT, which is designed to pre-train deep bidirectional representations from the unlabeled text by jointly conditioning on both left and right context in all layers [2]. Since there are words in Chinese that are similar but expressed completely differently, semantic similarity cannot be ignored when solving such problems. The semantic similarity of Chinese requires the support of a huge Chinese thesaurus, among which HowNet and Tongyici Cilin are typical representatives. Based on Tongyici Cilin similarity calculation, Mao et al. [8] presented a method to calculate the semantic similarity with Tongyici Cilin and Word2vec. The method makes full use of the semantic information of words in the knowledge base and corpus. Tohdi et al. did work not only considers text similarity based on word features but also considers text similarity based on the emotional tendency [12]. The feature extension algorithm can better solve the problem of sparse short text features, widely used in the field of short text classification, and has achieved good results. In [14], Yang et al. took advantage of the extension of features through external reference documents, and the method improved the performance of text classification. Li et al. [6] expanded the text based on the granularity of keywords and domain characteristics which also achieved good results. Xi-Wei [13] proposed two feature extension methods based on the co-occurrence relationship. The improved methods give higher accuracy to the short text classification system. However, the current work of feature extension mainly acts on short text classification and is rarely used in semantic similarity calculations. Not only do traditional models such as word2vec depend heavily on the frequency of words and the cooccurrence relationship between words when calculating text similarity, but also the existing semantic knowledge base is less flexible and difficult to keep pace with the times. Above them, cases are worse in processing the calculation of short text similarity.

Feature Extension for Chinese Short Text …

169

To further improve the effectiveness of the unsupervised similarity algorithm for Chinese short texts, a feature extension algorithm based on Tongyici Cilin is designed in this paper. It mainly performs word segmentation for two short texts with high surface similarity and extracts the major difference components. On this basis, the similarity of the major difference components is calculated based on Cilin. To better solve the problem of sparse short text features, we perform feature extension in the corresponding short text according to similarity results of the calculation. After that, it can effectively solve the problem of feature sparsity of short text and improve the accuracy of short text similarity. Next, we will introduce the specific details of the feature extension algorithm based on Tongyici Cilin.

2 Method In the experiment of a traditional unsupervised word vector combined with a distance similarity algorithm, it is difficult to distinguish two short texts with very similar text content but different meanings. For example, the sentences “what is the name of my mailbox?” and “what name can I give to my mailbox?” In this case, the calculated similarity value is very high in Chinese. In the case of the combined calculation of word vector and cosine similarity, the obtained similarity value is as high as 0.988. After feature extension, the similarity value can be effectively reduced, so that it is more distinguishing. First, we calculate the surface similarity of two short texts. Then, select two short texts with high surface similarity. Next, the selected short text is segmented and the major difference components in the text are extracted. After that, the similarity of the major difference components between the two short texts are calculated based on Cilin. Finally, we perform feature extension in the corresponding short text according to similarity results of the calculation. The feature extension algorithm process is shown in Fig. 1.

2.1 Compute Surface Similarity Surface similarity refers to the degree of morphological similarity between two sentences, measured by the number of identical words contained in two sentences. In calculating short text semantic similarity, it can be divided into four classes as follow: • • • •

Text has high literal similarity and the same meaning Text has high literal similarity but different meanings Text has low literal similarity but the same meaning Text has low literal similarity and different meanings.

170

C. Huang et al.

Fig. 1 The flowchart of the feature extension algorithm

Because targeted feature extension can improve the accuracy more significantly, this paper only uses the feature extension algorithm for texts with high literal similarity, that is high surface similarity. This paper uses the Jaccard algorithm [3] to calculate the surface similarity. The Jaccard algorithm is an algorithm that simply compares the number of repeated units in two sentences. It is often used to determine whether an exam is plagiarized or a paper for duplication. It is used in this paper to compare the surface similarity of two texts. Suppose there are two sentences S and T . The similarity of the two sentences is described in Eq. 1: J accar d Similarit y =

S∩T S∪T

(1)

The similarity threshold will be selected as α. When J accar d Similarit y ≥ α, it considers two short text interfaces to be high surface similarity. Otherwise, it is low surface similarity.

2.2 Segment Words and Extract Major Differences in Sentenses The feature extension algorithm does not extend to all the words in the sentence, but for the keywords that play a major role in two short texts and differ from each other. Through the observation of a large number of corpus sentences, we found that the nouns, verbs, and adjectives in the sentences can better reflect the central idea of the sentence. So, after annotating the text, this paper extracts nouns, verbs, and

Feature Extension for Chinese Short Text …

171

adjectives as the major components of the sentence, and performs feature extension on this basis. Due to the great uncertainty of Chinese part-of-speech labeling, for example, homophones often exhibit different grammatical properties and different meanings in different scenarios. If the traditional labeling of each word as a high-frequency part of speech is used, there will be a certain error. Therefore, in the algorithm of word segmentation and part-of-speech labeling, this paper uses the jieba word segmentation tool. The algorithm used by jieba is a rule-based and statistical part-ofspeech labeling algorithm, that can determine the combination which has the highest probability according to the specific meaning of the word. Then, select the part-ofspeech in the combination with the highest probability for labeling. For example, in word collocations, the most combinations that appear are combinations of nouns and verbs, and when it is determined that the combination of the former is a verb, the probability that the latter is a noun becomes very high.

2.3 Similarity Calculation of Major Difference Components Comparing the nouns, verbs, and adjectives in one sentence with all the words of the same part-of-speech in another sentence respectively, we call the words that need to be compared as the major difference components of the short text. According to the previous step extraction result, this paper calculates short texts similarity based on Tongyici Cilin. There are 5 levels in the thesaurus, and each word is composed of a large class: a medium class, a small class, a word group, and an atomic word group. These five levels are represented by a total of 7 codes, so the codes can determine a unique set of atomic word groups. After the 7-bit code, the 8th-bit code represents the status. There are 3 codes in total, which are “=”, “#”, and “@”. “=” is used to indicate that the words in the atomic word group are synonymous with each other. “#” is used to indicate that the words in the atomic word group are related. “@” indicates that there is only one word in the atomic word group, and the word has no synonyms also have no related words. The semantic item encoding of Tongyci Cilin is shown in Table 1.

Table 1 Cilin code sample Code 1 2 position

3

4

5

6

7

8

0

1

A

0

1

@#=

Code code Level

A

a

Big

Medium

Small

Layer

First

Second

Third

Wordgroup Forth

Automic word-group Fifth

172

C. Huang et al.

Fig. 2 Cilin classes meaning

Fig. 3 Words in the first layer word groups amount

The first major class has a total of 12. The specific meanings of the classes of words are shown in Fig. 2. In the 12 classes, the nouns are in the three classes of A, B, and C. The D class contains various abstract nouns and nouns of various related concepts, as well as some numerals and quantifiers. F to J are verb classes, K is some particle, and L is respect. The distribution of the number of word groups is shown in Fig. 3. It can be seen from the figure that most branches are B and C, which are 4568 and 4097. The words containing in these two branches are almost nouns, in which the descriptions are the names of all things in the world and some abstract concepts prescribed by humans. The branch with the third most atomic word group is H, which has 3420 word groups, containing many verbs and the specific content actioned in daily life. The honorifics are mainly distributed in the branch L with the least number of atomic word groups. Combining the distribution and quantity analysis of the words in the Tongyici Cilin, the content in the thesaurus is exactly what is involved in our daily life dialogue. And the large number of short texts appearing on the Internet is generated from our daily oral communication. Therefore, if you want to expand

Feature Extension for Chinese Short Text …

173

Fig. 4 Nearest principle

the unified word in the colloquial short text, Tongyici Cilin is undoubtedly a good choice. When the comparative word is a noun, the similarity between words is only calculated under the four branches among A, B, C, and D of Tongyici Cilin. If the comparative word is a verb, it is calculated under the F, G, H, I, J branches. If the comparative word is an adjective, it is calculated under the E branch. But even under the local branch, there will be polysemy. While there is polysemy of a word, this paper will adopt the principle of node nearest principle, selecting the node with the closest distance to each other among the multiple leaf nodes where the two words appear as the calculation benchmark. For example, taking “默读” as the target word and “读书” as the comparison word. When calculating the similarity between them, “默读” is located in “Hg10A02” atomic word group, but “读书” exists in the atomic word group “Hg08A01” and “Hg10A01” because of its polysemy. The following is shown in Fig. 4. Based on Tongyici Cilin, the distance formula for calculating the similarity of words is: Distance(A, B) = | posi (A) − posi (B)|

(2)

When calculating the similarity between “读书” and “默读”, A represents the former, and B represents the latter. The function Distance() represents the difference in the number of words between two words, and the function posi () represents the position of the word in the thesaurus. Calculating the distance between the target word and the comparison word in different atomic word groups. The smaller the difference in the number of words between words, the closer the two words are. The atomic word group where the word with the smallest distance is located is selected as the calculation benchmark. After calculation, the “读书” in the atomic word group “Hg10A01” is closer to the target word. Therefore, the atomic word group where it is located is used as the calculation benchmark. Different parts-of-speech have differences in the level of Cilin. For exemple, in Fig. 5, the words at the D level are nouns, the words at the E level are adjectives, and the words at the H level are verbs, and these atomic word groups are all in the same word group. From these atomic word groups, it can be seen that the synonyms of nouns only appear in the same atomic word group, while the words in the same word

174

C. Huang et al.

Fig. 5 Differences between nouns and verbs and adjectives

group and in different atomic word groups have great correlation, but they are not synonymous, such as “三角形” and “菱形”. And synonyms for verbs and adjectives can appear in different atomic word groups, such as “旅行” and “出行”. So we calculate the similarity of different parts-of-speech in different methods. When calculating the similarity between nouns, only observe whether the two words are in the same atomic word group, if so, they are similar, otherwise, they are not similar. Two factors are considered when calculating the similarity of verbs and adjectives. The first is to discriminate whether the two words are under the same word group, that is, to distinguish whether the parent node of the fourth level of the two words is the same. If exist, the density between the two words is calculated. The density formula is set as Eq. 3. Dens(A, B) = 1 −

Di f f ( A, B) Sum( A, B)

(3)

where A and B represent two different words, Di f f ( A, B) represents the difference in position between the two words, and Sum(A, B) represents the total number of words in the group of two words. After several experiments, this paper sets the density threshold as β. When Den(A, B) ≥ β, the words are considered similar, otherwise, they are not similar.

2.4 Similarity Calculation of Major Difference Components If two short texts’ major differential components are similar, similar words are used as the feature extensions of the two sentences respectively. If not, the first word in the same atomic word group as the word is selected as the feature extension word in Cilin. While the word is not found in the thesaurus, or there is no other word in the atomic word group where the word is located, no feature extension will be performed. The details of the algorithm can be described below.

Feature Extension for Chinese Short Text …

175

Algorithm: Feature Extension Input: Feature set X = x1 , x2 , x3 , · · · , xm ; Feature set Y = y1 , y2 , y3 , · · · , yn Output: New feature set Xnew , Ynew 1. Initialize Core Difference set W ; V 2. for xi in X and not in Y : 3. if xi .f lag ∈ N oun or xi .f lag ∈ V erb or xi .f lag ∈ Adj : 4. W xi 5. end for 6. for yi in Y and not in X: 7. if yi .f lag ∈ N oun or yi .f lag ∈ V erb or yi .f lag ∈ Adj : 8. V yi 9. end for 10. for wi in W : 11. initialize d 12. for vi in V : 13. d calculate distance(wi , vi ) 14. if d ≥ D : cilin(wi )+X 15. Xnew 16. Ynew cilin(vi )+Y 17. else: vi + X 18. Xnew 19. Ynew wi + Y 20. end for 21. end for

3 Experiment and Analysis 3.1 Dataset The dataset in this paper uses the well-known public semantic matching dataset LCQMC [7], which is a Chinese semantic matching dataset constructed by Harbin Institute of Technology at COLING2018, the international top conference on natural language processing. It is extensive and universal since all fields of daily contact are encompassed. In addition, because the data source comes from Baidu knows, the world’s leading Chinese Q&A interactive platform, the dataset sentence structure is extremely colloquial, which is an excellent choice for processing experimental datasets as unsupervised short text similarity.

3.2 Evaluation Setting When implementing the feature extension step, we set the threshold α in Eq. 1 to 0.3, and the density threshold β in Eq. 3 to 0.5.

176

C. Huang et al.

3.3 Criteria From the perspective of evaluation indicators, the semantic similarity task can be regarded as a binary classification task. Due to various reasons such as the distribution of data types, accuracy is often unable to make a correct evaluation of an algorithm because of many reasons. Therefore, accuracy cannot be used only as an evaluation index for the quality of the algorithm. This paper uses Accuracy, Precision, Recall, and F1-score as the evaluation indicators to judge the quality of the algorithm. The Precision means the amount of all the two pairs of short text predicted to be similar is positive. The Recall is just for the true similar class accuracy calculation. In this experiment, its meaning is how many predictions are correct similar in all the positive classes of the test set. When in the semantic similarity match, the evaluation standard only focuses on the accuracy of similar sentences. F1-score combines the value of the Precision and Recall. Its value will be biased towards the smaller one of the Precision and Recall. As the value of the Precision and Recall is closer, the larger the value of the F1-score.

3.4 Comparison Result To verify the effectiveness of the feature extension algorithm, this paper uses 5 unsupervised text vector models, using the feature extension algorithm based on Tongyici Cilin on these models, and comparing the experimental results before and after feature extension. The models selected for this experiment are skip-gram [9], PV-DBOW [5], FastText [4], Glove [11] and Bert [2] pre-trained models. Then, the sentences after feature extension are subjected to the Pearson correlation coefficient [1] to calculate the similarity of the test data. There E is the mathematical expectation and cov() is the covariance. The formula is as Eq. 4. cov(X, Y ) σ X σY E[(X − μX )(Y − μY )] = σ X σY E[(X − μX )(Y − μY )] = /Σ √ n 2 (Y − μY )2 i i=1 (X i − μX )

ρ X,Y =

(4)

In Fig. 6, it can be seen from that when the accuracy of the model reaches the highest point, the corresponding abscissa values are very close to 1, which indicates that the similarity value of dissimilar texts is high, requiring a larger threshold to distinguish similar texts from dissimilar texts. The highest accuracy obtained by different models is not much different, almost in the range of 0.70–0.80. The model

Feature Extension for Chinese Short Text …

177

Fig. 6 Compared with different unsupervised models before feature extension

Fig. 7 Compared with different unsupervised models after feature extension

with the highest accuracy of 0.770 is PV-DBOW, and the model with the lowest accuracy of 0.704 is skip-gram. When we put Figs. 6 and 7 together, comparing the Accuracy figure obtained by the model for feature extension before and after, it can be found that the shapes of the figure are similar. While the accuracy rate has increased to a certain extent, the threshold for judging the similarity between short texts has decreased. It shows that the algorithm can make short texts with very high literal similarity generate a certain degree of distinction. At the same time, making the spatial similarity values of similar texts and dissimilar texts move toward opposite directions. This allows better

178

C. Huang et al.

Table 2 Comparisons with different unsupervised model before and after feature extension Algorithm Accuracy Precision Recall F1-score Skip-gram Skip-gram F FastText FastText F PV-DBOW PV-DBOW F BERT BERT F Glove Glove F F After

0.704 0.740 0.746 0.777 0.770 0.792 0.740 0.762 0.764 0.785

0.718 0.753 0.769 0.803 0.782 0.787 0.744 0.808 0.782 0.816

0.670 0.713 0.703 0.736 0.750 0.801 0.728 0.685 0.794 0.806

0.693 0.733 0.735 0.768 0.765 0.794 0.736 0.741 0.788 0.811

Feature Extension

identification of whether short texts are actually similar, rather than just relying on the surface similarity of the texts. We mainly use the Accuracy and F1-score as the criteria for judging the quality of the model, because the accuracy shows the pros and cons of the recognition rate of the model. The F1-score combines the values of Recall and Precision. The higher the value, the better the model stable (Table 2). Unsupervised algorithms before being executed feature extension algorithm, it can be seen that the Glove model performs the best among the five models. Its Recall and F1-score are 0.794 and 0.788, which are higher than the second model PVDBOW under the same indicators 0.044 and 0.023 respectively. In Precision, both Glove and PV-DBOW top the list. However, in Accuracy, Glove is 0.06 lower than PV-DBOW but still scores higher than other models. Skip-gram performs the worst and ranks last in comparison with other models under different evaluation metrics. Compared with other models, the characteristics of short text are not so friendly to the training effect of the traditional model word2vec. Models after feature extension, almost all the algorithms have a good improvement effect. The improvement effect is most obvious for the skip-gram with the worst experimental effect before feature extension. All of its evaluation indicators rise by nearly 4%. In Recall, the model with the highest improvement, PV-DBOW, is 5.1% higher than before. Its accuracy and F1-score reached 79.2 and 79.4%, which are 2.2 and 2.9% higher than before. In Precision, the model with the highest improvement, Bert, improved by 6.4% compared to the previous one. Its Accuracy and F1-score reache 76.2 and 74.1%, which are 2.2 and 0.5% higher than those before feature extension. In Recall, the model with the highest improvement, PV-DBOW, is 5.1% higher than before. In Precision, the model with the highest improvement, Bert, improve by 6.4% compared to the previous one. FastText Accuracy and F1-score reach 77.7 and 76.8%, which are 3.1 and 3.3% higher than before feature extension respectively. Glove’s Accuracy and F1-score reached 78.5 and 81.1%, which were 2.1 and 2.3% higher than before feature extension.

Feature Extension for Chinese Short Text …

179

From the results before and after feature extension, it can be seen that the feature extension algorithm based on Tongyici Cilin has about 3% improvement in the Accuracy of various spatial vector models and F1-score. The improvement of Skipgram and FastText are relatively large. The improvement of the BERT pre-training model, the PV-DBOW model, and the Glove model is small. The results show that the feature extension algorithm based on Cilin has a certain improvement effect on each feature vector algorithm, and the extension effect of the word vector is better than that of the character vector.

4 Conclusion Aiming at the problem of sparse short text features, this paper proposes a feature extension algorithm and applies the algorithm to skip-gram, PV-DBOW, FastText, glove, and BERT pre-training models. Comparing different models before and after feature extension, the latter improves the accuracy and F1-score of the former by about 3% while reducing the similarity threshold to a certain extent. So the results show that the feature extension algorithm based on Tongyici Cilin is an effective feature extension algorithm for short text similarity. For the unsupervised short text similarity algorithm, from feature extraction to model construction to spatial distance algorithm, there is still a lot of room for improvement in this process, which is worth exploring. The following aspects can be studied in the following areas: 1. Here is some room for simplification about the algorithm in this paper. How to simplify the algorithm is the main topic that needs to be studied in the future. 2. The feature extension algorithm based on Tongyici Cilin proposed in this paper has achieved a certain Accuracy improvement, but the algorithm is mainly focus on the daily texts. How to do feature extension algorithms in short texts in various professional fields can be used as a direction for continued exploration.

References 1. J. Benesty, J. Chen, Y. Huang, I. Cohen, Pearson correlation coefficient, in Noise Reduction in Speech Processing (Springer, 2009), pp. 1–4 2. J. Devlin, M.W. Chang, K. Lee, K. Toutanova, Bert: pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805 3. P. Jaccard, The distribution of the flora in the alpine zone. 1. New Phytol. 11(2), 37–50 (1912) 4. A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification (2016). arXiv:1607.01759 5. Q. Le, T. Mikolov, Distributed representations of sentences and documents, in International Conference on Machine Learning, PMLR (2014), pp. 1188–1196 6. X. Li, F. Gao, C. Ding, The research of Chinese short-text classification based on domain keyword set extension and hownet, in International Conference on Intelligent Control and Computer Application (Atlantis Press, 2016)

180

C. Huang et al.

7. X. Liu, Q. Chen, C. Deng, H. Zeng, J. Chen, D. Li, B. Tang, Lcqmc: a large-scale Chinese question matching corpus, in Proceedings of the 27th International Conference on Computational Linguistics (2018), pp. 1952–1962 8. Y. Mao, G. Zhang, S. Zhang, Word semantic similarity based on Cilin and word2vec, in 2020 International Conference on Culture-oriented Science & Technology (ICCST) (IEEE, 2020), pp. 304–307 9. T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space (2013). arXiv:1301.3781 10. A. Mnih, K. Kavukcuoglu, Learning word embeddings efficiently with noise-contrastive estimation. Adv. Neural Inf. Process. Syst. 26 (2013) 11. J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), pp. 1532–1543 12. T. Tohti, S. Li, A. Hamdulla, A text similarity measurement employs semantic dictionary-based sentiment analysis, in 2021 International Conference on Asian Language Processing (IALP) (IEEE, 2021) pp 358–362 13. Y. Xi-Wei, Feature extension for short text, in Proceedings of the Third International Symposium on Computer Science and Computational Technology (Citeseer, 2010), pp. 338–341 14. Z. Yang, K. Fan, Y. Lai, K. Gao, Y. Wang, Short texts classification through reference document expansion. Chin. J. Electron. 23(2), 315–321 (2014)

Task-Level Consistency Semi-supervised Based Domain Adaptation for Lung Nodules Segmentation Yifan Zeng, Aohui Pang, Wei Lv, and Xiaolin Zhu

Abstract The pixel-level segmentation labels in volumetric images are expensive and time consuming. Using a model in a new environment without labeled data is a normal case. Domain adaptation can tackle this issue by learning unlabeled data from the target domain and improving target testing performance. In this paper, we propose an out-of-the-box semi-supervised based domain adaptation framework DTCnnU-Net, which used dual-task consistency. Our framework is based on nnUNet, which is used to perform primarily data-driven automatic machine learning in different datasets. We improved the way of generating level set ground truth to adapt small objects and redesigned the loss function to better weight each loss term. Furthermore, we propose dual-task deep supervision to tackle the problem that small object was invisible in downsampling ground truth when performing the previous deep supervision method. Experiments show that DTCnnU-Net is superior to the state-of-the-art nnU-Net supervised learning framework in domain adaptation of lung nodule segmentation. Our framework improved target testing dice by 3.87% compared to the nnU-Net baseline. The source code is available at: https://github. com/XHMY/nnUNet. Keywords Medical image segmentation · Lung nodule · Domain adaptation · Semi-supervised learning · Task-level consistency

Y. Zeng · A. Pang · W. Lv Department of Alibaba Cloud Big Data Application, Zhuhai College of Science and Technology, Zhuhai, China e-mail: [email protected] A. Pang e-mail: [email protected] W. Lv e-mail: [email protected] X. Zhu (B) Faculty of Data Science, City University of Macau, Macau, China e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_13

181

182

Y. Zeng et al.

1 Introduction Recently, considerable literature has grown up around the theme of domain adaptation of image segmentation tasks [8, 12, 13]. Tasks such as lung nodules segmentation, where each voxel in a scanned lung volumetric image is assigned to nodule or background. This allows quantitative analysis of clinical parameters related to early cancer diagnosis. Deep learning algorithms requiring a large amount of labeled data have been extensively used in this task [17]. However, the pixel-level annotating labeled data is hard to acquire, especially in 3D volumetric images [16]. The lack of labeled data motivates approaches such as semi-supervised learning. In the semi-supervised learning scenario, there are two sets of samples in dataset D: labeled samples Ds and unlabeled samples Du . The goal is to use Du to improve the classifier f , where f is constructed only using Ds [9]. In this domain adaptation problem we studied, the Ds is the training dataset (source domain), and the Du is the test dataset (target domain), where Du came from another data distribution. The most popular segmentation algorithm in medical image analysis using a convolutional neural network is U-Net [11]. There are two main components in U-Net: a contracting path (encoder) to capture context and a symmetric expanding path (decoder) to enable precise localization. U-Net combines these two paths using socalled skip-connections. A similar approach called 3D U-Net was proposed for volumetric segmentation [14]. Volumetric data is abundant in clinical applications. The clinical thoracic computed tomography(CT) scan data used in this research is also 3D images. When we come to the semi-supervised learning context, the network architecture mostly remains the same as in supervised learning. There are two typical ways to utilize the unlabeled data, the task-agnostic approach and the task-specific approach. In the first approach, we leveraged unlabeled data through unsupervised or self-supervised pretraining. In the second approach, we jointly leveraged the labeled and unlabeled data by enforcing a form of regularization [4]. In this work, we used the second way that makes use of dual-task consistency (DTC) [16]. Because the first way required the unlabeled data from the target domain to train initially and a huge amount of unlabeled data to feed the pretrain model, which is not suitable for domain adaptation. Most of the current semi-supervised learning work focused on making use of a lot of unlabeled data that is cheaper than the labeled data. Our idea is more task-specific, we focused on improving model performance in the new scanner by using unlabeled data from this scanner. In a real-world application, inferencing data coming from a new environment is a common scenario. This included the circumstances that the model was trained on a specific dataset and deployed in a different hospital. We could not expect abundant labeled data from the new environment in that we are going to perform inference. But the unlabeled data is relatively easier to obtain. Hence, we could train the existing model with the unlabeled data from the target environment. The model can learn the target environment feature and improve the segmentation performance on the images from this target environment.

Task-Level Consistency Semi-supervised Based Domain …

183

To this end, we propose an out-of-box semi-supervised learning segmentation method. Our method is based on the nnU-Net [6] framework to easily adapt various kinds of lung image data and achieve great segmentation results. In our semisupervised learning part, we make use of dual-task consistency [1, 16] and separately train the two sets of data from the different scanners. Our main contributions are: 1. By using our domain adaptation framework DTCnnU-Net, we improved the target domain testing dice score by 3.87% compared to the nnU-Net baseline. 2. Propose an out-of-box semi-supervised based domain adaption segmentation framework that worked for lung nodule segmentation tasks 3. Propose dual-task deep supervision to deal with the problem that small target was invisible in downsampling ground truth. 4. Improve the implementation of the level set function to adapt small objects. 5. Redesign the loss function of the DTC method to better weight each loss term. 6. Our semi-supervised segmentation model was trained on the LIDC-IDRI [10] dataset, which is the largest public lung nodule segmentation dataset. This dataset contains various types of pulmonary nodule lesions and CT images of different manufacturers and different hospitals, which is high complexity for the segmentation task. Using this dataset can help us train and test the model on different distributed data.

2 Related Work 2.1 nnU-Net nnU-Net is an out-of-the-box framework for the task of semantic segmentation of medical images [6]. When performing semantic segmentation, it can automatically complete data preprocessing and model hyperparameter adjustment according to the properties of the dataset. nnU-Net avoids the problem of poor detection performance due to imperfect configuration when manually tuned empirically. Such properties also give the framework a greater generalization ability and strong robustness to better cope with the lung nodule detection task scenario in this study. It can dynamically adjust the model according to different computing resources, which makes the actual deployment of the model in hospital scenarios more feasible. It enables the model to adapt to the computing equipment of different specifications in different hospitals, and it can make full use of computing resources on both cheap and high-end equipment to exert the model’s utility. The nnU-Net framework uses the primary data-driven automatic machine learning (AutoML) approach to replace the traditional primary expert-driven method when determining the configuration and divides the parameters that need to be configured into the following three categories: • fixed-parameters: fixed configuration across different datasets, such as model structures, optimizer, loss function, etc.

184

Y. Zeng et al.

• rule-based parameters: data preprocess and mode hyperparameter that adjusts according to dataset fingerprint, such as resampling strategy, spacing, patch size, batch size, etc. • empirical parameters: postprocessing and model ensemble selection In the Convolutional neural network (CNN) network part of the nnU-Net framework, the U-Net [11] structure suitable for three-dimensional data is adopted. Deep supervision is used during training, that is, for each deep supervision feature map, a corresponding downsampling ground truth segmentation mask is used for loss computation. According to the previous work [7, 19], this method can enable shallow layers to be more fully trained and avoid gradients vanishing. The loss function of the nnU-Net framework for the segmentation task is the sum of binary cross-entropy and Dice loss [15, 18]. Empirically, combining the Dice loss with a binary cross-entropy loss improved training stability and segmentation accuracy. The binary cross-entropy is defined as: Σ[ ] yi · log (Pi ) + (1 − yi ) · log (1 − P i ) (1) L BC E = − i

The dice loss is defined as: L Dice

2 = −Σ i

Σ

Pi · yi Σ Pi + i yi i

(2)

2.2 Dual-Task Consistency (DTC) Dual-Task Consistency (DTC) is a semi-supervised method for image segmentation [1, 16]. The typical semi-supervised method uses the consistency of the output to update the model by encouraging the model to output smoothly under the same input having perturbations, which is a kind of data-level consistency. The DTC method we adopted encourages the model to have consistency between the output results of the two tasks we define, which is a kind of task-level consistency. Compared with data-level consistency, task-level consistency enables the model to learn various level representations and improves the semi-supervised learning effect of the model. The two tasks defined in the DTC method are the segmentation task and the level set regression task. The segmentation task is for lung nodule segmentation. The level set regression task predicts the level set value of the segmentation. This method does not require an additional label level set value manually. It produces level set value ground truth from a level set function T (x) defined as:

Task-Level Consistency Semi-supervised Based Domain …

⎧ − inf ||x − y||2 , ⎪ ⎪ ⎪ ⎨ y∈∂ S T (x) = 0, ⎪ ⎪ ⎪ ⎩ + inf ||x − y||2 , y∈∂ S

185

x ∈ Sin x ∈ ∂S x ∈ Sout

(3)

When training unlabeled data, we need T −1 (x) to generate pseudo labels of segmentation task from level set regression task prediction. But it is impractical to integrate the exact inverse transform of T (x) in training due to the non-differentiability. Hence a smooth approximation to the inverse transform of the level-set function was used to get T −1 (x) , which is defined as: T −1 (z) =

1 = σ (k · z) 1 + e−k·z

(4)

We set k = 1500 as the original work. When training the model, conditional-based DTC will perform fully supervised training or semi-supervised training according to whether the current batch data is labeled: 1. Without Ground Truth (semi-supervised): Using T −1 (x) to compute pseudo labels of segmentation task from level set prediction. Use the pseudo labels to compute consistency loss L DT C with segmentation prediction. L DT C is the batch loss. 2. With Ground Truth (supervised): Using the segmentation ground truth and the level set value ground truth to compute the loss of the two tasks prediction. Then following the step above to get the consistency loss L DT C . The batch loss is defined as: (5) L total = L Seg + L L S F + λd L DT C L Seg + L L S F is the loss of the two tasks prediction, λd is time-dependent Gaussian warming up function [3, 20]. Briefly speaking, because there is no ground truth for computing task loss when doing semi-supervised learning, we need to compute L DT C based on task consistency; If the ground truth existed, we only need to compute the loss of two tasks and add L DT C .

3 Method 3.1 Dataset The dataset used in this paper is the Lung Image Database Consortium image collection (LIDC-IDRI). LIDC-IDRI dataset consists of diagnostic and lung cancer screening thoracic computed tomography (CT) scans with marked-up annotated lesions. Since we need to study the domain adaptation of the model under different distributions, we extract the two types of data from the LIDC-IDRI dataset, which the scanning devices are SIEMENS and GE MEDICAL SYSTEMS. Since the scanning

186

Y. Zeng et al.

Fig. 1 Lung images and their corresponding annotated slices from two different scanning devices in the LIDC-IDRI dataset Table 1 Number of samples Manufacturer Total number Available number * Training set Testing set

GE medical systems

siemens

557 350 315 35

187 134 120 14

*Number of nodules that diameter ≥3 mm more than 1

devices are different, we believe it was coming from different distributions. Figure 1 shows two CT image slices produced by the two scanning devices. The upper side of Fig. 1 is the 2D slice of the lung image, and the lower side is the corresponding nodule annotation. The white color area of the annotation represents the nodule area, and the black color area represents the background without nodules. In Fig. 1, there are two slice samples of lung images with nodules on the left side generated by SIEMENS scanning equipment, and on the right side are two slice samples of lung images with nodules generated by GE MEDICAL SYSTEMS scanning equipment. Compared to other medical image segmentation tasks, we consider the segmentation of lung nodules to be more difficult. It can be seen from Fig. 1 that the nodule area occupies a small area in the entire image. Nodules have an irregular shape with blurred boundaries. It is difficult for people without field knowledge to identify the location of lung nodules from the image. There are 1018 cases in the LIDC-IDRI dataset. We only study the samples that modality is CT in this paper. The samples number from the two kinds of scanner manufacturer is shown in Table 1. The nodules that diameter ≥3 mm, are more meaningful for clinical application. So, we only used the samples that had nodules whose diameter ≥3 mm more than 1. We divided our dataset into two parts: The manufacturer of the scanner is GE MEDICAL SYSTEMS (DG E ) and the manufacturer of the scanner is SIEMENS (D S I ). We studied domain adaptation problems that semi-supervised training our

Task-Level Consistency Semi-supervised Based Domain …

187

model on D S I with label and on DG E without label to improve DG E testing performance. The D S I is the source domain, and the DG E is the target domain. The labels of DG E is unseen for the network during training. As shown in Table 1, we split 10% samples as testing set from DG E and D S I . The remaining 90% samples were training set that was used in the training process by 5-fold cross-validation.

3.2 DTC Semi-supervised Based Domain Adaptation To allow the model to perform semi-supervised based domain adaptation, we combined DTC semi-supervised mechanism with the nnU-Net framework [16] as shown in Fig. 2. The model with DTC semi-supervised mechanism can predict segmentation and level set value at the same time. The U-Net network module of the nnU-Net framework can only predict segmentation results. We add a regression head to the U-Net network as shown in Fig. 2. For the level set regression task. A regression head only contains one 1 × 1 × 1 convolutional layer. This design only adds a small number of learnable parameters to the network, which can prevent overfitting. We want to enforce the network to learn the level set regression task mainly by its shared

Fig. 2 Overview of DTCnnU-Net semi-supervied training process. The far left in the figure is the lung CT image preprocessed by nnU-Net. During training, we will first calculate it through the 3D U-Net in nnU-Net. We added regression heads to the network to perform level set regression task. We placed the newly added regression head in parallel with the segmentation head of original network. Regression head contain a 1 × 1 × 1 convolutional layer (the orange rectangle in the figure) fellowed a tanh layer to adapt to the output range of the level set regression task. The different outputs of the two head corresponding to different tasks. Two kinds of task loss and consistency loss will be calculated as shown in the yellow rectangle. The thin pink line represents the semi-supervised learning process of DTC. The yellow rectangle in the figure represents the step of calculating Loss, and the gray rectangle represents the transformation process through a function without learnable parameters

188

Y. Zeng et al.

Fig. 3 The level set results of inner and outer boundary drawing methods

parameters with the segmentation task. The regression head also contains a tanh function after the convolutional layer. This enforced the output range in [−1, 1], which matched the output of the level set function. We also redesigned the DTC loss function. We weighted L seg , L L S F , L DT C by adding weight parameters to the loss function as follows: ( ) Ltotal = (1 − β) (1 − α) L Seg + α L L S F + β λd L DT C

(6)

The α and β control the weight of the level set regression task and consistency loss. We found when α = 0.3, β = 0.2, the first 50 epochs estimate dice score could reach the highest value by comparing multiple α and β settings. In level set function T (x) , the ∂ S is the zero level set and represents the contour of the target object. The original DTC method implemented T (x) as drawing the boundary inner the nodule. We found this boundary drawing method resulted in poor segmentation performance. Because the original DTC method did not take the small object into account. The zero value boundary would cover the level set value inside the nodule. When a nodule was small, the entire nodule would be filled by a zero value boundary. This led to the minimum level set value in this nodule being zero, which did not satisfy the condition that the minimum level set value must be −1. We changed the way of drawing boundaries to draw them outer the nodule as shown in Fig. 3c. This can keep the inside level set value of the small nodules.

3.3 Dual-Task Deep Supervision To train deeper networks, we can add auxiliary supervision branches after certain intermediate layers during training, which is called deep supervision[7, 19]. In addition to the deep supervision training of the segmentation task, we also used deep supervision to train the level set regression task as shown in Fig. 4. We think this approach can guide the model to learn small objects such as lung nodules. We produced four downsampling versions of the original ground truth when per1 , forming deep supervision training. Each downsampling version has the 21 , 41 , 18 , 16 1 resolution of the original ground truth. The size of the downsampling ground truth 32

Task-Level Consistency Semi-supervised Based Domain …

189

Fig. 4 Dual-task deep supervision. The network of DTCnnU-Net is on the left side. In addition to the final output, the network will also output four downsampling version predictions in different scale from intermediate layers. Four downsampling version of ground truth will produce from original ground truth in data preprocessing, which shows in right side of this figure. The size of each ground truth will match the output size of the network. Each output will compute loss with corresponding ground truth and weighted sum as the final loss

will match the intermediate layers outputs of the network of nnU-Net framework. This makes it possible to compute the output loss of certain intermediate layers during training. But small nodules will disappear in the downsampling segmentation ground truth. We can see two nodules in the 192 × 192 segmentation ground truth of Fig. 4. When downsampled to 48 × 48, only one bigger nodule existed, nodules disappeared when downsampled to 24 × 24. Most of the nodules in our dataset were less than 16 voxels, which meant they would not show in the last two segmentation ground truth. This would confuse the network to learn where the nodule is. We propose dual-task deep supervision to tackle this issue. we add the level set regression task to the deep supervision process. The level set ground truth would always keep the information of the exiting nodule in every downsampling version as shown in Fig. 4. Even downsampled to 12 × 12, we can also see the appearance of the nodules in the level set ground truth. This can guild the lower layer of the network to recognize the position of the nodules even the nodule is invisible in the current scale.

190

Y. Zeng et al.

4 Evaluation Metrics 4.1 Dice Similarity Coefficient Dice Similarity Coefficient (Dice) measures the overlap between segmentation results and ground truths. It is computed as follows: | ⋂ | 2 |X Y | (7) Dice = |X | + |Y | where |X | and |Y | are the number of elements in each set. In a segmentation context, X and Y denote foreground voxels in the annotation and corresponding sets of foreground voxels in the segmentation result.

4.2 Precision and Recall The precision is the proportion of identifications that was correct. It is computed as follows: Pr ecision =

TP T P + FP

(8)

The recall is the proportion of positives that were identified correctly. It is computed as follows: Recall =

TP T P + FN

(9)

A false positive (FP) is an outcome where the model incorrectly predicts the positive class. A false negative (FN) is an outcome where the model incorrectly predicts the negative class. A true positive (TP) is an outcome where the model correctly predicts the positive class.

5 Experiments and Results 5.1 Experimental Platform We conducted exploratory experiments using a computing platform with 4 T V100 SMX2 GPUs. At the same time, we used another 14 RTX 3090 GPUs with the help of cloud computing to speed up the training process of the 5-fold experiment. The above GPUs support float16 half-precision operations, so automated mixed-precision

Task-Level Consistency Semi-supervised Based Domain …

191

training can be used to speed up the model training process. Our experiments took about 32 d of single GPU time in total. Due to the addition of the level set regression task, the CPU load in the data loading stage is significantly increased compared to the original nnU-Net. Our CPUs were Intel(R) Xeon(R) E5-2698 v4 and Intel(R) Xeon(R) Gold 6330. We wanted to ensure that the data loading stage would not become a performance bottleneck. In the experiments, we used the Weights and Biases to track experiments [2]. It provided convenience to visualize the loss and results during the training process, record the parameter configuration, and backup the model file.

5.2 Applying nnU-Net We adopt the nnU-Net framework [6] to our task. With the help of nnU-Net primarily data-driven AutoML approach, we made configurations for the LIDC-IDRI dataset.

5.2.1

Fixed-Parameters

The model used the U-Net architecture suitable for 3D data. To perform Dual-Task Consistency in the semi-supervised stage, we added a regression head to the model and added the L DT C term to the original Loss function. We used a Poly learning rate schedule (initial 0.01), SGD Optimizer with Nesterov momentum (μ = 0.95), Data augmentation mainly included rotations, scaling, gamma correction. The number of training epochs is 1000, and the number of mini-batched per epoch is fixed at 250.

5.2.2

Rule-Based Parameters

The nnU-Net framework will automatically set preprocessing methods, model hyperparameters, and network topology according to the characteristics of the dataset. For the fairness of the experiment, we ensured the model configured these parameters only by D S I . We show some of the parameters that are automatically configured during this process in Table 2. The nnU-Net automatically adjusts configuration according to the available GPU memory when setting the batch size and patch size parameters. We are using a configuration scheme based on the 8 GB GPU memory, which means that the model will consume up to 8 GB of GPU memory while it is working. Although our hardware conditions allowed the configuration scheme to use 32 GB GPU memory, according to the official nnU-Net recommendation, doing so has not yet been beneficial for model performance consistently. Using a smaller GPU memory configuration scheme is also beneficial to speed up the training of the model, so we use a configuration scheme of 8 GB GPU memory.

192

Y. Zeng et al.

Fig. 5 One of the patches in nnU-Net after a 192 × 192 patch is applied to a 512 × 512 image slice

The nnU-Net does not input the entire image into the network during training. It splits the image into many patches as input of the network (shown in Fig. 5). The purpose of patching is to reduce the consumption of GPU resources during training. In this study, nnU-Net automatically set the patch size to 96 × 192 × 192 through the AutoML mechanism. The figure shows an example with a Patch Size of 192 × 192 on a 2D slice of a lung image. The input Patch for actual network training will be a 3D image with a size of 96 × 192 × 192.

5.2.3

Empirical Parameters

We used the low-resolution configuration of the 3D U-Net model in our experiments. Coarse segmentation maps are learned by a 3D U-Net that operates on low-resolution data. We adopted the low-resolution 3D U-Net to simplify the domain adaptation we study. Low-resolution configuration can significantly speed up training compared with full resolution. By comparing the experimental results of these two resolution configurations on the same set of data on the 3D U-Net model, we found that the met-

Table 2 Number of samples Names Batch size Patch size num_pool_per_axis pool_op_kernel_sizes conv_kernel_sizes conv_per_stage

Configurations 2 96 × 192 × 192 [4, 5, 5] [[1, 1, 1] [2, 2, 2], [2, 2, 2],[2, 2, 2],[2, 2, 2], [1, 2, 2]] [[3, 3, 3], [3, 3, 3], [3, 3, 3],[3, 3, 3],[3, 3, 3],[3, 3, 3]] 2

Task-Level Consistency Semi-supervised Based Domain …

193

Fig. 6 3D visualization of the segmentation result comparison between baseline and our DTCnnUNet

rics of the two models had a slight difference. We believed that using low resolution 3D U-Net can fairly reflect the performance of the model.

5.3 Result Figure 6 is the 3D visualization of nodule segmentation results after applying our DTCnnU-Net framework. We chose a nodule sample on inference results of DG E testing set. Figure 6a is the ground truth; Fig. 6b is predicted by the nnU-Net baseline. Figure 6c is predicted by our DTCnnU-Net. We can see that Fig. 6c had significant improvement compared to Fig. 6b and was more similar to Fig. 6a. We only took the segmentation task into account when studying the experimental results. The level set regression task was used as an intermediate process and was not presented in the result. We presented the experiment results in Table 3. To verify that our domain adaptation method could work on this lung nodule segmentation task, we designed the experiments in the table. We compared the metrics of our DTCnnU-Net and nnU-Net baseline and three other popular segmentation models [5, 11, 14]. The metrics used for comparison are Dice, Precision, and Recall. We calculated metrics on the testing set of D S I and DG E . Both the images and labels of the testing set are unseen during the training process. By comparing the experiments result in Table 3, we found our DTCnnU-Net improved the dice score by 3.87 % compared to the nnU-Net baseline on the target domain DG E . We recorded segmentation task loss, level set regression task loss, and consistency loss of each epoch during training to study the convergence of the model as shown in Fig. 7. Since we used 5-fold cross-validation, we averaged the 5 losses in each epoch and presented them on the graph, the shading under each loss curve represents the maximum and minimum loss of the 5-folds. The solid line in the loss curve

194

Y. Zeng et al.

Table 3 Evaluation results Metrics Model D S I Testing Dice (%) Recall (%) 2D U-Net [11] 3D U-Net [14] DeepLabv3 [5] nnU-Net (baseline) [6] DTCnnU-Net (ours) ⋆Improvements

14.75 47.57 38.08 65.26 67.78 2.52

16.75 62.79 41.48 60.87 63.39 2.52

Precision (%)

DG E Testing (Target) Dice (%) Recall Precision (%) (%)

21.36 43.85 45.00 81.14 83.91 2.77

1.96 53.50 33.53 56.40 60.27 3.87

1.11 73.21 27.95 51.24 54.34 3.10

13.82 47.62 61.82 73.98 75.33 1.35

Fig. 7 DTCnnU-Net training loss in different terms

represents the training loss, and the dotted line represents the validation loss. The blue line in the figure represents segmentation task loss which is the sum of dice loss and cross-entropy loss. The other losses are mean squared error loss. So they had different ranges. The segmentation task loss continued to decline throughout the process. The orange level set regression task loss dropped less during training.

6 Conclusion In this paper, we have presented an out-of-the-box task-level consistency based semisupervised framework for domain adaptation. We improved and implemented the

Task-Level Consistency Semi-supervised Based Domain …

195

DTC mechanism with the nnU-Net framework, which is called DTCnnU-Net. The DTCnnU-Net can simultaneously predict a segmentation result and a level set representation of the segmentation. The consistency of the two predictions can be used to leverage the unlabeled data from the target domain. We adapted our method for small segmentation objects such as lung nodules. The dual-task deep supervision was used to tackle the problem that the previous deep supervision method will make the small object invisible in downsampling ground truth. Our method can guide the model to learn small objects by passing the global-level shape and geometric information into the intermediate layers. We also improved the way of generating level set ground truth by preventing the boundary to override level set value inside the small object. To build a semi-supervised training framework, we added regression heads to the network. We also redesigned the loss function to better control the weights of each term. We conduct experiments on the LIDC-IDRI dataset, which is a dataset for lung nodule segmentation. We achieve a 3.87% dice score improvement compared to the nnU-Net baseline.

References 1. S.G. Armato III., G. McLennan, L. Bidaut, M.F. McNitt-Gray, C.R. Meyer, A.P. Reeves, B. Zhao, D.R. Aberle, C.I. Henschke, E.A. Hoffman, The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Med. Phys. 38(2), 915–931 (2011) 2. L. Biewald, Experiment tracking with weights and biases (2020). https://www.wandb.com/, software available from wandb.com 3. L.C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic image segmentation (2017). arXiv:1706.05587 4. Q.Q. Chen, Z.H. Sun, C.F. Wei, E.Q. Wu, D. Ming, Semi-supervised 3d medical image segmentation based on dual-task consistent joint learning and task-level regularization. IEEE/ACM Trans. Comput. Biol. Bioinform. (2022) 5. V. Cheplygina, M. de Bruijne, J.P.W. Pluim, Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Medical Image Anal. 54, 280– 296 (2019). https://doi.org/10.1016/j.media.2019.03.009 6. Çiçek Ö, A. Abdulkadir, S.S. Lienkamp, T. Brox, O. Ronneberger, 3d u-net: learning dense volumetric segmentation from sparse annotation, in International Conference on Medical Image Computing and Computer-assisted Intervention (Springer, 2016), pp. 424–432 7. M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, C. Pal, The importance of skip connections in biomedical image segmentation, in Deep learning and Data Labeling For Medical Applications (Springer, 2016), pp. 179–187 8. H. Guan, M. Liu, Domain adaptation for medical image analysis: a survey. IEEE Trans. Biomed. Eng. 69(3), 1173–1185 (2022). https://doi.org/10.1109/tbme.2021.3117407 9. F. Isensee, P.F. Jaeger, S.A.A. Kohl, J. Petersen, K.H. Maier-Hein, nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18(2), 203– 211 (2021). https://doi.org/10.1038/s41592-020-01008-z 10. C.Y. Lee, S. Xie, P. Gallagher, Z. Zhang, Z. Tu, Deeply-supervised nets,in Artificial Intelligence and Statistics, PMLR (2015), pp. 562–570 11. Y. Li, L. Yuan, N. Vasconcelos, I.C. Soc, Bidirectional learning for domain adaptation of semantic segmentation, in 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition

196

12.

13.

14. 15.

16.

17.

18.

19. 20.

Y. Zeng et al. (CVPR), IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 6929– 6938. https://doi.org/10.1109/cvpr.2019.00710 Z. Li, R. Togo, T. Ogawa, M. Haseyama, Unsupervised domain adaptation for semantic segmentation with symmetric adaptation consistency, in IEEE International Conference on Acoustics, Speech, and Signal Processing, International Conference on Acoustics Speech and Signal Processing ICASSP (IEEE, 2020), pp. 2263–2267 G. Litjens, T. Kooi, B.E. Bejnordi, A.A.A. Setio, F. Ciompi, M. Ghafoorian, J.A.W.M. van der Laak, B. van Ginneken, C.I. Sánchez, A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017). https://doi.org/10.1016/j.media.2017.07.005 X. Luo, J. Chen, T. Song, G. Wang, Semi-supervised medical image segmentation through dual-task consistency. Proc. AAAI Conf. Artif. Intel. 35, 8801–8809 (2021) F. Milletari, N. Navab, S.A. Ahmadi, V-net: fully convolutional neural networks for volumetric medical image segmentation, in 2016 Fourth International Conference on 3D Vision (3DV) (IEEE, 2016), pp. 565–571 J. Peng, Y. Wang, Medical image segmentation with limited supervision: a review of deep network models. IEEE Access 9, 36,827–36,851 (2021). https://doi.org/10.1109/ACCESS. 2021.3062380 O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for biomedical image segmentation, in Springer International Publishing, Medical Image Computing and ComputerAssisted Intervention—MICCAI (2015), pp. 234–241 A. Tarvainen, H. Valpola, Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 30 (Nips 2017) 30 (2017) L. Wang, C.Y. Lee, Z. Tu, S. Lazebnik, Training deeper convolutional networks with deep supervision (2015). arXiv:1505.02496 L. Yu, S. Wang, X. Li, C.W. Fu, P.A. Heng, Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation, in International Conference on Medical Image Computing and Computer-Assisted Intervention (Springer, 2019), pp. 605–613

Malaria Blood Smears Object Detection Based on Convolutional DCGAN and CNN Deep Learning Architectures Francisco Nauber Bernardo Gois, João Alexandre Lobo Marques, Allberson Bruno de Oliveira Dantas, Márcio Costa Santos, José Valdir Santiago Neto, José Antônio Fernandes de Macêdo, Wencai Du, and Ye Li Abstract Fast and efficient malaria diagnostics are essential in efforts to detect and treat the disease in a proper time. The standard approach to diagnose malaria is a microscope exam, which is submitted to a subjective interpretation. Thus, the automating of the diagnosis process with the use of an intelligent system capable of recognizing malaria parasites could aid in the early treatment of the disease. Usually, F. N. B. Gois · J. A. L. Marques (B) · W. Du University of Saint Joseph, Macau SAR, China e-mail: [email protected] F. N. B. Gois e-mail: [email protected] W. Du e-mail: [email protected] F. N. B. Gois Controllership and General Ombudsman of the State of Ceara, Fortaleza, Ceara, Brazil J. A. L. Marques Institute of Distance Education, University for the International Integration of the Afro-Brazilian Lusophony, Liberdade Campus, Redenção, Brazil A. B. de Oliveira Dantas Shenzhen Institutes of Advanced Technology/Chinese Academy of Sciences, Shenzhen, China e-mail: [email protected] M. C. Santos Department of Computer Science, Federal University of Ceará, Russas Campus, Russas, Brazil e-mail: [email protected] J. V. S. Neto Repair Pricer, Austin, TX, USA e-mail: [email protected] J. A. F. de Macêdo Science Center, Department of Computer Science, Federal University of Ceará, Fortaleza, Brazil e-mail: [email protected] Y. Li Shenzhen Institutes of Advanced Technology/Chinese Academy of Sciences, Shenzhen, China e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2_14

197

198

F. N. B. Gois et al.

laboratories capture a minimum set of images in low quality using a system of microscopes based on mobile devices. Due to the poor quality of such data, conventional algorithms do not process those images properly. This paper presents the application of deep learning techniques to improve the accuracy of malaria plasmodium detection in the presented context. In order to increase the number of training sets, deep convolutional generative adversarial networks (DCGAN) were used to generate reliable training data that were introduced in our deep learning model to improve accuracy. A total of 6 experiments were performed and a synthesized dataset of 2.200 images was generated by the DCGAN for the training phase. For a real image database with 600 blood smears with malaria plasmodium, the proposed Deep Learning architecture obtained the accuracy of 100% for the plasmodium detection. The results are promising and the solution could be employed to support a mass medical diagnosis system.

1 Introduction The global burden of malaria is enormous. In 2012, the World Health Organization (WHO) estimated that at least 247 million people worldwide suffer from malaria and that more than two billions or 42% of people worldwide has a malaria contamination risk due to living in malaria-endemic areas, 627,000 of which resulted in deaths among African children. In the Philippines, for example, malaria is considered to be the 9th leading cause of morbidity, with 58 out of the 81 provinces being malariaendemic. Malaria is a disease caused by a protozoa parasite of the genus plasmodium that infects erythrocytes of patients. One of the plasmodium species that infects humans is plasmodium falciparum. This species is the most severe because in a short time can invade a large number of erythrocytes and cause several complications in the body’s organs and even causes death. The patients, in this case, are characterized by a variety of organ dysfunction [1]. Microscope analysis of blood smear images plays a very important role in the characterization of erythrocytes in the malaria parasites spectrum once that the characteristics of erythrocyte alterations vary accordingly to the malaria parasite responsible for the infection. The microscopic features of the erythrocyte include morphology, intensity and texture [2]. Among the major obstacles for malaria eradication is the remote location of the majority of malaria cases and the lack of trained individuals to analyze blood samples using a microscope. The gold standard test for malaria is the method of preparing a blood smear on a glass slide, staining it, and examining it under a microscope. While several fast diagnostic tests are also currently available, they still have disadvantages compared to microscope analysis [3–5]. The microscopic analysis of the blood smear by a specialist is a time consuming process and depends on the expertise of one specialist in the pathology, which is rarely available for remote locations. Because of that, the visual analysis is performed by technicians and this procedure is erroneous due to the subjectivity in the visual analysis of the blood smears. Besides, the huge number of simultaneous analysis is also relevant for the high rates of false positives and false negatives in conventional malaria diagnostic.

Malaria Blood Smears Object Detection …

199

Even with the use of digital images, another issue that we may encounter is that several laboratories capture the blood smears images using low-cost microscope systems which produces a low quality blood smears images. Due to the poor quality of the images yield through this system in comparison to traditional light-emitting microscopes, conventional algorithms do not adequately process these images [6]. To handle such drawbacks, automated systems for diagnosis could be a viable solution. Digital photographs of the blood smears samples could be analyzed by computerized systems and the early diagnosis of malaria cases in remote location could be reliably attained, especially if such systems become publicly available to trained laboratory technicians and doctors [4, 5]. One example of such approach is to consider the use of deep learning systems. The last decade, deep learning methods have shown successful outcomes in different applications, including signal and image processing, object recognition, natural language processing, etc. Deep learning can be seen as an extension of well-known multilayer neural network classifiers trained with backpropagation. In a deep learning neural network, we may have several different types of layers that are used to represent linear or non-linear relations between the input and the output of the neural network. Due to the use of various neural layers and, in some cases, complex activation functions, deep learning uses massive amounts of computational power and computational resources as memory for example. Nevertheless, time and resource consuming deep learning methods have proven themselves to be the most accurate and reliable methods for several classification problems. There are different kinds of deep neural networks (neural networks that uses deep learning architecture), each one more successful in a specific sort of problem. Concerning image detection and image recognition, the more suitable deep learning architectures are the convolutional neural network (CNN) based algorithms. Those algorithms are more suitable in image related tasks since images have highly correlated intensities in local regions and some local signals or statistics are invariant to location [7]. The idea of CNN’s is to apply smaller convolutional kernels (or filters) in combination with a deep network architecture to capture the discernible image resources as much as possible. In the later years, deep learning techniques have boosted the performance of many systems in several areas. Although deep learning is a successful technique, it has its shortcomings. The major issue the requirement of large datasets for the training phase. This is why medical applications have been among the latest applications to embrace deep learning, as images are particularly difficult to obtain due to the need of trained experts and privacy issues [8]. One way to handle the need for large datasets of deep neural networks is through data generation techniques. Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other. GANs learn to mimic a distribution of data, creating new samples in a similar domain. One neural network, called the generator, generates new data instances, while the other, the discriminator, evaluates them for authenticity; i.e. the discriminator decides whether each datum it reviews belongs to the actual training dataset or not [9]. Rad-

200

F. N. B. Gois et al.

ford proposes a new model called DCGAN (Deep Convolutional GAN) that uses convolutional layers in a GAN [10]. There are some deep learning methods for malaria detection and classification. The majority of such methods works in a two step process [11]. First, they separate the blood cells and other objects from the background image and after, they perform the plasmodium recognition. This work proposes the development of a blood corpses detection system based on deep neural network architectures to be used as the two first modules of a comprehensive malaria automatic detection system. Since the databases are usually small, for the training phase, a dataset of malaria infected samples is submitted to a DCGAN network, to generate new samples of blood smear objects and support the object detection procedure. After that, a CNN architecture is proposed for the effective blood corpses detection, with an extensive network tunning based on hyper parameters optimization process. The most significant contributions of this work are listed below: • Automates the diagnosis process of recognizing malaria parasites with the use of an intelligent system; • Presents the application of deep learning techniques to improve the accuracy of malaria plasmodium detection; • Employs deep convolutional generative adversarial networks (DCGAN) to generate reliable training data that were introduced in our deep learning model to improve accuracy. The remaining of this paper is organized as follows: in Sect. 2, we present a brief but comprehensive review of the literature concerning the use of DCGAN’s and CNN’s in general and in disease detection, focusing on malaria detection. In Sect. 3, we present the DCGAN and CNN architectures proposed and perform a discussion about the reasons to select such models. In Sect. 4, the results of different experiments are presented, concerning the improvement of accuracy in the CNN network through the use of data generated by a DCGAN network. And finally, in Sect. 5, we draw some conclusions driven by the computational results we obtained and lay down some ground to future work and improvement of the proposed method.

2 Literature Review The initial studies about the subject do not discuss the necessity of differentiating parasite and non-parasite stained objects which lead to rudimentary solutions. Some studies have addressed the need for parasite detection in order to obtain methods with a better accuracy. Linder et al. [12] have proposed a malaria diagnostic tool for plasmodium falciparum detection. In [13] the authors proposed a method based in color histogram and in [14] the authors proposed a method to classify malaria-infected blood smear images. Rajpurkar et al. have developed a reinforcement learning agent

Malaria Blood Smears Object Detection …

201

(RL) that can predict the probability of an individual positive test for malaria by asking questions about their household [15]. There are many different approaches to tackle the problem of building an automatic system to diagnosis malaria. Since the focus of this work is the use of DCGAN networks and CNN networks, a succinct and comprehensive summary of the use of such approaches in the diagnosis of malaria is presented. This is by no means a complete description of the state-of-the-art for automatic diagnosis, but might bring to the reader an overview about relevant researches in the field. Convolutional Neural Networks (CNNs) have been widely applied in many problems of machine learning and computer vision in the last years [16, 17]. Moreover, a lot of techniques have been proposed to enhance the performance or ease the training of CNNs [18, 19]. Their popularity began in 2012, following the proposition of the AlexNet network, when it outperformed all other methods for visual classification, thus attracting great attention from the research community [20]. In the following years, an increasing adoption was observed, mainly due to the their ability to have scalable parallel processing through Graphical Processing Units (GPUs), even in usual desktop computers, once they are basically implemented through matrix multiplications easily parallelizable. CNNs have also been successfully employed in a vast domain of applications, such as video object detection, speech recognition and language translation [21]. Recent studies on the deep learning architecture have proven that the pattern classification models based on the deep learning paradigm can significantly outperform the models learned based on conventional classifiers [22, 23]. Hung et al. applied the Faster R-CNN object detection model to identify cells and recognize their stages in clear field microscopy images of malaria-infected blood [24]. Poostchi et al. presented a comprehensive systematic review with a set of approaches for malaria automatic diagnosis [25]. Zhang et al. presented a two-step approach for detecting infected and uninfected cells. The first step applies an object-detection structure with a trained classifier to detect all red blood cells in a blood drop image. The second stage classifies each segmented region into an infected or uninfected cell, by considering its morphological characteristics [11]. Liang et al. employed a convolutional neural network to discriminate infected and uninfected cells in fine blood smears after the application of a conventional approach classification for cell segmentation [26]. Other authors which have applied deep learning in cell segmentation are Dong et al. and Gopakumar et al., by means of convolutional neural networks. Dong et al. used whole images of thin blood slides to compile a dataset of red blood cells infected with malaria and uninfected cells, as labeled by a group of four pathologists. The simulation results showed that all these deep convolution neural networks achieved classification accuracies of more than 95%, greater than the accuracy of about 92% achievable using the support vector machine (SVM) method. In addition, deep learning methods have the advantage of being able to automatically learn the characteristics of the incoming data, thus requiring less amount of human expert inputs for automated malaria diagnosis [1, 8]. Gopakumar et al., in turn, have proposed an image stacking-based approach for automated quantitative malaria detec-

202

F. N. B. Gois et al.

tion. The cell counting problem was addressed as a 2-level segmentation strategy and the use of CNN not only improved the detection accuracy but also favored the processing on cell patches and avoided the necessity of hand-engineered features. Slide images have been with a custom-built portable slide scanner made from low-cost, off-the-shelf components [27]. Bibin et al. used deep belief networks and recently Hung et al. presented an endto-end structure using faster convolutional neural network [24, 28]. Premaratne et al. worked with digital images of oil immersion views from microscopic slides captured through a capture card. They were preprocessed by segmentation and grayscale conversion to reduce their dimensionality and later fed into a feedforward backpropagation neural network for training [4]. Generative adversarial networks (GANs) are deep neural net architectures composed of two networks that pitting one against the other. GANs were introduced by [29] in 2014. GANs have a huge potential to increase the accuracy of the whole method because they can learn to mimic any distribution of data. To the best of our knowledge, these generative adversarial networks have not been applied in the context of malaria detection, which is a relevant contribution of this work.

3 Materials and Methods 3.1 Convolutional Neural Networks Convolutional Neural Networks (CNNs) are able to map complex, high-dimensional image data into a much lower dimensional space of finite distinct categories, composed of hundreds or thousands object classes. Their architecture consists basically of a stack of three types of layers, namely convolutional layers, pooling layers, and fully-connected layers. A typical CNN network is depicted in Fig. 1. A convolutional layer determines the output of neurons associated with local regions of the input, by means of the scalar product between their weights and the region representing the input volume. The ReLu (rectified linear unit) rectifier applies an element-wise activation function (e.g. sigmoid) to the output of the activation generated by the previous layer. A pooling layer, in turn, downsamples along the spatial dimensionality of the input, thus reducing the number of parameters in the current activation. Finally, a fully-connected layer is responsible for producing class scores from activations, for classification purposes. For improving performance, ReLu is also commonly employed between these layers. In the CNN architecture proposed in this paper, we have eight layers mixing different kinds of layers and several parameters in each kind of layer. The eight layers that are described below: The first layer is a convolutional 2D layer with 32 entries; the second layer is a convolutional 2D layer with 32 entries; the third layer is a pooling 2D layer; the fourth layer is a convolutional 2D layer with 64

Malaria Blood Smears Object Detection …

203

Fig. 1 General architecture of a convolutional classification neural network. The datum enters the CNN through a convolution layer after, a pooling layer reduces the dimension and a fully-connected layer learn the features of the model and finally, a fully connected layer produces the final answer

entries; the fifth layer is a convolutional 2D layer with 64 entries; the sixth layer is a pooling 2D layer; the seventh layer is a fully-connected layer and, the eighth layer is a fully-connected layer. All the activation functions for each layer are ReLu functions, except for the last layer where we use a simple max function. Although the development of a neural network is, in its core, a trial and error procedure, which depends on several externals factors from overall computational power to problem samples idiosyncrasies, we can present some intuition to support the choose of the proposed model, besides a initial set of experiments. First, the number of layers as well as the time of each layer was chosen in order to reduce the overall elapsed training time of the method, several works in the field also uses a reduced number of layers, for example [30] which also uses a architecture similar to the one proposed in this paper. Second, the set of parameters used in each layer is determined by the resolution of the images together with the overall structure of the CNN.

3.2 Generative Adversarial Networks Learning reusable resource representations from large datasets has been an active research area. One way to build good image representations is through Generative Adversarial Networks. Generator Adverse Networks (GAN) learn to synthesize elements of a target distribution using two competing neural networks. Those networks can produce compelling images that are sharper than those produced by automatic encoders using pixel losses. The Generator (G) network selects an n-dimensional random sample from a predefined distribution, conventionally called latent space and attempts to create examples of the target distribution. The discriminant network (D) takes a generated or real example as input and has to make the binary decision whether the input is real or generated. This competition process is expressed as a zero-sum game in the following loss term. Let x be a natural image taken from a distribution p X and z ∈ IRd be a random vector. Considering that z has a uniform distribution with support [1, −1]d , then

204

F. N. B. Gois et al.

g and f are the generator and discriminative models, respectively. Denoting the distribution g(z) to pG . The discriminative model estimates the probability that an input image was generated by p X . Ideally, f (x) = 1 if x ∼ p X and f (x) = 0 if x ∼ pG . A GAN network corresponds to the generator and discriminative models, trained according to the equation [31]: maxmin V ( f, g) = Ex∼ px [−log( f (x))] + Ez∼ pz [−log(1 − f (g(z)))] g

f

The equation above is solved by applying the gradient in two steps: θ t+1 = θ tf − λt ∇θ f V ( f t , g t ) f θgt+1 = θgt − λt ∇θg V ( f t + 1, g t ) θ f and θg are parameters of f and g, λ is the learning rate and t is the number of iterations. Goodfellow et al. show that given sufficient capacity for f and g training iterations, the distribution pG converges to p X . In other words, from a random vector z, the network g can synthesize an image g(z) that resembles one that is extracted from the true distribution p X [9].

3.3 The Dataset The dataset used as substrate for the generative adversarial network built in this work is result of the work of the AI and Data Science research group of the Makerere University, Uganda.1 Despite a certain presence of microscopes in that country, a small number of laboratory technicians limits the quality of exam diagnoses for the population. Since the acquisition of mobile devices is a growing worldwide trend, the cited research group has begun a process of capturing blood images through such devices as a way to solve the problem. The diagnostic challenges being focused by them are malaria (in blood samples), tuberculosis (in sputum samples) and intestinal parasites (in stool samples). The annotated malaria dataset is composed of 1182 thick blood smear images with bounding boxes of 7245 parasites. From each image in such a dataset, we cut out the square images surrounding the parasites and then passed them on to our GAN for the production of new samples.

1

http://air.ug/microscopy/.

Malaria Blood Smears Object Detection …

205

Fig. 2 Process for object detection in malaria blood smears using DCGAN networks in four steps. 1—Sample generation with GAN networks; 2—CNN architecture trained with the generated samples; 3—application of image filtering and processing for classification and 4—CNN used for plasmodium detection

3.4 Proposed Method In this section, we describe the proposed process for object detection in malaria blood smears using CNN networks enhanced by the use of GAN networks to generate more samples and improve the classifier accuracy. The process consists of 6 steps, they are the following steps: 1. 2. 3. 4. 5.

blood smears image acquisition; image generation with GANs networks (Fig. 2—➊); train a convolutional neural network (Fig. 2—➋); apply adaptive threshold filter (Fig. 2—➌) and classify objects with the trained convolutional network (Fig. 2—➍).

The first step is maybe the simplest of the proposed method, the real blood smears images where acquired from the repository presented by Quinn et al. [3]. The second step of the proposed method is the image generation by the GAN network. We use a deep convolutional generative adversarial network (DCGAN) proposed by Radford et al. in [32]. The generator receive as input a noise vector which passes through convolutional, normalization, upsampling and activation layers. Batch normalization normalizes activations throughout the network, it prevents small changes to the parameters from amplifying into larger and suboptimal changes in activations in gradients; for instance, it prevents the training from getting stuck in the saturated regimes of nonlinearities. The Relu activation is used in the generator with the exception of the output layer which uses the Tanh function.

206

F. N. B. Gois et al.

DCGAN was trained with 1800 real plasmodium images cutted from real blood smears images provide by the Makere University dataset. After the generation of the images by the DCGAN network, we use data augmentation techniques to train a CNN. We follow a simple data augmentation for training: pixels are padded on each side, and a 50 × 50 crop is randomly sampled from the padded image or its horizontal flip. This allows the network to learn invariance to deformations. Data augmentation is essential to teach the network the desired invariance and robustness properties, when only few training samples are available and realistic deformations can be simulated efficiently. After the use of data augmentation techniques to train a CNN, we use an adaptive threshold filter in the blood smear images. Thresholding is a method of image segmentation where each pixel in an image with a black pixel if the image intensity Ii, j Ii, j is less than some fixed constant T (that is, Ii, j < T Ii, j < T ), or a white pixel if the image intensity is greater than that constant. Adaptive Thresholding is a form of extract useful information encoded into pixels while minimizing background noise. After this procedure, image segments are cut from the blood smears image and classify by the CNN network. Data argumentation is also utilized to make new images without plasmodium. There is no plasmodium data in half of the image created with data argumentation. Finally, a CNN network was trained without the data generated by DCGAN to be used as a benchmark.

4 Results and Discussion As previously presented, the process of malaria automatic diagnostic involves different steps to be achieved. In this section, the main results achieved by the proposed solution are presented and discussed. The first group of results are related to the impact in classification accuracy of the DCGAN architecture adoption to increase the number of training samples for the CNN architecture and the results obtained by a CNN, designed with the same architecture, without the data generated by DCGAN to be used as a benchmark. The second discussion is related to the CNN architecture and hyper parameter tuning, focused on the separation of objects from background objects from the blood smears.

4.1 DCGAN Results First of all, Figs. 3, 4 and 5 present samples of images generated after 50, 250 and 30,410 epochs, respectively, by the DCGAN network. As can be seen in the presented figures, the images generated increase in quality and accuracy when compared to real blood smears samples.

Malaria Blood Smears Object Detection …

Fig. 3 Generated images after 50 epochs

Fig. 4 Generated images after 250 epochs

207

208

F. N. B. Gois et al.

Fig. 5 Generated images after 30,410 epochs

In order to present some empirical evidences of the effectiveness of using the generated outputs, this work will consider only the images generated by the GAN network to train the CNN classifier. A group of 18 experiments were performed to evaluate the effectiveness and the impact of the number of available images generated by the DCGAN network for the training phase. The learning rate of 0.1, batch size of 10 and a total of 30 epochs were considered for the evaluation tests. The results are presented on table 1. The number of images considered for the training phase varies from 500 to 12,000 generated by the proposed GAN network. For the test phase, 1920 real blood smears images were considered. A total of 56,590 real negative images were used. The precision, recall and f1-score indexes are evaluated to assure the improvements for both false positives and false negatives results. It is important to notice that, clinically, the occurrence of false negatives create more impact at first sight, since the object detection is the first step for malaria diagnosis. Nevertheless, the false positives also lead to misdiagnosis, erroneous medical prescription that may generate other health issues, specially for children. According to the results, the precision, i.e., the performance related to the occurrence of false positives, is significantly high since the beginning of the tests. This leads us to the conclusion that the proposed CNN architecture is robust for false positives classification. On the other hand, the recall parameter, which is related to the occurrences of false negatives, range from 0.70 with only 500 training images, to 0.99 when considered 12,000 generated images for training. Different groups of 12,000 images were used to train the network.

Malaria Blood Smears Object Detection …

209

Table 1 Analysis of the effectiveness on the increase of GAN outputs on the classifier performance from 500 to 12,000 generated images Precision Recall f1-score N. images 500 700 900 5,000 7,000 9,000 12,000

0.99 0.96 0.98 0.97 0.98 0.97 0.99

0.70 0.78 0.60 0.88 0.96 0.88 0.99

0.82 0.81 0.74 0.92 0.97 0.91 0.99

Table 2 Results obtained by the CNN network without the images generated by DCGN N. images Precision Recall f1-score 500 700 900 1450

0.95 0.97 0.97 0.97

0.42 0.76 0.89 0.92

0.56 0.84 0.90 0.94

The results obtained by the CNN network, without the images generated by DCGN are presented on Table 2. As can be seen in the presented table in comparison with Table 1, the use of DCGN increase in quality and accuracy when compared to real blood smears samples.

4.2 Hyper Parameters Tuning The second part of the results is focused on the CNN hyper parameters tuning aiming the best configuration for higher levels or Precision and Recall indexes, and also to assure the network convergence and reliability for the future classifier application. For this analysis, 12,000 images generated by the proposed GAN network were used during the training phase, while 1920 real blood smears images were considered for testing. A total of 56,590 real negative images were used. Table 3 presents the results for a group of hyper parameter adjustments. The first tuning followed the classic adjustment for the CNN Learning Rate (LR). Varying from 0.1 to 0.001, the best result achieved was for LR = 0.001, achieving a Recall of 1.00, meaning that a very low number of false negatives. After defining the LR, percentage of dropout for the flat and dense (i.e. fully connected) layers were considered. This operation is usually performed to avoid the CNN to be trapped in a local minima or biased by the training data.

210

F. N. B. Gois et al.

Table 3 Hyper parameters tunning for the CNN network for blood smears object detection, considering 50 epochs and 12,000 images generated by the DCGAN network Learning rate Dropout Conv2D Dropout dense Precision Recall layer (%) layer (%) 0.1 0.01 0.001 0.001 0.001

0 0 0 50 50

0 0 0 50 30

0.99 0.98 0.99 0.95 1.00

1.00 0.99 0.99 1.00 1.00

First, a dropout of 50% was considered for a Conv2D hidden layer and the dense layer. Considering this approach, the Recall index was satisfactory but the classifier Precision dropped to 95%, which was not acceptable, according to the previous results. The percentage of both dropouts were them reduced and adjusted. The best result for Precision and Recall indexes were obtained when the dropout of 30% was proposed for the dense layer, keeping the hidden layer on 50%. For this set of parameters, the CNN network obtained Precision, Recall and F1score equal to 1 and 100% of accuracy with the real images test datasets.

5 Conclusion Malaria is still a serious health problem in many areas of the world and its diagnosis is a key aspect to combat the disease. Microscopic analysis of blood samples is still the preferred method. An expert must examine several blood smears looking for parasites to declare a patient infected or not. The first results achieved by this work proposes the use of DCGAN networks as a reliable solution for generating training samples for object detection in malaria blood smears. A group of experiments were performed and showed that the images generated by the DCGAN network can be used in the training of a CNN network to detect objects improving classifier Precision and Recall. A minimum number of 12,000 generated images was considered as a reference result for the next step. After that, the hyper parameters fine tuning phase was performed. For this application, the absolute number of false negatives and false positives are clinically relevant, since they may represent the accurate diagnostic and save ones life. Because of that, this work focused on tuning to have the maximum Precision and Recall indexes (equal to or approximately 1). For the LR equals to 0.1 and dropouts of 50% and 30% for the Conv2D and Dense layers, respectively, the Precision, Recall, F1-score and Accuracy were all approximately equal to 1.

Malaria Blood Smears Object Detection …

211

For future works, other sophisticated techniques for detecting objects, such as YOLO and Region Based CNN, shall be considered for a full Malaria classification system. The objects detected by the presented process can be classified allowing a new malaria diagnosis approach.

References 1. Y. Dong, Z. Jiang, H. Shen, W.D. Pan, Classification accuracies of malaria infected cells using deep convolutional neural networks based on decompressed images. SoutheastCon 2017, 1–6 (2017) 2. S. Shuleenda Devi, S. Alam Sheikh, R. Hussain Laskar, Erythrocyte features for malaria parasite detection in microscopic images of thin blood smear: a review. Int. J. Interact. Multimed. Artif. Intel. 4(2), 34 (2016). Available http://www.ijimai.org/journal/node/1442 3. J.A. Quinn, R. Nakasi, P.K.B. Mugagga, P. Byanyima, W. Lubega, A. Andama, Deep convolutional neural networks for microscopy-based point of care diagnostics, in Machine Learning and Healthcare Conference (MLHC 2016), vol. 56 (2016). Available http://arxiv.org/abs/1608. 02989 4. S.P. Premaratne, N.D. Karunaweera, S. Fernando, A neural network architecture for automated recognition of intracellular malaria parasites in stained blood films (2006), pp. 4–7 5. K.E.D. Peñas, P.T. Rivera, P.C. Naval, Malaria parasite detection and species identification on thin blood smears using a convolutional neural network, in 2017 IEEE/ACM International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE) (2017), pp. 1–6 6. R. Sorgedrager, Automated malaria diagnosis using convolutional neural networks in an onfield setting The analysis of low quality smartphone based microscope images, Ph.D. dissertation (2018) 7. Z. Yan, Y. Zhan, S. Zhang, D. Metaxas, X.S. Zhou, Multi-Instance Multi-Stage Deep Learning for Medical Image Recognition, 1st ed. (Elsevier Inc., 2017). Available http://dx.doi.org/10. 1016/B978-0-12-810408-8.00006-7 8. Y. Dong, Z. Jiang, H. Shen, W.D. Pan, Classification accuracies of malaria infected cells using deep convolutional neural networks based on decompressed images. SoutheastCon 2017, 1–6 (2017) 9. I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative Adversarial Networks (2014), pp. 1–9. Available http://arxiv.org/abs/ 1406.2661 10. A. Radford, L. Metz, S. Chintala, Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (2015), pp. 1–16. Available http://arxiv.org/abs/1511. 06434 11. Z. Zhang, L. S. Ong, K. Fang, A. Matthew, J. Dauwels, M. Dao, H. Asada, Image classification of unlabeled malaria parasites in red blood cells, in 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (2016), pp. 3981–3984. Available http://ieeexplore.ieee.org/document/7591599/ 12. N. Linder, R. Turkki, M. Walliander, A. Mårtensson, V. Diwan, E. Rahtu, M. Pietikäinen, M. Lundin, J. Lundin, A malaria diagnostic tool based on computer vision screening and visualization of plasmodium falciparum candidate areas in digitized blood smears. PLoS One 9(8), e104855 (2014) 13. F.B. Tek, A.G. Dempster, I. Kale, Malaria parasite detection in peripheral blood images. in BMVC (2006), pp. 347–356 14. G. Díaz, F.A. González, E. Romero, A semi-automatic method for quantification and classification of erythrocytes infected with malaria parasites in microscopic images. J Biomed Inf 42(2), 296–307 (2009). Available http://dx.doi.org/10.1016/j.jbi.2008.11.005

212

F. N. B. Gois et al.

15. P. Rajpurkar, V. Polamreddi, A. Balakrishnan, Malaria likelihood prediction by effectively surveying households using deep reinforcement learning, no. Nips (2017). Available http:// arxiv.org/abs/1711.09223 16. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst, 1097–1105 (2012) 17. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition (2015), pp. 1–9 18. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556 (2014) 19. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 20. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Proceedings of the 25th International Conference on Neural Information Processing Systems—Volume 1, ser. NIPS’12. (Curran Associates Inc.,USA, 2012), pp. 1097– 1105. Available http://dl.acm.org/citation.cfm?id=2999134.2999257 21. W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F.E. Alsaadi, A survey of deep neural network architectures and their applications. Neurocomputing 234, 11–26 (2017) 22. G.E. Hinton, Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002) 23. V. Nair, G.E. Hinton, 3d object recognition with deep belief nets. Adv. Neural Inf. Process. Syst. (2009), pp. 1339–1347 24. J. Hung, A. Carpenter, Applying faster r-cnn for object detection on malaria images, in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (IEEE, 2017), pp. 808–813 25. M. Poostchi, K. Silamut, R. J. Maude, S. Jaeger, G. Thoma, Image analysis and machine learning for detecting malaria. Transl. Res. 194, 36–55 (2018). Available https://doi.org/10. 1016/j.trsl.2017.12.004 26. Z. Liang, A. Powell, I. Ersoy, M. Poostchi, K. Silamut, K. Palaniappan, P. Guo, M. Hossain, A. Sameer, R. Maude, J. Huang, S. Jaeger, G. Thoma, CNN-based image analysis for malaria diagnosis, in Proceedings—2016 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2016 (2017), pp. 493–496 27. G.P. Gopakumar, M. Swetha, G. Sai Siva, G.R. Sai Subrahmanyam, Convolutional neural network-based malaria diagnosis from focus stack of blood smear images acquired using custom-built slide scanner. J. Biophoton. 11(3) (2018) 28. D. Bibin, M.S. Nair, P. Punitha, Malaria parasite detection from peripheral blood smear images using deep belief networks. IEEE Access 5, 9099–9108 (2017) 29. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets. Adv. Neural Inf. Process. Syst. (2014), pp. 2672–2680 30. A. Vijayalakshmi, B. Rajesh Khanna, Deep learning approach to detect malaria from microscopic images, in Multimedia Tools and Applications (2019). Available https://doi.org/10.1007/ s11042-019-7162-y 31. M.-Y. Liu, O. Tuzel, Coupled generative adversarial networks, in NIPS, no. Nips (2016), pp. 469–477. Available http://arxiv.org/abs/1606.07536 32. A. Radford, L. Metz, S. Chintala, Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (2015), pp. 1–16. Available http://arxiv.org/abs/1511. 06434

Author Index

C Cheng, Suying, 85 Chu, Chiawei, 73 Cuthbert, Laurie, 139

D de Macêdo, José Antônio Fernandes, 197 de Oliveira Dantas, Allberson Bruno, 197 do Vale Madeiro, João Paulo, 117 Du, Wencai, 117, 197

L Lam, Chan-Tong, 31 Liang, Shengbin, 85, 101 Liang, Yanchun, 59 Lioba Caldas, Weslley, 117 Li, Tongfei, 45 Liu, Yue, 139 Li, Xinru, 153, 167 Li, Ye, 197 Lu, Jiadong, 45 Lu, Wei, 73, 153 Lv, Wei, 45, 167, 181

F Fan, Baoyu, 139 Fu, Xiaoyang, 59

M Ma, Han, 31 Marques, João Alexandre Lobo, 117, 197

G Gao, Junhan, 73 Gois, Francisco Nauber Bernardo, 197 Gomes, João Paulo Pordeus, 117 Guo, Te, 73

N Neto, JoséValdir Santiago, 197 Ng, Benjamin, K., 31

H Hao, Zhaoquan, 85 Huang, Chuying, 153, 167 Huang, Jiaqi, 59

J Jiang, Meixuan, 45 Jiang, Yuchao, 153, 167 Jin, Jiangyong, 85

P Pang, Aohui, 181 Pedrosa, Roberto Coury, 117 R Ripon Patgiri, 1 S Santos, Márcio Costa, 197 Shen, Yanqing, 85, 101 Sun, Haoran, 101

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Computer and Information Science, Studies in Computational Intelligence 1055, https://doi.org/10.1007/978-3-031-12127-2

213

214

Author Index

W Wang, Mengyao, 73 Wu, Chunguo, 59 Wu, Xuan, 59

Y Yang, Tingrui, 131 Yao, Mingyuan, 101 Yukie, Niki, 101

X Xu, Minghe, 153

Z Zeng, Yifan, 181 Zhang, Yuchen, 45 Zhou, You, 59 Zhu, Xiaolin, 181