Current Trends in Computer Science and Mechanical Automation Vol.1: Selected Papers from CSMA2016 9783110584974, 9783110584967

221 32 23MB

English Pages 660 Year 2018

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Current Trends in Computer Science and Mechanical Automation Vol.2: Selected Papers from CSMA2016 9783110584998, 9783110584981

204 53 28MB Read more

Against the Current: Selected Philosophical Papers 9783110322002, 9783110321654

The present collection of seventeen papers, most of them already published in international philosophical journals, deal

178 45 1MB Read more

Major Trends in Theoretical and Applied Linguistics 1: Selected Papers from the 20th ISTAL 9788376560762, 9788376560755

In the three volumes of Major Trends in Theoretical and Applied Linguistics, the editors guide the reader through a well

187 28 7MB Read more

Informatics in control automation and robotics: selected papers from the International Conference on Informatics in Control Automation and Robotics 2009: 2-5 July: ICINCO 2009 9783642197291, 9783642197307, 3642197299

The present book includes selected papers from the fourth International Conference on Informatics in Control Automation

797 120 9MB Read more

Informatics In Control, Automation And Robotics: Revised And Selected Papers From The International Conference On Informatics In Control, Automation 9783642195389, 9783642195396, 3642195385

The present book includes a set of selected papers from the seventh "International Conference on Informatics in Con

509 24 12MB Read more

Perspectives and Trends in Education and Technology: Selected Papers from ICITED 2023 [366, 1 ed.] 9789819954131, 9789819954148

This book presents high-quality, peer-reviewed papers from the International Conference in Information Technology &

464 101 20MB Read more

kiyosi ito selected papers

736 29 18MB Read more

Selected papers from CONSAL XII in Brunei 9781845444075, 9780861769926

This editorial provides brief background information on the Congress of Southeast Asian Librarians and its twelfth confe

198 18 1MB Read more

Aspects of narrative: selected papers from the English Institute 9780231035798

146 6 30MB Read more

Romanticism reconsidered: selected papers from the English Institute 9780231026710, 9780231085892

139 86 20MB Read more

Current Trends in Computer Science and Mechanical Automation Vol.1: Selected Papers from CSMA2016
9783110584974, 9783110584967

Author / Uploaded
Shawn X. Wang (editor)

Table of contents :
Contents
Preface
Introduction of keynote speakers
Part I: Computer Science and Information Technology I
Research and Development of Upper Limb Rehabilitation Training Robot
K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design
Research and Prospects of Large-scale Online Education Pattern
New LebiD2 for Cold-Start
An Efficient Recovery Method of Encrypted Word Document
A Short Reads Alignment Algorithm Oriented to Massive Data
Based on the Mobile Terminal and Clearing System for Real-time Monitoring of the AD Exposure
Star-shaped SPARQL Query Optimization on Column-family Overlapping Storage
Assembly Variation Analysis based on Deviation Matrix
Design and Realization of Undergraduate Teaching Workload Calculation System Based on LabVIEW
Smooth Test for Multivariate Normality of Innovations in the Vector Autoregressive Model
Research on the Privacy-Preserved Mechanism of Supercomputer Systems
Reliability Evaluation Model for China’s Geolocation Databases
Using Local Library Function in Binary Translation
Type Recognition of Small Size Aircrafts in Remote Sensing Images based on Weight Optimization of Feature Fusion and Voting Decision of Multiple Classifiers
Research of User Credit Rating Analysis Technology based on CART Algorithm
Research of the Influence of 3D Printing Speed on Printing Dimension
Research on Docker-based Message Platform in IoT
Research on the Recommendation of Microblog Network Advertisement based on Hybrid Recommendation Algorithm
Fuzz Testing based on Sulley Framework
A Risk Assessment Strategy of Distribution Network Based on Random Set Theory
A Software Homology Detection based on BP Neural Network
Image Small Target Detection based on Deep Learning with SNR Controlled Sample Generation
Agricultural Product Price Forecast based on Shortterm Time Series Analysis Techniques
The Correction of Software Network Model based on Node Copy
A Linguistic Multi-criteria Decision Making Method based on the Attribute Discrimination Ability
Part II: Computer Science and Information Technology II
The Reverse Position Fingerprint Recognition Algorithm
Global Mean Square Exponential Stability of Memristor-Based Stochastic Neural Networks with Time-Varying Delays
Research on Classified Check of Metadata of Digital Image based on Fuzzy String Matching
Quantitative Analysis of C2 Organization Collaborative Performance based on System Dynamics
Comprehensive Evaluation and Countermeasures of Rural Information Service System Construction in Hengyang
Image Retrieval Algorithm based on Convolutional Neural Network
A Cross-domain Optimal Path Computation
Collaborative Filtering Recommendation Algorithm based on Item Similarity Learning
The Graph Merge-Clustering Method based on Link Density
Person Name Disambiguation by Distinguishing the Importance of Features based on Topological Distance
A Security Technology Solution for Power Interactive Software Based on WeChat
Two-microphones Speech Separation Using Generalized Gaussian Mixture Model
A Common Algorithm of Construction a New Quantum Logic Gate for Exact Minimization of Quantum Circuits
A Post-Processing Software Tool for the Hybrid Atomisitc-Continuum Coupling Simulation
Achieve High Availability about Failover in Virtual Machine Cluster
Automatic Segmentation of Thorax CT Images with Fully Convolutional Networks
Application of CS-SVM Algorithm based on Principal Component Analysis in Music Classification
A New Particle Swarm Optimization Algorithm Using Short-Time Fourier Transform Filtering
RHOBBS: An Enhanced Hybrid Storage Providing Block Storage for Virtual Machines
Part III: Sensors, Instrument and Measurement I
AIS Characteristic Information Preprocessing & Differential Encoding based on BeiDou Transmission
Modeling of Ship Deformation Measurement based on Single-axis Rotation INS
Research on Fault Diagnosis of Satellite Attitude Control System based on the Dedicated Observers
Data Advance Based on Industrial 4.0 Manufacturing System
High Performance PLL base on Nonlinear Phase Frequency Detector and Optimized Charge Pump
Accelerated ICP based on linear extrapolation
Studies of falls detection algorithm based on support vector machine
Research and Development of Indoor Positioning Geographic Information System based on Web
Harmonic Distribution Optimization of Surface Rotor Parameters for High-Speed Brushless DC Motor
Vehicle Motion Detection Algorithm based on Novel Convolution Neural Networks
The Bank Line Detection in Field Environment Based on Wavelet Decomposition
Research on Measurement Method of Harmonics Bilateral Energy
Traffic Supervision System Using Unmanned Aerial Vehicle based on Image Recognition Algorithm
Improving the Durability of Micro ER Valves for Braille Displays Using an Elongational Flow Field
Aircraft Target Detection in Remote Sensing Images towards Air-to-Ground Dynamic Imaging
A New Third-order Explicit Symplectic Scheme for Hamiltonian Systems
Research on Terahertz Scattering Characteristics of the Precession Cone
Application Development of 3D Gesture Recognition and Tracking Based on the Intel Real Sense Technology Combing with Unity3D and WPF

Citation preview

Shawn X. Wang (Ed.) Current Trends in Computer Science and Mechanical Automation Selected Papers from CSMA2016 - Volume 1

Shawn X. Wang (Ed.)

Current Trends in Computer Science and Mechanical Automation Selected Papers from CSMA2016 - Volume 1

ISBN: 978-3-11-058496-7 e-ISBN: 978-3-11-058497-4

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License. For details go to http://creativecommons.org/licenses/by-nc-nd/3.0/. Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. © 2017 Shawn X. Wang (Ed.) and chapters contributors Published by De Gruyter Open Ltd, Warsaw/Berlin Part of Walter de Gruyter GmbH, Berlin/Boston The book is published with open access at www.degruyter.com.

www.degruyteropen.com Cover illustration: © cosmin4000 / iStock.com

Contents Preface

XIII

Introduction of keynote speakers

XIV

Part I: Computer Science and Information Technology I Yan-zhao Chen, Yu-wei Zhang Research and Development of Upper Limb Rehabilitation Training Robot

1

Xiao-fei CUI, Ya-dong WANG, Guang-ri QUAN, Yong-dong XU K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design Hong ZHANG, Ling-ling ZHANG Research and Prospects of Large-scale Online Education Pattern

21

Lebi Jean Marc Dali, Zhi-guang QIN New LebiD2 for Cold-Start

33

Li-jun ZHANG, Fei YU, Qing-bing JI An Efficient Recovery Method of Encrypted Word Document

40

Gao-yang LI, Kai WANG, Yu-kun ZENG, Guang-ri QUAN A Short Reads Alignment Algorithm Oriented to Massive Data

49

Yan-nan SONG, Shi LIU, Chun-yan ZHANG, Wei JI, Ji QI Based on the Mobile Terminal and Clearing System for Real-time Monitoring of the AD Exposure 58 Li-ming LIN, Guang-cao LIU, Yan WANG, Wei LU Star-shaped SPARQL Query Optimization on Column-family Overlapping Storage 67 Zhen-yu LV, Xu ZHANG, Wei MING, Peng LI Assembly Variation Analysis based on Deviation Matrix

74

9

Yan XU, Rui CHANG, Ya-fei WANG Design and Realization of Undergraduate Teaching Workload Calculation System Based on LabVIEW 84 Yan SU, Yi-Shu ZHONG Smooth Test for Multivariate Normality of Innovations in the Vector Autoregressive Model 94 Bi-kuan YANG, Guang-ming LIU Research on the Privacy-Preserved Mechanism of Supercomputer Systems

102

Jiu-chuan LIN, Yong-jian WANG, Rong-rong XI, Lei CUI, Zhi-yu HAO Reliability Evaluation Model for China’s Geolocation Databases

114

Jie TAN, Jian-min PANG, Shuai-bing LU Using Local Library Function in Binary Translation

123

Kai CHENG, Fei SONG, Shiyin QIN Type Recognition of Small Size Aircrafts in Remote Sensing Images based on Weight Optimization of Feature Fusion and Voting Decision of Multiple Classifiers 133 Bo HU, Yu-kun JIN, Wan-jiang GU, Jun LIU, Hua-qin QIN, Chong CHEN, Ying-yu WANG Research of User Credit Rating Analysis Technology based on CART Algorithm Cong-ping CHEN, Yan-hua RAN, Jie-guang HUANG, Qiong HU, Xiao-yun WANG Research of the Influence of 3D Printing Speed on Printing Dimension

157

Fan ZHANG, Fan YANG Research on Docker-based Message Platform in IoT

164

Yan-xia YANG Research on the Recommendation of Micro-blog Network Advertisement based on Hybrid Recommendation Algorithm 171 Zhong GUO, Nan LI Fuzz Testing based on Sulley Framework

181

149

Jun ZHANG, Shuang ZHANG, Jian LIANG, Bei TIAN, Zan HOU, Bao-zhu LIU A Risk Assessment Strategy of Distribution Network Based on Random Set Theory 188 Rui LIU, Bo-wen SUN, Bin TIAN, Qi LI A Software Homology Detection based on BP Neural Network

199

Ming LIU, Hao-yuan DU, Yue-jin ZHAO, Li-quan DONG, Mei HUI Image Small Target Detection based on Deep Learning with SNR Controlled Sample Generation 211 Yi-xin ZHANG, Wen-sheng SUN Agricultural Product Price Forecast based on Short-term Time Series Analysis Techniques 221 Xiao-lin ZHAO, Jing-feng XUE, Qi ZHANG, Zi-yang WANG The Correction of Software Network Model based on Node Copy

234

Wei-ping WANG, Fang LIU A Linguistic Multi-criteria Decision Making Method based on the Attribute Discrimination Ability 254

Part II: Computer Science and Information Technology II Gui-fu YANG, Xiao-yu XU, Wei-shuo LIU, Cheng-lin PU, Lu YAO, Jing-bo ZHANG, Zhen-bang LIU The Reverse Position Fingerprint Recognition Algorithm

263

Xiao-Lin XU Global Mean Square Exponential Stability of Memristor-Based Stochastic Neural Networks with Time-Varying Delays 270 Zhen-hong XIE Research on Classified Check of Metadata of Digital Image based on Fuzzy String Matching 280

Shao-nan DUAN, Yan-jie NIU, Yao SHI, Xiao-dong MU Quantitative Analysis of C2 Organization Collaborative Performance based on System Dynamics 286 Kuai-kuai ZHOU, Zheng CHEN Comprehensive Evaluation and Countermeasures of Rural Information Service System Construction in Hengyang 297 Wen-qing HUANG, Qiang WU Image Retrieval Algorithm based on Convolutional Neural Network

304

Bing ZHOU, Juan DENG A Cross-domain Optimal Path Computation

315

Feng LIU, Huan LI, Zhu-juan MA, Er-zhou ZHU Collaborative Filtering Recommendation Algorithm based on Item Similarity Learning 322 Huo-wen JIANG, Hai-ying MA, Xin-ai XU The Graph Merge-Clustering Method based on Link Density

336

Qing-yun QIU, Jun-yong LUO, Mei-juan YIN Person Name Disambiguation by Distinguishing the Importance of Features based on Topological Distance 342 Bo HU, Yu-kun JIN, Jun LIU, Ai-jun FAN, Hong-bo MA, Chong CHEN A Security Technology Solution for Power Interactive Software Based on WeChat 352 Miao FAN, Jia-min MAO, Jao-gui DING, Wei-feng LI Two-microphones Speech Separation Using Generalized Gaussian Mixture Model 362 Zhi-qiang LI, Sai CHEN, Wei ZHU, Han-wu CHEN A Common Algorithm of Construction a New Quantum Logic Gate for Exact Minimization of Quantum Circuits 371

Qian WANG, Xiao-guang REN, Li-yang XU, Wen-jing YANG A Post-Processing Software Tool for the Hybrid Atomisitc-Continuum Coupling Simulation 379 Jun XU, Xiao-yong LI Achieve High Availability about Failover in Virtual Machine Cluster

392

Kai-peng MAO, Shi-peng XIE,Wen-ze SHAO Automatic Segmentation of Thorax CT Images with Fully Convolutional Networks 402 Yong-jie WANG, Yi-bo WANG, Dun-wei DU, Yan-ping BAI Application of CS-SVM Algorithm based on Principal Component Analysis in Music Classification 413 Si-wen GUO, Yu ZUO, Tao YAN,Zuo-cai WANG A New Particle Swarm Optimization Algorithm Using Short-Time Fourier Transform Filtering 422 Zhen WANG, Hao-peng CHEN, Fei HU RHOBBS: An Enhanced Hybrid Storage Providing Block Storage for Virtual Machines 435

Part III: Sensors, Instrument and Measurement I Shang-yue Zhang, Yu-ming Wang, Zheng-guo Yu AIS Characteristic Information Preprocessing & Differential Encoding based on BeiDou Transmission 451 You LI, Xing-shu WANG, Hao XIONG Modeling of Ship Deformation Measurement based on Single-axis Rotation INS 460 Mei-ling WANG, Hua SONG, Chun-ling WEI Research on Fault Diagnosis of Satellite Attitude Control System based on the Dedicated Observers 470

Ming-hui YAN, Yao-he LIU, Ning GUO, Hua-cheng TANG Data Advance Based on Industrial 4.0 Manufacturing System

482

Hai-tao ZHAI, Wen-shen MAO, Wen-song LIU, Ya-Di LU, Lu TANG High Performance PLL base on Nonlinear Phase Frequency Detector and Optimized Charge Pump 492 Fang-yan LUO Accelerated ICP based on linear extrapolation

500

Li-ran PEI, Ping-ping JIANG, Guo-zheng YAN Studies of falls detection algorithm based on support vector machine

507

Ting-ting GUO, Feng QIAO, Ming-zhe LIU, Ai-dong XU, Jun-nan SUN Research and Development of Indoor Positioning Geographic Information System based on Web 517 Chun FANG, Man-feng DOU, Bo TAN, Quan-wu LI Harmonic Distribution Optimization of Surface Rotor Parameters for High-Speed Brushless DC Motor 528 Sheng-yang GAO, Xian-yang JIANG, Xiang-hong TANG Vehicle Motion Detection Algorithm based on Novel Convolution Neural Networks 544 Yong LI, En-de WANG, Zhi-gang DUAN, Hui CAO, Xun-qian LIU The Bank Line Detection in Field Environment Based on Wavelet Decomposition 557 Xiao-ming LI, Jia-yue YIN, Hao-jun XU, Chengqiong BI, Li ZHU, Xiao-dong DENG, Lei-nan MA Research on Measurement Method of Harmonics Bilateral Energy

566

Long-fei WANG, Wei ZHANG, Xiang-dong CHEN Traffic Supervision System Using Unmanned Aerial Vehicle based on Image Recognition Algorithm 573

Yu-fei LI, Ya-yong LIU,Lu-ning XU,Li HAN, Rong SHEN, Kun-quan LU Improving the Durability of Micro ER Valves for Braille Displays Using an Elongational Flow Field 585 Jinxia WU, Fei SONG, Shiyin QIN Aircraft Target Detection in Remote Sensing Images towards Air-to-Ground Dynamic Imaging 592 Xiao-mei LIU, Shuai ZHU A New Third-order Explicit Symplectic Scheme for Hamiltonian Systems

609

Qi YANG, Yu-liang QIN, Bin DENG, Hong-qiang WANG Research on Terahertz Scattering Characteristics of the Precession Cone

620

Chun-yu CHENG, Meng-lin SHENG, Zong-min YU, Wen-xuan ZHANG, An-qi LI, Kai-yu WANG Application Development of 3D Gesture Recognition and Tracking Based on the Intel Real Sense Technology Combing with Unity3D and WPF 630

Preface The 2nd International Conference on Computer Science and Mechanical Automation carried on the success from last year and received overwhelming support from the research community as evidenced by the number of high quality submissions. The conference accepted articles through rigorous peer review process. We are grateful to the contributions of all the authors. For those who have papers appear in this collection, we thank you for your great effort that makes this conference a success and the volume of this proceeding worth reading. For those whose papers were not accepted, we assure you that your support is very much appreciated. The papers in this proceeding represent a broad spectrum of research topics and reveal some cutting-edge developments. Chapter 1 and 2 contain articles in the areas of computer science and information technology. The articles in Chapter 1 focus on algorithm and system development in big data, data mining, machine learning, cloud computing, security, robotics, Internet of Things, and computer science education. The articles in Chapter 2 cover image processing, speech recognition, sound event recognition, music classification, collaborative learning, e-government, as well as a variety of emerging new areas of applications. Some of these papers are especially eye-opening and worth reading. Chapter 3 and 4 contain papers in the areas of sensors, instrument and measurement. The articles in Chapter 3 cover mostly navigation systems, unmanned air vehicles, satellites, geographic information systems, and all kinds of sensors that are related to location, position, and other geographic information. The articles in Chapter 4 are about sensors and instruments that are used in areas like temperature and humidity monitoring, medical instruments, biometric sensors, and other sensors for security applications. Some of these papers are concerned about highly critical systems such as nuclear environmental monitoring and object tracking for satellite videos. Chapter 5 and 6 contain papers in the areas of mechatronics and electrical engineering. The articles in Chapter 5 cover mostly mechanical design for a variety of equipment, such as space release devices, box girder, shovel loading machines, suspension cables, grinding and polishing machines, gantry milling machines, clip type passive manipulator, hot runner systems, water hydraulic pump/motor, and turbofan engines. The articles in Chapter 6 focus on mechanical and automation devices in power systems as well as automobiles and motorcycles. This collection of research papers showcases the incredible accomplishments of the authors. In the meantime, they once again prove that the International Conference on Computer Science and Mechanical Automation is a highly valuable platform for the research community to share ideas and knowledge. Organization of an international conference is a huge endeavor that demands teamwork. We very much appreciate everyone who is involved in the organization, especially the reviewers. We are looking forward to another successful conference next year. Shawn X. Wang CSMA2016 Conference Chair

Introduction of keynote speakers Professor Lazim Abdullah School of Informatics and Applied Mathematics, Universiti Malaysia Terengganu, Malaysia Lazim Abdullah is a professor of computational mathematics at the School of Informatics and Applied Mathematics, Universiti Malaysia Terengganu. He holds a B.Sc (Hons) in Mathematics from the University of Malaya, Kuala Lumpur in June 1984 and the M.Ed in Mathematics Education from University Sains Malaysia, Penang in 1999. He received his Ph.D. from the Universiti Malaysia Terengganu, (Information Technology Development) in 2004. His research focuses on the mathematical theory of fuzzy sets and its applications to social ecology, environmental sciences, health sciences, and manufacturing engineering. His research findings have been published in over two hundred and fifty publications, including refereed journals, conference proceedings, chapters in books, and research books. Currently, he is Director, Academic Planning, Development and Quality of his University and a member of editorial boards of several international journals related to computing and applied mathematics. He is also a regular reviewer for a number of local and international impact factor journals, member of scientific committees of several symposia and conferences at national and international levels. Dr Abdullah is an associate member, IEEE Computational Intelligence Society, a member of the Malaysian Mathematical Society and a member of the International Society on Multiple Criteria Decision Making. Professor Jun-hui Hu State Key Lab of Mechanics and Control of Mechanical Structures, Nanjing University of Aeronautics and Astronautics, China Dr. Junhui Hu is a Chang-Jiang Distinguished Professor, China, the director of Precision Driving Lab at Nanjing University of Aeronautics and Astronautics, and the deputy director of State Key Laboratory of Mechanics and Control of Mechanical Structures, China. He received his Ph.D. Degree from Tokyo Institute of Technology, Tokyo, Japan, in 1997, and B.E. and M.E. degrees in electrical engineering from Zhejiang University, Hangzhou, China, in 1986 and 1989, respectively. He was an assistant and associate professor at Nanyang Technological University, Singapore, from 2001 to 2010. His research interest is in piezoelectric/ultrasonic actuating technology. He is the author and co-author of about 250 papers and disclosed patents, including more than 80 full SCI journal papers and one editorial review for an international journal, and the sole author of monograph book “Ultrasonic Micro/Nano Manipulations” (2014, World Scientific, Singapore). He is the editorial board member of two international journals. Dr. Hu won the Paper Prize from the Institute of Electronics, Information and Communication Engineers (Japan) as the first author in 1998, and was once awarded the title of valued reviewer by Sensors

and Actuators A: Physical and Ultrasonics. His research work has been highlighted by 7 international scientific media. Professor James Daniel Turner College of Engineering, Aerospace Engineering, Texas A&M University (TAMU), America Dr. James Daniel Turner is a research professor in College of Engineering, Texas A&M University (TAMU) from 2006 to current. In 1974, he received his B.S. degree in Engineering Physics in George Mason University. In 1976, he received his M.E. degree in Engineering Physics, University of Virginia. And he received his Ph.D. Degree from Engineering Science and Mechanics, Virginia Tech in 1980. He has broad experience in public, private, and acedemic settings for working with advanced engineering and scientific concepts that are developed from first principles; modeled and simulated to understand the limits of performance; developed as hardware proto-types; tested in operationally relevant environments; and transitioned through partnering with industry and government to missions of critical national interest. Dr. James Daniel Turneris engaged in exploratory research where the goal is to transition aerospace analysis tools to bioinformatics. This research consists of applying multibody dynamics for drug design problems in computational chemistry, and most recently working with the immunological group at Mayo Clinic for exploring the development of generalized preditor-prey models for analyzing melenoma cancer in Human cell behaviors. Professor Rong-jong Wai Department of Electronic and Computer Engineering, National Taiwan University of Science and Technology, Taiwan Rong-jong Wai was born in Tainan, Taiwan, in 1974. He received the B.S. degree in electrical engineering and the Ph.D. degree in electronic engineering from Chung Yuan Christian University, Chung Li, Taiwan, in 1996 and 1999, respectively. From August 1998 to July 2015, he was with Yuan Ze University, Chung Li, Taiwan, where he was the Dean of the General Affairs Office from August 2008 to July 2013, and the Chairman of the Department of Electrical Engineering from August 2014 to July 2015. Since August 2015, he has been with National Taiwan University of Science and Technology, Taipei, Taiwan, where he is currently a full Professor, and the Director of the Energy Technology and Mechatronics Laboratory. He has authored more than 150 conference papers, near 170 international journal papers, and 51 inventive patents. He is a fellow of the Institution of Engineering and Technology (U.K.) and a senior member of the Institute of Electrical and Electronics Engineers (U.S.A.). Professor Zhen-guo Gao Dalian University of Technology, China Zhen-guoGao had been a visiting professor in University of Michigan, Dearborn with full financial support from China Scholarship Council. He has been working as a

XVI

Introduction of keynote speakers

visiting professor in University of Illinois at Urbana-Champaign in 2010. He received his Ph.D. degree in Computer Architecture from Harbin Institute of Technology, Harbin, China, in 2006 and then joined Harbin Engineering University, Harbin, China. His research interests include wireless ad hoc network, cognitive radio network, network coding, game theory applications in communication networks, etc. He is a senior member of China Computer Federation. He is serving as a reviewer for project proposals to National Science Foundation of China, Ministry of Education of China, Science Foundation of HeiLongJiang Province, China, etc. He is also serving as a reviewer for some important journals including IEEE Transactions on Mobile Computing, Wireless Networks and Mobile Computing, Journal of Parallel and Distributed Computing, Journal of Electronics (Chinese), Journal of Computer (Chinese), etc. Professor Steven Guan (Sheng-Uei Guan) Director, Research Institute of Big Data Analytics, Xi’an Jiaotong-Liverpool University, China Steven Guan received his M.Sc. & Ph.D. from the University of North Carolina at Chapel Hill. Prof. Guan has worked in a prestigious R&D organization for several years, serving as a design engineer, project leader, and department manager. After leaving the industry, he joined Yuan-Ze University in Taiwan for three and half years. He served as deputy director for the Computing Center and the chairman for the Department of Information & Communication Technology. Later he joined the Electrical & Computer Engineering Department at National University of Singapore as an associate professor. He is currently a professor in the computer science and software engineering department at Xi’an Jiaotong-Liverpool University (XJTLU). He created the department from scratch and served as the head for 4.5 years. Before joining XJTLU, he was a tenured professor and chair in intelligent systems at Brunel University, UK. Prof. Guan’s research interests include: machine learning, computational intelligence, e-commerce, modeling, security, networking, and random number generation. He has published extensively in these areas, with 130 journal papers and 170+ book chapters or conference papers. He has chaired and delivered keynote speeches for 20+ international conferences and served on 130+ international conference committees and 20+ editorial boards.

General Chair Prof. Wen-Tsai Sung, National Chin-Yi University of Technology, Taichung, Taiwan Prof. Shawn X. Wang, Department of Computer Science, California State University, Fullerton, United States

Introduction of keynote speakers

XVII

Editors Prof. Wen-Tsai Sung, National Chin-Yi University of Technology, Taichung, Taiwan Prof. Hong-zhi Wang, Department of Computer Science and Technology, Harbin Institute of Technology, China Prof. Shawn X. Wang, Department of Computer Science, California State University, Fullerton, United States

Co-editor Professor Cheng-Yuan Tang, Huafan University, New Taipei City, Taiwan Prof. Wen-Tsai Sung, Department of Electrical Engineering, National Chin-Yi University of Technology, Taichung, Taiwan, [email protected] His mian research areas are Electrical Engineering and Wireless Sensors Network. Prof. Shawn X. Wang, Department of Computer Science, California State University, Fullerton, United States, [email protected] His main research areas are Mathematics, Computer and Information Science. Prof. Hong-zhi Wang, Department of Computer Science and Technology, Harbin Institute of Technology, China, [email protected] His main research area is Dig data. Professor Cheng-Yuan Tang, Department of Information Management, Huafan University, New Taipei City, Taiwan, [email protected], chengyuantang@ outlook.com. His main research areas are Computer Science and Information Engineering.

Technical Program Committee Prof. ZhihongQian, Jilin University, Changchun, Jilin, China Prof. Jibin Zhao, Shenyang Institute of Automation, Chinese Academy of Science, China Prof. LixinGao, Wenzhou university, China Prof. HungchunChien, Jinwen University of Science and Technology, New Taipei City, Taiwan Prof. Huimi Hsu, National Ilan University, Yilan, Taiwan Prof. Jiannshu Lee, Department of Computer Science and Information Engineering, National University of Tainan, Tainan, Taiwan Prof. Chengyuan Tang, Huafan University, New Taipei City, Taiwan

XVIII

Introduction of keynote speakers

Prof. Mingchun Tang, Chongqing University, Microwave Electromagnetic and Automation, China Dr. Jing Chen, Computer School of Wuhan University, China Dr. Yinghua Zhang, Institute of Automation, Chinese Academy of Sciences, China Dr. Lingtao Zhang, College of Computer Science and Information Technology, Central South University of Forestry and Technology, China Dr. Kaiming Bi, School of Civil and Mechanical Engineering, Curtin University, Perth, Australia Dr. Jingyu Yang, Faculty of Aerospace Engineering, Shenyang Aerospace University Shenyang, China Dr. Dong Wang, School of Information and Communication Engineering, Dalian University of Technology, Dalian, China Dr. Kang An, College of Communications Engineering, PLA University of Science and Technology, Nanjing, China Dr. Kaifeng Han, Department of Electrical and Electronic Engineering (EEE), The University of Hong Kong, Hong Kong Dr. Sri YulisBinti M Amin, UniversitiyTun Hussein Onn Malaysia, BATU PAHAT, Malaysia Dr. Longsheng Fu, Northwest A&F University, Yangling, China Dr. Hui Yang, Beijing University of Posts and Telecommunications, Beijing, China Dr. T. BhuvanEswari, Faculty of Engineering and Technology, Multimedia University, Melaka, Malaysia Dr. Xiangjie Kong, School of Software, Dalian University of Technology, Dalian, China Dr. Kai Tao Nanyang, Technological University, Singapore Dr. Lainchyr Hwang, Dept. of Electrical Engineering, I-Shou University, Kaohsiung, Taiwan Dr. Yilun Shang, Department of Mathematics, Department of Mathematics, Shanghai, China Dr. Thang Trung Nguyen, Ton Duc Thang University, Ho chi Minh, Vietnam Dr. Chichang Chen, Department of Information Engineering, I-Shou University, Kaohsiung, Taiwan Dr. Tomasz Andrysiak, Technology and Science University, Bydgoszcz, Poland Dr. Rayhwa Wong, Department of Mechanical Eng., Hwa-Hsia University of Technology, New Taipei City, Taiwan Dr. Muhammad Naufal Bin Mansor, KampusUnicitiAlam, Universiti Malaysia Perlis (UniMAP), Sungai Chuchuh, Malaysia Dr. Michal Kuciej, Faculty of Mechanical Engineering, Bialystok University of Technology, Bialystok, Poland Dr. Imran Memon, Zhejiang university, Hangzhou, China Dr. Yosheng Lin, National Chi Nan University, Nantou, Taiwan Dr. Zhiyu Jiang, University of Chinese Academy of Sciences, Beijing, China

Introduction of keynote speakers

XIX

Dr. WananXiong, School of Electronic Engineering, University of Electronic Science and Technology of China(UESTC), Chengdu, China Dr. Dandan Ma, University of Chinese Academy of Sciences, Beijing, China Dr. ChienhungYeh, Department of Photonics, Feng Chia University, Taichung, Taiwan Dr. Adam Głowacz, AGH University of Science and Technology, Cracow, Poland Dr. Osama Ahmed Khan, Lahore University of Management Sciences, Lahore, Pakistan Dr. Xia Peng, Microsoft, Boston, United States Dr. Andrzej Glowacz, AGH University of Science and Technology, Cracow, Poland Dr. Zhuo Liu, Computer Science and Software Engineering Department, Auburn University Auburn, United States Dr. ZuraidiSaad, Universiti of Teknologi MARA, Shah Alam, Malaysia Dr. Gopa Sen, Institute for Infocomm Research, Agency for Science Technology and Research (A*STAR), Singapore Dr. Minhthai Tran, Ho Chi Minh City University of Foreign Languages and Information Technology, Ho Chi Minh City, Vietnam Dr. FatihEmreBoran, Department of Industrial Engineering, Faculty of Engineering, Gazi University, Ankara, Turkey Prof. SerdarEthemHamamci, Electrical-Electronics Engineering Department, Inonu University, Malatya, Turkey Dr. Fuchien Kao, Da-Yeh University, Zhanghua County, Taiwan Dr. NoranAzizan Bin Cholan, Faculty of Electrical and Electronics Engineering, UniversitiTun Hussein Onn Malaysia, BATU PAHAT, Malaysia Dr. Krzysztof Gdawiec, Institute of Computer Science, University of Silesia, Sosnowiec, Poland Dr. Jianzhou Zhao, Cadence Design System, San Jose, United States Dr. Malka N. Halgamuge, Department of Electrical & Electronic Engineering, Melbourne School of Engineering, The University of Melbourne, Melbourne, Australia Dr. Muhammed EnesBayrakdar, Department of Computer Engineering, Duzce University, Duzce, Turkey Dr. DeepaliVora, Department of Information Technology, Vidyalankar Institute of Technology, Mumbai, India Dr. Xu Wang, Advanced Micro Devices (Shanghai), Co. Ltd, Shanghai, China Dr. Quanyi Liu, School of Aerospace Engineering, Tsinghua University, Beijing, China Dr. YiyouHou, Department of Electronic Engineering, Southern Taiwan University of Science and Technology, Tainan City, Taiwan Dr. Ahmet H. ERTAS, Biomedical Engineering Department, Karabuk University, Karabuk, Turkey Dr. Hui Li, School of Microelectronics and Solid-State Electronics, University of Electronic Science and Technology of China, UESTC, China Dr. Zhiqiang Cao, Institute of Automation, Chinese Academy of Sciences multi-robot systems, intelligent robot control, China

XX

Introduction of keynote speakers

Dr. Hengkai Zhao, School of Communication And Information Engineering, Shanghai University, China Dr. Chen Wang, School of Electronic Information and Communications, Huazhong University of Science and Technology, China

Part I: Computer Science and Information Technology I

Yan-zhao Chen*, Yu-wei Zhang

Research and Development of Upper Limb Rehabilitation Training Robot Abstract: This paper studies rehabilitation mechanisms and the inadequateness of traditional clinical means rehabilitation of stroke patients. Related progress and the research status of upper limb rehabilitation robot aided rehabilitation are discussed. The results show that self-training has become a promising rehabilitation method, sEMG based upper limb motion recognition is becoming a key interaction technology, and a task-based training mode in the environment combined robot and VR is the trend of future development. Keywords: rehabilitation training; robot; hemiplegia patient

1 Introduction Hemiplegia is usually caused by stroke and other diseases, and blocks the activities of daily living abilities (ADL) of most patients [1-3]. The patient’s daily work and life are seriously affected, which bring a burden on society and family. The incidence of stroke is high, and it has become one of the common causes of human death [4-6]. Clinical studies have indicated that in the early stage of illness, some exercise training for patients is helpful to the recovery of the activity of daily living [7-9]. The commonly used traditional clinical rehabilitation training method is a one to one style between patients and rehabilitation therapists. Due to the fact that the number of patients is large, and the number of rehabilitation therapists is limited, the actual implementation of rehabilitation training, training intensity, training time and training accuracy and other aspects are not guaranteed, and so the patients’ rehabilitation is poor. The rehabilitation robot emerged with the development of modern science and technologies such as robot technology, signal processing, pattern recognition and clinical technology. Robot aided rehabilitation training can overcome the shortcomings of traditional clinical rehabilitation methods, facilitate the generation of new rehabilitation models and ultimately improve rehabilitation [10,11].

*Corresponding author: Yan-zhao Chen, School of Mechanical and Automotive Engineering, Qilu University of Technology, Jinan, China, [email protected] Yu-wei Zhang, School of Mechanical and Automotive Engineering, Qilu University of Technology, Jinan, China, [email protected]

2

Research and Development of Upper Limb Rehabilitation Training Robot

Through the mechanism of stroke rehabilitation research and considering the clinical deficiency of traditional rehabilitation methods, this article analyzes the current research development of rehabilitation robots used for assisting patients doing rehabilitation training. The development status of upper limb rehabilitation robots and the corresponding recovery mode changes are reviewed.

2 Changes in the Way of Rehabilitation and Rehabilitation Mechanism Stroke is a central nervous system disease, its causes are generally a sudden hemorrhage or ischemia in the brain, resulting in damage to the cerebral cortex. Thus it affects the control instruction formation in the central nervous system or blocks the pathways for nerve control instruction, and eventually leads to the patient’s movement intent formed incorrectly or the nerve control instruction can not transmit to the movement terminal to achieve movement. The human body’s motor function, especially the upper limbs are blocked. Medical studies have shown that the human nervous system has a certain degree of plasticity [12], as well as the ability to re-learn motor skills [13]. Practice shows that rehabilitation therapy in the early stages of the patient’s sickness is more conducive to the recovery of their motor function [7], and the most effective way to promote motor function reconstruction of patients is repeated exercise training [8]. Now the rehabilitation training methods often used in clinics are traditional, namely the one to one style between patients and rehabilitation therapists, such a training method has a lot of drawbacks. The first is place restriction, the process of rehabilitation is generally only executed in a hospital or rehabilitation center, it lacks without of flexibility. Secondly, because of the high incidence of stroke, the number of patients is numerous, and the number of rehabilitation therapists is relatively limited, causing many patients to not get timely and stable treatment, which affects the rehabilitation. On the other hand, patient rehabilitation usually takes a long period of repeated training in training processes. In this case, the workload of rehabilitation therapists will increase and easily lead to fatigue, training may cause errors, inadequate training efforts and other phenomena, which will produce adverse effects on rehabilitation patients. Due to the poor effect of the traditional clinical rehabilitation method, which is not conducive to the promotion of rehabilitation clinical practice as well as the research on the rehabilitation mechanisms of stroke patients, a new and modern rehabilitation method is in urgently needed as a complement to traditional rehabilitation methods to promote the development of clinical rehabilitation theory and practice. In addition, clinical studies show that patients’ active participation is more helpful to enhance the rehabilitation effect. In this context, patients’ independent rehabilitation, especially in the home environment becomes a meaningful self-rehabilitation method, while research on the self-rehabilitation based on rehabilitation robot has broad applications [14,15].

Research and Development of Upper Limb Rehabilitation Training Robot

3

The upper arm plays an important role in people’s daily life, so an upper limb rehabilitation robot assisting the upper limb to execute rehabilitation training in order to achieve functional reconstruction of the movement is particularly important. An upper limb rehabilitation robot is a mechanical structure used for assisting patients with limited upper limb motor function, such as hemiplegic patients to implement the rehabilitation of their ability of daily living, which is guided by rehabilitation medicine theory, and based on the integration of robotics, human anatomy as well as disciplines of computer science and other technologies. Upper limb rehabilitation robots have good fatigue resistance, have high controllability, which can achieve high-precision control and ensure safe and reliable control during the operation process. The manner that upper limb rehabilitation robot for upper limb rehabilitation training is adopted can change doctor-patient relationships and provide a new rehabilitation training for patients, and has become a promising field of research and application.

3 Upper Limb Rehabilitation Robot With the development of science and technology as well as rehabilitation medicine theory, the rehabilitation concepts and methods of stroke patients have changed. The means of upper limb rehabilitation training have transformed from the traditional way to the robot aided manner, which brings a series of changes from robot mechanism design of upper limb to rehabilitation mode.

3.1 The Mechanism Design for Rehabilitation Training In the related research on upper limb rehabilitation robot started earlier, in the design of mechanical structure, some efforts have been made by many scholars in related fields, the researchers designed a variety of rehabilitation training devices, from an early simple assisted rehabilitation tool with single degree of freedom to an automated multi-DOF rehabilitation robot. The early rehabilitation training device appears, such as hand-object-hand in master-slave means developed at the American University of Pennsylvania [16], which can assist a patient’s hand do simple movements with the mirror training. Shortly after, Researchers in Stanford University designed a series of upper limb rehabilitation devices to assist upper limbs to do rehabilitation training, known as Mirror-image Motion Enable [17]. In recent years, the development of upper limb rehabilitation robots tend to more freedom, intelligence and porTable. It is more and more user-friendly, and the wearable style has become a research trend. Arizona State University developed an upper limb rehabilitation training mechanical structure called Robot Upper Extremity Repetitive Therapy Device [18]. Researchers of the University of Washington studied wearable neurological rehabilitation exoskeletons

4

Research and Development of Upper Limb Rehabilitation Training Robot

robots called Cable-actuated Dexterous Exoskeleton for Neurorehabilitation [19], and so on.

3.2 Interactive Mode A rehabilitation robot is a mechanical device used to assist patient rehabilitation, its interaction with the patient is an important aspect of this rehabilitation. Since it is usually one body side of the patient with stroke that has lost voluntary movement functions. Guiding disabled upper limbs to do movement using the upper limb on the healthy side become a viable rehabilitation training manner, at the same time become a trend. The general process of this approach firstly identifies the movements of the healthy arm, and then, converts the recognition result to rehabilitation robot motion control instruction and drives the robot, at last, the disabled upper limb executes movement with the aid of the robot in order to achieve rehabilitation. The action of the healthy arm becomes one of the core technologies. Since the 1980’s, the motion tracking used for rehabilitation has become a hot area of research. Motion recognition tracking technologies currently available for rehabilitation can be summarized into three classes: First, the tracking technologies for body motion based on physical sensor. This type of technology mainly refers to adopting various physical sensors for human movement identification and tracking. The physical sensors commonly used include gyroscopes, acceleration sensors, gravity sensors, acceleration sensors, etc. [20,21]. This can get physical parameters of the upper limb movement directly, such as posture, freedom of movement, velocity and acceleration. However, when using the contact sensors to track body movements, it necessary to install the sensors in the human body permanently, due to the special nature of human physiological structure, the sensor is not easy to mount with the location and angle are difficult to fix. Moreover, due to the randomness of body movement, such physical sensors are prone to shift, delay and jitter as well as generate other issues. Thus, human motion tracking method based on contact sensors does not apply to robot-assisted upper limb rehabilitation. Second, the tracking technologies for body motion based on the non-contact physical sensor. Since the tracking technology for body motion based on the non-contact sensors does not require direct contact with the human body, the problems such as offset and installation are not be generated. The non-contact physical sensor most commonly used is based on optical devices such as Kinect, etc. [21,22]. Such technology is more convenient, however, in the process of human motion tracking and identification by optical sensor, the body location needs to be stable, which means a lack of flexibility. There are also special requirements for the environment and the light

Research and Development of Upper Limb Rehabilitation Training Robot

5

level of application areas. In addition, it produces body part overlap, occlusion and other issues, especially with noisy backgrounds, such as in the home environment of remote rehabilitation, the motion recognition results are difficult to achieve at the desired level [23]. Moreover, 3D positioning requires high-precision mathematical calculations, which will result in delay and other problems and it is difficult to ensure real-time performance. So the technology is not applicable to this kind of robotassisted upper limb rehabilitation. Third, there is physiological signals based motion pattern recognition technology. In addition to physical sensors, a physiological signal as an emerging tool has been brought to the field of human motion tracking with pattern recognition as its core technology. The physiological signals commonly used are mainly EMG (Electromyography) [24], EEG (Electroencephalogram) [25] and so on. EMG signal is the most commonly used. According to the work mode, an EMG signal can be divided into a needle electromyography signal and a surface electromyography signal (sEMG). The signal acquisition by needle electromyography needs to insert the needle electrodes into the muscle inside, which is inconvenient and will result in trauma to patients. The signal acquisition by sEMG just needs to stick the electrodes to the surface of corresponding muscles at the skin surface, with a non-invasive, real-time, wide collection area, and sEMG is weak potential difference signal collected on the skin surface, the signal is rich in information on body movements, and capable of reacting to human movement intent. Therefore, sEMG based human motion tracking is more suitable for clinical application, and this is the reason why it has received widespread attention and study. Pattern recognition is usually used to establish the relation between sEMG and the upper limb motions. Its process can be described as follows: first perform the specified upper limb movement, while collecting the sEMG from the corresponding muscle at the skin surface; and then, perform a signal pretreatment which includes filtering and amplification, followed by the feature extraction; training pattern classifier and implement motion classification. The motion recognition results can be used as upper limb rehabilitation robot motion control instructions. In summary, these three techniques can be used to track the motion, EMG, particularly sEMG is more in line with the special nature of the patient’s physiological state, which is more suitable for clinical application, thus it is becoming a promising field for rehabilitation medicine and technology research.

3.3 Rehabilitation Mode The introduction of rehabilitation robot brings changes in rehabilitation method. The purpose of rehabilitation is to restore the patient’s activities, and a rehabilitation robot supported rehabilitation training is possible to promote the structural recovery of their nervous system, but it is not easy to transfer this restoration to functional

6

Research and Development of Upper Limb Rehabilitation Training Robot

recovery [26]. Under normal circumstances, although the patient’s neural pathways is opened to some extent, it is still not able to complete some functional movements independently, such as picking up a cup, this phenomenon called “learned disuse” [27]. The reason for this phenomenon may be that the previous training is only mechanical training without functional objectives. Studies have shown that, task-based training can induce the transition from structural recovery to functional recovery of nervous system, and then, promote the functional rehabilitation of patients and rebuild their activities of daily living (ADL) [1]. Meanwhile, introducing the virtual reality technology into the field of rehabilitation medicine and letting patients performing task-based training in realistic virtual scene can enhance their training initiative, which will benefit their functional recovery and be more conducive to the recovery of their activities of daily living. In particular, with technological advances, remote recovery, especially in the home environment has drawn increasing attention. Virtual reality based task training provides support for the rehabilitation in this model. Combining the upper limb rehabilitation robotics, virtual reality technology and sEMG pattern recognition technology to achieve rehabilitation of patients with hemiplegia becomes a viable approach. In 2011, the researchers presented a robot ARMin III supported ADL rehabilitation training systems [28], and designed variety of tasks in virtual reality scene for patients’ ADL training, such as cooking.

4 Conclusion Stroke is a disease which has become one of the major causes of death. The voluntary movement ability of patient’s body, especially the upper limb is usually impaired, the activities of daily living are impeded. The traditional clinical rehabilitation method depends on the rehabilitation therapist and hospital with a lot of drawbacks and restrictions. With the development of science and technology, robot-assisted rehabilitation research has created enthusiasm. The mechanism design from the early single degree of freedom and simple function to multi-degree of freedom, portability, automation, intelligent. In the aspect of interaction mode, the sEMG emerged as its various advantages. The rehabilitation method changed from a traditional mode with fixed time, fixed location, fixed form to remote recovery, family rehabilitation and other diversified methods. By the aid of a robot in a virtual reality environment, a task-based rehabilitation method is promising for future development. Acknowledgment: The authors are thankful to the Higher Educational Science and Technology Program of Shandong Province, China (No.J15LB01) and the Natural Science Foundation of Shandong Province, China (ZR2014EEQ029, ZR2015FM021) in carrying out this research for support.

Research and Development of Upper Limb Rehabilitation Training Robot

7

References [1] Trotti, C., Menegoni, F., Baudo, S., Bigoni, M., Galli, M., and Mauro, A.: “Virtual reality for the upper limb rehabilitation in stroke: A case report”, Gait & Posture, 2009, 30, Supplement 1, (0), pp. S42. [2] La, C., Young, B.M., Garcia-Ramos, C., Nair, V.A., and Prabhakaran, V.: “Chapter Twenty Characterizing Recovery of the Human Brain following Stroke: Evidence from fMRI Studies”, in Seeman, P., and Madras, B. (Eds.): “Imaging of the Human Brain in Health and Disease” (Academic Press, 2014), pp. 485-506. [3] Sumida, M., Fujimoto, M., Tokuhiro, A., Tominaga, T., Magara, A., and Uchida, R.: “Early rehabilitation effect for traumatic spinal cord injury”, Archives of physical medicine and rehabilitation, 2001, 82, (3), pp. 391-395. [4] Kwakkel, G., Kollen, B.J., and Krebs, H.I.: “Effects of robot-assisted therapy on upper limb recovery after stroke: A systematic review”, Neurorehabil. Neural Repair, 2008, 22, (2), pp. 111-121. [5] Prange, G.B., Jannink, M.J., Groothuis-Oudshoorn, C.G., Hermens, H.J., and Ijzerman, M.J.: “Systematic review of the effect of robot-aided therapy on recovery of the hemiparetic arm after stroke”, Journal of rehabilitation research and development, 2006, 43, (2), pp. 171-184. [6] gang, S.X., li, W.Y., Zhang, N., te, W., hai, L.Y., Jin, X., juan, L.n., and Feng, J.: “Incidence and trends of stroke and its subtypes in Changsha, China from 2005 to 2011”, Journal of Clinical Neuroscience, 2014, 21, (3), pp. 436 – 440. [7] Kwakkel, G., Kollen, B., and Twisk, J.: “Impact of time on improvement of outcome after stroke”, Stroke, 2006, 37, (9), pp. 2348 – 2353. [8] Langhorne, P., Bernhardt, J., and Kwakkel, G.: “Stroke rehabilitation”, The Lancet, 377, (9778), pp. 1693-1702. [9] Patton, J., Stoykov, M., Kovic, M., and Mussa-Ivaldi, F.: “Evaluation of robotic training forces that either enhance or reduce error in chronic hemiparetic stroke survivors”, Exp. Brain Res., 2006, 168, (3), pp. 368-383. [10] An-Chih, T., Tsung-Han, H., Jer-Junn, L., and Te, L.T.: “A comparison of upper-limb motion pattern recognition using EMG signals during dynamic and isometric muscle contractions”, Biomedical Signal Processing and Control, 2014, 11, pp. 17 - 26. [11] Morris, J.H., and Wijck, F.V.: “Responses of the less affected arm to bilateral upper limb task training in early rehabilitation after stroke: a randomized controlled trial”, Archives of physical medicine and rehabilitation, 2012, 93, (7), pp. 1129 – 1137. [12] Howell, M.D., and Gottschall, P.E.: “Lectican proteoglycans, their cleaving metalloproteinases, and plasticity in the central nervous system extracellular microenvironment”, Neuroscience, 2012, 217, (0), pp. 6-18. [13] Carr, J.H., and Shepherd, R.B.: “A Motor Learning Model for Stroke Rehabilitation”, Physiotherapy, 1989, 75, (7), pp. 372-380. [14] Takahashi, C.D., Der-Yeghiaian, L., Le, V., Motiwala, R.R., and Cramer, S.C.: “Robot-based hand motor therapy after stroke”, Brain, 2008, 131, pp. 425-437. [15] Zollo, L., Rossini, L., Bravi, M., Magrone, G., Sterzi, S., and Guglielmelli, E.: “Quantitative evaluation of upper-limb motor control in robot-aided rehabilitation”, Medical & Biological Engineering & Computing, 2011, 49, (10), pp. 1131-1144. [16] Lum, S.P., Reinkensmeyer, D.J., and Lehman, S.L.: “Robotic assist devices for bimanual physical therapy: preliminary experiments”, IEEE Transactions on Rehabilitation Engineering, 1993, 1, (3), pp. 185 - 191.

8

Research and Development of Upper Limb Rehabilitation Training Robot

[17] Lum, P.S., Burgar, C.G., and Shor, P.C.: “Evidence for improved muscle activation patterns after retraining of reaching movements with the MIME robotic system in subjects with post-stroke hemiparesis”, IEEE Trans. Neural Syst. Rehabil. Eng., 2004, 12, (2), pp. 186 – 194. [18] Sugar, T.G., ping, H.j., Koeneman, E.J., Koeneman, J.B., Herman, R., H, H., Schultz, R.S., Herring, D.E., Wanberg, J., Balasubramanian, S., Swenson, P., and Ward, J.A.: “Design and Control of RUPERT: A Device for Robotic Upper Extremity Repetitive Therapy”, IEEE Trans. Neural Syst. Rehabil. Eng., 2007, 15, (3), pp. 336-346. [19] Perry, J.C., Powell, J.M., and Rosen, J.: “Isotropy of an upper limb exoskeleton and the kinematics and dynamics of the human arm”, Applied Bionics and Biomechanics, 2009, 6, (2), pp. 175 – 191. [20] Pastor, I., Hayes, H.A., and Bamberg, S.J.M.: “A feasibility study of an upper limb rehabilitation system using kinect and computer games”, in Editor (Ed.)^(Eds.): “Book A feasibility study of an upper limb rehabilitation system using kinect and computer games” (Institute of Electrical and Electronics Engineers Inc., 2012, edn.), pp. 1286-1289. [21] Chang, C.-Y., Lange, B., Zhang, M., Koenig, S., Requejo, P., Somboon, N., Sawchuk, A.A., and Rizzo, A.A.: “Towards pervasive physical rehabilitation using microsoft kinect”, in Editor (Ed.)^(Eds.): “Book Towards pervasive physical rehabilitation using microsoft kinect” (IEEE Computer Society, 2012, edn.), pp. 159-162. [22] Pogrzeba, L., Wacker, M., and Jung, B.: “Potentials of a low-cost motion analysis system for exergames in rehabilitation and sports medicine”, in Editor (Ed.)^(Eds.): “Book Potentials of a low-cost motion analysis system for exergames in rehabilitation and sports medicine” (Springer Verlag, 2012, edn.), pp. 125-133. [23] Sturman, D.J., and Zeltzer, D.: “A survey of glove-based input”, Computer Graphics and Applications, 1994, 14, (1), pp. 30-39. [24] Kamavuako, E.N., Scheme, E.J., and Englehart, K.B.: “Combined surface and intramuscular EMG for improved real-time myoelectric control performance”, Biomedical Signal Processing and Control, 2014, 10, (3), pp. 102 – 107. [25] Dhiman, R., Saini, J.S., and Priyanka: “Genetic algorithms tuned expert model for detection of epileptic seizures from EEG signatures”, Applied Soft Computing, 2014, 19, pp. 8 – 17. [26] Y, B., Y, H., Y, W., Y, Z., Q, H., C, J., L, S., and W, F.: “A prospective, randomized, single-blinded trial on the effect of early rehabilitation on daily activities and motor function of patients with hemorrhagic stroke”, Journal of clinical neuroscience : official journal of the Neurosurgical Society of Australasia, 2012, 19, (10), pp. 1376-1379. [27] Peper, E., Harvey, R., and Takabayashi, N.: “Biofeedback an evidence based approach in clinical practice”, Japanese Journal of Biofeedback Research, 2009, 36, (1), pp. 3-10. [28] Guidali, M., Duschau-Wicke, A., Broggi, S., Klamroth-Marganska, V., Nef, T., and Riener, R.: “A robotic system to train activities of daily living in a virtual environment”, Medical and Biological Engineering and Computing, 2011, pp. 1-11.

Xiao-fei CUI, Ya-dong WANG, Guang-ri QUAN, Yong-dong XU*

K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design Abstract: Actual hybridization is performed on a global identity scenario. However, searching for sequences similar to a given sequence in a large data set is very challenging. This is especially true for global alignment. A local alignment algorithm BLAST or semi-global algorithm Myers’ bit-vector algorithm is used to instead in most cases. We introduce a novel global alignment method in this paper. It computes the same alignment as a certain dynamic programming algorithm, while executing over 60 times faster on appropriate data. Its high accuracy and speed makes it a better choice for the alignment of probe design. Keywords: component; probe design; global alignment; fast

1 Introduction Sequence alignment algorithms are very important to bioinformatics applications. They can be divided into 3 categories, namely global, semi-global, and local [1]. A general global alignment technique is the Needleman-Wunsch algorithm [2], which is based on dynamic programming. It can produce the optimal alignment of two sequences, but the high time complexity makes it inappropriate for the comparison of huge data set. The most widely used local alignment algorithm is BLAST [3]. Its emphasis on speed makes the algorithm practical on the huge genome databases currently available [4,5]. But, it cannot guarantee the optimal alignments of the query and databases sequences. Myers’ bit-vector algorithm is the fastest semi-global algorithm [6]. However, it has its limitations too. Sequence alignment is a component of probe design tools. Most probe design tools calculate identities using local alignment algorithm such as BLAST [7-10]. There are also some software [11] using semi-global algorithm such as bit-vector. However, actual hybridization is performed on a global identity scenario [12]. In this paper we present a new global alignment method to find the most similar sequences with the same length of the query in a huge data set.

*Corresponding author: Yong-dong XU, School of Computer Science & Technology, Harbin Institute of Technology at Weihai, Weihai, China, [email protected] Xiao-fei CUI, Ya-dong WANG, Guang-ri QUAN, School of Computer Science & Technology, Harbin Institute of Technology at Weihai, Weihai, China

10

K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design

2 Method Our goal is to find all the subsequences in a huge data set whose identity are greater than MI to the query sequence. And the length of them is same with the query sequence. In order to achieve this goal, we first split each sequence in the data set into L-mer fragments. L is the length of the query sequence. The comparison steps between the query sequence and each L-mer is described below.

2.1 Preliminaries Let A=a1a2…an and B=b1b2…bn be DNA sequences. We are given a positive threshold MI ≥ 0. The problem is to determine whether the identity of the two sequences is greater than MI or not. The identity between A and B is defined as I(A,B) = (the number of optimal matches between A and B) / (the sequence’s length L).

2.2 The Basic Algorithm There are two parts of our method. First, the comparison pairs that cannot be more similar than MI are filtered out with a k-tuple method. Second, for the remaining comparison pairs, a modified greedy algorithm is used to make further determination. Both the k-tuple method and the greedy algorithm are not completely new. Our main contribution lies in the joint use of the two algorithms, the estimation of the filtering parameters of the k-tuple algorithm and to deal with the global alignment problem with this method. 2.2.1 The look-up Table In the first step, the comparison pairs are filtered according to the total number of exact match k-tuple. Firstly, a lookup Table [13] of A is constructed to locate the identical k-tuple segments rapidly. Any k-tuple that consists of the characters in alpha={A,C,G,T,a,c,g,t} is converted to an integer between 0 and 4k where A/a equals to 0, C/c equals to 1, G/g equals to 2 and T/t equals to 3. If a character of the k-tuple is not in alpha, the k-tuple is converted to -1. The k-tuple converted to -1 is not considered in the later calculation. Then, an array C of length 4k that consists of pointers set initially to nil is used. In a single pass through A, each position i is added to the list that the pointer at C(ic) point to, where ic is the coded form of the k-tuple beginning at i in A. 2.2.2 The filter step In conjunction with the lookup Table, the number of exact match k-tuple between A and B match is counted. In a single pass through B, the k-tuple tj beginning at each position j is coded to jc. If there is an element pos in the list of C(jc) which is constructed from sequence A that makes |pos - j| < W, let match = match + 1. W is the window size. A and

K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design

11

B is reported no more similar than MI, if match < MinNum. Otherwise, a modified greedy algorithm is used. The parameter MinNum is discussed later. 2.2.3 The modified greedy algorithm The greedy algorithm used for further determination is especially suitable for the alignment between two high similar sequences. It is significantly faster than the traditional dynamic algorithm. The original one is firstly described in paper [14]. In the initial algorithm, they use a user-specified score X for pruning. In order to translate back and forth between an alignment’s score and the number of its difference, the alignment scoring parameters is constrained by ind = mis – mat / 2, where mat > 0, mis < 0 and ind < 0 are the scores for a match, mismatch and insertion/deletion. We use the maximum differences D for pruning. A difference is defined as an insertion, a deletion or a mismatch. This is more useful in some cases such as probe design. The algorithm shows in Figure 1. R(d, k) is the x-coordinate of the last position on diagonal k. d is the number of difference of the comparison. L and U are used for pruning. N is the length of sequence A and B. 1. 2. 3. 4.

i? 0 while i < N and ai+1 = bi+1 do i ? i+1 R(0,0) ? i d? L? U? 0

5.

Dhalf =

 D / 2 

6. repeat 7. d ? d+ 1 8. if L < 1 – Dhalf then 9. L ? 1 – Dhalf 10. if U > Dhalf – 1 then 11. U ? Dhalf – 1 12. for k ? L – 1 to U + 1 do 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.

 R( d − 1, k − 1) + 1, if L < k ,  i ← max  R( d − 1, k ) + 1, if L ≤ k ≤ U ,  R( d − 1, k + 1), if k –8 then while i < N, j < N and a i+1 = bj+1 do i ? i + 1; j ? j + 1 R(d,k) ? i else R(d,k) ? –8 if k = 0 and R(d,k) = N then report similar if d = D and R(d,0) < N then report not similar L? L –1 U? U+1

Figure 1. The modified greedy algorithm.

12

K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design

2.3 Parameters Estimate Two parameters, the sequence’s length L and the similarity threshold MI, should be specified by the custom. Then, all the other parameters can be estimated automatically.

2.3.1 The maximum difference parameter D The identity between A and B greater than MI equals to the number of differences between them smaller than D. D is calculated with (1).

D = ⎣(1 – MI) × L ⎦ (1)

2.3.2 The parameter K To obtain a result with no accuracy loss, the parameter k should be small enough. But to achieve a great speed, k should be as great as it can. We get the parameter k by (2). L is the length of sequences in comparison and D is the maximum difference between them. This value is small enough to obtain at least one exact k-tuple match between the two sequences. And, it is the greatest value can be used with no accuracy loss.

k = ⎡(L – D) /(D + 1)⎤ (2)

2.3.3 The minimum k-tuple matches MinNum The minimum number of k-tuple match MinNum is calculated by Lemma 1. LEMMA1. Suppose two sequences with length L, there are D differences between them. Then, they have at least MinNum = L – k × (D + 1) + 1 k-tuple matches. PROOF. Slit the two sequences into identical regions by differences as Figure 2. In the Figure, red represents the identical regions. Yellow represents the differences. The length of each region is k–1. Then, the two sequences have no identical k-tuple with a minimum difference d.

Figure 2. The divided alignment.

Suppose there are only k-tuple matches on the 0 diagonal, the minimum number of k-tuple match is L – k × D – (k – 1). And it equals to L – k × (D + 1) + 1.

K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design

13

The maximum continuous stretch should be less than a threshold MCS (maximum contiguous stretches) in probe design. Lemma2 is used to deal with this situation. LEMMA2. Suppose two sequences with length L, there are D differences between them and they have no MCS contiguous stretches. Then, they have at least MinNum = L – k × (D + 1) + 1 k-tuple match. PROOF. Think about the scenario of above. Borrow x differences from the regions before the last one as Figure 3. In the Figure, red represents the identical regions. Yellow represents the differences. The last region meets (x + 1) × (MCS – 1) ≥ L – k × D + (k – 1) × x. x can be calculated by (3).

x=

⎡

L – k × D –MCS + 1 MCS – k

⎤ (3)

Then, the minimum number of k-tuple matches with no MCS contiguous stretches can be calculated with (4). MinNum = (MCS – 1 – (k – 1)) × x + L – k × D + (k – 1)

(4)

× x – (MCS – 1) × x – (k – 1) = L – k × (D + 1) + 1

Figure 3. The divided alignment with contiguous stretches.

3 Results and Discussion In this section, we first conduct some experiments to select the best parameter K for KS (K-mer Similarity algorithm). Then, a sequence of data sets are used to convince the performance of KS. Finally, the comparison between KS and other alignment algorithms which are widely used in probe design is done.

3.1 Examination of KS Algorithm. 3.1.1 The selection of parameter K We have implemented a C++ program of KS. On a data set of 100 sequences with 610 50-mer candidate probes after the MCS=20 filter, there are 34874402 pairs total comparisons. Our program is used to select the pairs which similarity are higher than

14

K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design

MI (MI=0.85). The experiment is done on a PC with 2.60GHz Inter i5-3230M CPU and 4.00GB memory. The execution time of KS with different parameter K are shown in Figure 4. It is obvious that the greater K is, the less time it costs. To obtain a better performance, the parameter K should be as greater as it can. However, the optimal alignment of the query and subject sequences cannot be guaranteed with such a large K. We use (2) to select the greatest K with no accuracy loss. It guarantees one exact k-tuple match for two sequences of length L and the difference of D at least.

Figure 4. The relationship between K and execution time.

3.1.2 Efficiency of each part We have implemented a stand-alone modified greedy algorithm and a dynamic algorithm to do the same task as KS. The execution time of them are shown in Figure 5. As can be seen from the Figure, the efficiency of the greedy algorithm is almost 20 times better than dynamic algorithm. The execution time of the stand-alone greedy algorithm and KS algorithm are shown in Figure 6. As we can see, the execution time of greedy algorithm is about 3 times of KS. It indicates that the efficiency of the algorithm is improved greatly after filtering.

K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design

15

Figure 5. The execution time of greedy and DP algorithm.

Figure 6. The execution time of greedy and KS algorithm.

The KS’s performance with different data size We have tested the performance of KS on different size of data sets. The filter rate and execution time are shown in Table 1. The filter rate is stable, above 99%. It guarantees less execution time of KS than the dynamic algorithm.

16

K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design

Table 1. The Filter Rate and Execution Time of KS on Different Data Sets. No.

Total Pairs

Filtered Pairs

Filter rate [%]

execution time [s]

1

13326339

13232473

99.296

11.194

2

16984412

16869880

99.326

14.225

3

26024135

25857935

99.361

21.873

4

38849145

38610875

99.387

32.744

5

248038281

246759999

99.485

211.987

6

1279717970

1269328593

99.188

1172.455

7

3.55891e+009

3.53145e+009

99.228

3053.496

The relationship between execution time and the data size is shown in Figure 7. The line indicates that the execution time of KS algorithm is proportional to the data size.

Figure 7. The relationship between KS’s execution time and data size.

3.2 Compared with Other Algorithms In terms of accuracy and efficiency, KS algorithm is compared with two commonly used algorithms in probe design, BLAST and Myer’s bit-vector algorithm. Given a set of candidate probes and a set of subject sequences, the three algorithms are used to select the final probes. A candidate probe can be a final probe when the similarity

K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design

17

between it and each subject sequence is less than MI=0.85. Two sets of candidate probes are used, one has 610 candidate probes, and the other has 4383. The length of the candidate probes is 50. Five sets of subject sequences are used, the number of sequences are 100, 200, 500, 1000 and 10000. Because actual hybridization is performed on a global identity scenario [12], the results of dynamic algorithm are used as the standard probes.

3.2.1 The comparison of accuracy We designed 7 experiments, the word size of BLAST is the same as KS algorithm to guarantee optimal alignments. The word size used is 6. The results are shown in Table 2. DP means the results of dynamic algorithm, KS means the results of K-mer Similarity algorithm, Myers means the results of the bit-vector algorithm. The KS algorithm obtained the same result as DP did in all the 7 experiments. Myers lost some standard probes in the 6th and 7th experiments. BLAST designed more probes than DP in all 7 experiments and lost some standard probes in the 6th and 7th experiments. Table 2. The Probe Results of DP, BLAST, KS and Myers. No. subject sequences set

candidate probes set

DP

BLAST

KS

Myers

1

100

610

89

204

89

89

2

200

610

64

169

64

64

3

500

610

55

159

55

55

4

1000

610

55

125

55

55

5

10000

610

42

66

42

42

6

1000

4383

1810

2559

1810

1786

7

10000

4383

267

869

267

256

The results of BLAST and DP algorithm and the number of same results of them are shown in Table 3. The difference of the results is mainly because of the different similarity measurements of the two algorithms. In the BLAST algorithm, the similarity between two sequences is defined as I(A,B) = (the number of optimal match bases) / (the length of the alignment result). In the alignment process, an insertion in the query sequence is equal to a deletion in the subject sequence at the same position. Similarly, an insertion in the subject sequence is equal to a deletion in the query sequence. In the generation of an alignment, BLAST uses insertion to replace the deletion. It is confusing sometimes that the similarity of the one with more optimal match bases to the query is less than one with less optimal match bases. As Figure 8 shows, the number of optimal match bases is 45 in (a) and its similarity is 90.0%. The number of optimal match bases is 46 in (b), while its similarity is 88.5%. Because of

18

K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design

this, the BLAST similarity of some high similar pairs is less than it really is. Then, we obtain some final probes which should not be. In addition, the actual length of the query and subject sequences can be different in the final alignment of BLAST. This can cause the greater similarity between the query sequence and subject sequence than the global alignment with the same length sometime. In Figure 9, the similarity is greater than the global alignment between two sequences with the same length because of the shorter actual subject sequence in alignment. This causes the losing of some standard probes of BLAST compared with DP algorithm. Myers lost some probes too. The reason is same as BLAST. After introducing the deletion, there are multiple alignment results for the equivalent between insertion in one sequence to a deletion in the other sequence. Therefore, one may get different similarity values for the same pair in comparison. Table 3. The Probe Results Comparison Between DP and BLAST. No.

subject sequences set

candidate probes set DP

BLAST

the number of same results

1

100

610

89

204

89

2

200

610

64

169

64

3

500

610

55

159

55

4

1000

610

55

125

55

5

10000

610

42

66

42

6

1000

4383

1810

2559

1806

7

10000

4383

267

869

261

Figure 8. The similarity confusion of BLAST.

Figure 9. The higher similarity case BLAST than dynamic algorithm.

K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design

19

In order to overcome the confusions above, we constraint the insertion and deletion in the subject sequence in Myers, DP and KS algorithm. The similarity is calculated by the formula: I(A,B) = (the number of optimal match bases) / (the length of query sequence). In this way, we can not only guarantee that there is only one similarity value for one comparison but also ensure that the more optimal matches are the greater the similarity.

3.2.2 The comparison of efficiency The execution time of different algorithms in the above 7 experiments, are shown in Figure 10. The BLAST program used is the blast program in the ncbi-blast-2.2.31+.

7 6 5 4 3 2 1 0

100

200

Myers

300

400

BLAST

500

600

KS

700

800

DP

Figure 10. The execution time of Myers, BLAST, KS and DP on different data sets.

As the Figure shows, KS algorithm is faster than BLAST and the execution time of the DP algorithm is about 64 times larger than the KS algorithm. Myers algorithm is the fastest and is about 9 times faster than the KS algorithm.

4 Conclusion The alignment problem of probe design is a problem of global alignment. The wellknown Needleman-Wunsch algorithm is not efficient enough for large data set. The widely used local alignment algorithm BLAST and the fast Myers’ bit-vector algorithm may fail some times. We have introduced a novel global alignment method called KS. It consists of two parts: the filter step and the greedy algorithm. With above 99% low similar comparison pairs filtered by the first step, KS runs 3 times faster. The modified greedy algorithm is very efficient for the alignment of high similar pairs.

20

K-mer Similarity: a Rapid Similarity Search Algorithm for Probe Design

It is about 20 times faster than the traditional dynamic programming algorithm. With the combination of the two method, our algorithm executes over 60 times faster on appropriate data. The high accuracy and speed makes KS a better choice for probe design. Acknowledgment: This work is supported by China National Natural Science Foundation (61172099).

References [1] J. Daily, “Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments,” BMC Bioinformatics. vol. 16(1), pp. 81, 2016. [2] S.B. Needleman, and C.D. Wunsch, “A general method applicable to search for similarities in amino acid sequence of 2 proteins, J. Journal of Molecular Biology,” vol. 48(3), pp. 443-453, 1970. [3] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology. vol. 215(3), pp. 403-410, 1990. [4] R. Ariyadasa, and N. Stein, “A sequence-ready physical map of barley anchored genetically by two million single-nucleotide polymorphisms,” Plant Physiology. vol. 164(1), pp. 412-423, 2013. [5] M. Pfeifer, K.G. Kugler, S.R. Sandve, B. Zhan, H. Rudi, and T.R. Hvidsten, “Genome interplay in the grain transcriptome of hexaploid bread wheat,” Science. vol. 345(6194), 1250091, 2014. [6] G. Myers, “A fast bit-vector algorithm for approximate string matching based on dynamic programming,” Journal of the ACM (JACM). vol. 46(3), pp. 395-415, 1999. [7] S. Terrat, E. Peyretaillade, O. Goncalves, E. Dugat-Bony, F. Gravelat, A. Mone, et al., “Detecting variants with metabolic design, a new software tool to design probes for explorative functional DNA microarray development,” BMC Bioinformatics. vol. 11(3), pp. 2611-2619, 2010. [8] E. Dugat-Bony, M. Missaoui, E. Peyretaillade, C. Biderrepetit, O. Bouzid, and C. Gouinaud, “HiSpOD: probe design for functional DNA microarrays,” Bioinformatics. vol. 27(5), pp. 641-648, 2011. [9] X. Wang, and B. Seed, “Selection of oligonucleotide probes for protein coding sequences,” Bioinformatics. vol. 19(7), pp. 796-802, 2003. [10] S.H. Chen, C.Z. Lo, S.Y. Su, B.H. Kuo, A.H. Chao, and C.Y. Lin, “UPS 2.0: unique probe selector for probe design and oligonucleotide microarrays at the pangenomic/genomic level,” BMC Genomics. vol. 11(Suppl 4), pp. 325, 2010. [11] X. Li, Z. He, and J. Zhou, “Selection of optimal oligonucleotide probes for microarrays using multiple criteria, global alignment and parameter estimation,”Nucleic Acids Research. vol. 33(19), pp. 6114-6123, 2005. [12] M.D. Kane, T.A. Jatkoe, C.R. Stumpf, J. Lu, J.D. Thomas, and S.J. Madore, “Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays,” Nucleic Acids Research. vol. 28(22), pp. 4552-4557, 2000. [13] J.P. Dumas, and J. Ninio, “Efficient algorithms for folding and comparing nucleic acid sequences,” Nucleic Acids Research. vol. 10(1), pp. 197-206, 1982. [14] Z. Zhang, S. Schwartz, L. Wagner, and W. Miller, “A greedy algorithm for aligning DNA sequences,” Journal of Computational Biology. vol. 7(1-2), pp. 203-214, 2000.

Hong ZHANG, Ling-ling ZHANG*

Research and Prospects of Large-scale Online Education Pattern Abstract: MOOC (Massive Open Online Courses) have been studied a lot by academia, industry and business since 2012. With the quick development of MOOC, there are still a lot of challenges to solve. Firstly, this paper gives a definition and explanation of MOOC based on current research. Then, we list the problems hindering the development of MOOC by analysis of the popularity of MOOC. At last, we propose our perspectives and advice of MOOC through the study of these challenges and the future of it. Therefore, this paper will refer to the development of MOOC and itself. Keywords: MOOC; Current Research; Operation mode; Challenges

1 Introduction MOOC (Massive Open Online Course) is a term proposed by Canadian scholar Dave Cormier and Bryan Alexander in 2008. It has been developed as a new online education pattern. Although it was proposed recently, its development is not inferior to the rise of the traditional education. MOOC is well-known because of its rise in 2012 and its global popularity in 2013. It has become a beacon of the global development of Internet education. The reason of this quick development is the idea that MOOC satisfies the expectation of customers. There is an example of edX which is a free learning pattern. In general, people can learn what they want at anywhere and anytime without expensive cost. Most MOOC courses are free and provide various learning sources for a earner. Online patterns give everyone a chance to touch the famous teachers around the world. They also provides a platform for the learners to communicate [1]. The development of MOOC has a determinable effect on the reform of traditional education. This paper introduces the leading MOOCs around the world and compares the difference. We illustrate the advantages of MOOCs and discuss the specific operating process of MOOCs by examing two aspects. We analyzed the challenges and opportunities of MOOC at last.

*Corresponding author: Ling-ling ZHANG, Department of Computer Science and Technology, Beihua University, Jilin, China, E-mail: [email protected] Hong ZHANG, Academic Affair Department, Beihua University, Jilin, China

22

Research and Prospects of Large-scale Online Education Pattern

2 Definition of MOOC There are many excellent MOOCs, such as edX, Coursera, Udacity, FutureLearn, Online Cource, NetEase Cloud Course, Shell MOOC etc. [2]. Through the number of people registering every year, we can come to the conclusion that foreign MOOCs (edX, Coursera and Udacity) are the pioneers in this area. These foreign MOOCs are leading the development of the global online education, including the running pattern and the teaching style. In our country, Tsinghua Online Course (Chinese MOOC), NetEase Cloud Course and Shell MOOC are the representatives of MOOC. Through the different characteristics and patterns of these MOOC, researchers can know the current status of large scale online education and its evolution in the future.

2.1 Foreign MOOC All of these three foreign representative MOOC platforms aim at sharing the best courses of the best universities with all over the world. Apart these aims, they also provide course resources of high quality (like video), setting homework requests and developing good learning atmospheres, which are the reasons MOOC achieve their success. In April 2012, MIT and Harvard created edX, which is a non-profit platform. It aims at improving teaching quality and spread online education. MIT, Harvard and their cooperative enterprises funded this free pattern. Cousrsera was founded by Minnie Kohler and Andrew Engle, who are Stanford professors. It is a large scale online education platform which combines profit and online education together. So far, there are a lot of famous universities working with Coursera, including Princeton and Pennsylvania. In addition, Coursera has provided at least 1500 courses. In the begining, Udacity was free. However, a reform in 2013 made Udacity promote online special training instead of free online education. This behavior shows that Udacity has found a new profiTable pattern instead of the methods of edX and Coursera. In field of the teaching style, edX is gorgeous. To the contrary, Coursera provides a normal education. Udacity provides a different method called ‘Exchange Identity’. In this teaching method, teachers ask questions and students solve them [3]. Apart from online study opportunities provided by MOOC, after they complete the corresponding courses, the educational institution will issue certificates to learners [4], which can help them find a good job. Generally, these certificates are classified as freeof charge, which means the free certificates are relatively useless. For example, edX provides normal certificates (free), authenticated certificates (cost) and series of X certificates (cost and point to specific theme). Coursera provides free authentication and signature tracking (cost). However, among these three platforms, only Udacity can provide a large amount of credits to students, which are accept by most colleges. In addition, there are some other MOOC platform like Iversity in Germany and FutureLearn in England, which engage teachers from all over the world and aim at building a famous online education platform.

Research and Prospects of Large-scale Online Education Pattern

23

2.2 Chinese MOOC Chinese MOOC began in 2013. In May of the same year, Tsinghua University and Peking University joined the edX respectively. At the same time, Peking University established a collaborative relationships with Coursera. In October of the same year, Tsinghua University promoted an online educational platform called ‘Online Course’ which is known as Chinese edX. In addition, five universities on both sides of the Taiwan Straits promote an online education site called ‘Ewant Open Education Platform’ together. After that, more and more universities have joined an online education platform one after another. Various Internet companies want to take a share of MOOC spoils. Due to that ambition, they promote their own online education platforms like Netease Cloud Course, Shell MOOC, YY Education and so on. In comparison with MOOC created by Universities, these kinds of MOOC has more diversification and can satisfy more requests of people. For example, except for the course resources, Netease Cloud Course cooperates with institutes to engage some online teachers to interact with student and solve their problems. It also coaches students according to different characteristics, which is hard to achieve in open MOOCs. Other platforms may put particularly emphasis on high quality courses to serve the customers. This high quality mean the courses are more lively, acceptable and understandable, instead of boring. However, no matter how different these Chinese MOOCs are, the final test and graduate authentication are necessary. Normally, the final test includes an average grade (a system test and peer rating) and an online attendance rate. This kind of test is similar with offline education, such as an ‘Online University’. The difference is whether the test platform is online or offline. Offline tests are the core method for many traditional educational authorities to check their student. However, the pattern of Internet education drives the relevant authorities to combine the characteristics of online education with traditional education. The future of MOOC is broad, the development of Chinese MOOC is promoted by the universities and their partners. Most Internet media companies and educational authorities aim at promoting MOOC to earn profit. Even though there are lots of online education platforms, most of them just import a large amount of teaching videos, which results in the confusion of students’ choices and resource waste. In this condition, students also can’t receive resources of high quality as soon. Beyond that, many online platforms lack the trailing feedback and supervision and have poor interaction. At present, the technologies are not the reason for these problems. In addition, personalized online education always has an expensive cost, which hiders the development and promotion of Chinese MOOC [5]. These MOOC introduced above are just the tip of the iceberg among the global online education. As shown in Table 1, except for the three biggest foreign MOOC platforms, there are also FutureLearn, Open2Study, Iversity, Khan Academy, Open Learing, NovoEd, Canvas, Fun, Udemy, Openup Ed, IOC Athlete MOOC and so on. The development of Chinese MOOC is not lagging behind. There are Geeks College,

24

Research and Prospects of Large-scale Online Education Pattern

MOOC of Chinese University, iMOOC, NTHU MOOCs, Class Network, Dauber MOOC, 51CTO and Chinese Ewant. More detailed information of MOOC is introduced in reference [6,7].

Foreign MOOCs

Table 1. Foreign MOOC and Chinese MOOC MOOC platform

Brief

Coursera

Coursera is a non-profit educational technology company with and currently offers more than 400 courses themes and more than 1500 courses.

edX

EdX is a nonprofit platform created by Harvard and MIT in 2012. It offers more than 900 online courses.

Udacity

Udacity was founded in 2011 and directly cooperated with professors of universities. It mainly provides online computer courses.

FutureLearn

FutureLearn is a united effort of 12 universities in the UK in 2013. It currently provides more than 280 online courses as well as offline tests to obtain certificates.

Iversity

Iversity is a Germany MOOC platform. It cooperates with teachers to collect courses around the world. It currently offers more than 60 courses.

OpenupEd

OpenupEd is supported by the European Commission. It currently covers more than 215 courses and 12 different languages.

Chinese MOOCs

OpenLearing, NovoEd, Canvas, Google, FUN, IOC-Athlete-MOOC, World-Science-U, Udemy, stanford-open-edx, P2PU, Alison, Saylor, Allversity, Academic Earth, JANUX, Microsoft Virtual Academy (MVA), Khan Academy Online Class

It is based on edX and founded by Tsinghua University. Online Class is the largest MOOC platform in China and provides about 508 online courses.

Ewant

Ewant is jointly set up by Shanghai Jiaotong University, Xi’an Jiaotong University, Southwest Jiaotong University, Beijing Jiaotong University, Hsinchu and Taiwan Jiaotong University. It offers the 112 courses.

Netease Cloud Class

It is a platform offering online skills learning, including computers, foreign languages, sports, etc. and more than ten categories of online courses.

Shell MOOC

Shell MOOC is founded by Shell company. It covers most courses of three major foreign MOOC platforms and displays these courses in Chinese.

Online University It was created in April 2014. It is an educational platform organized by some universities and has the characteristics of open, public welfare and so on. YY Education

It is a interactive Internet education platform established in 2011. It takes advantages of Internet and has a online interactive class cross region and space time.

Wisdom Tree, Supernova MOOC, Geek College, Chinese MOOC, MOOCs@pku, iMOOC, NTHU MOOCS(Taiwan), Transmission Class, Daube, 51CTO, Chinese Ewant

Research and Prospects of Large-scale Online Education Pattern

25

3 Operation Mode of MOOC The official definition of MOOC’s business model is: a paradigm formed during the overall development of MOOCs. It is used to select a series of methods. In normal, it is the methods MOOCs used to choose its customers, providing a category according to the current status of online education, keeping a balance between the cost of online courses and benefit of that and etc. This section will introduce the most important part of MOOC’s running model that is course operation and profit model.

3.1 Course Operation In normal, whether a company’s products are competitive directly determines the fate of the enterprise. Similarly, whether the curriculum products provided by MOOCs are competitive directly determines the success of MOOCs. Although there are countless domestic and overseas MOOCs, their curriculum operation basically includes three points: providing the course, teaching method and evaluation method [8]. Different MOOCs provide different courses. Some of the large scale MOOCs (like Cousera and edX) provide the subject contents from all walks of life and are able to satisfy the requirements of any learner. However, there are some MOOCs which limit the courses they provide probably because of their own capability or the specific cooperative relationships with some universities. Learners can choose the courses according to their own needs without any worry. At the same time, with the consideration of optimizing the learning experience, almost all MOOCs invariably offer short videos (about 6-15 minutes) to the learner, so as to avoid having the learner’s interests plummet due to a long time spent learning. In spite of this, the duration of each online course will last for about 1-2 hours. So far, teaching methods of MOOCs are divided into two types: synchronous and asynchronous (real-time and not real time). For those who wish to be free to control their learning time and learning progress, it is recommended to choose asynchronous MOOCs. Many videos and notes in this kind of teaching method are prepared in advance. It is less likely to have teachers online at any time due to the unfixed time. The interactive mode of this teaching method is mainly based on a human-machine interface. In contrast, students who choose the synchronous mode have more opportunities to communicate with teachers or discuss with other students who are studying together. This kind of communication is very important for learners to understand the relevant knowledge. But there is a clear date of commencement of synchronous MOOCs. Sometimes, learners’ plan may be directly interrupted due to missing the commencement date which will not wait for anyone [9]. Each of the two ways has their advantages and disadvantages, which need to be carefully chosen by the student according to their personal situation. However, from the view of data, asynchronous MOOCs are more popular currently. The most obvious phenomenon

26

Research and Prospects of Large-scale Online Education Pattern

is that although many people have chosen synchronize MOOCs, more than half of these people can’t complete the course. For example, the lack of perfect attendance mechanism leads to many students skipping classes. Assessment of online courses is similar with that of the traditional offline education. According to the time of preparation, online assessment can be divided into once a week or once a month. Such an assessment is usually used for testing the knowledge students have learned before. According to the time of feedback, the test can be divided into online tests and offline tests. In addition, according to the importance of assessment, the test can also be divided into a midterm test and a final test. People who pass the final test can receive a corresponding certificate. Both Coursera and edX currently provide two kinds of certificates for learners: a paying certificate and a free certificate of honor. It is obvious that a company likes people who have paying certificates. In order to meet the needs of some specific learners, some platforms recommend the credits mechanism of corresponding universities to these learners, such as Udacity. These credits gained from Udacity have been recognized by many universities in US.

3.2 Profit Model From the view of the great success of the MOOCs in the short term, we can imagine the great influence of MOOCs on future educational reform. The emergence, development and maturity of any new technology must be accompanied by a specific business model. The initial MOOCs (such as Udacity) are free of charge. But the development process needs the support of the economy. As a result, MOOCs have begun to charge and have evolved into a unique business model. Research shows that the current MOOCs business model is still in the exploratory stage. The major mainstream MOOCs profit channels mainly consist of learners or other organizations. Any learners can get free access to the general learning materials and the corresponding curriculum content on the MOOCs. But the question is how MOOCs get profit from learners. In fact, the principle is very simple. Although the learners can be free to complete the course assessment, they need to pay tuition if they want to obtain the corresponding certificate. The course is free and has no supervision. However, some MOOCs have a strict assessment with special invigilators. After passing the exam, MOOCs will also issue the corresponding certificates to learners to prove their abilities. In addition, some MOOCs also offer some senior professional courses training (such as language R), which are certainly not free. A stable investment is very important for the sustainable development of MOOCs. Generally, such a stable investment includes corporate sponsorship and venture capital. However, only the MOOC which has high quality, a high social impact and a high efficiency can get this kind of investment. Therefore, most of the MOOCs must find other ways to survive, which usually means enterprise cooperation. This kind of cooperation not only includes potential relationships established between enterprises

Research and Prospects of Large-scale Online Education Pattern

27

and learners, which aims at looking for personnel for the enterprise but also includes profit sharing with offering online courses. The model of cooperating with enterprises based on the characteristics of broad coverage of MOOCs service has been adopted by most of the MOOCs, and is gradually developing into a mature business model.

4 Meaning of MOOC MOOC’s popularity has been an irresistible trend in the world. However, the question is why we need MOOCs. The obvious reason is many people can receive the free education of the best universities in the world through MOOCs. They can learn the latest scientific knowledge of these universities. In addition, MOOC’s popularity has many other reasons concluded as follows.

4.1 Openness of MOOC All the MOOCs in the world have good openness. People anywhere in the network can login MOOCs to study what they want. The most important thing is MOOCs has no tuition compared to the traditional education. Online education can reduce the burden of poor students. In addition, traditional education needs an appointed place and appointed time to study. Student must attend the class in time. On the contrary, MOOCs just provide an relative free time and place for students to study, which can satisfy requirements of different people. Besides the time and place, MOOCs never limit the age of learners, which is opposite to the traditional education. In addition, as an open online education, MOOCs emphasizes the share of the knowledge, which helps people receive most free resources of MOOCs.

4.2 Scale of MOOC MOOCs means massive open online courses. The scale of MOOCs is different with that of traditional education which is related with the scale of classroom. Technologically, the scale of MOOCs is associated with the infrastructure internet service provider. In normal, an online classroom can contain thousands of people at the same time. In this comparison, MOOCs have a great advantage of scale. At present, all the MOOCs provide a great quantity of curriculum resources for learners to choose, such as Netease open class and etc. In order to facilitate learners, MOOCs also provide a very convenient and accurate searching method and sharing method. Learners can combine their interests and hobbies with the classes. Then, they can choose the courses precisely according to the corresponding MOOC. On the contrary, traditional education leaves a rigid and boring impression to students.

28

Research and Prospects of Large-scale Online Education Pattern

4.3 Autonomy of MOOCs At present, the mainstream of MOOCs offer a variety learning methods of courses. MOOCs is different from traditional education which teaches students in fixed time and place. MOOCs take the needs of learners’ own work into consideration. Therefore, it provides a relative free learning method. Students can go on their online study according to their own time. In this case, there are more interactions between computers and learners. In addition, people’s discussion is not real-time and will take some time to get feedback.

4.4 Consistency of MOOC In traditional education, a student cannot have two courses with two teachers at the same time. However, one person can attend math class of Netease and history class of Online Class on two different computers respectively. Another advantage of this consistency is one student can learn a same course of two different teachers. Learners can get more knowledge and consideration through the comparison of study.

4.5 Improvement of Course MOOCs is an online open class, which means the learners may be students or teachers. Different teachers may have different opinions on the same course. They can do some communication to discuss the teaching methods and how to alter their courses through the platform provided by MOOCs. This phenomenon greatly improves the quality of network course. In addition, people can upload their best course resources because online curriculum resources are shared, which also helps MOOCs improve the quality of the curriculum. Certainly, these advantages which are obvious in MOOCs can adequately attract attention of the educational circles. Therefore, the development of MOOCs has become an inevitable trend. The key lies in how MOOCs get along with traditional educational circles.

5 Challenges and Developing Direction Recently, MOOCs have become a very hot topic at home and abroad. However, it also shocks the traditional education which will not be replaced by MOOCs in a short time. Therefore, we list the advantages and disadvantages of MOOC in this paper.

Research and Prospects of Large-scale Online Education Pattern

29

5.1 Challenges of MOOC The biggest challenge of MOOCs comes from learners. The flexibility of MOOCs introduced in the above section is also a disadvantage from another perspective. When the learners are too young or too old, their enthusiasm for studying may not be as stronger as the learners in traditional education, which causes truancy. On the contrary, the curriculum system of traditional education is compulsory, which can prevent truancy. In addition, traditional education needs to pay tuition, which also prevents truancy. When taking these factors into consideration, there will not be too many people play truant. There are lots of people who register in MOOCs and a few people stick with it until to the end. Educational specialist Pill Hill divides the learners of MOOCs into 5 categories: active learners, passive learners, temporary learners, spectators, and learners who give up halfway. The statistical results show that learners who give up halfway occupy 47%, which is the highest proportion. The temporary learners occupy 7%, which is the lowest proportion. The active learners of MOOCs just occupy 21%. The passive learners occupy 11%. The rest are spectators who occupy 14%. In addition, there are only 15% learners who pass the final test smoothly at the end of the study [10]. Most of the people who pass the final test are the original active learners. Learners can freely access the classroom of MOOCs because there are not mandatory requirements. Therefore, if learners want to complete the MOOCs and achieve better grades, learners must be conscientious. In fact, people who are not conscientious are relatively more suitable for traditional education. Although some MOOCs improve students’ learning efficiency through increasing tuition, this behavior may result in losing part of the students and enrollment difficulties. In addition, learners usually choose the MOOCs who cooperate with well-known universities. Other MOOCs who are relatively small will collapse with the competition. However, this kind of competition is very important for the development and innovation of MOOCs. Besides the competition, how to recruit students is also a problem for these small MOOCs. Therefore, the solution of this problem is also an important factor for the development and innovation. In addition, there is hardly a strict and unified standard for the MOOCs because of the online mode of MOOCs and open mind. However, the quality of MOOCs is still a difficult problem. On the one hand, poor English learners are difficult to understand the content of courses due to the English teaching method of most MOOCs, which will reduce the enthusiasm for these people. In addition, the enthusiasm of students will reduce continuously because the quality of online courses will reduce continuously over time. On the other hand, the enthusiasm and participation of online teachers will reduce as well as they will face a recording camera for a long time. After all, the communication established by MOOCs only exists on the network. In real life, online teachers and learners seldom meet each other, which make it difficult for teachers to understand the characteristics of all the students and teach or guide students in a pertinent way. In contrast, teachers can actually interact with students

30

Research and Prospects of Large-scale Online Education Pattern

in traditional education. This interaction is visible and important. In addition, the frosty relationship between online teachers and students, the massive open model may also cause many excellent students to be lost. The reason is that there are too many learners and too few teachers. At present, most teachers of MOOC are online teachers as well as traditional teachers, which means they have to spend double or even more time to prepare and teach the course. Repeated work is boring and easily to leads teachers to feel fatigued, which may influence the normal lives of these teachers. How to coordinate the online and offline educational work is not only a challenge for educators, but also a challenge for MOOCs. It is obvious that MOOC’s development needs the support of educators. In addition to the obvious challenges above, we can also foresee some other challenges. With the development and improvement of MOOCs, some aspects of traditional education are bound to be replaced by MOOCs. If MOOC certificates are recognized by society, some small universities will be eliminated quickly. People can get education from the best universities and achieve relevant certificate through the study on MOOC, which avoid the stress of college entrance examination. Most people hope to attend better universities. So do the teachers. Famous teachers will be more and more welcome and other teachers will be eliminated gradually. This serious polarization will finally impact on the society and threaten some people and authorities’ benefit. Long-term polarization will make MOOC’s development get into trouble. It is possible that MOOCs will disappear in the future due to these challenges. In fact, these challenges are deduced by us and they may be the worst condition and may not happen.

5.2 Developing Direction of MOOC Although there are lots of challenges and problems hinder the development of MOOCs, MOOCs are still the most hopeful educational mode. MOOC will direct the development of global education. After taking this into consideration, we make some assumptions on MOOC’s future. In the long run, a good business model is the driving force which promotes the development of MOOCs. It’s critical for MOOCs to optimize the running operation and develop a reasonable business model. At present many MOOCs which want to profitt have explored in this area which includes professional recommendation, certification, advertisement, copyright, credit etc. Professional recommendation and certification have become common benefit mode. However, most learners still choose free MOOCs to study with. Therefore, it’s very important for MOOCs to find a sustainable business mode during its development. MOOCs are well-known for their large scale and openness. So far, there are lots of learning resources and large quantity of learning data on the platforms [11]. Through some algorithms and mathematical model, we can do data mining based on these

Research and Prospects of Large-scale Online Education Pattern

31

data. The data mining can help us find the potential learning problems and more challenges of MOOCs. It is also very important for evaluation, analysis and improving the quality of learning. This mode can gradually develop into a personalized and customized learning program, so as to serve the new business model. In addition, the ability of data mining also reduces the burden of teacher. Teachers can get the characteristics of all the students through the report of big data instead of interacting with the students, which can help them design a syllabus. However, these capabilities of MOOCs are unstable. Therefore, data mining and analysis will be the development direction of MOOCs. Online education has an inherent advantage relative to traditional education model, such as distance education, large scale and free mode. However, traditional education also makes up for many deficiencies of MOOCs. For example, MOOCs do not offer face-to-face teaching and related practical activities. Therefore, we shouldn’t put them on the opposite. We should consider the relationship between them from the complementary point of view. In addition, the department of education mentioned that major colleges and universities can select a suitable platform to undertake the corresponding educational work in the book named “Strengthen Management and Application of Open Online Course Construction of Colleges and Universities”. Traditional teachers must draw on the strong points of MOOCs and promote a combination of traditional education and online education. For example, teachers can work online and students can organize face-to-face activities by themselves. Besides that practice, students can preview the classes on a MOOC at a suitable time, which can help them have a better grasp of the learned knowledge. It is very important for teachers to know the characteristics of all the students through online analysis of big data and offline interactions. Teachers can also customize a more detailed educational plan. Therefore, MOOC’s development can be promoted by the combination of MOOC and traditional education. Both MOOC and traditional education promote the transformation of education model to a better direction.

6 Conclusion Educational informational construction has been the trend of the times in the whole world. As a new online educational mode, MOOCs already have an important impact on the traditional education. Although the development of MOOCs is faster than traditional education, MOOCs still cannot replace the traditional education, which shows the inevitability of the coexistence of MOOCs and traditional education. It was said by Prof. Christophe that what MOOCs bring to traditional education is not a threat but an extension. Therefore, we believe there will be a new educational mode which combines MOOCs and traditional education. This combination takes advantage of the Internet technology and traditional education, which form a relatively more complete

32

Research and Prospects of Large-scale Online Education Pattern

education mode. This new educational mode can offer full-scale service to all the learners. Acknowledgment: This work is supported by the Education Department of Jilin Province for Higher Education Reform (Grant No. BHSY007).

References [1] PAPPANO L, The Year of the MOOC, The New York Times, 2(12), 2012. [2] Xiaoxia Dong, Jianwei Li. Research of the MOOC’s Operating Model. China Educational Technology, 330(7): 34-39, 2014. [3] Yulei Zhang, Yan Dang, Beverly Amer, A Large-Scale Blended and Flipped Class: Class Design and Investigation of Factors Influencing Students’ Intention to Learn, IEEE Transactions on Education, 99(3):1-11, 2016. [4] Liangtao Yang, Dilemma and Development Strategy of MOOC Localization, 2015 7th International Conference on Information Technology in Medicine and Education, 2015: 439 – 442. [5] Xiaohong Su, Tiantian Wang, Jing Qiu, Lingling Zhao, Motivating students with new mechanisms of online assignments and examination to meet the MOOC challenges for programming, IEEE Frontiers in Education Conference, 2015: 1-6. [6] Zhuyun Yang, Qi Zhen, SPOC: INTEGRATING INNOVATION OF COMBINING WITH UNIVERSITY EDUCATION, Tsinghua University Press, 33(2): 9-12, 2014. [7] Arjit Sachdeva, Prashast Kumar Singh, Amit Sharma, MOOCs: A comprehensive study to highlight its strengths and weaknesses, 2015 IEEE 3rd International Conference on MOOCs, 2015: 365-370. [8] Xiaohong Su, Tiantian Wang, Jing Qiu, Lingling Zhao, Motivating students with new mechanisms of online assignments and examination to meet the MOOC challenges for programming, IEEE Frontiers in Education Conference, 2015: 1-6. [9] MOOC panel - Future educational challenges in computer engineering education: Will MOOCs be a threat or an opportunity?, Field-Programmable Technology, 2014 [10] Mi Fei, Dit-Yan Yeung, Temporal Models for Predicting Student Dropout in Massive Open Online Courses, IEEE International Conference on Data Mining Workshop, 11(4):256-263, 2015 [11] YanyanZheng,Beibei Yin, Big Data Analytics in MOOCs, IEEE International Conference on Computer and Information Technology, 2015: 681-686.

Lebi Jean Marc Dali*, Zhi-guang QIN

New LebiD2 for Cold-Start

Abstract: As the Cold Start problem emerges as a serious bottleneck in the world of internet companies, it has become a priority to solve it efficiently. Previous techniques fail at addressing this problem. But in this paper, we present LebiD2 a hybrid Trust based technique which solves efficiently the Cold-Start. The secret of LebiD2 is that it doesn't need the active user history which is exactly the downfall of the other recommenders. In this paper, we explain our method in detail. Keywords: Cold Start, Recommenders, model-based RS, Trust based algorithm, social network

1 Introduction The boom in the online business was mainly due to the inception of the Recommender systems (RS) technology. RS are very helpful for the users in making purchase decision online. Indeed as the name indicate RS systems recommends potential products a user will be interested in. The recommendation is mainly based on the user’s history with the company. Recommendation systems are mainly divided into two groups namely content based recommendation systems and Collaborative Filtering recommendation systems. Content based methods use semantics in predicting items the user will be interested in. It uses information such as the user interests, his occupation, age, favorite authors and information pertaining to the item such as title, topic. On the other hand, Collaborative Filtering (CF) recommends items to users solely based on the rating matrix in the company database. The latest RS technique is called the trust based CF method. It makes recommendation based on the social network information of the user. Our method LebiD2 belongs to this category. Indeed, nowadays almost everyone can be traced in the social networking world and by considering information on the social network, we can successfully predict the behavior of any user on a particular item. This method is very effective in addressing the cold start problem. Indeed the cold start problem refers to predicting the behavior of a novel user with no history. Here we describe an improved RS technique LebiD2 which applies the model based methodology into the trust based technique. The result is amazing. We have a better performance at solving the cold start problem than we did with the former

*Corresponding author: Lebi Jean Marc Dali, Department of Computer Science and Engineering, University of Electronics, Science and Technology of China Zhi-guang QIN, Department of Computer Science and Engineering, University of Electronics, Science and Technology of China

34

New LebiD2 for Cold-Start

technique LebiD1 [1] which will be briefly described here. This paper has three (3) sections. In section 2, we discuss related works in this area of study. In section 3, we explain our method “LebiD2” in detail, then in section 4 we evaluate LebiD2 against other well-known methods for solving the cold start problem and finally we conclude by showing the advantages of our method.

2 Related Work The use of recommendation systems has had a considerable impact on the income of online businesses. The companies with the best RS systems have witnessed a surge in their client base set. The first really popular RS technique is the memory based RS [2], a famous such technique is the nearest neighbor method. The memory based RS is popular because it is relatively simple to use and it is very effective. Indeed, the memory based method analyzes the entire rating matrix to find the closest friends of the active user based on the user history and hence predict the behavior of the active user based on these data. The memory based CF does not perform well in a real world arena especially in situations where the rating matrix is very large. This technique is called memory based because it uses the rating matrix residing in the memory in order to compute the prediction. This technique has serious issues pertaining to sparsity, scalability and shilling. Sparsity refers to the state of affair wherein the matrix contains many empty cells, scalability refers to the situation where the number of users and items grows to a very large level and shilling refers to spurious data inserted into the matrix. The next important RS method is the model based CF [2]. Unlike the CF where the prediction is computed directly from the rating matrix, here the rating matrix is used to first learn a model. And using this model, we can perform the required predictions. One important characteristic of this model is that once the model is learned, we do not need the dataset anymore. While in the case of the former CF (memory based CF), the rating matrix is accessed whenever a prediction is needed. Also the model based CF makes use of matrix decomposition and matrix factorization in computing the model parameters. And matrix factorization is a process for which is no guarantee of success. The novel CF method to appear in the RS universe is the trust based system. Indeed the trust based technique combines the company rating matrix with the social networking for its prediction. The Tidal Trust recommendation system [3] is one such method. In fact, it performs a modified breadth-first search in the system and computes the trust value based on all the raters at the shortest distance from the target user. The trust between users uand ν is given by:

∑ (tu,w ∗ tw,v ) (1)

tu ,v = w∈N

∑ tu,w

w∈N

where N denotes the set of the neighbors of u.

New LebiD2 for Cold-Start

35

Also the trust depends on all the connecting paths. The formula used for the prediction is:

∑ (tu,v ∗ rv,i ) (2) ∑ tu,v v = raters

ru ,i = v=raters

where rv,i denotes rating of user ν for item i. In fact this technique addresses to some extend the infamous cold start problem. We will test our method against this technique. The next method in this category is the MoleTrust technique [4]. In its operation, it is very similar to the previous technique (TidalTrust). However, these two techniques differ in how they select trusted users. The TidalTrust uses users at the shortest distance of the active user, while MoleTrust goes beyond that up to a maximum-depth d. Here the difficulty is in the selection of the depth d. If it is too high, the accuracy will be good but the processing time will also be high. However, the MoleTrust performs better than than TidalTrust. MoleTrust will also be evaluated against our technique. We will test our technique (lebiD1) against this method as well. We also have in this category the TrustWalker method [5]. This method belongs to the Trust based RS. It behaves almost like MoleTrust. Their differences is that here we use the near friends who have rated similar items instead of the far neighbors. The similarity between items is given by: corr (i, j ) sim(i, j ) = |corr ( i , j )| (3) − 1+ e 2 Finally to close this section, we inspect LebiD1. LebiD1 is a special type of trust based technique. Indeed it combines memory based technique with social networking. When used with the QQ social network and the movielens database [6], it shows a better performance than the previous methods. LebiD1 prediction formula is given as:

= ra,i

∑ ru,i

u=F

card( F )

,i ∈ I

(4)

wherein F represents the closest friends of active new user a, I is the set of all the items rated by all the users u of F, ra,i represents the predicted rating of novel user a to item i, while F stands for the set of social network friend shaving rated item i. ru,i stands for the rating of friend user u to item i and finally card (F) denotes the number of such friends.

3 LebiD2 The problem of Cold start is very disturbing, especially to the e-commerce community. However the benefits that one will obtain from solving this problem are enormous.

36

New LebiD2 for Cold-Start

However using our method LebiD2, the cold start problem is solved effectively. Already in our previous method LebiD1 [1], we addressed the cold start problem. In LebiD1, we combined the memory-trust based CF with the QQ social network and the Movielens database [6]. We use the same databases in LebiD2 but here we use the model based CF combined with trust based instead of using the memory-trust based LebiD1 [1]. LebiD2 performs better than our previous method LebiD1 as will be witnessed in the next section. We use the same working condition as in the case of LebiD1, that is here too we used the Movielens [6] dataset and our work is to predict whether a particular user will like a particular movie. In the case of the LinkedIn dataset, we are interested in predicting the skillset of an employee or which companies will be interested in a particular employee. To sum up, here our tasks are to predict the movies a new user will like or what will be his ratings on given movies while our user has no history with our database before. We can also sort the movies according to the computed ratings from the most liked (highest rating) to the least (lowest rating). In LebiD1, we used the rating matrix to compute the predictions. But here, we will first lean the model using: |W = (XT X)–1 XT.t

(5)

where W denotes the model parameters needed and it is solved using LebiD2 decomposition. X denotes the user dataset and t denotes the movie database. LebiD2 is a *congruent like decomposition except it is more stable and less time consuming. The algorithm used in the LebiD2 preprocessing phase is given as: N: represents the number of friends on the social network having rated item i -For k=1 to n -For i=k to m

S (k ) = S (k ) 2 + A(i, k ) A(i, k ) = A(i, k ) / S (k )

-For j=k+1 to n -For i= k to m

A(i, j ) = ( A(i, j ) + A(i, k ) 2 ) / A(k , k ) S (i ) = A(i, i )

Comparing our LebiD2 with the matlab SVD we obtain Figure 1. LebiD2 is shown in the green color. We clearly see that our method LebiD2 shows a very stable condition compared to the matlab SVD. Hence LebiD2 will perform better than the matlab SVD in a real time environment because while being more stable, it requires less processing time which is the prime concern of real applications. But we need to specify that the matlab SVD is more complex hence more effective in its operation than our technique. But we know that users hate to wait for a web page to open. And if the page takes too long to open, they just go to the competitor, hence we lose the client and his money. So time complexity is very important with an acceptable performance which is the goal of LebiD2. The matrix factorization operation is the main operation that consumes the time. So in LebiD2, we tricked (somehow) the

New LebiD2 for Cold-Start

37

system, in the sense that instead of computing the model parameters directly with the entire rating matrix, we first identify the main eigenvalues and using these, we derive the model which is way faster than operating directly on the large database. Also here an approximation of the model is enough so we don’t need all the complex operations required in the matlab SVD. So speedy processing results in increasing our ability to retain online customers and attract even more from customer advertising about our product. Also as we will see in the next section, the results are exceptional in addressing the cold start problem.

Figure 1. Diagram comparing lebiD2 (green) and matlab SVD (blue)

4 Experimental Evaluation Our technique (LebiD2) has been tested against the methods discussed above. The methods are: the MoleTrust, the TidalTrust and theTrust Walker and our former technique LebiD1. And we used RMSE [2] as the evaluation metric. The result is given as per Table 1: Table 1. Table comparing the rmse of lebid2 with previous trust based methods Method

RMSE results

MoleTrust TidalTrust TrustWAlker LebiD1 LebiD2

1.400 1.200 1.180 0.800 0.200

38

New LebiD2 for Cold-Start

The diagrammatic representation is given in Figure 2.

Figure 2. Barchart comparing LebiD2 with previous trust based methods

It can be observed that our model-trust based approach LebiD2 (the last bar in the Figure) has a better error rate than the other methods. RMSE (Root Mean Squared Error) The RMSE is a rating metric that is used to test the accuracy of the recommender technique. Its formula is given by:

= RMSE

2 1  ∑ {i , j} ( Pi , j − ri , j )  (6) n 

n represents the total number of ratings over all users, Pi,j denotes the predicted rating for user i on item j, and finally ri,j stands for the actual rating. RMSE amplifies the contributions of the absolute errors between the predictions and the true values.

5 Conclusion To sum up, we can boldly say that LebiD2 is indeed a very important step in addressing the Cold Start problem. Bringing together the model based approach and the trust based technique has proven to be very effective in solving with the cold-start problem.

New LebiD2 for Cold-Start

39

Also the accuracy is increased as can be seen with the RMSE test. However LebiD2 is time consuming since it is based on the model based architecture. We encourage researchers to focus on the relationship between the user and Social network [7] in order to design better recommender that solves efficiently the cold start (0% error rate) in an acceptable time delay. Indeed Social Network is the next big thing in the world of recommenders. Hence, we encourage researchers to focus more on the social network relation of the users.

References [1] L. J.M. Dali and Q. ZhiGuang, “Cold Start Mastered: LebiD1”,InternationalConferences on Computational Science and Engineering (CSE2014), Chendu China, Dec 2014 [2] X. Su and T.M. Khoshgoftaar, “A Survey of Collaborative Filtering Techniques”, Advances in Artificial Intelligence, 2009:4:2-4:2, January 2009.. [3] J. Golbeck, “Computing and Applying Trust in Web-based Social Networks”, PhD thesis, University of Maryland College Park, 2005. [4] P. Massa, P. Avesani, “Trust-aware recommender systems”, RecSys 2007. [5] M. Jamal, M. Ester: TrustWalker: A Random Walk Model for Combining Trust-based and Item-based Recommendation, KDD 2009. [6] MovieLens data, http://www.grouplens.org/ [7] A. Sharma, D. Cosley, “Do social explanations work? Studying and modeling the effects of social explanations in recommender systems”, WWW 2013.

Li-jun ZHANG*, Fei YU, Qing-bing JI

An Efficient Recovery Method of Encrypted Word Document Abstract: The recovery of encrypted Word document has great application importance not only for the case of decrypting a user’s Word document without the forgotten password but also for evidence acquisition in forensic justice circumstances. In this paper, we studied the file structure, encryption principle as well as decryption key derivation approach of a Word document, and then we present an efficient method of decrypting this kind of file. After a practical test, we find that our method is able to acquire the original plaintext document rapidly (within an average time of 1.5 minutes), which can almost meet the actual requirement of real-time decryption of Word document. Keywords: Word document; rainbow Table attack; rapid decryption; data forensics

1 Introduction Word document in Microsoft office suite is one of the most widely used word processing software while the security of document content and privacy protection have become a basic demand for those document users. Word document employs encryption to control the access privilege of its document and a person could only open and edit the content of document when he enters the correct password in advance. This mechanism provides the necessary security guarantee for user data. However, with extensive use of passwords in a variety of encryption applications, the case of a forgotten password appears frequently. Once the password of some important Word document is forgotten, no one can open the document to view the content which usually brings a great loss to document owners. On the other hand, the encrypted Word documents also make criminal investigations in civil national security departments more difficult. So it is of practical significance to study the recovery of encrypted Word document. The earliest method of cracking encrypted Word document is through security vulnerability, and one anonymous researcher [1] presented such an approach in 2004 to modify a document’s encryption protection in order to achieve the purpose of obtaining access privilege. Later in 2005, Wu [2] pointed out the improper usage of core RC4 encryption algorithm in Word document, i.e., it uses the same encryption

*Corresponding author: Li-jun ZHANG, Science and Technology on Communication Security Laboratory, Chengdu, 610041, China, E-mail: [email protected] Fei YU, Qing-bing JI, Science and Technology on Communication Security Laboratory, Chengdu, 610041, China

An Efficient Recovery Method of Encrypted Word Document

41

key stream for different versions of Word file. This kind of implementation enables it to be possible to decrypt the content by using much weaker exclusive or algorithm [3]. However, in practice it is quite difficult to decrypt the document content by exploiting these vulnerabilities since it needs different versions of the same file, a requirement is usually extremely hard to satisfy. Therefore, the existing tool of decrypting a Word file is through exhaustive search for the correct password. There are two such representative softwares: one is Advanced Office Password Recovery [4] released by Elcomsoft company and the other is Password Recovery Kit [5] from Passware company. But this brute force mode is only valid for short passwords and it cannot recover those a little longer passwords within acceptable time since the space of candidate passwords becomes very large which results in a time-consuming search process. So it is valuable to design a decryption approach independent of password length. Chen [6] proposed an algorithm by using a time and space tradeoff but their result is limited to find out the internal encryption key without giving the final recovery of a plain text document. In this paper, we first give a detailed analysis of encryption principle and storage structure in a Word file, and then proposed a decryption method for this kind of document by making use of a rainbow Table attack technique. The advantage of this approach is that the decryption can be accomplished in a determined time. In an actual test under the computer configuration with i5 dual-core CPU and 4GB memory, it can effectively recover the plaintext of encrypted document in less than 2 minutes with success rate exceeding 95%. Our result is helpful for practical demand in the data forensics and forgotten password retrieval situations.

2 The Analysis of File Structure and Encryption Principle Word file uses a special kind of file format called Microsoft compound document which is a complex structured storage to contain a variety of meta data in different formats such as text, image, audio and video. We found out the explicit format of Word document by analyzing the open source code of “OpenOffice” software.

2.1 The Storage Structure of Word File Logically, a compound document is a kind of file system composed of storages and streams, where storages are similar to directories and streams are similar to files as those in Windows operating system. So root storage is equivalent to the root directory of the file system. Moreover, every stream is divided into several smaller data blocks called sectors for concrete data storage. The logical storage structure is shown in Figure 1.

42

An Efficient Recovery Method of Encrypted Word Document

root storage

storage 1

stream 11

storage 2

stream 1

stream 21

stream 22

Figure 1. The logical storage structure of Word file

Specifically, a typical Word document with images has the following five kinds of streams. “Data” streams store the image data and this stream exists only when the file has images. “1Table” streams store the content of data Table, and “CompObj” store the common object. “Word Document” store text data which is the actual text and format information while “Summary Information” store summary information of the whole document. Physically, an entire Word file consists of a file header structure and the subsequent sectors. The size of every sector is the same and identified in the header. A sector’s index starting with 0 is called sector identifier (SID for short). These sectors belonging to one stream can be disordered and the corresponding SID array of a stream is called a sector chain (SID chain for short). Thus the physical storage structure of a Word document is shown in Figure 2.

header sector 1

all streams

sector 2

data blocks

…… sector n

Sector chains ID

2

4

5

3

Figure 2. The physical storage structure of Word file

2.2 The Encryption Principle of Word File Encryption Algorithm. Word document with versions from 97 to 2003 use RC4 algorithm which supports 40-bit, 64-bit and 128-bit length of encryption key. RC4 is a stream cipher algorithm whose operation is byte-oriented. It adopts an encryption key

An Efficient Recovery Method of Encrypted Word Document

43

of variable length to derive the initial state and then generates a pseudorandom key stream to produce the cipher text. In order to maintain compatibility, the default key length is 40 bits whose encryption security is too weak under the current computation capability [7]. Encryption Process. The 40-bit encryption key of RC4 algorithm in Word document is generated from a user’s password and a salt value by computing several rounds of MD5 hash value. This process is called key derivation implemented by the function KDF(salt, password). After the encryption key is obtained, it is concatenated with the block number “bnum” according to the physical position of data block in the Word document. Then this concatenation value is calculated as input of MD5 algorithm to derive the 16-byte initialization vector in RC4 algorithm. Finally, the RC4 algorithm produces the key stream to encrypt every data block by exclusive-or operation and outputs the corresponding ciphertext. The entire encryption process is shown in Figure 3. Original Word File

salt password

KDF(salt,password)

40-bit key

MD5(key||bnum)

128 bit enckey

RC4(enckey,block)

Encrypted Word File

Figure 3. The encryption process of Word file

Vulnerability of Encryption Mechanism. The existing Word document recovery method is trying to find out the user’s password. However, the password searching space is increased exponentially with respect to the length of password. So the password cracking time is unacceptable when the document has a long and complicated password. For example, a password contains numbers, uppercase and lowercase letters, as well as special characters with a total length greater than 8. In this case, the Word file recovery is almost impossible by this password cracking method. But from the encryption process we could see that the security strength is

44

An Efficient Recovery Method of Encrypted Word Document

totally determined by the 5 bytes data which derive the initialization vector of RC4 algorithm. So we can attempt to recover this 40-bit key which has the remarkable advantage of being independent of password length.

3 The Basics of Rainbow Table Attack 3.1 The Principle of Rainbow Table Attack A rainbow Table attack is a kind of time-memory tradeoff algorithm [8]. It makes use of stored data in the offline precomputed phase to reduce analysis time of an online attack phase. For an encryption algorithm E and the known plaintext P0, the attack target is to obtain encryption key k satisfying Ek(P0) = C0 after given the ciphertext C0. The time-memory tradeoff attack is divided into two phases: (1) Precomputation Phase. Select m starting points S1, S2,..., Sm from the key space K and define a reduce function R:C→K which maps ciphertext space C to key space K. Let f(k) = R(Ek(P0)) and we apply this function to calculate m chains from every starting point Si.

…

In fact, in order to improve the coverage of keys in key space when these chains are generated, the reduction function R at every column of chains is different so is the . function f. If this function in every column is marked with a different color, then these chains look like a rainbow. Hence they are named rainbow chains. After this calculation of m chains is completed, only the starting and end point pairs (Si, Ei) are stored in the Table.

… (2) Online Attack Phase. Given the target ciphertext C0, we first apply reduce function R on C0 to obtain value Y1,then use the function f to apply iteratively from Y1until the result is matched with some end point Ej. Now we get a computation chain: . This matched chain will be rebuilt from the starting point Sj until we find the desired key. In practice, this correct key k = Xj(t-s) = ft-s-1(Sj) may not exist in the matched chain.

An Efficient Recovery Method of Encrypted Word Document

45

This phenomenon occurs because the chain generated from Y1coincides with a chain in the Table but this matched rainbow chain does not contain the correct key. This case is called a false alarm [9]. A rainbow Table attack can be used to reverse one-way functions such as hash functions and encryption functions. In practice, it is mainly applied for plaintext recovery of hash value and unsalted encrypted password cracking cases.

3.2 Rainbow Table Attack for Word document The type of rainbow Table in a practical attack is almost the perfect Table [10] which means no end points are the same in the Table. In a rainbow Table attack, there are several important parameters for constructing Table s such as the number of rainbow Table s n, Table ’s chain number m and chain length t. These parameters can be configured to be optimal by the following formulas according to success rate p, storage space M and key space N. These formulas are: n=-ln(1-p)/2, m=M/n, t=-(N/M)ln(1-p). In the process of constructing and searching rainbow Table s, there are two essential functions, namely the encryption function E involving key and reduce function R. So for the concrete situation of Word file recovery, we should first define such both functions. We denote the 40-bit decryption key by k which determines RC4 initialization vector. According to the Word encryption process and characteristics of RC4 stream algorithm, the pseudorandom key stream ks is generated from k. Moreover, we find that the consecutive 8 bytes plaintext at offset 0x400 in plaintext Word file are fixed with all 0x00 whose block number is exact 0x01. Since the ciphertext c = p xor ks and all the plaintext bytes are 0x00, in this case the ciphertext c is the same as key stream ks. So we extract the consecutive 8 bytes data as the ks at offset 0x400 and establish the target one-way function Ek(p) = c which maps the 40-bit key to 64-bit pseudorandom key stream ks. The reduce function is designed as Ri(x)=(x+ti) mod 240 where x is a 64-bit number and ti is the column position in a rainbow Table chain. In Word document recovery, the key space N=240. If our desired success rate s is 99%, we can calculate the relevant parameters by the precede formulas of constructing perfect rainbow Table. Concretely, we need to generate n=4 Table s with each Table containing m = 54,000,000 chains and chain length t = 36000. The total storage space M of 4 rainbow Table s is about 3.2 GB.

46

An Efficient Recovery Method of Encrypted Word Document

4 Word Document Decryption 4.1 Decryption Key Acquisition In order to decrypt a Word document, it is necessary to acquire decryption key k. We read out the 8 bytes ciphertext at offset 0x400 in the Word file as the input of rainbow Table attack. Using the 4 rainbow Table s, we could search and recover the correct 40-bit decryption key.

4.2 Recover the Word Document by Decryption Key After obtaining the correct decryption key, we start to decrypt the encrypted part in Word document and reconstruct the original plaintext file. By studying the file structure, we found that not all the data blocks in the document are encrypted. Only the streams of one Table, Data and WordDocument are encrypted whose data are text, images and so on. Therefore, when a Word document is decrypted, the sector number of ciphertext data block is firstly listed according to the file structure and then we extract the corresponding data and decrypt them while the unencrypted parts remain unchanged in order to recover the whole document. The sector number and sector chain can be obtained from the file header which contains important parameters of file version, sector size, total number of sectors and starting offset of every sector. This compound document header is exactly at the beginning of file with size 512 bytes. Here we present some necessary parameters in this header needed in decryption as shown in Table 1. Table 1. The parameters for decryption in file header Offset

Size

Description (in hex)

0

8

fixed document file identifier (in hex): D0…E1

28

2

byte order identifier

30

2

size of a sector in power of 2

32

2

size of a short sector in power of 2

44

4

total sector number for sector allocation Table

48

4

SID of first sector of directory stream

60

4

SID of first sector of short sector allocation Table

64

4

total sector number for short sector allocation Table

68

4

SID of first sector of master sector allocation Table

72

4

total sector number for master sector allocation Table

76

436

first part of master sector allocation Table (109 SIDs)

An Efficient Recovery Method of Encrypted Word Document

47

The decryption process of Word document is as follows. (1) Restore the sector allocation Table (SAT) according to master sector allocation Table (MSAT). (2) Read out the starting sector number of every encrypted directory stream based on the file header of the document, denote it by DirSid. (3) According to starting sector number SID of every encrypted directory stream, recover the sector chain for every directory data. We denote it by CSID whose process is described in Figure 4. (4) Extract every data block according to the sector chain CSID.

read

DirSid

header

CSID parse

SAT

Figure 4. Recover sector chain of every directory

(5) Derive decryption key of RC4 stream cipher based on 40-bit key and decrypt every encrypted data block. (6) Modify the encryption identifier in file header and add the unencrypted part to reconstruct the plaintext Word document. This process is shown in Figure 5.

Encrypted Word File

read by CSID

RC4(deckey,block)

decryption

Original Word File

Figure 5. Reconstruct the plaintext Word document

4.3 The Comparison of Attack Efficiency In a general attack of searching precomputed Table, every key will need 5 bytes and the whole 240 key space will take at least 5 TB bytes storage which cannot be achieved within an ordinary computer. For brute force attack method [9], the exhaustive search of 40-bit key will take 40 days in the average time. In our rainbow Table attack, we practically construct the specified 4 rainbow Tables and implement key recover algorithm in an experimental environment with Intel i5 dual-core CPU of 3.2G clock, 4GB memory and Windows XP operating system. We attempted 100 encrypted Word document samples and successfully recovered 96

48

An Efficient Recovery Method of Encrypted Word Document

documents with time shorter than 2 minutes. So this recovery method for encrypted Word files is very efficient in practice.

5 Conclusion This paper proposed an efficient plaintext recovery method for encrypted Word document. This method exploited the rainbow Table attack technique to find out the correct 40-bit decryption key and then reconstructed the original plaintext file according to the encryption principle and file structure of the Word document. After the practical test, this method can decrypt the Word file cipher text efficiently which could be very helpful for forgotten password retrieval and judicial data forensics. Acknowledgment: This work is supported by Foundation of National Natural Science (No.61309034). The authors also acknowledge the reviewers for their useful opinions to improve this paper.

References [1] Anonymous hacker. The problem of encryption functionality in Word document. Available at http://college.Sxhifhway.gov.cn/document/ 20040112161826088.htm. [2] Wu H. The Misuse of RC4 in Microsoft Word and Excel. Institute for Infocomm research, Singaphore, 2005.Available at http://packets- torm.setnine.com. [3] Trappe W. Introduction to cryptography with coding theory, Pearson Education press, 2006. [4] Elcomsoft company software 2016. Advanced Office Password Recovery. An introduction is available at http://www.elcomsoft.com, 2016. [5] Passware company software. Passware Kit Enterprise and Passware Kit Forensic, available at http://www.lostpassWord.com/index.htm, 2016. [6] Chen Q, Fang H. Study on Word Document Fast Crack Based on Time-memory Trade-off Algorithm. Computer Engineering, vol.36(16): 137-139, 2010. [7] Team 509. The use of vulnerability of encryption algorithm in MS Word. Available at http:// rootyscx.net/documents/MSWord encrypt.pdf [8] Hellman M, A cryptanalytic time-memory trade off. IEEE Transactions on Information Theory, IT, 1980, 6(4): 401-406. [9] Avoine G., Junod P. and Oechslin P. Time-memory trade-offs: False alarm detection using checkpoints. In Progress in Cryptology Indocrypt 2005, volume 3797 of Lecture Notes in Computer Science, pp. 183-196, Springer-Verlag. [10] Oechslin P.. Making a Faster Cryptanalytic Time-memory Trade-Off, Advances in Cryptology proceedings of Crypto 2003, LNCS 2729, Springer-Verlag, 2007: 617-630.

Gao-yang LI, Kai WANG, Yu-kun ZENG, Guang-ri QUAN*

A Short Reads Alignment Algorithm Oriented to Massive Data Abstract: DNA sequencing technology has seen rapid development in recent years, and both the sequencing throughput and read lengths are growing. Besides, new properties such as paired-end sequencing are emerging. Therefore, it is of great value to develop a sequence alignment algorithm for this new type of DNA data. In this paper, an alignment algorithm is proposed. Instead of the Smith-Waterman algorithm, a local alignment algorithm oriented to sparse mutation is used to accelerate seed extension. Besides, instead of aligning short reads one by one, this software puts all reads with similar seeds together to accelerate seed location. This paper uses human genome reference sequences and short sequencing data from GenBank (40 times coverage) to evaluate our algorithm. And we compare our work with Bowtie2 in terms of speed and accuracy. The results show our algorithm has significant advantages in alignment speed and space overhead with large scale data. Keywords: alignment tool; local alignment algorithm; Next-Generation Sequencing

1 Introduction With the continuous development of the Next-Generation Sequencing (NGS) Technologies, the capabilities of gene sequencing have grown swiftly, while the price is falling. In the beginning of 2015, Illumina launched HiSeq 4000, which can sequence at most 1.5T nucleotide data in just 3.5 days (http://www.illumina. com). Such advances have greatly widened the use of NGS in clinical medicine and precision medicine. The raw data generated by the Next-Generation sequencer need to be aligned to the reference genomes before downstream analysis. However, the alignment process is a time-consuming task with intensive computational requirements. In most cases of alignment, the use of high-performance workstations or servers is a necessity (e.g. [1]). This increases the cost of sequencing and makes the clinical application of NGS more difficult. Thus, an alignment tool that is capable of handling massive raw data with cheaper computing resources is needed.

*Corresponding author: Guang-ri QUAN, School of Computer Science and Technology, Harbin Institute of Technology (Weihai), Weihai, China, E-mail: [email protected] Gao-yang LI, Kai WANG, Yu-kun ZENG, School of Computer Science and Technology, Harbin Institute of Technology (Weihai), Weihai, China

50

A Short Reads Alignment Algorithm Oriented to Massive Data

2 Methods The mainstream of the current aligners follow the seed-and-extend paradigm (e.g.BLAST [2]). First, short sub-strings (seeds) in the reads are aligned exactly or with a few mismatches to some reference genome regions. In that step, BWT [3] index (e.g.SOAP3-dp [4], SOAP3 [5, 6], BigBWA [7]) and hash index (e.g. MOSAIK [8], GMAP [9]) are used to locate candidate positions. Then a local or global alignment algorithm are used to complete alignment for candidate locations and generate final results. Dynamic programming methods like the Needleman-Wunsch algorithm and the Smith-Waterman algorithm are used as local or global alignment algorithm [10, 11]. Those methods can always find out the best alignment of two strings of nucleotide or protein sequences.

2.1 A local alignment method oriented to sparse mutations Assuming the lengths of the two strings are n, the computational complexity of dynamic programming methods is O(n2), which is relatively high when n is big. But in the process of resequencing, in most case the two DNA sequences are similar, and the similarity can be used to make alignment complexity lower than O(n2). Most mutation sites are short strings isolated from each other, like islands in the sea. In this paper, for a k bps string containing mutations, if strings ahead and after it are without mutation and longer than k bps, it considered that the string containing mutations is isolated. Under such circumstances, strings without mutations are long enough to locate isolated mutated strings. In accordance with the method in this paper, the computational complexity of alignment depends on the length of mutated strings. For two DNA sequences long n bps, if the longest isolated mutated string longs p bps, the complexity is substantially O(np). The mutations are sparse and the majority of mutated strings shorter than a specific constant. Therefore, the algorithm can usually achieve good performance. For two similar strings S1, S2 and natural number k, we define function f:

f(S1,S2,k):=[(S1«k)⊕S2]&(S1⊕S2)&[(S2«k)⊕S1] (1)

In function (1), “«” means left shift k bit, “&” means bitwise AND, and “⊕” means bitwise XOR. For a natural number k, equation:

f(S1,S2,k) = 0

(2)

In most case shows the two strings have no mutated sub-strings longer than k. And we define Md:

Md := Min (k) (3)

A Short Reads Alignment Algorithm Oriented to Massive Data

51

When:

f(S1, S2, k) = 0 (k ∈ N) (4) S1

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11

S2

A1 A2 A3 A4 A5 A6 A7 T8 A9 A10 A11

S1 0.   x(t ) = x . (17) 0  0  Corollary 1 Let g j ( ±T j ) = 0 ( j = 1, 2, , n) . Then (17) is globally mean square exponentially stable if there exist positive constants κ > 0 and pi>0, such that for i = 1, 2, , n n

(κ + λi − 2) pi + ∑ ( pi aijmG j + p j a mji Gi ) < 0 j =1

. (18)

Proof. The proof of Corollary 1 is similar to Theorem 1, so we omit it here. In the following, we discuss an illustrative example. Example 1 Consider the following model:

278

Global Mean Square Exponential Stability of Memristor-Based Stochastic Neural Networks

[− x1 (t ) + a11 ( x1 (t )) g ( x1 (t )) + a12 ( x1 (t )) g ( x2 (t )) dx1 (t ) =  +b11 ( x1 (t )) f ( x1 (t − τ (t ))) + b12 ( x1 (t )) f ( x2 (t − τ (t )))   + I1 ]dt + σ 1 ( x(t ), x(t − τ (t )), I )dB(t )  dx t ( ) = [− x2 (t ) + a21 ( x2 (t )) g ( x1 (t )) + a22 ( x2 (t )) g ( x2 (t ))  2  +b21 ( x2 (t )) f ( x1 (t − τ (t ))) + b22 ( x2 (t )) f ( x2 (t − τ (t )))  + I 2 ]dt + σ 2 ( x(t ), x(t − τ (t )), I )dB(t )  , (19)

where

g ( x) tanh( x1 − 1) = f ( x) tanh( x − 1) x = ( x1 , x2 )T= , , , and

− − −  1.5, x1 < 1,  1, x1 < 1,  0.5, x2 < 1, a21 ( x1 ) =  a11 ( x1 ) =  a12 ( x2 ) =  x1 > 1, 1, 0.5, x1 > 1, 0.1, x2 > 1, 1, x2 < 1, 10, x1 < 1, − a22 ( x2 ) =   1, x2 < 1, 0.5, x2 > 1, b11 ( x1 ) = −1, x > 1, b12 ( x2 ) = −10, x > 1,   1 2 − 10, x2 < 1.  1, x1 < 1, b21 ( x1 ) =  b22 ( x2 ) =  x2 > 1. −10, x1 > 1, 1, Clearly, the conditions

1 max  1≤i ≤ n θ  i

n

∑θ j =1

j

 (a mji Gi + b mji Fi )  < 1 

 (κ + λmax − 2) P + 2G j PAm  T  Fj P ( B m ) 

Fj PB m

λmax P

 