Cursive Script Text Recognition in Natural Scene Images: Arabic Text Complexities [1st ed. 2020] 978-981-15-1296-4, 978-981-15-1297-1

This book offers a broad and structured overview of the state-of-the-art methods that could be applied for context-depen

422 55 6MB

English Pages XV, 111 [121] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Cursive Script Text Recognition in Natural Scene Images: Arabic Text Complexities [1st ed. 2020]
 978-981-15-1296-4, 978-981-15-1297-1

Table of contents :
Front Matter ....Pages i-xv
Foundations of Cursive Scene Text (Saad Bin Ahmed, Muhammad Imran Razzak, Rubiyah Yusof)....Pages 1-12
Text in a Wild and Its Challenges (Saad Bin Ahmed, Muhammad Imran Razzak, Rubiyah Yusof)....Pages 13-30
Arabic Scene Text Acquisition and Statistics (Saad Bin Ahmed, Muhammad Imran Razzak, Rubiyah Yusof)....Pages 31-42
Methods and Algorithm (Saad Bin Ahmed, Muhammad Imran Razzak, Rubiyah Yusof)....Pages 43-84
Progress in Cursive Wild Text Recognition (Saad Bin Ahmed, Muhammad Imran Razzak, Rubiyah Yusof)....Pages 85-92
Conclusion and Future Work (Saad Bin Ahmed, Muhammad Imran Razzak, Rubiyah Yusof)....Pages 93-95
Back Matter ....Pages 97-111

Citation preview

Saad Bin Ahmed Muhammad Imran Razzak Rubiyah Yusof

Cursive Script Text Recognition in Natural Scene Images Arabic Text Complexities

Cursive Script Text Recognition in Natural Scene Images

Saad Bin Ahmed Muhammad Imran Razzak Rubiyah Yusof •



Cursive Script Text Recognition in Natural Scene Images Arabic Text Complexities

123

Saad Bin Ahmed King Saud bin Abdulaziz University for Health Sciences Riyadh, Saudi Arabia

Muhammad Imran Razzak School of Information Technology Deakin University Geelong, VIC, Australia

Malaysia-Japan International Institute of Technology (M-JIIT) University of Technology Malaysia Kuala Lumpur, Malaysia Rubiyah Yusof Malaysia-Japan International Institute of Technology (M-JIIT) University of Technology Malaysia Kuala Lumpur, Malaysia

ISBN 978-981-15-1296-4 ISBN 978-981-15-1297-1 https://doi.org/10.1007/978-981-15-1297-1

(eBook)

© Springer Nature Singapore Pte Ltd. 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Whatever you can do, or dream you can, begin it. Boldness has genius, power, and magic in it. —Goethe

Preface

The complexities of cursive scene text are important to highlight in pursuing possible solutions demonstrated by researchers. The contemporary researchers in natural language processing (NLP) field must thrive premise knowledge that elaborates the complexities of cursive text which further provides help in defining solutions explicitly designed for Arabic scene text recognition systems. This book comprehends the intended researchers who want to have an elementary knowledge of Arabic scene text and all its relevant issues. The successful learning systems start with the assumption that how good features have been evolved from cameracaptured text images. The text analysis emphasized essential skills in building reliable systems utilizing expertise by relying on the process with an impetus to recognize provided samples with higher accuracy. When dynamics of cursive scene text are analyzed with emphasis on Arabic, then it is learned that traditional OCR approaches are not suitable to be applied to this intrinsic text pattern. The variant nature of traditional OCRs does not possess any similarity with the issues related to scene text analysis. The difference between aforementioned systems should clearly understandable by the reader. The detail covering context-based LSTM approaches designed for Arabic scene text recognition systems constituted state-of-the-art solutions which need to be discussed in a comprehensive manner. This book is a result of all difficulties that are overlooked during proposed research. The rationale behind this effort is to compile all proposed solutions that have been suggested for smooth research activities performed in cursive scene text analysis. It is pertinent to discuss regarding deep learning architecture which believes to be suitable for learning the solutions designed for complex script, appeared in natural image. This book places a clear emphasis on context learning classification methods which are a result of rigorous research performed in scene text analysis. Kuala Lumpur, Malaysia July 2019

Saad Bin Ahmed Muhammad Imran Razzak Rubiyah Yusof

vii

Acknowledgements

First and foremost, praises and thanks to Almighty for His showers of blessings throughout this research work and successful completion of writing this book. Writing a book is harder than someone’s thought and more rewarding than anyone has ever imagined. None of this would have been possible without precious discussion among groupmates that reflects the flow of contents and overall organization of this book. As a first author of this book, I am grateful to all of those with whom I have had the pleasure to work during this and other related projects. Dr. Muhammad Imran Razzak and Prof. Rubiyah Yusof have provided me extensive personal and professional guidance and taught me a great deal about both scientific research and life in general. Nobody has been more important to me in the pursuit of this project than the members of my family. I would like to thank my parents, whose love and guidance are with me in whatever I pursue. They are the ultimate role models. Most importantly, I wish to thank my loving and supportive wife, Tayyaba, and my three wonderful children, Aiza, Rameen, and Aleyan, who provide unending inspiration. At the end, we would like to thank Springer for providing us an opportunity to write this book that could be beneficial for newbie in the field of document image analysis and natural language processing.

ix

Contents

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1 1 3 5 7 9 11

2 Text in a Wild and Its Challenges . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Inbuilt Complexities Relevant to Cursive Scene Text . . . . . . 2.2.1 Scene Text Localization Constraints . . . . . . . . . . . . . 2.2.2 Classification Techniques for Scene Text Recognition 2.2.3 Methods Designed for Feature Extraction . . . . . . . . . 2.3 Importance of Implicit Segmentation . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

13 13 15 15 21 25 28

3 Arabic Scene Text Acquisition and Statistics . . . . . . . . . . . . . . 3.1 Dataset Relation with Machine Learning . . . . . . . . . . . . . . . 3.2 Arabic Script Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Dataset Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Multilingual Scene Text Recognition and Its Need . . 3.2.3 EASTR-42K Dataset . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Status of Available Arabic Scene Text Datasets Other than EASTR-42k . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Pre-processing of Scene Text Images . . . . . . . . . . . . . . . . . . 3.4 Generation and Verification of Ground Truth . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

31 31 32 33 34 35

.... .... ....

37 39 41

1 Foundations of Cursive Scene Text . . . 1.1 Introduction . . . . . . . . . . . . . . . . . 1.2 Document Image Analysis . . . . . . 1.3 What Is Cursive Script? . . . . . . . . 1.4 Role of Context in Cursive Scripts 1.5 Cursive Text in Natural Images . . . 1.6 Book Overview . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

xi

xii

Contents

4 Methods and Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Invariant Feature Extraction in Co-occurrence Extremal Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Detection of Extremal Regions . . . . . . . . . . . . . . . . . . . 4.1.2 Invariant Feature Extraction . . . . . . . . . . . . . . . . . . . . . 4.2 Window-Based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Linear Spatial Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Formulation and Pre-processing . . . . . . . . . . . . . . . . . . 4.3.2 Formulation of Linear Spatial Pyramids of Cursive Arabic Scene Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Pre-processing of Image Pyramids by Image Filters . . . . 4.4 MNIST-Based Convolutional Features . . . . . . . . . . . . . . . . . . . 4.4.1 ConvNets as a Feature Extractor . . . . . . . . . . . . . . . . . . 4.5 Deep Learning RNN Model for Cursive Text Analysis . . . . . . . 4.5.1 MDLSTM Network Training for Arabic Scene Text . . . 4.5.2 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Deep Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . 4.6.1 ConvNets as a Learning Classifier . . . . . . . . . . . . . . . . 4.6.2 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Training of Handwritten Urdu Samples on Pre-trained MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Hierarchical Sub-sampling-Based Cursive Document and Scene Text Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . .

..

43

. . . . . .

. . . . . .

43 43 45 48 48 50

. . . . . . . . . .

. . . . . . . . . .

51 53 55 56 57 57 61 65 65 66

.. .. ..

69 73 74

.. ..

78 82

5 Progress in Cursive Wild Text Recognition . . . . . . . . . . . . . . . . . . . 5.1 Convolutional-Based Performance Comparison . . . . . . . . . . . . . . . 5.1.1 Comparison with ICDAR Competitions . . . . . . . . . . . . . .

85 85 88

6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

Appendix: Relevant Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

About the Authors

Dr. Saad Bin Ahmed is a lecturer at King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia (KSAU-HS). He is also associated with Center of Artificial Intelligence and Robotics (CAIRO) research lab at the Malaysia-Japan International Insitute of Technology (M-JIIT), Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia. He completed his Ph.D. in Intelligent Systems at the Universiti Teknologi Malaysia in 2019. Prior to that, he completed his Master of Computer Science in Intelligent Systems at the Technische Universität, Kaiserslautern, Germany, and was a research assistant at the Image Understanding and Pattern Recognition Research Group at the same university. His areas of interests are document image analysis, machine learning, computer vision, and optical character recognition. He has authored more than 25 research articles in leading journals and conferences, as well as book chapters. Dr. Muhammad Imran Razzak is working at Deakin University, Australia. Before joining to Deakin University, he was worked at University of Technology, Sydney, Australia and at King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia. He holds a patent and is also the author of more than 70 papers in respected journals and conferences. He has secured research grants of more than $1.3 million, and has successfully developed and delivered several research projects. His areas of research include machine learning, document image analysis, and health informatics. Prof. Dr. Rubiyah Yusof is a director at Center of Artificial Intelligence and Robotics (CAIRO) M-JIIT, Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia. She received her master’s degree in Control Systems from Cranfield Institute of Technology, United Kingdom, in 1986, and her Ph.D. in Control Systems from the University of Tokushima, Japan, in 1994. Throughout her career, Dr. Yusof has made significant contributions to artificial intelligence, process control, and instrumentation design. She is recognized for her work in biometrics systems, such as KenalMuka (face recognition system) and a signature verification system, which won both national and international awards. She is the author of the book Neuro-Control and its Applications published by Springer-Verlag, in 1995, which was translated to xiii

xiv

About the Authors

Russian in 2001. Professor Dr Yusof is a member of the AI Society Malaysia, Instrumentation and Control Society Malaysia, and Institute of Electrical and Electronics Engineers Malaysia.

Acronyms

ANN ASTR ConvNets CRF CTC DSLR EASTR ESTR FCN HOG ISRI LDA LSTM MDLSTM MKL ML MNIST MSER MSRA-TD NLP NN OCR RNN SIFT STC STR SVM SVT UNHD UNLV UPTI

Artificial neural network Arabic scene text recognition Convolutional neural networks Conditional random fields Connectionist temporal classification Digital single-lens reflex English–Arabic scene text recognition English scene text recognition Fully convolutional networks Histogram of gradients Information Science Research Institute Linear discriminative analysis Long short-term memory networks Multidimensional long short-term memory networks Multiple kernel learning Machine learning Modified National Institute of Standards and Technology Maximally stable extremal regions MSRA Text Detection Natural language processing Nearest neighbor Optical character recognition Recurrent neural network Scale-invariant feature transformation Scene text character Scene text recognition Support vector machine Street View Text Urdu Nastaliq Handwritten Dataset University of Nevada, Las Vegas Urdu Printed Text Images xv

Chapter 1

Foundations of Cursive Scene Text

Abstract This chapter presents principle concepts of Arabic and Arabic-like text analysis appears in natural images. The Arabic script has a complex writing style which constitutes numerous problems to address prior to apply state-of-the-art techniques. If script complexities define in detail, then it supports to those researchers who are not familiarize with basic structure of the Arabic script. This chapter aims to provide information about the structure of the Arabic script and its context learning characteristics. The Arabic script research witnessed the interest shift of document image analysis researchers from recognition of optical characters to recognition of characters appeared in natural images. The literature on Latin and Chinese scriptbased scene text recognition system can be found but Arabic-like scene text recognition problem is yet to be addressed in detail.

1.1 Introduction The cursive text analysis transpired an interesting filed of research in natural image processing and machine learning, either related to natural images, synthetic or handwritten data. The interest of researchers in recent years embarked toward cursive text recognition appeared in natural images. The interest is increasing gradually in natural language processing community to perform research on cursive scripts like Chinese, Japanese, and Arabic. They highlighted the complexities of cursive script like Arabic in particular. The research on Chinese and Japanese cursive scripts have accomplished good results and matured in terms of accuracies as reported by [1, 2]. But in Arabic, there is a need to describe the complexities pertaining specifically to language layout and word or sentence building. Although numerous works have been published by researchers in recent years as mentioned by [3, 4]. The inherent complexities which are embedded with Arabic script require detail description because it makes text recognition difficult even with segmentation-based classifiers. This chapter helps in understanding the basic structure of Arabic script. Figure 1.1 represents the camera captured Arabic scene text. As observed from this figure that explicit techniques are not suitable for character or word segmentation © Springer Nature Singapore Pte Ltd. 2020 S. B. Ahmed et al., Cursive Script Text Recognition in Natural Scene Images, https://doi.org/10.1007/978-981-15-1297-1_1

1

2

1 Foundations of Cursive Scene Text

Fig. 1.1 Cursive scene text camera captured images. Text images appeared in uncontrolled environment which is impacted by illumination and perspective distortion

of Arabic text. In Fig. 1.2, the Arabic characters are segmented from Arabic word. There are numerous factors like illumination effect, image angle, different font size, and font styles that can extensively impact the correct segmentation of Arabic text in natural images. Another problem is structure of Arabic text. Each character in Arabic script has four possibilities to occur in a word, with this connection there are four representations of single character which make segmentation techniques drastically fail to correctly segment the Arabic character. In such script’s recognition, there is a

1.1 Introduction

3

Fig. 1.2 Characters identification from scene text image. There are five characters in this image. All characters appearance depend on their position in a word

need to apply implicit segmentation techniques. The other variation of Arabic script is the Urdu and the Persian. This book provides an opportunity to look into the details of Arabic text complexities and to investigate how underlined complexities of this script can be handled.

1.2 Document Image Analysis The document image analysis encompasses interpretation of digital documents. The digital document includes the text appears as a synthetic data, in scanned form or taken from specialized cameras as represented in Fig. 1.3. • The synthetic data is defined as an artificial data that mimics the real-world data in terms of representation, for example, the text depicts the contents of web application. • The scanned data, as its name implies, is taken by specialized device called scanner. The text can be categorized as handwritten text and book text under the scanned data. On the other hand, scene data appears in natural images, which needs to be detected and segmented by statistical algorithms taken by specialized cameras. The text appears in digital documents are processed by Optical Character Recognition (OCR) systems. The overall process is not as much straightforward because nature of each aforementioned data has distinct issues related to them. The associated

4

1 Foundations of Cursive Scene Text

Fig. 1.3 Arabic text in its various representations. The research on Arabic script depends on text appearance like in printed, handwritten, or scene text form

complexities should satisfy prior to apply further statistical methods via OCR system. The OCR system is a specialized technology which reads document images and translates them into search-able text. The script specialized OCR systems designed to recognize characters, words, and sentences are routinely developed and improved. Due to extensive research in the field of OCR since more than decade, there is utmost urge to generate standard and reliable dataset for evaluation of presented techniques.

1.2 Document Image Analysis

5

In contrast to OCR systems specialized for cursive scripts, the text captured through specialized devices possess distinct characteristics and complexities associated to it. The text analysis in natural images may fall as an exceptional problem of OCR.

The working of OCR is predominantly toward the clean document images that are obtained via the synthetic means or by machine rendering. Besides specialized OCR systems, the advancement in cameras of handheld gadgets prompt its users to capture scene images having overlaid text. This furtherance takes the document analysis research to advance level where there is a need to address the problems depth having big data. In today’s era, greatest amount of individual have specialized gadgets to capture scene images. These supportive images may be beneficial in providing information promptly during work or a journey, etc. These camera captured images may contain many textual information in addition to semantic knowledge constituted by graphics or natural images that support in obtaining big data samples. The text appeared in natural images is generally supportive for conveying information to people. The scene text is constituted by various types of font styles and sizes having various backgrounds including building, sea, mountain, forest, etc., which is termed as noise and may halt the smooth text recognition process from natural images.

The natural images having text are substantially visible on signboards, banners, and advertising notes or boards. The text extraction from natural image is an emerging research field as far as cursive scene text recognition is concerned. This problem is termed as challenging one due to implicit noise like blur, lighting condition, text alignment, styles, orientation, and size of text attached to the image. Moreover, the complex background often makes it hard to extract the text. The traditional OCR systems are not specialized in text extraction rather to learn the provided clean text image. The extracted text from natural scene images are beneficial for the applications like text search in a video, text extraction from videos, content-based retrieval, and search engine, etc.

1.3 What Is Cursive Script? In document image analysis, the heterogeneity of scripts always becomes a challenge to solve. The script may appear in plain format like printed Latin or somehow handwritten Latin. It also represented in a cursive style which is often difficult to address.

6

1 Foundations of Cursive Scene Text

Fig. 1.4 Arabic script constraints

Definition 1.1 Cursive script defines through any penmanship style by using it various symbols of any language is written in conjoined and flowing way. The text appears other than printed/synthetic Latin is impacted by different writing styles. There are numerous complicated scripts which are cursive by nature like Arabic, Chinese, Japanese, etc. Such scripts cursiveness is not influenced by text acquisition methods. The aforementioned scripts must have cursive representation regardless of their mode of acquisition, i.e., taken by specialized cameras, synthetic means or in handwritten form. The Latin text recognition is no more research problem as many researchers have proposed efficient solutions for Latin printed, handwritten, and scene text recognition systems as witnessed in recent research [24, 55, 56]. The uncomplicated segmentation techniques performed well on Latin script even the question text images appeared in cursive handwritten form as represented by [56]. On the other hand, the recognition of non-Latin scripts like Arabic still pose a great challenge in terms of complex text representation and requires more efforts from research community to address the complications. It is pertinent to mention here that Arabic and Arabic-like scripts follows two styles of representation, i.e., Naskh and Nastaliq styles. The Arabic language is written in Naskh style whereas Arabic-like Urdu and Persian are written in Nastaliq style. Moreover, the Arabic and Arabic-like scripts share same writing style, i.e., from right to left. Arabic script classifies into two forms, i.e., joiner and non-joiner. The characters that appear as a joiner may join predecessor or successor character in a word, which means that these

1.3 What Is Cursive Script?

7

Fig. 1.5 Arabic writing styles: a The word representation does not necessarily depend on its baseline but in diagonal form. b The word is strictly written on baseline

characters can appear at first, middle, or at final position in a word as represented in Fig. 1.4a, whereas non-joiner characters may appear in isolation or as a last character in a ligature as shown in Fig. 1.4b. As suggested by preliminary discussion, every character in Arabic has an option to appear at any of the four locations, i.e., initial, middle, final, or an isolated position. The numerous works have been proposed to address the complexities of synthetic or handwritten Arabic script; the detail can be found in [4]. In contrast, there are various complexities associated with Arabic scene text which are relatively difficult to deal in presence of joined and non-joined characters constraint. In camera captured text images, there are plenty of other factors to concentrate, like text volume, invariant font styles, complex and shattered images, and image’s perspective. All issues require careful consideration that may assist to extract the text with high precision. There are numerous other factors like illumination, angle of a text, font size, appearance, and clarity of a text that pose a huge challenge for researchers to recognize the Arabic text in uncontrolled environment. The Arabic script has 39 identified letters including numerals. Being a descendant of Arabic script, Urdu script shares similarities with Arabic script in terms of representation, thus have same complexities. The Urdu text is written from right to left having joined adjacent characters within a word. However, characters may appear in four different positions inside a word like Arabic. The appearance of same character will change depending on its position inside a word. These positions dictate the characters about its pattern according to the position within a word. The Urdu language is written in a Nastaliq style while the Arabic language is written in Naskh style. The nature of Nastaliq style impose Urdu text representation in diagonal form which does not appear on fixed baseline as depicted in Fig. 1.5. In comparison to Arabic which appears on baseline, Urdu language does not have a baseline constraint rather the text is center justified. In Arabic or Arabic-like scripts, the characters in a word with its position determine its role in making the word meaningful. This property of Arabic script makes it as a context-sensitive language.

1.4 Role of Context in Cursive Scripts The unconstrained cursive scripts like Arabic poses a great challenge to deal with complexities in presence of other image degrading properties. The context is considered as a backbone in understanding of Arabic and Arabic-like scripts. Definition 1.2 In general, context is defined through the environment. With respect to Arabic text, the context is determined by surrounding characters. The position

8

1 Foundations of Cursive Scene Text

Fig. 1.6 Arabic word written with diacritical marks. Compulsory diacritical marks make the word meaningful. The optional diacritical marks help in accurate pronunciation

of characters in an Arabic word helps in making the context. The prediction of next character in a word and next word in a sentence is estimated with reference to current temporal behavior. As mentioned earlier that Arabic itself is written in Naskh style where character understanding is comparatively manageable but in Nastaliq style, the text is written in diagonal form as represented in Fig. 1.5. In this particular style, the characters overlap each other which makes it impossible to segment the characters even presented in printed form. Another issue is association of diacritical marks over or under the Arabic base characters. The diacritical marks are very crucial in Arabic-like scripts because it change the meaning if misplaced in a word.

Urdu also uses punctuation marks to separate sentences and leave white space between ligatures and words for separation. Furthermore, characters may overlap with each other and are very rich in diacritical marks; i.e., Urdu contains 22 diacritical marks and these additional diacritical marks associated with ligature represents short vowels or other sounds. Some diacritical marks are compulsory that cannot be skipped, whereas some diacritical marks are optional and only added in a word to help in pronunciation as represented in Fig. 1.6. Based on complications of Nastaliq script, machine-based recognition of Nastaliq is much more complex as compared to Naskh writing style. The explicit segmentation tools cannot handle the complexities of Arabic characters, thus, implicit segmentation is an appropriate choice to deal with these complexities of Arabic script. In recent years, the research work based on implicit segmentation and context learning classifiers on Arabic/Urdu script OCRs either in printed or in handwritten have been reported as indicated in [3, 11, 67]. The state-of-the-art techniques like Recurrent Neural Networks (RNNs) [18] has been applied on Urdu cursive script that

1.4 Role of Context in Cursive Scripts

9

resulted in achieving remarkable accuracies as reported in recent years [3, 11, 12]. Although the work on OCRs of Urdu–Arabic-like scripts presented commendable solutions as shown in recent research publications on this subject but the recognition of Arabic scene text requires detailed efforts so that reliable solutions would be presented. As discussed earlier that the techniques which have been applied on OCR systems have failed catastrophically on Arabic scene text recognition, this is due to the complex structure of Arabic scene text in presence of varied style of text, color, and in orientation regardless of following any font style and size. Efforts are being made to overcome the difficulties in this direction and some research work have been reported as described in later chapters.

1.5 Cursive Text in Natural Images There are supplementary inter- and intra-class variations in the text extracted from natural images. It is comparatively easier to recognize Latin script from natural images unlike Arabic, Chinese, Japanese, or any other cursive script. Arabic script is one of the common and second largest language having the status of national language of Arabian Peninsula. Around more than one billion users in the world communicating Arabic script-based languages in reading and writing.

The writing style of Arabic script is from right to left with the combination of diacritics that consider an integral part to make a word meaningful. There are a variety of techniques presented on Arabic or Arabic-like text recognition either in printed as revealed by [12], scanned or handwritten format during recent years [67]. The existing methods designed for scene text detection and recognition may categorized into texture-based, component-based, and hybrid-based methods: 1. The texture-based method relies on properties of an image like intensity and hue values, wavelet transformation of an image, by applying different filtration techniques which contribute in representation of an image. Such properties may help to detect the text in an image as explained in some of the presented work like [15, 17, 21, 52]. 2. The component-based method depends on the specific region/s of an image. The region is often marked by color clustering and coordinate values. The different filtration techniques may apply to segment the text and non-text regions from an image. If scene text images are taken in specific settings then componentbased methods produce good results. This method is not suitable for invariant text images like difference in font size, rotation, etc. Some researchers proposed their techniques by using this method as mentioned by some researchers in [3, 11, 18, 93, 100]

10

1 Foundations of Cursive Scene Text

3. The Hybrid methods share the characteristics of both texture-based and component-based methods. The candidate region is determined by using both techniques on same image as explored by [84, 92, 101, 103]. The segmentation approaches that have been applied on OCR can also be applied on scene text recognition as reported by [68, 70]. These approaches produced stateof-the-art results on OCR either be applied on cursive or on non-cursive scripts recognition systems. The text in natural image and printed or scanned documents does not share similarities that cause these approaches to be drastically failed on scene text recognition systems.

Numerous works have been proposed to address the complexities involved in scene text recognition. Although it may categorized as a problem of OCR field that contemplate different approaches to recognize scene text. The accuracies reported on various OCR techniques sharped down on scene text images in presence of nontext patterns. However, scene text recognition is labeled as specialized problem of OCR but there are distinct issues relevant to scene text recognition in comparison to typical OCR systems. One of the prominent issues in scene text is localization of a text from natural scene images. There are numerous techniques represented to address text localization as explained in [71–74] and text classification as examined in [75–78], respectively. The active research contribution on cursive scene text specifically Arabic is witnessed during past few years. Primarily, proposed techniques have been applied on localization and recognition of Latin text from natural images [15–17]. The cursive script postulates more challenge to recognize text from acquired image. As mentioned earlier, the Arabic script is cursive in nature due to inherent variability of single and joining characters [3, 14]. The focus of this book remains on the proposed solutions designed to address the problems emerged during the process of Arabic document text and recognition of Arabic text from natural images.

The aim is to investigate and analyze the strength and proposed the adapted deep learning solutions using context learner RNNs and instance learner Convolutional Neural Network (ConvNets) approaches. Furthermore, the suggested novel solutions are highlighted with an impetus to equipped the complex pattern of Arabic text by adding the advantage of deep learning classifiers. The yielded results are outperformed during text localization and recognition phase. In most of the surveyed work, multiscript scene textual images are presented as a prominent focal point of current researcher’s interest. The text detection and

1.5 Cursive Text in Natural Images

11

Fig. 1.7 Arabic as a cursive script. The green box shows different positions of Arabic character noon, while blue box represents different positions of Arabic character laam. The red box depicts various positions of Arabic character seen

localization algorithms are not designed for specific language; instead, this process is same for any type of language but the recognition techniques may vary depending upon the nature of script’s characteristics and complexity. In Arabic, the text is cursive in nature because isolated characters do not represent any meaning unless they use in conjunction with other characters as represented in Fig. 1.7. As observed from the figure, the context in Arabic script is crucial to learn if it requires to build specialized intelligent systems using Arabic scripts. The Arabic characters are predicted in a word as they appear because of context learning capability. Therefore to learn the sequences, the RNN-based Long Short Term Memory (LSTM) network approaches are presented with the combination of deep learning architectures.

1.6 Book Overview This book provides insights into researches of document image analysis and pattern recognition field regarding the challenges of Arabic cursive script. Moreover, this book enhances the scope by providing the knowledge about how to apply analysis and to overcome prescribed challenges of Arabic scene text. The intended audience of this book are those individuals who want to know the constraints in Arabic script research. Moreover, it will be helpful if reader wants to prob the ways by which it can contribute in document analysis field. The Arabic scene data by considering its all limitations are not addressed and recognized yet.

12

1 Foundations of Cursive Scene Text

This book uses analysis of the scientific article genre to provide clear understanding of overall text recognition processes. The challenges of text in a wild are discussed in Chap. 2, which further explores the constraint occurred during scene text localization and recognition phase. The importance of implicit segmentation is also discussed later in this chapter. Chapter 3 reveals the importance of having a benchmark dataset. It further explains about how English–Arabic scene text dataset was acquired. The pre-processing steps and ground truth generation mechanism are also explored in this chapter. Strategies are presented for feature extraction of cursive scene text images in Chap. 4. This chapter summarizes the work presented for reliable feature extraction from natural scene text. The details about invariance feature extraction from co-occurrence extremal regions, window base approach for feature extraction, MNIST-based convolutional features, and features taken by linear convolutional means are points of discussion in this section. Deep learning models designed for Arabic scene text recognition are also represented subsequently in this chapter. Furthermore, this chapter also compiles and highlights the results obtained by deep learning models. The detail about current progress in cursive scene text analysis is represented in Chap. 5. The comparison among various techniques designed for cursive scripts is also provided. The book is culminated by examining the open research issues and discussing future direction in this particular field as detailed in Chap. 6.

Chapter 2

Text in a Wild and Its Challenges

Abstract This chapter discusses the challenges encountered during recognition of wild text image. The challenges have been defined in detail with reference to the recent years’ publications in this field. Since decades, the interest shift of researchers has been witnessed from recognition of optical characters to recognition of characters appeared in natural images. Scene text recognition is a challenging problem due to the text having variations in font styles, size, alignment, orientation, reflection, illumination change, blurriness, and complex background. Among cursive scripts, Arabic scene text recognition is contemplated as a more challenging problem due to joined writing, same character variations, large number of ligatures and number of baselines, etc. The study on Latin and Chinese script-based scene text recognition system can be found but Arabic and Arabic-like scene text recognition problem is yet to be addressed in detail. The issues pertaining to text localization and feature extraction are also highlighted. Moreover, this chapter emphasized the importance of having benchmark cursive scene text dataset.

2.1 Introduction The advancement in cameras of handheld gadgets prompt its users to capture scene images having overlaid text. In today’s era, most people have handheld gadgets to capture scene images for information taking purpose during work or a journey, etc. The camera captured images may contain many textual information in addition to semantic knowledge represented by graphics or in pictures. The text appeared in natural images are usually for conveying information to people. The scene text is represented by different types of font styles and size having various backgrounds including building, sea, mountain, forest, etc., which termed as noise and may halt the smooth process of text recognition from natural images. The natural images bearing text can be seen on signboards, banners, and advertising notes or boards. The text extraction from a natural image is an emerging research field with the perspective of cursive scene text recognition. This problem is termed as a challenging one due to implicit noise like blur, lighting condition, text alignment, styles, orientation and size of text attached to the image. The work on Chinese and Japanese in this © Springer Nature Singapore Pte Ltd. 2020 S. B. Ahmed et al., Cursive Script Text Recognition in Natural Scene Images, https://doi.org/10.1007/978-981-15-1297-1_2

13

14

2 Text in a Wild and Its Challenges

Fig. 2.1 Scene Text Recognition (STR) phases

direction have provided various algorithms that obtained good results. Arabic script demands more efforts from research to contribute in segmentation process because it is considered one of the subtle tasks to perform. This chapter focuses the importance of segmentation or localization of Arabic cursive text from natural images. The text detection and recognition contemplate as a subtle task in scene text analysis. The three basic challenges identified by [43] are diversity in text, background complexity, and embedded noise in text created by interference. The scene text overwhelmed by variations with respect to font size, color, noise, and inconsistent background. It becomes a challenge to recognize the text in presence of such tricky impediments. The phases involved in scene text recognition is represented in Fig. 2.1. After the acquisition of an image, pre-processing is required to have standardize representation of acquired image. The next challenge is the text localization which requires specialized methods to analyze the provided text image. Features are considered as distinct characteristics of an image which are extracted from localized text. It is pertinent to mention here that all extracted features are not treated as potential features; therefore, post-processing is required to examine relevant, limited but distinct features. In the following subsection, each phase is explained with reference to current status of research examined in recent years. To counter the implicit problems associated to scene text, there exist various proposed techniques which address the inherent challenges faced by text appeared in natural images. These techniques achieved significant results but all of them were applied on Latin script as reported in [15, 16, 50]. This chapter elaborates insight about each phase of the text recognition process with respect to state-of-the-art approaches.

2.2 Inbuilt Complexities Relevant to Cursive Scene Text

15

2.2 Inbuilt Complexities Relevant to Cursive Scene Text The text detection and recognition in natural images are contemplated as a subtle tasks. The detection of text is possible after removal of unnecessary and irrelevant image noise. The scene text localization affects the recognition accuracy; therefore, its constraints should be explicated in detail. The following subsections summarized the discussion of currently proposed techniques designed for text localization and recognition of non-Arabic scene text. This discussion will envision the idea about how one technique is suitable over the other and on what basis a specific technique can be applicable for cursive scene text.

2.2.1 Scene Text Localization Constraints The text detection/localization is regarded as an important part of information extraction systems. During recent years, novel approaches for text detection have been proposed by pattern recognition, document image analysis, and computer vision research communities. After carefully reviewed the presented work, the script identification in multiscript scene textual images is presented as a prominent focal point of current researcher’s interest. The text detection and localization algorithms are not designed for specific language. Instead, this process is same for any type of language, but the recognition techniques may vary depending on nature of the script’s characteristics and complexity. This section provides description about the constraints that may encounter in cursive scene text localization. The detail about the methods recommended for feature extraction of cursive scene text is elaborated later in this chapter. The numerous methods have been reported in past years which describe the correct localization of text in natural images by considering their constraints. Some of them have been explained in the following paragraphs with their implementation references. Definition 2.1 (Conditional Random Fields (CRF)) It is an undirected graph based on probabilistic method that predicts the sequence of missing information between the previous and current sequence of content in order to make an exact or approximate label. In Fig. 2.2, the statistical model of CRF is presented. The observations O are depicted as a whereas the states S are conceptualized as b at current previous and next point in time. It is used to encode consistent interpretations and often used where sequence labeling is required like in Natural Language Problems (NLPs).   T  M 1 exp θk f m (bt−1 , bt , at ) Q θ (a|b) = Z θ (a) t=1 m=1

(2.1)

16

2 Text in a Wild and Its Challenges

Fig. 2.2 Conditional random fields visualization

By considering Eq. 2.1, where a = (a1 , . . . , aT ) is represented as an input sequence and b = (b1 , . . . , bT ) is constituted as output sequence, hence treated as the sequence of labels. { f m }1≤m≤K is a feature vector and {θm }1≤m≤M are linked with real associated parameter value s represented in Eq. 2.1. The CRF as represented in equation (2.1) is sometimes referred as linear-chain CRF, although it is taken as more general, as bt and at can be composed not directly of the individual sequence tokens, but on subsequences (e.g., trigrams) or other localized characteristics. The hybrid approach using the said method for text localization is proposed by [20]. The motive to discuss their presented work is to get an idea about how the said technique can be useful in text localization. Their proposed method consists of three stages, i.e., pre-processing, connected component analysis, and minimum classification error. The connected component technique may lead to the problem of inaccurate localization of text; hence, CRF is employed to predict the accurate label. In text localization from natural images, there is high probability to have misclassified text. Therefore, CRF is useful technique to predict and localize the text correctly. In Arabic scene text, there exist such images where text is appeared in diagonal form or in patches. The CRF is useful technique to predict the missing sequence of Arabic text. Its implementation requires critical thinking and professional expertise to develop such specialized application. Scale-based region growing: Another very pertinent method is Scale-Invariant Feature Transformation (SIFT) which is used by numerous researchers with the combination of other techniques. Definition 2.2 In SIFT, the process of region growing starts from the key-points detected by Scale-Invariant Feature Transformation (SIFT) algorithm as discussed in [38]. By SIFT algorithm, the key-points of a given image were extracted which is treated as a feature. The extracted key-points may represent on a text and non-text regions in an image.

2.2 Inbuilt Complexities Relevant to Cursive Scene Text

17

Fig. 2.3 Scale-based region growing with identification of key-points and marking of a text

The SIFT features geometrically map on extracted text which sometimes merge background lines as a candidate region, eventually contribute in decreasing the OCR accuracy. The main constraint in SIFT technique is determination of key-point’s size. Figure 2.3 depicts the text localization performed by SIFT technique. The keypoints were also extracted from provided image. Moreover, how to concentrate on potential key-points and which key-point is more relevant with respect to provided text, are the main issues that need to be addressed. The SIFT approach may be applied on cursive scene text localization, as few research work proposed [38]. The dimension of an image and nature of presented text is still a research question. Oriented Stroke Detection: The text stroke orientation also plays a vital role in scene text detection as discussed by [16]. The assumption lies on the fact that every character is represented by its stroke information. The orientation of character stroke is measured, for instance, character “A” is represented by two-stroke direction: one is 60◦ on left and the other is 120◦ on right, and both are joined on the top as visualized in Fig. 2.4. One stroke makes 0◦ which joins left and right stroke from middle. The gradient projection “G” of every character is modeled and noted down direction α with a scale “s” as described in their paper as follows: G α,s =

β Rα Ss I β

(2.2)

Rα is represented as a rotation matrix of an angle α. On the other hand, Ss is labeled as the scaling matrix of a scale s. The detail about their proposed method can be seen in the manuscript [16]. The candidate region is detected which grows by detecting at-least single stroke in an image I . The stroke-based information may be helpful in localization of cursive scene text provided a clean text image. As numerous impediments degrade the quality of acquired images, with this constraint forefront it becomes very cumbersome to concentrate on stroke information as text images are not consistent with their appearance. Moreover, keeping the statistical information of each stroke seems laborious and time taking process.

18

2 Text in a Wild and Its Challenges

Fig. 2.4 Handwritten text orientation of Latin character “A”. The circles indicate the degree of each stroke, whereas, arrows tells the direction of handwritten stroke

Text Detection based on Laplacian and Sobel filter: The orientation of text in video images portrays a challenge for researcher due to the constraint of size and orientation of a text image. To address the problems and challenges in multi-orientation in video text images, P. Shivakumara et al. [61] proposed text detection approach. They applied the product of Laplacian and Sobel filter for enhancement of pixel values. The discontinuities in an image gives strong indication about presence of a text. The Laplacian is a second-order derivative and is used to detect dissimilarities in four directions, i.e., horizontal, vertical, up left, and up right. By this, information of low or high contrast images were enhanced. But the results they found was noisy. The noise is handled by the product of Laplacian and Sobel filter. The Sobel filter is a first derivative and it produces a fine detail at discontinuities in horizontal and vertical directions as shown in Fig. 2.5. They first individually applied both techniques and later combined the result as a product of both techniques. The localization of cursive text through product of Sobel and Laplacian filter seems practical. But pre-processing is required prior to apply said filtration technique for text localization. Generally, the acquired scene text images have implicit non-text noise. The perspective distortion and orientation of a text contribute in determining the correct localization. Manual supervision is required during the process to assess the resultant images. As the scene text images always taken in uncontrolled environment, there is a need to present such solutions which are robust in nature and can handle any type of scene text image with its associated impediments. Definition 2.3 Connected component analysis is a labeling technique that scans an image and groups its candidate pixels in a way that all connected pixels make a component that shares similar intensity values and somehow their pixels get connected to each other. This technique is applied on natural images or on video text images for determination of text area. The video/scene text image often has a complex background, cluttered text, and jerked image. In this situation, it becomes very difficult for any type of method to produce encouraging results. The connected component analysis technique is suitable on clean text images. As represented in Fig. 4.13, when background image constitute non-text regions

2.2 Inbuilt Complexities Relevant to Cursive Scene Text

19

Fig. 2.5 Cursive scene text localization by filtration technique

then looking for text by connected component technique will be difficult to extract. Because background regions may merge with the actual text. The careful consideration is required in cursive scene text localization if connected component technique is being used. As mentioned earlier, perspective distortion and quality of an image impact the accuracy. Definition 2.4 (Maximally Stable Extremal Regions (MSER)) In computer vision, it is well-known method for blob detection. It was first introduced by Pajdla et al. [80]. The correspondence between two image elements with two different viewpoints was considered. The assumption is to extract comprehensive number of image elements which contribute to matching baseline which helps in detection of an object in an image. This algorithm has been applied to detect text candidate in various state-of-the-art applications [63–66]. The MSER has been applied with the combination of other techniques with an intention to improve text localization accuracy. MSER considered as a suitable technique for text localization. A multilingual text detection through MSER is proposed by [41]. The input for MSER algorithm is a gray scale image “Ig ” and the output is “It ” where, t = 0 to 255. There is a need to pass the image from standardized process before applying MSER technique. Like, the image size should be in standardize format by keeping in view its aspect ratio. Then can binarize the image with a threshold T . Every pixel was evaluated and changed into black or in white area whereas 0 means completely black and 255 means completely white. The white area in the image is called extremal region. To detect extremal

20

2 Text in a Wild and Its Challenges

regions, the rest of the pixel area should be the same. The threshold T is applied to take exact number of contributed interested regions. By applying threshold T over the image, successive regions obtained which were not impacted by the overall process. Such regions are said to be Maximally Stable Extremal Regions (MSER). As nature of MSER suggests that the individual characters cannot detect correctly, because most of the time it is used to find Region of Interest (ROI). Hence, it is proved to be suitable where to find number of characters or a words. In general, MSERs in an image is categorized into three classes. The first class corresponds to individual characters, while second class may have arbitrary number of characters whereas third class may contain all non-textual background content. The character area is expanded by calculating distance transform map which depends on the basis of calculated binary mask. Here, the pixels play an important role and local maximum distance is considered for extremal region detection. The character stroke area As [72] is calculated by the following equation: As = 2



di

(2.3)

i∈S

where S is the stroke and di is a distance of pixel i to the boundary. Their proposed estimation is taken correct only for the strokes which have odd width, while it becomes inaccurate on even width. The boundary of stroke pixels was not connected to each other because of this noise. This has been compensated by introducing weight wi as follows:  3 wi di , wi = (2.4) As = 2 |N i| i∈S where Ni denotes number of stroke pixels in 3 × 3 matrix. Figure 2.6 visualize the cursive text detected by MSER. But there is a need to eliminate unnecessary MSER detected non-text regions. After examining numerous manuscripts that described the use of MSER, it is proved to be a very important and suitable technique to detect character candidates from a scene image. It is also considered as an invariant to affine transformation. By keeping its ability to search the point of interest in provided area, it can produce good results with low-quality image. The implementation complexity is O(nlog(log(n))), where n is the number of pixels in an image. Although this algorithm is more practical for text detection, in some situations, it might detect false positive which can further be investigated by applying various checks to eliminate non-interested regions from a given image. Graph-cut method for scene text localization: Another way to localize the text from natural image is based on graph-cut approach which is presented by [71]. In this approach, the edges of an image are first extracted through local maximum difference filter, and then, the image is clustered based on color information.

2.2 Inbuilt Complexities Relevant to Cursive Scene Text

21

Fig. 2.6 Cursive scene text localization by MSER

The spatial information of scene text in a skeleton image was generated by extracting the edges. The characters were realized by applying heuristic rules in addition to connected components. Finally, they applied graph-cut approach to identify and localize text lines correctly. They concluded that their approach produced state-ofthe-art results on diverse font, size, and color in different languages regardless of the impact caused by the illumination. This approach could be useful if apply on text localization in natural images.

2.2.2 Classification Techniques for Scene Text Recognition The classification refers to the statistical analysis of training observations under supervised and unsupervised learning. The classifiers analyze the numerical properties of a given image. These numerical properties are the distinctive features which represent image in question. There are numerous state-of-the-art classifiers proposed during recent years, but most of them depict the process of learning through unsupervised methods [31, 84]. The motive behind this section is to provide detail about the classifiers suitable for cursive scene text. The following discussion starts from unsupervised classification techniques and then will explain how supervised learning approach is more effective for cursive text analysis. Few proposed classification techniques have been summarized with their references as follows. Nearest Neighbor Classifier: The Nearest Neighbor (NN) classifier is categorized as nonparametric method used to train the given sample based on closet trained neighbor in the feature vector. The NN classifier also said as K-NN because the object is classified by the vote of k-nearest neighbors. The value of k is adjusted empirically. K-NN is an instance learner, it means all relevant computation is activated and computed only in current point of time. With respect to cursive scene text recognition, this

22

2 Text in a Wild and Its Challenges

classifier is not suitable, because in Arabic script there is a context associated with each character and a word. Somehow, there is a need to have computation performed at previous point in time. Neural Network as a Classifier: Artificial Neural Network (ANN) is considered as a reliable classifier which is inspired by human way of learning the things.

Neural networks are designed by keeping in focus various network architecture. The basic form of neural network is backpropagation, which takes input, process it with bias value by hidden neurons and delegate computation to output neurons. The backpropagation is a basic learning architecture in neural network which does not consider previous computations. Those problems which are complex in nature and require big data analysis, backpropagation fails considerably. The backpropagation is not recommended for problems which have correlation in sequence. The Arabic script is represented in a contextual manner. Each character in Arabic script depends on its previous character and predicts about the next character at current point in time.

If context eliminated from Arabic script then recognition would be very cumbersome and somewhat not possible. Therefore, for cursive scene text analysis, the interest shift of research community is observed from traditional backpropagation to context learning classifiers. Recent years presented work as explained in [3, 11, 12, 67] on Arabic-like script used RNN approach for text classification. The RNN is suitable for problems where context is important to learn. The MDLSTM is considered as connectionist approach which mainly relies on Multidimensional Recurrent Neural Network (MDRNN) and LSTM networks. The Multidimensional LSTM follows RNN approach to learn the sequences. All past sequences with respect to current point in time are accumulated to predict the output character. The RNN provides an appropriate architecture for Arabic scene text recognition. The Arabic scene text image requires extensive pre-processing before applying learning classifier. The pre-processing includes skew correction, removal of non-text elements from an image, conversion of text image into standard format, and feature extraction which is very crucial part machine learning tasks. Convolutional Neural Network (ConvNets) is another choice to learn the complex patterns but it is an instance learner, which is suitable where data is uncorrelated like only individual characters.

2.2 Inbuilt Complexities Relevant to Cursive Scene Text

23

Fig. 2.7 K-means depiction of workload given to each individual according to age

If ConvNets applies on extracted image patches, it cannot provide stable and good convolutional features because of the inherent complexity attached to cursive scripts in presence of noise. Although it is a good option for feature extraction process and most of the time applied for extracting features. Definition 2.5 (K-Means Clustering) It is one of the most prominent and simple unsupervised learning algorithm that classifies the given data through certain number of clusters. During the cluster analysis, k-means algorithm is used to partition the input data into k-clusters. The value of k is empirically selected which is termed as NP problem. Generally, k-means clustering is used in learning the features. The approach is to train the k-clusters of provided input data. This process will be for all input data. The resultant datum will be selected on basis of a threshold value of datum product with centroid value with respect to each cluster. The distance of each datum with centroid is measured by Euclidean distance and selected according to predefined criteria as represented in Fig. 2.7. This process continues until same points are assigned in each consecutive rounds. The k-means can be combined with other linear classifiers; however, if features are encoded for cursive scene text with combination of linear classifier, then it could be helpful in learning the cursive script. J=

k  n 

j

2

xi − c j 

(2.5)

j=1 i=1

In Eq. 2.5, the objective function is defined by number of clusters j from 1−k and cases x in each cluster where i is datum in each cluster from 1−n, c is represented as centroid of cluster. The constraint in K-means is that there is no standard criteria to find optimal number of clusters which resulted some times local optimum or over-fitting. The best learning network is based on empirically selected k value which usually produce good results with minimum value.

24

2 Text in a Wild and Its Challenges

Fig. 2.8 Visualization of feature points selected by Bayesian classifier

Definition 2.6 (Bayesian Classifier) The Bayesian classifier is a probabilistic classifier based on the Bayes theorem which helps to predict the independent assumption about the features variable whether to appear as continuous or categorical. It can be applied on Arabic cursive scene text data to classify the provided samples. p(X |C j )∞ p(C j )

d 

p(xk |C j )

(2.6)

k=1

With reference to Eq. 2.6, provided a set of text images, X=x1, x2, x…, xd, the posterior probability is calculated for the event C j among a set of possible outcomes C=c1, c2, c…, cd. The predictor is represented by X whereas C is a categorical levels exist in dependent variables. The features can map by posterior probability if they are linearly define as represented in Fig. 2.8. The new feature is labeled with the class C j that has high posterior probability. If this particular scenario is applied on Arabic scene text and the current features are not a part of text then due to high posterior probability rule, the feature will be misclassified as part of text. If scene text image has non-text regions, then it may treat such impediments as potential text which obviously effect the recognition accuracy. The research work on character recognition using Bayesian classifier is presented by [61]. The text candidates were obtained by intersecting the output of Bayesian classifier and canny edge map of an input image. They assumed that orientation and size of a font does not make any impact on recognition due to proposed Bayesian classifier. The connected component method was used to make a bounding box of a text. They evaluated the performance of their proposed technique on their collected dataset named Hua [113] and ICDAR 2003 [35].

2.2 Inbuilt Complexities Relevant to Cursive Scene Text

25

2.2.3 Methods Designed for Feature Extraction In machine learning and pattern recognition applications, features are considered as backbone of any recognition systems. This subsection is elaborating the importance of feature extraction techniques by providing the analysis of recently proposed techniques that reported better performance. The feature values derived from the raw or initial set of given data are discriminative and nonredundant data. This discriminative, nonredundant data facilitates the process of further classification and learning steps.

There are numerous feature extraction methods proposed by various researchers designed specifically for scene text images. The detail about prominent methods is as follows: Global Sampling: As described in [43], the scene text character’s recognition performance is evaluated by considering the comparison of different sampling methods, i.e., local and global sampling, feature descriptors [91, 92, 94, 97], dictionary sizes, coding and pooling schemes [99], and Support Vector Machines (SVMs) kernels [84]. To obtain the features from local sampling as mentioned in [43], the key-points detection, compute local descriptors, build dictionary of visual words, and feature pooling and coding to get the histogram of visual words contemplated as important task due to their discriminative nature. They computed the descriptors from character that is patched by global sampling without considering key-points, local descriptors, coding, and pooling. The features they found by following their process are regarded as distinct features that are ready for classification. Multiscale Histogram Gradient Descriptors: The multiscale histogram of oriented gradient descriptors as a feature vector was reported by [17]. They include features at multiple scale in a column-wise manner of HOG descriptor and evaluate the performance on Latin scene text images having variation of characters. The oriented gradients were calculated using derivative of Gaussian filters. Then total strength of each variation is summed up. This process continues on each block in an image. Each histogram with respect to each block normalized so that to sum up the total variation across all orientations. In this way, they made a single descriptor for each image. They evaluated their proposed technique on two commonly used datasets, i.e., Chars74K [34] and ICDAR03-CH [35]. The Chars74k contains 62 classes which includes number, upper and lower case characters. The other dataset they used was ICDAR03-CH is the character recognition dataset which was presented in robust reading competition in ICDAR 2003. This dataset is similar to Chars74k; in addition, it has included punctuation symbols images. They split the Chars74k dataset into Chars74k-5 and Chars74k-15 training images per class. The ICDAR03-CH-5 dataset were used with 4 training images per class. The reported accuracy using multiscale HOG as 50%, 60%, and 49% on Chars74k-5, Chars74k-15, and ICDAR03-CH-5,

26

2 Text in a Wild and Its Challenges

Fig. 2.9 Scale-invariant feature transformation of Arabic text

respectively, while using HOG columns they reported accuracy as 59%, 67%, and 58% on Chars74k-5, Chars74k-15, and ICDAR03-CH-5, respectively. Scale-Invariant Feature Transformation (SIFT): It is dominant computer vision algorithm that is used to define the local features of an image. The local feature which it detects are the key-points which are not affected by image transformation. The scale-invariant approach is applied on Arabic scene text recognition [42] with the combination of sparse coding [36] and spatial pyramid matching [37] as shown in Fig. 2.9. They extracted the local features from SIFT [38] which is considered as very efficient technique that demonstrated and extracted most relevant distinguished local features. In order to get more precise information of an image, the weighted linear superposition function is applied on extracted descriptors. The input image was divided into subregions. The features relevant to specific subregion using different scales were modeled into histogram. Later, they applied pooling technique to summarize all the features represented an image. The evaluation were performed on two publicly available dataset, i.e., Chars74 and ICDAR03 and reported 73.1% and 75.3%, respectively. They also proposed their own dataset named Arabic STC and evaluate the performance of their proposed system on it. Moreover, they reported 60.4% character recognition accuracy on their proposed dataset. Another research which represents Chinese handwritten character recognition by considering SIFT descriptors is proposed by [2]. They modified SIFT descriptor according to the characteristics of Chinese character. The pre-processing is performed by passing each image through linear normalization and then perform elastic meshing [40] to rectify invariance of same characters written by various individuals. In addition to SIFT, they also extracted Gabor and Gradient features of an image. Every extracted feature vector is compressed to 256 dimensions by Linear Discriminative Analysis (LDA). They performed experiments on different window sizes with various dimensions. The discussion about their detailed experiments can be found in their manuscript. A very interesting work on local features extraction based on template images is proposed by [39]. Their motive was to read the text in complex images in presence

2.2 Inbuilt Complexities Relevant to Cursive Scene Text

27

Table 2.1 Feature extraction approaches of cursive and non-cursive scene text Study Feature extraction approach Script Gomez et al. [30] Tian et al. [17] Tounsi et al. [42] Yi et al. [43] Newell et al. [44] Campos et al. [45]

Neumann et al. [16, 50] Mao et al. [46] Wu et al [40] Zheng et al. [49]

Convolutional neural network and K-means Histogram of oriented gradients (HOGs) SIFT Global and local HOG HOG multiscale column Geometric blur, shape context, SIFT, patches, SPIN, maximum response (MR8) Stroke orientation SIFT Minimum Euclidean distance, SIFT SIFT

Multilingual Chinese, Bengali Arabic, Latin Latin Latin Latin, Kannada

Latin Latin, multilingual Chinese Chinese, Japanese, Korean

of built-in noise associated to it. They proposed a new method for building template in absence of influential noise. After performing normalization, enhancement, and binarization, they extracted scale-invariant features from the template image and from complex image as well. If some features are missing or not recognized then their proposed geometrical verification algorithm is applied to correct the error. They evaluated their proposed technique on more than 2,00,000 images having three scripts, i.e., Chinese, Japanese, and Korean script. Hybrid Features: In [41], the hybrid feature extraction approach was proposed by determining the stroke width, area, aspect ratio, perimeter, number, and area of holes as a features associated to each given text image. These features were examined and further pass them to the classifier. The description about each feature is detailed in their manuscript. Their proposed method was evaluated on various benchmark datasets like Latin and multilingual script. The dataset named as MSRA-TD500, ICDAR2011, ICDAR 2013, except ICDAR 2011 dataset, other datasets contains multilingual text including Arabic and Hindi text as well. Furthermore, they also evaluated the performance of proposed algorithm on their collected data samples. Reference [100] uses 8 different hybrid features of the text detected by MSER. The identified features are character width, character surface, aspect ratio, stroke width, character height, character color, vertical distance bottom line, and MSER margin. They passed these features to classifier to train. The two publicly available datasets ICDAR 2003 and Char74K dataset were evaluated. Table 2.1 summarizes the detail about feature extraction approaches that are recently proposed for cursive and noncursive scripts.

28

2 Text in a Wild and Its Challenges

2.3 Importance of Implicit Segmentation In cursive scripts like Arabic, an explicit segmentation thoroughly fail to segment the text accurately. Thus, by keeping the complex syntax of Arabic forefront an implicit segmentation provides a way to segment the characters correctly by using the techniques like dynamic programming. In implicit segmentation, usually heuristic or probability applies for character segmentation. The heuristic in segmentation process determines the tentative segment point that may help a recognizer with the determination of accurate character. The incorrect segmentation points may also be removed heuristically. By following this way, minimum burden is shifted to recognizer that may produce correct results in less computational time. The visualization of segmentation by dynamic programming is presented in Fig. 2.10. Dynamic programming is a combination of mathematical optimization algorithms plus computer programming. It divide the problem into subproblems and deal with each segment of problem individually on the basis of knowledge it has prior experimented.1 In Arabic, there are variations of same characters with respect to its position in a word which could have various ligatures too.

In such a complex script it seems a challenge to recognize a same character having variant shapes. Whilst the automated tools cannot accurately segment this intrinsic script, hence most of the reported work relies on implicit segmentation of Arabic text. In these intrinsic scripts, context plays a major role in recognition. The context learning approach named Multidimensional Long Short Term Memory (MDLSTM) network is discussed by [48]. The MDLSTM is considered as connectionist approach which mainly relies on Multidimensional Recurrent Neural Network (MDRNN) and LSTM networks. The MDLSTM follows RNN approach to learn the sequences. All past sequences with respect to current point in time is accumulated to predict the output character. The RNN is most suitable for sequence learning tasks. The temporal sequential behavior of RNN recorded by Connectionist Temporal Classification (CTC).

1 https://www.hackerearth.com/practice/algorithms/dynamic-programming/introduction-to-

dynamic-programming-1/tutorial/.

2.3 Importance of Implicit Segmentation

29

Fig. 2.10 Depiction of cursive scene text segmentation by dynamic programming

By nature, CTC is suitable to use with bidirectional model in order to make an estimation on both sides with respect to current point during processing. As explained in [12], general neural network with an objective function requires separate training targets for every segment or time step with the input sequence. This leads to two major implications: 1. The data samples for training should be pre-segmented. 2. The local classification is computed against every input label, but global aspect is missing which is required to make the context in case of large and complex sequential problems. To address aforementioned issues, it is revealed that the trained network requires sequences to predict the final character or symbol by keeping in view of global aspect. The CTC technique addresses sequence problem. The alignment of a sequence is no more important after inclusion of CTC in RNN architecture, because it can make prediction at any point in time against input label until the whole label sequence is correct. The CTC is added as an output layer in RNNLIB library as a softmax layer. To complete the sequence, probability is estimated that consequently eliminate the efforts requires for post-processing as highlighted in [12, 47]. The detail about CTC, its working, and importance with respect to recurrent neural networks can be realized in [48]. Regarding the problems where sequence is important, a powerful method is required that may learn context and make the task easier for recognition. Due to built-in complexities in Arabic scripts, it is usually impossible to segment cursive characters by explicit means.

30

2 Text in a Wild and Its Challenges

Every text image has distinctive features associated with it. These features are representation of provided samples. In recent years, numerous techniques designed for feature extraction are proposed and reported reasonable performance as far as classification of these features are concerned. If there is huge data available then potential of proposed classification techniques can be measured effectively. Dataset plays vital role in whole process. The next chapter is providing details of presented dataset compiled for Arabic scene text and Urdu handwritten samples.

Chapter 3

Arabic Scene Text Acquisition and Statistics

Abstract Classification relies on characteristics of provided data. The tasks of machine learning produce good results if classification techniques applied on big dataset. Therefore, dataset is considered as an integral part of machine learning research. Unlike traditional machine learning algorithms where dataset size was never been important, state of the machine learning techniques usually work with huge chunks of data and cannot produce good results if significant amount of data samples have not been trained. This book accentuates to highlight the challenges and present solutions for smooth recognition of Arabic text appeared in natural images. Therefore, this chapter presents Arabic scene text datasets and provides researchers a benchmark database for their presented solutions. This chapter aims to discuss the sources used to capture scene text. The scene text recognition is a challenging problem due to the text having variations in font styles, size, alignment, orientation, reflection, illumination change, blurriness, and complex background. Among cursive scripts, Arabic scene text recognition is contemplated as a more challenging problem due to joined writing, same character variations, large number of ligatures and number of baselines, etc. As scene text recognition is an application of supervised learning, therefore to present two ways to generate ground truth.

3.1 Dataset Relation with Machine Learning Generally, the problems in machine learning are solved through deep learning mechanism but the images in question should be righteous image and in acceptable format. Deep learning is a subfield in machine learning, whereas machine learning is categorized as an artificial intelligence branch under computer sciences.

The preparation of dataset is a primary step for proposing a solution using deep learning methods. Usually, the focus should be on such data that correspond to predictable output. Here, prediction is defined as a supervised output event. The acquired samples verify the data by aligning input labels with the ground truth. For © Springer Nature Singapore Pte Ltd. 2020 S. B. Ahmed et al., Cursive Script Text Recognition in Natural Scene Images, https://doi.org/10.1007/978-981-15-1297-1_3

31

32

3 Arabic Scene Text Acquisition and Statistics

Fig. 3.1 The camera captured text are not represented in uniform font style. The unconstraint text variations make the task cumbersome

instance, if one is working on human facial expression recognition then images of cat will not serve the purpose. Data analyst is an appropriate person to investigate either acquired samples align with the described problem. The machine learning approaches are inspired by human way of learning, hence are realized by Artificial Neural Networks (ANNs).

The deep learning and machine learning approaches require good quality training set for better performance. The data acquisition and data analysis phases consume considerable time to search and extract sizable data for training. The trained network is used as a benchmark to measure the performance of training set on unseen data. Machine learning works with two types of datasets, i.e., training data and testing data. In some cases, training data can further decomposed into validation data, which periodically ensures during training that over-fitting does not happen. The training data contains huge number of data samples which run through neural network, which further instructs the network about how to manipulate feature values that are intended to minimize error in predicted results. The neural network is controlled by the parameters it defines. After training has been performed, the network will test against random sampled data. If the obtained error is minimized then good accuracy is reported. Otherwise, to look at pre-processing steps or tune network parameters and start to train the network again. This section provides detail about collected data samples having Arabic text by [88]. The characteristics of Arabic script, process of acquired data samples, preprocessing of captured text samples and generation and verification of ground truth is described in this section.

3.2 Arabic Script Properties Arabic is considered as an ancient script which is followed by many languages, i.e., Arabic, Persian, Urdu, etc. It is rich in vocabulary and is used by 1/3rd of world population [8, 54]. Arabic script is purely cursive which exhibits various forms of single character depending on the position it occurs in a word. The cursive script’s context sensitivity and calligraphic nature make it quite challenging to classify especially in

3.2 Arabic Script Properties

33

Fig. 3.2 Red box represents the ligatures which are a part of a word

scene text images as shown in Fig. 3.1. Shape of the character depends on preceding and following character in the ligature (i.e., initial, middle, final, or isolated). The ligature plays an important role in word formulation; moreover, one word may consist of one or many ligatures. Unlike, Latin and other scripts, Arabic script is written from right to left. Figure 3.2 shows two ligatures which make a single word pronounced as khid’mat’h. Another property of Arabic text is representation of joiner and non-joiner characters. The joiner character occurs at final, isolated, initial position, or at middle position and it may completely change its shape at middle position and initial position as explained in detail in Chap. 1. Whenever non-joiner character appears at final position in a word or as isolated form, it must terminate the word. The end character of a word maintains its full shape.

3.2.1 Dataset Collection The multilingual scene text images have been captured. As detailed by [88], the multilingual nature of acquired samples prompted to compile a dataset for English in parallel. The database collected for scene text is divided into three sections. In addition to Arabic and English, there is database prepared for multilingual scripts. The acquired scene text images were taken from University’s precinct, advertisement boards, and guide boards displayed on roadsides and also taken from various commodity wrappers. The 2469 scene text images have been captured which comprise number of text lines and Arabic numeral. The Arabic scene text data is distinguished into text lines, words, and characters. All images were taken from N ikon D3300 specialized DSLR camera with 24 megapixels lens, Samsung Galaxy A8 and from HTC-One(M8) with 2.5 GHz quadcore along 2GB RAM which has 13 megapixels back camera with ultra pixel sensor, it means that it may capture more light. All images were captured in an uncontrolled environment. The image dimension is 2688 × 1520 with fixed 72 dpi at horizontal and vertical resolution. The exposure time is varied according to the light exposure on the object. The captured samples were taken from various sources as presented in Table 3.1.

34

3 Arabic Scene Text Acquisition and Statistics

Table 3.1 Sources of acquired images Source

Number of test images

University precinct Hoardings Roadside guide Commodity wrappers

300 383 1539 247

Fig. 3.3 Cursive and non-cursive handwritten/ multilingual scene text and data samples

3.2.2 Multilingual Scene Text Recognition and Its Need Recent development on scene text recognition mostly focuses on Latin or English text. However, few efforts have been done for cursive script scene text recognition specifically the Arabic. Even though some efforts have been made on different scripts, but other prompted questions need to be answered. Among them, the foremost question is relevant to the need of having multilingual scene text recognition system. Generally, character recognition systems are uni-language as most of the text exist in single language as shown in Fig. 3.4. However, it is not true with scene text images. In non-English countries such as Arabian peninsula and Indian subcontinent, mainly scene text images appeared in multilingual scripts as shown in Fig. 3.3. With emerging multilingualism techniques, bilingual, trilingual or even more languages need to be supported. Thus, there is an increasing urge to develop multilingual scene text recognition system which can work seamlessly across different scripts. In contrast, some researchers raise their eyebrows by thinking if such dataset is available then would it be worthily effort to reinvent a wheel? The simple answer

3.2 Arabic Script Properties

35

Fig. 3.4 Example images in EASTR dataset

is “yes”, many scene text datasets are discussed as reported in ICDAR competitions by [12, 22, 26, 35], but the scene text dataset for Arabic needs more variations with respect to orientation, illumination, and font styles. One dataset contains camera captured Arabic scene text named ARASTI proposed by [28]. They presented limited number of acquired samples having less variation in taken sample. Moreover, the dataset prepared by [88] contains images taken from commodity wrappers. Dataset also captured images of pamphlets and books presenting Arabic text and numerals. Arabic text itself considered as complex due to its cursive nature which accentuate numerous challenges during recognition. The ASTR dataset contains every possible word that appears in Arabic language with all its permutation in reference to shapes, fonts, size of a text, and font colors. In general, ASTR dataset covers huge variety of Arabic scene text appeared in unconstraint environment. The acquired dataset named “EASTR” as it covers English words in addition to Arabic. The details about EASTR dataset is briefly elaborated below.

3.2.3 EASTR-42K Dataset In this section, the description is provided about the EASTR-42K dataset collection process and its statistics in detail as presented by [88]. As mentioned earlier, the dataset contains bilingual (English and Arabic) scene text images and tried to cover every possible word permutation of Arabic language with its variant shapes. The acquired text images were segmented into English and Arabic text lines, words, and characters. The segmented words from Arabic text lines are depicted in Fig. 3.5, while segmented Arabic characters represented in Fig. 3.6. Due to different font styles and cursive nature of Arabic script, there is a need to have such dataset which covers maximum number of Arabic text so it may consider all possible Arabic vari-

36

3 Arabic Scene Text Acquisition and Statistics

Fig. 3.5 Example images of words in EASTR dataset

Fig. 3.6 Example images of segmented characters in EASTR dataset

3.2 Arabic Script Properties

37

Table 3.2 EASTR-42K division based on complexity Language Text lines Words Arabic English

8915 2601

Characters

2593 5172

12000 7390

Table 3.3 Number of text lines in Arabic, English, and multilingual

Language

Number of text lines

Arabic English Multilingual

2107 983 784

Table 3.4 Number of characters assuming six characters per word in Arabic and English

Language

Number of characters

Arabic English

16624 5904

ations. EASTR-42K dataset covers huge variety of English and Arabic scene text appeared in unconstraint environment. The details about EASTR-42k dataset division based on complexity, total number of images, text lines, and segmented words and characters are briefly elaborated in Tables 3.2, 3.3, and 3.4. In Table 3.2, the detail about division of collected samples based on complexity with respect to Arabic and English are mentioned. The acquired complex text is divided into text lines, words, and characters. Table 3.3 depicts the total number of good samples exist in EASTR-42k dataset including multilingual text image appeared in Arabic and English. Table 3.4 summarizes the description about number of Arabic and English characters appeared in better quality acquired scene text images.

3.2.4 Status of Available Arabic Scene Text Datasets Other than EASTR-42k The dataset plays a dominant role in investigating the potential of numerous classifiers. The various efforts have been reported in capturing and preparing the datasets for Arabic scene text images in recent past. Some articles presented survey on available open access datasets and tools specifically designed for Arabic text detection and recognition in video frames captured by news channels [69]. The benchmark dataset for Arabic scene text still requires more effort to standardized the research as far as Arabic scene text analysis is concerned. The description on few available dataset is revealed as follows including the presented dataset for Arabic scene text. ARASTEC2015: This dataset is prepared by [42]. They captured 260 natural images containing Arabic text. Images were taken from signboards, hoardings, and

38

3 Arabic Scene Text Acquisition and Statistics

Fig. 3.7 Sample image from ARASTEC2015 dataset [42]

Fig. 3.8 Embedded Arabic text of ALIF dataset

advertisements. They manually segmented characters from images and obtained 100 classes depending on the position of a character in a word depicting 28 Arabic characters. They obtained 30–40 variations of each class. The sample Arabic text image of ARASTEC2015 dataset is shown in Fig. 3.7. ALIF It is considered one of the first Arabic embedded text recognition dataset proposed by [82]. The gathered samples were taken from various Arabic TV broadcast, e.g., Aljazeera, AlArabiya, France 24 Arabic, BBC Arabic, and AlHiwar Arabic. The Arabic text was localized from 64 recorded videos. There are a wide variety of text specifications like colors, styles, and font size. In addition, the text visibility is also impacted by acquisition conditions, e.g., contrast, luminosity, background color that makes the dataset vulnerable to be evaluated by state-of-the-art techniques. ALIF dataset contains 89, 819 characters, 52,410 paws, and 18,041 words. The data was collected in more than 20 fonts. Their proposed network was trained on 4,152 text images which covers wide variability of acquired text. The potential of trained network was evaluated on three different test sets. The first test included 900 images which were selected from the same channels used during training. The second test set is applied in the same setting as previous one but with additional 400 images. The third test set has large variations of text with respect to font and size. It has 1,022 text images in total. The complexity of camera captured text image is different in comparison with printed text as shown in Fig. 3.8.

3.2 Arabic Script Properties

39

Fig. 3.9 Various fonts used to generate APTI dataset [83]

APTI: Arabic Printed Text Image (APTI) database was introduced by [83]. They generated Arabic data synthetically with font variations, different sizes, different styles like bold/italic. The samples were also decomposed into ligatures. The taken lexicon were divided into 1,13,284 words written in 10 different Arabic font styles by fixing 10 font size but in four different font styles. Their dataset have 45,313,600 single word images with more than 250 million characters. In camera captured text images, there are a lot of other factors to concentrate, such as to ignore non-text objects so that text may extract with high precision. There are numerous other factors like illumination, angle of a text, font size, appearance, and clarity of a text which pose a huge challenge for researchers to recognize Arabic scene text in an uncontrolled environment. In short, based on the provided facts that detailed pre-processing is required for camera captured text image. As APTI dataset was generated synthetically; therefore, this dataset fall into different category because it is generated through variant manner as shown in Fig. 3.9.

3.3 Pre-processing of Scene Text Images In Arabic, it is cumbersome to disintegrate word into individual characters as discussed earlier. In cursive style, it is impossible to correctly segment the characters. In Arabic, the character shape variation, its position, and occurrence of two consecutive characters at same level makes a challenge for segmentation techniques to work perfectly on such complex text image. In this scenario, implicit segmentation plays its role which is meant to segment the characters empirically. The quality of

40

3 Arabic Scene Text Acquisition and Statistics

Fig. 3.10 Captured image with segmented Arabic text lines Fig. 3.11 Image representation with skew correction

text presented in captured images were impacted by the presence of illumination factor appeared in an uncontrolled environment. Such illumination factor may cause implicit noise attached to acquired samples which ultimately may blur the visibility of a text. The unnecessary data should remove prior to classify them. The scene text image is manually segmented into different text lines. For instance, the scene text image is segmented into six text lines as represented in Fig. 3.10. In EASTR-42k text images, the skew is detected and corrected empirically by RAST-based skew detection and correction technique using OCRopus as explained in [6] and shown in Fig. 3.11. The skew correction is done empirically but there is largely skew images present in EASTR dataset which does not mean that it will provide good results on largely skewed text image as represented in Fig. 3.12. The acquired images having Arabic text in collected samples were standardized with respect to x-height and then applied MSER technique for determination of region of interest. Subsequently, SIFT method were used to consider invariant features in extracted region by MSER. The aspect ratio of each image is maintained by keeping in view the varied width of a text image. The next important step is generation of ground truth labels as explained in following subsection.

3.4 Generation and Verification of Ground Truth

41

Fig. 3.12 Image representation with large skew. The red line indicates actual skew, whereas the blue line shows corrected skew Fig. 3.13 Ground truth depiction

3.4 Generation and Verification of Ground Truth The establishment of ground truth is an important step in supervised learning methods. It is considered as one of the salient step to match the learned pattern with targeted value. The determination of ground truth solely depends on selected classification technique and implementation scenario. The ground truth for Arabic script is defined in two different ways. The extracted regions are labeled and consider only those x-y coordinates which are mutually occurred in both binary image and an image mask. Another way to define ground truth is through Latin labels that are specified for each word in a dataset. The ground truth is written in text file accompanying with its image file with same name. There are 27 basic characters and 10 numeral in Arabic language. For ground truth generation, 37 classes are declared, every class is a representation of every single character regardless of its position occurred in a word. The Latin characters were used to label a ground truth file. The ground truth file is manually identified as represented in Fig. 3.13. As depicted in this figure, there are eight characters used in making two words. The ground truth is labeled as represented in Fig. 3.14, every character was separated with “−” symbol and word separated by space in-between two words. Each class is defined using Latin characters. The ground truth file is reviewed and manually corrected if there is something missing. The missing character verification is automatically detected by presented approach. The missing characters may be specified in ground truth but not declared as a class during implementation which are represented as a warning during training session.

42

3 Arabic Scene Text Acquisition and Statistics

Fig. 3.14 Two different ways of ground truth declaration

The context is treated as an integral part in cursive text recognition. The representation of each character depends on its previous character and so on.

In next chapter, the importance of implicit segmentation and context learning is validated by describing the LSTM based network with reference to Arabic scene text recognition.

Summary This chapter introduced EASTR-42K scene text image dataset. Although few efforts have been made to propose Arabic scene text dataset but are not publicly available and have less data samples. As Arabic scene text datasets are not available, the performance evaluation with benchmark datasets is limited. The EASTR-42k dataset is designed by considering the characteristics of Arabic script and its appearance in captured images. The dataset includes images having text both in English and Arabic while maintaining the prime focus on Arabic script. The dataset can be employed for evaluation of text segmentation and recognition task.

Chapter 4

Methods and Algorithm

Abstract The camera captured text images have various aspects to investigate. Generally, the emphasis of research depends on regions of interest. Sometimes the focus could be on color segmentation, object detection, or scene text analysis. The image analysis, visibility, and layout analysis are the tasks easier for humans as suggested by behavioral trait of humans, but in contrast when these same tasks are supposed to be performed by machines, then it seems to be challenging one. The learning machines always learn from the properties associated to provided samples. The numerous approaches are designed in recent years for scene text extraction and recognition, and the efforts are underway to improve the accuracy. The explicit segmentation techniques do not demonstrate reliable results. Hence, implicit segmentation techniques are pivotal for cursive text analysis. This chapter describes numerous methods and algorithms that specifically address the complexities of Arabic document and scene text analysis.

4.1 Invariant Feature Extraction in Co-occurrence Extremal Regions Feature selection is a process of selecting a relevant attribute which could be used for prediction model. The numerous feature extraction approaches represented in recent years are proved to be very accurate and focused on text area regardless of implicit ambiguity and distortions associated with these sort of images as represented in [3, 47, 121]. A novel method for quantifying the co-occurrence of invariant features that reside in extremal regions is discussed as follows. The model presents how invariant features on cursive scene text images can be helpful in detection of extremal regions.

4.1.1 Detection of Extremal Regions The well-known connected component approach for detection of extremal region is Maximally Stable Extremal Region (MSER) approach. © Springer Nature Singapore Pte Ltd. 2020 S. B. Ahmed et al., Cursive Script Text Recognition in Natural Scene Images, https://doi.org/10.1007/978-981-15-1297-1_4

43

44

4 Methods and Algorithm

The word extremal points toward the characteristics where all pixels inside MSER detected blob would have higher or lower intensities in comparison to its outer boundary. The MSER detects covariant points and merges them together to make a region.

The main idea behind MSER is to detect those points in an image which stay nearly the same in presence of wide range of threshold values. The regions having minimum variations at the time when threshold applied would be considered as maximally stable region. During the process, over the large threshold values, the binarization of an image is stable which means that it represents minimum invariance to affine transformation with respect to intensities of involved pixels as represented in Eq. 4.1. f (x) = Ax + T

(4.1)

f (x) is affine function which represents linear attitude A and a transformation variable T . The whole image is evaluated by Eq. 4.2: f (xi ) =

n  i=1

Axi +

n 

Ti

(4.2)

i=1

The extraction steps of MSER is mentioned as follows: 1. Apply threshold algorithm over the whole image. 2. Find extremal regions by connected component analysis. 3. By threshold, the maximally stable regions will be detected in an image having discrete nature.

It is pertinent to mention here that extremal region might be rejected. The rejection may occur if detected region covers maximum area or minimum area. Another reason for rejection might be due to the unstable region detection and a possibility of duplicate extracted regions. The important characteristics of extremal region is the continuous affine transformation, hence this feature some times could not be able to extract exact region of interest that requires considerations from research communities to work on for further precision. In presented work, stable regions needs to be searched out by evaluating binary image and an image mask as shown in Fig. 4.1. The extraction of interested regions depends on image quality. As observed in Fig. 4.1, the precision of text detection is clearly visible in binary image as compared to image mask. But there are some situations where text is localized precisely in an image mask

4.1 Invariant Feature Extraction in Co-occurrence Extremal Regions

45

Fig. 4.1 a Extremal regions-based text detection in binary images (on left). b Text detection in image mask (on right)

in comparison to binary image; hence, the quality of an image plays a vital role for text localization. The proposed work validates on most of the presented images and concludes with the established fact that binary image enhances the quality of a given input image which eliminates unnecessary details that may hinder the performance otherwise. The non-interested regions can be minimized through transformation by applying image filtration methods. The goal is to detect text in natural image, but there are some non-text regions which should be considered as noise. The overall text detection yielded very good accuracy for Arabic script in particular. As mentioned in an Algorithm 4.1, an adjacency relation is defined on images. Suppose, two different areas are p, q ∈ D, here D belongs to entire image, where IR is a subset of D. p is adjacent to q p ∀ q, if, D=

d 

| pi − qi |

(4.3)

i=1

where i is a number of regions in an image. For each IR, there is a sequence p like, a1 , a2 , .....an .

4.1.2 Invariant Feature Extraction The scale invariance is an eminent feature of Scale-Invariant Feature Transformation (SIFT) method. To achieve scale invariance, SIFT uses Laplacian pyramid which is calculated by difference of various levels of Gaussian (DoG) function as represented in Eq. 4.4. (4.4) D(x, y, δ) = (G(x, y, δk ) − G(x, y, δ)) ∗ I (x, y)

46

4 Methods and Algorithm

Algorithm 4.1 Algorithm for Text Localization Require: A filtered Image Ensure: Text detected Image 1: Procedure for Text Localization: 2: Take a raw Image I 3: Perform filtrationtechnique on Image I B 1, f or I B (x, y) > t 4: where I B (x,y) = 0, f or I B (x, y) ≤ t 5: Apply adapted-MSER approach on I B . 6: An adjacency relation is defined on I B 7: IR is Maximally Stable Region 8: Let IR1 , ...., IRi−1 is a sequence of nested extremal regions IR 9: LOOP Process 10: for i = no. of IR do 11: if i == IR has a local minimum at i ∗ then 12: Select IR as a candidate region 13: else i = 0 14: end if 15: Goto step 8, unless i does not possess any value 16: end for

where

  2 x + y2 1 G(x, y, δ) = 2 ex p − 2δ 2δ 2

(4.5)

By Laplacian pyramid L as represented in Eq. 4.6, high-frequency information of an image can easily be obtained, because features in an image mostly reside on these parts. L(x, y, δ) = G(x, y, δ) ∗ I (x, y) (4.6)

The scale space is divided into various. In each octave, the initial image is convolved with the Gaussian G to produce the set of scale space. The adjacent Gaussian is subtracted to get difference of Gaussian.

After each octave, the Gaussian image is down sampled by factor 2 and rest of the process is repeated in the same manner. The number of octaves helps in finding the key-points in different scales. The octave number and scale depends on the size of original image. To calculate key-points, the value of each pixel in I (x, y) is assessed by looking at the current value if is greater(or smaller) than eight adjacent pixels at each level of DoG octave. At these extracted points, compare their values with values of adjacent pixels exist in lower and upper level. In first and last scale, there are not enough adjacent pixels to compute which restrict in finding local minima or maxima. Based on the criteria, minimum or maximum value, location, and scale of a relevant point

4.1 Invariant Feature Extraction in Co-occurrence Extremal Regions

47

Fig. 4.2 Gaussian blurred image represented as gradient magnitudes and orientation

Fig. 4.3 Detected key-point by scale-invariant feature transformation

are recorded. The detection of key-points in an image do not mean that it will be used in same manner as detected but instead there is a need to accept or reject unnecessary feature points which are generated on low contrast region and are poorly localized along the edge. Therefore, to assume that all extremal points extracted through DoG space search help in finding location, scale, and orientation of each key-point. To obtain consistent orientation with respect to each key-point which is based on local image properties, a key-point descriptor is defined to represent orientation information. Suppose, orientation of detected key-points were assigned as shown in Fig. 4.2 and mathematically expressed in Eq. 4.7. The region is selected having a key-point in the center. The region size is covered within the circle where key-point is detected as shown in Fig. 4.3 and mathematically represented through Eqs. 4.7 and 4.8 

(L(x + 1, y) − L(x − 1, y)2 ) + (L(x, y + 1) − L(x, y − 1))2 ) (4.7) θ (x, y) = tan −1 ((L(x, y + 1) − L(x, y − 1))/L(x + 1, y) − L(x − 1, y))) (4.8)

B(x, y) =

48

4 Methods and Algorithm

The next step is to define the image descriptor which contains all information of extracted key-points that is considered as a distinctive feature of an image. Those keypoints which are extracted at same location in both images (i.e., binary image and in image mask) are examined. Another constraint is that these key-points should reside in extremal region as shown in Fig. 4.4. The good quality images are described by their invariant features property which should not be affected by any other impediment. The key-points within extremal regions are important because that describe as a feature of an image. Some key-points are consistent in representation but do not appear in extremal region which eventually be rejected and not considered as a focal feature. This might be drawback of proposed system which can be examined later. All extreme points of DoG scale space are located exactly as detected by SIFT.

The low contrast and unstable edge points were removed later. At each key-point, SIFT computes the scale gradient and direction with respect to the neighborhood. As a reference, SIFT puts all calculated values into histogram and summation of these points were used as a gradient of key-point selection.

4.2 Window-Based Features The sliding window approach is a another way to extract features from given input text image. This technique has been successfully experimented on Urdu printed and handwritten text recognition [3, 47]. In sliding window approach, the traversing starts from right to left. The size of window should determine first like 30 × 1 or 60 × 1. The input pattern size defined by 30 × 1 pixels tells about the height wise pixel values of given text image. The classifier reads image data when sliding window traverse over the image. There is a need to provide clean data otherwise the said approach cannot produce good results. The system reads all sample data by moving 30 × 1 window size over the whole image as represented in Fig. 4.5 and gets corresponding pixel values against its ground truth (Fig. 4.6).

4.3 Linear Spatial Pyramid The background of an image is discriminated by the color and other lighting effects. The discrimination usually provides a clue about the presence of a text in a particular region. The aforementioned fact is exploited and established the steps about how this discriminative property can be helpful in the determination of text localization. This

4.3 Linear Spatial Pyramid

Fig. 4.4 Depiction of extracted invariant features matching in binary image and image mask Fig. 4.5 Gray scale image of x-height 60 × 1, moving from right to left

49

50

4 Methods and Algorithm

Fig. 4.6 Architecture designed for spatial image pyramids

book also explains a novel feature extraction technique that helps in Arabic scene text recognition using linear spatial pyramids based on image analysis filters. The Gaussian pyramid is proposed to smooth the images. Every subsequent image passes through various image processing filters. An adapted convolutional linear approach is applied which considers each image pyramid and establish feature vectors. Every convolutional image is taken into account and passed through the filtration and then convolution process.

4.3.1 Formulation and Pre-processing The convolutional linear technique proved very convincing results on image categorization in natural images. The strength of convolutional linear technique has yet to explore regarding text categorization in natural images. A novel way is presented by [89] for correct localization of text from natural images. Figure 4.9 sketched one of the proposed idea designed for text localization. Every image is rescaled into standard size, later applied linear pyramid’s method which decomposed the image in 6 levels as indicated in Fig. 4.7. With reference to each image, the linear pyramid generates five images of different resolutions. Each image passed through the filter pack and resultant images were converted into. Eventually, all pyramid gray scale images were given to classifier for training. The whole process is defined into following subsections (Fig. 4.8).

4.3 Linear Spatial Pyramid

51

Fig. 4.7 Visual representation of linear spatial pyramids with six levels

4.3.2 Formulation of Linear Spatial Pyramids of Cursive Arabic Scene Text The image pyramid contemplated as a two-dimensional arrays which represent image from smaller to smallest size reducing the image information at each level from base to the top of pyramid. There are numerous ways suggested to define pyramids as explained in [7], but in practice procedure to define pyramids is to start from base to its top. Mathematically, this relation can be represented as in Eq. 4.9: P = ((m, n, p, q)0 ≤ n ≤ S; 0 ≤ p, q ≤ 2n − 1; m = F(n, p, q))

(4.9)

The pyramid P is defined as an array of values representing a cell with F function that computes corresponding value v for each cell. Thus, p is represented as a pyramid of S + 1 levels, where the pixels (m, n, p, q) at each level n has the value m = F(n, p, q).

52

4 Methods and Algorithm

Fig. 4.8 EASTR dataset pre-processing with good examples

Fig. 4.9 EASTR dataset examples where segmentation is misclassified

The function F determines the construction of pyramids by considering its immediate level below as envisaged in Fig. 4.7. In another presented work, RGB values were taken in an account where each cell contains the percentage of R, G, and B colors. The reason for proposing pyramids for text localization is its ability to group the text image/words in an appropriate resolution so it may contribute in depicting feature as a whole in absence of complex computation. In today’s era, systems are more intelligent in processing the acquired image, but difficult to understand the content represented in an image. The human perception of recognizing text presented in various fonts regardless of their sizes inspired the machine models where its realization is implemented by linear pyramids.

4.3 Linear Spatial Pyramid

53

Algorithm 4.2 Pseudo code for Image generating Linear pyramid Require: Input: Scene Text Image Ensure: Image pyramids 1: read_image ← imread([args]) 2: for loop on each pyramid image do 3: pyramid_gaussian(image, scale[read_image]) 4: if resize.shape[0] < 30 or resize.shape[1] < 30 then 5: break 6: else resize.shape[py-image] 7: end if 8: end for

The input image is passed through the loop where Gaussian pyramid is applied on each image. Each image is resized to x-height which should not be greater than 30. The height limit described as the minimum size of an image selected empirically. The steps involved in overall process have been presented in Algorithm 4.2.

4.3.3 Pre-processing of Image Pyramids by Image Filters Initially, the irrelevant and unnecessary information is removed from an image so that it should not be merged with the textual objects during convolution calculation. The text should be in focus during the process that is why there is a need to eliminate such pattern that may confuse the recognition process. The previous section explained about construction of pyramid regardless of explaining its significance for recognition process. The novelty of work is to address the whole image as a single patch and consider whole image as a feature after passing it through empirically selected kernels. Each level of linear image is considered and convolve by applying various filtration methods that assist to have variety of features described as follows: 1. Smooth text regions: The text location is determined by applying the average blur kernels. The two kernels were used to have two varieties of features like the small and large blur of pyramid image. small_blur = np.ones(9, 9) × (1/(9 ∗ 9))

(4.10)

large_blur = np.ones(23, 23) × (1/(23 ∗ 23))

(4.11)

2. Text edge detection The region of interest such as scene text is detected by applying Laplacian kernel. ⎛ ⎞ 0 1 0 Laplacian = ⎝ 1 −4 1 ⎠ 0 1 0

54

4 Methods and Algorithm

3. Image sharpening The image is enhanced by highlighting the edges to get the fine details of text. ⎛ ⎞ 0 −1 0 Shar pen = ⎝ −1 5 −1 ⎠ 0 −1 0 4. Sobel x-y convolve Sobel filter is used to accentuate the edges of given image. Two kernels were designed to respond to the image vertically and horizontally with reference to image grid. The two specified kernels were applied individually on each input image to measure the gradient component. ⎛

⎞ −1 0 1 Sobel − x = ⎝ −2 0 2 ⎠ −1 0 1 ⎛

⎞ −1 −2 −1 Sobel − y = ⎝ 0 0 0 ⎠ 1 2 1

Algorithm 4.3 Pseudo code to pre-process pyramid images Require: py-img Ensure: Filtered Image 1: while image!=py-img do 2: pyramid_gaussian(image, scale[read_image]) 3: Smooth(py-img) 4: im.save(py-img_s) 5: Laplacian(py-img) 6: im.save(py-img_l) 7: Sobel_x(py-img) 8: im.save(py-img_sx) 9: Sobel_y(py-img) 10: im.save(py-img_sy) 11: end while 12: img_gt←read_gt(img) 13: while each filtered image fi do 14: im_gray←Grayscale(fi) 15: rf←Read_features(im_gray) 16: main_rf←features.append(rf) 17: end while 18: call_classifier(main_rf, img_gray)=0

The image values obtained after applying aforementioned filters serves as feature vector. In this manner, variety of features associated with a single image can be mapped.

4.4 MNIST-Based Convolutional Features

55

4.4 MNIST-Based Convolutional Features The ConvNet is widely proposed for feature extraction as it extracts detailed features from input images [2, 4]. The potential of ConvNets is exploited to extract intended features. By maintaining the implied complexity attached to Urdu script forefront, the idea of implicit segmentation emphasized. The ConvNet requires large amount of labeled data for training that becomes difficult to be handled manually. The entire hypothesis presented by [102] is to consider MNIST and UNHD databases because both database contains handwritten strokes that shares the same complexities. The five-layer architecture of ConvNets is proposed.

Initially, pre-processing was done by removing noise from given image and standardize the representation of an image after converting it into gray scale. The standardization of an image is performed by dividing the current pixel value with total number of image pixels, as represented in Eq. 4.12: S=

X Curr ent V alue Tp

(4.12)

whereas T p is a total number of involved pixels of an image. In this way, all values will be represented in a range (0 − 1). The words were cropped from Urdu text and standardize the image into 96 × 96 size. In each feature map, every neuron is mapped according to small 5 × 5 region of an input image. The connection from input image to hidden layer is established through local receptive field called a filter size. Each neuron in a layer shares a same bias value. As single feature map does not cover the intensive features therefore, the process is further delegated to have variety of features against each given image. A feature map is defined by its share weight and a bias value; mathematically, this relation can be represented in Eq. 4.13: 4  4  We, f A j+e,k+ f α d+

(4.13)

e=0 f =0

whereas α is neural activation sigmoid function while d is a shared value of bias. We, f represents filter or kernel weight which depends on filter size whereas A represents the input activation at point (x,y). In proposed architecture, 96 feature maps are defined by 5 × 5 set of shared weights with single bias value that make the kernel size 26. From each feature map, 25 shared weights are required. The convolutional layer is represented by 96 feature maps that make the total of 96 × 26 = 2496 parameters which define the convolutional layer. As a result, a network can detect 96 different kinds of features at convolutional layer 1, as represented in Fig. 4.10. The aforementioned process will continue for the next four hidden layers. In order to condense the

56

4 Methods and Algorithm

Fig. 4.10 Transfer learning-based cursive handwritten text recognition

extracted features, pooling is essential step in ConvNets. Here, L2 pooling strategy is used, which takes square root of 5 × 5 region’s sum. The extracted condensed features F1 from MNIST database images are convolved with Urdu handwritten input image Iw of UNHD database. The last layer is fully connected network, this layer connects every neuron from L2 pool to output neurons. The prime goal is to find the optimal performance on intrinsic cursive script.

4.4.1 ConvNets as a Feature Extractor Suppose, there is relatively huge image in size and 70 features from each image are required to extract and further learn by the classifier having fully connected feedforward network. In this situation, the computation would be so complex and takes ample time to process a single epoch, even in backpropagation the computation would be slower. To substantiate the performance in ConvNets, the solution is to limit the connections between hidden units and input units. By this, hidden unit will connect only with a subset of input units. In particular, each hidden unit will connect to small group of contagiously located pixels in input unit. The image volume Iv is computed by width w, height h, and depth d. Iv = w + h + d

(4.14)

Let us assume, number of filters as k, the spatial extent as f , the stride as s, and amount of zero padding p. Here the zero padding is relevant to linear output. The nonlinear output is represented as a negative values which is replaced by zero to get linear layer output. At each location where filter was implemented by moving the stride values, the w and h is computed for each kernel as follows, where Wi and Hi are width and height of ith kernel. The number of kernels makes the depth d. Wi = (w1 − f + 2 p)/s + 1

(4.15)

Hi = (h 1 − f + 2 p)/s + 1

(4.16)

4.4 MNIST-Based Convolutional Features

57

Fig. 4.11 Feature extraction using ConvNets

As shown in Fig. 4.11, the filter was sliding over the whole image. At each time when it stops (dictated by stride), it takes a maximum value as a feature from involved pixels and write at (1, 1) of output layer. When stride value is 1, it means the filter will move one pixel to the right and will perform the same operation as previously mentioned. After performing operation in one row, it will move one down and begin the entire process again until it process whole image.

4.5 Deep Learning RNN Model for Cursive Text Analysis Deep learning is a part of machine learning that is inspired by human way of learning the tasks. It usually works on labeled data with the help of artificial neural networks. This section explains the deep learning-based approaches adapted and designed for Arabic text analysis.

4.5.1 MDLSTM Network Training for Arabic Scene Text In Arabic script, the context makes the word meaningful, without context it is difficult to understand the exact word. By keeping this constraint forefront, the potential of context learning approach is investigated on Arabic scene text. Furthermore, recent researches on Arabic-like script have been proposed that suggest complex context learning architectures as reported in [3, 4, 93]. In basic recurrent neural network architecture, tracing back recent computation makes the history which is maintained by recurrent connection of a neuron. The retained computation in a node would impact in computing the current weight calculation of sequence node. The main constraint in retaining the previous computation is a time lag which varies with subject to given problem. In a situation where problem is large and there is a need to keep all previous calculations intact, in this scenario simple recurrent neural architecture does not support to maintain history for larger input. The LSTM networks help to overcome this problem. As represented in Fig. 4.12, the LSTM keep the history as long as it is

58

4 Methods and Algorithm

Fig. 4.12 LSTM memory block consists of an input gate, forget gate, and output gate with multiplicative units and a memory cell. The computation is regulated through multiplicative units and nonessential previous computation discarded via forget gate

required and after forget it through its gating mechanism. In LSTM architecture, the hidden neurons were replaced with LSTM memory blocks and their multiplicative units. The history is manipulated within the memory block by multiplicative units which are responsible to retain or discard the gradient information based on the sequence computation requirement at particular point in time. In proposed system, the common key-points information that appeared in detected extremal regions of binary image and image mask are passed conjunctively and be recorded as key-point descriptor. The number of experiments is executed to examine the performance of presented technique. The multidimensional LSTM networks is RNN-based approach. In MDLSTM network, there are as many LSTM memory blocks as have dimensions in an image. At each point in sequence, the network receives external input and its previous own activation along with all dimensions. It is suitable for context learning applications. It has been applied on various research tasks in document image analysis specifically relevance to cursive script as reported by [47, 93, 121]. Because RNN maintains contextual information and temporarily correlates the new sequences with previous one, it has distinct identification. The proposed architecture presented the hybrid feature extraction technique and exploited the potential of MDLSTM networks as represented in Fig. 4.13. MDLSTM network is a variant of RNN who can handle long sequences in all dimensions. The basic RNN architecture does not retain feature for longer period of time in case of complex input; this problem is regarded as vanishing gradient problem. The aforementioned problem was overcome by introduction of memory blocks instead of hidden neurons.

4.5 Deep Learning RNN Model for Cursive Text Analysis

59

Fig. 4.13 MDLSTM-based Arabic scene text recognition system

Each memory block comprises memory cell and multiplicative unit as depicted in Figure [48]. The input is regulated through input gate to memory cell via multiplicative units as indicated in depiction of LSTM network classifier in Fig. 4.12. After empirically selected suitable parameters, the segmented Arabic text image was passed to MSER, and then, invariant features were extracted. In each experiment, as mentioned in Table 4.3, invariant points were given to classifier. The invariant points have information about orientation which helps MDLSTM network classifier to learn the pattern as it appears but with the coordinate values.

60

4 Methods and Algorithm

Fig. 4.14 Learning curve with 100 LSTM memory units observed during character recognition

The coordinate values of invariant points provide primitive feature information which is further manipulated for learning. As input is complex and cursive in nature, the network is divided into five hidden layers comprising 20, 40, 60, 80, and 100 LSTM memory blocks in each layer. As suggested by RNN architecture the hidden layers are fully connected. The hidden blocks passed to feedforward network having tanh summation units to activate cell. All these hidden layer processing collapsed into one-dimensional sequence and later Connectionist Temporal Classification (CTC) give labels to the learned content. The network’s optimal performance is obtained by careful consideration and selection of effective parameters. Figure 4.14 represents the best learning curve obtained during training on character dataset. The red line represents the best point for learning the given samples. After the red line the difference between training and validation increases which ultimately culminate the training after 450 epochs. The number of parameters specific to RNN is input block, learning rate, and number of units in LSTM memory block. With reference to the parameters the input block defines as 4 × 4, it means the invariant points reside in 4 × 4 block that would be given as an input. In other words, the hidden block size means that the features of a provided samples are collected into 4 × 4 block size. The preliminary experiments guided toward correct or optimal selection of parameters. The selected parameters for training MDLSTM network on Arabic scene text are presented in Table 4.1. The training stops at a point where no improvement observed during validation. The experiments were conducted on different LSTM hidden memory size as mentioned in Table 4.2, which represents the number of hidden memory units and consumed time to perform training on given samples. In order to achieve optimal accuracy, the proposed architecture was evaluated by considering 20, 40, 60, 80, and 100 hidden layer units as reported in Table 4.3.

4.5 Deep Learning RNN Model for Cursive Text Analysis Table 4.1 Parameter selection during training Parameters

61

Values 4×4 4×4 20, 40, 60, 80, 100 1 × 10−4 0.9 2,80,273 2,10,028 93,724

Input block size Hidden block size Hidden memory units Learning rate Momentum Total network weight for exp 1 Total network weight for exp 2 Total network weight for exp 3

Table 4.2 Time comparison per iteration on various LSTM memory blocks Hidden layer sizes 20 40 60 80 Time per iteration (seconds)

18.3

29.7

44.1

Table 4.3 Accuracies reported on trained data Exp-variations No. of No. of hidden layers(%) SIFTs 20 40 60 Binary+SIFT+MSER Mask+SIFT+MSER Intersection

2,80,273 2,10,028 93,724

60.12 61.78 69.72

63.70 63.42 71.89

71.53 74.16 79.24

60.5

100 73.0

80

100

79.17 83.29 87.61

90.44 92.57 94.50

The number of SIFT points was also accounted against each combination of experiments. An assumption has been established by looking at Table 4.3 that obtained good accuracy if less number of interested common SIFT features are given to classifier. This will provide an efficient solution because every key-point extracted by SIFT is not important to consider rather the key-point appeared in extremal region is more relevant for training as shown in Fig. 4.15.

4.5.2 Experimental Analysis As mentioned earlier that under discussion experimental study is carried out on 1500 scene text images where text has been segmented into words and invariant features were extracted. For the third experiment, the assumption is that all extracted key-points features are not important to examine; therefore, intersected key-points detected in two different images are considered and pass them to the classifier.

62

4 Methods and Algorithm

Fig. 4.15 The common MSER region is mapped on x-y coordinates. The box indicates that SIFT features were detected at same place but that place has not been detected by MSER, so these points which could include for classification would go in vain as the emphasis is to consider the common points exist in detected extremal region

Figure 4.16 represents the training curves observed on various LSTM memory blocks. The network shows over-fitting trend when memory blocks were 20 in size as shown in figure. The training and validation curves were also mapped in this figure on 40, 60, 80 and 100 LSTM memory blocks. The best accuracy was reported when LSTM memory blocks was 100. The learning curves represent trade-off between training and validation set. The accuracy is measured on character’s level. In the proposed dataset, each character in Arabic is appeared in different styles and in various orientations, so the invariant features extraction approach is suitable choice for them to classify. Figure 4.17 shows bad sample images exist in EASTR-42k dataset which impacted the final accuracy. As mentioned in Table 4.3, the experiments were conducted by considering three variations on Arabic word recognition in natural images. In other experiments, only common invariant points were considered in an extremal regions for character and text line recognition as summarized in Table 4.4. In word recognition, the process of extracting SIFT features and detecting extremal regions are common but the proposed work using SIFT applied on two different images, i.e., binary image and an image mask, whereas as third variation, the intersected points are considered as features which further were given to classifier. All performed experiments were labeled, like first variation which used invariant feature extraction and extremal regions detection approach on image mask, labeled as V1 . Similarly, other experimental variation with binary image and intersected points are

4.5 Deep Learning RNN Model for Cursive Text Analysis

63

Fig. 4.16 Training curves observed on various LSTM memory block sizes during word recognition Table 4.4 Accuracies reported during training of word data on ASTR-27k dataset Parameters V1 V2 V3 Number of SIFTs tp fp tn fn Precision Recall F-measure

2, 80, 273 47.24% 10.78% 23.47% 18.51% 0.81 0.72 0.76

2, 10, 028 68.71% 6.16% 17.59% 7.54% 0.92 0.90 0.91

93, 724 71.24% 4.71% 15.12% 8.93% 0.94 0.89 0.94

64

4 Methods and Algorithm

Fig. 4.17 Examples of some bad images. a Misclassified text region in binary images. b Text detection is ambiguous in some image masks

Fig. 4.18 Example images where scene text was influenced by light or orientation

4.5 Deep Learning RNN Model for Cursive Text Analysis

65

referred as V2 and V3 , respectively. The number of SIFT features were counted and computed the accuracy by precision, recall, and F-measure as detailed in Table 4.4. There are some samples which represented missing or blank image characters after pre-processing. Such samples might influence with illumination or orientation factors as shown in Fig. 4.18.

4.6 Deep Convolutional Neural Network Arabic scene text is also analyzed by proposed ConvNets. The ConvNets is a type of deep learning Neural Network that is based on the idea of Multilayer Perceptrons (MLPs). It has been successfully applied on recognition of various objects in an image. Unlike RNNs, ConvNets is more focused on single instance learner rather than a sequence learner.

The context is not important for ConvNets training. Nowadays, ConvNets are considered as an important tool in machine learning applications [125–127]. The Arabic script is complex and cursive in nature. Various authors have reported work on synthetic and scanned Arabic text but very few research work are presented on Arabic scene text recognition till date. In Fig. 4.19, the input image of an arbitrary size is preprocessed with respect to size (50 × 50) and converted it into gray scale. The image is saved with five various orientations. Five oriented images are processed against one input image. The convolution is performed and features were extracted from pooling. The detail about feature extraction is mentioned in the following subsection. In the last stage, fully connected layers classified the given image and compute the probability against current input image.

4.6.1 ConvNets as a Learning Classifier Although ConvNets is suitable for feature extraction but it can be used as a learning classifier. In proposed work, ConvNets is used as classification technique. The fully connected 3 × 3 and 5 × 5 spatial convolution kernels architecture is used with maxpooling strategy as represented in equation, F ‘ (x) = maxk f (xs j )

(4.17)

The max-pooling strategy takes maximum value maxk from the filter which is been observed on pixel xs j . The Rectified Linear Unit (ReLU) is used as an activation

66

4 Methods and Algorithm

Fig. 4.19 Proposed methodology based on ConvNets

function which removes the nonlinearity of processed data. The features that have been learned through training are compared with extracted features of test set data. The difference is computed and accuracy is measured. The output neurons in the proposed network are represented as an activation of each class. The most active neuron analogously predict the class for given input. The softmax layer is used to interpret the prediction about activation value for each class.

4.6.2 Experimental Study The Arabic text images were extracted from EAST (English–Arabic Scene Text) dataset. For achieving the correct text recognition process, there is a need to correctly segment text image and remove noise so that classifier may correctly classify the features, learn, and recognize the text. In Arabic script, 27 classes have been identified. Every class is represented by 20 images in train set as depicted in Fig. 4.20. Five various orientations take into an account against each character. As summarized in Table 4.5, for each character in Arabic, 100 representations are identified. As discussed in the recently presented technique by [93]. In test set, each class is represented by 5 variant positions. After having oriented images, 20 samples were identified for each class.

4.6 Deep Convolutional Neural Network

67

Fig. 4.20 Various representations of character “aain” and “wao” with five orientations Table 4.5 Dataset statistics Dataset classification Number of characters Classes Oriented images per class Training set Test set

Statistics 2700 27 100 2450 250

Experiments have been conducted on various parameters like changing filter size, and a learning rate. The experiments were conducted according to the parameters mentioned in Table 4.6. Training and testing samples have distributed on the underlaying 27 identified classes. Every segmented character is rescaled (50 × 50) and oriented into five different angles. The training was performed on 2450 character images while trained network was evaluated on 250 images. The CovNets has been implemented with two convolutional layers followed by one fully connected layer. Both convolutional layers use 5 × 5 convolutions with stride value 2. The error rate was reported on 27.01%. In another setting, max-pooling strategy was introduce after each convolutional layer and add an extra fully connected layer with stride value 1. The filter size is 5 × 5 whereas learning rate is empirically experimented. In this way, 19.57% error rate is measured. The best accuracy is reported on 3 × 3 filter size instead of 5 × 5. The reason to choose minimum filter size is to capture more details about the character image, as Arabic characters also appears with diacritics. As learning rate is empirically selected, 14.57% error rate is been delineated on 0.005 learning rate. The detail about performed experiments with observed error rate have summarized in Table 4.6.

68

4 Methods and Algorithm

The ConvNets are suitable for instance learning tasks rather than sequence learning. The context cannot learn from ConvNets; rather, it may be beneficial in extracting detailed features of provided pattern.

The feature’s detail scrutinized the given pattern at pixel level by variant filter size. As mentioned earlier, ConvNets have evaluated on small subset of Arabic scene text images and received encouraging results. The best result were reported when filter size was 3 × 3 as can be observed in Fig. 4.21. It is believed and have been noticed from performed experiments that if filter size is minimum then it may cover more feature which is suitable for languages represented in cursive scripts.

Table 4.6 Experimental parameters with error rates Filter size Stride Learning rate 3×3 3×3 3×3 3×3 5×5 5×5 5×5 5×5

1 1 2 2 1 1 2 2

Fig. 4.21 ConvNets performance comparison with 3 × 3 and 5 × 5 filter sizes by keeping learning rate as 0.5 and 0.005

0.005 0.5 0.005 0.5 0.005 0.5 0.005 0.5

Error rate (%) 14.57 20.93 18.24 25.59 19.75 29.01 22.20 33.97

4.7 Training of Handwritten Urdu Samples on Pre-trained MNIST Dataset

69

4.7 Training of Handwritten Urdu Samples on Pre-trained MNIST Dataset The pre-trained network on MNIST dataset is used to access the potential of Urdu handwritten samples. Multidimensional LSTM Networks: For every Handwritten Urdu word w represented as input Iw , the processing is initiated by standardizing the image skeleton and computed feature maps M, by using ConvNet as described in previous section. Every feature map includes series of distinctive features. The MDLSTM network is a variant of RNN approach [27]. In RNN, the hidden layer neurons are recursively connected to itself and also with subsequent neurons in next layer, hence to have an effect of memory. But when the problem grows exponentially then memory becomes fade gradually which accelerate and handle the problem with long sequences. This problem advised by solution in a shape of LSTM memory blocks which replaced hidden neurons with memory block and its multiplicative units. The memory is regulated through these blocks for a specific period of time and then make it forget. The LSTM architecture demonstrated evident success for many sequence learning problems addressed recently [4, 27, 47, 123]. As represented in Fig. 4.22, against the corresponding input image Iw , the extracted image skeleton convolved with five kernels, i.e., F1 − F5 (empirically selected during MNIST training) are used as features and pass them to MDLSTM classifier as represented in Fig. 4.23. The 5 × 5 filter size was move (one at current point in time) over the image to extract feature vector and pass them to the classifier along with corresponding ground truth. Transfer Learning: The training of UNHD samples originate with the help of MNIST pre-trained network. The advantage of handwritten strokes that exist in MNIST dataset is exploited by incorporating their learning experience during the training of UNHD samples. To realize this assumption, the MDLSTM network is described by 10, 20, 40, and 60 hidden memory blocks in five-layer architecture that investigate the given input in all four directions having tanh as an activation indicator. Mathematically, it can be explained as z e − e−z (4.18) tanh(z) = e z + e−z The underlying architecture is fully connected. By taking the benefit of transfer learning, the experimental study was conducted by dividing the hidden layers 2 and 4 into two sub-sampling layers having size of (10, 20)(20, 30)(30, 40)(40, 50) 5 and 30, respectively. In this experimental settings, the training will begin with the MNIST pre-trained network. All sub-sampling layers are feed-forward network that use sigmoid σ as an activation. Mathematically, it is written as 1 σ (z) = (4.19) (1 + ex p(−z))

70

4 Methods and Algorithm

Fig. 4.22 MNIST convolutional matrix with UNHD Database images

Although tanh believes to be a good choice to optimize the training because of its faster convergence ability, sometimes network requires more time to learn the complex data. Therefore, σ is used in sub-sampled layers. Mathematically, this relation can be summed up as Am =

e z − e−z e z + e−z



+

1 (1 + ex p(−z))

n (4.20)

Am shows the mutual activation whereas, n shows the size of sub-sampling layers. The features were collected into 4 × 4 hidden blocks which were passed to feedforward layers having tanh summation activation units as represented in Fig. 4.23. Another experiment was performed by keeping hidden layer 1 and 3 freeze. In this experiment, the training starts by using pre-trained MNIST dataset. As MDLSTM network’s hidden layers 2 and 4 have sub-sampled layers to learn handwritten cursive scripts in detail by using the experience of MNIST trained network. But in this experiment, the sub-sample layers have tanh feedforward network. All hidden layers of MDLSTM network were disintegrated into single sequence vector. The multidimensional architecture of LSTM network is employed for proposed work.

4.7 Training of Handwritten Urdu Samples on Pre-trained MNIST Dataset

71

Fig. 4.23 Experimental variation in which layer 2 has a sub-sampling layer. Each neuron in a memory block is activated with sigma. The layer 1 and n is transferring output to CTC directly, whereas memory blocks in sub-sampling layers has the neuron cells which were activated with tanh

In the feedforward pass, the activation a of each cell in LSTM memory block is computed in four directions. Each LSTM memory block computes the activation at t , internal cell activation a tp ,and output gates aδt . The input gate aλt , forget gates aα,dim LSTM memory block computes activation in all dimensions dim. The disintegrated sequence were fed to Connectionist Temporal Classification (CTC) [10] which is a part of output layer that comprise same labels, defined as target symbols and one extra label to deal with undeclared or space symbols in a given pattern. Each element in the labels have path associated to input sequence Ix , where x is determined by the sequential path. The gradient descent optimizer is used to reduce the loss obtained by CTC loss function. The LSTM memory block computes the activation as follows, whereas Mb represents the memory block: t ) + u tp + f (aδt ) Mb = f (aλt ) + f (aα,dim

(4.21)

72

4 Methods and Algorithm

The input gate computes the activation with all hidden memory units J in Eq. 4.22, the refined form is represented in Eq. 4.23. aλt

=

k 

xit wi

 n J 

+

i=0

dim=1

btjdim wdim j

+

P 

m w p q tpdim

(4.22)

p=1

j=1

R is represented as an activation computed at each cell. t = f (aλ )t Rα,dim

(4.23)

Equations 4.24 and 4.25 represent the calculation performed at forget gate. t aα,dim =

k 

J n  

xit wi(α,dim) +

t





b jdim wdim j (α,dim)

(4.24)

dim  =1 j=1

i=1

t t = f (aα,dim ) Rα,dim

(4.25)

The computation at cell is represented by Eq. 4.26 with recurrent connection as shown in Eq. 4.27 J k n    p t t xi wi p + R j dim wdim (4.26) ap = jp i=1

dim=1 j=1

t u tp = Rα,dim g(a tp ) +



u tpdim



t Rα,dim

(4.27)

dim=1:tdim >0

The calculation performed at output gates is presented in Eqs. 4.28 and 4.29 aδt =

k 

n J  

xit wi,δ +

i=1

p

R j dim wdim jδ +

P 

w pδ s tp

(4.28)

p=1

dim=1 j=1

Rδt = f (aδt )

(4.29)

The LSTM memory blocks in sub-sampling layers are the product of all LSTM memory blocks involved in feedforward and backward direction as represented in Eq. 4.30, where n is the number of LSTM blocks in sub-sampling layers. n

n f (aλt )

+

t f (aα,dim )

+

u tp

+

f (aδt )

(4.30)

s=0

The proposed network was trained by using gradient descent optimizer with a learning rate of 1 × 10−4 and 1 × 10−3 but the best training network was reported on

4.7 Training of Handwritten Urdu Samples on Pre-trained MNIST Dataset

73

prior learning rate which is selected empirically to report the best training accuracy. The total network weights are 2, 75, 391. When there is no improvement reported in error rate after consecutive 30 iterations then the training stops.

4.7.1 Dataset The handwritten Urdu text was collected from various segments of people without imposing any special constraint, the people including school-going children, college or universities students, and office-going professionals. Figure 4.24 represents Urdu handwritten samples having variations in writing styles which is categorized as an exceptional problem for recognition systems. In this scenario, the recognition accuracy relies more on how perfectly it can segment the images. More variations have been captured in an extension of UNHD dataset as asked writers to write in their natural handwriting without baselines. The taken samples also include scrambled and overwritten Urdu text. In UNHD dataset, the Urdu text is taken from 500 authors and each author wrote 5 text lines with 10 words per line. In this way, almost 10,000 text lines are collected

Fig. 4.24 UNHD data samples having implicit variations Table 4.7 Extended UNHD dataset description Detail Description No. of authors Text lines No. of words

500 10000 312000

74

4 Methods and Algorithm

having 3,12,000 words as mentioned in Table 4.7. The experiments are conducted on 150 Urdu text lines written by 30 authors having 1509 words and 2851 ligatures. The dataset is divided into train set and test set.

4.7.2 Experimental Study To vindicate the proposed system, the potential of demonstrated experiments is assessed by tuning the parameters that could influence the training curve. Another experiment is conducted by freezing the activation of few layers while, on the other hand, having a sub-sampled layers of non-frozen layer in MDLSTM network architecture. The brief description about performed experiment is discussed below. 1. By Tuning parameters: The training curves were observed by employing different parameter values. The parameter values of MDLSTM classifier are represented in Table 4.8. The advantage of providing distinct values to training network is to assess the learning of network by experimenting with different values given empirically. The images are handwritten cursive samples that aim to be trained on pre-trained network, hence to transfer the learning of MNIST samples to UNHD data. The word transcription is divided into 1,74,720 train set, 74,880 validation set, and 62,400 test set. The input block size, sub-sample block size, hidden layer size, and learning rate are the parameters which are examined by providing different range of values during the experiments as explained in Table 4.9. As observed from the table that best accuracy was reported on learning rate 1 × 10−4 while the hidden layer was 100. As noticed that learning rates and number of hidden layers impacted on recognition accuracies. The input block is constant for all performed experiments. The training network starts training on pre-trained MNIST dataset by using gradient descent optimizer which exploits the learning of UNHD handwritten data.

Table 4.8 MDLSTM parameter details used for training UNHD using ConvNets feature selection Parameter’s detail Values Hidden size Sub-sample size Hidden block size Input block Size Learning rate Momentum Total network weight

20, 60, 100 10, 40 2 × 6, 2 × 6 10 × 1 1e-4, 1e-3 0.9 2,75,391

4.7 Training of Handwritten Urdu Samples on Pre-trained MNIST Dataset

75

Table 4.9 Tuning the network by keeping focus on selected parameter’s values Input block Learning rate Hidden layer size Training error Testing error 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1

1e-4 1e-4 1e-4 1e-3 1e-3 1e-3 1e-4 1e-4 1e-4 1e-3 1e-3 1e-3

20 60 100 20 60 100 20 60 100 20 60 100

Table 4.10 Error rate on clean and complex data Details Clean test data Number of samples Error rate

58827 0.063%

0.23 0.17 0.05 0.31 0.29 0.13 0.18 0.11 0.03 0.17 0.15 0.09

0.25 0.20 0.71 0.39 0.32 0.19 0.21 0.16 0.07 0.20 0.18 0.11

Complex test data 3573 0.087%

Several parameters were contributed to investigate the best performance influenced by given values. This experiment considered UNHD dataset without complex data. The complex data examined Urdu scrambled text appeared with cutting and overwritten words. The tanh was used for activation functions. The best accuracy was 91.3% on test set after 370 epochs while hidden layer size was 100. Table 4.10 depicts the accuracy rate achieved on complex and clean handwritten data. 2. By incorporating freeze hidden layers: Another variation is to investigate the performance by incorporating sub-sampling layers while to freeze the other hidden layer’s activation. As explained earlier that five-layer architecture is employed which comprise 20, 60, and 100 hidden memory units. As MNIST pre-trained network is used to get the benefit of transfer learning. In this experimental variation, the training starts from pre-trained network but the gradients computed at layers 1,3, and 5 kept freeze. The training of proposed architecture is observed by inclusion of sub-sampling layers implemented at layer 2 and 4. The detail about number of sub-sampled hidden memory units are presented in Table 4.11. The dataset evaluation is categorized into three sections according to the difficulty level of the acquired samples. TL-MDLSTM-1 is a first variation of MDLSTM networks transfer learning architecture which evaluates the UNHD

76

4 Methods and Algorithm

Table 4.11 Experimental results with different models and training methods Details SubInput block Precision Recall sampling size size TL-MDLSTM-1 TL-MDLSTM-1 TL-MDLSTM-1 TL-MDLSTM-1 TL-Deep-MDLSTM TL-Deep-MDLSTM TL-Deep-MDLSTM TL-Deep-MDLSTM TL-MDLSTM (Complex) TL-MDLSTM (Complex) TL-MDLSTM (Complex) TL-MDLSTM (Complex) TL-MDLSTM-1 TL-MDLSTM-1 TL-MDLSTM-1 TL-MDLSTM-1 TL-Deep-MDLSTM TL-Deep-MDLSTM TL-Deep-MDLSTM TL-Deep-MDLSTM TL-MDLSTM (Complex) TL-MDLSTM (Complex) TL-MDLSTM (Complex) TL-MDLSTM (Complex)

10, 20 20, 30 30, 40 40, 50 10, 20 20, 30 30, 40 40, 50 10, 20 20, 30 30, 40 40, 50 10, 20 20, 30 30, 40 40, 50 10, 20 20, 30 30, 40 40, 50 10, 20 20, 30 30, 40 40, 50

10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 10 × 1 16 × 2 16 × 2 16 × 2 16 × 2 16 × 2 16 × 2 16 × 2 16 × 2 16 × 2 16 × 2 16 × 2 16 × 2

0.63 0.69 0.73 0.68 0.74 0.73 0.85 0.78 0.77 0.79 0.82 0.79 0.82 0.84 0.90 0.88 0.79 0.88 0.93 0.85 0.66 0.65 0.67 0.63

0.57 0.58 0.65 0.56 0.68 0.69 0.78 0.75 0.71 0.70 0.72 0.72 0.78 0.78 0.84 0.83 0.75 0.82 0.84 0.80 0.59 0.49 0.47 0.45

F-measure

0.59 0.63 0.69 0.61 0.71 0.71 0.81 0.76 0.74 0.74 0.77 0.75 0.80 0.81 0.86 0.85 0.77 0.84 0.88 0.82 0.62 0.56 0.55 0.52

dataset. Every experimental variation considers four sets of sub-sampling size, i.e., (10,20) (20,30) (30,40) (40,50) using input blocks size 10 × 1 and 16 × 2. The second variation is TL-Deep-MDLSTM networks that covers large amount of UNHD dataset samples and also handwritten samples of UCOM dataset [67] take in account for evaluation. The limited number of complicated handwritten Urdu text was evaluated as third experimental study. The challenging text was contributed by 65 authors. The experiments are performed by considering input block size and sub-sampling size as presented in Table 4.11. The accuracy is measured by computing precision and recall of three experimental variations according to the detail provided in Table 4.11. The ROC curves of each experimental study carried out when input block size was 10 × 1 is represented in Fig. 4.25. Each curve is determined by considering input block size.

4.7 Training of Handwritten Urdu Samples on Pre-trained MNIST Dataset

77

Fig. 4.25 The ROC curves obtained from three experimental studies on four different sub-sample sizes when input block size was 10 × 1

Fig. 4.26 The ROC curves obtained from three experimental studies on four different sub-sampling sizes when input block size was 16 × 2

As observed in Fig. 4.25, the best result was obtained when sub-sampling size was (30, 40). In Fig. 4.26, the precision and recall were mapped by keeping in view three experimental studies. In each experiment, it has been witnessed that good performance was measured when sub-sampling size was (30, 40), but for the third experiment as it contains complex Urdu handwritten data that is why the overall accuracy is comparatively low whereas, the best curve was obtained when sub-sampling size was (10, 20). The comparison of obtained results using two input block size represent comparatively good performance when input block size was 16 × 2. But for the third experimental study, gleaned good accuracy on 10 × 1.

78

4 Methods and Algorithm

4.8 Hierarchical Sub-sampling-Based Cursive Document and Scene Text Recognition The adapted hierarchical MDLSTM network architecture based on sub-sampling of hidden layers approach is proposed by [90] for Arabic scene text. The hierarchical sub-sampling usually applies where the data volume is too large and complex.

The hierarchical sub-sampling-based LSTM architecture includes input layer, an output layer, and multiple self-connected hidden layers. The output of each level in the hierarchy is represented as input an to the level up and so on. The input sequences were sub-sampled by predetermined window width. The hierarchical sub-sampling of RNN based networks follows the same structure as defined for ConvNets. The potential of sub-sampling approach was scrutinized by investigating the performance through three-layer architecture which incorporate 20, 40, 60, 80, 100, and 120 hidden memory block sizes. The network learning is based on the empirically selected parameters. The prime objective is to look for appropriate parameters that provides low error rate in comparison. The parameter’s detail along error rates and overall training time is provided in Table 4.12. The Arabic word assorted from scene text initially pre-process to standard size of 70 × 70 by keeping the aspect ratio. The feature map is prepared by convolving the extracted features from given image through filter window. The convolution process is similar as presented in Sect. 4.4 for handwritten Urdu text as depicted in Fig. 4.27. Here, the gray scale values of convolved pixels are passed to classifier by following a specific input size as sketched in Fig. 4.28. In each feature map, every neuron is mapped according to small 5 × 5 region of input image. The connection from input image to hidden layer is established through local receptive field called a filter size. Each neuron in a layer shares a same bias value. As single feature map does not cover the intensive features therefore, the

Table 4.12 Selected parameters during training the network Parameters Values Input block size Hidden block size Sub-sample sizes Hidden sizes Learn rate Momentum Total network weight

4×1 4×2 6 and 20 2, 10 and 50 1 × 10−4 , 1 × 10−3 0.9 732863

4.8 Hierarchical Sub-sampling-Based Cursive Document and Scene Text Recognition

79

Fig. 4.27 Arabic scene text feature extraction by convolutional pixelate method

process is further delegated in order to have variety of features against each given image. A feature map is defined by its share weight and a bias value; mathematically, this relation can be represented as follows: 4  4  α d+ We, f A j+e,k+ f

(4.31)

e=0 f =0

whereas α is neural activation sigmoid function while d is a shared value of bias. We, f represents filter or kernel weight which depends on filter size whereas A represents the input activation at point (x,y). The extracted features by ConvNets are converted into raw pixels and are given to MDLSTM network architecture with corresponding ground truth as presented in Fig. 4.28. The complex nature of Arabic script prompts to propose a hierarchical sub-sampling architecture of MDLSTM networks for learning. The proposed experiments are based on the sub-sampling architecture which is divided into two main categories. As a first evaluation, the experiments were performed having 3 and 5 layers architecture. Each layer incorporate 20, 40, 60, 80, 100 and 120 hidden LSTM memory blocks. The three-layer architecture is defined by number of hidden memory units at each layer. The input is sub-sampled by 6 × 6 and 2 × 9 window size. The deep learning architecture is designed by defining the data into layer-wise manner. The same process is applied on five-layer architecture. The second variation of experiments was performed by defining the same parameters as experimented by [48, 58]. [48] proposed their solution on handwritten Arabic character recognition while [58] presented the same idea on printed Urdu character recognition using similar parameters. The same parameters and network structure are deliberately to compare the performance of handwritten, printed, and scene text Arabic script recognition as shown in Fig. 4.28. All activation functions in sub-sampling layers are feedforward tanh layers, whereas hidden layers are fully connected in all dimensions. The MDLSTM network collapse all processing into one-dimensional CTC layer having 40 classes including a blank label which predict the output symbol. All activation functions in sub-sampling layers are feedforward, whereas hidden layers are fully connected in all dimensions. The performance was evaluated on various settings of proposed architecture as summarized in Table 4.13.

80

4 Methods and Algorithm

Fig. 4.28 Hierarchical sub-sampling approach. As indicated in the output, the character “meem” (in green) is not recognized by the network

4.8 Hierarchical Sub-sampling-Based Cursive Document and Scene Text Recognition Table 4.13 Selected parameters during network’s training Parameters Values Validation No. of epochs Training/error Sub-sample window Hidden memory units

6×6 2×9 20,60,100, 120

Learning rate

1 × 10−4 1 × 10−5 0.9 475723

Momentum Total network weight

0.86/0.83 0.94/0.92 Best (0.97/0.95) Worst (17.28/15.74) 0.80/0.82 0.96/0.98

81

Time/epoch (min)

317 299 461 248

40 34 29 53

319 406

48 51

Table 4.14 Performance comparison of hierarchical sub-sampling on handwritten, synthetic, and scene text Arabic script with predetermined architecture Category

Epochs

Hidden units

Output layer

Weights

Subsample window

LSTM Accuracy dimension (%)

Online Handwritten 85 Arabic [48]

20, 60, 180

CTC

423,926

[1], [2], [2]

1-DLSTM 95.70

Offline Handwritten 91 Arabic [48]

4, 20, 100 CTC

550,334

[4, 3], [4, 2], [4, 2]

2-DLSTM 95.70

Printed Arabic [58]

398

2, 10, 50

CTC

551,405

[4, 3], [4, 2], [4, 2]

MDLSTM 98.25

Arabic Scene Text

406

20, 60, 100, 120

CTC

475,723

[4, 3], [4, 2], [4, 2]

MDLSTM 96.81

The performance comparison of said approach on handwritten, synthetic, and scene text is detailed in Table 4.14. The offline and online handwritten Arabic script is experimented by [48]. They presented their work in ICDAR 2009 handwriting competition. As presented in Table 4.14, they proposed hierarchical architecture. Later [58] used the same architecture by changing little bit in parameters like hidden memory blocks. Moreover, they experimented their work MDLSTM. The details about their implementation can be found in their manuscript [58]. The presented approach on scene text using the hierarchical sub-sampling achieved benchmark accuracy in terms of Arabic scene text recognition.

82

4 Methods and Algorithm

4.8.1 Experimental Analysis The experiments were conducted into manifold with various settings. The experimental settings were apparently outlined on the basis of architectural manipulation and parametric details. Following are the details of conducted experiments: 1. The number of hidden layers was considered to investigate the performance of learning architecture. 2. The number of memory blocks at each layer using sub-sampled input. 3. The performance is explored by empirically selected learning rates. As discussed earlier that proposed network delegate the processing of MDLSTM network learning to hidden layer units. The proposed method was evaluated on threeand five-hidden layer architecture. At first, with three-layer architecture, each layer has 20 LSTM memory units. Then, by following same hidden layer architecture, each layer has 60 LSTM memory blocks and so on. Ultimately, with hidden layers size 3 and 5, the network was

Table 4.15 Details of performed experiments on three-hidden layer architecture Sub-sample Experiments Hidden Learning rate Word Character size units/layer recognition recognition error (%) error (%) 6×6

2×9

Exp-1 Exp-2 Exp-3 Exp-4 Exp-1 Exp-2 Exp-3 Exp-4

20 60 100 120 20 60 100 120

1 ×10−4 1 ×10−4 1 ×10−4 1 ×10−4 1 ×10−5 1 ×10−5 1 ×10−5 1 ×10−5

0.49 0.24 0.17 0.20 0.55 0.33 0.09 0.24

Table 4.16 Details of performed experiments on five-hidden layer architecture Sub-sample Experiments Hidden Learning rate Word size units/layer recognition error (%) 6×6

2×9

Exp-1 Exp-2 Exp-3 Exp-4 Exp-1 Exp-2 Exp-3 Exp-4

20 60 100 120 20 40 100 120

1 ×10−4 1 ×10−4 1 ×10−4 1 ×10−4 1 ×10−5 1 ×10−5 1 ×10−5 1 ×10−5

0.62 0.53 0.11 0.43 0.59 0.31 0.19 0.22

0.40 0.19 0.13 0.17 0.51 0.23 0.06 0.16

Character recognition error (%) 0.54 0.42 0.10 0.34 0.48 0.24 0.12 0.14

4.8 Hierarchical Sub-sampling-Based Cursive Document and Scene Text Recognition

83

evaluated with each 20, 60, 100 and 120 LSTM memory blocks. Consequently, there are eight experimental settings for each proposed architecture based on the number of hidden layers as detailed in Table 4.15. For the activation of input and output unit tanh was used whereas, logistic sigmoid function was used for gate’s activation. The CTC layer has 38 output nodes for 37 input characters including one extra blank node. The 38-character input includes Arabic characters and numerals. All hidden layers in proposed architecture are fully connected to each other. The three-hidden layer architecture was initially proposed

Fig. 4.29 Observed scene text recognition output, the original input images were rescaled and converted into gray scale. The output was mapped with ground truth. The green color symbols at output shows insertions, whereas deletions are presented in red color

84

4 Methods and Algorithm

Table 4.17 Details of performed experiments on five-hidden layer architecture Error type Test set error Deletions Substitutions Insertions

43.75 41.91 30.24

where each layer was sub-sampled at first to 20 LSTM memory blocks. The performance was evaluated later on 40, 60, 80, and 100. The units defined in sub-sampled layers were also fully connected. The performance in hidden units were delegated backward to main hidden layers and the calculation of sub-sample layer was incorporated in the gradient descent of next hidden layer with learning rate 1 × 10−4 and then on 1 × 10−3 and momentum 0.9 which is selected after observing the trend from other cursive text analysis using MDLSTM networks. The training on each experiment was stopped after observing no significant improvement on performance for 30 epochs. Tables 4.15 and 4.16 represent the details about number of epochs consumed for each experiment while the size of hidden layer was 3 and 5. The learning rate and number of hidden sub-sampled layers on convolutional features are impacting the learning performance of training network. The output is presented in Fig. 4.29. The recorded accuracy is 95.8% calculated by Levenshtein distance measure at character level as indicated in Table 4.17.

Summary This chapter discussed feature extraction approaches and algorithm designed by keeping the complexity of Arabic script forefront. Primarily, the efforts should define toward smooth extraction of features through hybrid approaches or by ConvNets whereas numerous architectures have been adapted and presented the solutions for learning the provided cursive scene text patterns. The contextual information is crucial to learn which can be implemented by RNN-based architecture. The adapted deep learning RNN is applied by using methodologies explained earlier in this chapter. All possible variations of each Arabic character appeared in unfettered environment are captured. The nature of Arabic script is very complex and cursive. To understand the Arabic word, there is a need to investigate the characters involve in predicting a word. The representation of characters is a considerable issue, because every character has four possibilities to occur in a word. The constraint of character’s position makes it difficult for any type of segmentation technique to correctly determine the characters by any specified technique. Therefore, there is a need to look for implicit segmentation techniques that counter such complications associated to Arabic scripts. As Arabic script is a context-based language, hence context learning classifiers are suitable for learning. Experimental evaluation has also explained in detail which tells the learning trend and recognition accuracy at word and character level of Arabic scene text.

Chapter 5

Progress in Cursive Wild Text Recognition

Abstract This chapter provides the comparison of recent research on Arabic scene text recognition with presented work to assess the performance consideration. Furthermore, the comparison on the performance of reported results of proposed work with ICDAR competition on multilingual script identification is also discussed in detail.

5.1 Convolutional-Based Performance Comparison The work on deep learning convolutional network-based isolated Arabic scene character recognition is presented in [93]. The presented method was evaluated on ASTR dataset and divided the taken samples into train set and test set. The identified classes are 27 in numbers whereas as each class is rescaled into 50 × 50 size. Moreover, consider five different orientations with respect to various angles. The training set consists of 2450 character images while the performance of learned network is evaluated on 250 images. Their proposed architecture used 2 convolutional layers followed by fully connected layer. In comparison to other available work in Arabic scene text analysis, the reported error rate was 0.15%. Figure 5.1 represents sample images in EASTR-42k image dataset. ASTR dataset is a subset of dataset proposed by [93] which only contains Arabic text samples. As observed from the figure that samples were acquired in an uncontrolled environment where text has rendered to numerous challenges. The robust algorithm for detection of Arabic video text is presented by [96]. They proposed Laplacian operator used for text detection. They identified the candidate region by Laplacian operator in frequency domain whereas the edges were detected by projection profile method. They measure the performance by calculating the precision 0.96% and recall 0.95%. One of the latest work proposed by [98], they demonstrated sub-sampling approach by deep learning classifier to evaluate the screen rendered Arabic text. They used convolutional network to segment the input sample against target labels and learn the contextual dependencies among part of segmented sample. They evaluated their results on freely available synthetic Arabic scene text datasets named ACTIV [95] and ALIF [82]. The third dataset they prepared by downloading avail© Springer Nature Singapore Pte Ltd. 2020 S. B. Ahmed et al., Cursive Script Text Recognition in Natural Scene Images, https://doi.org/10.1007/978-981-15-1297-1_5

85

86

5 Progress in Cursive Wild Text Recognition

Fig. 5.1 Challenging text images represented various font styles, sizes, color, and in different angles having complex background

able Arabic scene text on Google Image. They reported 98.17% accuracy on ALIF dataset while on ACTIV dataset they achieved 97.44% accuracy. A comprehensive model for Arabic text, detection, localization, extraction, and recognition is proposed by [104]. The text has been detected by using four different techniques which are, connected component methods, texture classification methods, edge detection methods, and correlation-based methods. They normalized the samples by setting x-height to 26 pixels. They considered dot over the characters as one feature while all characters do not have dots that is why main body of character also considered like projection feature, transition feature, and occlusion extraction are main features relevant to single character. They used supervised learning k-nearest neighbor algorithm to learn the patterns. They measure recall, precision, and f-score on their proposed dataset. They divided the acquired synthesized Arabic video text taken from three different channels such as TV7 Tunisia, Aljazeera, and AlArabiya, they reported precision, recall, and f-measure as mentioned in Table 5.1.

5.1 Convolutional-Based Performance Comparison

87

Table 5.1 Comparison of English research presented by [88] work with recently proposed methods evaluated on various scripts Study Algorithm Language DB name/size Precision Recall F-score used Halima et al. [9]

Neurofuzzy

Li et al. [105]

Nonlinear neural network Veit et al. Photo [85] OCR algorithms Yi et al. Character [86] appearance and structuring model Esphtein et Stroke al. [122] width transform Tian et al. Histogram [124] of oriented gradients Ahmed et Windowal. [88] based

Arabic

Arabic

Tunisia 1 TV, 0.83, 0.92 Aljazeera, 0.89 Alarabiya 200 images 0.70

0.87, 0.88, NR 0.91 0.72

0.66

English

63,686 images

0.68

0.28

0.40

English

RRD2003, RRD2011

0.71, 0.76

0.62, 0.68

0.63, 0.67

English

ICDAR2005

0.73

0.60

0.66

Multilingual (Chinese/Bengali) English scene text

(Bengali 487, NR Chinese 6,262) ESTR-15k 94.1

NR

92.2, 71.0

89.5

97.52

*NR = Not Reported

Another method for text localization of Farsi script is presented by [87]. They detected candidate text by considering edge and color information. The features were extracted by wavelet coefficient histogram technique. The SVM is used to classification of text and non-textual pattern. There is no benchmark dataset available for Farsi; therefore, they also suggested new dataset for Farsi scene text. They disintegrate their experiments by localization techniques and feature comparison. The performance was evaluated by calculating precision, recall, and f-measure on localization and feature techniques separately. They evaluate their proposed method on HOG, Wavelet coefficient histogram and combination of both and reported precision 76.0, 62.6, 80.8% whereas the recall was 76.0%, 71.5%, and 29.4%, respectively. The f − score for earlier mentioned experiments were 76.0%, 83.3%, and 86.5%, respectively. The proposed work measures character, word, and line recognition rate of acquired dataset images and compares its results with recently proposed work in Table 5.2.

88

5 Progress in Cursive Wild Text Recognition

Table 5.2 Comparison of recently proposed recognition results by [88] with recent proposed Arabic dataset evaluation Study Methods Dataset CRR WRR LRR Yousfi et al. [82] Yousfi et al. [82] Tounsi et al. [28] Tounsi et al. [28] Ahmed et al. [93] Ahmed et al. [88]

ConvNets+BLSTM

ALIF_Test1

94.36

71.26

55.03

CovNets+BLSTM

ALIF_Test2

90.71

65.67

44.90

Sparse coding

ARASTEC(Char74k15)

73.1

NR

NR

Sparse coding

ICDAR03-15

75.3

NR

NR

ConvNets

EASTR

85.3

NR

NR

SIFT+MSER+MDLSTM

ASTR-27k

96.32

94.01

75.20

*NR = Not Reported

5.1.1 Comparison with ICDAR Competitions This section presents the ICDAR competition results and compares their performance with the proposed technique. ICDAR 2003 Robust Reading Competition: The Robust reading competition was arranged by ICDAR in 2003. The aim was to locate the text region and recognize extracted text. The numerous datasets of Latin script were presented in zip files containing jpg images. The received images were divided into sample set, train set, and test set. The sample set contain 20 images and provides a quick impression of testing software, 258 images were selected in train set while during testing they used 251 images. The text images were further decomposed into word and characters. The challenge was on robust reading by locating a text from given samples. The ICDAR 2003 competition was partially successful in locating the text but not for other issues like word or text recognition. The submitted systems were evaluated on the basis of strength and weaknesses. Some submitted systems are more accurate but are not so efficient in terms of time in comparison to other systems. Illumination and other sources of light caused significant problems in recognition. The main focus on the said competition was on locating text in an image, but insight recognition of text was an open research problem. Table 5.3 is presented in [35], which details about the submitted systems and their performance on identical nature and number of given samples. The results were compiled by considering text localization from scene images.

5.1 Convolutional-Based Performance Comparison Table 5.3 Results of text locating competition System Precision Recall Ashida HWDavid Wolf Todoran Full

0.55 0.44 0.30 0.19 0.1

0.46 0.46 0.44 0.18 0.06

89

f-measure

t(s)

0.50 0.45 0.35 0.18 0.08

8.7 0.3 17.0 0.3 0.2

ICDAR 2005 Robust Reading Competition: Another edition of ICDAR robust reading challenge was organized in 2005 as explained by [22]. Four competitions were organized which focused on character recognition, word recognition, text localization, and text reading. The participants had to provide running version of their submitted systems for independent testing. The training samples were available for character recognition competition only. The character dataset MNIST was evaluated on gray scale images. The MNIST dataset includes digits, uppercase and lowercase characters with their ground truth. The main point of concern here is to compare the ICDAR submitted systems performance with the proposed solution presented by [88]. Here, the focus of their work is not on text localization instead scene word or character recognition. In ICDAR 2005, they reported text localization results as presented in ICDAR 2003. ICDAR 2011 Robust Reading Competition Challenge 2: The challenge 2 of ICDAR 2011 was designed to assess the presented approaches on text detection and recognition in scene images as described by [26]. Text reading in scene text includes text localization and word recognition task. The presented dataset in ICDAR 2003 and 2005 was extended to more number of transcriptions. They pointed out some shortcomings like some files missing ground truth information, vague interpretation of special symbols, and loose bounding boxes around the words. These issues prompted the authors to prepare ground truth right from scratch. In addition to this, they also included captured images from camera into dataset. The final dataset consist of 485 images having text appeared in different variations with respect to color, texture, illumination, and size. There were nine text localization methods measure in ICDAR 2011 challenge 2 while three entries were submitted for word recognition task. The text localization method is not proposed instead the concern is to compare the word recognition accuracy with those methods presented in ICDAR 2011 challenge as summarized in Table 5.4. ICDAR 2013 Robust Reading Competition: In ICDAR 2013, the multiscript robust reading competition was organized as reported by [12]. They confined the tasks into text location, segmentation, and recognition. The submitted methods were evaluated on Chinese, English, Hindi, and Kannada scripts. They reported word recognition rate on English and Kannada scene word images. The goal is to look for script independent techniques for scene text localization, segmentation or for recognition. For the word recognition task, they evaluated English and Kannada scripts. Although they have

90

5 Progress in Cursive Wild Text Recognition

Table 5.4 Comparison of ICDAR 2011 word recognition results on English with proposed method Method Correct recognition (%) TH-OCR system KAIST AIPR system Neumann’s method Ahmed et al. [88] method

41.2 35.6 33.11 97.52

Table 5.5 Comparison of ICDAR 2013 multiscript word recognition results with proposed method Method English Kannada Arabic Benchmark PLT MAPS NESP Baseline (raw image) Ahmed et al. [88]

57.7 46.9 46.9 45.1 37.5 97.52

11.1 5.3 4.9 5.8 2.5 NR

NR NR NR NR NR 94.01

NR = Not reported

not received any submission for this task in particular but they tried the submitted techniques on their own dataset as explained in their manuscript. The details about the submitted methods can be found in their submission in ICDAR 2013. Here, the proposed method is compared with their obtained results as summarized in Table 5.5 ICDAR 2015 Robust Reading Competition: The challenge 2 of ICDAR 2015 was focused on camera captured focused scene text as explained by [33]. In addition to previous tasks, they introduced End-to-End system performance tasks in ICDAR 2015 competition. The ground truth is defined at word level. As reported, that most of the presented techniques used Maximally Stable Extremal Regions (MSER) for text localization whereas top-performing methods in competition used commercial OCRs for recognition. In comparison to previous years competition, ICDAR 2015 competition marked significant number of increased researchers that depicts their interest as far as scene text analysis is concerned. The noteworthy increased recognition rate is observed by using submitted methods. The dataset has been distinguished on the basis of contextual complexities of vocabulary words. The eight submitted techniques were experimented as described and summarized in Table 5.6. ICDAR 2017 robust reading competition was held on November 2017 for its fifth edition. The organizers of competition offered numerous challenges among them multilingual scene text detection and script identification challenge was offered. Although the competition has closed in March 2018, their results has not published yet. Therefore, yielded results cannot be compare with them (Fig. 5.2). In presented work, the problems are highlighted and sketched the solution to address complications that exists in cursive text recognition in natural images. A comprehensive benchmark dataset for Arabic scene text recognition is also prepared which is divided into text lines, words, and characters. The multilingual appearance

5.1 Convolutional-Based Performance Comparison

91

Table 5.6 Comparison of ICDAR 2015 word recognition results with proposed method by [88] Method F-measure (%) Ahmed et al. [88] VGGMaxBBNet Stradivision-1 Baseline (Text spotter) Deep2Text-I MSER-MRF Beam search CUNI Baseline (OpenCV + Tesseract) Beam search CUNI + S

92.01 86.18 81.28 77.02 45.1 71.13 63.2 59.47 26.38

Fig. 5.2 Comparison of ICDAR 2015 submitted approaches with Ahmed et al.’s [88] methodology

of captured samples allows to prepare dataset for Latin script along with Arabic. In this way, there is huge collection of various characters and words representation that appeared without any specific font style presented. The dataset compiled for Arabic numerals appeared in natural images is also presented. The Arabic samples represented in ASTR while English samples included in ESTR. This chapter also discuss the presented hybrid feature extraction approach by which proposed methodology considers the relevant regions and invariant point occurred in that region. The validation and verification of proposed dataset is performed by MDLSTM networks due

92

5 Progress in Cursive Wild Text Recognition

to its strong ability of sequence learning. The reported experimental analysis is on ASTR dataset only. The accuracy is computed by recall and precision. The presented work achieved state-of-the-art results in comparison to recently reported work and the work presented in ICDAR competition. This work considered as a benchmark effort because of scarcity of work in relevance to Arabic script.

Chapter 6

Conclusion and Future Work

This book presents substantial contribution in the field of document image analysis specifically in cursive scene text recognition. The complexities of Arabic script have been highlighted, and potential of RNN is exploited because of its context learning ability. The RNN-based statistical models were adapted to support the learning of complex Arabic script. In cursive scripts, there are numerous works presented for Chinese and Japanese characters but Arabic script has not exposed to state-of-theart research methods. Few efforts are reported to date, which obviously are not enough to address the complexities. The scene text recognition can be categorized as a subproblem of Optical Character Recognition (OCR). Although the appearance of samples do not share similarities. Likewise, there is much difference between handwritten and printed Arabic text. The printed text is more clean and clear whereas handwritten text is dominated by writing styles along implicit noise attached with scanned document. This book discussed several contributions in proposing novel solutions for Arabic scene and handwritten text. The contribution is divided into four ways to address the raised issues. There is not publicly dataset available for Arabic scene text. Few work is presented in this direction but that are at initial level. The dataset challenge is overcome by proposing a novel benchmark dataset for Arabic scene text. The camera captured text images were taken from University precinct, advertisement boards, brochures, and from commodity wrappers. In gulf countries, most of the text is represented in bilingual format which encouraged to prepare a dataset for Latin in parallel. In this relation, English–Arabic Scene Text Recognition (EASTR-42k) dataset is presented. Second, this book provides detail about highly accurate RNN-based LSTM networks that are assessed and adapted while taking into consideration the context which is integral part in Arabic script understanding. For instance, models were developed to train the hybrid features extracted with the combination of SIFT and MSER feature extraction approaches. An experimental studies suggests that the derived solutions are somewhat consistent with the representation of Arabic text. However, few © Springer Nature Singapore Pte Ltd. 2020 S. B. Ahmed et al., Cursive Script Text Recognition in Natural Scene Images, https://doi.org/10.1007/978-981-15-1297-1_6

93

94

6 Conclusion and Future Work

samples contributed in deviating the results. The strength of Convolutional Neural Networks (ConvNets) is also explored in Arabic scene text recognition. The phenomenon image pyramids were used. The image presentation is enhanced by five kernels. The LSTM network model with its multidimensional capability is adapted correspondingly. The MDLSTM networks trained on each pyramid separately. It was also shown that model could be used to train any size of image pyramid. The studies shows big data analysis portrayed good results having implicit complexities with reference to script. As a third contribution, handwritten Arabic text is evaluated on suggested LSTM network architecture. The focus was on Urdu language as a variant of Arabic script. The Urdu handwritten samples were taken from more than 700 individuals. The collected Urdu dataset is not as a major contribution of their presented work because it has been compiled by their research group prior to conduct experimentation on Arabic scene text as reported by them. The transfer learning ability of two handwritten datasets is evaluated and assessed by adapted MDLSTM networks which delegate training into sub-sampling layers. The conducted experiments demonstrated encouraging results which enhance the idea that learning through different levels may represent good results on complicated script recognition. The fourth major contribution is to exploit the structure of LSTM network in hierarchical manner by reducing the size of input sequences at each level. Even though the hierarchical model initially implemented by [56] and experimented on handwritten Arabic script. Later the same model has been evaluated on printed Urdu text by [27]. This book also discussed the effectiveness of hierarchical sub-sampling which is exploited with similar parameters as experimented before but on Arabic scene text samples. The hierarchical sub-sampling approach is deeply investigated by adapting the basic architecture. The empirically selected parameters helps in producing benchmark results. The prime motive of this book is to provide foundations of Arabic scene text and to highlight the important solutions suggested for development of reliable Arabic scene text recognition systems.

The development of methodologies and approaches is in its infancy and only provides support for adapting the implementation. Although presented solutions can be considered as a benchmark while there are many ways in which they can be improved and present innovated solutions. Ideally, contextual learning-based predictive models would be develop and used by keeping the complexity of Arabic script forefront. The explicit segmentation-based approaches cannot produce accurate results on scripts like Arabic. Instead, the emphasis is to use implicit segmentation-based approaches that contribute in mitigating the error rate. The routinely develop contextual models can assist reliable and effective scene text recognition systems but it requires intensive resources in terms of human efforts and efficient implementation; thus, automating some aspects like implicit segmentation and adaptation of contextual learning models

6 Conclusion and Future Work

95

would be extremely helpful. Considerable work requires even after developing and deploying models with an intention to fully understand and validate design principles gleaned from them. The possible extension and commercial application of work is to develop robust scene text recognition system that will assist visually impaired individuals. In realization of complicated scene text recognition systems, there is a need to concentrate on development of automated techniques. The implicit segmentation is suitable choice for contextual languages. When to develop systems other than LSTM-based recognition then segmentation technique needs to propose carefully as it is regarded as a foundation step toward recognition system. The nature of Arabic and Arabic-like scripts needs to understand in detail for reliable solution development. The techniques can be proposed to cater the challenge of single character with various representations. Another extension of this work is to propose automated text detection in presence of non-text regions termed as noise. This book also provided description of Arabic scene text which is extracted manually. According to the information to date, automated text detection has not been proposed for Arabic scene text. Nonetheless, the presented solutions for scene text recognition in this chapter provided benchmark work to highlight the issues that are different from conventional OCR approaches.

Appendix

Relevant Description

A.1 Evaluation Metric The proposed method is evaluated with similar evaluation metric used by recently reported state-of-the-art work to compute accuracy as performed by [122, 124]. The recognition accuracy is calculated on scale-invariant binary image and image masks. In addition, the match between two images is computed through their common detected extremal regions. The extremal regions with intersected key-points are treated as a feature which is trained for selected train set. The accuracy is measured by the following equations where T p is a true positives which means correctly predict the provided samples and F p is a false positives which means incorrectly predict/ identified the wrong pattern as correct one. On the other hand, Tn refers to true negative which correctly predict wrong samples, and Fn represents incorrectly identified correct samples as wrong ones. The relation among T p ,F p , Tn , and Fn is represented by the following equations:  Tp pr ecision = (A.1) T p + Fp 

r ecall =

Tp T p + Fn

(A.2)

After calculating precision and recall, the accuracy of learned samples is calculated by f-measure as follows: F1 = 2.

pr ecision.r ecall pr ecision + r ecall

(A.3)

Another evaluation metric is Levenshtein distance measure also called edit distance used to measure the difference between two strings. The distance between the source string s and target string t is represented as deletions, insertions, or substitutions required during transformation of s into t. The smaller the edit distance between © Springer Nature Singapore Pte Ltd. 2020 S. B. Ahmed et al., Cursive Script Text Recognition in Natural Scene Images, https://doi.org/10.1007/978-981-15-1297-1

97

98

Appendix: Evaluation Metric

two strings, the more similar the given strings. The Levenshtein Distance (LD) is usually used in script analysis, spell checking, etc. The idea is further elaborated by given example. Suppose if s is a “next” and t is “next”, then LD(s,t) = 0 because characters in strings are identical. If s is “next” and t is “nest”, then LD(s,t) = 1; there is one substitution of x to s.   I nser tions + Deletions + Substitutions (A.4) Accuracy = 100 × 1 − N o.o f − T estset − T ranscri ptions The insertions add in total length of target vector while deletions are deleted characters from target string after recognition.

Glossary

Artificial Neural Networks (ANNs) It is a network system inspired by human brain’s structure. ANN is a basic tool used in machine learning applications. The ANN consists of input layer which takes data, transforms the input data via hidden layers through neurons, and maps them on output layer. In each layer, there may be numerous neurons which are responsible for computation. Recurrent Neural Networks (RNN) It is a type of ANN whose hidden layer neurons also establish recurrent connection with itself in addition to next level neurons. In this way, they maintain the temporal sequence and are suitable for learning the correlated data. Long Short Term Memory Networks (LSTM) It is RNN-based technique that is proposed to overcome the problem of gradient descent that elevates in typical RNN while dealing with problems of complex nature. In LSTM, the neurons at hidden layer are replaced with memory block which comprised of input gate, forget gate, and output gate. The computation in the cell (which is a part of memory block) is controlled by three multiplicative units. Convolutional Neural Networks (ConvNets) The convolutional neural network is another type of neural network. It consists of learnable neurons and biases. Each neuron at input layer receives an input and performs dot operation at convolutionally sub-sampled hidden layers followed by fully connected layers. It is categorized as a deep learning neural network and suitable for instance learning instead of context learning like LSTM networks. Machine Learning The machine learning is a process defined in artificial intelligence by using statistical techniques to learn the given data. The whole idea is to make the systems intelligent enough so that they can make decisions by their own. Deep Learning It is a part of machine learning. It divides the learning architecture into deeply interconnected neurons by statistical models and tools. The overall motive is to realize the artificially intelligent applications. Optical Character Recognition (OCR) OCR systems are defined as a mechanical conversion of synthetic, handwritten, and camera captured text into machine encoded text. © Springer Nature Singapore Pte Ltd. 2020 S. B. Ahmed et al., Cursive Script Text Recognition in Natural Scene Images, https://doi.org/10.1007/978-981-15-1297-1

99

100

Glossary

Natural Language Processing (NLP) The NLP is a subfield of artificial intelligence which specifically focuses on text processing by numerous statistical approaches. The OCRs may fall under NLP applications. Scene Text Recognition Scene text recognition systems are specialized systems which process the camera captured scene text images in uncontrolled environment. Scale-Invariant Feature Transformation SIFT is a feature detection approach in computer vision which describe select those features which have invariant property in presence of affine transformation. Maximally Stable Extremal Regions MSER is a computer vision method used to detect blob in an image. The detected blob can contain textual information. Connectionist Temporal Classification CTC is neural network prediction modelbased output scoring function which usually resides before the output layer so it may predict the learned labels. It is most widely seen as replacement to language modeling in supervised learning. Fully Convolutional Networks As its name implies, FCN is a normal ConvNets model where the last fully connected layer in the architecture is substituted by another convolutional layer having large receptive fields for purpose to capture global aspect of scene text.

References

1. R. Odate, H. Goto, Highly-accurate fast candidate reduction method for Japanese/Chinese character recognition, in ICIP (2016), pp. 2886–2890. ISBN: 978-1-4673-9961-6 2. Z.Y. Zhang, L.W. Jin, K. Ding, X. Gao, Character-SIFT: a novel feature for offline handwritten chinese character recognition, in ICDAR (2009), pp. 763–767 3. S.B. Ahmed, S. Naz, M.I. Razzak, S.F. Rashid, M.Z. Afzal, T.M. Breuel, Evaluation of cursive and non-cursive scripts using recurrent neural networks. Neural Computing and Applications 27, 603–613 (2016) 4. S. Naz, A.I. Umar, R. Ahmad, I. Siddiqi, S.B. Ahmed, M.I. Razzak, F. Shafait, Urdu Nastaliq recognition using convolutional-recursive deep learning, in Neurocomputing, vol. 243 (2017), pp. 80–87, http://dblp.uni-trier.de/db/journals/ijon/ijon243.html#NazUASARS17 5. S.B. Ahmed, S. Naz, M.I. Razzak, R. Yusof, T.M. Breuel, Balinese character recognition using bidirectional LSTM classifier, in Advances in Machine Learning and Signal Processing: Proceedings of MALSIP 2015 (Springer International Publishing, Berlin, 2015), pp. 201–211. https://doi.org/10.1007/978-3-319-32213-1_18 6. J. van Beusekom, F. Shafait, T.M. Breuel, Combined orientation and skew detection usinggeometric text-line modeling, in IJDAR, vol. 13 (2010), pp. 79–92, http://dblp.uni-trier.de/db/ journals/ijdar/ijdar13.html#BeusekomSB10 7. C.L. Tan, B. Yuan, C.H. Ang, Agent-based text extraction from pyramid images, in International Conference on Advances in Pattern Recognition (Springer, London, 1999), pp. 344–352. ISBN:978-1-4471-0833-7 8. S. Naz, A.I. Umar, S.H. Shirazi, S.B. Ahmed, M.I. Razzak, I. Siddiqi, Segmentation techniques for recognition of Arabic-like scripts: a comprehensive survey. Educ. Inf. Technol. 21(5), 1225–1241 (2016) 9. M.B. Halima, H. Karray, A.M. Alimi, A.F. Vila, NF-SAVO: neuro-fuzzy system for arabic video OCR. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 3(10) (2012). https://doi.org/10.14569/ IJACSA.2012.031022 10. A. Graves, S. Fernández, F.J. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25–29, 2006, ACM International Conference Proceeding Series, vol. 148 (2006), pp. 369–375. ISBN 1-59593-383-2, http://dblp.uni-trier.de/db/conf/icml/icml2006. html#GravesFGS06 11. S. Naz, A.I. Umar, R. Ahmad, S.B. Ahmed, S.H. Shirazi, I. Siddiqi, M.I. Razzak, Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks. Neurocomputing 177, 228–241 (2016) © Springer Nature Singapore Pte Ltd. 2020 S. B. Ahmed et al., Cursive Script Text Recognition in Natural Scene Images, https://doi.org/10.1007/978-981-15-1297-1

101

102

References

12. A. Ul-Hasan, S.B. Ahmed, S.F. Rashid, F. Shafait, T.M. Breuel, Offline printed urdu nastaleeq script recognition with bidirectional LSTM networks, in ICDAR 2013 (IEEE Computer Society, 2013), pp. 1061–1065. ISBN: 978-0-7695-4999-6 13. F. Parwej, An empirical evaluation of off-line arabic handwriting and printed characters recognition system. Int. J. Comput. Sci. Issues 9(6), 29–35 (2012). ISSN: 1694-0784 14. S. Naz, K. Hayat, M.I. Razzak, M.W. Anwar, S.A. Madani, S.U. Khan, The optical character recognition of Urdu-like cursive scripts. Pattern Recognit. 47, 1229–1248 (2014), http://dblp. uni-trier.de/db/journals/pr/pr47.html#NazHRAMK14 15. J. Fabrizio, B. Marcotegui, M. Cord, Text segmentation in natural scenes using togglemapping, in IEEE ICIP 2009 (2009), pp. 2373–2376. ISBN: 978-1-4244-5654-3, http://dblp. uni-trier.de/db/conf/icip/icip2009.html#FabrizioMC09 16. L. Neumann, J. Matas, Scene text localization and recognition with oriented stroke detection, in ICCV 2013 (IEEE Computer Society, 2013), pp. 97–104. ISBN: 978-1-4799-2839-2, http:// dblp.uni-trier.de/db/conf/iccv/iccv2013.html#NeumannM13 17. A.J. Newell, L.D. Griffin, Multiscale histogram of oriented gradient descriptors for robust character recognition, in ICDAR 2011 (IEEE Computer Society, 2011), pp. 1085–1089. ISBN: 978-1-4577-1350-7, http://dblp.uni-trier.de/db/conf/icdar/icdar2011.html#NewellG11 18. A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber, A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 855–868 (2009) 19. K. Jung, K.I. Kim, A.K. Jain, Text information extraction in images and video: a survey, in Pattern Recognition, vol. 37 (2004), http://dblp.uni-trier.de/db/journals/pr/pr37.html#JungKJ04 20. Y.F. Pan, X.W. Hou, C.L. Liu, Text localization in natural scene images based on conditional random field. ICDAR 2009, 6–10 (2009) 21. J. Mao, H. Li, W. Zhou, S. Yan, Q. Tian, Scale based region growing for scene text detection, in ACM Multimedia Conference 2013, pp. 1007–1016. ISBN: 978-1-4503-2404-5, http://dblp. uni-trier.de/db/conf/mm/mm2013.html#MaoLZYT13 22. S.M. Lucas, ICDAR, text locating competition results, in ICDAR (2005), pp. 80–84. https:// doi.org/10.1109/ICDAR.2005.231 23. M. Muja, D.G. Lowe, Fast approximate nearest neighbors with automatic algorithm configuration, in International Conference on Computer Vision Theory and Applications, Lisboa, Portugal, February 5-8, 2009 (2009), pp. 331–340. ISBN: 978-989-8111-69-2, http://dblp. uni-trier.de/db/conf/visapp/visapp2009-1.html#MujaL09 24. A. Shahab, F. Shafait, A. Dengel, ICDAR 2011 robust reading competition challenge 2: reading text in scene images, in ICDAR 2011 (IEEE Computer Society, 2011), pp. 1491–1496. ISBN: 978-1-4577-1350-7, http://dblp.uni-trier.de/db/conf/icdar/icdar2011. html#ShahabSD11a 25. C. Wolf, J.M. Jolion, Object count/area graphs for the evaluation of object detection and segmentation algorithms. Int. J. Doc. Anal. Recognit. 8, 280–296 (2006) 26. A. Shahab, F. Shafait, A. Dengel, ICDAR 2011 robust reading competition challenge 2: reading text in scene image, in ICDAR (IEEE Computer Society, 2011), pp. 1491–1496. ISBN: 9781-4577-1350-7, http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6065245 27. S. Naz, A.I. Umar, R. Ahmad, S.B. Ahmed, S.H. Shirazi, M.I. Razzak, Urdu Nasta’liq text recognition system based on multi-dimensional recurrent neural network and statistical features, in Neural Computing and Applications, vol. 28 (2017), pp. 219–231, http://dblp.unitrier.de/, https://doi.org/10.1007/s00521-015-2051-4 28. M. Tounsi, I. Moalla, A.M. Alimi, F. Lebourgeois, Arabic characters recognition in natural scenes using sparse coding for feature representations, ICDAR, pp. 1036–1040. ISBN: 9781-4799-1805-8, http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7321714 29. B. Shi, X. Bai, C. Yao, An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, in CVPR (2015), http://dblp.unitrier.de/db/journals/corr/corr1507.html#ShiBY15 30. L. Gómez, D. Karatzas, A fine-grained approach to scene text script identification, in 12th IAPR Workshop on Document Analysis Systems (DAS) (2016). ISBN: 978-1-5090-1792-8

References

103

31. A. Coates, B. Carpenter, C. Case, S. Satheesh, B. Suresh, T. Wang, D.J. Wu, A.Y. Ng, Text detection and character recognition in scene images with unsupervised feature learning, in ICDAR (IEEE Computer Society, 2011), pp. 440–445. ISBN: 978-1-4577-1350-7 32. O. Boiman, E. Shechtman, M. Irani, In defense of Nearest-Neighbor based image classification, in CVPR (2008), pp. 1–8 33. N. Sharma, R. Mandal, R. Sharma, U. Pal, M. Blumenstein, in ICDAR 2015, Competition on Video Script Identification (CVSI 2015) (IEEE Computer Society, 2015). ISBN: 978-1-47991805-8, http://dblp.uni-trier.de/db/conf/icdar/icdar2015.html#SharmaMSPB15 34. T.E. de Campos, B.R. Babu, M. Varma, Character recognition in natural images, in Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, Lisboa, Portugal, February 5–8, 2009, vol. 2 (2009), pp. 273–280, http://dblp.uni-trier.de/ db/conf/visapp/visapp2009-2.html#CamposBV09 35. S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, in ICDAR 2003 Robust Reading Competitions (2003), pp. 682–687 36. B.A. Olshausen, D.J. Field, Emergence of simple-cell receptive-field properties by learning a sparse code for natural images. Springer Nature 381, 607–609 (1996) 37. S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in CVPR (2006), pp. 2169–2178 38. D.G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004) 39. Q. Zheng, K. Chen, Y. Zhou, C. Gu, H. Guan, Text Localization and Recognition in Complex Scenes Using Local Features. Lecture Notes in Computer Science, ACCV, vol. 6494 (Springer, Berlin, 2010), pp. 121–132. ISBN: 978-3-642-19317-0 40. T. Wu, S.P. Ma, Feature extraction by hierarchical overlapped elastic meshing for handwritten Chinese character recognition, in ICDAR (2003), pp. 529–533 41. M. Zarechensky, N. Vassilieva, Text detection in natural scenes with multilingual text, in Proceedings of the Tenth Spring Researchers Colloquium on Database and Information Systems Veliky Novgorod and Russia (2014) 42. M. Tounsi, I. Moalla, A.M. Alimi, F. Lebourgeois, Arabic characters recognition in natural scenes using sparse coding for feature representations, in ICDAR (2015), pp. 1036–1040. ISBN: 978-1-4799-1805-8. http://dblp.uni-trier.de/db/conf/icdar/icdar2015. html#TounsiMAL15 43. C. Yi, X. Yang, Y. Tian, Feature representations for scene text character recognition: a comparative study, in ICDAR (IEEE Computer Society, 2013), pp. 907–911. ISBN: 978-0-76954999-6, http://dblp.uni-trier.de/db/conf/icdar/icdar2013.html#YiYT13 44. A.J. Newell, L.D. Griffin, Multiscale histogram of oriented gradient descriptors for robust character recognition, in ICDAR (IEEE Computer Society, 2011), pp. 1085–1089. ISBN 9781-4577-1350-7, http://dblp.uni-trier.de/db/conf/icdar/icdar2011.html#NewellG11 45. T.E. de Campos, B.R. Babu, M. Varma, Character recognition in natural images, in Proceedings of the International Conference on Computer Vision Theory and Applications (Lisbon, Portugal, 2009) 46. J. Mao, H. Li, W. Zhou, S. Yan, Q. Tian, Scale based region growing for scene text detection, in ACM Multimedia Conference, MM ’13, Barcelona, Spain, October 21–25, 2013 (2013), pp. 1007–1016. ISBN: 978-1-4503-2404-5, http://dblp.uni-trier.de/db/conf/mm/mm2013.html# MaoLZYT13 47. S.B. Ahmed, S. Naz, S. Swati, M.I. Razzak, Handwritten urdu character recognition using 1-dimensional BLSTM classifier, in Neural Computing and Applications (2017). http://dblp. uni-trier.de/db/journals/corr/corr1705.html#AhmedNSR17, http://arxiv.org/abs/1705.05455 48. A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, vol. 385 (Springer Book, Berlin, 2012), pp. 1–131. ISBN:978-3-64224796-5, http://dblp.uni-trier.de/, https://doi.org/10.1007/978-3-642-24797-2 49. Q. Zheng, K. Chen, Y. Zhou, C. Gu, H. Guan, Text localization and recognition in complex scenes using local features, in ACCV. Lecture Notes in Computer Sciences, vol. 6494 (2010), pp.121–132. ISBN: 978-3-642-19317-0, http://dblp.uni-trier.de/db/conf/accv/accv2010-3. html#ZhengCZGG10

104

References

50. L. Neumann, J. Matas, Scene text localization and recognition with oriented stroke detection, in 2013 IEEE International Conference on Computer Vision (ICCV 2013) (2013), pp. 97–104. ISBN: 978-1-4799-2839-2 51. Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, X. Bai, Multi-oriented text detection with fully convolutional networks, in CVPR (2016), https://arxiv.org/abs/1604.04018 52. M. Neuhaus, Learning graph edit distance (Master Thesis) (2003) 53. M. Zarechensky, N. Vassilieva, Text detection in natural scenes with multilingual text, in Spring Researchers Colloquium on Database and Information Systems (Veliky Novgorod, Russia, 2014) 54. M.I. Razzak, F. Anwar, S.A. Husain, A. Belaid, M. Sher, HMM and fuzzy logic: a hybrid approach for online Urdu script-based languages character recognition. Knowl.-Based Syst. 23(8), 914–923 55. M. Liwicki, H. Bunke, Feature selection for HMM and BLSTM based handwriting recognition of whiteboard notes, in IJPRAI, vol. 23 (2009) 56. A. Graves, Teaching computers to read and write: recent advances in cursive handwriting recognition and synthesis with recurrent neural networks, in CORIA 2014 - Conférence en Recherche d’Infomations et Applications- 11th French Information Retrieval Conference. CIFED 2014 Colloque International Francophone sur l’Ecrit et le Document, Nancy, France, March 19–23, 2014 (2014), http://dblp.uni-trier.de/db/conf/coria/coria2014.html#Graves14 57. M.B. Halima, H. Karray, A.M. Alimi, Arabic text recognition in video sequences. Int. J. Comput. Linguist. Res. (2013), https://arxiv.org/abs/1308.3243 58. S. Naz, A.L. Umar, A. Ahmed, M.I. Razzak, S.F. Rashid, F. Sheikh, Urdu Nasta’liq Text Recognition Using Implicit Segmentation Based on Multi-Dimensional Long Short Term Memory Neural Networks, vol. 5 (SpringerPlus, 2016). ISSN: 2193-1801, https://doi.org/10.1186/ s40064-016-3442-4 59. C. Yao, X. Bai, B. Shi, W. Liu, Strokelets: a learned multi-scale representation for scene text recognition, in CVPR (IEEE Computer Society, 2014) ISBN: 978-1-4799-5118-5, pp. 4042–4049, http://dblp.uni-trier.de/db/conf/cvpr/cvpr2014.html#YaoBSL14 60. C. Shi, C. Wang, B. Xiao, Y. Zhang, S. Gao, Z. Zhang, Scene text recognition using part-based tree-structured character detection, in CVPR (IEEE Computer Society, 2013), pp. 2961–2968. ISBN: 978-0-7695-4989-7, http://dblp.uni-trier.de/db/conf/cvpr/cvpr2013. html#ShiWXZGZ13 61. P. Shivakumara, R.P. Sreedhar, T.Q. Phan, S. Lu, C.L. Tan, Multioriented video scene text detection through bayesian classification and boundary growing. IEEE Trans. Circuits Syst. 22 (2012) 62. Q.F. Liu, C.K. Jung, S.K. Kim, Y.S. Moon, J.Y. Kim, Stroke filter for text localization in video images, in ICIP (2006), pp. 1473–1476 63. H. Chen, S.S. Tsai, G. Schroth, D.M. Chen, R. Grzeszczuk, B. Girod, Robust text detection in natural images with edge-enhanced maximally stable extremal regions, in 18th IEEE International Conference on Image Processing, ICIP 2011, Brussels, Belgium, September 11–14, 2011 (2011), pp. 2609–2612. ISBN: 978-1-4577-1304-0, http://dblp.uni-trier.de/db/ conf/icip/icip2011.html#ChenTSCGG11 64. L.G. Bigorda, D. Karatzas, Multi-script text extraction from natural scenes, in ICDAR (IEEE Computer Society, 2013), pp. 467–471. ISBN: 978-0-7695-4999-6, http://dblp.uni-trier.de/ db/conf/icdar/icdar2013.html#GomezK13 65. I.Z. Yalniz, D. Gray, R. Manhmatha, Adaptive exploration of text regions in natural scene images, in 13th International Conference on Document Analysis and Recognition (ICDAR) (2013) 66. X.-C. Yin, X. Yin, K. Huang, Robust text detection in natural scene images. IEEE Trans. Pattern Anal. Mach. Intell. 36, 970–983 (2014) 67. S.B. Ahmed, S. Naz, S. Swati, M.I. Razzak, A.A. Khan, A.I. Umer, UCOM offline dataset: Aa Urdu handwritten dataset generation. Int. Arab J. Inf. Technol. 14(2) (2014) 68. A. Bissacco, M. Cummins, Y. Netzer, H. Neven, PhotoOCR: reading text in uncontrolled conditions, in ICCV (IEEE Computer Society, 2013), pp. 785–792. ISBN: 978-1-4799-28392, http://dblp.uni-trier.de/db/conf/iccv/iccv2013.html#BissaccoCNN13

References

105

69. O. Zayene, S.M. Touj, J. Hennebert, R. Ingold, N.E.B. Amara, Open datasets and tools for Arabic text detection and recognition in news video frames. J. Imaging 4, 32 (2018). http:// dblp.uni-trier.de/, https://doi.org/10.3390/jimaging4020032 70. Y.M. Alginahi, A survey on Arabic character segmentation. IJDAR 16, 105–126 (2013) 71. X. Liu, W. Wang, An effective graph-cut scene text localization with embedded text segmentation. Multimedia Tools Appl. 74, 4891–4906 (2015) 72. L. Neumann, J. Matas, Efficient Scene text localization and recognition with local character refinement, in Document Analysis and Recognition (ICDAR) (2015), pp. 746–750. https:// doi.org/10.1109/ICDAR.2015.7333861, ISSN: 1520-5363 73. B.H. Shekar, M.L. Smitha, Skeleton matching based approach for text localization in scene images, in 8th International Conference on Image and Signal Processing (Elsevier Publications, 2014), pp. 145–153. ISBN: 9789351072522 74. C. Yu, Y. Song, Y. Zhang, Scene text localization using edge analysis and feature pool, in Neurocomputing, vol. 175 (2016), pp. 652–661, http://dblp.uni-trier.de/db/journals/ijon/ ijon175.html#YuSZ16 75. M. Bušta, L. Neumann, J. Matas, FASText: efficient unconstrained scene text detector, in IEEE International Conference on Computer Vision (2016), pp. 1206–1214. ISBN: 978-14673-8391-2 76. L. Gomez, A. Nicolaou, D. Karatzas, Boosting patch-based scene text script identification with ensembles of conjoined networks, in Computer Science - Computer Vision and Pattern Recognition (2016) 77. A. Veit, T. Matera, L. Neumann, J. Matas, S. Belongie, COCO-text: dataset and benchmark for text detection and recognition in natural images, in Computer Science - Computer Vision and Pattern Recognition (2016) 78. S.D.M. Raja, Wavelet features based war scene classification using artificial neural networks, in Scene Classification; HAAR and Daubechies Wavelet (2013), http://www.enggjournals. com/ijcse/doc/IJCSE10-02-09-104.pdf 79. S. Tian, U. Bhattacharya, S. Lu, B. Su, Q. Wang, X. Wei, Y. Lu, C.L. Tan, Multilingual scene character recognition with co-occurrence of histogram of oriented gradients, in Pattern Recognition, vol. 51 (2016) 80. T. Pajdla, M. Urban, O. Chum, J. Matas, Robust wide baseline stereo from maximally stable extremal regions, in BMVC (2002) 81. J. Serra, Toggle Mappings, Published in Book Title ‘From Pixels to Features’ (North Holland, 1989), pp. 61–72 82. S. Yousfi, S.-A. Berrani, C. Garcia, ALIF: a dataset for Arabic embedded text recognition in TV broadcast, in ICDAR (IEEE Computer Society, 2015), pp. 1221–1225. ISBN: 978-14799-1805-8, http://dblp.uni-trier.de/db/conf/icdar/icdar2015.html#YousfiBG15a 83. F. Slimane, R. Ingold, S. Kanoun, A. M. Alimi, J. Hennebert, A New Arabic printed text image database and evaluation protocols, in ICDAR 2009, pp. 946–950 84. D. Simian, F. Stoica, Evaluation of a Hybrid Method for Constructing Multiple SVM Kernels (WSEAS Press, 2009), pp. 619–623. ISBN: 978-960-474-099-4 85. A. Veit, T. Matera, L. Neumann, J. Matas, S.J. Belongie, COCO-text: dataset and benchmark for text detection and recognition in natural images (2016), http://dblp.uni-trier.de/db/ journals/corr/corr1601.html#VeitMNMB16, http://arxiv.org/abs/1601.07140 86. C. Yi, Y. Tian, Text extraction from scene images by character appearance and structure modeling, in Computer Vision and Image Understanding, vol. 117 (2013), pp. 182–194, http://dblp.uni-trier.de/db/journals/cviu/cviu117.html#YiT13 87. M. Darab, M. Rahmati, A hybrid approach to localize farsi text in natural scene images, in Proceedings of the 3rd International Neural Network Society Winter Conference, INNS-WC 2012, Bangkok, Thailand, October 3–5, 2012, vol. 13 (Elsevier, 2012), pp. 171–184, http:// www.sciencedirect.com/science/journal/18770509/13 88. S.B. Ahmed, S. Naz, M.I. Razzak, R. Yusof, A novel dataset for English-Arabic scene text recognition (EASTR)-42k and its evaluation using invariant feature extraction on detected extremal regions in IEEE Access (under review), 2019

106

References

89. S.B. Ahmed, S. Naz, M.I. Razzak, R. Yusof, An effective use of filters for cursive scene text enhancement by adapted convolutional linear approach, in Pattern Analysis and Applications (PAAA) (2019) 90. S.B. Ahmed, M.I. Razzak, R. Yusof, Sub-sampling approach for unconstrained Arabic scene text analysis by impicit segmentation based deep learning classifier, in NCA (under review) 91. H. Bay, T. Tuytelaars, L.J. Van Gool, SURF: speeded up robust features, in ECCV (2006), pp. 404–417 92. E. Tola, A. Fossati, C. Strecha, P. Fua, Large occlusion completion using normal maps, in Asian Conference on Computer Vision (2010) 93. S.B. Ahmed, S. Naz, M.I. Razzak, R. Yousaf, Deep learning based isolated arabic scene character recognition, in 1st Workshop on Arabic Script Analysis and Recognition (2017), http://arxiv.org/abs/1704.06821 94. M. Calonder, V. Lepetit, M. Özuysal, T. Trzcinski, C. Strecha, P. Fua, BRIEF: computing a local binary descriptor very fast. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1281–1298 (2012). https://doi.org/10.1109/TPAMI.2011.222 95. O. Zayene, J. Hennebert, S.M. Touj, R. Ingold, N.E.B. Amara, A dataset for Arabic text detection, tracking and recognition in news videos- AcTiV, in ICDAR (IEEE Computer Society, 2015), pp. 996–1000. ISBN: 978-1-4799-1805-8. http://ieeexplore.ieee.org/ xpl/mostRecentIssue.jsp?punumber=7321714, http://www.computer.org/csdl/proceedings/ icdar/2015/1805/00/index.html 96. M. Ben Halima, H. Karray, A. M. Alimi, Arabic text recognition in video sequences. Int. J. Comput. Linguist. Res. (2013), http://arxiv.org/abs/1308.3243 97. E. Rublee, V. Rabaud, K. Konolige, G.R. Bradski, ORB: an efficient alternative to SIFT or SURF, in IEEE International Conference on Computer Vision ICCV 2011, Barcelona, Spain, November 6–13, 2011 (IEEE Computer Society, 2011), pp. 2564–2571. ISBN: 978-1-45771101-5 98. M. Jain, M. Mathew, C.V. Jawahar, Unconstrained scene text and video text recognition for Arabic script (2017), bibsource: http://dblp.uni-trier.de/db/journals/corr/corr1704.html# AhmedNRY17, http://arxiv.org/abs/1704.06821 99. L. Liu, L. Wang, X. Liu, In defense of soft-assignment coding, in IEEE International Conference on Computer Vision ICCV 2011, Barcelona, Spain, November 6–13, 2011 (IEEE Computer Society, 2011), pp. 2486–2493. ISBN: 978-1-4577-1101-5 100. L. Neumann, J. Matas, Scene text localization and recognition with oriented stroke detection, in 2013 IEEE International Conference on Computer Vision (2013), pp. 97–104. ISBN: 9781-4799-2839-2, https://doi.org/10.1109/ICCV.2013.19 101. S.J. Belongie, J. Malik, J. Puzicha, Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24, 509–522 (2002) 102. S.B. Ahmed, S. Naz, M.I. Razzak, R. Yusof, Evaluation of handwritten Urdu text by integration of MNIST dataset learning experience, in Neuro Processing Letters (NEPL) 103. A.C. Berg, T.L. Berg, J. Malik, Shape matching and object recognition using low distortion correspondences, in CVPR (2005), pp. 26–33. https://doi.org/10.1109/CVPR.2005.320 104. M.B. Halima, H. Karray, A.M. Alimi, A Comprehensive Method for Arabic Video Text Detection, Localization, Extraction and Recognition. Lecture Notes in Computer Science, vol. 6298 (2010), pp. 648–659, http://dblp.uni-trier.de, https://doi.org/10.1007/978-3-64215696-0_60, ISBN : 978-3-642-15695-3 105. L. Li, S. Yu, L. Zhong, X. Li, Multilingual text detection with nonlinear neural network. Math. Probl. Eng. (2015). https://doi.org/10.1155/2015/431608 106. D.G. Lowe, Object recognition from local scale-invariant features, in ICCV (1999), pp. 1150– 1157 107. S. Lazebnik, C. Schmid, J. Ponce, A sparse texture representation using local affine regions. IEEE Trans. Pattern Anal. Mach. Intell. (2005), http://hal.archives-ouvertes.fr/inria00548530/en/ 108. M. Varma, A. Zisserman, Classifying images of materials: achieving viewpoint and illumination independence, in ECCV (2002), pp. III: 255 ff

References

107

109. M. Varma, A. Zisserman, Texture classification: are filter banks necessary?, in CVPR (2003), pp. II: 691–698 110. A. Vedaldi, A. Zisserman, Efficient additive kernels via explicit feature maps. IEEE Trans. Pattern Anal. Mach. Intell. Arch. 34(3), 480–492 (2012) 111. L. Neumann, J. Matas, A Method for Text Localization and Recognition in Real-World Images, ACCV 2010. Springer Lecture Notes in Computer Science (2010), pp. 770–783. ISBN: 9783-642-19317-0, http://dblp.uni-trier.de/db/conf/accv/accv2010-3.html#NeumannM10 112. Yonina C. Eldar, Albert M. Chan, An optimal whitening approach to linear multiuser detection. IEEE Trans. Inf. Theory 49, 2156–2171 (2003), http://dblp.uni-trier.de/db/journals/tit/tit49. html#EldarC03 113. X.-S. Hua, L. Wenyin, H.J. Zhang, An automatic performance evaluation protocol for video text detection algorithms. IEEE Trans. Circuits Syst. Video Tech. 14, 498–507 (2004), http:// dblp.uni-trier.de/db/journals/tcsv/tcsv14.html#HuaWZ04 114. A. John, D.C. O’Connell, S. Kowal, Personal perspective in TV interviews. Pragmatics 12, 257–271 (2002) 115. D. Cameron, Theoretical debates in feminist linguistics: questions of sex and gender, in Gender and Discourse, ed. by R. Wodak (Sage Publications, London, 1997), pp. 99–119 116. D. Cameron, Feminism and Linguistic Theory (St. Martin’s Press, New York, 1985) 117. J. Dod, Effective substances, in The Dictionary of Substances and Their Effects (Royal Society of Chemistry, 1999). Available via DIALOG. http://www.rsc.org/dose/title of subordinate document. Cited 15 Jan 1999 118. C. Suleiman, D.C. O’Connell, S. Kowal, If you and I, if we, in this later day, lose that sacred fire...?’: Perspective in political interviews. J. Psychol. Res. (2002). https://doi.org/10.1023/ A:1015592129296 119. B. Brown, M. Aaron, The politics of nature, in The Rise of Modern Genomics, 3rd edn., ed. by J. Smith (Wiley, New York, 2001) 120. J. Dod, Effective substances, in The Dictionary of Substances and Their Effects (Royal Society of Chemistry, 1999). Available via DIALOG. http://www.rsc.org/dose/ titleofsubordinatedocument. Cited 15 Jan 1999 121. S.B. Ahmed, S. Naz, M.I. Razzak, R. Yusof, T.M. Breuel, Balinese character recognition using bidirectional LSTM classifier, in Advances in Machine Learning and Signal Processing (Springer International Publishing, Berlin, 2016), pp. 201–211 122. B. Epshtein, E. Ofek, Y. Wexler, Detecting Text in Natural Scenes with Stroke Width Transform (IEEE Computer Society, 2010), pp. 2963–2970. ISBN: 978-1-4244-6984-0, http:// ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=5521876 123. S. Naz, S.B. Ahmed, R. Ahmad, M.I. Razzak, Zoning Features and 2DLSTM for Urdu Textline Recognition, vol. 96 (Procedia Computer Science, 2016), pp. 16–22, http://dblp.uni-trier. de/db/conf/kes/kes2016.html#NazAAR16 124. S. Tian, U. Bhattacharya, S. Lu, B. Su, Q. Wang, X. Wei, Y. Lu, C.L. Tan, Multilingual scene character recognition with co-occurrence of histogram of oriented gradients, in Pattern Recognition, vol. 51 (2016), pp. 125–134, http://dblp.uni-trier.de/, https://doi.org/10.1016/j. patcog.2015.07.009 125. G. Qiang, T. Dan, L. Guohui, L. Jun, Memory matters: convolutional recurrent neural network for scene text recognition, in Computer Science - Computer Vision and Pattern Recognition (CVPR) (2016) 126. R. Wu, B. Wang, W. Wang, Y. Yu, Harvesting discriminative meta objects with deep CNN features for scene classification, in Computer Science - Computer Vision and Pattern Recognition -2015 (2015) 127. X. Ren, K. Chen, J. Sun, A novel scene text detection algorithm based on convolutional neural network, in Computer Science - Computer Vision and Pattern Recognition (IWPR, 2016) 128. M.K. Slifka, J.L. Whitton, Clinical implications of dysregulated cytokine production. J. Mol. Med. (2000). https://doi.org/10.1007/s001090000086 129. J. Smith, M. Jones Jr., L. Houghton et al., Future of health insurance. N. Engl. J. Med. 965, 325–329 (1999) 130. J. South, B. Blass, The Future of Modern Genomics (Blackwell, London, 2001)

Index

A Adapted convolutional linear, 50 Adjacency relation, 45 Affine transformation, 44 ALIF, 38 Ancient script, 32 Arabian peninsula, 9, 34 Arabic Printed Text Image (APTI), 39 Arabic script, 1 ARASTI, 35 Artificial data, 3 Artificial intelligence, 31 Artificial neural networks, 32

B Bias value, 55 Bilingual, 34 Binary image, 44

C Candidate region, 10 Clean document images, 5 Complicated scripts, 6 Component-based method, 9 Conjoined, 6 Connectionist temporal classification, 60 Context learning, 8 Convolutional features, 55 Convolutional layer, 55 Covariant points, 44 Cursive text recognition, 1

D Deep convolutional neural network, 65

Deep learning, 31 Diacritical marks, 8 Document analysis, 5 DSLR camera, 33 Dynamic programming, 28

E EASTR dataset, 35 Empirically selected kernels, 53 Extremal regions, 43

F Feature map, 55 Feedforward tanh layers, 79 Filter size, 55 Filtration methods, 53

G Gaussian blurred, 47 Gaussian pyramid, 50 Gray scale, 50 Ground truth, 41

H Hierarchical MDLSTM, 78 Hybrid feature, 58 Hybrid methods, 10

I ICDAR competitions, 35 Image mask, 44 Image sharpening, 54

© Springer Nature Singapore Pte Ltd. 2020 S. B. Ahmed et al., Cursive Script Text Recognition in Natural Scene Images, https://doi.org/10.1007/978-981-15-1297-1

109

110 Implicit noise, 5 Implicit segmentation, 3 Indian subcontinent, 34 Input activation, 55 Input pattern size, 48 Instance learning, 68 Intersected points, 62 Invariant features, 43 Invariant font styles, 7

J Joiner and non-joiner, 6

K Kernel weight, 55 Key-points, 46

L Laplacian pyramid, 46 Learning classifier, 65 Level of Gaussian, 45 Ligature, 8 Linear attitude, 44 Linear output, 56 Linear pyramid, 50 Logistic sigmoid, 83 Long Short Term Memory (LSTM) networks, 57 LSTM memory units, 82 L2 pooling strategy, 56

M Machine learning, 31, 32 Machine rendering, 5 Maximally Stable Extremal Region (MSER), 43 Max-pooling strategy, 65 Memory cell, 59 Minima or maxima, 46 Modified National Institute of Standards and Technology (MNIST), 55 Multidimensional Long Short Term Memory (MDLSTM), 28, 58 Multidimensional Recurrent Neural Network (MDRNN), 28 Multilayer Perceptrons (MLPs), 65 Multilingual scene text, 34 Multiplicative units, 58 Mutual activation, 70

Index N Naskh, 6 Nastaliq, 6 Natural language processing, 1 Neural activation, 55 O Octaves, 46 One-dimensional sequence, 60 Optical character recognition, 3 Over-fitting, 32 Overlaid text, 5 P Penmanship style, 6 Pooling, 56 Precinct, 33 Precision, 76 Precision of text, 44 Pre-trained network, 69 Printed Latin, 5 R RAST, 40 Recall, 76 Recurrent connection, 57 Recurrent neural networks, 29 RGB values, 52 RNNLIB library, 29 ROC curves, 77 S Scale and orientation, 47 Scale-Invariant Feature Transformation (SIFT), 45 Scale space, 46 Search-able text, 4 Sequence learning tasks, 28 Sigmoid function, 55 Sliding window approach, 48 Smooth text region, 53 Sobel x-y convolve, 54 Softmax, 29 Softmax layer, 66 Specialized cameras, 3 Specialized device, 3 Stride, 56 Structure of Arabic text, 2 Styles of representation, 6 Sub-sampled layers, 70 Supervised learning, 41

Index

111

Support Vector Machine (SVM), 87 Synthetic, 5

V Vanishing gradient problem, 58

T Temporal sequential, 28 Text acquisition methods, 6 Text edge detection, 53 Texture-based method, 9 Trilingual, 34 Two-dimensional arrays, 51

W Wavelet coefficient histogram, 87 Word recognition, 62 Word segmentation, 1 Word transcription, 74

U Ultra pixel sensor, 33 Unconstrained cursive scripts, 7 Uncontrolled environment, 33, 39 Urdu handwritten samples, 69 Urdu Nastal’iq Handwritten Dataset (UNHD), 55 Urdu scrambled text, 75

X X-height, 53 X-y coordinates, 41

Z Zero padding, 56