The two-volume set LNCS 13451 and 13452 constitutes revised selected papers from the CICLing 2019 conference which took
277 45 37MB
English Pages 682 [683] Year 2023
Table of contents :
Preface
Organization
Contents – Part II
Contents – Part I
Name Entity Recognition
Neural Named Entity Recognition for Kazakh
1 Introduction
2 Related Work
3 Named Entity Features
4 The Neural Networks
4.1 Mapping Words and Tags into Feature Vectors
4.2 Tensor Layer
4.3 Tag Inference
5 Experiments
5.1 Data-Set
5.2 Model Setup
5.3 Results
6 Conclusions
References
An Empirical Data Selection Schema in Annotation Projection Approach
1 Introduction
2 Related Work
3 Method
3.1 Problems of Previous Method
3.2 Our Method
4 Experiments
4.1 Data Sets and Evaluating Methods
4.2 Results
5 Conclusion
References
Toponym Identification in Epidemiology Articles – A Deep Learning Approach
1 Introduction
2 Previous Work
3 Our Proposed Model
3.1 Embedding Layer
3.2 Deep Feed Forward Neural Network
4 Experiments and Results
4.1 Effect of Domain Specific Embeddings
4.2 Effect of Linguistic Features
4.3 Effect of Window Size
4.4 Effect of the Loss Function
4.5 Use of Lemmas
5 Discussion
6 Conclusion and Future Work
References
Named Entity Recognition by Character-Based Word Classification Using a Domain Specific Dictionary
1 Introduction
2 Related Work
3 Baseline Method
4 Proposed Method
5 Experiments
5.1 Datasets
5.2 Methods
5.3 Pre-trained Word Embeddings
5.4 Experimental Results and Discussion
6 Conclusion
References
Cold Is a Disease and D-cold Is a Drug: Identifying Biological Types of Entities in the Biomedical Domain
1 Introduction
2 Related Work
3 Dataset
4 Approach
4.1 Ontology Creation
4.2 Algorithm: Identify Entity with Its Biological Type
5 Experimental Setup and Results
6 Conclusion
References
A Hybrid Generative/Discriminative Model for Rapid Prototyping of Domain-Specific Named Entity Recognition
1 Introduction
2 Related Work
2.1 General and Domain-Specific NER
2.2 Types of Supervision in NER
2.3 Unsupervised Word Segmentation and Part-of-Speech Induction
3 Proposed Method
3.1 Task Setting
3.2 Model Overview
3.3 Semi-Markov CRF with a Partially Labeled Corpus
3.4 PYHSMM
3.5 PYHSCRF
4 Experimentals
4.1 Data
4.2 Training Settings
4.3 Baselines
4.4 Results and Discussion
5 Conclusion
References
Semantics and Text Similarity
Spectral Text Similarity Measures
1 Introduction
2 Related Work
3 Similarity Measure/Matrix Norms
4 Document Similarity Measure Based on the Spectral Radius
5 Spectral Norm
6 Application Scenarios
6.1 Market Segmentation
6.2 Translation Matching
7 Evaluation
8 Discussion
9 Supervised Learning
10 Conclusion
A Example Contest Answer
References
A Computational Approach to Measuring the Semantic Divergence of Cognates
1 Introduction
1.1 Related Work
1.2 Contributions
2 The Method
2.1 Cross-Lingual Word Embeddings
2.2 Cross-Language Semantic Divergence
2.3 Detection and Correction of False Friends
3 Conclusions
References
Triangulation as a Research Method in Experimental Linguistics
1 Introduction
2 Methodology
2.1 Semantic Research and Experiment
2.2 Expert Evaluation Method in Linguistic Experiment
3 Conclusions
References
Understanding Interpersonal Variations in Word Meanings via Review Target Identification
1 Introduction
2 Related Work
3 Personalized Word Embeddings
3.1 Reviewer-Specific Layers for Personalization
3.2 Reviewer-Universal Layers
3.3 Multi-task Learning of Target Attribute Predictions for Stable Training
3.4 Training
4 Experiments
4.1 Settings
4.2 Overall Results
4.3 Analysis
5 Conclusions
References
Semantic Roles in VerbNet and FrameNet: Statistical Analysis and Evaluation
1 Introduction
2 VerbNet and FrameNet as Linguistic Resources for Analysis
2.1 VerbNet
2.2 FrameNet
2.3 VerbNet and FrameNet in Comparison
3 Basic Statistical Analysis
4 Advanced Statistical Analysis
4.1 Distribution of Verbs per Class in VN and FN
4.2 Distribution of Roles per Class in VN and FN
4.3 General Analysis and Evaluation
5 Hybrid Role-Scalar Approach
5.1 Hypothesis: Roles Are Not Sufficient for Verb Representation
5.2 Scale Representation
6 Conclusion
References
Sentiment Analysis
Fusing Phonetic Features and Chinese Character Representation for Sentiment Analysis
1 Introduction
2 Related Work
2.1 General Embedding
2.2 Chinese Embedding
3 Model
3.1 Textual Embedding
3.2 Training Visual Features
3.3 Learning Phonetic Features
3.4 Sentence Modeling
3.5 Fusion of Modalities
4 Experiments and Results
4.1 Experimental Setup
4.2 Experiments on Unimodality
4.3 Experiments on Fusion of Modalities
4.4 Validating Phonetic Feature
4.5 Visualization of the Representation
4.6 Who Contributes to the Improvement?
5 Conclusion
References
Sentiment-Aware Recommendation System for Healthcare Using Social Media
1 Introduction
1.1 Problem Definition
1.2 Motivation
1.3 Contributions
2 Related Works
3 Proposed Framework
3.1 Sentiment Classification
3.2 Top-N Similar Posts Retrieval
3.3 Treatment Suggestion
4 Dataset and Experimental Setup
4.1 Forum Dataset
4.2 Word Embeddings
4.3 Tools Used and Preprocessing
4.4 UMLS Concept Retrieval
4.5 Relevance Judgement for Similar Post Retrieval
5 Experimental Results and Analysis
5.1 Sentiment Classification
5.2 Top-N Similar Post Retrieval
5.3 Treatment Suggestion
6 Conclusion and Future Work
References
Sentiment Analysis Through Finite State Automata
1 Introduction
2 State of the Art
3 Methodology
3.1 Local Grammars and Finite-State Automata
3.2 Sentita and Its Manually-Built Resources
4 Morphology
5 Syntax
5.1 Opinionated Idioms
5.2 Negation
5.3 Intensification
5.4 Modality
5.5 Comparison
5.6 Other Sentiment Expressions
6 Conclusion
References
Using Cognitive Learning Method to Analyze Aggression in Social Media Text
1 Introduction
2 Related Work
3 Methodology
3.1 Dataset
3.2 Pre-processing
3.3 Feature Extraction
4 Experiments and Results
4.1 Experimental Setup
4.2 Result
4.3 Discussion and Analysis
5 Conclusion and Future Work
References
Opinion Spam Detection with Attention-Based LSTM Networks
1 Introduction
2 Related Work
2.1 Opinion Spam Detection
2.2 Deep Learning for Sentiment Analysis
2.3 Attention Mechanisms
3 Methodology
3.1 Attention-Based LSTM Model
4 Experiments
5 Results and Analysis
5.1 All Three-Domain Results
5.2 In-domain Results
5.3 Cross-domain Results
5.4 Comparison with Previous Work
6 Conclusion and Future Work
References
Multi-task Learning for Detecting Stance in Tweets
1 Introduction
2 Related Work
3 Proposed Approach
3.1 Task Formulation
3.2 Multi-task Learning
3.3 Model Details
4 Experiments
4.1 Dataset
4.2 Training Details
4.3 Baselines
4.4 Evaluation Metrics
4.5 Results
4.6 Ablation Study
4.7 Importance of Regularization
4.8 Effect on Regularization Strength ()
4.9 Case-Study and Error Analyses
5 Conclusion
References
Related Tasks Can Share! A Multi-task Framework for Affective Language
1 Introduction
2 Related Work
3 Proposed Methodology
3.1 Hand-Crafted Features
3.2 Word Embeddings
4 Experiments and Results
4.1 Dataset
4.2 Preprocessing
4.3 Experiments
4.4 Error Analysis
5 Conclusion
References
Sentiment Analysis and Sentence Classification in Long Book-Search Queries
1 Introduction
2 Related Work
3 User Queries
4 Sentiment Intensity
5 Reviews Language Model
6 Analysing Scores
6.1 Sentiment Intensity, Perplexity and Usefulness Correlation
6.2 Sentiment Intensity, Perplexity and Information Type Correlation
6.3 Graphs Interpretation
7 Conclusion and Future Work
References
Comparative Analyses of Multilingual Sentiment Analysis Systems for News and Social Media
1 Introduction
1.1 Tasks Description
1.2 Systems Overview
2 Related Work
3 Datasets
3.1 Twitter Datasets
3.2 Targeted Entity Sentiment Datasets
3.3 News Tonality Datasets
4 Evaluation and Results
4.1 Baselines
4.2 Twitter Sentiment Analysis
4.3 Tonality in News
4.4 Targeted Sentiment Analysis
4.5 Error Analysis
5 Conclusion
References
Sentiment Analysis of Influential Messages for Political Election Forecasting
1 Introduction
2 Related Works
2.1 Sentiment Analysis
2.2 Election Forcasting Approaches
3 Proposed Method
3.1 Data Collection
3.2 Feature Generation
3.3 Influential Classifier Construction
3.4 Election Outcome Prediction Model
4 Results and Findings
4.1 Learning Quality
4.2 Features Quality
4.3 Predicting Election Outcome Quality
5 Conclusion
References
Basic and Depression Specific Emotions Identification in Tweets: Multi-label Classification Experiments
1 Introduction
1.1 Emotion Modeling
1.2 Multi-label Emotion Mining Approaches
1.3 Problem Transformation Methods
1.4 Algorithmic Adaptation Methods
2 Baseline Models
3 Experiment Models
3.1 A Cost Sensitive RankSVM Model
3.2 A Deep Learning Model
3.3 Loss Function Choices
4 Experiments
4.1 Data Set Preparation
4.2 Feature Sets
4.3 Evaluation Metrics
4.4 Quantifying Imbalance in Labelsets
4.5 Mic/Macro F-Measures
5 Results Analysis
5.1 Performance with Regard to F-Measures
5.2 Performance with Regard to Data Imbalance
5.3 Confusion Matrices
6 Conclusion and Future Work
References
Generating Word and Document Embeddings for Sentiment Analysis
1 Introduction
2 Related Work
3 Methodology
3.1 Corpus-Based Approach
3.2 Dictionary-Based Approach
3.3 Supervised Contextual 4-Scores
3.4 Combination of the Word Embeddings
3.5 Generating Document Vectors
4 Datasets
5 Experiments
5.1 Preprocessing
5.2 Hyperparameters
5.3 Results
6 Conclusion
References
Speech Processing
Speech Emotion Recognition Using Spontaneous Children's Corpus
1 Introduction
2 Methods
2.1 Data
2.2 Feature Selection
2.3 The i-Vector Paradigm
2.4 Classification Approaches
3 Results
4 Conclusion
References
Natural Language Interactions in Autonomous Vehicles: Intent Detection and Slot Filling from Passenger Utterances
1 Introduction
1.1 Background
2 Methodology
2.1 Data Collection and Annotation
2.2 Detecting Utterance-Level Intent Types
3 Experiments and Results
3.1 Utterance-Level Intent Detection Experiments
3.2 Slot Filling and Intent Keyword Extraction Experiments
3.3 Speech-to-Text Experiments for AMIE: Training and Testing Models on ASR Outputs
4 Discussion and Conclusion
References
Audio Summarization with Audio Features and Probability Distribution Divergence
1 Introduction
2 Audio Summarization
3 Probability Distribution Divergence for Audio Summarization
3.1 Audio Signal Pre-processing
3.2 Informativeness Model
3.3 Audio Summary Creation
4 Experimental Evaluation
4.1 Results
5 Conclusions
References
Multilingual Speech Emotion Recognition on Japanese, English, and German
1 Introduction
2 Methods
2.1 Emotional Speech Data
2.2 Classification Approaches
2.3 Shifted Delta Cepstral (SDC) Coefficients
2.4 Feature Extraction
2.5 Evaluation Measures
3 Results
3.1 Spoken Language Identification Using Emotional Speech Data
3.2 Emotion Recognition Based on a Two-Level Classification Scheme
3.3 Emotion Recognition Using Multilingual Emotion Models
4 Discussion
5 Conclusions
References
Text Categorization
On the Use of Dependencies in Relation Classification of Text with Deep Learning
1 Introduction
2 A Syntactical Word Embedding Taking into Account Dependencies
3 Two Models for Relation Classification Using Syntactical Dependencies
3.1 A CNN Based Relation Classification Model (CNN)
3.2 A Compositional Word Embedding Based Relation Classification Model (FCM)
4 Experiments
4.1 SemEVAL 2010 Corpus
4.2 Employed Word Embeddings
4.3 Experiments with the CNN Model
4.4 Experiments with the FCM Model
4.5 Discussion
5 Conclusion
References
Multilingual Fake News Detection with Satire
1 Introduction
2 Experimental Framework and Results
2.1 Text Resemblance
2.2 Domain Type Detection
2.3 Classification Results
2.4 Result Analysis
3 Conclusion
References
Active Learning to Select Unlabeled Examples with Effective Features for Document Classification
1 Introduction
2 Related Works
2.1 Active Learning
2.2 Uncertainty Sampling
3 Proposed Method
4 Experiments
4.1 Data Set
4.2 Experiments on Active Learning
4.3 Experimental Results
5 Conclusion
References
Effectiveness of Self Normalizing Neural Networks for Text Classification
1 Introduction
2 Related Work
3 Self-Normalizing Neural Networks
3.1 Input Normalization
3.2 Initialization
3.3 SELU Activations
3.4 Alpha Dropout
4 Model
4.1 Word Embeddings are Not Normalized
4.2 ELU Activation as an Alternative to SELU
4.3 Model Architecture
5 Experiments and Datasets
5.1 Datasets
5.2 Baseline Models
5.3 Model Parameters
5.4 Training
6 Results and Discussion
6.1 Results
6.2 Discussion
7 Conclusion
References
A Study of Text Representations for Hate Speech Detection
1 Introduction
2 Problem Definition
3 Related Work
3.1 Text Representations for Hate Speech
3.2 Classification Approaches
4 Study and Proposed Method
4.1 Text Representations
4.2 Classification Methods
5 Experiments and Results
5.1 Datasets and Experimental Setup
5.2 Results
5.3 Significance Testing
5.4 Discussion
6 Conclusion and Future Work
References
Comparison of Text Classification Methods Using Deep Learning Neural Networks
1 Introduction
2 Related Work
3 Experimental Evaluation and Analysis
3.1 Non-neural Network Approach
3.2 Analysis of the Experiments
3.3 Comparison Tables
4 Conclusion
References
Acquisition of Domain-Specific Senses and Its Extrinsic Evaluation Through Text Categorization
1 Introduction
2 Acquisition of Domain-Specific Senses
3 Application to Text Categorization
4 Experiments
4.1 Acquisition of Senses
4.2 Text Categorization
5 Related Work
6 Conclusion
References
``News Title Can Be Deceptive'' Title Body Consistency Detection for News Articles Using Text Entailment
1 Introduction
2 Related Work
3 Methodology
3.1 Multilayer Perceptron Model (MLP)
3.2 Convolutional Neural Networks Model (CNN)
3.3 Long Short-Term Memory Model (LSTM)
3.4 Combined CNN and LSTM Model
3.5 Modeling
4 Experiments
4.1 Data
4.2 Experimental Setup
4.3 Results and Discussion
4.4 Error Analysis
5 Conclusion and Future Work
References
Look Who's Talking: Inferring Speaker Attributes from Personal Longitudinal Dialog
1 Introduction
2 Related Work
3 Conversation Dataset
4 Message Content
5 Groups over Time
6 Conversation Interaction
7 Model
8 Features
9 Experiments
10 Results
11 Conclusion
References
Computing Classifier-Based Embeddings with the Help of Text2ddc
1 Introduction
2 Related Work
3 Model
3.1 Step 1 and 2: Word Sense Disambiguation
3.2 Step 3: Classifier
3.3 Step 4: Classification Scheme
4 Experiment
4.1 Evaluating text2ddc
4.2 Evaluating CaSe
5 Discussion
5.1 Error Analysis
6 Conclusion
References
Text Generation
HanaNLG: A Flexible Hybrid Approach for Natural Language Generation
1 Introduction
2 Related Work
3 HanaNLG: Our Proposed Approach
3.1 Preprocessing
3.2 Vocabulary Selection
3.3 Sentence Generation
3.4 Sentence Ranking
3.5 Sentence Inflection
4 Experiments
4.1 NLG for Assistive Technologies
4.2 NLG for Opinionated Sentences
5 Evaluation and Results
6 Conclusions
References
MorphoGen: Full Inflection Generation Using Recurrent Neural Networks
1 Introduction
2 Datasets
3 MorphoGen Architecture
4 Generation Experiments
5 Results
6 Conclusions
References
EASY: Evaluation System for Summarization
1 Introduction
2 EASY System Design
2.1 Summarization Quality Metrics
2.2 Baselines
3 Implementation Details
3.1 Input Selection
3.2 Metrics
3.3 Baselines
3.4 Correlation of Results
4 Availability and Reproducibility
5 Conclusions
References
Performance of Evaluation Methods Without Human References for Multi-document Text Summarization
1 Introduction
2 Related Work
2.1 ROUGE-N
2.2 ROUGE-L
2.3 ROUGE-S y ROUGE-SU
3 Evaluation Methods
3.1 Manual Methods
3.2 Automatic Methods
4 Proposed Methodology
5 Obtained Results
5.1 Comparison of the State-of-the-Art Evaluation Methods
6 Conclusions and Future Works
References
EAGLE: An Enhanced Attention-Based Strategy by Generating Answers from Learning Questions to a Remote Sensing Image
1 Introduction
2 Methodology
2.1 Problem Formulation
2.2 EAGLE: An Enhanced Attention-Based Strategy
2.3 Overall Framework
3 Remote Sensing Question Answering Corpus
3.1 Creation Procedure
3.2 Corpus Statistics
4 Experimental Evaluation
4.1 Models Including Ablative Ones
4.2 Dataset
4.3 Evaluation Metrics
4.4 Implementation Details
4.5 Results and Analysis
5 Related Work
5.1 Visual Question Answering (VQA) with Attention
5.2 Associated Datasets
6 Conclusion
References
Text Mining
Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis
1 Introduction
2 Methodology
2.1 Hierarchy of Word Clusters
2.2 Taxonomy-Augmented Features Given a Set of Predefined Words
2.3 Taxonomy-Augmented Features Given the Hierarchy of Word Clusters
3 Experiments
3.1 Datasets
3.2 Experimental Set-Up
3.3 Experimental Results on Document Classification
3.4 Experimental Results on Document Clustering
3.5 Semantic Analysis
4 Conclusion
References
Adversarial Training Based Cross-Lingual Emotion Cause Extraction
1 Introduction
2 Related Work
2.1 Emotion Cause Extraction
2.2 Cross-Lingual Emotion Analysis
3 Model
3.1 Task Definition
3.2 Adversarial Training Based Cross-Lingual ECA Model
4 Experiments
4.1 Data Sets
4.2 Experimental Settings and Evaluation Metrics
4.3 Comparisons of Different Methods
4.4 Comparisons of Different Architectures
4.5 Effects of Sampling Methods
4.6 Effects of Different Attention Hops
5 Conclusion and Future Work
References
Techniques for Jointly Extracting Entities and Relations: A Survey
1 Introduction
2 Problem Definition
3 Motivating Example
4 Overview of Techniques
5 Joint Inference Techniques
6 Joint Models
7 Experimental Evaluation
7.1 Datasets
7.2 Evaluation of End-to-End Relation Extraction
7.3 Domain-Specific Entities and Relations
8 Conclusion
References
Simple Unsupervised Similarity-Based Aspect Extraction
1 Introduction
2 Background and Definitions
3 Related Work
4 Simple Unsupervised Aspect Extraction
5 Experimental Design
6 Results and Discussion
7 Conclusion
References
Streaming State Validation Technique for Textual Big Data Using Apache Flink
1 Introduction
2 Preliminaries
2.1 Stateful Stream Processing
2.2 Why Using Apache Flink?
2.3 Apache Flink System
2.4 Core Concepts
3 Design Framework
4 Implementation and Evaluation
4.1 Implementation Setup
4.2 Design of the Implementation
4.3 Experimental Setup
4.4 Results
4.5 Evaluation
4.6 Evaluation Matrices
4.7 Visualization of Results
5 Conclusions and Future Work
5.1 Conclusion
5.2 Future Work
References
Automatic Extraction of Relevant Keyphrases for the Study of Issue Competition
1 Introduction
2 Related Work
3 Keyphrases Extraction
3.1 Candidate Identification
3.2 Candidate Scoring
3.3 Top n-rank Candidates
4 Experiments
4.1 Evaluation Metric
4.2 Datasets
4.3 Results
5 Key-Phrase Extraction Using Portuguese Parliamentary Debates
5.1 Candidates Selection
5.2 Visualisation
6 Conclusion
A Appendix
References
Author Index
LNCS 13452
Alexander Gelbukh (Ed.)
Computational Linguistics and Intelligent Text Processing 20th International Conference, CICLing 2019 La Rochelle, France, April 7–13, 2019 Revised Selected Papers, Part II
Lecture Notes in Computer Science Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA
Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Moti Yung Columbia University, New York, NY, USA
13452
More information about this series at https://link.springer.com/bookseries/558
Alexander Gelbukh (Ed.)
Computational Linguistics and Intelligent Text Processing 20th International Conference, CICLing 2019 La Rochelle, France, April 7–13, 2019 Revised Selected Papers, Part II
Editor Alexander Gelbukh Instituto Politécnico Nacional Mexico City, Mexico
ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-031-24339-4 ISBN 978-3-031-24340-0 (eBook) https://doi.org/10.1007/978-3-031-24340-0 © Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
CICLing 2019 was the 20th International Conference on Computational Linguistics and Intelligent Text Processing. The CICLing conferences provide a wide-scope forum for discussion of the art and craft of natural language processing research, as well as the best practices in its applications. This set of two books contains three invited papers and a selection of regular papers accepted for presentation at the conference. Since 2001, the proceedings of the CICLing conferences have been published in Springer’s Lecture Notes in Computer Science series as volumes 2004, 2276, 2588, 2945, 3406, 3878, 4394, 4919, 5449, 6008, 6608, 6609, 7181, 7182, 7816, 7817, 8403, 8404, 9041, 9042, 9623, 9624, 10761, 10762, 13396, and 13397. The set has been structured into 14 sections representative of the current trends in research and applications of natural language processing: General; Information Extraction; Information Retrieval; Language Modeling; Lexical Resources; Machine Translation; Morphology, Syntax, Parsing; Name Entity Recognition; Semantics and Text Similarity; Sentiment Analysis; Speech Processing; Text Categorization; Text Generation; and Text Mining. In 2019 our invited speakers were Preslav Nakov (Qatar Computing Research Institute, Qatar), Paolo Rosso (Universidad Politécnica de Valencia, Spain), Lucia Specia (University of Sheffield, UK), and Carlo Strapparava (Foundazione Bruno Kessler, Italy). They delivered excellent extended lectures and organized lively discussions. Full contributions of these invited talks are included in this book set. After a double-blind peer review process, the Program Committee selected 95 papers for presentation, out of 335 submissions from 60 countries. To encourage authors to provide algorithms and data along with the published papers, we selected three winners of our Verifiability, Reproducibility, and Working Description Award. The main factors in choosing the awarded submission were technical correctness and completeness, readability of the code and documentation, simplicity of installation and use, and exact correspondence to the claims of the paper. Unnecessary sophistication of the user interface was discouraged; novelty and usefulness of the results were not evaluated, instead they were evaluated for the paper itself and not for the data. The following papers received the Best Paper Awards, the Best Student Paper Award, as well as the Verifiability, Reproducibility, and Working Description Awards, respectively: Best Verifiability, Reproducibility, and Working Description Award: “Text Analysis of Resumes and Lexical Choice as an Indicator of Creativity”, Alexander Rybalov. Best Student Paper Award: “Look Who’s Talking: Inferring Speaker Attributes from Personal Longitudinal Dialog”, Charles Welch, Veronica Perez-Rosas, Jonathan Kummerfeld, Rada Mihalcea.
vi
Preface
Best Presentation Award: “A Framework to Build Quality into Non-expert Translations”, Christopher G. Harris. Best Poster Award, Winner (Shared): “Sentiment Analysis Through Finite State Automata”, Serena Pelosi, Alessandro Maisto, Lorenza Melillo, and Annibale Elia. And “Toponym Identification in Epidemiology Articles: A Deep Learning Approach”, Mohammad Reza Davari, Leila Kosseim, Tien D. Bui. Best Inquisitive Mind Award: Given to the attendee who asked the most (good) questions to the presenters during the conference, Natwar Modani. Best Paper Award, First Place: “Contrastive Reasons Detection and Clustering from Online Polarized Debates”, Amine Trabelsi, Osmar Zaiane. Best Paper Award, Second Place: “Adversarial Training based Cross-lingual Emotion Cause Extraction”, Hongyu Yan, Qinghong Gao, Jiachen Du, Binyang Li, Ruifeng Xu. Best Paper Award, Third Place (Shared): “EAGLE: An Enhanced Attention-Based Strategy by Generating Answers from Learning Questions to a Remote Sensing Image”, Yeyang Zhou, Yixin Chen, Yimin Chen, Shunlong Ye, Mingxin Guo, Ziqi Sha, Heyu Wei, Yanhui Gu, Junsheng Zhou, Weiguang Qu. Best Paper Award, Third Place (Shared): “dpUGC: Learn Differentially Private Representation for User Generated Contents”, Xuan-Son Vu, Son Tran, Lili Jiang. A conference is the result of the work of many people. First of all, I would like to thank the members of the Program Committee for the time and effort they devoted to the reviewing of the submitted articles and to the selection process. Obviously, I thank the authors for their patience in the preparation of the papers, not to mention the development of the scientific results that form this book. I also express my most cordial thanks to the members of the local Organizing Committee for their considerable contribution to making this conference become a reality. November 2022
Alexander Gelbukh
Organization
CICLing 2019 (20th International Conference on Computational Linguistics and Intelligent Text Processing) was hosted by the University of La Rochelle (ULR), France, and organized by the L3i laboratory of the University of La Rochelle (ULR), France, in collaboration with the Natural Language and Text Processing Laboratory of the CIC, IPN, the Mexican Society of Artificial Intelligence (SMIA), and the NewsEye project. The NewsEye project received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 770299. The conference aims to encourage the exchange of opinions between the scientists working in different areas of the growing field of computational linguistics and intelligent text and speech processing.
Program Chair Alexander Gelbukh
Instituto Politécnico Nacional, Mexico
Organizing Committee Antoine Doucet (Chair) Nicolas Sidère (Co-chair) Cyrille Suire (Co-chair)
University of La Rochelle, France University of La Rochelle, France University of La Rochelle, France
Members Karell Bertet Mickaël Coustaty Salah Eddine Christophe Rigaud
L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France
Additional Support Viviana Beltran Jean-Loup Guillaume Marwa Hamdi Ahmed Hamdi Nam Le Elvys Linhares Pontes Muzzamil Luqman Zuheng Ming Hai Nguyen
L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France
viii
Organization
Armelle Prigent Mourad Rabah
L3i Laboratory, University of La Rochelle, France L3i Laboratory, University of La Rochelle, France
Program Committee Alexander Gelbukh Leslie Barrett Leila Kosseim Aladdin Ayesh Srinivas Bangalore Ivandre Paraboni Hermann Moisl Kais Haddar Cerstin Mahlow Alma Kharrat Dafydd Gibbon Evangelos Milios Kjetil Nørvåg Grigori Sidorov Hiram Calvo Piotr W. Fuglewicz Aminul Islam Michael Carl Guillaume Jacquet Suresh Manandhar Bente Maegaard Tarık Ki¸sla Nick Campbell Yasunari Harada Samhaa El-Beltagy Anselmo Peñas Paolo Rosso Horacio Rodriguez Yannis Haralambous Niladri Chatterjee Manuel Vilares Ferro Eva Hajicova Preslav Nakov
Instituto Politécnico Nacional, Mexico Bloomberg, USA Concordia University, Canada De Montfort University, UK Interactions, USA University of São Paulo, Brazil Newcastle University, UK MIRACL Laboratory, Faculté des Sciences de Sfax, Tunisia ZHAW Zurich University of Applied Sciences, Switzerland Microsoft, USA Bielefeld University, Germany Dalhousie University, Canada Norwegian University of Science and Technology, Norway CIC-IPN, Mexico Nara Institute of Science and Technology, Japan TiP, Poland University of Louisiana at Lafayette, USA Kent State University, USA Joint Research Centre, EU University of York, UK University of Copenhagen, Denmark Ege University, Turkey Trinity College Dublin, Ireland Waseda University, Japan Newgiza University, Egypt NLP & IR Group, UNED, Spain Universitat Politècnica de València, Spain Universitat Politècnica de Catalunya, Spain IMT Atlantique & UMR CNRS 6285 Lab-STICC, France IIT Delhi, India University of Vigo, Spain Charles University, Prague, Czech Republic Qatar Computing Research Institute, HBKU, Qatar
Organization
Bayan Abushawar Kemal Oflazer Hatem Haddad Constantin Orasan Masaki Murata Efstathios Stamatatos Mike Thelwall Stan Szpakowicz Tunga Gungor Dunja Mladenic German Rigau Roberto Basili Karin Harbusch Elena Lloret Ruslan Mitkov Viktor Pekar Attila Novák Horacio Saggion Soujanya Poria Rada Mihalcea Partha Pakray Alexander Mehler Octavian Popescu Hitoshi Isahara Galia Angelova Pushpak Bhattacharyya Farid Meziane Ales Horak Nicoletta Calzolari Milos Jakubicek Ron Kaplan Hassan Sawaf Marta R. Costa-Jussà Sivaji Bandyopadhyay Yorick Wilks Vasile Rus Christian Boitet Khaled Shaalan Philipp Koehn
ix
Arab Open University, Jordan Carnegie Mellon University in Qatar, Qatar iCompass, Tunisia University of Wolverhampton, UK Tottori University, Japan University of the Aegean, Greece University of Wolverhampton, UK University of Ottawa, Canada Bogazici University, Turkey Jozef Stefan Institute, Slovenia IXA Group, UPV/EHU, Spain University of Roma Tor Vergata, Italy University Koblenz-Landau, Germany University of Alicante, Spain University of Wolverhampton, UK University of Birmingham, UK Pázmány Péter Catholic University, Hungary Universitat Pompeu Fabra, Spain Nanyang Technological University, Singapore University of North Texas, USA National Institute of Technology Silchar, India Goethe-University Frankfurt am Main, Germany IBM, USA Toyohashi University of Technology, Japan Institute for Parallel Processing, Bulgarian Academy of Sciences, Bulgaria IIT Bombay, India University of Derby, UK Masaryk University, Czech Republic Istituto di Linguistica Computazionale – CNR, Italy Lexical Computing, UK Nuance Communications, USA Amazon, USA Institute for Infocomm Research, Singapore Jadavpur University, India University of Sheffield, UK University of Memphis, USA Université Grenoble Alpes, France The British University in Dubai, UAE Johns Hopkins University, USA
x
Organization
Software Reviewing Committee Ted Pedersen Florian Holz Miloš Jakubíˇcek Sergio Jiménez Vargas Miikka Silfverberg Ronald Winnemöller
Best Paper Award Selection Committee Alexander Gelbukh Eduard Hovy Rada Mihalcea Ted Pedersen Yorick Wilks
Contents – Part II
Name Entity Recognition Neural Named Entity Recognition for Kazakh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gulmira Tolegen, Alymzhan Toleu, Orken Mamyrbayev, and Rustam Mussabayev
3
An Empirical Data Selection Schema in Annotation Projection Approach . . . . . . Yun Hu, Mingxue Liao, Pin Lv, and Changwen Zheng
16
Toponym Identification in Epidemiology Articles – A Deep Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MohammadReza Davari, Leila Kosseim, and Tien D. Bui
26
Named Entity Recognition by Character-Based Word Classification Using a Domain Specific Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Makoto Hiramatsu, Kei Wakabayashi, and Jun Harashima
38
Cold Is a Disease and D-cold Is a Drug: Identifying Biological Types of Entities in the Biomedical Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suyash Sangwan, Raksha Sharma, Girish Palshikar, and Asif Ekbal
49
A Hybrid Generative/Discriminative Model for Rapid Prototyping of Domain-Specific Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suzushi Tomori, Yugo Murawaki, and Shinsuke Mori
61
Semantics and Text Similarity Spectral Text Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tim vor der Brück and Marc Pouly A Computational Approach to Measuring the Semantic Divergence of Cognates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana-Sabina Uban, Alina Cristea (Ciobanu), and Liviu P. Dinu
81
96
Triangulation as a Research Method in Experimental Linguistics . . . . . . . . . . . . . 109 Olga Suleimanova and Marina Fomina Understanding Interpersonal Variations in Word Meanings via Review Target Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Daisuke Oba, Shoetsu Sato, Naoki Yoshinaga, Satoshi Akasaki, and Masashi Toyoda
xii
Contents – Part II
Semantic Roles in VerbNet and FrameNet: Statistical Analysis and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Aliaksandr Huminski, Fiona Liausvia, and Arushi Goel Sentiment Analysis Fusing Phonetic Features and Chinese Character Representation for Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Haiyun Peng, Soujanya Poria, Yang Li, and Erik Cambria Sentiment-Aware Recommendation System for Healthcare Using Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Alan Aipe, N. S. Mukuntha, and Asif Ekbal Sentiment Analysis Through Finite State Automata . . . . . . . . . . . . . . . . . . . . . . . . . 182 Serena Pelosi, Alessandro Maisto, Lorenza Melillo, and Annibale Elia Using Cognitive Learning Method to Analyze Aggression in Social Media Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Sayef Iqbal and Fazel Keshtkar Opinion Spam Detection with Attention-Based LSTM Networks . . . . . . . . . . . . . 212 Zeinab Sedighi, Hossein Ebrahimpour-Komleh, Ayoub Bagheri, and Leila Kosseim Multi-task Learning for Detecting Stance in Tweets . . . . . . . . . . . . . . . . . . . . . . . . 222 Devamanyu Hazarika, Gangeshwar Krishnamurthy, Soujanya Poria, and Roger Zimmermann Related Tasks Can Share! A Multi-task Framework for Affective Language . . . . 236 Kumar Shikhar Deep, Md Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharyya Sentiment Analysis and Sentence Classification in Long Book-Search Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Amal Htait, Sébastien Fournier, and Patrice Bellot Comparative Analyses of Multilingual Sentiment Analysis Systems for News and Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 Pavel Pˇribáˇn and Alexandra Balahur Sentiment Analysis of Influential Messages for Political Election Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 Oumayma Oueslati, Moez Ben Hajhmida, Habib Ounelli, and Erik Cambria
Contents – Part II
xiii
Basic and Depression Specific Emotions Identification in Tweets: Multi-label Classification Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Nawshad Farruque, Chenyang Huang, Osmar Zaïane, and Randy Goebel Generating Word and Document Embeddings for Sentiment Analysis . . . . . . . . . 307 Cem Rıfkı Aydın, Tunga Güngör, and Ali Erkan Speech Processing Speech Emotion Recognition Using Spontaneous Children’s Corpus . . . . . . . . . . 321 Panikos Heracleous, Yasser Mohammad, Keiji Yasuda, and Akio Yoneyama Natural Language Interactions in Autonomous Vehicles: Intent Detection and Slot Filling from Passenger Utterances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Eda Okur, Shachi H. Kumar, Saurav Sahay, Asli Arslan Esme, and Lama Nachman Audio Summarization with Audio Features and Probability Distribution Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Carlos-Emiliano González-Gallardo, Romain Deveaud, Eric SanJuan, and Juan-Manuel Torres-Moreno Multilingual Speech Emotion Recognition on Japanese, English, and German . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 Panikos Heracleous, Keiji Yasuda, and Akio Yoneyama Text Categorization On the Use of Dependencies in Relation Classification of Text with Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 Bernard Espinasse, Sébastien Fournier, Adrian Chifu, Gaël Guibon, René Azcurra, and Valentin Mace Multilingual Fake News Detection with Satire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 Gaël Guibon, Liana Ermakova, Hosni Seffih, Anton Firsov, and Guillaume Le Noé-Bienvenu Active Learning to Select Unlabeled Examples with Effective Features for Document Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Minoru Sasaki Effectiveness of Self Normalizing Neural Networks for Text Classification . . . . . 412 Avinash Madasu and Vijjini Anvesh Rao
xiv
Contents – Part II
A Study of Text Representations for Hate Speech Detection . . . . . . . . . . . . . . . . . 424 Chrysoula Themeli, George Giannakopoulos, and Nikiforos Pittaras Comparison of Text Classification Methods Using Deep Learning Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 Maaz Amjad, Alexander Gelbukh, Ilia Voronkov, and Anna Saenko Acquisition of Domain-Specific Senses and Its Extrinsic Evaluation Through Text Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Attaporn Wangpoonsarp, Kazuya Shimura, and Fumiyo Fukumoto “News Title Can Be Deceptive” Title Body Consistency Detection for News Articles Using Text Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462 Tanik Saikh, Kingshuk Basak, Asif Ekbal, and Pushpak Bhattacharyya Look Who’s Talking: Inferring Speaker Attributes from Personal Longitudinal Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 Charles Welch, Verónica Pérez-Rosas, Jonathan K. Kummerfeld, and Rada Mihalcea Computing Classifier-Based Embeddings with the Help of Text2ddc . . . . . . . . . . 491 Tolga Uslu, Alexander Mehler, and Daniel Baumartz Text Generation HanaNLG: A Flexible Hybrid Approach for Natural Language Generation . . . . . 507 Cristina Barros and Elena Lloret MorphoGen: Full Inflection Generation Using Recurrent Neural Networks . . . . . 520 Octavia-Maria S¸ ulea, Steve Young, and Liviu P. Dinu EASY: Evaluation System for Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 Marina Litvak, Natalia Vanetik, and Yael Veksler Performance of Evaluation Methods Without Human References for Multi-document Text Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 Alexis Carriola Careaga, Yulia Ledeneva, and Jonathan Rojas Simón EAGLE: An Enhanced Attention-Based Strategy by Generating Answers from Learning Questions to a Remote Sensing Image . . . . . . . . . . . . . . . . . . . . . . . 558 Yeyang Zhou, Yixin Chen, Yimin Chen, Shunlong Ye, Mingxin Guo, Ziqi Sha, Heyu Wei, Yanhui Gu, Junsheng Zhou, and Weiguang Qu
Contents – Part II
xv
Text Mining Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 Sattar Seifollahi and Massimo Piccardi Adversarial Training Based Cross-Lingual Emotion Cause Extraction . . . . . . . . . 587 Hongyu Yan, Qinghong Gao, Jiachen Du, Binyang Li, and Ruifeng Xu Techniques for Jointly Extracting Entities and Relations: A Survey . . . . . . . . . . . 602 Sachin Pawar, Pushpak Bhattacharyya, and Girish K. Palshikar Simple Unsupervised Similarity-Based Aspect Extraction . . . . . . . . . . . . . . . . . . . 619 Danny Suarez Vargas, Lucas R. C. Pessutto, and Viviane Pereira Moreira Streaming State Validation Technique for Textual Big Data Using Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632 Raheela Younas and Amna Qasim Automatic Extraction of Relevant Keyphrases for the Study of Issue Competition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648 Miguel Won, Bruno Martins, and Filipa Raimundo Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
Contents – Part I
General Visual Aids to the Rescue: Predicting Creativity in Multimodal Artwork . . . . . . . Carlo Strapparava, Serra Sinem Tekiroglu, and Gözde Özbal Knowledge-Based Techniques for Document Fraud Detection: A Comprehensive Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Beatriz Martínez Tornés, Emanuela Boros, Antoine Doucet, Petra Gomez-Krämer, Jean-Marc Ogier, and Vincent Poulain d’Andecy
3
17
Exploiting Metonymy from Available Knowledge Resources . . . . . . . . . . . . . . . . Itziar Gonzalez-Dios, Javier Álvez, and German Rigau
34
Robust Evaluation of Language–Brain Encoding Experiments . . . . . . . . . . . . . . . Lisa Beinborn, Samira Abnar, and Rochelle Choenni
44
Connectives with Both Arguments External: A Survey on Czech . . . . . . . . . . . . . Lucie Poláková and Jiˇrí Mírovský
62
Recognizing Weak Signals in News Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniela Gifu
73
Low-Rank Approximation of Matrices for PMI-Based Word Embeddings . . . . . Alena Sorokina, Aidana Karipbayeva, and Zhenisbek Assylbekov
86
Text Preprocessing for Shrinkage Regression and Topic Modeling to Analyse EU Public Consultation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nada Mimouni and Timothy Yu-Cheong Yeung
95
Intelligibility of Highly Predictable Polish Target Words in Sentences Presented to Czech Readers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Klára Jágrová and Tania Avgustinova Information Extraction Multi-lingual Event Identification in Disaster Domain . . . . . . . . . . . . . . . . . . . . . . 129 Zishan Ahmad, Deeksha Varshney, Asif Ekbal, and Pushpak Bhattacharyya
xviii
Contents – Part I
Detection and Analysis of Drug Non-compliance in Internet Fora Using Information Retrieval Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Sam Bigeard, Frantz Thiessard, and Natalia Grabar Char-RNN and Active Learning for Hashtag Segmentation . . . . . . . . . . . . . . . . . . 155 Taisiya Glushkova and Ekaterina Artemova Extracting Food-Drug Interactions from Scientific Literature: Relation Clustering to Address Lack of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Tsanta Randriatsitohaina and Thierry Hamon Contrastive Reasons Detection and Clustering from Online Polarized Debates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Amine Trabelsi and Osmar R. Zaïane Visualizing and Analyzing Networks of Named Entities in Biographical Dictionaries for Digital Humanities Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Minna Tamper, Petri Leskinen, and Eero Hyvönen Unsupervised Keyphrase Extraction from Scientific Publications . . . . . . . . . . . . . 215 Eirini Papagiannopoulou and Grigorios Tsoumakas Information Retrieval Retrieving the Evidence of a Free Text Annotation in a Scientific Article: A Data Free Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Julien Gobeill, Emilie Pasche, and Patrick Ruch Salience-Induced Term-Driven Serendipitous Web Exploration . . . . . . . . . . . . . . . 247 Yannis Haralambous and Ehoussou Emmanuel N’zi Language Modeling Two-Phased Dynamic Language Model: Improved LM for Automated Language Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Debajyoty Banik, Asif Ekbal, and Pushpak Bhattacharyya Composing Word Vectors for Japanese Compound Words Using Dependency Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 Kanako Komiya, Takumi Seitou, Minoru Sasaki, and Hiroyuki Shinnou Microtext Normalization for Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Ranjan Satapathy, Erik Cambria, and Nadia Magnenat Thalmann
Contents – Part I
xix
Building Personalized Language Models Through Language Model Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 Milton King and Paul Cook dpUGC: Learn Differentially Private Representation for User Generated Contents (Best Paper Award, Third Place, Shared) . . . . . . . . . . . . . . . . . . . . . . . . . . 316 Xuan-Son Vu, Son N. Tran, and Lili Jiang Multiplicative Models for Recurrent Language Modeling . . . . . . . . . . . . . . . . . . . . 332 Diego Maupomé and Marie-Jean Meurs Impact of Gender Debiased Word Embeddings in Language Modeling . . . . . . . . 342 Christine Basta and Marta R. Costa-jussà Initial Explorations on Chaotic Behaviors of Recurrent Neural Networks . . . . . . 351 Bagdat Myrzakhmetov, Rustem Takhanov, and Zhenisbek Assylbekov Lexical Resources LingFN: A Framenet for the Linguistic Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Shafqat Mumtaz Virk, Per Klang, Lars Borin, and Anju Saxena SART - Similarity, Analogies, and Relatedness for Tatar Language: New Benchmark Datasets for Word Embeddings Evaluation . . . . . . . . . . . . . . . . . . . . . . 380 Albina Khusainova, Adil Khan, and Adín Ramírez Rivera Cross-Lingual Transfer for Distantly Supervised and Low-Resources Indonesian NER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 Fariz Ikhwantri Phrase-Level Simplification for Non-native Speakers . . . . . . . . . . . . . . . . . . . . . . . 406 Gustavo H. Paetzold and Lucia Specia Automatic Creation of a Pharmaceutical Corpus Based on Open-Data . . . . . . . . . 432 Cristian Bravo, Sebastian Otálora, and Sonia Ordoñez-Salinas Fool’s Errand: Looking at April Fools Hoaxes as Disinformation Through the Lens of Deception and Humour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Edward Dearden and Alistair Baron Russian Language Datasets in the Digital Humanities Domain and Their Evaluation with Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 Gerhard Wohlgenannt, Artemii Babushkin, Denis Romashov, Igor Ukrainets, Anton Maskaykin, and Ilya Shutov
xx
Contents – Part I
Towards the Automatic Processing of Language Registers: Semi-supervisedly Built Corpus and Classifier for French . . . . . . . . . . . . . . . . . . . 480 Gwénolé Lecorvé, Hugo Ayats, Benoît Fournier, Jade Mekki, Jonathan Chevelu, Delphine Battistelli, and Nicolas Béchet Machine Translation Evaluating Terminology Translation in MT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 Rejwanul Haque, Mohammed Hasanuzzaman, and Andy Way Detecting Machine-Translated Paragraphs by Matching Similar Words . . . . . . . . 521 Hoang-Quoc Nguyen-Son, Tran Phuong Thao, Seira Hidano, and Shinsaku Kiyomoto Improving Low-Resource NMT with Parser Generated Syntactic Phrases . . . . . . 533 Kamal Kumar Gupta, Sukanta Sen, Asif Ekbal, and Pushpak Bhattacharyya How Much Does Tokenization Affect Neural Machine Translation? . . . . . . . . . . . 545 Miguel Domingo, Mercedes García-Martínez, Alexandre Helle, Francisco Casacuberta, and Manuel Herranz Take Help from Elder Brother: Old to Modern English NMT with Phrase Pair Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 Sukanta Sen, Mohammed Hasanuzzaman, Asif Ekbal, Pushpak Bhattacharyya, and Andy Way Adaptation of Machine Translation Models with Back-Translated Data Using Transductive Data Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 Alberto Poncelas, Gideon Maillette de Buy Wenniger, and Andy Way Morphology, Syntax, Parsing Automatic Detection of Parallel Sentences from Comparable Biomedical Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583 Rémi Cardon and Natalia Grabar MorphBen: A Neural Morphological Analyzer for Bengali Language . . . . . . . . . 595 Ayan Das and Sudeshna Sarkar CCG Supertagging Using Morphological and Dependency Syntax Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608 Luyê.n Ngo.c Lê and Yannis Haralambous
Contents – Part I
xxi
Representing Overlaps in Sequence Labeling Tasks with a Novel Tagging Scheme: Bigappy-Unicrossy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 Gözde Berk, Berna Erden, and Tunga Güngör *Paris is Rain. or It is raining in Paris?: Detecting Overgeneralization of Be-verb in Learner English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636 Ryo Nagata, Koki Washio, and Hokuto Ototake Speeding up Natural Language Parsing by Reusing Partial Results . . . . . . . . . . . . 648 Michalina Strzyz and Carlos Gómez-Rodríguez Unmasking Bias in News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658 Javier Sánchez-Junquera, Paolo Rosso, Manuel Montes-y-Gómez, and Simone Paolo Ponzetto Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
Name Entity Recognition
Neural Named Entity Recognition for Kazakh Gulmira Tolegen, Alymzhan Toleu(B) , Orken Mamyrbayev, and Rustam Mussabayev Institute of Information and Computational Technologies, Almaty, Kazakhstan [email protected]
Abstract. We present several neural networks to address the task of named entity recognition for morphologically complex languages (MCL). Kazakh is a morphologically complex language in which each root/stem can produce hundreds or thousands of variant word forms. This nature of the language could lead to a serious data sparsity problem, which may prevent the deep learning models from being well trained for underresourced MCLs. In order to model the MCLs’ words effectively, we introduce root and entity tag embedding plus tensor layer to the neural networks. The effects of those are significant for improving NER model performance of MCLs. The proposed models outperform state-of-the-art including character-based approaches, and can be potentially applied to other morphologically complex languages. Keywords: Named entity recognition · Morphologically complex language · Kazakh language · Deep learning · Neural network
1
Introduction
Named Entity Recognition (NER) is a vital part of information extraction. It aims to locate and classify the named entities from unstructured text. The different entity categories are usually the person, location and organization names, etc. Kazakh language is an agglutinative language with complex morphological word structures. Each root/stem in the language can produce hundreds or thousands of new words. It leads to the severe problem of data sparsity when automatically identifying the entities. In order to tackle the problem, Tolegen et al. (2016) [24] have given the systematic study for Kazakh NER by using conditional random fields. More specifically, the authors assembled and annotated the Kazakh NER corpus (KNC), and proposed a set of named entity features with the exploration of their effects. To achieve a state-of-the-art result for Kazakh NER compared with other languages’ NER. Authors have manually designed feature templates, which in practice is a labor-intensive process and requires a lot of expertise. With the intention of alleviating the task-specific feature engineering, there has been increasing interest in using deep learning to solve the NER task for many languages. However, the effectiveness of the deep learning for Kazakh NER is c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 3–15, 2023. https://doi.org/10.1007/978-3-031-24340-0_1
4
G. Tolegen et al.
still unexplored. One of the aims of this work is to use deep learning for Kazakh NER to avoid the task-specific feature engineering and to achieve a new stateof-the-art result. As in similar studies [5] the neural networks (NNs) produces high results for English or for other languages by using distributed word representations. But using only surface word representation in deep learning is may not enough to reach the state-of-the-art results for under-resourced MCLs. The main reason is that deep learning approaches are data hungry, their performance is strongly correlated with the amount of available training data. In this paper, we introduce three types of representation for MCL including word, root and entity tag embeddings. With the purpose of discovering how above embeddings contribute to model performance independently, we use a simple NN as the baseline to do the investigation. We also improve this basic model from two perspectives. One is to apply a tensor transformation layer to extract multi-dimensional interactions among those representations. The other is to map each entity tag into a vector representation. The result shows that the use of root embedding can lead to a significant improvement to the models in term of improving test results. Our NNs reached good outcomes by transferring intermediate representations learned on large unlabeled data. We compare the NNs with the existing CRF-based NER system for Kazakh [24] and the other bidirectional-LSTM-CRF [12] that considered as the state-of-the-art in NER. Our NNs outperforms the state-of-the-art and the result indicates that the proposed NNs can be potentially applied to other morphologically complex languages. The rest of the paper is organized as follows: Sect. 2 reviews the existing work. Section 3 gives the named entity features used in this work. Section 4 describes the details of neural networks. Section 5 reports the results of experiments and the paper is concluded in Sect. 6 with future work.
2
Related Work
Named Entity Recognition have been studied for several decades, not only for English [4,9,23], but also for other MCL, including Kazakh [24] and Turkish [20,29]. For instance, Chieu and Hwee Tou (2003) [4] presented a maximum entropy approach based NER systems for English and German, where the authors used both local and global features to enhance their models and achieved good performance in NER. In order to explore the flexibilities of the four diverse classifiers (Hidden Markov model, maximum entropy, transformationbased learning, robust linear classifier) for NER, the work [6] showed that a combined system of these models under different conditions could reduce the F1-score error by a factor of 15 to 21% on English data-set. As known, the maximum entropy approach was suffering from the label bias problem [11], then the researchers attempted to use CRF model [17] and presented CRF-based NER systems with a number of external features. Such supervised NER systems were extremely sensitive to the selection of an appropriate feature set, in the work [23], the authors explored various combinations of a set of features (local and
Neural Named Entity Recognition for Kazakh
5
non-local knowledge features) and compared their impact on recognition performance for English. Using the CRF with optimized feature template, they obtained a 91.02% F1-score on the CoNLL 2003 [22] data-set. For Turkish, Yeniterzi (2011) [29] analyzed the effect of the morphological features, they utilized CRF that enhanced with several syntactic and contextual features, their model achieved an 88.94% F1-score on Turkish test data. In same direction Seker and Eryigit (2012) [20] presented a CRF-based NER system with their feature set, their final model achieved the highest F1-score (92%). For Kazakh, Tolegen et al. (2016) [24] annotated a Kazakh NER corpus (KNC), and carefully analyzed the effect of the morphological (6 features) and word type (4 features) features using CRF. Their results showed that the model could be improved by using morphological features significantly, the final CRF-based NER system achieved an 89.81% F1 on Kazakh test data. In this work, we use such CRF-based NER system as one baseline and make comparison to our deep learning models. Recently, deep learning models including biLSTM have obtained a significant success on various natural languages processing tasks, such as POS tagging [13,25,26,28], NER [4,10], machine translation [2,8], word segmentation [10] and on other fields like speech recognition [1,7,15,15,16]. As the state-of-the-art of NER, in the study [12], the authors have explored various neural architectures for NER including the language independent character-based biLSTM-CRF models. These type of models on German, Dutch and English have achieved 81.74%, 85.75% and 90.94%. Our models have several differences compared to other state-of-the-art. One difference is that we introduce root embedding to tackle the problem of data sparsity caused by MCL. The decoding part (refers it to CRF layer in literature [12,14,30]) of NNs is combined into NNs using tag embedding. Then the word, root and tag embeddings are efficiently incorporated and calculated by NNs in the same vector space, which allows us to extract higher-level vector features. Table 1. The entity features, more details see Tolegen et al. [24] Morphological features Word type features
3
Root
Case feature
Part of speech
Start of the sentence
Inflectional suffixes
Latin spelling words
Derivational suffixes
Acronym
Proper noun
–
Kazakh Name suffixes
–
Named Entity Features
NER models are often enhanced with named entity features. In this work, with the purpose of making a fair comparison, we utilize the same entity features proposed by Tolegen et al. (2016) [24]. The entity features are given in Table 1
6
G. Tolegen et al.
with two categories: morphological and word type information. Morphological features are extracted by using the morphological tagger of our implementation. We used a single value (1 or 0) to represent each feature according to each word has the feature or not. Then each word in the corpus contains an entity feature vector to feed into NNs with word, root and tag embeddings.
4
The Neural Networks
In this section, we describe our NNs for MCL NER. Unlike other NNs for English or other similar languages, we introduce three types of representations: word, root and tag embedding. In order to explore the effect of root and tag embedding separately and clearly, our first model is general deep neural network (DNN), which was first proposed by Bengio et al. (2003) [3] for probabilistic language model, and re-introduced by Collobert et al. (2011) [5] for multiple NLP tasks. DNN also is a standard model for sequence labeling task and could be a strong baseline. The second model is the extension of the DNN by applying a tensor layer to DNN. The tensor layer can be viewed as a non-linear transformation that extracts higher dimensional interactions from the input. The architecture of our NN is shown in Fig. 1. The first layer is lookup table layer which extracts features for each word. Here, the features are a window of words, and root (Si ) plus tag embedding (ti−1 ). The concatenation of these feature vectors are fed into the next several layers for feature extractions. The next layer is tensor layer and the remaining layers are standard NN layers. The NN layers are trained by backpropagation and the details of NNs are given in the following sections. 4.1
Mapping Words and Tags into Feature Vectors
The NNs have two dictionaries1 : one for roots and another for words. For simplicity, we will use one notation for both dictionaries in the following descriptions. Let D be the finite dictionary, and for each word xi ∈ D is represented as a ddimensional vector Mxi ∈ R1×d where d is word vector size (a hyper-parameter). All word representation of the D are stored in a embedding matrix M ∈ Rd×|D| where |D| is size of the dictionary. Each word xi ∈ D corresponds to an index ki which is column index of the embedding matrix, and then the corresponding word embedding is retrieved by the lookup table layer LTM (·): LTM (ki ) = Mxi
(1)
Similar to word embedding, we introduce tag embedding L ∈ Rd×|T | , where d is the vector size and T is a tag set. The lookup table layer can be seen as a simple projection layer where the word embedding for each context and tag 1
The dictionary is extracted from training data and performed some pre-processing, namely lowercasing and word-stemming. Words outside this dictionary are replaced by a single special symbol.
Neural Named Entity Recognition for Kazakh
7
Fig. 1. The architecture of the neural network.
embedding for the previous word is retrieved by lookup table operation. To use these features effectively, we use a sliding window approach2 . More precisely, for each word xi ∈ X, a window size word’s embeddings are given by the lookup table layer: (2) fθ1 (xi ) = Mxi− w2 . . . Mxi . . . Mxi+ w2 , Si , ti−1 where fθ1 (xi ) ∈ R1×wd is w word feature vectors, the w is the window size (a hyper-parameter), ti−1 ∈ R1×d is previous tag embedding, Si is embedding of current root. These embedding matrix is initialized with small random numbers and trained by back-propagation. 4.2
Tensor Layer
In order to capture more interactions between roots, surface words, tags and entity features, we extend the DNN to the tensor neural network. We use 3-way tensor T ∈ Rh2 ×h1 ×h1 , where h1 is size of previous layer and h2 is size of tensor layer. We define the output of a tensor product h via the following vectorized notation. (3) h = g(eT Te + W 3 e + b3 ) 2
The words exceeding the sentence boundaries are mapped to one of two special symbols, namely “start” and “end” symbols.
8
G. Tolegen et al.
where e ∈ Rh1 is output of previous layer, W 3 ∈ Rh2 ×h1 , h ∈ Rh2 . Maintaining the full tensor directly leads to parametric explosion. Here, we use a tensor factorization approach [19] that factorizes each tensor slice as the product of two low-rank matrices, and get the factorized tensor function: h = g(eT P [i] Q[i] e + W 3 e + b3 )
(4)
where the matrix P [i] ∈ Rh1 ×r and Q[i] ∈ Rr×h1 are two low rank matrices, and r is number of the factors (a hyper-parameter). 4.3
Tag Inference
There are strong dependencies between the named entity tags in a sentence for the NER. In order to capture the tag transitions, we use a transition score Aij [5,31] for jumping from one tag i ∈ T to another tag j ∈ T and an initial scores A0i for starting from the ith tag. For the input sentence X with a tag sequence Y , a sentence-level score can be calculated by the sum of transition and the output of NNs: N (5) s(X, Y, θ) = (Ati−1 ,ti + fθ (ti |i)) n=1
where fθ (ti |i) indicates the score output by the network for the ti tag at the ith word. It should be noted that this model calculates the tag transition score independently from NNs. One possible way of combining the both tag transitions and neural network outputs is to feed the previous tag embedding to the NNs. Then, the output of NNs could calculate a transition score given the previous tag embedding, and it can be written as follows: s(X, Y, θ) =
N
fθ (ti |i, ti−1 )
(6)
n=1
At inference time, for a sentence X, we can find the best tag path Y ∗ by maximizing the sentence score. The Viterbi algorithm can be used for this inference.
5
Experiments
We conducted several experiments to evaluate our NNs. One of them is to explore the effects of the word, root and tag embedding plus the tensor layer for MCL NER task, independently. Another is to show the results of our models after using the pre-trained root and word embeddings. The last is to compare our models to the state-of-the-art including character embedding-based biLSTM-CRF [12].
Neural Named Entity Recognition for Kazakh
5.1
9
Data-Set
In experiments we used the data from [27] for Turkish and the Kazakh NER corpus (KNC) from [24]. Both corpus were divided into training (80%), development (10%) and test (10%) set. The development set is for choosing the hyperparameters and model selection. We adopted IOB tagging scheme [21] for all experiments and used standard conlleval evaluation script3 to report the F-score, precision and recall values. Table 2. Corpus statistics. Kazakh Turkish #sent. #token #LOC #ORG #PER #sent. #token #LOC #ORG #PER Train 14457
2065
3424
22050
397062 9387
7389
13080
Dev.
1807
27277
785
247
413
2756
48990 1171
869
1690
Test
1807
27145
731
247
452
2756
46785 1157
925
1521
5.2
215448 5870
Model Setup
A set of experiments were conducted to chose the hyper-parameters and the hyper-parameters are tuned on the development set. The initial learning rate of AdaGrad is set to 0.01 and the regularization is fixed to 10−4 . Generally, the number of hidden units has a limited impact on the performance as long as it is large enough. Window size w was set to 3, the word, root and tag embedding size was set to 50, number of hidden units was 300 for NNs, and for those NNs with tensor layer, it was set to 50 and its factor size was set to 3. After finding the best hyper-parameters, we would train final models for all NNs. After each epoch over the training set, we measured the accuracy of the model on the development set and chose the final model that obtained the highest performance on development set, then use the test set to evaluate the selected model. We made several preprocessing to the corpora, namely token and sentence segmentation, lowercasing surface words and the roots were kept in original forms. 5.3
Results
We evaluate the following model variations in the experiment: i) a baseline neural network, NN, which contains a discrete tag transition; ii) NN+root refers to a model that uses root embedding and the discrete tag transition. iii) NN+root+tag is a model that the discrete tag transition in NN is replaced by named entity tag embedding. iv) NN+root+tensor refers to tensor layer-based model with discrete tag transition. v) models with +feat refer to the models use the named entity feature. 3
www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt.
10
G. Tolegen et al.
Table 3. Results of the NNs for Kazakh and Turkish (F1-score, %). Here root and tag indicate root and tag embeddings; tensor means tensor layer; feat denotes entity feature vector; Kaz - Kazakh and Tur - Turkish; Ov - Overall. L.
# Models
Development set LOC ORG PER Ov
Test set LOC ORG
PER
Ov
Kaz 1 2 3 4 5 6 7
NN NN+root NN+root+tag NN+root+tensor NN+root+feat NN+root+tensor+feat NN+root+tag+tensor+feat
86.69 87.48 88.85 89.56 93.48 93.78 93.65
68.95 70.23 67.69 72.54 78.35 81.48 81.28
68.57 75.66 79.68 81.07 91.59 90.91 92.42
78.66 81.20 82.81 84.22 90.40 90.87 91.27
86.32 87.74 87.65 88.51 92.48 92.22 92.96
69.51 72.53 73.75 75.79 78.90 81.57 78.89
64.78 75.25 76.13 77.32 90.75 91.27 91.70
76.89 81.36 81.86 82.83 89.54 90.11 90.28
Tur 8 9 10 11 12 13 14
NN NN+root NN+root+tag NN+root+tensor NN+root+feat NN+root+tensor+feat NN+root+tag+tensor+feat
85.06 87.38 90.70 92.43 91.54 93.60 91.77
74.70 77.13 84.93 86.45 89.04 88.88 89.72
81.11 84.78 86.67 89.63 91.62 92.23 92.23
80.86 83.78 87.53 89.78 91.01 91.88 91.44
83.17 85.78 90.02 90.50 90.27 92.05 92.80
76.26 78.66 86.14 87.14 89.50 89.35 88.45
80.55 84.03 85.95 90.00 91.95 92.01 91.91
80.29 83.17 87.31 89.42 90.78 91.34 91.39
Table 3 summaries the results for Kazakh and Turkish. Rows (1–4, 8–11) are given to compare the root, tag embedding and tensor layer independently. Rows (5–7, 12–14) shows the effect of entity features. As shown, when only use the surface word forms, the NN gives 76.89% overall F1-score for Kazakh. The NN gives low F1-scores of 64.78% and 69.51% for PER and ORG respectively. There are mainly two reasons for this: i) the number of person and organization names are less than location (Table 2), and ii) compared to other entities, the length of organization name is much longer, it also has ambiguous words with people names4 . For Turkish, NN yields 80.29% overall F1. It is evident from (row 2, 9) that NN+root is improved significantly in all terms after using the root embedding. There are 4.47% and 2.88% improvements in overall F1 for Kazakh and Turkish compare to NN. More precisely, using root embedding, NN+root gives 10.47%, 3.02% and 1.42% improvements for Kazakh PER, ORG, LOC entities, respectively. The result for Turkish also follows the pattern. Row (3,10) shows the effect of replacing the discrete tag transition with named entity tag embedding. We could observe that NN+root+tag yields overall F1-scores of 81.86% and 87.31% for Kazakh and Turkish. Compared to NN+root, the model with entity tag embedding has a significant improvement for Turkish with 4.14% in overall F1. For two languages, the model performances are boosted by using tensor transformation; it shows that the tensor layer could capture the more interactions between root and word vectors. Using the entity features, NN+root+feat give a significant improvement for Kazakh (from 81.36 4
It often appears when the organization name is given after someone’s name.
Neural Named Entity Recognition for Kazakh
11
to 89.54%) and Turkish (from 83.17 to 90.78%). The best result for Kazakh is 90.28% F1-score that is obtained by using tensor transformation with tag embeddings and entity features. We compare our NNs with exiting CRF-based NER system [24] and other state-of-the-art models. According to the recent studies for NER [12,14,30], the current cutting-edge deep learning models for sequence labeling problem is bidirectional LSTM with CRF layer. On the one hand, we trained such state-of-theart NER model for Kazakh language for making comparisons. On the other, It is also worth to see how does a character-based model perform well for agglutinative languages. Because the character-based approaches seem to be well suited for agglutinative nature of the languages and it can serve as a stronger baseline than CRF. For those biLSTM-based models, we set hyper-parameters are comparable with those models yield the state-of-the-art results for English [12,14]. The word and character embeddings are set to 300 and 100, respectively. The hidden unit of LSTM for both character and word are set to 300. The dropout is set to 0.5 and use “Adam” updating strategy for learning model parameters. It should be note that the form of entities in Kazakh always starts with capital letter, and the data set used for all biLSTM-based models are not converted to lowercase, which could lead a positive effect for recognition. For a fair comparison, the following NER models are trained on the same training, development and test set. Table 4 shows the comparison of our NNs with state-of-the-art for Kazakh. Table 4. Comparison of our NNs and state-of-the-art Models
LOC
ORG
PER
Overall
CRF [24] biLSTM+dropout biLSTM-CRF+dropout biLSTM-CRF+Characters+dropout
91.71 85.84 86.52 90.43
83.40 68.91 69.57 76.10
90.06 72.75 75.79 85.88
89.81 78.76 80.28 86.45
NN+root+feat NN+root+tensor+feat NN+root+tag+tensor+feat
92.48 78.90 92.22 81.57 92.96 78.89
90.75 91.27 91.70
89.54 90.11 90.28
NN+root+feat* NN+root+tensor+feat* NN+root+tag+tensor+feat*
91.74 92.91 91.33
81.00 90.99 89.70 81.76 91.09 90.40 81.88 92.00 90.49
The CRF-based system [24] achieved an F1-score of 89.81% using all features with their well-designed feature template. The biLSTM-CRF with character embedding yields 86.45% F1-score which is better than the result of the model without using characters. It can be seen, the significant improvement about 6% in overall F1-score was gained after using character embeddings. It indicates that character-based model fits the nature of the MCL. We initialized the root and word embedding by using pre-trained embeddings. The skip-gram model of
12
G. Tolegen et al.
word2vec 5 [18] is used to train root and word vectors on large Kazakh news articles and Wikipedia texts6 . Table 4 also shows the results after pre-training the root and word embedding marked with symbol *. As shown, the pre-trained root and word representations have a minor effect on the overall F1-score of NN models. Especially for organization names, the pre-trained embeddings have positive effects. The NN+root+feat* and the NN+root+tag+tensor+feat* models achieve around 2% improvement for organization F1-score compared to those of the models without using the per-trained embeddings (the former’s is form 78.90% to 81.00% and the latter’s is from 78.89% to 81.88%). Overall, our NN outperforms the CRF-based system and other state-of-the-art (biLSTM-CRFcharacter+dropout), and the best NN yields an F1 of 90.49%, a new state-ofthe-art for Kazakh NER. To show the effect of word embeddings after the model training. We calculated the ten nearest neighbors of a few randomly chosen query words (first row). Their distances were measured by the cosine similarity. As given in Table 5, the nearest neighbors in three columns are related to their named entity labels: all location, person and organization names are listed in the first, second and third column, respectively. Compared to CRF, instead of using discrete features, the NNs project root, words into a vector space, which could group similar words by their meaning and the NNs has non-linear transformations to extract higher-level features. In this way, the NNs may reduce the effects of data sparsity problems of MCL. Table 5. Example words in Kazakh and their 10 closest neighbors. Here, we used the Latin alphabet to write Kazakh words for convenience. Kazakhstan (Location) Meirambek (Person) KazMunayGas (Organization)
5 6
Kiev
Oteshev
Nurmukasan
Sheshenstandagy
Klinton
TsesnaBank
Kyzylorda
Shokievtin
Euroodaktyn
Angliada
Dagradorzh
Atletikony
Burabai
Tarantinonyn
Bayern
Iran
Nikliochenko
Euroodakka
Singapore
Luis
CenterCredittin
Neva
Monhes
Juventus
London
Fernades
Aldaraspan
Romania
Fog
Liverpool
https://code.google.com/p/word2vec/. In order to reduce dictionary size of root and surface word, we did some preprocessing namely, lowercasing and word stemming by morphological analyzer and disambiguator.
Neural Named Entity Recognition for Kazakh
6
13
Conclusions
We presented several neural networks for NER of MCLs. The key aspects of our model for MCL are to utilize different embeddings and layer, namely, i) root embedding, ii) entity tag embedding and iii) the tensor layer. The effects of those aspects are investigated individually. The use of root embedding leads to a significant result on MCLs’ NER. The other two also gives positive effects. For Kazakh, the proposed NNs outperform the CRF-based NER system and other state-of-the-art including character-based biLSTM-CRF model. The comparisons showed that character embedding is vital to MCL’s NER. The experimental results indicate that the proposed NNs can be potentially applied to other morphologically complex languages. Acknowledgments. The work was funded by the Committee of Science of Ministry of Education and Science of the Republic of Kazakhstan under the grant AP09259324.
References 1. Baba Ali, B., W´ ojcik, W., Orken, M., Turdalyuly, M., Mekebayev, N.: Speech recognizer-based non-uniform spectral compression for robust MFCC feature extraction. Przegl. Elektrotechniczny 94, 90–93 (2018). https://doi.org/10.15199/ 48.2018.06.17 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014) 3. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003) 4. Chieu, H.L., Ng, H.T.: Named entity recognition with a maximum entropy approach. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 160–163. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003) 5. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493– 2537 (2011) 6. Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 168–171. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003) 7. Graves, A., Fern´ andez, S., Gomez, F.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning, ICML 2006, pp. 369–376 (2006) 8. He, D., et al.: Dual learning for machine translation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems vol. 29, pp. 820–828. Curran Associates, Inc. (2016) 9. Klein, D., Smarr, J., Nguyen, H., Manning, C.D.: Named entity recognition with character-level models. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 180–183. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
14
G. Tolegen et al.
10. Kuru, O., Can, O.A., Yuret, D.: CharNER: character-level named entity recognition. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 911–921. The COLING 2016 Organizing Committee (December 2016) 11. Lafferty, J.D., Mccallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, San Francisco, CA, USA, pp. 282–289. Morgan Kaufmann Publishers Inc., (2001) 12. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics (2016) 13. Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1520– 1530. Association for Computational Linguistics (September 2015) 14. Ma, X., Hovy, E.: End-to-end sequence labeling via Bi-directional LSTM-CNNsCRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), pp. 1064–1074. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/P16-1101, https://aclweb. org/anthology/P16-1101 15. Mamyrbayev, O., Toleu, A., Tolegen, G., Mekebayev, N.: Neural architectures for gender detection and speaker identification. Cogent Eng. 7(1), 1727168 (2020). https://doi.org/10.1080/23311916.2020.1727168 16. Mamyrbayev, O., et al.: Continuous speech recognition of kazakh language. ITM Web Conf. 24, 01012 (2019). https://doi.org/10.1051/itmconf/20192401012 17. Mccallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 188–191. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003) 18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013) 19. Pei, W., Ge, T., Chang, B.: Max-margin tensor neural network for chinese word segmentation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland (Vol. 1: Long Papers), pp. 293– 303. Association for Computational Linguistics (June 2014) 20. Seker, G.A., Eryigit, G.: Initial explorations on using CRFs for turkish named entity recognition. In: COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8–15 December 2012, Mumbai, India, pp. 2459–2474 (2012) 21. Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: languageindependent named entity recognition. In: Proceedings of the 6th Conference on Natural Language Learning - Vol. 20, pp. 1–4. COLING 2002, Association for Computational Linguistics, Stroudsburg, PA, USA (2002) 22. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 142–147. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
Neural Named Entity Recognition for Kazakh
15
23. Tkachenko, M., Simanovsky, A.: Named entity recognition: exploring features. In: ¨ Jancsary, J. (ed.) Proceedings of KONVENS 2012. pp. 118–127. OGAI (September 2012). main track: oral presentations 24. Tolegen, G., Toleu, A., Zheng, X.: Named entity recognition for kazakh using conditional random fields. In: Proceedings of the 4-th International Conference on Computer Processing of Turkic Languages TurkLang 2016, pp. 118–127. Izvestija KGTU im.I.Razzakova (2016) 25. Toleu, A., Tolegen, G., Mussabayev, R.: Comparison of various approaches for dependency parsing. In: 2019 15th International Asian School-Seminar Optimization Problems of Complex Systems (OPCS), pp. 192–195 (2019) 26. Toleu, A., Tolegen, G., Makazhanov, A.: Character-aware neural morphological disambiguation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada (Vol. 2: Short Papers), pp. 666– 671. Association for Computational Linguistics (July 2017) 27. T¨ ur, G., Hakkani-t¨ ur, D., Oflazer, K.: A statistical information extraction system for turkish. Nat. Lang. Eng. 9(2), 181–210 (2003) 28. Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: Charagram: embedding words and sentences via character n-grams. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1504–1515. Association for Computational Linguistics (November 2016) 29. Yeniterzi, R.: Exploiting morphology in turkish named entity recognition system. In: Proceedings of the ACL 2011 Student Session, pp. 105–110. HLT-SS 2011, Association for Computational Linguistics, Stroudsburg, PA, USA (2011) 30. Zhai, Z., Nguyen, D.Q., Verspoor, K.: Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition. In: Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis, pp. 38–43. Association for Computational Linguistics (2018) 31. Zheng, X., Chen, H., Xu, T.: Deep learning for Chinese word segmentation and POS tagging. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 647–657. Association for Computational Linguistics (October 2013)
An Empirical Data Selection Schema in Annotation Projection Approach Yun Hu1,2(B) , Mingxue Liao2 , Pin Lv2 , and Changwen Zheng2 1
2
University of Chinese Academy of Sciences, Beijing, China [email protected] Institute of Software, Chinese Academy of Sciences, Beijing, China {mingxue,lvpin,changwen}@iscas.ac.cn
Abstract. Named entity recognition (NER) system is often realized using supervised methods such as CRF and LSTM-CRF. However, supervised methods often require large training data. In some low-resource languages, annotated data is often hard to obtain. Annotation projection method obtains annotated data from high-resource languages automatically. However, the data obtained automatically contains a lot of noise. In this paper, we propose a new data selection schema to select the high-quality sentences in annotated data. The data selection schema computes the sentence score considering the occurrence number of entity-tags and the minimum scores of entity-tags in sentences. The selected sentences can be used as an auxiliary annotated data in low resource languages. Experiments show that our data selection schema outperforms previous methods. Keywords: Named entity recognition · Annotation projection · Data selection schema
1 Introduction Name entity recognition is a fundamental Natural Language Processing task that labels each word in sentences with predefined types, such as Person (PER), Location (LOC), Organization (ORG) and so on. The results of NER can be used in many downstream NLP tasks, such as relation extraction [1] and question answering [16]. The supervised methods like CRF [7] and neural network methods [2, 8] are often used to realize the NER system. However, supervised methods require large data to train the appropriate model, which leads to supervised methods can only be used in high-resource languages such as English. In low-resource languages, annotation projection method is one of methods which can be used to improve the performance of the NER system. Annotation projection is used to obtain annotated data in low-resource languages through parallel corpus. The method can be formalized as a pipeline work. An example
The work is supported by both National scientific and Technological Innovation Zero (No. 17-H863-01-ZT-005-005-01) and State’s Key Project of Research and Development Plan (No. 2016QY03D0505). c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 16–25, 2023. https://doi.org/10.1007/978-3-031-24340-0_2
An Empirical Data Selection Schema in Annotation Projection Approach
17
is shown in Fig. 1. We use English-Chinese language as example.1 First, we use the English NER system to obtain the NER tags of English sentences. For example, the ‘Committee’ is labeled as ‘ORG’ (Organization). Then, the alignment system is used to find word alignment pairs. For example, the GIZA++ [11] model can find that the ‘Committee’ is translated to ‘委员会’. Next, we take the tags of English words as the low-resource language tags in each word alignment pairs. For example, The tag of ‘Committee’ can be mapped to ‘委员会’. So the ‘委员会’ is labeled as organization. Finally, the data obtained automatically can be as training data together with the manual data. Directly using the annotation projection data leads to the low performance in some languages such as Chinese. In this paper, we use the data together with manual data as the training data of NER system. The increasing of the training data results in increasing of performance.
Fig. 1. An case of annotation projection.
However, the data obtained from annotation projection contains lots of noise. Higher quality training data may lead to higher performance in NER system. Previous works use data selection schema to obtain high quality annotation projection corpus. For example, Ni et al. considers the entity-tags and average the entity-tag scores in sentences as sentence scores [10]. The entity-tag means an entity labeled with a predefined entity type. However, this method selects low-quality sentences when an ‘entity’ appears one time in corpus or one of entities has wrong predict. In this paper, we propose a new data selection schema which considers the occurrence number of entities and uses the minimum value of entity-tags in sentences. Experimental results show that our data selection method outperforms the previous works and improves the baseline 1.93 points.
2 Related Work Recently, most NER systems are based on Conditional Random Fields (CRF) or Neural Networks model. CRF model is generally used in sequence labeling tasks like partof-speech tagging and NER [7]. The correlations between labels in neighborhoods are considered in CRF model. The shortcoming of the CRF model is that the model relies heavily on hand-crafted features. Neural Network model partly alleviates the use of hand-crafted features. The features of the sequences are extracted automatically by the model. Collobert et al. used word embedding and CNN to capture the sentence information [2]. Lample et al. incorporated character embedding with word embedding as the 1
We consider English as high-resource language. Although Chinese is not truly low-resource language, we simulate the low-resource environment by limiting the size of training data which is similar to [14].
18
Y. Hu et al.
bi-LSTM input and achieved the best results [8]. However, all these models require large amounts of annotated data to train. Annotated data is hard to obtain in low-resource languages. In this paper, we focus on realizing a NER system in low-resource languages with limited annotated data. To build NER system in low-resource languages, annotation projection and model transfer are widely used. Model transfer focuses on finding features that are independent to languages. Taeckstroem et al. generated cross-lingual word cluster feature which is useful in the model transfer [13]. Yang et al. used the hierarchical recurrent networks to model the character similarity between different languages [15]. However, the model transfer model is hard to be used in languages that have little similarity. In this paper, we focus on annotation projection method. Another approach to realize the named entity recognition in low-resource languages is annotation projection. The annotation projection is not limited by the similarity between the languages. The annotation projection method was first proposed by [17]. They used annotation projection for POS tagging, Base NP, NER and morphological analysis. Kim et al. introduced the GIZA++ to do the entity alignment in annotation projection and applied an entity dictionary to correct the alignment errors [5]. The phrase-based statistical machine translation system and an external multilingual named entity database were used by [4] . Wang and Manning used the soft projection to avoid that the error of CRF prediction propagated to the alignment step [14]. Recently, a data selection schema was proposed by [10]. The data selection schema measures sentence quality in data obtained from annotation projection and do not have to consider which steps lead the errors. The low-quality sentences are discarded when the model is trained. The details will be discussed in Sect. 3.1.
3 Method 3.1
Problems of Previous Method
We first introduce the data selection schema proposed by [10]. The method first computes the entity-tag score, then computes the sentence score. The relative frequency is used as entity-tag score. The score of sentence is calculated by averaging each entity-tag score in sentences. n si (1) feq-avg = i=1 n where n is the number of entities in a sentence, and si is the entity-tag score. We define the data selection schema in [10] as feq-avg method. An example is shown to compute the entity-tag score in Table 1. The word ‘货币基金组织’ (International Monetary Fund, IMF) appears 49 times in the annotation projection data and 40 times is labeled as ‘ORG’, so the entity-tag score of ‘货币基金组织’ labeled as ‘ORG’ is 0.816. The method does not use the word-tag score, because a lot of entities are composed by common words. For example, ‘货币’ (Money) is a common word. In Fig. 2, the first sentence ‘秘鲁恢复了在货币基金组织的积极成员地位’ (Peru returned to activemember status at IMF.) has 2 entities, the score of first sentence is 0.878 (average of 0.94 and 0.816). The sentence score can be an index of sentence quality. When the sentence score is low, we have the high confidence to consider that some errors occur.
An Empirical Data Selection Schema in Annotation Projection Approach
19
We check the annotation projection data selected by feq-avg method and find that the method can not process two situations appropriately. First, in computing entitytag score step, the feq-avg method can not process the situation when an ‘entity’ only occurs a few times in corpus. The ‘entity’ is labeled as entity by annotation projection method and is not truly entity. For example, in Table 1, ‘推选赛义德· 阿姆贾德· 阿 里’ (elected Syed Amjad Ali) is considered as a person name in annotation projection data. The error is that the entity contains the word ‘推选’ (elected) which is a common word. The word ‘推选’ may appear many times in whole corpus, however, the ‘entity’ ‘推选赛义德· 阿姆贾德· 阿里’ appears one time in whole corpus. In feq-avg method, the entity-tag score is 1. The same situation also happen at 新闻部有关 (related to the Department of Public Information) which contains ‘有关’ (related to) as an entity. Second, in computing sentence score step, some errors may occur when a sentence has a lot of entities, most of them are high scores, and an entity is low score. For example, in the second sentence of Fig. 2, the sentence contains four entities. The sentences score can be computed as sentence-score =
0.99 + 0.816 + 0.99 + 0.03 4
(2)
The feq-avg method obtains the sentence score as 0.7065 which is a high score. However, the sentence contains a error that ‘汇率’ (exchange rate) is labeled as ‘ORG’. Because other entities in sentence obtain high score, the entity-tag score of ‘汇率’ is ignored in sentence score computing. Table 1. The case of computing entity-tag score. Words
Label type Count Relative frequency
货币基金组织
ORG
40
0.816
货币基金组织
MISC
2
0.041
货币基金组织
O
7
0.143
推选赛义德·阿姆贾德·阿里 PER
1
1.0
新闻部有关
1
1.0
ORG
Fig. 2. The case of computing sentence score.
20
Y. Hu et al.
3.2
Our Method
In our methods, we use minimum value instead of average and a weighted entity-tag score instead of directly relative frequency. The sentence score is computed as: wfeq-min = minαsi where the α is α=1−
1 2ex−1
(3)
(4)
In Eq. 3, the si is the entity-tag score which is relative frequency. In Eq. 4, the x is the occurrence number of entity in corpus. We define our data selection schema as wfeqmin method. For the first problem in feq-avg, the entity-tag score will be multiplied by a small α when the entity-tag only occurs a few times. when the x is large, the α is close to 1 and the entity-tag score is close to relative frequency. For example, the α and sentence score of ‘推选赛义德· 阿姆贾德· 阿里’ labeled as ‘PER’ and ‘新闻部相关 ’ labeled as ‘ORG’ are 0.5. The α of ‘货币基金组织’ labeled as ‘ORG’ is 0.99. The score of ‘货币基金组织’ labeled as ‘ORG’ is 0.816. For the second problem, we use the minimum value of the entity-tag score in sentences as sentence score. For example, for second sentence in Fig. 2, the sentence score becomes 0.03. The sentence score of first sentence in Fig. 2 is 0.816. After computing the sentence score, we can set a value q. The sentences which scores are over the q can be seen as high-quality sentences.
4 Experiments 4.1
Data Sets and Evaluating Methods
Following the steps of annotation projection method, we first require English datasets and English NER system. The English NER system we used is LSTM-CRF model [8] trained by Conll 2003 data [12]. We replace the ‘MISC’ tag with ‘O’ tag in the English dataset. Then, the alignment system and parallel data are required. The English-Chinese parallel data is from datum2017 corpus provided by Datum Data Co., Ltd.2 . The corpus contains 20 files, covering different genres such as news, conversations, law documents, novels, etc. Each file has 50,000 sentences. The whole corpus contains 1 million sentences. The alignment system we used is GIZA++. Next, the data selection schemas are used in annotation projection data. We consider four kinds of data selection schema to process the annotation projection data. – feq-avg. The sentence score is computed by averaging the entity-tag scores which are relative frequency. The method is the same as the method described in Sect. 3.1. – feq-min. The sentence score is computed by using minimum of the entity-tag scores which are relative frequency. – wfeq-avg. The sentence score is computed by averaging the weighted entity-tag scores which are relative frequency. 2
http://nlp.nju.edu.cn/cwmt-wmt/.
An Empirical Data Selection Schema in Annotation Projection Approach
21
– wfeq-min. The sentence score is computed by using minimum of the weighted entity-tag scores which are relative frequency. The method is the same as the method described in Sect. 3.2. Finally, we require evaluating data and evaluating model to show that our data selection schema outperforms other methods. We use different training data to train the model and the same testing data to test the model. In this paper, the training data contains two parts: annotation projection data part and original news data part. Annotation projection data part is data obtained from annotation projection methods. The original news data part is data from the training data part of third SIGHAN Chinese language processing bakeoff [9]. To simulate the low-resource environment, we do not use all the training data. The testing data is testing data part of [9]. The third SIGHAN Chinese language processing bakeoff which is one of widely used Chinese NER datasets. The NER dataset contains 6.1M training data, 0.68M developing data and 1.3M testing data. Three types of entities are considered in dataset: PER (Person), ORG (Organization) and LOC (Location). The domain of the NER data is news domain. The evaluating model is the same when different evaluating data are used. We use the LSTM-CRF model as our evaluating model. The model is similar to the model in [3] excepted that we do not use radical part to extract the Chinese character information. The model is shown in Fig. 3. The sentences are as the input of character embedding layer.3 We use pretrained character embedding to initialize our lookup table. Each character maps to a low dimensional dense vector. The dimension of character embedding is 100. Then a bi-lstm layer is used to model the sentence level information. The LSTM dimension is 100. The projection layer dimension is the same as the number of NER tags. We use the CRF layer to model the relation between the neighborhoods. The optimization method we used is adam [6]. The dropout strategy is used to avoid overfit.
Fig. 3. The evaluating model. 3
Before using the data from annotation projection in Chinese, we change the word tag schema to character tag schema. For example, the word ‘委员会’ (Committee) is labeled as ‘ORG’, so the three characters in ‘委员会’ will be labeled as ‘ORG’.
22
4.2
Y. Hu et al.
Results
We first do the experiments that only annotation projection data is used as training data. The results show that the models are hard to train and converge. The F1 score of final results is low (less than 30%). The reason may be that the annotation data contains a lot of noise even through data selection. In this paper, we use annotation projection data together with manual data to show that the data schema can be helpful when the original manual data is limited. We use 0.5M data from original news training data and 0.5M data from annotation projection. The results of using different data selection schemas are shown in Table 2. The baseline system only uses 0.5M original news data. The baseline-D system uses 0.5M data from original news training data and 0.5M data from annotation projection without data selection schema. The results show that directly using annotation projection data may lead to performance declining. The reason may be that the increasing of the training data can not compensate the tremendous noise. Four data selection schemas improve the performance of the baseline system, which means that the data selection schema is important to improve the sentence quality. The wfeq-min data selection schema described in Sect. 3.2 achieves the best results and improves the baseline system 1.93 points in F1 score. Both wfeq-avg and feq-min data selection schema outperform feq-avg data selection schema, which presents that both the occurrence number of entity-tags and the minimum scores of entity-tags in sentences are important. Table 2. The overview results. P
R
F
baseline 68.58 67.98 68.28 baseline-D 71.09 64.01 67.36 feq-avg wfeq-avg feq-min wfeq-min
71.38 73.29 73.87 73.43
66.95 65.99 66.06 67.26
69.09 69.45 69.75 70.21
When we utilize the data selected by data selection schema, three hyperparameters are considered: the size of the original news training data, the size of the data from annotation projection, and the selected value q in the data selection schema. When one hyperparameter is considered, two other hyperparameters are fixed. The first hyperparameter is the size of the original news training data. The results of F1 score are shown in Table 3. The original news data size is from 0.25M to 1M. The annotation projection data size is 0.5M and the selected value q is 0.7. The baseline system does not use the annotation projection data. The baseline results explain that the larger training data sets, the better the model works. In all sizes, the directly using annotation projection data harms the performance. The models using data from selection schema all outperform the baseline-D system. The wfeq-min data selection schema obtains the best results. We also observe that the effect of annotation projection data is reduced when the original data size is increased. For example, when the original data size is 0.25M, the wfeq-min method improves the baseline 3.04 points. When
An Empirical Data Selection Schema in Annotation Projection Approach
23
the original size is 1.0M, the wfeq-min method improves the baseline 0.84 points. The experiments indicate that the annotation projection method is more helpful in the low resource situation. Table 3. The results of different data sizes from original news data. size/M
0.25
0.5
0.75
1.0
baseline 56.41 68.28 74.43 77.82 baseline-D 56.87 67.36 72.97 75.80 feq-avg wfeq-avg feq-min wfeq-min
58.55 58.82 59.01 59.45
69.09 69.45 69.75 70.21
74.52 74.64 74.78 75.16
77.25 77.43 77.82 78.12
The second hyperparameter is the size of the data from annotation projection. The size of original news data is 0.5M and the q selects 0.7. We test the annotation projection data from 0M to 1.0M. The results of F1 score are shown in Table 4. Directly using the annotation projection data harms the performance in all data sizes (baseline-D). In results of data selection schemas, we observe that the performance increases first and declines latter. The reason may be that the disadvantage from the noise data arises with increasing of the annotation projection data. The experiments indicate that the annotation projection data should have similar data size as the original news data. Table 4. The results of different data sizes from annotation projection. size/M
0
0.25
0.5
0.75
1.0
baseline-D 68.28 68.19 67.36 66.48 64.93 feq-avg
68.28 70.50 69.09 68.74 67.42
wfeq-avg
68.28 70.93 69.45 68.84 67.52
feq-min
68.28 70.98 69.75 68.97 67.63
wfeq-min
68.28 71.08 70.21 70.04 69.49
The finally hyperparameter is the selected value q in the data selection schema. The results of F1 score is presented in Fig. 4. The training data set contains 0.5M original news data and 0.5M annotation projection data. The q is used to select the annotation projection data. Big q value leads to high-quality sentences. To avoid the situation when the q is small and the size of data over q is large, we only selected 0.5M data from the annotation projection data in all q. The figure presents that the performance increases with increasing of the q , which means that the quality of the training data is increasing. In all q, the wfeq-min method obtains the best results and the feq-avg method obtains the worst results. Compared feq-min method with wfeq-avg method, the feq-min method obtains better results when the q is small value and obtain worse results when the q is big value. Through analyzing the selected sentences in feq-min method, we find the first problem described in Sect. 3.1 will be obvious when q is big value.
24
Y. Hu et al.
Fig. 4. The results of using different q values.
5 Conclusion In this paper, we propose a new data selection schema to select the high-quality sentences in annotation projection data. The data selection schema considers the occurrence number of entity-tags and uses the minimum value of entity-tags in sentences. The selected sentences can be used as an auxiliary data to obtain higher performance in NER system. Experiments show that our methods can obtain higher quality sentences compared with previous methods. In the future, more sophisticated model will be used to utilize the high-quality sentences instead of directly using as training data.
References 1. Bunescu, R., Mooney, R.: A shortest path dependency kernel for relation extraction. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (2005). https://aclweb.org/anthology/H05-1091 2. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011) 3. Dong, C., Zhang, J., Zong, C., Hattori, M., Di, H.: Character-based LSTM-CRF with radicallevel features for Chinese named entity recognition. In: Lin, C.-Y., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds.) ICCPOL/NLPCC -2016. LNCS (LNAI), vol. 10102, pp. 239– 250. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50496-4 20 4. Ehrmann, M., Turchi, M.: Building multilingual named entity annotated corpora exploiting parallel corpora. In: Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora (AEPC) (2010) 5. Kim, S., Jeong, M., Lee, J., Lee, G.G.: A cross-lingual annotation projection approach for relation detection. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 564–571. Coling 2010 Organizing Committee (2010). https:// www.aclweb.org/anthology/C10-1064 6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. Computer Science (2014)
An Empirical Data Selection Schema in Annotation Projection Approach
25
7. Lafferty, J.D., Mccallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Eighteenth International Conference on Machine Learning, pp. 282–289 (2001) 8. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics (2016). https://doi.org/10.18653/ v1/N16-1030, https://www.aclweb.org/anthology/N16-1030 9. Levow, G.A.: The third international Chinese language processing bakeoff: word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 108–117. Association for Computational Linguistics (2006). https://www.aclweb.org/anthology/W06-0115 10. Ni, J., Dinu, G., Florian, R.: Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 1470–1480. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1135, https://www.aclweb.org/anthology/P17-1135 11. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1) (2003). https://www.aclweb.org/anthology/J03-1002 12. Sang, E.F.T.K., Meulder, F.D.: Introduction to the conll-2003 shared task: languageindependent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (2003). https://www.aclweb.org/anthology/W030419 13. T¨ackstr¨om, O., McDonald, R., Uszkoreit, J.: Cross-lingual word clusters for direct transfer of linguistic structure. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 477– 487. Association for Computational Linguistics (2012). https://www.aclweb.org/anthology/ N12-1052 14. Wang, M., Manning, C.D.: Cross-lingual projected expectation regularization for weakly supervised learning. Trans. Assoc. Comput. Linguist. 2, 55–66 (2014). https://www.aclweb. org/anthology/Q14-1005 15. Yang, Z., Salakhutdinov, R., Cohen, W.W.: Transfer learning for sequence tagging with hierarchical recurrent networks (2016) 16. Yao, X., Van Durme, B.: Information extraction over structured data: question answering with freebase. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 956–966. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/P14-1090,https://aclweb.org/anthology/P14-1090 17. Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research (2001). https://www.aclweb.org/anthology/H011035
Toponym Identification in Epidemiology Articles – A Deep Learning Approach MohammadReza Davari(B) , Leila Kosseim, and Tien D. Bui Department of Computer Science and Software Engineering, Concordia University, Montreal, QC H3G 1M8, Canada mohammadreza[email protected], {leila.kosseim,tien.bui}@concordia.ca
Abstract. When analyzing the spread of viruses, epidemiologists often need to identify the location of infected hosts. This information can be found in public databases, such as GenBank [3], however, information provided in these databases are usually limited to the country or state level. More fine-grained localization information requires phylogeographers to manually read relevant scientific articles. In this work we propose an approach to automate the process of place name identification from medical (epidemiology) articles. The focus of this paper is to propose a deep learning based model for toponym detection and experiment with the use of external linguistic features and domain specific information. The model was evaluated using a collection of 105 epidemiology articles from PubMed Central [33] provided by the recent SemEval task 12 [28]. Our best detection model achieves an F1 score of 80.13%, a significant improvement compared to the state of the art of 69.84%. These results underline the importance of domain specific embedding as well as specific linguistic features in toponym detection in medical journals. Keywords: Named entity recognition neural network · Epidemiology articles
1
· Toponym identification · Deep
Introduction
With the increase of global tourism and international trade of goods, phylogeographers, who study the geographic distribution of viruses, have observed an increase in the geographical spread of viruses [9,12]. In order to study and model the global impact of the spread of viruses, epidemiologists typically use information on the DNA sequence and structure of viruses, but also rely on meta data. Accurate geographical data is essential in this process. However, most publicly available data sets, such as GenBank [3], lack specific geographical details, providing information only at the country or state level. Hence, localized geographical information has to be extracted through a manual inspection of medical journals. The task of toponym resolution is a sub-problem of named entity recognition (NER), a well studied topic in Natural Language Processing (NLP). Toponym c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 26–37, 2023. https://doi.org/10.1007/978-3-031-24340-0_3
Toponym Identification in Epidemiology Articles
27
resolution consists of two sub-tasks: toponym identification and toponym disambiguation. Toponym identification consists of identifying the word boundaries of expressions that denote geographic expressions; while toponym disambiguation focuses on labeling the expression with their corresponding geographical locations. Toponym resolution has been the focus of much work in recent years (e.g. [2,7,30]) and studies have shown that the task is highly dependent on the textual domain [1,8,14,25,26]. The focus of this paper is to propose a deep learning based model for toponym detection and experiment with the use of external linguistic features and domain specific information. The model was evaluated using the recent SemEval task 12 datatset [28] and shows that domain specific embedding as well as some linguistic features do help in toponym detection in medical journals.
2
Previous Work
The task of toponym detection consists in labeling each word of a text as a toponym or non-toponym. For example, given the sentence: (1) WNV entered Mexico through at least 2 independent introductions1 . The expected output is shown in Fig. 1.
Fig. 1. An example of input and expected output of toponym detection task. Example from the [33] dataset.
Toponym detection has been addressed using a variety of methods: rule based approaches (e.g. [29]), dictionary or gazetteer-driven (e.g. [18]), as well as machine learning approaches (e.g. [27]). Rule based techniques try to manually capture textual clues or structures which could indicate the presence of a toponym. However, these handwritten rules are often not able to cover all possible cases, hence leading to a relatively large number of false negatives. Gazetteer driven approaches (e.g. [18]), suffer from a large number of false positive identifications, since they cannot disambiguate entities that refer to geographical locations from other categories of named entities. For example in the sentence, (2) Washington was unanimously elected President by the Electoral College in the first two national elections. 1
Example from the [33] dataset.
28
M. Davari et al.
the word Washington will be recognized as a toponym since it is present in geographic gazetteers but in this context, the expression refers to a person name. Finally, standard machine learning approaches (e.g. [27]), require large datasets of labeled texts and carefully engineered features. Collecting such large datasets is costly and feature engineering is a time consuming task, with no guarantee that all relevant features have been modeled. This motivated us to experiment with automatic feature learning to address the problem of toponym detection. Deep Learning approaches to NER (e.g. [5,6,16,17,31]) have shown how a system can infer relevant features and lead to competitive performances in that domain. The task of toponym resolution for the epidemiology domain is currently the object of the SemEval 2019 shared task 12 [28]. Previous approaches to toponym detection in this domain includes rule based approach [33], Conditional Random Fields [32], and a mixture of deep learning and rule based approaches [19]. The baseline model used at the SemEval 2019 task 12 [28] is modeled after the deep feed forward neural network (DFFNN) architecture presented in [19]. The network consists of 2 hidden layers with 150 rectified linear unit (ReLU) activation functions per layer. The baseline F1 performance is reported to be 69.84%. Building upon the work of [19,28] we propose a DFFNN that uses domain-specific information as well as linguistic features to enhance the state of the art performance.
3
Our Proposed Model
Fig. 2 shows the architecture of our toponym recognition model. The model is comprised of 2 main layers: an embedding layer, and a deep feed-forward network. 3.1
Embedding Layer
As shown in Fig. 2, the model takes as input a word (e.g. derived) and its context (i.e. n words around it). Given a document, each word is converted into an embedding along with its context. Specifically, two types of embeddings are used: word embeddings and feature embeddings. For word embeddings, our basic model uses the pretrained WikipediaPubMed embeddings2 . This embedding model was trained on a vocabulary of 201, 380 words and each word is represented by a 200 dimensional feature vector. This embedding model was used as opposed to more generic Word2vec [20] or GloVe [23] in order to capture more domain specific information (see Sect. 4). Indeed, the corpus used for training the Wikipedia-PubMed embedding consists of Wikipedia pages and PubMed articles [21]. This entails that the embeddings should be more appropriate when processing medical journals, and domain specific words. Moreover, the embedding model can better represent the closeness and relation of words in medical articles. The word embeddings for the target 2
http://bio.nlplab.org/.
Toponym Identification in Epidemiology Articles
29
Fig. 2. Toponym recognition model. Input: words are extracted with a fixed context window (a) Embeddings: For each window, an embedding is constructed (b) Deep Neural Network: A feed-forward neural network with 3 layers and 500 neurons per layer outputs a prediction label indicating whether the word in the center of the context window is a toponym or not.
word and its context are concatenated to form a single word embedding vector of size 200 × (2c + 1), where c is the context size. Specific linguistic features have been shown to be very useful in toponym detection [19]. In order to leverage this information, our model is augmented using embedding for these features. These include the use of capital letters for the first character of the word or for all characters of the word. These features are encoded as a binary vector representation. If a word starts with a capitalized letter, its feature embedding is [1, 0] otherwise it is [0, 1] and if all of its letters are capitalized then its feature embedding is [1, 1]. Other linguistic features we observed to be useful (see Sect. 4) include part of speech tags, and the word embedding of the lemma of the word. The feature embedding of the input word and its context are combined to the word embedding via concatenation to form a single vector and passed to the next layer. 3.2
Deep Feed Forward Neural Network
The concatenated embeddings formed in the embedding layer (Sect. 3.1) are fed to a deep feed forward network (DFFNN) (see Fig. 2) whose task is to perform binary classification. This component is comprised of 3 hidden layers and one output layer. Each hidden layer is comprised of 500 ReLU activation nodes. Once an input vector x enters a hidden layer h, the output h(x) is computed as: h(x) = ReLU(W x + b)
(1)
The model is defined using the above equation recursively for all 3 hidden layers. The output layer contains 2 dimensional softmax activation functions. Upon
30
M. Davari et al.
receiving the input x, this layer will output O(x) as follows: O(x) = Softmax(W x + b)
(2)
The Softmax function was chosen for the output layer since it provides a categorical probability distribution over the labels for an input x, i.e.: p(x = toponym) = 1 − p(x = non-toponym)
(3)
We employed 2 mechanisms to prevent overfitting: drop-out and early-stopping. In each hidden layer the probability of drop-out was set to 0.5. The early-stopping caused the training to stop if the loss on the development set (see Sect. 4) started to rise preventing over-fitting and poor generalization. Norm clipping [22] scales the gradient when its norm exceeds a certain threshold and prevents the occurrence of exploding gradient; we experimentally found the best performing threshold to be 1 for our model. We experimented with variations of the model architecture both in depth and number of hidden units per layer as well as other hyper-parameters listed in Table 1. However, deepening the model lead to immediate over-fitting due to the small size of the dataset used [13] (see Sect. 4) even with the presence of a dropout function to prevent it. The optimal hyper-parameter configuration with the development set used to fine tune them can be found in Table 1. Table 1. Optimal hyper-parameters of the neural network. Parameters
Value
Learning Rate 0.01 Batch Size
4
32
Optimizer
SGD
Momentum
0.1
Loss
Weighted Categorical cross-entropy
Loss weights
(2, 1) for toponym vs. nontoponym
Experiments and Results
Our model has been evaluated as part of the recent SemEval 2019 task 12 shared task [28]. As such, we used the dataset and the scorer3 provided by the organisers. The dataset consists of 105 articles from PubMed annotated with toponym mentions and their corresponding geographical locations. The dataset was split into 3 sections: training, development, and test set containing 60%, 10%, and 30% of the dataset respectively. Table 2 shows statistics of the dataset. 3
https://competitions.codalab.org/competitions/19948#learn the detailsevaluation.
Toponym Identification in Epidemiology Articles
31
Table 2. Statistics of the dataset. Training Development Test Size
2.8MB
0.5MB
1.5MB
Number of articles
63
10
32
Average size of each article (in words)
6422
5191
6146
44
50
Average number of toponyms per article 43
A baseline model for toponym detection was also provided by the organizers for comparative purposes. The baseline, inspired by [19], also uses a DFFNN but only uses 2 hidden layers and 150 ReLU activation functions per layer. Table 3 shows the results of our basic model presented in Sect. 3.2 (see #4) compared to the baseline (row #3).4 We carried out a series of experiments to evaluate a variety of parameters. These are described in the next sections. Table 3. Performance score of the baseline, our proposed model and its variations. The suffixes represent the presence of a feature, P.:Punctuation marks, S: Stop words, C: Capitalization features, POS: Part of speech tags, W: Weighted loss, L: Lemmatization feature. For example DFFNN Basic+P+S+C+POS refers to the model that only takes advantage of capitalization feature and part of speech tags and does not ignore stop words or punctuation marks. # Model
4.1
Context Precision Recall
F1
8
DFFNN Basic+P+S+C+POS+W+L 5
80.69%
79.57% 80.13%
7
DFFNN Basic+P+S+C+POS+W
5
76.84%
77.36% 77.10%
6
DFFNN Basic+P+S+C+POS
5
77.55%
70.37% 73.79%
5
DFFNN Basic+P+S+C
2
78.82%
66.69% 72.24%
4
DFFNN Basic+P+S
2
79.01%
63.25% 70.26%
3
Baseline
2
73.86%
66.24% 69.84%
2
DFFNN Basic+P−S
2
74.70%
63.57% 68.67%
1
DFFNN Basic+S−P
2
64.58%
64.47% 64.53%
Effect of Domain Specific Embeddings
As [1,8,14,25,26] showed, the task of toponym detection is dependent on the discourse domain; this is why our basic model used the Wikipedia-PubMed embeddings. In order to measure the effect of such domain specific information, we experimented with 2 other pretrained word embedding models: Google News Word2vec [11], and a GloVe Model trained on Common Crawl [24]. Table 4 shows the characteristics of these pretrained embeddings. Although, the WikipediaPubMed has a smaller vocabulary in comparison to the other embedding models, it suffers from the smallest percentage of out of vocabulary (OOV) words within our dataset since it was trained on a closer domain. 4
At the time of writing this paper, the results of the other teams were not available. Hence only a comparison with the baseline can be made at this point.
32
M. Davari et al. Table 4. Specifications of the word embedding models. Model
Vocabulary size Embedding dimension OOV words
Wikipedia-PubMed
201, 380
200
28.61%
Common Crawl GloVe
2.2M
300
29.84%
300
44.36%
Google News Word2vec 3M
We experimented with our DFFNN model with each of these embeddings and optimized the context window size to achieve the highest F-measure on the development set. The performance of these models on the test set is shown in Table 5. As predicted, we observe that Wikipedia-PubMed performs better than the other embedding models. This is likely due to its small number of OOV words and its domain-specific knowledge. As Table 5 shows, the performance of the GloVe model is quite close to the performance of Wikipedia-PubMed. To investigate this further, we decided to combine the two embeddings and train another model and evaluate performance. As shown in Table 5, the performance of this model (Wikipedia-PubMed + GloVe) is higher than the GloVe model alone but lower than the Wikipedia-PubMed. This decrease in performance suggests that because GloVe embeddings are more general, when the network is presented with a combination of GloVe and Wikipedia-PubMed, they dilute the domain specific information captured by the Wikipedia-PubMed embeddings, hence the performance suffers. From here on, our experiments were carried on using Wikipedia-PubMed word embeddings alone. Table 5. Effect of word embeddings on the performance of our proposed model architecture.
4.2
Model
Context window Precision Recall
Wikipedia-PubMed
2
79.01%
63.25% 70.26%
F1
Wikipedia-PubMed + GloVe 2
73.09%
67.22% 70.03%
Common Crawl GloVe
1
75.40%
64.05% 69.25%
Google News Word2vec
3
75.14%
58.96% 66.07%
Effect of Linguistic Features
Although deep learning approaches have lead to significant improvements in many NLP tasks, simple linguistic features are often very useful. In the case of NER, punctuation marks constitute strong signals. To evaluate this in our task, we ran the DFFNN Basic without punctuation information. As Table 3 shows, the removal of punctuation, decreased the F-measure from 70.26% to 64.53% (see Table 3 #1). A manual error analysis showed that many toponyms appear inside parenthesis, near a dot at the end of a sentence, or after a comma. Hence, as shown in [10] punctuation is a good indicator of toponyms and should not be ignored.
Toponym Identification in Epidemiology Articles
33
As Table 3 (#2) shows, the removal of stop words, did not help the model either and lead to a decrease in F-measure (from 70.26% to 68.67%). We hypothesize that some stop words such as in do help the system detect toponyms as they provide a learnable structure for detection of toponyms and that is why the model accuracy suffered once the stop words were removed. As seen in Table 3 our basic model suffers from low recall. A manual inspection of the toponyms in the dataset revealed that either their first letter is capitalized (e.g. Mexico) or all their letters are capitalized (e.g. UK ). As mentioned in Sect. 3.1 we used this information in an attempt to help the DFFNN learn more structure from the small dataset. As a result the F1 performance of the model increased form 70.26% to 72.27% (see Table 3 #5). In order to help the neural network better understand and model the structure of the sentences, we experimented with part of speech (POS) tags as part of our feature embeddings. We used the NLTK POS tagger [4] which uses the Penn Treebank tagset. As shown in Table 3 (#6), the POS tags significantly improve the recall of the network (from 66.69% to 70.37%) hence leading to a higher performance in F1 (from 72.24% to 73.79%). The POS tags help the DFFNN to better learn the structure of the sentences and take advantage of more contextual information (see Sect. 4.3). 4.3
Effect of Window Size
In order to measure the effect of the size of the context window, we varied this value using the basic DFFNN. As seen in Fig. 3, the best performance is achieved at c = 2. With values over this threshold, the DFFNN overfits as it cannot extract any meaningful structure. Due to the small size of the data set, the DFFNN is not able to learn the structure of the sentences, hence increasing the context window alone does not help the performance. In order to help the neural network better understand and use the contextual structure of the sentences in its predictions, we experimented with part of speech (POS) tags as part of our feature embeddings. As shown in Fig. 3, the POS tags help the DFFNN to take advantage of more contextual information as a result the DFFNN with POS embeddings achieves a higher performance on larger window sizes. The context window for which the DFFNN achieved its highest performance on the development set was c = 5, and on the test set the performance was increased from 72.24% to 77.10% (see Table 3 #6). 4.4
Effect of the Loss Function
As shown in Table 2 most models suffer from a lower recall than precision. The dataset is quite imbalanced, that is the number of non-toponyms are much higher than toponyms (99% vs 1%). Hence, the neural network prefers to optimize its performance by concentrating its efforts on correctly predicting the labels for the dominant class (non-toponym). In order to minimize the gap between recall than precision, we experimented with a weighted loss function. We adjusted the importance of predicting the correct labels experimentally and found that
34
M. Davari et al.
Fig. 3. Effect of context window on performance of the model with and without POS features. (DFFNN Basic+P+S and DFFNN Basic+P+S+C+P )
by weighing the toponyms 2 times more than the non-toponyms, the system reaches an equilibrium in the precision and recall measure, leading to a higher F1 performance. (This is indicated by “w” in Table 3 row #7) 4.5
Use of Lemmas
Neural networks require large datasets to learn structures and they learn better if the dataset contains similar examples so that the system can cluster them in its learning process. Since our dataset is small and the Wikipedia-PubMed embeddings suffer from 28.61% OOV words (see Table 4), we tried to help the network better cluster the data by adding the lemmatized word embeddings of the words to the feature embeddings and see how our best model reacts to it. As shown in Table 3 (#8), this improved the F1 measure significantly (from 77.10% to 80.13%). Furthermore, we picked 2 random toponyms and 2 random non-toponyms to visualize the confidence of our best model and the baseline model in their prediction as given by the softmax function (see Eq. 2). Figure 4 shows that our model produces much sharper confidence in comparison to the baseline model.
5
Discussion
Overall our best model (DFFNN #8 in Table 3) is made out of the basic DFFNN plus capitalized feature, POS embeddings, weighted loss function, and lemmatization feature. The experiments and results described in Sect. 4 underlines the
Toponym Identification in Epidemiology Articles
35
Fig. 4. (a) Confidence of our proposed model in its categorical predictions. (b) Confidence of the baseline in its categorical predictions.
importance of linguistic insights in the task of toponym detection. Ideally the system should learn all these insights and features by itself given access to enough data. However, when the data is scarce, as in our case, we should take advantage of the linguistic structure of the data for better performance. Our experiments also underline the importance of domain specific word embedding models. These models reduce OOV words and also present us with embeddings that capture the relation of the words in the specific domain of study.
6
Conclusion and Future Work
This paper presented the approach we used to participate to the recent SemEval task 12 shared task on toponym resolution [28]. Our best DFFNN approach took advantage of domain specific embeddings as well as linguistic features. It achieves a significant increase in F-measure compared to the baseline system (from 69.74% to 80.13%). However, as the official results were not available at the time of writing, comparison with other approaches cannot be done at this time. The focus of this paper was to propose a deep learning based model for toponym detection and experiment with the use of external linguistic features and domain specific information. The model was evaluated using the recent SemEval task 12 datatset [28] and shows that domain specific embedding as well as some linguistic features do help in toponym detection in medical journals. One of the main factors preventing us from exploring deeper models, was the small size of the data set. With more human annotated data the models could be extended for better performance. However, since human annotated data is expensive to produce, we suggest distant supervision [15] to be explored for further increasing performance. As our experiments pointed out, the model could
36
M. Davari et al.
heavily benefit from linguistic insights, hence equipping the model with more linguistic driven features could potentially lead to a higher performing model. We did not have the time or computational resources to explore recurrent neural architectures, however future work could be done focusing on these models. Acknowledgments. This work was financially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).
References 1. Amitay, E., Har’El, N., Sivan, R., Soffer, A.: Web-a-where: geotagging web content. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 273–280. ACM (2004) 2. Ardanuy, M.C., Sporleder, C.: Toponym disambiguation in historical documents using semantic and geographic features. In: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage, pp. 175–180. ACM (2017) 3. Benson, D.A., et al.: Genbank. Nucleic Acids Res. 41(D1), D36–D42 (2012) 4. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. ”O’Reilly Media, Inc.”, Sebastopol (2009) 5. Chiu, J.P., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016) 6. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. ACM (2008) 7. DeLozier, G., Baldridge, J., London, L.: Gazetteer-independent toponym resolution using geographic word profiles. In: AAAI, pp. 2382–2388 (2015) 8. Garbin, E., Mani, I.: Disambiguating toponyms in news. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 363–370. Association for Computational Linguistics (2005) 9. Gautret, P., Botelho-Nevers, E., Brouqui, P., Parola, P.: The spread of vaccinepreventable diseases by international travellers: a public-health concern. Clin. Microbiol. Infect. 18, 77–84 (2012) 10. Gelernter, J., Balaji, S.: An algorithm for local geoparsing of microtext. GeoInformatica 17(4), 635–667 (2013) 11. Google: Pretrained word and phrase vectors. https://code.google.com/archive/p/ word2vec/ (2019). Accessed 10 Jan 2019 12. Green, A.D., Roberts, K.I.: Recent trends in infectious diseases for travellers. Occup, Med. 50(8), 560–565 (2000). https://dx.doi.org/10.1093/occmed/50.8.560 13. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012) 14. Kienreich, W., Granitzer, M., Lux, M.: Geospatial anchoring of encyclopedia articles. In: Tenth International Conference on Information Visualisation (IV 2006), pp. 211–215 (July 2006). https://doi.org/10.1109/IV.2006.57 15. Krause, S., Li, H., Uszkoreit, H., Xu, F.: Large-scale learning of relation-extraction rules with distant supervision from the web. In: Cudr´e-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 263–278. Springer, Heidelberg (2012). https:// doi.org/10.1007/978-3-642-35176-1 17
Toponym Identification in Epidemiology Articles
37
16. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016) 17. Li, L., Jin, L., Jiang, Z., Song, D., Huang, D.: Biomedical named entity recognition based on extended recurrent neural networks. In: 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 649–652 (Nov 2015). https://doi.org/10.1109/BIBM.2015.7359761 18. Lieberman, M.D., Samet, H.: Multifaceted toponym recognition for streaming news. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 843–852. ACM (2011) 19. Magge, A., Weissenbacher, D., Sarker, A., Scotch, M., Gonzalez-Hernandez, G.: Deep neural networks and distant supervision for geographic location mention extraction. Bioinformatics 34(13), i565–i573 (2018) 20. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 21. Moen, S., Ananiadou, T.S.S.: Distributional semantics resources for biomedical text processing. In: Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan, pp. 39–43 (2013) 22. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning, pp. 1310–1318 (2013) 23. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 24. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. https://nlp.stanford.edu/projects/glove/ (2014). Accessed 10 Jan 2019 25. Purves, R.S., et al.: The design and implementation of spirit: a spatially aware search engine for information retrieval on the internet. Int. J. Geogr. Inf. Sci. 21(7), 717–745 (2007) 26. Qin, T., Xiao, R., Fang, L., Xie, X., Zhang, L.: An efficient location extraction algorithm by leveraging web contextual information. In: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 53–60. ACM (2010) 27. Santos, J., Anast´ acio, I., Martins, B.: Using machine learning methods for disambiguating place references in textual documents. GeoJournal 80(3), 375–392 (2015) 28. SemEval: Toponym resolution in scientific papers. https://competitions.codalab. org/competitions/19948#learn the details-overview (2018). Accessed 20 Jan 2019 29. Tamames, J., de Lorenzo, V.: EnvMine: a text-mining system for the automatic extraction of contextual information. BMC Bioinformatics 11(1), 294 (2010) 30. Taylor, M.: Reduced Geographic Scope as a Strategy for Toponym Resolution. Ph.D. thesis, Northern Arizona University (2017) 31. Wang, P., Qian, Y., Soong, F.K., He, L., Zhao, H.: A unified tagging solution: bidirectional LSTM recurrent neural network with word embedding. arXiv preprint arXiv:1511.00215 (2015) 32. Weissenbacher, D., Sarker, A., Tahsin, T., Scotch, M., Gonzalez, G.: Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods. AMIA Summits Transl. Sci. Proc. 2017, 114 (2017) 33. Weissenbacher, D., et al.: Knowledge-driven geospatial location resolution for phylogeographic models of virus migration. Bioinformatics 31(12), i348–i356 (2015). https://dx.doi.org/10.1093/bioinformatics/btv259
Named Entity Recognition by Character-Based Word Classification Using a Domain Specific Dictionary Makoto Hiramatsu1(B) , Kei Wakabayashi1 , and Jun Harashima2 1
University of Tsukuba, Tsukuba, Japan [email protected] 2 Cookpad Inc., Yokohama, Japan
Abstract. Named entity recognition is a fundamental task in natural language processing and has been widely studied. The construction of a recognizer requires training data that contains annotated named entities. However, it is expensive to construct such training data for low-resource domains. In this paper, we propose a recognizer that uses not only training data but also a domain specific dictionary that is available and easy to use. Our recognizer first uses character-based distributed representations to classify words into categories in the dictionary. The recognizer then uses the output of the classification as an additional feature. We conducted experiments to recognize named entities in recipe text and report the results to demonstrate the performance of our method. Keywords: Named entity recognition
1
· Recipe text · Neural network
Introduction
Named entity recognition (NER) is one of the fundamental tasks in natural language processing (NLP) [20]. The task is typically formulated as a sequence labeling problem, for example, estimating the most likely tag sequence Y = (y1 , y2 , . . . , yN ) for a given word sequence X = (x1 , x2 , . . . , xN ). We can train a recognizer using annotated data that consists of (X, Y ). However, the construction of such annotated data is labor-intensive and timeconsuming. Although the beginning, inside, and outside (BIO) format is often used in NER, it is challenging to annotate sentences with tags, particularly for people who are not familiar with NLP. Furthermore, there are low-resource domains that do not have a sufficient amount of data. We focus on the recipe domain as an example of such a domain. Even in such domains, we can find a variety of dictionaries available. For example, Nanba et al. [13], constructed a cooking ontology, Harashima et al. [3] constructed a dictionary for ingredients, and Yamagami et al. [21] built a knowledge base for basic cuisine. These resources can be utilized for NER (Fig. 1) In this paper, we propose a method to integrate a domain-specific dictionary into a neural NER using a character-based word classifier. We demonstrate the c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 38–48, 2023. https://doi.org/10.1007/978-3-031-24340-0_4
Named Entity Recognition by Character-Based Word Classification
39
Fig. 1. LSTM-CRF based neural network.
effectiveness of the proposed method using experimental results on the Cooking Ontology dataset [13] as a dictionary. We report our experimental results on the recipe domain NE corpus [12].
2
Related Work
In recent years, NER methods that use long short-term memory (LTSM) [4] and conditional random fields (CRF) [7] have been extensively studied [8–10,15,19]. This type of neural network is based on Huang et al. [5]. Note that they used Bidirectional LSTM (Bi-LSTM), which concatenate two types of LSTM; one is forward LSTM, and another is backward LSTM. In these studies, the researchers assumed that training data with sequence label annotation was provided in advance. In our experiments, we use recipe text as a low-resource domain to evaluate our proposed method. Although Mori et al. [12] constructed an r-NE corpus, it consists of only 266 Japanese recipes. To overcome this problem, Sasada et al. [18] proposed an NE recognizer that is trainable from partially annotated data. However, as seen in Sect. 5.4, the method does not perform better than recent neural network-based methods. Preparing training data for NER is time-consuming and difficult. In addition to the strategy that uses partial annotation, there have been attempts to make use of available resources. Peters et al. [15,16] acquired informative features using language modeling. However, these approaches require a large amount of unlabeled text for training, which makes it difficult in a low-resource scenario. To avoid this difficulty, making use of a task that does not require a large amount of data could be useful.
40
M. Hiramatsu et al.
Whereas it is time-consuming to prepare training data for NER, it is relatively easy to construct a domain-specific dictionary [1,13,21]. Some researchers have used a dictionary as an additional feature [10,19]. Pham et al. [10] incorporated dictionary matching information as additional dimensions of a feature vector of a token. In their method, the representations are zero vectors for words that are not in the dictionary. Our proposed method overcomes this limitation by extracting character-based features from a classifier trained on a dictionary.
3
Baseline Method
Fig. 2. Word-level feature extractor proposed by Lample et al. [8]
As described in Sect. 2, the popular methods use Bi-LSTM (bidirectional-LSTM) and CRF, which is so called LSTM-CRF. Lample et al. [8] takes account of not only word-level but also character-level information to extract features. We show an illustration of the word-level feature extractor proposed by Lample et al. in Fig. 2. Let X = (x1 , x2 , . . . , xN ) be an input word sequence and Ct = (ct,1 , ct,2 , . . . , ct,M ) be the character sequence of the t’th word. A word distributed representation corresponding to xt is defined by vxt , and a character distributed representation corresponding to Ct,k is defined by vCt,k . Let VCt = (vCt,1 , vCt,2 , . . . , vCt,M ). Then their model can be represented as wt (char) = Bi-LSTM(char) (VCt ), xt = [wt ; wt (char) ].
(1) (2)
Named Entity Recognition by Character-Based Word Classification
41
Then, let VX = (x1 , x2 , . . . , xN ), ht = Bi-LSTM(VX )t ,
(3)
where wt indicates the word representation corresponding to xt . After extracting the feature vector ht of the sequence, they applied CRF to predict the tag sequence considering their tag transitions. Let y = (y1 , y2 , . . . , yN ) be a tag sequence. Using H = (h1 , h2 , . . . , hN ), we can calculate the probability of the tag sequence using n
P (y | H; W, b) =
i=1
ψi (yi−1 , yi , H) n
y ∈Y(H) i=1
ψi (y i−1 , yi , H)
,
(4)
where ψi (yi−1 , yi , H) = exp(WyTi hi + byi−1 ,yi ). Wyi is the weight vector and ˆ , which is byi−1 ,yi is the bias term. What we want is the optimal tag sequence y defined by ˆ = argmax P (y | H; W, byi−1 ,yi ). y
(5)
y ∈Y(H)
ˆ by maximizing P using the Viterbi We can obtain the optimal tag sequence y algorithm (Fig. 3).
Fig. 3. Overview of the character-based word classifier. We use a 3 stacked Bi-LSTM.
42
M. Hiramatsu et al.
4
Proposed Method
In this paper, we propose a recognizer that uses not only training data but also a domain-specific dictionary. As described in Sect. 1, it is expensive to construct training data for a recognizer. We thus make use of a domain-specific dictionary that contains pairs that consist of a word and category.
Fig. 4. Overview of the proposed method. We concatenate the classifier output to a feature vector from the Bi-LSTM.
Figure 4 shows the architecture of our proposed recognizer. Our recognizer can be considered as an extension of Lample et al. [8]. We incorporate the character-based word classifier which calculates at as follows: h(classif ier) t = Stacked Bi-LSTM(Ct )
(classif ier)
a t = Wh t+b at = Softmax(a t ),
(6) (7) (8)
This classifier is a neural network that consists of a embedding layer, a stacked Bi-LSTM layer, and a fully connected layer. Stacked Bi-LSTM is one kind of neural network which applies Bi-LSTM k times where k > 1. Classifier takes the character sequence of words as input and predicts categories of it defined in a dictionary. After passing the word classifier, our method concatenates the hidden state calculated in Sect. 3 and the output of Classifier defined by ht = ht ⊕ at . Finally, as in Sect. 3, our method transforms h by zt = Wht + b , and the CRF predicts the most likely tag sequence.
Named Entity Recognition by Character-Based Word Classification
43
Table 1. The statistics of corpora using our experiments. Note that the r-NE corpus is annotated for NEs with the BIO format. We show character-level information only for r-NE because it is used to train a recognizer. Attribute
Cookpad
Wikipedia
Doc
1,715,589
1,114,896
Sent
12,659,170
18,375,840
Token
216,248,517 600,890,895 60,542
Type
221,161
2,306,396
r-NE 436 3,317 3,390
Char token –
–
91,560
Char type
–
1,130
–
Table 2. r-NEs and their frequencies. NE Description
# of Examples
F
Food
6,282
T
Tool
1,956
D
Duration
409
Q
Quantity
404
Ac Action by the chef 6,963 Af
Action by foods
1,251
Sf
State of foods
1,758
St
State of tools
216
Although our method is simple, it has two advantages: First, our method is based on character-level distributed representations, which avoid the mismatching problem between words in the training data and words in the dictionary. Second, the method can use a dictionary with arbitrary categories that are not necessarily equal to the NE categories in the sequence labels. Consequently, our method can be applied in all scenarios in which there is a small amount of training data that contains NEs and there is a domain dictionary constructed arbitrarily (Table 1).
5 5.1
Experiments Datasets
We used the following four datasets: r-NE [12]: used to train and test methods. We used 2,558 sentences for training, 372 for validation, and 387 for testing.
44
M. Hiramatsu et al. Table 3. Word categories, frequencies, and results on classification. Category
# of Examples Prec Recall Fscore
Ingredient-seafood (example: salmon)
452
0.60 0.62
0.61
Ingredient-meat (example: pork)
350
0.88 0.83
0.85
Ingredient-vegetable (example: lettuce) 935
0.75 0.79
0.77
Ingredient-other (example: bread)
725
0.75 0.71
0.73
Condiment (example: salt)
907
0.81 0.84
0.83
Kitchen tool (example: knife)
633
0.79 0.74
0.76
Movement (example: cut)
928
0.94 0.99
0.96
Other
896
0.70 0.66
0.68
Cooking Ontology [13]: used to train the word classifier. We use 3,825 words for training, 1,000 for validation, and 1,000 for testing. Cookpad [2]: used to train word embeddings. Cookpad corpus contains 1.7M recipe texts. Wikipedia: used to train word embeddings. There were various types of topics in this corpus. We downloaded the raw data of this corpus from the Wikipedia dump1 . Wikipedia corpus contains 1.1M articles. As in Table 2 and Table 3, the categories in the cooking ontology were different from the tags in the r-NE corpus. However, as described in Sect. 4, our method flexibly incorporated such information into its network. 5.2
Methods
We compared the following methods in our experiments: Sasada et al. [18] is a pointwise tagger. They use Logistic Regression as the tagger. Sasada et al. [18]+DP is an extension of LR, which optimizes LR’s prediction using dynamic programming. This method achieved state-of-the-art performance for the r-NE task. Lample et al. [8] is an LSTM-CRF tagger described in Sect. 2. Dictionary is an LSTM-CRF based naive baseline that uses a dictionary. A dictionary feature is added to Lample’s feature in the form of a one-hot vector. 1
https://dumps.wikimedia.org/jawiki/.
Named Entity Recognition by Character-Based Word Classification
45
Table 4. Results on NER (averaged over five times except for Sasada et al. [18] because KyTea [14], the text analysis toolkit used in their experiments, does not have the option to specify a random seed). Method
Recall
Fscore
Sasada et al. [18] –
Decoder Embedding Prec. –
82.34
80.18
81.20
Sasada et al. [18] DP
–
82.94
82.82
82.80
Lample et al. [8] CRF
Uniform
82.59 (± 0.94)
88.19 (± 0.25)
85.24 (± 0.46)
Lample et al. [8] CRF
Cookpad
84.54 (± 1.22)
88.47 (± 0.69)
86.40 (± 0.89)
Lample et al. [8] CRF
Wikipedia
85.31 (± 0.67)
88.22 (± 0.65)
86.68 (± 0.47)
Dictionary
CRF
Uniform
82.36 (± 1.25)
88.28 (± 0.25)
85.18 (± 0.71)
Dictionary
CRF
Cookpad
83.91 (± 1.21)
88.60 (± 0.41)
86.16 (± 0.72)
Dictionary
CRF
Wikipedia
85.44 (± 1.04)
87.67 (± 0.25)
86.50 (± 0.56)
Proposed
CRF
Uniform
82.81 (± 0.88)
88.40 (± 0.41)
85.46 (± 0.58)
Proposed
CRF
Cookpad
85.08 (± 1.30)
88.46 (± 0.18)
86.68 (± 0.71)
Proposed
CRF
Wikipedia 85.63 (± 0.52) 88.87 (± 0.37) 87.18 (± 0.34)
Proposed is the proposed method that uses the character-level word classifier described in Sect. 4. 5.3
Pre-trained Word Embeddings
In NLP, a popular approach is to make use of pre-trained word embeddings to initialize parameters in neural networks. In this paper, three strategies are used to initialize word vectors: Uniform initializes word vectors by sampling from −3 3 , dim ]. the uniform distribution over [ dim Wikipedia initializes word vectors using those trained on the Wikipedia corpus. Word vectors not in pre-trained word vectors are initialized by Uniform. Cookpad initializes word vectors using those trained on the Cookpad corpus. Word vectors not in pre-trained word vectors are initialized by Uniform. We use train word embeddings with skip-gram with negative sampling (SGNS) [11]. As the hyperparameter of SGNS, we set 100 as the dimension of the word vector, 5 for the size of the context window, and 5 for the size of negative examples, and use default parameters defined in Gensim [17] for other parameters. In our proposed network, we set 50 dimensions for character-level distributed representations and 2 × 50 for character-level Bi-LSTM as a word classifier. The word feature extracted by the word classifier is concatenated with the wordlevel representation and fed into the word-level Bi-LSTM to obtain the entire
46
M. Hiramatsu et al.
word features. To train neural networks, we use the Adam optimizer [6] with mini-batch size 10 and clip gradient with threshold 5.0. 5.4
Experimental Results and Discussion
Table 3 shows the performance of our word classifier. Our classifier successfully classified words with a certain degree of accuracy. We show the results of comparing each recognizer in Table 4 In our experiments, (i) pre-trained word vectors played an essential role in improving the performance of NER and (ii) our classifier enhanced the performance of the Lample’s method. Interestingly, we obtained the best result when pre-trained word vectors were trained on the Wikipedia corpus, which is not a domain-specific corpus. This suggests that our method to have successfully combined universal knowledge from pre-trained word vectors and domain-specific knowledge from the classifier trained on a domain-specific dictionary. Table 5. Results on named entity recognition (for each NE, averaged over five times). NE Precision
Recall
Fscore
Ac 91.77 (± 1.02) 95.23 (± 0.42) 93.46 (± 0.33) Af
78.87 (± 3.68) 78.12 (± 1.19) 78.46 (± 2.22)
D
96.63 (± 1.71) 93.88 (± 2.88) 95.23 (± 2.16)
F
85.84 (± 0.94) 89.01 (± 0.65) 87.39 (± 0.59)
Q
58.70 (± 3.81) 70.00 (± 3.19) 63.69 (± 1.82)
Sf
75.12 (± 4.40) 78.17 (± 1.95) 76.52 (± 2.04)
St
66.03 (± 5.64) 52.63 (± 4.70) 58.46 (± 4.52)
T
82.53 (± 2.30) 89.09 (± 1.26) 85.66 (± 1.21)
We show the label-wise results of prediction in Table 5. In this result, we can see that the proposed model successfully predicted tags of Ac, D, F, and T. However, prediction performances for Af, Q, Sf, and St were limited because there is no entry for these categories in our dictionary.
Example Translation Ground Truth Baseline Proposed method
Yo ji Cocktail stick B-T B-Sf B-T
de DAT O O O
Fig. 5. Prediction results for an example
to
me clip B-Ac O B-Ac O B-Ac O
te O O O
Named Entity Recognition by Character-Based Word Classification Example Translation Ground Truth Baseline Peoposed method
Denshi renji Microwave B-T I-T B-T I-T B-T I-T
( ( O O O
500 500 B-St B-T B-St
W W I-St I-T I-St
) ) O O O
47
de DAT O O O
Fig. 6. Prediction results for another example
Figure 5 and Fig. 6 show prediction results for the baseline and our methods. Note that the abbreviation DAT means dative. In the first example, the word classifier taught the model that cocktail stick was a kitchen tool, which made the proposed method successfully recognize it as a tool. In the second example, the word classifier taught the model that 500W is not a kitchen tool. Then, the proposed method avoided the baseline’s failure and estimate the correct NE tag sequence.
6
Conclusion
We proposed a recognizer that is trainable from not only annotated NEs but also a list of examples for some categories related to NE tags. The proposed method uses the output of a character-based word classifier. Thanks to this characterbased modeling, the proposed method considers sub-word information to extract dictionary features for words not in the dictionary. Our experiment demonstrates that our method achieves state-of-the-art performance on the r-NE task. This implies that the proposed method successfully extracts an informative feature to improve the performance of NER.
References 1. Chung, Y.J.: Finding food entity relationships using user-generated data in recipe service. In: Proceedings of International Conference on Information and Knowledge Management, pp. 2611–2614 (2012) 2. Harashima, J., Michiaki, A., Kenta, M., Masayuki, I.: A large-scale recipe and meal data collection as infrastructure for food research. In: Proceedings of International Conference on Language Resources and Evaluation, pp. 2455–2459 (2016) 3. Harashima, J., Yamada, Y.: Two-step validation in character-based ingredient normalization. In: Proceedings of Joint Workshop on Multimedia for Cooking and Eating Activities and Multimedia Assisted Dietary Management, pp. 29–32 (2018) 4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1–32 (1997) 5. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF Models for Sequence Tagging (2015). https://arxiv.org/abs/1508.01991 6. Kingma, D.P., Ba, J.L.: Adam: a Method for Stochastic Optimization. In: Proceedings of International Conference on Learning Representations (2015)
48
M. Hiramatsu et al.
7. Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of International Conference on Machine Learning, pp. 282–289 (2001) 8. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270 (2016) 9. Ma, X., Hovy, E.: End-to-end sequence labeling via Bi-directional LSTM-CNNsCRF. In: Proceedings of Annual Meeting of the Association for Computational Linguistics (2016) 10. Mai, K., Pham, et al.: An empirical study on fine-grained named entity recognition. In: Proceedings of International Conference on Computational Linguistics, pp. 711– 722 (2018) 11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of International Conference on Learning Representations (2013) 12. Mori, S., Maeta, H., Yamakata, Y., Sasada, T.: Flow graph corpus from recipe texts. In: Proceedings of International Conference on Language Resources and Evaluation, pp. 2370–2377 (2014) 13. Nanba, H., Takezawa, T., Doi, Y., Sumiya, K., Tsujita, M.: Construction of a cooking ontology from cooking recipes and patents. In: Proceedings of ACM International Joint Conference on Pervasive and Ubiquitous Computing Adjunct Publication, pp. 507–516 (2014) 14. Neubig, G., Nakata, Y., Mori, S.: Pointwise prediction for robust, adaptable japanese morphological analysis. In: Proceedings of Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 529–533 (2011) 15. Peters, M.E., Ammar, W., Bhagavatula, C., Power, R.: Semi-supervised sequence tagging with bidirectional language models. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 1756–1765 (2017) 16. Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2227–2237 (2018) 17. Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of LREC Workshop on New Challenges for NLP Frameworks, pp. 45–50 (2010) 18. Sasada, T., Mori, S., Kawahara, T., Yamakata, Y.: Named entity recognizer trainable from partially annotated data. In: Proceedings of International Conference of the Pacific Association for Computational Linguistics. vol. 593, pp. 148–160 (2015) 19. Sato, M., Shindo, H., Yamada, I., Matsumoto, Y.: Segment-level neural conditional random fields for named entity recognition. In: Proceedings of International Joint Conference on Natural Language Proceedings of Sing, pp. 97–102. No. 1 (2017) 20. Yadav, V., Bethard, S.: A survey on recent advances in named entity recognition from deep learning models. In: Proceedings of International Conference on Computational Linguistics, pp. 2145–2158 (2018) 21. Yamagami, K., Kiyomaru, H., Kurohashi, S.: Knowledge-based dialog approach for exploring user’s intention. In: Procceedings of FAIM/ISCA Workshop on Artificial Intelligence for Multimodal Human Robot Interaction, pp. 53–56 (2018)
Cold Is a Disease and D-cold Is a Drug: Identifying Biological Types of Entities in the Biomedical Domain Suyash Sangwan1(B) , Raksha Sharma2 , Girish Palshikar3 , and Asif Ekbal1 1
2
Indian Institute of Technology Patna, Bihta, India [email protected], [email protected] Indian Institute of Technology, Roorkee, Roorkee, India [email protected] 3 TCS Innovation Labs, Chennai, India [email protected]
Abstract. Automatically extracting different types of knowledge from authoritative biomedical texts, e.g., scientific medical literature, electronic health records etc., and representing it in a computer analyzable as well as human-readable form is an important but challenging task. One such knowledge is identifying entities with their biological types in the biomedical domain. In this paper, we propose a system which extracts end-to-end entity mentions with their biological types from a sentence. We consider 7 interrelated tags for biological types viz., gene, biological-process, molecularfunction, cellular-component, protein, disease, drug. Our system employs an automatically created biological ontology and implements an efficient matching algorithm for end-to-end entity extraction. We compare our approach with a Noun-based entity extraction system (baseline) as well as we show a significant improvement over standard entity extraction tools, viz., Stanford-NER, Stanford-OpenIE.
Keywords: Biomedical entity tagging extraction · POS tagging
1
· Ontology creation · Entity
Introduction
An enormous amount of biomedical data have been generated and collected at an unprecedented speed and scale. For example, the application of electronic health records (EHRs) is documenting large amounts of patient data. However, retrieving and processing this information is very difficult due to the lack of formal structure in the natural language used in these documents. Therefore we need to build systems which can automatically extract the information from the biomedical text which holds the promise of easily consolidating large amounts of biological knowledge in computer or human accessible form. Ability to query and use such extracted knowledge-bases can help scientists, doctors and other c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 49–60, 2023. https://doi.org/10.1007/978-3-031-24340-0_5
50
S. Sangwan et al.
users in performing tasks such as question-answering, diagnosis and identifying opportunities for the new research. Automatic identification of entities with their biological types is a complex task due to the domain-specific occurrences of entities. Consider the following example to understand the problem well. – Input Sentence: Twenty courses of 5-azacytidine (5-Aza) were administrated as maintenance therapy after induction therapy with daunorubicin and cytarabine. – Entities Found = {5-azacytidine, daunorubicin, cytarabine} – Biological types for the Entities = {drug, drug, drug} All the three extracted entities in the example are specific to the biomedical domain having biological type drug. Hence, an entity extraction tool trained on generic data will not be able to capture these entities. Figure 1 shows the entities tagged by Stanford-NER and Stanford-OpenIE tools. Stanford-NER fails to tag any of the entity, while Stanford-OpenIE is able to tag daunorubicin. In this paper, we propose an approach which uses the biomedical ontology to extract end-to-end entity mentions with their biological types from a sentence in the biomedical domain. By end-to-end we mean that correctly identify the boundary of each entity mention. We consider 7 interrelated tags for biological types viz., gene, biological-process, molecular-function, cellular-component, protein, disease, drug. They together form a complete biological system, where one biological type is the cause or effect of another biological type. The major contribution of this research is as follows. 1. Ontology in the biomedical domain: Automatic creation of ontology having biological entities with their biological types. 2. Identifying end-to-end entities with their biological types: We have implemented an efficient matching algorithm, named it All Subsequences Entity Match (ASEM). It is able to extract entities with their biological types from a sentence using ontology. ASEM is also able to tag entities which are the subsequence of another entity. For example, for mammalian target of rapamycin, our system detects two entities {mammalian target of rapamycin, rapamycin} with biological types {protein, drug} respectively. Since nouns are the visible candidates for being entities, we consider a Nounbased entity extraction system as a baseline. This system uses NLTK POS tagger for tagging the words with POS tags. In addition, we compare performance of our ASEM-based approach with Stanford-NER1 and Stanford-OpenIE2 . The rest of the paper is organized as follows. Section 2 describes the related work. Section 3 gives a description of the dataset used. Section 4 presents the ontology creation details and the ASEM algorithm. Section 5 provides experimental setup and results and Sect. 6 concludes the paper. 1 2
Available at: https://nlp.stanford.edu/software/CRF-NER.shtml. Available at: https://nlp.stanford.edu/software/openie.html.
Cold Is a Disease and D-cold Is a Drug
51
Fig. 1. Entity tagging by Stanford-NER and Stanford-OpenIE
2
Related Work
Entity Extraction has been a widely studied area of research in NLP. There have been attempts for both supervised as well as unsupervised techniques for entity extraction task [5]. Etzioni et al. (2005) [8] proposed an unsupervised approach to extract named entities from the Web. They built a system KNOWITALL, which is a domain-independent system that extracts information from the Web in an unsupervised and open-ended manner. KNOWITALL introduces a novel, generate-and-test architecture that extracts information in two stages. KNOWITALL utilizes a set of eight domain-independent extraction patterns to generate candidate facts. Baluja et al. (2000) [2] presented a machine learning approach for building an efficient and accurate name spotting system. They described a system that automatically combines weak evidence from different, easily available sources: parts-of-speech tags, dictionaries, and surface-level syntactic information such as capitalization and punctuation. They showed that the combination of evidence through standard machine learning techniques yields a system that achieves performance equivalent to the best existing hand-crafted approach. Carreras et al. (2002) [3] presented a Named Entity Extraction (NEE) problem as two tasks, recognition (NER) and classification (NEC), both the tasks were performed sequentially and independently with separate modules. Both modules are machine learning based systems, which make use of binary AdaBoost classifiers. Cross-lingual techniques are also developed to build an entity recognition system in a language with the help of another resource-rich language [1,6,7,11,14,17,19]. There are a few instances of use of already existing ontology or creation of a new ontology for entity extraction task. Cohen and Sarawagi, (2004) [4] considered the problem of improving named entity recognition (NER) systems by using external dictionaries. More specifically, they extended state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. Textpresso’s which is a tool by Muller et al. (2004) [12] has two major elements, it has a collection of the full text of scientific articles split into individual sentences and the implementation of categories of terms for which a database of articles and individual sentences can be searched.
52
S. Sangwan et al.
The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. Wang et al. (2009) [16] used approximate dictionary matching with edit distance constraints. Their solution was based on an improved neighborhood generation method employing partitioning and prefix pruning techniques. They showed that their entity recognition system was able to capture typographical or orthographical errors, both of which are common in entity extraction tasks yet may be missed by token-based similarity constraints. There are a few instances of entity extraction in the biomedical domain [10,18,20]. Takeuchi and Collier (2005) [15] applied Support Vector Machine for the identification and semantic annotation of scientific and technical terminology in the domain of molecular biology. This illustrates the extensibility of the traditionally named entity task to special domains with large-scale terminologies such as those in medicine and related disciplines. More recently, Joseph et al. (2012) [9] built a search engine dedicated to the biomedical domain, they also populated a dictionary of domain-specific entities. In this paper, we have proposed an unsupervised approach, which first generates a domain-specific ontology and then performs all subsequence matches against the ontology entries for entity extraction in the biomedical domain.
3
Dataset
To evaluate the performance of our algorithm and other approaches, we asked an expert to manually annotate a dataset of 50 abstracts (350 sentences) of Leukemia-related papers from PubMed [13] having cause-effect relations. We obtained 231 biological entities with their biological types. Below is an example from the manually tagged output. Entities are enclosed in curly brackets and their types are attached using ‘ ’ symbol. {6-Mercaptopurine} drug (6-MP drug) is one of the main components for the treatment of childhood {acute lymphoblastic leukemia} disease (ALL disease). To observe the performance of our system on a large corpus, we used an untagged corpus of 10, 000 documents given by Sharma et al. (2018) [13]. They downloaded 10, 000 abstracts of Leukemia-related papers from PubMed using the Biopython library with Entrez package. They used this dataset (89, 947 sentences having 1, 935, 467 tokens) to identify causative verbs in the biomedical domain.
4
Approach
In this paper, we present an approach to identify end-to-end biological entities with their biological types in a sentence using automatically created ontology. The following Sects. 4.1 and 4.2 elaborate the process of ontology creation and ASEM algorithm.
Cold Is a Disease and D-cold Is a Drug
4.1
53
Ontology Creation
We automatically built an ontology for 7 biological types, viz., gene, biologicalprocess, molecular-function, cellular-component, protein, disease, drug. We referred to various authentic websites having biological entity names with their types.3 Since direct downloadable links are not available to obtain complete dataset, we built a customized HTML parser to extract biological entities with their types. The selection of the websites for this task is done manually. We obtained an ontology of size 90, 567 with our customized HTML parser. Joseph et al. (2012) [9] also created a dictionary for biological entities having information about their biological types. They used this dictionary to equip with TPX. TPX is a Web-based PubMed search enhancement tool that enables faster article searching using analysis and exploration features. Their process of creating dictionary from the various sources, has been granted a Japanese patent (JP2013178757). In order to enrich our ontology further, we included entities available in TPX. Table 1. Ontology: entities-name and biological-type Entities
Biological type
Chitobiase
Gene
Reproduction
Biological-process
Acyl binding
Molecular-function
Obsolete repairosome Cellular-component Delphilin
Protein
Acanthocytoses
Disease
Calcimycin
Drug
We found approx 1, 50, 000 new biological entities with their types from Joseph et al. [9] work. The entire ontology is stored in a Hash table format, where entity name is the unique key and biological type is the value. Table 1 shows a few entries from the ontology used in the paper. Table 2 depicts the total number of entities extracted with respect to each biological type. 4.2
Algorithm: Identify Entity with Its Biological Type
Nouns are explicitly visible candidates for being a biological entity, we designed a Noun-based entity extraction system. Performance of this system completely depends on the POS tagger, which assigns NOUN tag. The Noun-based system is not able to identify end-to-end entities or the correct entity boundary. Our ASEM-based system is able to find the boundary of an entity in a sentence without POS tag information. 3
1. http://www.geneontology.org/, 2. https://bioportal.bioontology.org/ontologies/ DOID, 3. http://browser.planteome.org/amigo/search/ontology?q=%20regimen.
54
S. Sangwan et al. Table 2. Ontology statistics Biological type
No. of entities
Gene
1,79,591
Biological-process
30,695
Molecular-function
11,936
Cellular-component
4,376
Protein
1,16,125
Disease
74,470
Drug
52,923
Noun-Based System. The system comprises 4 modules. Figure 2 depicts the workflow of the Noun-based system. The description of the modules is as follows. Module-1: POS Tagging We used NLTK POS tagger to tokenize and assign POS tags to words. Words which are tagged as Noun are considered as candidates for being a biological entity. The tagger is trained on general corpus.4 We observed that NLTK failed to assign correct tags to many words specific to the biomedical domain. For example, 6-Mercaptopurine is tagged as Adjective by NLTK, however, it is the name of a medicine used for Leukemia treatment, hence it should be tagged as a noun. Module-2: Preprocessing Since NLTK tokenizes and tags many words erroneously, we apply preprocessing on the output produced by Module-1. In the preprocessing step, we removed all single letter words which are tagged as Noun, and words starting and ending with a symbol. In order to reduce the percentage of wrongly tagged words, we removed stop words also. For this purpose, we used a standard list of stop words (very high-frequency words) in the biomedical domain.5 Below are a few examples of stop words from the list. {Blood, analysis, acid, binding, brain, complex} Module-3: Get Abbreviations (Abv) We observed that there were entries in the ontology for the abbreviation of the entity, but not for the actual entity. To capture such instances, we defined rules to form abbreviations from the words of a sentence. For example, acute lymphoblastic leukemia was also represented as ALL. In such scenario, if acute lymphoblastic leukemia is missing in the ontology, but ALL is present, we assign the biological type of ALL to acute lymphoblastic leukemia. 4 5
We also experimented with Stanford POS tagger, but the performance of this tagger was worse than NLTK tagger for the biological entities. Available at: https://www2.informatik.hu-berlin.de/∼hakenber/corpora/medline/ wordFrequencies.txt.
Cold Is a Disease and D-cold Is a Drug
55
Fig. 2. Work flow of the Noun-based system
Module-4: Extract Biological Type This module searches for the Entity Candidate (EC) in the ontology (O). If there is an exact match for the candidate word, the Biological Type (BT) of the entity is extracted. The final outcome of this module is the biological type attached to the entity name. ASEM-Based System. All Subsequences Entity Match algorithm finds all subsequences up-to sentence length (n) from the sentence whose entities have to be recognized with biological types.6 In order to get all possible subsequences, we used n-gram package of NLTK.7 This system doesn’t require preprocessing step as it doesn’t consider POS tagged words, hence there is no error due to the tagger. Module-3 (Get Abbreviations) of the Noun-based approach is also part of ASEM algorithm as it helps to get the biological type of the entity whose abbreviation is an entry in the ontology, but not the entity itself. If we find an entry in the ontology for any subsequence, we consider the subsequence as a valid biological entity and retrieve the biological type of the entity from the ontology. Algorithm 1 gives the pseudo code of the proposed approach. Table 3 defines the functions and symbols used in Algorithm 1. 6 7
Though we obtained subsequences up-to length (n), we observed that there were no entity more than 4 words long. Available at: http://www.nltk.org/ modules/nltk/model/ngram.html.
56
S. Sangwan et al. 1 2 k Input: WP = {wB , wB , ...wB }, TPXDictionary , S = {s1B , s2B , ....sm B },
Output: Entity names with their Entity Types for s∈S Ontology := ∅; for each Web-page wp ∈ W P do Ontology[EntityN ame , EntityT ype ] := HT M LP arser (wp) end Ontology[EntityN ame , EntityT ype ] := Ontology[EntityN ame , EntityT ype ] ∪ T P XDictionary for each sentence s ∈ S do E BT = ∅ N G := n-grams(s), //Getting all subsequence of S where n ∈ {1,2,..,length(s)} for each ng ∈ N G do abvng := Get Abbreviation(ng) if ng in Ontology then E BT := E BT ∪ (ng,Ontology[ng]) if abvng in Ontology then E BT := E BT ∪ (ng,Ontology[abvng ]) end Entities in s with their biological types : E BT end
Algorithm 1: Identifying Biological Types of Biological Entities Table 3. Symbols used in Algorithm 1 Symbol
Description
WP
Set of relevant Web-pages
TPXDictionary
Dictionary by Joseph et al. [9]
S
Set of sentences in the Biomedical (B) Domain
Ontology
Hash-table having entity-name as key and its biological-type as value
HT M LP arser ()
Extracts entity-name and its value from a HTML page
E BT
Set of entities tagged with Biological Types (BT)
n-grams()
A function to obtain all subsequences (ng) of s ∈ S
Get Abbreviations() A function to generate abbreviation of ng
Cold Is a Disease and D-cold Is a Drug
5
57
Experimental Setup and Results
In this paper, we hypothesize that matching of all subsequences of a sentence against automatically created ontology in the biomedical domain can efficiently extract end-to-end entities with their biological types. We compare our ASEMbased system with a Noun-based system (4.2), NER-based system and OpenIEbased system. Named entities are good candidates for being biological entities. We have used Stanford Named Entity Recognizer (NER) to obtain entities. On the other hand, Open information extraction (OpenIE) refers to the extraction of binary relations, from plain text, such as (Mark Zuckerberg; founded; Facebook). It assigns subject and object tags to related arguments. We considered these two arguments as candidates for being an entity. In this paper, we have used Stanford-OpenIE tool. NER and OpenIE are able to extract end-to-end entities, in other words, they are able to tag entities having multiple words. However, they both fail to tag many of the entities which are specific to the biomedical domain (See example in Fig. 1). The Algorithm 1 remains the same with NER or OpenIE, except the all subsequences set N G is replaced with the set of entities extracted by NER or OpenIE. Table 4. Precision (P), Recall (R) and F-score (F) using different approaches in %. System
P
Noun-based 74.54
R
F
35.49
48.09
NER
94.44
7.35
13.65
OpenIE
93.75
12.98
22.81
ASEM
95.86 80.08 87.26
Table 4 shows the results obtained with the 4 different systems, viz., Nounbased, NER-based, OpenIE-based, and ASEM-based (our approach) on test data of 350 sentences having manually annotated 231 entities with their biological types. A True Positive (TP) scenario is when both entity and its type exactly match with the manually tagged entry, else False Positive (FP). A False Negative (FN) scenario is when a manual tagging is there for an entity, but the same is not produced by the system. We have used the same ontology to obtain biological type with all 4 systems of Table 4. Results validate our hypothesis that ASEMbased system is able to obtain a satisfactory level of Precision (P), recall (R) and F-score (F) for this domain-specific task. Though Precision is good for all cases, first three systems fail to score good Recall (R) as they use external NLP tools to extract entities from text. Table 5 shows the results obtained with ASEM-based system for each biological type. We obtained a positive Pearson correlation of 0.67 between Recall (‘R’ column of Table 5) obtained for the biological types and the size (‘Entity’ column
58
S. Sangwan et al.
Table 5. Precision, Recall and F-score in % with respect to biological type with ASEMbased system. B-Type Gene
P
R
F
92
95
92
Biological-process
100
75
86
Molecular-function
86
50
63
Cellular-component 100
10
18
Protein
90
91
93
Disease
100 100 100
Drug
100 100 100
of Table 2) of the ontology. The positive correlation asserts that enriching the ontology further would enhance the performance of our approach. Error Analysis: In the Noun-based system, where we have considered nouns as candidates for entities, precision is minimum as compared to other approaches. NLTK (or Stanford) POS tagger is not able to correctly tag domain-specific entities like PI3K/AKT, JNK/STAT etc. (words having any special character in between), they treat PI3K and AKT as two separate words and assign tags accordingly. Below is an example from the biomedical domain which shows the use of these entities. “Targeted therapies in pediatric leukemia are targeting BCR/ABL, TARA and FLT3 proteins, which activation results in the downstream activation of multiple signaling pathways, including the PI3K/AKT, JNK/STAT, Ras/ERK pathways” These random breaks in entities introduced by the POS taggers cause a drop in the overall precision of the system. In addition, the Noun-based system is not able to detect the boundary of the entity. However, the biomedical domain is full of entities constituting multiple words. Hence, the Noun-based system produces a poor F-Score of 48.09%. The NER-based system which uses Stanford-NER tagger is not breaking words like BCR/ABL, PI3K/AKT, JNK/STAT, Ras/ERK etc., as separate entities, unlike the Noun-based system. Therefore Precision is quite high than the Noun-based system. But due to the generic behavior of Stanford-NER, it is able to extract very few entities. So false negatives increase abruptly and hence recall score drops down to 7.35%. On the other hand, OpenIE-based system considers all subject and object as candidates for entities, therefore there are relatively higher chances to extract the exact entity. Our ASEM-based approach considers all subsequences of the input sentence as candidates for entities and matches these subsequences against the entries in the ontology. Therefore we are able to get all one-word entities, abbreviations and entities constituting more than one words. Consequently, we obtain a high Recall with our approach. However, the ontology is automatically created from
Cold Is a Disease and D-cold Is a Drug
59
the Web. There are a few entities in our Gold standard dataset, which are not found in the ontology. On the other hand, ontology also contains a few words like led, has, next, which are not the biological entities as per our annotator. These lacunae in the ontology cause drop in the P and F score of our system. A positive correlation of 0.67 between Recall and ontology size justifies that enriching the ontology further would enhance the performance of our approach.
6
Conclusion
The biomedical domain is full of domain-specific entities, which can be distinguished based on their biological types. In this paper, we presented a system to identify biological entities with their types in a sentence. We showed that All Subsequence Entity Match against an automatically created ontology specific to the domain provides an efficient solution than Noun-based entity extraction. In addition, due to generic behavior of standard Entity extraction tools like Stanford-NER and Stanford-OpenIE, they fail to equate the level of performance achieved with ASEM-based system. Furthermore, a high positive correlation between Recall obtained with ASEM-based system and Ontology size emphasizes that expansion of ontology can lead to a better system for this domain-specific knowledge (entity with its type) extraction task. Though we have shown the efficacy of our approach with the biomedical domain, we believe that it can be extended to any other domain where entities are domain specific and can be distinguished based on their types. For example, financial domain, legal domain etc.
References 1. Asahara, M., Matsumoto, Y.: Japanese named entity extraction with redundant morphological analysis. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 8–15. Association for Computational Linguistics (2003) 2. Baluja, S., Mittal, V.O., Sukthankar, R.: Applying machine learning for highperformance named-entity extraction. Comput. Intell. 16(4), 586–595 (2000) 3. Carreras, X., Marquez, L., Padr´ o, L.: Named entity extraction using adaboost. In: Proceedings of the 6th Conference on Natural Language Learning, vol. 20, pp. 1–4. Association for Computational Linguistics (2002) 4. Cohen, W.W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods. In: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89–98. ACM (2004) 5. Collins, M.: Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 489–496. Association for Computational Linguistics (2002) 6. Daiber, J., Jakob, M., Hokamp, C., Mendes, P.N.: Improving efficiency and accuracy in multilingual entity extraction. In: Proceedings of the 9th International Conference on Semantic Systems, pp. 121–124. ACM (2013)
60
S. Sangwan et al.
7. Darwish, K.: Named entity recognition using cross-lingual resources: Arabic as an example. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1558–1567 (2013) 8. Etzioni, O., et al.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165(1), 91–134 (2005) 9. Joseph, T., et al.: TPX: biomedical literature search made easy. Bioinformation 8(12), 578 (2012) 10. Krallinger, M., Leitner, F., Rabal, O., Vazquez, M., Oyarzabal, J., Valencia, A.: CHEMDNER: the drugs and chemical names extraction challenge. J. Cheminform. 7(1), S1 (2015) 11. Laurent, D., S´egu´ela, P., N`egre, S.: Cross lingual question answering using QRISTAL for CLEF 2006. In: Peters, C., et al. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 339–350. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3540-74999-8 41 12. M¨ uller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2(11), e309 (2004) 13. Sharma, R., Palshikar, G., Pawar, S.: An unsupervised approach for causeeffect relation extraction from biomedical text. In: Silberztein, M., Atigui, F., Kornyshova, E., M´etais, E., Meziane, F. (eds.) NLDB 2018. LNCS, vol. 10859, pp. 419–427. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91947-8 43 14. Sudo, K., Sekine, S., Grishman, R.: Cross-lingual information extraction system evaluation. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 882. Association for Computational Linguistics (2004) 15. Takeuchi, K., Collier, N.: Bio-medical entity extraction using support vector machines. Artif. Intell. Med. 33(2), 125–137 (2005) 16. Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 759–770. ACM (2009) 17. Yang, Z., Salakhutdinov, R., Cohen, W.: Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270 (2016) ˇ ˇ Holzinger, A.: An adaptive 18. Yimam, S.M., Biemann, C., Majnaric, L., Sabanovi´ c, S., annotation approach for biomedical entity and relation recognition. Brain Inform. 3(3), 157–168 (2016). https://doi.org/10.1007/s40708-016-0036-4 19. Zhang, B., et al.: ELISA-EDL: a cross-lingual entity extraction, linking and localization system. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 41–45 (2018) 20. Zheng, J.G., et al.: Entity linking for biomedical literature. BMC Med. Inform. Decis. Mak. 15(1), S4 (2015)
A Hybrid Generative/Discriminative Model for Rapid Prototyping of Domain-Specific Named Entity Recognition Suzushi Tomori1(B) , Yugo Murawaki1 , and Shinsuke Mori2 1
Graduate School of Informatics, Kyoto University, Kyoto, Japan [email protected], [email protected] 2 Academic Center for Computing and Media Studies, Kyoto University, Kyoto, Japan [email protected] Abstract. We propose PYHSCRF, a novel tagger for domain-specific named entity recognition that only requires a few seed terms, in addition to unannotated corpora, and thus permits the iterative and incremental design of named entity (NE) classes for new domains. The proposed model is a hybrid of a generative model named PYHSMM and a semi-Markov CRF-based discriminative model, which play complementary roles in generalizing seed terms and in distinguishing between NE chunks and non-NE words. It also allows a smooth transition to full-scale annotation because the discriminative model makes effective use of annotated data when available. Experiments involving two languages and three domains demonstrate that the proposed method outperforms baselines. Keywords: Named entity recognition model · Natural Language Processing
1
· Generative/Discriminative
Introduction
Named entity recognition (NER) is the task of extracting named entity (NE) chunks from texts and classifying them into predefined classes. It has a wide range of NLP applications such as information retrieval [1], relation extraction [2], and coreference resolution [3]. While the standard classes of NEs are PERSON, LOCATION, and ORGANIZATION among others, domain-specific NER with specialized classes has proven to be useful in downstream tasks [4]. A major challenge in developing a domain-specific NER system lies in the fact that a large amount of annotated data is needed to train high-performance systems, and even larger amounts are needed for neural models [5]. In many domains, however, domain-specific NE corpora are small in size or even nonexistent because manual corpus annotation is costly and time-consuming. What is worse, domain-specific NE classes cannot be designed without specialized knowledge of the target domain, and even with expert knowledge, a trial-anderror process is inevitable, especially in the early stage of development. c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 61–77, 2023. https://doi.org/10.1007/978-3-031-24340-0_6
62
S. Tomori et al.
In this paper, we propose PYHSCRF, a novel NE tagger that facilitates rapid prototyping of domain-specific NER. All we need to run the tagger is a few seed terms per NE class, in addition to an unannotated target domain corpus and a general domain corpus. Even with minimal supervision, it yields reasonable performance, allowing us to go back-and-forth between different NE definitions. It also enables a smooth transition to full scale annotation because it can straightforwardly incorporate labeled instances. Regarding the technical aspects, the proposed tagger is a hybrid of a generative model and a discriminative model. The generative model called the PitmanYor hidden semi-Markov model (PYHSMM) [6] recognizes high-frequency word sequences as NE chunks and identifies their classes. The discriminative model, semi-Markov CRF (semiCRF) [7], initializes the learning process using the seed terms and generalizes to other NEs of the same classes. It also exploits labeled instances more powerfully when they are available. The two models are combined into one using a framework known as JESS-CM [8]. Generative and discriminative models have mutually complementary strengths. PYHSMM exploits frequency while semiCRF does not, at least explicitly. SemiCRF exploits contextual information more efficiently, but its high expressiveness is sometimes harmful. Because of this, it has difficulty in balancing between positive and negative examples. We treat the seed terms as positive examples and the general corpus as proxy data for negative examples. While semiCRF is too sensitive to use the general corpus as negative examples, PYHSMM utilizes them in a softer manner. We conducted extensive experiments on three domains in two languages and demonstrated that the proposed method outperformed baselines.
2 2.1
Related Work General and Domain-Specific NER
NER is one of the fundamental tasks in NLP and has been applied not only to English but to a variety of languages such as Spanish, Dutch [9], and Japanese [10,11]. NER can be classified into general NER and domain-specific NER. Typical NE classes in general NER are PERSON, LOCATION, and ORGANIZATION. In domain-specific NER, special NE classes are defined to facilitate the development of downstream applications. For example, the GENIA corpus for the biomedical domain has five NE classes, such as DNA and PROTEIN, to organize research papers [12] and to extract semantic relations [13]. Disease corpora [14– 16], which are annotated with the disease class and the treatment class, are used to solve disease-treatment relation extraction. However, domain-specific NER is not limited to only the biomedical domain; it also covers recipes [17] and game commentaries [18], to name a few examples. In addition, recognition of brand names and product names [19], recognition of the names of tasks, materials, and processes in science texts [20] can be seen as domain-specific NER.
A Hybrid Generative/Discriminative Model for Rapid Prototyping
2.2
63
Types of Supervision in NER
The standard approach to NER is supervised learning. Early studies used the hidden Markov model [21], the maximum entropy model [22], and support vector machines [23] before conditional random fields (CRFs) [24,25] dominated. A CRF can be built on top of neural network components such as a bidirectional LSTM and convolutional neural networks [26]. Although modern high-performance NER systems require a large amount of annotated data in the form of labeled training examples, annotated corpora for domain-specific NER are usually of limited size because building NE corpora is costly and time-consuming. Tang et al. [5] proposed a transfer learning model for domain-specific NER with a medium-sized annotated corpus (about 6,000 sentences). Several methods have been proposed to get around costly annotation and they can be classified into rule-based, heuristic feature-based, and weakly supervised methods. Rau [27] proposed a system to extract company names while Sekine and Nobata [28] proposed a rule-based NE tagger. Settles [29] proposed a CRF model with hand-crafted features for biomedical NER. These methods are timeconsuming to develop and need specialized knowledge. Collins and Singer [30] proposed bootstrap methods for NE classification that exploited a small amount of seed data to classify NE chunks into typical NE classes. Nadeau et al. [31] proposed a two-step NER system in which NE extraction followed NE classification. Since their seed-based NE list generation from Web pages exploited HTML tree structures, it cannot be applied to plain text. Zhang and Elhadad [32] proposed another two-step NER method for the biomedical domain which first uses a noun phrase chunker to extract NE chunks and then classifies them using TF-IDF and biomedical terminology. Shang et al. [33] and Yang et al. [34] proposed weakly supervised methods by using domain-specific terminologies and unannotated target domain corpus. Shang et al. [33] automatically build a partially labeled corpus and then train a model by using it. Yang et al. [34] also use automatically labeled corpus and then select sentences to eliminate incomplete and noisy labeled sentences. The selector is trained on a human-labeled corpus. We also use automatically labeled corpus but there is a major difference. We focus on rapid prototyping of domainspecific NER that only requires a few seed terms because domain-specific terminologies are not necessarily available in other domains. 2.3
Unsupervised Word Segmentation and Part-of-Speech Induction
The model proposed in this paper has a close connection to unsupervised word segmentation and part-of-speech (POS) induction [6]. A key difference is that, while they use characters as the unit for the input sequence, we utilize word sequences. Uchiumi et al. [6] can be seen as an extension to Mochihashi et al. [35], who focused on unsupervised word segmentation. They proposed a nonparametric Bayesian n-gram language model based on Pitman-Yor processes. Given an unsegmented corpus, the model infers word segmentation using Gibbs sampling.
64
S. Tomori et al.
Uchiumi et al. [6] worked on the joint task of unsupervised word segmentation and POS induction. We employ their model, PYHSMM, for our task. However, instead of combining character sequences into words and assigning POS tags to them, we group word sequences into NE chunks and give NE classes to them. To efficiently exploit annotated data when available, Fujii et al. [36] extended Mochihashi et al. [35] by integrating the generative word segmentation model into a CRF-based discriminative model. Our model, PYHSCRF, is also a hybrid generative/discriminative model but there are two major differences. First, to extend the approach to NER, we combine PYHSMM with a semiCRF, not an n-gram model with a plain CRF. Second, since our goal is to facilitate rapid prototyping of domain-specific NER, we consider a much weaker type of supervision than fully annotated sentences: a few seed terms per NE class. This is challenging partly because seed terms can only be seen as implicit positive examples although most text fragments are outside of NE chunks (i.e., the O class). Our solution is to use a general domain corpus as implicit negative examples.
Fig. 1. The overall architecture of PYHSCRF for domain-specific NER. Here, the maximum length of NE chunks L = 2. F and Ac stand for FOOD and ACTION, respectively, while O indicates a word outside of any NE chunks.
Fig. 2. Partially labeled sentences. F stands for FOOD.
3 3.1
Proposed Method Task Setting
NER is often formalized as a sequence labeling task. Given a word sequence x = (x1 , x2 , ..., xN ) ∈ Xl , our system outputs a label sequence y = (y1 , y2 , .., yN ) ∈
A Hybrid Generative/Discriminative Model for Rapid Prototyping
65
Yl , where yi = (zi , bi , ei ) means that a chunk starting at the bi -th word and ending at the ei -th word belongs to class zi . The special O class is assigned to any word that is not part of an NE (if zi = O, then bi = ei ). In the recipe domain, for example, the word sequence “Sprinkle cheese on the hot dog” contains an NE in the F (FOOD) class, “hot dog,” which corresponds to y5 = (F, 5, 6). Likewise the third word “on” is mapped to y3 = (O, 3, 3). We assume that we are given a few typical NEs per class (e.g., “olive oil” for the F class). Since choosing seed terms is by far less laborious than corpus annotation, our task settings allow us to design domain-specific NER in an exploratory manner. In addition to seed terms, an unannotated target domain corpus Xu and an unannotated general domain corpus Xg are provided. The underlying assumption is that domain-specific NEs are observed characteristically in Xu . Contrasting Xu with Xg helps distinguishing NEs from the O class. 3.2
Model Overview
Figure 1 illustrates our approach. We use seed terms as implicit positive examples. We first automatically build a partially labeled corpus Xl , Yl using seed terms. For example, if “olive oil” is selected as a seed term of class F, sentences in the target domain corpus Xu that contain the term are marked with its NE chunks and the class as in Fig. 2. We train semiCRF using the partially labeled corpus (Sect. 3.3). To recognize high-frequency word sequences as NE chunks, we apply PYHSMM to the unannotated corpus Xu (Sect. 3.4). The general domain corpus Xg is also provided to the generative model as proxy data for the O class, with the assumption that domain-specific NE chunks should appear more frequently in the target domain corpus than in the general domain corpus. PYHSMM is expected to extract high-frequency word sequences in the target domain as NE chunks. Note that we do not train semiCRF with the implicit negative examples because the discriminative model is too sensitive to noise inherent to them. We combine the discriminative and generative models using JESS-CM [8] (Sect. 3.5). 3.3
Semi-Markov CRF with a Partially Labeled Corpus
We use semiCRF as the discriminative model, although Markov CRF is more often used as an NE tagger. Markov CRF employs the BIO tagging scheme or variants of it to identify NE chunks. Since each NE class is divided into multiple tags (e.g., B-PERSON and I-PERSON), it is unsuitable for our task, which is characterized by the scarcity of supervision. For this reason, we chose semiCRF.
66
S. Tomori et al.
SemiCRF is a log-linear model that directly infers NE chunks and classes. The probability of y given x is defined as: exp(Λ · F (x, y)) , p(y|x, Λ) = Z(x) Z(x) = exp(Λ · F (x, y)), y ∈Y
where F (y, x) = (f1 , f2 , · · · , fM ) are features, Λ = (λ1 , λ2 , · · · , λM ) are the corresponding weights, and Y is the set of all possible label sequences. The feature function can be expressed as the combination of F (bi , ei , zi , zi−1 ) in relation to xi , yi , and yi−1 . The training process is different from standard supervised learning because we use partially labeled corpus Xl , Yl . Following Tsuboi et al. [37], we marginalize the probabilities of words that are not labeled. Instead of using the full log likelihood p(y|x)F (x, y) LL = F (x, y) − y ∈Y
as the objective function, we use the following marginalized log likelihood p(y|Yp , x)F (x, y) − p(y|x)F (x, y), M LL = y ∈Y p
y ∈Y
where Yp is the set of all possible label sequences in which labeled chunks are fixed. 3.4
PYHSMM
The generative model, PYHSMM, was originally proposed for joint unsupervised word segmentation and POS induction. While it was used to group character sequences into words and assign POS tags to them, here we extend it to wordlevel modeling. In our case, PYHSMM consists of 1) transitions between NE classes and 2) the emission of each NE chunk xi = xbi , ..., xei from its class zi . As a semi-Markov model, it employs n-grams not only for calculating transition probabilities but also for computing emission probabilities. The building blocks of PYHSMM are hierarchical Pitman-Yor processes, which can be seen as a back-off n-gram model. To calculate the transition and emission probabilities, we need to keep track of latent table assignments [38]. For notational brevity, let Θ be the set of the model’s parameters. The joint probability of the i-th chunk xi and its class zi conditioned on history hxz is given by p(xi , zi |hxz ; Θ) = p(xi |hnx , zi ; Θ)p(zi |hnz ; Θ),
A Hybrid Generative/Discriminative Model for Rapid Prototyping
67
where hnx = xi−1 , xi−2 , ..., xi−(n−1) and hnz = zi−1 , zi−2 , ..., zi−(n−1) . p(xi |hx , zi ) is the chunk n-gram probability given its class zi , and p(zi |hz ) is the class n-gram probability. The posterior predictive probability of the i-th chunk is p(xi |hnx , zi ) =
f req(xi |hnx ) − d · txi ,hnx θ + d · thnx + p(xi |hn−1 , zi ), x n θ + f req(hx ) θ + f req(hnx )
(1)
where hn−1 is the shorter history of (n − 1)-gram, θ and d are hyperparameters, x is n-gram frequency, thnx ,xi is a count related to table assignments, f req(xi |hnx ) f req(hnx ) = x f req(xi |hnx ), and thnx = x thnx ,xi . The class n-gram probability i i is computed in a similar manner. Gibbs sampling is used to infer PYHSMM’s parameters [35]. During training, we randomly select a sentence and remove it from the parameters (e.g., we subtract n-gram counts from f req(xi |hnx )). We sample a new label sequence using forward filtering-backward sampling. We then update the model parameters by adding the corresponding n-gram counts. We repeat the process until convergence. Now we explain the sampling procedure in detail. We consider the bigram case for simplicity. The forward score α[t][k][z] is the probability that a subsequence (x1 , x2 , ..., xt ) of a word sequence x = (x1 , x2 , ..., xN ) is generated with its last k words being a chunk (xtt−k+1 = xt−k+1 , ..., xt ) which is generated from class z. Let L be maximum length of a chunk and Z be the number of classes. α[t][k][z] is recursively computed as follows: α[t][k][z] =
L Z j=1 r=1
p(xtt−k+1 |xt−k t−k−j+1 , z)p(z|r)α[t
− k][j][r] .
(2)
The forward scores are calculated from the beginning to the end of the sentence. Chunks and classes are sampled in the reverse direction by using the forward score. There is always the special token EOS and its class zEOS at the end of the sequence. The final chunk and its class in the sequence is sampled with the score proportional to N p(EOS|wN −k , zEOS ) · p(zEOS |z) · α[N ][k][z].
The second-to-last chunk is sampled similarly using the score of the last chunk. We continue this process unti we reach the beginning of the sequence. To update the parameters in Eq. (1), we add n-gram counts to f req(xi |hnx ) and f req(hnx ), and also update the table assignment count thnx ,xi . Parameters related to the class n-gram model are updated in the same manner. Recall that we use the general domain corpus Xg to learn the O class. We assume that Xg consists entirely of single-word chunks in the O class. Although the general domain corpus might contain some domain-specific NE chunks, most words indeed belong to the O class. During training, we add and remove sentences in Xg without performing sampling. Thus these sentences can be seen as implicit negative samples.
68
S. Tomori et al.
3.5
PYHSCRF
PYHSCRF combines discriminative semiCRF with generative PYHSMM in a similar manner to the model presented in Fujii et al. [36]. The probability of label sequence y given word sequence x is written as follows: p(y|x) ∝ pDISC (y|x; Λ) pGEN (y, x; Θ)λ0 , where pDISC and pGEN are the discriminative and generative models, respectively. Λ and Θ are their corresponding parameters. When pDISC is a log-linear model like semiCRF, p(y|x) can be expressed as a log-linear model: M pDISC (y|x) ∝ exp λm fm (y, x) , m=1
M λm fm (y, x) p(y|x) ∝ exp λ0 log(pGEN (y, x)) + m=1
= exp(Λ∗ · F ∗ (y, x)),
(3)
where Λ∗ = (λ0 , λ1 , λ2 ..., λM ), F (y, x) = (log(pGEN ), f1 , f2 , ..., fM ). ∗
In other words, PYHCRF is another semiCRF in which PYHSMM is added to the original semiCRF as a feature. Algorithm 1. Learning algorithm for PYHSCRF. Xl , Yl is a partially labeled corpus and Xu is an unannotated corpus in the target domain. Xg is the general domain corpus used as implicit negative examples. for epoch = 1, 2, ..., E do for x in randperm(Xu , Xg ) do if epoch > 1 then Remove parameters of y from Θ end if if x ∈ Xu then Sample y according to p(y|x; Λ∗ , Θ) else Determine y according to Xg end if Add parameters of y to Θ end for Optimize Λ∗ on Xl , Yl end for
The objective function is p(Yl |Xl ; Λ∗ ) p(Xu , Xg ; Θ).
A Hybrid Generative/Discriminative Model for Rapid Prototyping
69
Algorithm 1 shows our training algorithm. During training, PYHSCRF repeats the following two steps: 1. fixing Θ and optimizing Λ∗ of semiCRF on Xl , Yl , 2. fixing Λ∗ and optimizing Θ of PYHSMM on Xu , Xg until convergence. When updating Λ∗ , we use the marginalized log likelihood of the partially labeled data. When updating Θ, we sample chunks and their classes from unlabeled sentences in the same manner as in PYSHMM. In PYHSCRF, a modification is needed to Eq. (2) because forward score α[t][k][z] incorporates the semiCRF score: L Z exp λ0 log(p(xtt−k+1 |xt−k α[t][k][z] = t−k−j+1 , z)p(z|r)) j=1 r=1
+Λ · F (t − k + 1, t, z, r) α[t − k][j][r], where F (t − k + 1, t, z, r) is a feature function in relation to chunk candidate xtt−k+1 , its class z, and class r of the preceding chunk candidate xt−k t−k−j+1 .
4
Experimentals
4.1
Data
Table 1 summarizes the specifications of three domain-specific NER datasets used in our experiments: the GENIA corpus, the recipe corpus, and the game Table 1. Statistics of the datasets for the experiments. Language Corpus (#NE classes) English
#Sentences #Words #NE instances
Target GENIA corpus (5) Train
10,000
264,743
-
Test
3,856
101,039
90,309
50,000
1039,886 -
General Brown (-) Japanese
Target Recipe corpus (8) Train
10,000
244,648
-
Test
148
2,667
869
Game commentary corpus (21) Train
10,000
398,947
-
Test
491
7,161
2,365
40,000
936,498
-
Oral communication corpus (-) 10,000
124,031
-
General BCCWJ (-)
70
S. Tomori et al.
commentary corpus. We used the GENIA corpus, together with its test script in the BioNLP/NLPBA 2004 shared task [39], as an English corpus for the biomedical domain. It contains five biological NE classes such as DNA and PROTEIN in addition to the O class. The corresponding general domain corpus was the Brown corpus [40], which consists of one million words and ranges over 15 domains. The recipe corpus [17] and the game commentary corpus [18] are both in Japanese. The recipe corpus consists of procedural texts from recipes for cooking. The game commentary corpus consists of commentaries on professional matches of Japanese chess (shogi) given by professional players and writers. We used gold-standard word segmentation for both corpora. As NEs, eight classes such as FOOD, TOOL, and ACTION were defined for the recipe corpus, while the game commentary corpus was annotated with 21 classes such as PERSON, STRATEGY, and ACTION. Note that NE chunks were not necessarily noun phrases. For example, most NE chunks labeled with ACTION in the two corpora were verbal phrases. The combination of the Balanced Corpus of Contemporary Written Japanese (BCCWJ) [41] and the oral communication corpus [42] were used as the general domain corpus. We automatically segmented sentences in these corpora using KyTea1 [43]. (The segmentation accuracy was higher than 98%.) 4.2
Training Settings
Although PYHSMM can theoretically handle arbitrarily long n-grams, we limited our scope to bigrams to reduce computational costs. To initialize PYHSMM’s parameter, Θ, we treated each word in a given sentence as an Oclass chunk. Just like Uchiumi et al. [6] modeled expected word length with negative binomial distributions for the tasks of Japanese word segmentation and POS induction, chunk length was drawn from a negative binomial distribution. Uchiumi et al. [6] set different parameters for character types such as hiragana and kanji, but we used a single parameter. We constrained the maximum length of chunk L to be 6 for computational efficiency. We used the normal priors of truncated N (μ, σ 2 ) to initialize PYHSMM’s weight λ0 and semiCRF’s weights λ1 , λ2 , · · · , λM . We set μ = 1.0 and σ = 1.0. We fixed the L2 regularization parameter C of semiCRF to 1.0. We used Table 2. Feature templates for semiCRF. chunki consists of word n-grams w ebii = wbi wbi +1 ...wei , wi−1 . wi−1 and wi+1 are the preceding word and the following word, respectively. BoW is a set of words (bag-of-words) in chunki . Semi-Markov CRF features chunki (wbi wb1 +1 ...wei ) wi−2 , wi−1 , wi+1 , wi+2 BoW(wbi , wbi +1 , ..., wei ) 1
http://www.phontron.com/kytea/ (accessed on March 15, 2017).
A Hybrid Generative/Discriminative Model for Rapid Prototyping
71
stochastic gradient descent for optimization of semiCRF. The number of iterations J was set to 300. Table 2 shows the feature templates for semiCRF. Each target domain corpus was divided into a training set and a test set. For each NE class, the 2 most frequent chunks according to the training set were selected as seed terms. In the GENIA corpus, for example, we automatically chose “IL-2” and “LTR” as seed terms for the DNA class. 4.3
Baselines
In biomedical NER, the proposed model was compared with two baselines. MetaMap is based on a dictionary matching approach with biomedical terminology [44]. The other baseline model is a weakly supervised biomedical NER system proposed by Zhang and Elhadad [32]. To our knowledge, there was no weakly supervised domain-specific NER tool in the recipe and game commentary domains. For these domains, we created a baseline model as follows: We first used a Japanese term extractor2 to extract NE chunks and then classified them with seed terms using a Bayesian HMM originally proposed for unsupervised POS induction [45]. Note that only noun phrases were extracted by the term extractor. 4.4
Results and Discussion
Table 3 compares the proposed method with baselines in terms of precision, recall, and F-measure. We can see that PYHSCRF consistently outperformed the baselines. Taking a closer look at the results, we found that the model successfully inferred NE classes from their contexts. For example, the NE chunk “水” (water) can be both FOOD and TOOL in the recipe domain. It was correctly identified Table 3. Precision, recall, and F-measure of various systems. Target method
Precision Recall F-measure
GENIA MetaMap [44] N/A Weakly supervised biomedical NER [32] 15.40 19.20 PYHSCRF
N/A 7.70 15.00 15.20 23.50 21.13
Recipe Baseline PYHSCRF
49.78 38.45
25.89 34.07 42.58 40.41
52.75 75.57
29.18 37.57 35.05 47.89
Game Baseline PYHSCRF 2
http://gensen.dl.itc.u-tokyo.ac.jp/termextract.html (accessed on March 15, 2017).
72
S. Tomori et al.
Fig. 3. Learning curve for recipe NER. The horizontal axis shows number of seed terms in each NE class.
Fig. 4. Learning curve for recipe NER. The horizontal axis shows number of general domain sentences.
as TOOL when it was part of the phrase “水で洗い流す” (wash with water) while the phrase “水をに加える” (add water in the pot) was identified as the FOOD class. We conducted a series of additional experiments. First, we changed the number of seed terms to examine their effects. Figure 3 shows F-measure as a function of the number of seed terms per NE class in the recipe domain. The F-measure increased almost monotonically as more seed terms became available. A major advantage of PYHSCRF over other seed-based weakly supervised methods for NER [31,32] is that it can straightforwardly exploit labeled instances. To see this, we trained PYHSCRF with fully annotated data (about 2,000 sentences) in the recipe domain and compared it with vanilla semiCRF.
A Hybrid Generative/Discriminative Model for Rapid Prototyping
73
We found that they achieve competitive performance (the F-measure was 90.01 for PYHSCRF and 89.98 for vanilla semiCRF). In this setting, PYHSCRF ended up simply ignoring PYHSMM (−0.1 < λ0 < 0.0). Next, we reduced the size of the general domain corpus. Figure 4 shows how F-measure changes with the size of the general domain corpus in recipe NER. We can confirm that PYHSCRF cannot be trained without the general domain corpus because it is a vital source for distinguishing NE chunks from the O class. Finally, we evaluated NE classification performance. Collins and Singer [30] focused on weakly supervised NE classification, in which given NE chunks were classified into three classes (PERSON, LOCATION, and ORGANIZATION) by bootstrapping with seven seed terms and hand-crafted features. We tested PYHSCRF with the CoNLL 2003 dataset [46] in the same settings. We did not use a general corpus because NE chunks are given a priori. PYHSCRF achieved competitive performance (over 93% accuracy, compared to over 91% accuracy for Collins and Singer [30]) although the use of different datasets makes direct comparison difficult. The semiCRF feature templates in our experiments are simple. Though not explored here, the accuracies can probably be improved by a wider window size or richer feature sets such as character type and POS. Word embeddings [47, 48], character embeddings [49], and n-gram embeddings [50] are other possible improvements because domain-specific NE chunks exhibit spelling variants. For example, in the Japanese recipe corpus, the NE chunk “玉ねぎ” (onion, kanji followed by hiragana) can also be written as “たまねぎ” (hiragana), “タマネ ギ” (katakana), and “玉葱” (kanji).
5
Conclusion
We proposed PYHSCRF, a nonparametric Bayesian method for distant supervised NER in specialized domains. PYHSCRF is useful for rapid prototyping domain-specific NER because it does not need texts annotated with NE tags and boundaries. We only need a few seed terms as typical NEs in each NE class, an unannotated corpus in the target domain, and a general domain corpus. PYHSCRF incorporates word level PYHSMM and semiCRF. In addition, we use implicit negative examples from the general domain corpus to train the O class. In our experiments, we used a biomedical corpus in English, and a recipe corpus and a game commentary corpus in Japanese as examples. We conducted domain-specific NER experiments and showed that PYHSCRF achieved higher accuracy than the baselines. Therefore we can build a domain-specific NE recognizer with much less cost. Additionally, PYHSCRF can be easily applied to other domains for domain-specific NER and is useful for low-resource languages and domains. In the future, we would like to investigate the effectiveness of the proposed method for downstream tasks of domain-specific NER such as relation extraction and knowledge base population.
74
S. Tomori et al.
Acknowledgement. In this paper, we used recipe data provided by Cookpad and the National Institute of Informatics.
References 1. Thompson, P., Dozier, C.C.: Name searching and information retrieval. CoRR cmp-lg/9706017 (1997) 2. Feldman, R., Rosenfeld, B.: Boosting unsupervised relation extraction by using NER. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 473–481 (2006) 3. Lee, H., Recasens, M., Chang, A., Surdeanu, M., Jurafsky, D.: Joint entity and event coreference resolution across documents. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 489–500 (2012) 4. Shahab, E.: A short survey of biomedical relation extraction techniques. CoRR abs/1707.05850 (2017) 5. Tang, S., Zhang, N., Zhang, J., Wu, F., Zhuang, Y.: NITE: a neural inductive teaching framework for domain specific NER. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2642–2647 (2017) 6. Uchiumi, K., Tsukahara, H., Mochihashi, D.: Inducing word and part-of-speech with Pitman-Yor hidden semi-Markov models. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1774–1782 (2015) 7. Sarawagi, S., Cohen, W.W.: Semi-Markov conditional random fields for information extraction. Adv. Neural. Inf. Process. Syst. 17, 1185–1192 (2005) 8. Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data, pp. 665–673. In: Proceedings of ACL 2008: HLT. Association for Computational Linguistics (2008) 9. Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: languageindependent named entity recognition. In: Proceedings of the 6th Conference on Natural Language Learning, vol. 31, pp. 1–4 (2002) 10. Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, vol. 1 (1996) 11. Sekine, S., Isahara, H.: IREX: IR and IE evaluation project in Japanese. In: Proceedings of International Conference on Language Resources and Evaluation (2000) 12. Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus: a semantically annotated corpus for bio-textmining. Bioinformatics 19(Suppl. 1), i180-2 (2003) ˇ 13. Ciaramita, M., Gangemi, A., Ratsch, E., Saric, J., Rojas, I.: Unsupervised learning of semantic relations between concepts of a molecular biology ontology. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp. 659–664 (2005) ¨ South, B.R., Shen, S., DuVall, S.L.: i2b2/VA challenge on concepts, 14. Uzuner, O., assertions, and relations in clinical text. J. Am. Med. Inform. Assoc. 18(2011), 552–556 (2010) 15. Do˘ gan, R.I., Lu, Z.: An improved corpus of disease mentions in PubMed citations. In: BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing, pp. 91–99 (2012)
A Hybrid Generative/Discriminative Model for Rapid Prototyping
75
16. Do˘ gan, R.I., Leaman, R., Lu, Z.: NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inform. 47, 1–10 (2014) 17. Mori, S., Maeta, H., Yamakata, Y., Sasada, T.: Flow graph corpus from recipe texts. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 2370–2377 (2014) 18. Mori, S., Richardson, J., Ushiku, A., Sasada, T., Kameko, H., Tsuruoka, Y.: A Japanese chess commentary corpus. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp. 1415–1420 (2016) 19. Bick, E.: A named entity recognizer for Danish. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004) (2004) 20. Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: Semeval 2017 task 10: scienceie - extracting keyphrases and relations from scientific publications. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), pp. 546–555 (2017) 21. Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a high-performance learning name-finder. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 194–201 (1997) 22. Borthwick, A.E.: A maximum entropy approach to named entity recognition. Ph.D. thesis, AAI9945252 (1999) 23. Asahara, M., Matsumoto, Y.: Japanese named entity extraction with redundant morphological analysis. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 8–15 (2003) 24. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282– 289 (2001) 25. McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4, pp. 188–191 (2003) 26. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNsCRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1064–1074 (2016) 27. Rau, L.F.: Extracting company names from text. In: Proceedings of the Seventh Conference on Artificial Intelligence Applications CAIA-91 (Volume II: Visuals), pp. 189–194 (1991) 28. Sekine, S., Nobata, C.: Definition, dictionaries and tagger for extended named entity hierarchy. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004) (2004) 29. Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pp. 33–38 (2004) 30. Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora (1999) 31. Nadeau, D., Turney, P.D., Matwin, S.: Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity. In: Conference of the Canadian Society for Computational Studies of Intelligence, pp. 266–277 (2006)
76
S. Tomori et al.
32. Zhang, S., Elhadad, N.: Unsupervised biomedical named entity recognition: experiments with clinical and biological texts. J. Biomed. Inform. 46, 1088–1098 (2013) 33. Shang, J., Liu, L., Gu, X., Ren, X., Ren, T., Han, J.: Learning named entity tagger using domain-specific dictionary. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2054–2064. Association for Computational Linguistics (2018) 34. Yang, Y., Chen, W., Li, Z., He, Z., Zhang, M.: Distantly supervised NER with partial annotation learning and reinforcement learning. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2159–2169. Association for Computational Linguistics (2018) 35. Mochihashi, D., Yamada, T., Ueda, N.: Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 100–108 (2009) 36. Fujii, R., Domoto, R., Mochihashi, D.: Nonparametric Bayesian semi-supervised word segmentation. Trans. Assoc. Comput. Linguist. 5, 179–189 (2017) 37. Tsuboi, Y., Kashima, H., Mori, S., Oda, H., Matsumoto, Y.: Training conditional random fields using incomplete annotations. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 897–904 (2008) 38. Teh, Y.W.: A hierarchical Bayesian language model based on Pitman-Yor processes. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 985–992 (2006) 39. Kim, J.D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bioentity recognition task at JNLPBA. In: Proceedings of the International Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA 2004), pp. 70–75 (2004) 40. Francis, W.N., Kucera, H.: Brown corpus manual. Brown University, vol. 2 (1979) 41. Maekawa, K., et al.: Balanced corpus of contemporary written Japanese. Lang. Resour. Eval. 48, 345–371 (2014) 42. Keene, D., Hatori, H., Yamada, H., Irabu, S.: Japanese-English Sentence Equivalents. Electronic book edn. Asahi Press (1992) 43. Neubig, G., Nakata, Y., Mori, S.: Pointwise prediction for robust, adaptable Japanese morphological analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 529–533 (2011) 44. Aronson, A.R.: Effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium, p. 17 (2001) 45. Goldwater, S., Griffiths, T.: A fully Bayesian approach to unsupervised part-ofspeech tagging. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 744–751 (2007) 46. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4. CoNLL 2003, pp. 142–147 (2003) 47. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 48. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural. Inf. Process. Syst. 26, 3111–3119 (2013)
A Hybrid Generative/Discriminative Model for Rapid Prototyping
77
49. Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: Charagram: embedding words and sentences via character n-grams. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1504–1515 (2016) 50. Zhao, Z., Liu, T., Li, S., Li, B., Du, X.: Ngram2vec: learning improved word representations from ngram co-occurrence statistics. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 244–253 (2017)
Semantics and Text Similarity
Spectral Text Similarity Measures Tim vor der Br¨ uck(B) and Marc Pouly(B) School of Computer Science and Information Technology, Lucerne University of Applied Sciences and Arts, Lucerne, Switzerland {tim.vorderbrueck,marc.pouly}@hslu.ch Abstract. Estimating semantic similarity between texts is of vital importance in many areas of natural language processing like information retrieval, question answering, text reuse, or plagiarism detection. Prevalent semantic similarity estimates based on word embeddings are noise sensitive. Thus, small individual term similarities can have in aggregate a considerable influence on the total estimation value. In contrast, the methods proposed here exploit the spectrum of the product of embedding matrices, which leads to increased robustness when compared with conventional methods. We apply these estimate on two tasks, which are the assignment of people to the best matching marketing target group and finding the correct match between sentences belonging to two independent translations of the same novel. The evaluation revealed that our proposed method based on the spectral norm could increase the accuracy compared to several baseline methods in both scenarios. Keywords: Text similarity
1
· Similarity measures · Spectral radius
Introduction
Estimating semantic document similarity is of vital importance in a lot of different areas, like plagiarism detection, information retrieval, or text summarization. One drawback of current state-of-the-art similarity estimates based on word embeddings is that small term similarities can sum up to a considerable amount and make these estimates vulnerable to noise in the data. Therefore, we propose two estimates that are based on the spectrum of the product F of embedding matrices belonging to the two documents to compare. In particular, we propose the spectral radius and the spectral norm of F, where the first denotes F’s largest absolute eigenvalue and the second its largest singular value. Eigenvalue and singular value oriented methods for dimensionality reduction aiming to reduce noise in the data have a long tradition in natural language processing. For instance, principal component analysis is based on eigenvalues and can be used to increase the quality of word embeddings [8]. In contrast, Latent Semantic Analysis [11], a technique known from information retrieval to improve search results in term-document matrices, focuses on largest singular values. Furthermore, we investigate several properties of our proposed measures that are crucial for qualifying as proper similarity estimates, while considering both unsupervised and supervised learning. c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 81–95, 2023. https://doi.org/10.1007/978-3-031-24340-0_7
82
T. vor der Br¨ uck and M. Pouly
Finally, we applied both estimates to two natural language processing scenarios. In the first scenario, we distribute participants of an online contest into several target groups by exploiting short text snippets they were asked to provide. In the second scenario, we aim to find the correct matching between sentences originating from two independent translations of a novel by Edgar Allen Poe. The evaluation revealed that our novel estimators performed superior to several baseline methods for both scenarios. The remainder of the paper is organized as follows. In the next section, we look into several state-of-the-art methods for estimating semantic similarity. Section 3 reviews several concepts that are vital for the remainder of the paper and also for building the foundation of our theoretical results. In Sect. 4, we describe in detail, how the spectral radius can be employed for estimating semantic similarity. Some drawbacks and shortcomings of such an approach as well as an alternative method that very elegantly solves all of these issues exploiting the spectral norm are discussed in Sect. 5. The two application scenario for our proposed semantic similarity estimates are given in Sect. 6. Section 7 describes the conducted evaluation, in which we compare our approach with several baseline methods. The results of the evaluation are discussed in Sect. 8. So far, we covered only unsupervised learning. In Sect. 9, we investigate, how our proposed estimates can be employed in a supervised setting. Finally, this paper concludes with Sect. 10, which summarizes the obtained results.
2
Related Work
Until recently, similarity estimates were predominantly based either on ontologies [4] or on typical information retrieval techniques like Latent Semantic Analysis. In the last couple of years, however, so-called word and sentence embeddings became state-of-the-art. The prevalent approach to document similarity estimation based on word embeddings consists of measuring similarity between vector representations of the two documents derived as follows: 1. The word embeddings (often weighted by the tf-idf coefficients of the associated words [3]) are looked up in a hashtable for all the words in the two documents to compare. These embeddings are determined beforehand on a very large corpus typically using either the skip gram or the continuous bag of words variant of the Word2Vec model [15]. The skip gram method aims to predict the textual surroundings of a given word by means of an artificial neural network. The influential weights of the one-hot-encoded input word to the nodes of the hidden layer constitute the embedding vector. For the socalled continuous bag of words method, it is just the opposite, i.e., the center word is predicted by the words in its surrounding. 2. The centroid over all word embeddings belonging to the same document is calculated to obtain its vector representation. Alternatives to Word2Vec are GloVe [17], which is based on aggregated global word co-occurrence statistics and the Explicit Semantic Analysis (or shortly
Spectral Text Similarity Measures
83
ESA) [6], in which each word is represented by the column vector in the tf-idf matrix over Wikipedia. The idea of Word2Vec can be transferred to the level of sentences as well. In particular, the so-called Skip Thought Vector (STV) model [10] derives a vector representation of the current sentence by predicting the surrounding sentences. If vector representations of the two documents to compare were successfully established, a similarity estimate can be obtained by applying the cosine measure to the two vectors. [18] propose an alternative approach for ESA word embeddings that establishes a bipartite graph consisting of the best matching vector components by solving a linear optimization problem. The similarity estimate for the documents is then given by the global optimum of the objective function. However, this method is only useful for sparse vector representations. In case of dense vectors, [14] suggested to apply the Frobenius kernel to the embedding matrices, which contain the embedding vectors for all document components (usually either sentences or words, cf. also [9]). However, crucial limitations are that the Frobenius kernel is only applicable if the number of words (sentences respectively) in the compared documents coincide and that a word from the first document is only compared with its counterpart from the second document. Thus, an optimal matching has to be established already beforehand. In contrast, the approach as presented here applies to arbitrary embedding matrices. Since it compares all words of the two documents with each other, there is also no need for any matching method. Before going more into detail, we want to review some concepts that are crucial for the remainder of this paper.
3
Similarity Measure/Matrix Norms
According to [2], a similarity measure on some set X is an upper bounded, exhaustive and total function s : X × X → I ⊂ R with |I| > 1 (therefore I is upper bounded and sup I exists). Additionally, a similarity measure should fulfill the properties of reflexivity (the supremum is reached if an item is compared to itself) and symmetry. We call such a measure normalized if the supremum equals 1 [1]. Note that an asymmetric similarity measure can easily be converted into a symmetric by taking the geometric or arithmetic mean of the asymmetric measure applied twice to the same arguments in switched order. A norm is a function f : V → R over some vector space V that is absolutely homogeneous, positive definite and fulfills the triangle inequality. It is called matrix norm if its domain is a set of matrices and if it is sub-multiplicative, i.e., AB ≤ A · B. An example of a matrix norm is the spectral norm, which denotes the largest singular value of a matrix. Alternatively, one can define this norm as: A2 := ρ(A A), where the function ρ returns the largest absolute eigenvalue of the argument matrix.
84
4
T. vor der Br¨ uck and M. Pouly
Document Similarity Measure Based on the Spectral Radius
For an arbitrary document t we define the embeddings matrix E(t) as follows: E(t)ij is the i-th component of the normalized embeddings vector belonging to the j-th word of the document t. Let t, u be two arbitrary documents, then the entry (i, j) of a product F := E(t) E(u) specifies the result of the cosine measure estimating the semantic similarity between word i of document t and word j of document u. The larger the matrix entries of F are, the higher is usually the semantic similarity of the associated texts. A straight-forward way to measure the magnitude of the matrix is just to summate all absolute matrix elements, which is called the L11 -norm. However, this approach has the disadvantage that also small cosine measure values are included in the sum, which can have in aggregate a considerable impact on the total similarity estimate making such an approach vulnerable to noise in the data. Therefore we propose instead to apply an operator, which is more robust than the L1,1 norm and which is called the spectral radius. This radius denotes the largest absolute eigenvalue of the input matrix and constitutes a lower bound of all matrix norms. It also insinuates the convergence of the matrix power series limn→∞ Fn . The series converges if and only if the spectral radius does not exceed the value of one. Since the vector components obtained by Word2Vec can be negative, the cosine measure between two word vectors can also assume negative values (rather rarely in practice though). Akin to zeros, negative cosine values indicate unrelated words as well. Because the spectral radius usually treats negative and positive matrix entries alike (the spectral radius of a matrix A and of its negation coincide), we replace all negative values in the matrix by zero. Finally, since our measure should be restricted to values from zero to one, we have to normalize it. Formally, we define our similarity measure as follows: sn(t, u) :=
ρ(R(E(t) E(u)) ρ(R(E(t) E(t)) · ρ(R(E(u) E(u)))
where E(t) is the embeddings matrix belonging to document t, where all embedding column vectors are normalized. R(M) is the matrix, where all non-zero entries are replaced by zero, i.e. R(M)ij = max{0, Mij }. In contrast to matrix norms that can be applied to arbitrary matrices, eigenvalues only exist for square matrices. However, the matrix F∗ := R(E(t) E(u)) that we use as basis for our similarity measures is usually non-quadratic. In particular, this matrix would be quadratic, if and only if the number of terms in the two documents t and u coincide. Thus, we have to fill up the embedding matrix of the smaller one of the two texts with additional embedding vectors. A quite straightforward choice, which we followed here, is to just use the centroid vector for this. An alternative approach would be to sample the missing vectors. A further issue is that eigenvalues are not invariant concerning row and column permutations. The columns of the embedding matrices just represent the
Spectral Text Similarity Measures
85
words appearing in the texts. However, the word order can be arbitrarily for the text representing the marketing target groups (see Sect. 6.1 for details). Since a similarity measure should not depend on some random ordering, we need to bring the similarity matrix F∗ in some normalized format. A quite natural choice would be to enforce the ordering that maximizes the absolute value of the largest eigenvalue (which is actually our target value). Let us formalize this. We denote with F∗P,Q the matrix obtained from F∗ by applying the permutation P on the rows and the permutation Q on the columns. Thus, we can define our similarity measure as follows: sn sr (t, u) := max ρ(F∗P,Q ) P,Q
(1)
However, solving this optimization problem is quite time-consuming. Let us assume the matrix F∗ has m rows and columns. Then we would have to iterate over m! · m! different possibilities. Hence, such an approach would be infeasible already for medium-sized texts. Therefore, we instead select the permutations that optimize the absolute value of the arithmetic mean over all eigenvalues, which is a lower bound of the maximum absolute eigenvalue. Let λi (M) be the i-th eigenvalue of a matrix M. With this, we can formalize our optimization problem as follows: sn˜sr (t, u) :=ρ(F∗P˜ ,Q˜ ) ˜ = arg max | P˜ , Q
m
P,Q
λi (F∗P,Q )|
(2)
i=1
The sum over all eigenvalues is just the trace of the matrix. Thus, ˜ = arg max |tr(F∗P,Q )| P˜ , Q P,Q
(3)
which is just the sum over all diagonal elements. Since we constructed our matrix F∗ in such a way that it contains no negative entries, we can get rid of the absolute value operator. ˜ = arg max{tr(F∗P,Q )} P˜ , Q P,Q
(4)
Because the sum is commutative, the sequence of the individual summands is irrelevant. Therefore, we can leave either the row or column ordering constant and only permutate the other one. sn˜sr (t, u) =ρ(F∗P˜ ,id ) P˜ = arg max{tr(F∗P,id )}
(5)
P
P˜ can be found by solving a binary linear programming problem in the following way. Let X be the set of decision variables and let further Xij ∈ X be one if and only if row i is changed to row j in the reordered matrix and zero otherwise.
86
T. vor der Br¨ uck and M. Pouly
Then the objective function is given by maxX denotes an 1:1 mapping, i.e., m i=1 m
m m i=1
j=1
∗ Xji Fji . A permutation
Xij =1 ∀j = 1, . . . , m Xij =1 ∀i = 1, . . . , m
(6)
j=1
Xij ∈{0, 1} ∀i, j = 1, . . . , m
5
Spectral Norm
The similarity estimate as described above has several drawbacks. – The boundedness condition is violated in some cases. Therefore, this similarity does not qualify as a normalized similarity estimate according to the definition in Sect. 3. – The largest eigenvalue of a matrix depends on the row and column ordering. However, this ordering is arbitrary for our proposed description of target groups by keywords (cf. Sect. 6.1 for the details). To ensure a unique eigenvalue, we apply linear optimization, which is an expensive approach in terms of runtime. – Eigenvalues are only defined for square matrices. Therefore, we need to fill up the smaller of the embedding matrices to meet this requirement. An alternative to the spectral radius is the spectral norm, which is defined by the largest singular value of a matrix. Formally, the spectral norm-based estimate is given as: (R(E(t) E(u))2 sn 2 (t, u) := R(E(t) E(t))2 · R(E(u) E(u))2 where A2 = ρ(A A). By using the spectral norm instead of the spectral radius, all of the issues mentioned above are solved. The spectral norm is not only invariant to column or row permutations, it can also be applied to arbitrary rectangular matrices. Furthermore, boundedness is guaranteed as long as no negative cosine values occur as it is stated in the following proposition. Proposition 1. If the cosine similarity values between all embedding vectors of words occurring in any of the documents are non-negative, i.e., if R(E(t) E(u)) = E(t) E(u) for all document pairs (t, u), then sn 2 is a normalized similarity measure.
Spectral Text Similarity Measures
87
Symmetry Proof. At first, we focus on the symmetry condition. Let A := E(t), B := E(u), where t and u are arbitrary documents. Symmetry directly follows, if we can show that Z2 = Z 2 for arbitrary matrices Z, since with this property we have sn2 (t, u) = = =
A B2 A A2
· B B2
(B A) 2
B B2 · A A2
(7)
B A2 B B2 · A A2
=sn2 (u, t) Let M and N be arbitrary matrices such that MN and NM are both defined and quadratic, then (see [5]) ρ(MN) = ρ(NM)
(8)
where ρ(X) denotes the largest absolute eigenvalue of a squared matrix X. Using identity 8 one can easily infer that: Z2 = ρ(Z Z) = ρ(ZZ ) = Z 2 (9) Boundedness Proof. The following property needs to be shown:
A B2 A A2 · B B2
≤1
(10)
In the proof, we exploit the fact that for every positive-semidefinite matrix X, the following equation holds ρ(X2 ) = ρ(X)2
(11)
88
T. vor der Br¨ uck and M. Pouly
We observe that for the denominator A A2 · B B2 = ρ((A A) A A) ρ((B B) B B) = ρ((A A) (A A) ) ρ((B B) (B B) ) = ρ([(A A) ]2 ) ρ([(B B) ]2 ) (11) = ρ((A A) )2 ρ((B B) )2
(12)
=ρ((A A) )ρ((B B) ) (9)
= A22 · B22
Putting things together we finally obtain
A B2 A A2 B B2
sub-mult.
≤
(9)
=
A 2 · B2 A A2 B B2 A2 · B2 A A2 B B2
(13)
A2 · B2 = =1 A22 · B22
(12)
The question remains, how the similarity measure value induced by matrix norms performs in comparison with the usual centroid method. General statements about the spectral-norm based similarity measure are difficult, but we can draw some conclusions, if we restrict to the case where A B is a square diagonal matrix. Hereby, one word of the first text is very similar to exactly one word of the second text and very dissimilar to all remaining words. The similarity estimate is then given by the largest eigenvalue (the spectral radius) of A B, which equals the largest cosine measure value. Noise in form of small matrix entries is completely ignored.
6
Application Scenarios
We applied our semantic similarity estimates to the following two scenarios: 6.1
Market Segmentation
Market segmentation is one of the key tasks of a marketer. Usually, it is accomplished by clustering over behaviors as well as demographic, geographic and psychographic variables [12]. In this paper, we will describe an alternative approach based on unsupervised natural language processing. In particular, our business
Spectral Text Similarity Measures
89
partner operates a commercial youth platform for the Swiss market, where registered members get access to third-party offers such as discounts and special events like concerts or castings. Actually, several hundred online contests per year are launched over this platform sponsored by other firms, an increasing number of them require the members to write short free-text snippets, e.g. to elaborate on a perfect holiday at a destination of their choice in case of a contest sponsored by a travel agency. Based on the results of a broad survey, the platform provider’s marketers assume five different target groups (called milieus) being present among the platform members: Progressive postmodern youth (people primarily interested in culture and arts), Young performers (people striving for a high salary with a strong affinity to luxury goods), Freestyle action sportsmen, Hedonists (rather poorly educated people who enjoy partying and disco music), and Conservative youth (traditional people with a strong concern for security). A sixth milieu called Special groups comprises all those who cannot be assigned to one of the upper five milieus. For each milieu (with the exception of Special groups) a keyword list was manually created by describing its main characteristics. For triggering marketing campaigns, an algorithm shall be developed that automatically assigns each contest answer to the most likely target group: we propose the youth milieu as best match for a contest answer, for which the estimated semantic similarity between the associated keyword list and user answer is maximal. In case the highest similarity estimate falls below the 10 percent quantile for the distribution of highest estimates, the Special groups milieu is selected. Since the keyword list typically consists of nouns (in the German language capitalized) and the user contest answers might contain a lot of adjectives and verbs as well, which do not match very well to nouns in the Word2Vec vector representation, we actually conduct two comparisons for our Word2Vec based measures, one with the unchanged user contest answers and one by capitalizing every word beforehand. The final similarity estimate is then given as the maximum value of both individual estimates. 6.2
Translation Matching
The novel The purloined letter authored by Edgar Allen Poe was independently translated by two translators into German1 . We aim to match a sentence from the first translation to the associated sentence of the second by looking for the assignment with the highest semantic relatedness disregarding the sentence order. To guarantee an 1:1 sentence mapping, periods were partly replaced by semicolons.
7
Evaluation
For evaluation we selected three online contests (language: German), where people elaborated on their favorite travel destination (contest 1, see Appendix A for 1
This corpus can be obtained under the URL https://www.researchgate.net/ publication/332072718 alignmentPurloinedLettertar.
90
T. vor der Br¨ uck and M. Pouly
an example), speculated about potential experiences with a pair of fancy sneakers (contest 2) and explained why they emotionally prefer a certain product out of four available candidates. In order to provide a gold standard, three professional marketers from different youth marketing companies annotated independently the best matching youth milieus for every contest answer. We determined for each annotator individually his/her average inter-annotator agreement with the others (Cohen’s kappa). The minimum and maximum of these average agreement values are given in Table 2. Since for contest 2 and contest 3, some of the annotators annotated only the first 50 entries (last 50 entries respectively), we specified min/max average kappa values for both parts. We further compared the youth milieus proposed by our unsupervised matching algorithm with the majority votes over the human experts’ answers (see Table 3) and computed its average inter-annotator agreement with the human annotators (see again Table 2). The obtained accuracy values for the second scenario (matching translated sentences) are given in Table 4.
Fig. 1. Scatter Plots of Cosine between Centroids of Word2Vec Embeddings (W2VC) vs similarity estimates induced by different spectral measures.
Table 1. Corpus sizes measured by number of words. Corpus
# Words
German Wikipedia
651 880 623
Frankfurter Rundschau News journal 20 min
34 325 073 8 629 955
The Word2Vec word embeddings were trained on the German Wikipedia (dump originating from 20 February 2017) merged with a Frankfurter Rundschau newspaper Corpus and 34 249 articles of the news journal 20 min 2 , where the latter is targeted to the Swiss market and freely available at various Swiss train stations (see Table 1 for a comparison of corpus sizes). By employing articles from 2
http://www.20min.ch.
Spectral Text Similarity Measures
91
Table 2. Minimum and maximum average inter-annotator agreements (Cohen’s kappa)/average inter-annotator agreement values for our automated matching method. Method Min kappa Max. kappa
Contest 1 2
3
0.123 0.295/0.030 0.110/0.101 0.178 0.345/0.149 0.114/0.209
Kap. (spectral norm) 0.128 0.049/0.065 0.060/0.064 # Entries
1544 100
100
Table 3. Obtained accuracy values for similarity measures induced by different matrix norms and for five baseline methods. (W)W2VC = Cosine between (weighed by tf-idf) Word2Vec Embeddings Centroids. Method Random ESA ESA2 W2VC WW2VC Skip-Thought-Vectors
Contest 1 2
3
all
0.167 0.357 0.355 0.347 0.347 0.162
0.167 0.288 0.227 0.227 0.197 0.273
0.167 0.335 0.330 0.330 0.322 0.189
0.167 0.254 0.284 0.328 0.299 0.284
Spectral Norm 0.370 0.299 0.353 0.313 Spectral Radius Spectral Radius+W2VC 0.357 0.299
0.288 0.350 0.182 0.326 0.212 0.334
20 min, we want to ensure the reliability of word vectors for certain Switzerland specific expressions like Velo or Glace, which are underrepresented in the German Wikipedia and the Frankfurter Rundschau corpus. ESA is usually trained on Wikipedia, since the authors of the original ESA paper suggest that the articles of the training corpus should represent disjoint concepts, which is only guaranteed for encyclopedias. However, Stein and Anerka [7] challenged this hypothesis and demonstrated that promising results can be obtained by applying ESA on other types of corpora like the popular Reuters newspaper corpus as well. Unfortunately, the implementation we use (Wikiprep-ESA3 ) expects its training data to be a Wikipedia Dump. Furthermore, Wikiprep-ESA only indexes words that are connected by hyperlinks, which are usually lacking in ordinary newspaper articles. So we could train ESA on Wikipedia only but we have developed meanwhile a version of ESA that can be applied to arbitrary corpora and which was trained on the full corpus
3
https://github.com/faraday/wikiprep-esa.
92
T. vor der Br¨ uck and M. Pouly
(Wikipedia+Frankfurter Rundschau+20 min). In the following, we refer to this implementation as ESA2. The STVs (Skip Thought Vectors) were trained on the same corpus as our estimates and Word2Vec embedding centroids (W2VC). The actual document similarity estimation is accomplished by the usual centroid approach. An issue we are faced with for the first evaluation scenario of market segmentation (see Sect. 6.1) is that STVs are not bag of word models but actually take the sequence of the words into account and therefore the obtained similarity estimate between milieu keyword list and contest answer would depend on the keyword ordering. However, this order could have arbitrarily been chosen by the marketers and might be completely random. A possible solution is to compare the contest answers with all possible permutation of keywords and determine the maximum value over all those comparisons. However, such an approach would be infeasible already for medium keyword list sizes. Therefore, we apply for this scenario a beam search to extends the keyword list iteratively while keeping only the n-best performing permutations. Table 4. Accuracy value obtained for matching a sentence of the first to the associated sentence of the second translation (based on the first 200 sentences of both translations). Method
Accuracy
ESA
0.672
STV
0.716
Spectral Radius 0.721
8
W2VC
0.726
Spectral Norm
0.731
Discussion
The evaluation showed that the inter-annotator agreement values vary strongly for contest 2 part 2 (minimum average annotator agreement according to Cohen’s kappa of 0.03 while the maximum is 0.149, see Table 2). On this contest part, our spectral norm based matching obtains a considerably higher average agreement than one of the annotators. Regarding baseline systems, the most relevant comparison is naturally the one with W2VC, since it employs the same type of data. The similarity estimate induced by the spectral norm performs quite stable over both scenarios and clearly outperforms the W2VC approach. In contrast however, the performance of the spectral radius based estimate is rather mixed. While it performs well on the first contest, the performance on the third contest is quite poor and lags there behind the Word2Vec centroids. Only the average of both measures (W2VC+Spectral Radius) performs reasonable well on all three
Spectral Text Similarity Measures
93
contests. One major issue of this measure is its unboundedness. The typical normalization with the geometric mean of comparing the documents with itself results in values exceeding the desired upper limit of one in 1.8% of the cases (determined on the largest contest 1). So still some research is needed to come up with a better normalization. Finally, we conducted a scatter plot (see Fig. 1), plotting the values of the spectral similarity estimates against W2VC. While the spectral norm is quite strongly correlated to W2VC, the spectral radius behaves much more irregular and non-linear. In addition, its values exceed several times the desired upper limit of 1, which is a result of its non-boundedness. Furthermore, both of the spectral similarity estimates tend to assume larger values than W2VC, which is a result of its higher robustness against noise in the data. Note that a downside of both approaches in relation to the usual Word2Vec centroids method is the increased runtime since it requires the pair-wise comparison of all words contained in the input documents. In our scenario with rather short text snippets and keyword lists, this was not much of an issue. However, for large documents, such a comprehensive comparison could become soon infeasible. This issue can be mitigated for example by constructing the embedding matrices not on basis of individual words but on entire sentences, for instance by employing the skip-thought-vector representation.
9
Supervised Learning
So far, our two proposed similarity measures were only applied in an unsupervised setting. However, supervised learning methods usually obtain superior accuracy. For that, we could use our two similarity estimates as kernels for a support vector machine [19] (SVM in short), potentially combined with an RBF kernel applied to an ordinary feature representation consisting of tf-idf-weights of word forms or lemmas (not yet evaluated however). One issue here is to investigate, whether our proposed similarity estimates are positive semidefinite and qualify as regular kernels. In case of non positive-semidefiniteness, the SVM training process can stuck in a local minimum resulting in failing to reach the global minimum for the hinge loss. The estimate induced by the spectral radius and also the spectral norm in case of negative cosine measure values between word embedding vectors can possibly violate the boundedness constraint and therefore, it cannot constitute a positivesemidefinite kernel. To see this, let us consider the kernel matrix K. According to Mercer‘s theorem [13,16], an SVM kernel is exactly then positive-semidefinite, if for any possible set of inputs, the associated kernel matrices are positivesemidefinite. So we must show that there is at least one kernel matrix that is not positive-semidefinite. Let us select one kernel matrix K with at least one violation of boundedness. We can assume that K is symmetric, since symmetry is a prerequisite for positive-semidefiniteness. Since our normalization procedure guarantees reflexivity, a text compared with itself always yields the estimated similarity of one. Therefore, the value
94
T. vor der Br¨ uck and M. Pouly
of one can only be exceeded for off-diagonal elements. Let us assume the entry Kij = Kji with i < j of the kernel matrix equals 1 + for some > 0. Consider a vector v with vi = 1, vj = −1 and all other components equal to zero. Let w := v K and q := v Kv = wv, then wi = 1 − (1 + ) = − and wj = 1 + − 1 = . With this, it follows that q = − − = −2. And therefore K cannot be positive-semidefinite. Note that sn2 can be a proper kernel in certain situations. Consider the fact that all of the investigated texts are so dissimilar that the kernel matrices are diagonal dominant for all possible sets of inputs. Since diagonal dominant matrices with non-negative diagonal elements are positive-semidefinite, the kernel is positive-semidefinite as well. It is still an open question if this kernel can also be positive-semidefinite if not all of the kernel matrices are diagonal dominant.
10
Conclusion
We proposed two novel similarity estimates based on the spectrum of the product of embedding matrices. These estimates were evaluated on a two task, i.e., assigning users to the best matching marketing target groups and matching sentences of a novel translation with its counterpart from a different translation. Hereby, we obtained superior results compared to the usual centroid of Word2Vec vectors (W2VC) method. Furthermore, we investigated several properties of our estimates concerning boundness and positive-definiteness. Acknowledgement. Hereby we thank the Jaywalker GmbH as well as the Jaywalker Digital AG for their support regarding this publication and especially for annotating the contest data with the best-fitting youth milieus.
A
Example Contest Answer
The following snippet is an example user answer for the travel contest (contest 1): 1. Jordanien: Ritt durch die W¨ uste und Petra im Morgengrauen bestaunen bevor die Touristenbusse kommen 2. Cook Island: Schnorcheln mit Walhaien und die Seele baumeln lassen 3. USA: Eine abgespaceste Woche am Burning Man Festival erleben English translation: 1. Jordan: Ride through the desert and marveling Petra during sunrise before the arrival of tourist buses 2. Cook Island: Snorkeling with whale sharks and relaxing 3. USA: Experience an awesome week at the Burning Man Festival
Spectral Text Similarity Measures
95
References 1. Attig, A., Perner, P.: The problem of normalization and a normalized similarity measure by online data. Trans. Case-Based Reason. 4(1), 3–17 (2011) 2. Belanche, L., Orozco, J.: Things to know about a (dis)similarity measure. In: K¨ onig, A., Dengel, A., Hinkelmann, K., Kise, K., Howlett, R.J., Jain, L.C. (eds.) KES 2011. LNCS (LNAI), vol. 6881, pp. 100–109. Springer, Heidelberg (2011). https:// doi.org/10.1007/978-3-642-23851-2 11 3. Brokos, G.I., Malakasiotis, P., Androutsopoulos, I.: Using centroids of word embeddings and word mover’s distance for biomedical document retrieval in question answering. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing, Berlin, Germany, pp. 114–118 (2016) 4. Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006) 5. Chatelin, F.: Eigenvalues of Matrices - Revised Edition. Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania (1993) 6. Gabrilovic, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language processing. J. Artif. Intell. Res. 34, 443–498 (2009) 7. Gottron, T., Anderka, M., Stein, B.: Insights into explicit semantic analysis. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, UK, pp. 1961–1964 (2011) 8. Gupta, V.: Improving word embeddings using kernel principal component analysis. Master’s thesis, Bonn-Aachen International Center for Information Technology (BIT) (2018) 9. Hong, K.J., Lee, G.H., Kom, H.J.: Enhanced document clustering using Wikipediabased document representation. In: Proceedings of the 2015 International Conference on Applied System Innovation (ICASI), Osaka, Japan (2015) 10. Kiros, R., et al.: Skip-thought vectors. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS), Montr´eal, Canada (2015) 11. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25, 259–284 (1998) 12. Lynn, M.: Segmenting and targeting your market: strategies and limitations. Technical report, Cornell University (2011). http://scholorship.sha.cornell.edu/articles/ 243 13. Mercer, J.: Functions of positive and negative type and their connection with the theory of integral equations. Phil. Trans. R. Soc. A 209, 441–458 (1909) 14. Mijangos, V., Sierra, G., Montes, A.: Sentence level matrix representation for document spectral clustering. Pattern Recognit. Lett. 85, 29–34 (2017) 15. Mikolov, T., Sutskever, I., Ilya, C., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, Nevada, pp. 3111–3119 (2013) 16. Murphy, K.P.: Machine Learning - A Probabilistic Perspective. MIT Press, Cambridge (2012) 17. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Katar (2014) 18. Song, Y., Roth, D.: Unsupervised sparse vector densification for short text similarity. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Denver, Colorado (2015) 19. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
A Computational Approach to Measuring the Semantic Divergence of Cognates Ana-Sabina Uban1,3(B) , Alina Cristea (Ciobanu)1,2 , and Liviu P. Dinu1,2 1
2
Faculty of Mathematics and Computer Science, University of Bucharest, Bucharest, Romania {auban,alina.cristea,ldinu}@fmi.unibuc.ro Human Language Technologies Research Center, University of Bucharest, Bucharest, Romania 3 Data Science Center, University of Bucharest, Bucharest, Romania
Abstract. Meaning is the foundation stone of intercultural communication. Languages are continuously changing, and words shift their meanings for various reasons. Semantic divergence in related languages is a key concern of historical linguistics. In this paper we investigate semantic divergence across languages by measuring the semantic similarity of cognate sets in multiple languages. The method that we propose is based on cross-lingual word embeddings. In this paper we implement and evaluate our method on English and five Romance languages, but it can be extended easily to any language pair, requiring only large monolingual corpora for the involved languages and a small bilingual dictionary for the pair. This language-agnostic method facilitates a quantitative analysis of cognates divergence – by computing degrees of semantic similarity between cognate pairs – and provides insights for identifying false friends. As a second contribution, we formulate a straightforward method for detecting false friends, and introduce the notion of “soft false friend” and “hard false friend”, as well as a measure of the degree of “falseness” of a false friends pair. Additionally, we propose an algorithm that can output suggestions for correcting false friends, which could result in a very helpful tool for language learning or translation. Keywords: Cognates · Semantic divergence · Semantic similarity
1 Introduction Semantic change – that is, change in the meaning of individual words [3] – is a continuous, inevitable process stemming from numerous reasons and influenced by various factors. Words are continuously changing, with new senses emerging all the time. [3] presents no less than 11 types of semantic change, that are generally classified in two wide categories: narrowing and widening. Most linguists found structural and psychological factors to be the main cause of semantic change, but the evolution of technology and cultural and social changes are not to be omitted. Measuring semantic divergence across languages can be useful in theoretical and historical linguistics – being central to models of language and cultural evolution – but also in downstream applications relying on cognates, such as machine translation. c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 96–108, 2023. https://doi.org/10.1007/978-3-031-24340-0_8
A Computational Approach to Measuring the Semantic Divergence
97
Cognates are words in sister languages (languages descending from a common ancestor) with a common proto-word. For example, the Romanian word victorie and the Italian word vittoria are cognates, as they both descend from the Latin word victoria (meaning victory) – see Fig. 1. In most cases, cognates have preserved similar meanings across languages, but there are also exceptions. These are called deceptive cognates or, more commonly, false friends. Here we use the definition of cognates that refers to words with similar appearance and some common etymology, and use “true cognates” to refer to cognates which also have a common meaning, and “deceptive cognates” or “false friends” to refer to cognate pairs which do not have the same meaning (anymore). The most common way cognates have diverged is by changing their meaning. For many cognate pairs, however, the changes can be more subtle, relating to the feeling attached to a word, or its connotations. This can make false friends even more delicate to distinguish from true cognates.
Fig. 1. Example of cognates and their common ancestor
Cognate word pairs can help students when learning a second language and contributes to the expansion of their vocabularies. False friends, however, from the more obvious differences in meaning to the more subtle, have the opposite effect, and can be confusing for language learners and make the correct use of language more difficult. Cognate sets have also been used in a number applications in natural language processing, including for example machine translation [10]. These applications rely on properly distinguishing between true cognates and false friends. 1.1 Related Work Cross-lingual semantic word similarity consists in identifying words that refer to similar semantic concepts and convey similar meanings across languages [16]. Some of the most popular approaches rely on probabilistic models [17] and cross-lingual word embeddings [13]. A comprehensive list of cognates and false friends for every language pair is difficult to find or manually build - this is why applications have to rely on automatically identifying them. There have been a number of previous studies attempting to automatically extract pairs of true cognates and false friends from corpora or from dictionaries. Most methods are based either on ortographic and phonetic similarity, or require large parallel corpora or dictionaries [5, 9, 11, 14]. We propose a corpus-based approach that is capable of covering the vast majority of the vocabulary for a large number of languages, while at the same time requiring minimal human effort in terms of manually evaluating word pairs similarity or building lexicons, requiring only large monolingual corpora.
98
A.-S. Uban et al.
In this paper, we make use of cross-lingual word embeddings in order to distinguish between true cognates and false friends. There have been few previous studies using word embeddings for the detection of false friends or cognate words, usually using simple methods on only one or two pairs of languages [4, 15]. 1.2
Contributions
The contributions of our paper are twofold: firstly, we propose a method for quantifying the semantic divergence of languages; secondly, we provide a framework for detecting and correcting false friends, based on the observation that these are usually deceptive cognate pairs: pairs of words that once had a common meaning, but whose meaning has since diverged. We propose a method for measuring the semantic divergence of sister languages based on cross-lingual word embeddings. We report empirical results on five Romance languages: Romanian, French, Italian, Spanish and Portuguese. For a deeper insight into the matter, we also compute and investigate the semantic similarity between modern Romance languages and Latin. We finally introduce English into the mix, to analyze the behavior of a more remote language, where words deriving from Latin are mostly borrowings. Further, we make use of cross-lingual word embeddings in order to distinguish between true cognates and false friends. There have been few previous studies using word embeddings for the detection of false friends or cognate words, usually using simple methods on only one or two pairs of languages [4, 15]. Our chosen method of leveraging word embeddings extends naturally to another application related to this task which, to our knowledge, has not been explored so far in research: false friend correction. We propose a straightforward method for solving this task of automatically suggesting a replacement when a false friend is incorrectly used in a translation. Especially for language learners, solving this problem could result in a very useful tool to help them use language correctly.
2 The Method 2.1
Cross-Lingual Word Embeddings
Word embeddings are vectorial representations of words in a continuous space, built by training a model to predict the occurrence of a given word in a text corpus given its context. Based on the distributional hypothesis stating that similar words occur in similar contexts, these vectorial representations can be seen as semantic representations of words and can be used to compute semantic similarity between word pairs (representations of words with similar meanings are expected to be close together in the embeddings space). To compute the semantic divergence of cognates across sister languages, as well as identify pairs of false cognates (pairs of cognates with high semantic distance), which by definition are pairs of words in two different languages, we need to obtain a multilingual semantic space, which is shared between the cognates. Having the representations
A Computational Approach to Measuring the Semantic Divergence
(a) Es-Fr
(e) Fr-It
(b) Es-It
(c) Es-Ro
(f) Fr-Ro
(g) Fr-Pt
(i) It-Pt
(j) Ro-Pt
(m) En-Es
(n) En-Ro
(q) Fr-La
(k) En-Fr
(r) Es-La
(o) En-Pt
(s) It-La
99
(d) Es-Pt
(h) It-Ro
(l) En-It
(p) En-La
(t) Pt-La
(u) Ro-La
Fig. 2. Distributions of cross-language similarity scores between cognates.
of both cognates in the same semantic space, we can then compute the semantic distance between them using their vectorial representations in this space. We use word embeddings computed using the FastText algorithm, pre-trained on Wikipedia for the six languages in question. The vectors have dimension 300, and were obtained using the skip-gram model described in [2] with default parameters.
100
A.-S. Uban et al.
The algorithm for measuring the semantic distance between cognates in a pair of languages (lang1, lang2) consists of the following steps: 1. Obtain word embeddings for each of the two languages. 2. Obtain a shared embedding space, common to the two languages. This is accomplished using an alignment algorithm, which consists of finding a linear transformation between the two spaces, that on average optimally transforms each vector in one embedding space into a vector in the second embedding space, minimizing the distance between a few seed word pairs (for which it is known that they have the same meaning), based on a small bilingual dictionary. For our purposes, we use the publicly available multilingual alignment matrices that were published in [12]. 3. Compute semantic distances for each pair of cognates words in the two languages, using a vectorial distance (we chose cosine distance) on their corresponding vectors in the shared embedding space. 2.2
Cross-Language Semantic Divergence
We propose a definition of semantic divergence between two languages based on the semantic distances of their cognate word pairs in these embedding spaces. The semantic distance between two languages can then be computed as the average the semantic divergence of each pair of cognates in that language pair. We use the list of cognates sets in Romance languages proposed by [6]. It contains 3,218 complete cognate sets in Romanian, French, Italian, Spanish and Portuguese, along with their Latin common ancestors. The cognate sets are obtained from electronic dictionaries which provide information about the etymology of the words. Two words are considered cognates if they have the same etymon (i.e., if they descend from the same word). The algorithm described above for computing semantic distance for cognate pairs stands on the assumption that the (shared) embedding spaces are comparable, so that the averaged cosine similarities, as well as the overall distributions of scores that we obtain for each pair of languages can be compared in a meaningful way. For this to be true, at least two conditions need to hold: 1. The embeddings spaces for each language need to be similarly representative of language, or trained on similar texts - this assumption holds sufficiently in our case, since all embeddings (for all languages) are trained on Wikipedia, which at least contains a similar selection of texts for each language, and at most can be considered comparable corpora. 2. The similarity scores in a certain (shared) embeddings space need to be sampled from a similar distribution. To confirm this assumption, we did a brief experiment looking at the distributions of a random sample of similarity scores across all embeddings spaces, and did find that the distributions for each language pair are similar (in mean and variance). This result was not obvious but also not surprising, since: – The way we create shared embedding spaces is by aligning the embedding space of any language to the English embedding space (which is a common reference to all shared embedding spaces). – The nature of the alignment operation (consisting only of rotations and reflections) guarantees monolingual invariance, as described in these papers: [1, 12].
A Computational Approach to Measuring the Semantic Divergence
101
The Romance Languages. We compute the cosine similarity between cognates for each pair of modern languages, and between modern languages and Latin as well. We compute an overall score of similarity for a pair of languages as the average similarity for the entire dataset of cognates. The results are reported in Table 1. Table 1. Average cross-language similarity between cognates (Romance languages). Fr
It
Pt
Ro
La
Es 0.67 0.69 0.70 0.58 0.41 Fr
0.66 0.64 0.56 0.40
It
0.66 0.57 0.41
Pt
0.57 0.41
Ro
0.40
We observe that the highest similarity is obtained between Spanish and Portuguese (0.70), while the lowest are obtained for Latin. From the modern languages, Romanian has, overall, the lowest degrees of similarity to the other Romance languages. A possible explanation for this result is the fact that Romanian developed far from the Romance kernel, being surrounded by Slavic languages. In Table 2 we report, for each pair of languages, the most similar (above the main diagonal) and the most dissimilar (below the main diagonal) cognate pair for Romance languages. Table 2. Most similar and most dissimilar cognates Es Es –
Fr
It
Ro
Pt
ocho/huit(0.89)
diez/dieci(0.86)
ocho/opt(0.82)
ocho/oito(0.89)
Fr caisse/casar(0.05) – It
dix/dieci(0.86)
prezzo/prez(0.06) punto/ponte(0.09) –
Ro miere/mel(0.09)
face/facteur(0.10) as/asso(0.11)
Pt
pena/paner(0.09)
prez/prec¸o(0.05)
d´ecembre/decembrie(0.83) huit/oito(0.88) convincere/convinge(0.75) convincere/convencer(0.88) –
opt/oito(0.83)
preda/prea(0.08) linho/in(0.05) –
The problem that we address in this experiment involves a certain vagueness of reported values (also noted by [8] in the problem of semantic language classification), as there isn’t a gold standard that we can compare our results to. To overcome this drawback, we use the degrees of similarity that we obtained to produce a language clustering (using the UPGMA hierarchical clustering algorithm), and observe that it is similar with the generally accepted tree of languages, and with the clustering tree built on intelligibility degrees by [7]. The obtained dendrogram is rendered in Fig. 3.
102
A.-S. Uban et al.
Fig. 3. Dendrogram of the language clusters
The Romance Languages vs English. Further, we introduce English into the mix as well. We run this experiment on a subset of the used dataset, comprising the words that have a cognate in English as well1 . The subset has 305 complete cognate sets. The results are reported in Table 3, and the distribution of similarity scores for each pair of languages is rendered in Fig. 2. We notice that English has 0.40 similarity with Latin, the lowest value (along with French and Romanian), but close to the other languages. Out of the modern Romance languages, Romanian is the most distant from English, with 0.53 similarity. Another interesting observation relates to the distributions of scores for each language pair, shown in the histograms in Fig. 2. While similarity scores between cognates among romance languages usually follow a normal distribution (or another unimodal, more skewed distribution), the distributions of scores for romance languages with English seem to follow a bimodal distribution, pointing to a different semantic evolution for words in English that share a common etymology with a word in a romance language. One possible explanation is that the set of cognates between English and romance languages (which are pairs of languages that are more distantly related) consist of two distinct groups: for example one group of words that were borrowed directly from the romance language to English (which should have more meaning in common), and words that had a more complicated etymological trail between languages (and for which meaning might have diverged more, leading to lower similarity scores).
1
Here we stretch the definition of cognates, as they are generally referring to sister languages. In this case English is not a sister of the Romance languages, and the words with Latin ancestors that entered English are mostly borrowings.
A Computational Approach to Measuring the Semantic Divergence
103
Table 3. Average cross-language similarity between cognates Fr
It
Pt
Ro
En
La
Es 0.64 0.67 0.68 0.57 0.61 0.42 Fr
0.64 0.61 0.55 0.60 0.40
It
0.65 0.57 0.60 0.41
Pt
0.56 0.59 0.42
Ro
0.53 0.40
En
0.40
2.3 Detection and Correction of False Friends In a second series of experiments, we propose a method for identifying and correcting false friends. Using the same principles as in the previous experiment, we can use embedding spaces and semantic distances between cognates in order to detect pairs of false friends, which are simply defined as pairs of cognates which do not share the same meaning, or which are not semantically similar enough. This definition is of course ambiguous: there are different degrees of similarity, and as a consequence different potential degrees of falseness in a false friend. Based on this observation, we define the notions of hard false friend and soft false friend. A hard false friend is a pair of cognates for which the meanings of the two words have diverged enough such that they don’t have the same meaning anymore, and should not be used interchangibly (as translations of one another). In this category fall most known examples of false friends, such as the French-English cognate pair attendre/attend: in French, attendre has a completely different meaning, which is to wait. A different and more subtle type of false friends can result from more minor semantic shifts between the cognates. In such pairs, the meaning of the cognate words may remain roughly the same, but with a difference in nuance or connotation. Such an example is the Romanian-Italian cognate pair amic/amico. Here, both cognates mean friend, but in Italian the connotation is that of a closer friend, whereas the Romanian amic denotes a more distant friend, or even acquaintance. A more suitable Romanian translation for amico would be prieten, while a better translation in Italian for amic could be conoscente. Though their meaning is roughly the same, translating one word for the other would be an inaccurate use of the language. These cases are especially difficult to handle by beginner language learners (especially since the cognate pair may appear as valid a translation in multilingual dictionaries) and using them in the wrong contexts is an easy trap to fall into. Given these considerations, an automatic method for finding the appropriate term to translate a cognate instead of using the false friend would be a useful tool to aid in translation or in language learning. As a potential solution to this problem, we propose a method that can be used to identify pairs of false friends, to distinguish between the two categories of false friends
104
A.-S. Uban et al.
defined above (hard false friends and soft false friends), and to provide suggestions for correcting the erroneous usage of a false friend in translation. False friends can be identified as pairs of cognates with high semantic distance. More specifically, we consider a pair of cognates to be a false friend pair if in the shared semantic space, there exists a word in the second language which is semantically closer to the original word than its cognate in that language (in other words, the cognate is not the optimal translation). The arithmetic difference between the semantic distance between these words and the semantic distance between the cognates will be used as a measure of the falseness of the false friend. The word that is found to be closest to the first cognate will be the suggested “correction”. The algorithm can be described as follows:
Algorithm 1. Detection and correction of false friends 1: Given the cognates pair (c1 , c2 ) where c1 is a word in lang1 and c2 is a word in lang2 : 2: Find the word w2 in lang2 such that for any wi in lang2 , distance(c2 , w2 ) < distance(c2 , wi ) 3: if w2 = c2 then 4: (c1 , c2 ) is a pair of false friends 5: Degree of falseness = distance(c1 , w2 ) − distance(c1 , c2 ) 6: return w2 as potential correction 7: end if
We select a few results of the algorithm to show in Table 4, containing examples of extracted false friends for the language pair French-Spanish, along with the suggested correction and the computed degree of falseness. Depending on the application, the measure of falseness could be used by choosing a threshold to single out pairs of false friends that are harder or softer, with a customizable degree of sensitivity to the difference in meaning. Table 4. Extracted false friends for French-Spanish FR cognate ES cognate Correction Falseness prix
prez
premio
0.67
long
luengo
largo
0.57
face
faz
cara
0.41
change
caer
cambia
0.41
concevoir
concebir
dise˜nar
0.18
majeur
mayor
importante 0.14
A Computational Approach to Measuring the Semantic Divergence
105
Evaluation. In this section we describe our overall results on identifying false friends for every language pair between English and five Romance languages: French, Italian, Spanish, Portuguese and Romanian. Table 5. Performance for Spanish-Portuguese using curated false friends test set Accuracy Precision Recall Our method
81.12
86.68
75.59
85.82
54.50
(Castro et al.) 77.28 WN Baseline 69.57
We evaluate our method in two separate stages. First, we measure accuracy of false friend detection on a manually curated list of false friends and true cognates in Spanish and Portuguese, used in a previous study [4], and introduced in [15]. This resource is composed by 710 Spanish-Portuguese word pairs: 338 true cognates and 372 false friends. We also compare our results to the ones reported in this study, which uses a method similar to ours (using a simple classifier that takes embedding similarities as features to identify false friends) and shows improvements over results in previous research. The results are show in Table 5. For the second part of the experiment, we use the list of cognates sets in English and Romance languages proposed by [6] (the same that we used in our semantic divergence experiments), and try to automatically decide which of these are false friends. Since manually built false friends lists are not available for every language pair that we experiment on, for the language pairs in this second experiment we build our gold standard by using a multilingual dictionary (WordNet) in order to infer false friends and true cognate relationships. We assume two cognates in different languages are true cognates if they occur together in any WordNet synset, and false friends otherwise. Table 6. Performance for all language pairs using WordNet as gold standard. Accuracy Precision Recall EN-ES 76.58 ES-IT
75.80
63.88
88.46
41.66
54.05
ES-PT 82.10
40.0
42.85
EN-FR 77.09
57.89
94.28
FR-IT
74.16
32.81
65.62
FR-ES 73.03
33.89
69.96
EN-IT 73.07
33.76
83.87
IT-PT
76.14
29.16
43.75
EN-PT 77.25
59.81
86.48
106
A.-S. Uban et al.
We measure accuracy, precision, and recall, where: – a true positive is a cognate pair that are not synonyms in WordNet and are identified as false friends by the algorithm, – a true negative is a pair which is identified as true cognates and is found in the same WordNet synset, – a false positive is a word pair which is identified as a false friends pair by the algorithm but also appears as a synonym pair in WordNet, – and a false negative is a pair of cognate words that are not synonyms in WordNet, but are also not identified as false friends by the algorithm. We should also note that in the WordNet based method we can only evaluate results for only slightly over half of cognate pairs, since not all of them are found in WordNet. This also makes our corpus-based method more useful than a dictionary-based method, since it is able to cover most of the vocabulary of a language (given a large monolingual corpus to train embeddings on). To be able to compare results to the ones evaluated on the manually built test set, we use the WordNet-based method as a baseline in the first experiment. Results for the second evaluation experiments are reported in Table 6. In this evaluation experiment we were able to measure performance for language pairs among all languages in our cognates set except for Romanian (which is not available in WordNet).
3 Conclusions In this paper we proposed a method for computing the semantic divergence of cognates across languages. We relied on word embeddings and extended the pairwise metric to compute the semantic divergence across languages. Our results showed that Spanish and Portuguese are the closest languages, while Romanian is most dissimilar from Latin, possibly because it developed far from the Romance kernel. Furthermore, clustering the Romance languages based on the introduced semantic divergence measure results in a hierarchy that is consistent with the generally accepted tree of languages. When further including English in our experiments, we noticed that, even though most Latin words that entered English are probably borrowings (as opposed to inherited words), its similarity to Latin is close to that of the modern Romance languages. Our results shed some light on a new aspect of language similarity, from the point of view of crosslingual semantic change. We also proposed a method for detecting and possibly correcting false friends, and introduced a measure for quantifying the falseness of a false friend, distinguishing between two categories: hard false friends and soft false friends. These analyses and algorithms for dealing with false friends can possibly provide useful tools for language learning or for (human or machine) translation. In this paper we provided a simple method for detecting and suggesting corrections for false friends independently of context. There are, however, false friends pairs that are context-dependent - the cognates can be used interchangibly in some contexts, but not in others. In the future, the method using word embeddings could be extended to provide false friend correction suggestions in a certain context (possibly by using the word embedding model to predict the appropriate word in a given context).
A Computational Approach to Measuring the Semantic Divergence
107
Acknowledgements. Research supported by BRD—Groupe Societe Generale Data Science Research Fellowships.
References 1. Artetxe, M., Labaka, G., Agirre, E.: Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2289–2294 (2016) 2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606 (2016) 3. Campbell, L.: Historical Linguistics. An Introduction. MIT Press, Cambridge (1998) 4. Castro, S., Bonanata, J., Ros´a, A.: A high coverage method for automatic false friends detection for Spanish and Portuguese. In: Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 29–36 (2018) 5. Chen, Y., Skiena, S.: False-friend detection and entity matching via unsupervised transliteration. arXiv preprint arXiv:1611.06722 (2016) 6. Ciobanu, A.M., Dinu, L.P.: Building a dataset of multilingual cognates for the Romanian lexicon. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, pp. 1038–1043 (2014) 7. Dinu, L.P., Ciobanu, A.M.: On the Romance languages mutual intelligibility. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, pp. 3313–3318 (2014) 8. Eger, S., Hoenen, A., Mehler, A.: Language classification from bilingual word embedding graphs. In: Proceedings of COLING 2016, Technical Papers, pp. 3507–3518 (2016) 9. Inkpen, D., Frunza, O., Kondrak, G.: Automatic identification of cognates and false friends in French and English. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, vol. 9, pp. 251–257 (2005) 10. Kondrak, G., Marcu, D., Knight, K.: Cognates can improve statistical translation models. In: Companion Volume of the Proceedings of HLT-NAACL 2003-Short Papers (2003) 11. Nakov, S., Nakov, P., Paskaleva, E.: Unsupervised extraction of false friends from parallel bi-texts using the web as a corpus. In: Proceedings of the International Conference RANLP2009, pp. 292–298 (2009) 12. Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859 (2017) 13. Søgaard, A., Goldberg, Y., Levy, O.: A strong baseline for learning cross-lingual word embeddings from sentence alignments. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, pp. 765–774 (2017) 14. St Arnaud, A., Beck, D., Kondrak, G.: Identifying cognate sets across dictionaries of related languages. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2519–2528 (2017) 15. Torres, L.S., Alu´ısio, S.M.: Using machine learning methods to avoid the pitfall of cognates and false friends in Spanish-Portuguese word pairs. In: Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology (2011)
108
A.-S. Uban et al.
16. Vulic, I., Moens, M.: Cross-lingual semantic similarity of words as the similarity of their semantic word responses. In: Proceedings of Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, pp. 106–116 (2013) 17. Vulic, I., Moens, M.: Probabilistic models of cross-lingual semantic similarity in context based on latent cross-lingual concepts induced from comparable data. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, pp. 349–362 (2014)
Triangulation as a Research Method in Experimental Linguistics Olga Suleimanova
and Marina Fomina(B)
Moscow City University, Moscow 129226, Russia [email protected]
Abstract. The paper focuses on the complex research procedure based on hypothesis-deduction method (with semantic experiment as its integral part), corpus-based experiment, and the analysis of search engine results. The process of verification that increases validity of research findings by incorporating several methods in the study of the same phenomenon is often referred to as triangulation. Triangulation being a well-established practice in social sciences is relatively recent in linguistics. The authors describe a step-by-step semantic research technique employed while studying semantic features of the group of English synonymous adjectives – empty, free, blank, unoccupied, spare, vacant and void. The preliminary stage of the research into the meaning of the adjectives consists in gathering information on their distribution, valence characteristics and all possible contexts they may occur in. The results of this preliminary analysis enable to frame a hypothesis on the meaning of the linguistic units. Then the authors proceed to the experimental verification of the proposed hypotheses supported by corpus-based experiment, the analysis of search engine results, and mathematical-statistical methods and procedures that can help separate the random factor from the informants’ grade determined by the system of language. The research findings result in stricter semantic descriptions of the adjectives. Keywords: Triangulation · Linguistic experiment · Corpus-based experiment · Expert evaluation method · Mathematical statistics · Informant · Semantics
1 Introduction Triangulation is regarded as a process of verification that increases validity of research findings by incorporating several methods in the study of the same phenomenon in interdisciplinary research. The proponents of this method claim that “by combining multiple observers, theories, methods, and empirical materials, researchers can hope to overcome the weakness or intrinsic biases and the problems that come from single-method, singleobserver, single-theory studies” [1]. In 1959, D. Campbell and D. Fiske advocated an approach to assessing the construct validity of a set of measures in a study [2]. This method that relied on a matrix (‘multitrait-multimethod matrix’) of intercorrelations among tests representing at least two traits, each measured by at least two methods, can be viewed as a prototype of the triangulation technique. © Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 109–120, 2023. https://doi.org/10.1007/978-3-031-24340-0_9
110
O. Suleimanova and M. Fomina
In social sciences, N. Denzin distinguishes between the following triangulation techniques: – data triangulation (the researcher collects data from a number of different sources to form one body of data); – investigator triangulation (there are several independent researchers who collect and then interpret the data); – theoretical triangulation (the researcher interprets the data relying on more than one theory as a starting point); – methodological triangulation (the researcher relies on more than one research method or data collection technique) which is the most commonly used technique [3]. Triangulation being a well-established practice in social sciences (e.g. see [1–5] and many others) is relatively recent in linguistics. The 1972 study by W. Labov states the ‘complementary principle’ and the ‘principle of convergence’ among the key principles of linguistic methodology that govern the gathering of empirical data [6]. W. Labov stresses the importance of triangulation principles in linguistics arguing that “the most effective way in which convergence can be achieved is to approach a single problem with different methods, with complementary sources of error” [6]. Modern verification procedures and experimental practices are steadily narrowing the gap between linguistics (as an originally descriptive science relying mostly on qualitative methods in studying linguistic phenomena) and exact sciences. The results of linguistic research get the status of tested and proved theories and established laws. In addition to well-known research procedures, the linguistic experiment, being entirely based on interviews with native speakers (often referred to as ‘informants’), is rapidly getting ground (see [7] for a detailed account of verification capacity of semantic experiment). Recent years have witnessed a significant rise in the number of corpus-based experimental studies. Many linguists support their research procedure by the analysis of search engine results (e.g. Google results). In the paper, we shall focus on verification procedures that rely on the methodological triangulation when experimental practices are supported by corpus-based experiment, the analysis of search engine results, and mathematical-statistical methods.
2 Methodology 2.1 Semantic Research and Experiment The semantic experiment is an integral, indispensable part of the complex research procedure often referred to as hypothesis-deduction method. J.S. Stepanov distinguishes the four basic steps of hypothesis-deduction method: (1) to collect practical data and provide its preliminary analysis; (2) to put forward a hypothesis to support the practical data and relate the hypothesis to other existing theories; (3) to deduce rules from the suggested theories; (4) to verify the theory by relating the deduced rules to the linguistic facts [8].
Triangulation as a Research Method in Experimental Linguistics
111
Following the steps, O.S. Belaichuk worked out a step-by-step procedure of semantic experiment [9]. Let us demonstrate how it works on the semantic analysis of the meanings of English adjectives empty, free, blank, unoccupied, spare, vacant and void [10]. The preliminary stage of semantic research into the meaning of a language unit consists in gathering information on its distribution, valence characteristics and all possible contexts it may occur in. The results of this preliminary analysis enable the researcher to frame a hypothesis on the meaning of the linguistic unit in question (see [7] for a detailed description of the step-by-step procedure). At the next stage, we arrange a representative sampling by reducing the practically infinite sampling to a workable set. Then an original word in the representative sampling is substituted by its synonym. For example, in the original sentence The waiter conducted two unsteady businessmen to the empty table beside them [11] the word empty is replaced by the adjective vacant: The waiter conducted two unsteady businessmen to the vacant table beside them. Then other synonyms − free, blank, spare, unoccupied and void − are also put in the same context. At this stage, we may not have any hypothesis explaining the difference in the meanings of the given adjectives. At the next stage of the linguistic experiment, informants grade the acceptability of the offered utterances in the experimental sample according to a given scale suggested by A. Timberlake [12] − consider a fragment of a questionnaire (see Fig. 1) used in the interview of native speakers of English [13]. Then the linguist processes and analyses the informants’ grades to put forward a linguistic hypothesis, and then proceeds to the experimental verification of the proposed hypotheses. There is a variety of tests for verifying hypotheses, e.g. when the researcher varies only one parameter of the situation described while others should be fixed and invariable (see [7, 14–16]) for the detailed account of verification procedures). In addition to the well-established verification procedures employed in the linguistic experiment, corpus-based experiment and the analysis of search engine results are rapidly getting ground. Researchers claim that these new IT tools give a linguist value added: text corpora as well as such search engines as Google provide invaluable data, though they remain underestimated, and have not been explored as regards their full potential [17]. While in linguistic experiment we obtain the so-called ‘negative linguistic material’ (the term used by L.V. Scherba), i.e. the sentences graded as unacceptable, the text corpora do not provide the researcher with marked sentences. Most frequently occurring search results are likely to be acceptable and preferred, while marginally acceptable and not preferred sentences are to be rare. To verify the hypothesis with corpora and Google big data, the researcher determines whether the corpora and Google experimental data complies with his/her predictions and expectations, and to what extent. So, in accordance with the expectations we get frequent search results with the word empty describing a physical object (a bottle, a box, a table, a room, etc.) construed as three-dimensional physical space; and rare or no results with the word blank in these adjective-noun-combinations (see Table 1).
112
O. Suleimanova and M. Fomina QUESTIONNAIRE Name: ________________________________ Nationality: ____________________________ Age: __________________________________ Qualifications: __________________________ DIRECTIONS Grade each of the sentences below according to the following scale: Rating
Meaning
Comment
1
Unacceptable
Not occurring
2
Marginally acceptable
Rare
3
Not preferred
Infrequent
4
Acceptable, not preferred
Frequent
5
Acceptable, preferred
Most frequent
NOTE: Grade sentences with reference to the norm of standard English (slang, vernacular, argot or stylistically marked words are not in the focus of investigation) Useful hints to prevent possible misapprehension ! Do not try to assess the degree of synonymy of the words analysed ! Do not develop possible contexts that may seem to be implied by the words used in the statements; assess the acceptability of the utterances judging by the way the information is presented ! Still if you feel that the context is insufficient to assess the acceptability of the sentence, suggest your own context in the column “comments” corresponding to the sentence (A-G). Then grade the utterance according to the context offered by you Any of your comments will be highly appreciated!
A B C D E F G
Sentence The room is empty. All the furniture has been removed. The room is free. All the furniture has been removed. The room is blank. All the furniture has been removed. The room is spare. All the furniture has been removed. The room is unoccupied. All the furniture has been removed. The room is vacant. All the furniture has been removed. The room is void. All the furniture has been removed.
Rating
THANK YOU.
Fig. 1. Questionnaire (a fragment).
Comments
Triangulation as a Research Method in Experimental Linguistics
113
Table 1. BNC and Google search results. Bottle Box Table Wall Screen Sheet of paper (BNC/Google) (BNC/Google) (BNC/Google) (BNC/Google) (BNC/Google) (BNC/Google) Empty 4.97 m/32
3.98 m/19
1.33 m/15
2.4 m/2
0.381 m/2
0.665/1
Blank
0.654 m/0
0.305 m/0
4.92 m/38
4.13 m/12
1.68 m/20
0.156 m/0
2.2 Expert Evaluation Method in Linguistic Experiment While grading the sentences the informant is governed by the language rules and regulations as well as by some random factors. Thus, each grade being the result of deterministic and random processes can be treated as a variate (not to confuse with a ‘variable’). In the linguistic experiment, this variate (X) can take on only integer values on the closed interval [1; 5] (five-point system). Therefore, it should be referred to as a discrete variate. Discrete variates can be processed by mathematical-statistical methods. We chose several statistics that best describe such random distributions. The first one is the expectation for each sentence, or − in other words − the mean value of grades. The expectation corresponds to the centre of a distribution. Thus, it can be interpreted as a numerical expression of the influence of deterministic factors. This characteristic is defined as 1 χij (i = 1, 2, . . . , n) m m
μi =
(1)
j=1
where μi is the mean value (the expectation) of grades for the ith sentence; i is a sentence number (i = 1, 2, …, n); j is an informant’s number (j = 1, 2, …, m); n is the total number of sentences; m is the total number of informants; χ ij is the ith sentence’s grade given by the jth informant (χ ij = 1 ÷ 5). The second characteristic is the dispersion. It defines the extent the grades are spread around their mean value. It means that the dispersion is a numerical expression of the influence of random factors. The lower the dispersion of the grade, the more reliable the grade is (the influence of random factors is lower), and vice versa. If the dispersion is high, the researcher should try and find possible reasons which might have led to this value. This statistic can be calculated with (2): 1 (χij − μi )2 (i = 1, 2, . . . , n) m m
Di =
j=1
where Di is the dispersion of grades for the ith sentence; μi is the mean value (the expectation) of grades for the ith sentence; i is a sentence number (i = 1, 2, …, n); j is an informant’s number (j = 1, 2, …, m); n is the total number of sentences;
(2)
114
O. Suleimanova and M. Fomina
m is the total number of informants; χ ij is the ith sentence’s grade given by the jth informant (χ ij = 1 ÷ 5). The next step of the algorithm is calculating the mean value for each sentence taking into account the competence of informants (3). The measure of competence of an informant can be expressed via the coefficient of competence which is a standardized value and can take on any value on the interval (0; 1). The sum of the coefficients of the whole group of informants is to amount to 1 (4). These coefficients can be calculated a posteriori, after the interview. We proceed from the assumption that informants’ competence should be estimated in terms of the extent to which each informant’s grade agrees with the mean value [13]. χi =
m
χij κj (i = 1, 2, . . . , n)
(3)
j=1
where χ i is the mean value of grades for the ith sentence; i is a sentence number (i = 1, 2, …, n); j is an informant’s number (j = 1, 2, …, m); n is the total number of sentences; m is the total number of informants; χ ij is the ith sentence’s grade given by the jth informant (χ ij = 1 ÷ 5); κ j is the coefficient of competence for the jth informant, the coefficient of competence being a standardized value, i.e. m
κj = 1
(4)
j=1
The coefficients of competence can be calculated with recurrence formulas (5), (6) and (7): χit =
m
χij κjt−1 (i = 1, 2, . . . , n)
(5)
j=1
λt =
m n
χij χ ti (t = 1, 2, . . . )
(6)
i=1 j=1
κjt =
n m 1 t χ χ ; kjt = 1(j = 1, 2, . . . , m) ij i λt i=1
(7)
j=1
We start our calculations with t = 1. In (5) the initial values of the competence coefficients are assumed to be equal and take the value of κj0 = 1/ m . Then, the cluster estimate for the ith sentence in the first approximation (expressed in terms of (5)) is therefore: 1 χij (i = 1, 2, . . . , n) m m
χi1 =
j=1
(8)
Triangulation as a Research Method in Experimental Linguistics
115
λ1 can be obtained using (6): λ1 =
m n
χij χ 1i
(9)
i=1 j=1
The coefficients of competence in the first approximation are calculated according to (7): κj1
n 1 = 1 χij χi1 λ
(10)
i=1
With the coefficients of competence in the first approximation, we may repeat the calculations using (5), (6), (7) to obtain χi2 , λ2 , κj2 in the second approximation, etc. Now consider the results of the interview (a fragment) to illustrate how the algorithm works. Eleven informants were asked to grade seven examples (A. The room is empty. All the furniture has been removed; B. The room is free. All the furniture has been removed; C. The room is blank. All the furniture has been removed; D. The room is spare. All the furniture has been removed; E. The room is unoccupied. All the furniture has been removed; F. The room is vacant. All the furniture has been removed; G. The room is void. All the furniture has been removed) according to the above five-point system (see Fig. 1). Table 2 features the results of the interview in the form of grades. Table 2. Matrix of grades (a fragment). χ ij
1
2
3
4
5
6
7
8
9
10
11
1
5
5
5
5
5
5
5
5
5
5
5
2
3
3
2
1
1
4
1
4
4
1
4
3
1
1
2
1
1
1
1
1
1
2
1
4
1
1
2
1
1
1
1
1
1
1
1
5
4
4
2
2
1
3
4
3
3
2
3
6
3
4
4
4
2
1
4
1
4
5
1
7
1
1
1
1
1
1
1
1
1
2
1
We start our calculations with t = 1. In (5) the initial values of the competence coefficients are assumed to be equal and take the value of κj0 = 1/ m = 1 11 . Then, the cluster estimate for the ith sentence in the first approximation (expressed in terms of (5)) is therefore (see Table 3). λ1 can be obtained using (6): λ1 =
m n i=1 j=1
χij χ 1i =
11 7 i=1 j=1
χij χ 1i = 574.18
116
O. Suleimanova and M. Fomina Table 3. Matrix of cluster estimates (t = 1).
χ 11
χ 12
χ 13
χ 14
χ 15
χ 16
χ 17
5
2.55
1.18
1.09
2.82
3
1.09
Table 4. Matrix of the coefficients of competence (t = 1). κ 11
κ 12
κ 13
κ 14
κ 15
κ 16
κ 17
κ 18
κ 19
κ 110
κ 111
0.098
0.1032
0.09
0.08
0.07
0.09
0.09
0.09
0.1
0.09
0.09
Table 4 features the coefficients of competence in the first approximation. With the coefficients of competence in the first approximation, we may repeat the calculations using (5), (6), (7) to obtain χi2 , λ2 , κj2 in the second approximation (see Tables 5 and 6), etc. Table 5. Matrix of cluster estimates (t = 2). χ 21
χ 22
χ 23
χ 24
χ 25
χ 26
χ 27
5
2.59
1.19
1.09
2.89
3.07
1.09
Now consider the statistic used to assess agreement among informants − the coefficient of concordance. It can be calculated with the following formula: 2 2 δmax W = δact (11) 2 is the actual dispersion of pooled informants’ grades; where δact 2 δmax is the dispersion of pooled grades if there is complete agreement among the informants. The coefficient of concordance may assume a value on the closed interval [0; 1]. If the statistic W is 0, then there is no overall trend of agreement among the informants, and their responses may be regarded as essentially random. If W is 1, then all the informants have been unanimous, and each informant has given the same grade to each of the sentences. Intermediate values of W indicate a greater or lesser degree of unanimity among the informants. To treat the grades as concurring enough it is necessary that W is higher than a set normative point W n (W > W n ). Let us take W n = 0.5. Thus, in case W > 0.5, the informants’ opinions are rather concurring than different. Then we admit the results of expertise to be valid and the group of informants to be reliable. What is more significant is that we have succeeded in the experiment, and expertise procedures were accurately arranged to meet all the requirements of the linguistic experiment.
Triangulation as a Research Method in Experimental Linguistics
117
Table 6. Matrix of the coefficients of competence (t = 1; 2). κ tj
t=1
t=2
κ1t
0.0980
0.0981
κ2t κ3t κ4t κ5t κ6t κ7t κ8t κ9t t κ10 t κ11 11 t j=1 kj
0.1032
0.1034
0.0929
0.0929
0.0845
0.0845
0.0692
0.0690
0.0871
0.0870
0.0944
0.0947
0.0871
0.0872
0.1028
0.1031
0.0937
0.0920
0.0871
0.0892
1
1
Now consider the results of the interview (see Table 7) to illustrate the calculation procedure. If the informants’ opinions had coincided absolutely, each informant would have graded the first sentence as 5, the second one – as 4, the third and the forth – as 1, the fifth – as 3, the sixth – as 4, and the seventh sentence – as 1. Then the total (pooled) grades given to the sentences would have amounted to 55, 44, 11, 11, 33, 44 and 11, respectively. The mean value of the actual pooled grades is (55 + 28 + 13 + 12 + 31 + 33 + 12) / 7 = 26.3. 2 = (55 − 26.3)2 + (28 − 26.3)2 + (13 − 26.3)2 + (12 − 26.3)2 + Then δact (31 − 26.3)2 + (33 − 26.3)2 + (12 − 26.3)2 = 1479.43 2 = (55 − 26.3)2 +(44 − 26.3)2 +(11 − 26.3)2 +(11 − 26.3)2 +(33 − 26.3)2 + δmax (44 − 26.3)2+ (11 − 26.3)2 = 2198.14 1479.43 2 δ2 W = δact max = 2198.14 = 0.67. The coefficient of concordance equals 0.67, which is higher than the normative point 0.5. Thus, the informants’ opinions are rather concurring. Still the coefficient could have been higher if the grades for the second and sixth examples (see sentences B and F in Fig. 1) had revealed a greater degree of unanimity among the informants – the dispersion of grades for these sentences (D2 and D6 ) is the highest (see Table 8). We analysed possible reasons which might have led to some of the scatter in the grades. Here we shall consider the use of adjective free in the following statement: The room is free. All the furniture has been removed. The research into semantics of free revealed that native English speakers more readily and more frequently associate the word free with ‘costing nothing’, ‘without payment’ rather than with ‘available, unoccupied, not in use’. In case the word free used in the latter
118
O. Suleimanova and M. Fomina Table 7. Results of the interview (a fragment).
Informant / Sentence
1
2
3
4
5
6
7
1
5
3
1
1
4
3
1
2
5
3
1
1
4
4
1
3
5
2
2
2
2
4
1
4
5
1
1
1
2
4
1
5
5
1
1
1
1
2
1
6
5
4
1
1
3
1
1
7
5
1
1
1
4
4
1
8
5
4
1
1
3
1
1
9
5
4
1
1
3
4
1
10
5
1
2
1
2
5
2
11
5
4
1
1
3
1
1
Actual pooled grade
55
28
13
12
31
33
12
Pooled grade (if W = 1)
55
44
11
11
33
44
11
Table 8. Dispersion of grades. D1
D2
D3
D4
D5
D6
D7
0
1.7
0.15
0.08
0.88
2.00
0.08
meaning may cause some ambiguity, native speakers opt for synonymous adjectives such as empty, blank, unoccupied, vacant or available to differentiate from the meaning of ‘without cost’. Consider the following utterances with free: The teacher handed a free test booklet to each student; Jane parked her car in a free lot; Mary entered the free bathroom and locked the door. Informants assess the statements as acceptable provided the adjective free conveys the information that one can have or use the objects (a test booklet, a lot, a bathroom) without paying for them. When we asked the informants to evaluate the same statements with the word free meaning ‘available for some particular use or activity’, the above sentences were graded as unacceptable: *The teacher handed a free test booklet to each student; *Jane parked her car in a free lot; *Mary entered the free bathroom and locked the door. The study revealed that many statements with free can be conceived of in two different ways depending on the speaker’s frame of reference. This ambiguity leads to a high dispersion of informants’ grades, i.e. the grades appear to be spread around their mean value to a great extent and thus cannot be treated as valid.
Triangulation as a Research Method in Experimental Linguistics
119
Thus, the use of the word free is often situational. If there is a cost issue assumed by the speaker, it can lead to ambiguities that may explain some of the scatter in the grades. In the following statement, The room is free. All the furniture has been removed the speaker may have in his/her mind the possibility of a room being available for use without charge, unless it is furnished. Thus, the removal of the furniture has the effect of making the room free from cost, letting this choice seem possibly more frequently used than it might otherwise be graded. When we asked the informants to assess the statement, assuming the word free conveyed the information ‘available for some activity’, the statement was graded as acceptable, whereas the use of free meaning ‘without payment’, ‘without charge’ was found to be not occurring (see [18]).
3 Conclusions Summing up the results of the research into verification procedures that rely on the methodological triangulation when experimental practices are supported by corpusbased experiment, the analysis of search engine results, and mathematical-statistical methods, we may conclude that: (1) New IT tools give a linguist value added: text corpora as well as such search engines as Google provide invaluable data, though they remain underestimated – they are to be explored as regards their full explanatory potential; (2) The results of expert evaluation, represented in the digital form, can be treated as discrete variates, and then be processed with mathematical-statistical methods; these methods and procedures can help separate the random factor from the grade determined by the system of language; as a result the researcher obtains a mathematical calculation for the influence of deterministic as well as random factors, the consistency in informants’ data and, consequently, reliability of their grades; high consistency, in its turn, testifies to the ‘quality’ of the group of informants and means that interviewing this group will yield good reliable data; (3) Of prime importance is the elaboration of a comprehensive verification system that relies on more than one research method or data collection technique; (4) The use of triangulation as a research method in experimental linguistics is steadily bridging the gap between linguistics as an originally purely descriptive field and other sciences, where mathematical apparatus has long been applied.
References 1. Jakob, A.: On the triangulation of quantitative and qualitative data in typological social research: reflections on a typology of conceptualizing “uncertainty” in the context of employment biographies. Forum Qual. Soc. Res. 2(1), 1–29 (2001) 2. Campbell, D., Fiske, D.: Convergent and discriminant validation by the multitraitmultimethod matrix. Psychol. Bull. 56(2), 81–105 (1959) 3. Denzin, N.: The Research Act: A Theoretical Introduction to Sociological Methods. Aldine, Chicago (1970)
120
O. Suleimanova and M. Fomina
4. Yeasmin, S., Rahman, K.F.: “Triangulation” research method as the tool of social science research. BUP J. 1(1), 154–163 (2012) 5. Bryman, A.: Social Research Methods, 2nd edn. Oxford University Press, Oxford (2004) 6. Labov, W.: Some principles of linguistic methodology. Lang. Soc. 1, 97–120 (1972) 7. Souleimanova, O.A., Fomina, M.A.: The potential of the semantic experiment for testing hypotheses. Sci. J. “Modern Linguistic and Metodical-and-Didactic Researches” 2(17), 8−19 (2017) 8. Stepanov, J.S.: Problema obshhego metoda sovremennoj lingvistiki. In: Vsesojuznaja nauchnaja konferencija po teoreticheskim voprosam jazykoznanija (11−16 nojabrja 1974 g.): Tez. dokladov sekcionnyh zasedanij, pp. 118−126. The Institute of Linguistics, Academy of Sciences of the USSR, Moscow (1974) 9. Belaichuk, O.S.: Gipotetiko-deduktivnyj metod dlja opisanija semantiki glagolov otricanija (poshagovoe opisanie metodiki, primenjaemoj dlja reshenija konkretnoj issledovatel’skoj zadachi). In: Lingvistika na rubezhe jepoh: dominanty i marginalii 2, pp. 158−176. MGPU, Moscow (2004) 10. Fomina, M.: Universal concepts in a cognitive perspective. In: Schöpe, K., Belentschikow, R., Bergien, A. et al. (eds.) Pragmantax II: Zum aktuellen Stand der Linguistic und ihrer Teildisziplinen: Akten des 43. Linguistischen Kolloquiums in Magdeburg 2008, pp. 353–360. Peter Lang, Frankfurt a.M. et al. (2014) 11. British National Corpus (BYU-BNC). https://corpus.byu.edu/bnc/. Accessed 17 Jan 2019 12. Timberlake, A.: Invariantnost’ i sintaksicheskie svojstva vida v russkom jazyke. Novoe v zarubezhnoj lingvistike 15, 261–285 (1985) 13. Fomina, M.A.: Expert appraisal technique in the linguistic experiment and mathematical processing of experimental data. In: Souleimanova, O. (ed.) Sprache und Kognition: Traditionelle und neue Ansätze: Akten des 40. Linguistischen Kolloquiums in Moskau 2005, pp. 409−416. Peter Lang, Frankfurt a.M. et al. (2010) 14. Fomina, M.A.: Konceptualizacija “pustogo” v jazykovoj kartine mira. (Ph.D. thesis). Moscow City University, Moscow (2009) 15. Seliverstova, O.N., Souleimanova, O.A.: Jeksperiment v semantike. Izvestija AN SSSR. Ser. literatury i jazyka 47(5), 431−443 (1988) 16. Sulejmanova, O.A.: Puti verifikacii lingvisticheskih gipotez: pro et contra. Vestnik MGPU. Zhurnal Moskovskogo gorodskogo pedagogicheskogo universiteta. Ser. Filologija. Teorija jazyka. Jazykovoe obrazovanie 2(12), 60−68 (2013) 17. Suleimanova, O.: Technologically savvy take it all, or how we benefit from IT resources. In: Abstracts. 53. Linguistics Colloquium, 24–27 September 2018, pp. 51−52. University of Southern Denmark, Odense (2018) 18. Fomina, M.: Configurative components of word meaning. In: Küper, Ch., Kürschner, W., Schulz, V. (eds.) Littera: Studien zur Sprache und Literatur: Neue Linguistische Perspektiven: Festschrift für Abraham P. ten Cate, pp. 121−126. Peter Lang, Frankfurt am Main (2011)
Understanding Interpersonal Variations in Word Meanings via Review Target Identification Daisuke Oba1(B) , Shoetsu Sato1,2 , Naoki Yoshinaga2 , Satoshi Akasaki1 , and Masashi Toyoda2 1
2
The University of Tokyo, Tokyo, Japan {oba,shoetsu,akasaki}@tkl.iis.u-tokyo.ac.jp Institute of Industrial Science, The University of Tokyo, Tokyo, Japan {ynaga,toyoda}@iis.u-tokyo.ac.jp
Abstract. When people verbalize what they felt with various sensory functions, they could represent different meanings with the same words or the same meaning with different words; we might mean a different degree of coldness when we say ‘this beer is icy cold,’ while we could use different words such as “yellow ” and “golden” to describe the appearance of the same beer. These interpersonal variations in word meanings not only prevent us from smoothly communicating with each other, but also cause troubles when we perform natural language processing tasks with computers. This study proposes a method of capturing interpersonal variations of word meanings by using personalized word embeddings acquired through a task of estimating the target (item) of a given reviews. Specifically, we adopt three methods for effective training of the item classifier; (1) modeling reviewer-specific parameters in a residual network, (2) fine-tuning of reviewer-specific parameters and (3) multi-task learning that estimates various metadata of the target item described in given reviews written by various reviewers. Experimental results with review datasets obtained from ratebeer.com and yelp.com confirmed that the proposed method is effective for estimating the target items. Looking into the acquired personalized word embeddings, we analyzed in detail which words have a strong semantic variation and revealed some trends in semantic variations of the word meanings. Keywords: Semantic variation
1
· Personalized word embeddings
Introduction
We express what we have sensed with various sensory units as language in different ways, and there exist inevitable semantic variations in the meaning of words because the senses and linguistic abilities of individuals are different. For example, even if we use the word “greasy” or “sour,” how greasy or how sour can differ greatly between individuals. Furthermore, we may describe the appearance of the same beer with different expressions such as “yellow,” “golden” and “orange.” These semantic variations not only cause problems in communicating c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 121–134, 2023. https://doi.org/10.1007/978-3-031-24340-0_10
122
D. Oba et al.
with each other in the real world but also delude potential natural language processing (nlp) systems. In the context of personalization, several studies have attempted to improve the accuracy of nlp models for user-oriented tasks such as sentiment analysis [5], dialogue systems [12] and machine translation [21], while taking into account the user preferences in the task inputs and outputs. However, all of these studies are carried out based on the settings of estimating subjective output from subjective input (e.g., estimating a sentiment polarity of the target item from an input review or predicting responses from input utterances in a dialogue system). As a result, the model not only captures the semantic variation in the user-generated text (input), but also handles annotation bias of the output labels (the deviation of output labels assigned by each annotator) and selection bias (the deviation of output labels inherited from the targets chosen by users in sentiment analysis) [5]. The contamination caused by these biases hinders us from understanding the solo impact of semantic variation, which is the target in this study. The goal of this study is to understand which words have large (or small) interpersonal variations in their meanings (hereafter referred to as semantic variation in this study), and to reveal how such semantic variation affects the classification accuracy in tasks with user-generated inputs (e.g., reviews). We thus propose a method for analyzing the degree of personal variations in word meanings by using personalized word embeddings acquired through a review target identification task in which the classifier estimates the target item (objective output) from given reviews (subjective input) written by various reviewers. This task is free from annotation bias because outputs are automatically determined without annotation. Also, selection bias can be suppressed by using a dataset in which the same reviewer evaluates the same target (object) only once, so as not to learn the deviation of output labels caused by the choice of inputs. The resulting model allows us to observe only the impact of semantic variations from acquired personalized word embeddings. A major challenge in inducing personalized word embeddings is the number of parameters (reviewers), since it is impractical to simultaneously learn personalized word embeddings for thousands of reviewers. We therefore exploit a residual network to effectively obtain personalized word embeddings using reviewer-specific transformation matrices from a small amount of reviews, and apply a fine-tuning to make the training scalable to the number of reviewers. Also, the number of output labels (review targets) causes an issue when building a reliable model due to the difficulty of extreme multi-class classification. We therefore perform multi-task learning with metadata estimation of the target, to stabilize the learning of the model. In the experiments, we hypothesize that words related to the five senses have inherent semantic variation, and validate this hypothesis. We utilized two largescale datasets retrieved from ratebeer.com and yelp.com that include a variety of expressions related to the five senses. Using those datasets, we employ the task of identifying the target item and its various metadata from a given review with the reviewer’s ID. As a result, our personalized model successfully captured semantic variations and achieved better performance than a reviewer-universal model in
Understanding Interpersonal Variations in Word Meanings
123
both datasets. We then analyzed the acquired personalized word embeddings from three perspectives (frequency, dissemination and polysemy) to reveal which words have large (small) semantic variation. The contributions of this paper are three-fold: – We established an effective and scalable method for obtaining personal word meanings. The method induces personalized word embeddings acquired through tasks with objective outputs via effective reviewer-wise fine-tuning on a personalized residual network and multi-task learning. – We confirmed the usefulness of the obtained personalized word embeddings in the review target identification task. – We found different trends in the obtained personal semantic variations from diachronic and geographical semantic variations observed in previous studies in terms of three perspectives (frequency, dissemination and polysemous).
2
Related Work
In this section, we introduce existing studies on personalization in natural language processing (nlp) tasks and analysis of semantic variation1 of words. As discussed in Sect. 1, personalization in nlp attempts to capture three types of user preferences: (1) semantic variation in task inputs (biases in how people use words; our target) (2) annotation bias of output labels (biases in how annotators label) and (3) selection bias of output labels (biases in how people choose perspectives (e.g., review targets) that directly affects outputs (e.g., polarity labels)). In the history of data-driven approaches for various nlp tasks, existing studies have focused more on (2) or (3), particularly in text generation tasks such as machine translation [14,17,21] and dialogue systems [12,22]. This is because data-driven approaches without personalization tend to suffer from the diversity of probable outputs depending on writers. Meanwhile, since it is difficult to properly separate these facets, as far as we know, there is no study aiming to analyze only the semantic variations of words depending on individuals. To quantify the semantic variation of common words among communities, Tredici et al. [20] obtained community-specific word embeddings by using the Skip-gram [15], and analyzed obtained word embeddings on multiple metrics such as frequency. Their approach suffers from annotation biases since Skipgram (or language models in general) attempts to predict words in a sentence given the other words in the sentence and therefore both inputs and outputs are defined by the same writer. As a result, the same word can have dissimilar embeddings not only because they have different meanings, but also because 1
Apart from semantic variations, some studies try to find, analyze, or remove biases related to socially unfavorable prejudices (e.g., the association between the words receptionist and female) from word embeddings [2–4, 19]. They analyze word “biases” in the sense of political correctness, which are different from biases in personalized word embeddings we targeted.
124
D. Oba et al.
they just appear with words in different topics.2 In addition, their approach is not scalable to the number of communities (reviewers in our case) since it simultaneously learns all the community-specific parameters. There also exist several attempts in computational linguistics to capture semantic variations of word meanings caused by diachronic [7,10,18], geographic [1,6], or domain [20] variations. In this study, we analyze the semantic variations of meanings of words at the individual level by inducing personalized word embedding, focusing on how semantic variations are correlated with word frequency, dissemination, and polysemy as discussed in [7,20].
3
Personalized Word Embeddings
In this section, we describe our neural network-based model for inducing personalized word embeddings via review target identification (Fig. 1). Our model is designed to identify the target item from a given review with the reviewer’s ID. A major challenge in inducing personalized word embeddings is the number of parameters. We therefore exploit a residual network to effectively obtain personalized word embeddings using reviewer-specific transformation matrices and apply a fine-tuning for the scalability to the number of reviewers. Also, the number of output labels makes building a reliable model challenging due to the
Fig. 1. Overview of our model.
2
Let us consider the two user communities of Toyota and Honda cars. Although the meaning of the word “car” used in these two communities is likely to be the same, its embedding obtained by Skip-gram model from two user communities will be different since “car” appears with different sets of words depending on each community.
Understanding Interpersonal Variations in Word Meanings
125
difficulty of extreme multi-class classification. We therefore perform multi-task learning to stabilize the learning of the model. 3.1
Reviewer-Specific Layers for Personalization u
First, our model computes the personalized word embeddings ewji of each word wi in input text via a reviewer-specific matrix Wuj ∈ Rd×d and bias vector u buj ∈ Rd . Concretely, an input word embedding ewi is transformed to ewji as below: (1) euwji = ReLU(Wuj ewi + buj ) + ewi where ReLU is a rectified linear unit function. As shown in Eq. (1), we employ a Residual Network (ResNet) [8] since semantic variation is namely the variation from the reviewer-universal word embedding. By sharing the reviewer-specific parameters for transformation across words and employing ResNet, we aimed for the model to stably learn personalized word embeddings even for infrequent words. 3.2
Reviewer-Universal Layers u
Given the personalized word embedding ewji of each word wi in an input text, our model encodes them through Long short-term Memory (LSTM) [9]. LSTM updates the current memory cell ct and the hidden state ht following the equations below: ⎤ ⎡ ⎤ ⎡ σ it ⎢ ft ⎥ ⎢ σ ⎥ uj ⎥ ⎢ ⎥=⎢ (2) ⎣ ot ⎦ ⎣ σ ⎦ WLSTM · ht−1 ; ewi tanh cˆt ct = ft ct−1 + it cˆt
(3)
ht = ot tanh (ct )
(4)
where it , ft , and ot are the input, forget, and output gate at time step t, respectively. ewi is the input word embedding at time step t, and WLSTM is a weight matrix. cˆt is the current cell state. The operation denotes elementwise multiplication and σ is the logistic sigmoid function. We adopt single-layer Bi-directional LSTM (Bi-LSTM) to utilize the past and the future context. As the representation of the input text h, Bi-LSTM concatenates the outputs from the forward and the backward LSTM:
−−−→ ← − (5) h = hL−1 ; h0 −−−→ ← − Here, L denotes the length of the input text. hL−1 and h0 denote the outputs from forward/backward LSTM at the last time step, respectively. ˆ Lastly, a feed-forward layer computes an output probability distribution y from the representation h with a weight matrix Wo and bias vector bo as: ˆ = softmax (Wo h + bo ) y
(6)
126
3.3
D. Oba et al.
Multi-task Learning of Target Attribute Predictions for Stable Training
We consider that training our model for the target identification task can be unstable because its output space (review targets) is extremely large (more than 50,000 candidates). To mitigate this problem, we set up auxiliary tasks that estimate metadata of the target item and solve them simultaneously with the target identification task (target task) by multi-task learning. This idea is motivated by the hypothesis that understanding related metadata of the target item contributes to the accuracy of target identification. Specifically, we add independent feed-forward layers to compute outputs from the shared sentence representation h defined by Eq. (5) for each auxiliary task (Fig. 1). We assume three types of auxiliary tasks: (1) multi-class classification (same as the target task), (2) multi-label classification, and (3) regression. We perform the multi-task learning under a loss that sums up individual losses for the target and auxiliary tasks. We adopt cross-entropy loss for multi-class classification, a summation of cross-entropy loss of each class for multi-label classification and mean-square loss for regression. 3.4
Training
Considering the case where the number of reviewers is enormous, it is impractical to simultaneously train the reviewer-specific parameters of all reviewers due to memory limitation. Therefore, we first pre-train the model using all the training data without personalization, and then we apply fine-tuning only to reviewerspecific parameters by training independent models from the reviews written by each reviewer. In this pre-training, the model uses reviewer-universal parameters W and b (instead of Wuj and buj ) in Eq. (1), and then initializes the reviewer-specific parameters Wuj and buj by them. This method makes our model scalable even to a large number of reviewers. We fix all the reviewer-universal parameters at the time of fine-tuning. Furthermore, we perform multi-task learning only during the pretraining without personalization. We then fine-tune reviewer-specific parameters Wuj , buj of the pre-trained model while only optimizing the target task. This enables the model to prevent the personalized embeddings from containing the selection bias, otherwise the prior output distribution of the auxiliary tasks by individuals can be implicitly learned.
4
Experiments
We first evaluate the target identification task using two review datasets to confirm the effectiveness of the personalized word embeddings induced by our method. If our model can successfully solve this objective task better than the reviewer-universal model obtained by the pre-taining of our reviewer-specific
Understanding Interpersonal Variations in Word Meanings
127
model, it is considered that those personalized word embeddings capture the personal semantic variation. We then analyze the degree and tendencies of the semantic variation in the obtained word embeddings. 4.1
Settings
Dataset. We adopt review datasets of beer and services related to foods for evaluation, since there are a variety of expressions that describe what we have sensed with various sensory units in these domains. RateBeer dataset is extracted from ratebeer.com3 [13] that includes a variety of beers. We selected 2,695,615 reviews about 109,912 types of beers written by reviewers who posted at least 100 reviews. Yelp dataset is derived from yelp.com4 that includes a diverse range of services. We selected reviews that (1) have location metadata, (2) fall under either the “food” or “restaurant” categories, and (3) are written by a reviewer who posted at least 100 reviews. As a result, we extracted 426,816 reviews of 56,574 services written by 2,414 reviewers in total. We divided these datasets into training, development, and testing sets with the ratio of 8:1:1. In the rest of this paper, we refer the former as RateBeer dataset and the latter as Yelp dataset. Auxiliary Tasks. Regarding the metadata for multi-task learning (MTL), we chose style and brewery for multi-class classification and alcohol by volume (ABV) for regression in the experiments with RateBeer dataset. As for the Yelp dataset, we used location for multi-class classification and category for multi-label classification. Models and Hyperparameters. We compare our model described in Sect. 3 with four different settings.5 Their differences are, (1) whether the fine-tuning for personalization is applied and (2) whether the model is trained through MTL before the fine-tuning. Table 1. Hyperparameters of our model. Model
Optimization
Dimensions of hidden layer
200
Dropout rate 0.2
Dimensions of word embeddings
200
Algorithm
Adam
Vocabulary size (Ratebeer dataset) 100,288 Learning rate 0.0005 Vocabulary size (Yelp dataset)
3 4 5
98,465
Batch size
200
https://www.ratebeer.com. https://www.yelp.com/dataset. We implemented all the models using PyTorch (https://pytorch.org/) version 0.4.0.
128
D. Oba et al.
Table 1 shows major hyperparameters. We initialize the embedding layer by Skip-gram embeddings [15] pretrained from each of the original datasets, containing all the reviews in RateBeer and Yelp datasets, respectively. The vocabulary for each dataset includes all the words that appeared 10 times or more in the dataset. For optimization, we trained the models up to 100 epochs with Adam [11] and selected the model at the epoch with the best results in the target task on the development set as the test model. 4.2
Overall Results
Table 2 and Table 3 show the results on the two datasets. We gain two insights from the results: (1) in the target task, the model with both MTL and personalization outperformed the others, (2) personalization also improves the auxiliary tasks. The model without personalization assumes that the same words written by different reviewers have the same meanings, while the model with personalization distinguishes them. The improvement by personalization on the target task with objective outputs partly supports the fact that the same words written by different reviewers have different meanings, even though they are in the same domain (beer or restaurant). Simultaneously solving the auxiliary tasks that Table 2. Results on the product identification task on RateBeer dataset. Accuracy and RMSE marked with ∗∗ or ∗ was significantly better than the other models (p < 0.01 or 0.01 < p ≤ 0.05 assessed by paired t-test for accuracy and z-test for RMSE). Model
Target task
Auxiliary tasks
Multi-task Personalize Product [Acc.(%)] Brewery [Acc.(%)] Style [Acc.(%)] ABV [RMSE]
Baseline
15.74 16.69 16.16 17.56∗∗
n/a n/a (19.98) (20.81∗∗ )
n/a n/a (49.00) (49.78∗∗ )
n/a n/a (1.428) (1.406∗ )
0.08
1.51
6.19
2.321
Table 3. Results on the service identification task on Yelp dataset. Accuracy marked with ∗∗ was significantly better than the others (p < 0.01 assessed by paired t-test).
Model
Target task
Auxiliary tasks
Multi-task Personalize Service [Acc.(%)] Location [Acc.(%)] Category [Micro F1] Baseline
6.75 7.15 9.71 10.72∗∗
n/a n/a (70.33) (83.14∗∗ )
n/a n/a (0.578) (0.577)
0.05
27.00
0.315
Understanding Interpersonal Variations in Word Meanings
129
estimate metadata of the target item guided the model to understand the target item from various perspectives, like part-of-speech tags of words. We should mention that only the reviewer-specific parameters are updated for the target task in fine-tuning. This means that the improvements on auxiliary tasks were obtained purely by the semantic variations captured by reviewerspecific parameters. Impact of the Number of Reviews for Personalization. We investigated the impact of the number of reviews for personalization when we solved the review target identification. We first grouped the reviewers into several bins according to the number of reviews, and then evaluated the classification accuracies for reviews written by the reviewers in the same bin. Figure 2 shows the classification accuracy of the target task plotted against the number of reviews per reviewer; for example, the plots (and error bars) for 102.3 represent the accuracy (variation) of the target identification for reviews written by each reviewer with review n (102.1 ≤ n < 102.3 ). Contrary to our expectation, in (a) RateBeer dataset, all of the models obtained lower accuracies as the number of reviews increased. On the other hand, in (b) Yelp dataset, only the model employing MTL and personalization obtained higher accuracies as they increased. We consider that this difference came from the biases of frequencies in review targets. Since RateBeer dataset is heavily skewed, where the top-10% frequent beers account for 74.3% of the entire reviews, while the top-10% frequent restaurants in Yelp dataset account for 48.0% of the reviews. Therefore, it is more difficult to estimate infrequent targets in RateBeer dataset and such reviews tend to be written by experienced reviewers. Although the model without MTL and personalization also obtained slightly lower accuracies even in Yelp dataset, the model with both MTL and personalization successfully exploited the increased reviews and obtained higher accuracies.
(a) RateBeer dataset
(b) Yelp dataset
Fig. 2. Accuracies in target identification task against the number of parameters per reviewer. In the legend, MTL and PRS stands for multi-task learning and personalization.
130
D. Oba et al.
(a) log-frequency
(b) dissemination
(c) polysemy
RateBeer dataset
(d) log-frequency
(e) dissemination
(f) polysemy
Yelp dataset
Fig. 3. Personal semantic variations of the words on the two datasets. Their Pearson coefficient correlations are (a) 0.43, (b) 0.29, (c) −0.07, (d) 0.27, (e) 0.16, (f) −0.19, respectively. The trendlines show 95% confidence intervals from kernel regressions.
4.3
Analysis
In this section, we analyze the obtained personalized word embeddings to see what kind of personal biases exist in each word. Here, we target only the words used by 30% or more reviewers (excluding stopwords) to remove the influences of low frequent words. We first define the personal semantic variation6 of a word wi , to determine how the representations of the word are different by individuals, as: 1 |U(wi )|
(1 − cos(euwji , ewi ))
(7)
uj ∈U(wi )
u
where ewji is the personalized word embedding to wi of a reviewer uj , ewi is the u average of ewji for U(wi ), and U(wi ) is the set of the reviewers who used the word wi at least once in training data. Here, we focus on three perspectives: frequency, dissemination, and polysemy which have been discussed in the studies of semantic variations caused by diachronic or geographical differences of text [6,7,20] (Sect. 2). Figure 3 shows 6
Unlike the definition of the semantic variation of the existing studies [20], which measures the degree of change from a point to a point of a word meaning, personal semantic variation measures how much a number of meanings of a word defined by individuals are diverged.
Understanding Interpersonal Variations in Word Meanings
131
Table 4. The list of top-50 words with the largest (and the smallest) semantic variation on the RateBeer dataset and Yelp dataset. Adjectives are boldfaced. Top-50
Bottom-50
RateBeer dataset
Ery bready ark slight floral toasty tangy updated citrusy soft deep mainly grassy aroma doughy dissipating grass ot great earthy smell toasted somewhat roasty soapy perfume flowery lingering musty citrus malty background malt present hue minimal earth foamy faint dark medium clean nice copper hay bread herbs chewy complexity toast reddish
Reminds cask batch oil reminded beyond canned conditioned double abv hope horse oats rye brewery blueberry blueberries maple bells old cork shame dogfish become dog hand plastic course remind christmas cross rogue extreme organic fat lost words islands etc. growler hot heat stout alcohol unibroue pass nitro longer scotch rare
Yelp dataset
Tasty fantastic great awesome delish excellent yummy delicious good amazing phenomenal superb asparagus risotto flavorful calamari salmon creamy chicken got veggies incredible ordered scallops sides outstanding sausage flatbread shrimp eggplant patio ambiance sandwich wonderful desserts salty gnocchi fabulous quesadilla atmosphere bacon mussels sauce vegetables restaurant broth grilled mushrooms ravioli decor food
Easily note possibly almost nearly warning aside opposite alone even needless saving yet mark thus wish apart thankfully straight possible iron short eye period thumbs old deciding major zero meaning exact replaced fully somehow single de key personal desired hence pressed rock exactly ups keeping hoping whole meant seeing test hardly
the semantic variations against the three metrics. Each of the x-axes corresponds to log frequency of the word ((a) and (d)), the ratio of the reviewers who used the word ((b) and (e)), and the number of synsets found in WordNet [16] ((c) and (f)), respectively. Interestingly, in contrast to the reports by [7] and [20], semantic variations correlate highly with frequency and dissemination, and poorly with polysemy in our results. This tendency of interpersonal semantic variations can be explained as follows: In the datasets used in our experiments, words related to five senses such as “soft” and “creamy” frequently appear and their usage depend on feelings and experiences by individuals. Therefore, they show high semantic variations. As for polysemy, although the semantic variations might change the degree or nuance of the word sense, they do not change its synset. This is because those words are still used only in skewed contexts related to food and drink where word senses do not fluctuate significantly. Table 4 shows the top-50 words with the largest (and smallest) semantic variations. As can be seen from the tables, the list of top-50 words contains much more adjectives compared with the list of bottom-50, which are likely to be used to represent individual feelings that depend on the five senses. To see in detail what kind of word have large semantic variation, we classify the adjectives of the top-50 (and bottom-50) by the five senses, which are sight (vision), hearing (audition), taste (gustation), smell (olfaction), and touch (somatosensation). From the results, on the RateBeer dataset, there were more words representing each sense except hearing in the top-50 words compared with the bottom-50. On the other hand, the list of words on Yelp dataset include less words related to the five senses than the RateBeer dataset, but there are many adjectives that could be applicable to various domains (e.g., “tasty,” and
132
D. Oba et al.
(a) RateBeer dataset
(b) Yelp dataset
Fig. 4. Two-dimensional representation of the words, bready and tasty on the two datasets, respectively, with the words closest to them in the universal embedding space.
“excellent”). This may be due to the domain size of Yelp dataset and the lack of reviews detailing the specific products in the restaurant reviews. We also analyze whether there are words that get confused. We use the word “bready” and “tasty” with the highest semantic variation in each dataset. We visualized the personalized word embeddings using Principal Component Analysis (PCA), with six words closest to the target words in the universal embedding space in Fig. 4. As can be seen, clusters of “cracky,” “doughy,” and “biscuity” are mixed each other, suggesting that words representing the same meaning may differ by individuals.
5
Conclusions
In this study, we focused on interpersonal variations in word meanings, and explored a hypothesis that words related to the five senses have inevitable personal semantic variations. To verify this, we proposed a novel method for obtaining semantic variation using personalized word embeddings induced through a task with objective outputs. Experiments using large-scale review datasets from ratebeer.com and yelp.com showed that the combination of multi-task learning and personalization improved the performance of the review target identification, which means that our method could capture interpersonal variations of word meanings. Our analysis showed that words related to the five senses have large interpersonal semantic variations. For future studies, besides factors we worked on this study such as frequency, we plan to analyze relationships between semantic variations and demographic factors of the reviewers such as gender and age which are inevitable for expressing individuals.
Understanding Interpersonal Variations in Word Meanings
133
Acknowledgements. We thank Satoshi Tohda for proofreading the draft of our paper. This work was partially supported by Commissioned Research (201) of the National Institute of Information and Communications Technology of Japan.
References 1. Bamman, D., Dyer, C., Smith, N.A.: Distributed representations of geographically situated language. In: 56th ACL, pp. 828–834 (2014) 2. Bolukbasi, T., Chang, K.W., Zou, J.Y., Saligrama, V., Kalai, A.T.: Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In: NIPS 2016, pp. 4349–4357 (2016) 3. Caliskan, A., Bryson, J.J., Narayanan, A.: Semantics derived automatically from language corpora contain human-like biases. Science 356(6334), 183–186 (2017) 4. D´ıaz, M., Johnson, I., Lazar, A., Piper, A.M., Gergle, D.: Addressing age-related bias in sentiment analysis. In: CHI Conference 2018, p. 412. ACM (2018) 5. Gao, W., Yoshinaga, N., Kaji, N., Kitsuregawa, M.: Modeling user leniency and product popularity for sentiment classification. In: 6th IJCNLP, pp. 1107–1111 (2013) 6. Garimella, A., Mihalcea, R., Pennebaker, J.: Identifying cross-cultural differences in word usage. In: 26th COLING, pp. 674–683 (2016) 7. Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: 54th ACL, pp. 1489–1501 (2016) 8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR 2016, pp. 770–778 (2016) 9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 10. Jaidka, K., Chhaya, N., Ungar, L.: Diachronic degradation of language models: insights from social media. In: 56th ACL, pp. 195–200 (2018) 11. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR 2015 (2015) 12. Li, J., Galley, M., Brockett, C., Spithourakis, G., Gao, J., Dolan, B.: A personabased neural conversation model. In: 54th ACL, pp. 994–1003 (2016) 13. McAuley, J., Leskovec, J.: Hidden factors and hidden topics: understanding rating dimensions with review text. In: 7th ACM on Recommender Systems, pp. 165–172 (2013) 14. Michel, P., Neubig, G.: Extreme adaptation for personalized neural machine translation. In: 56th ACL, pp. 312–318 (2018) 15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS 2013, pp. 3111–3119 (2013) 16. Miller, G.A.: Wordnet: a lexical database for English. ACM Commun. 38(11), 39–41 (1995) 17. Mirkin, S., Meunier, J.L.: Personalized machine translation: predicting translational preferences. In: EMNLP 2015, pp. 2019–2025 (2015) 18. Rosenfeld, A., Erk, K.: Deep neural models of semantic shift. In: NAACL-HLT 2018, pp. 474–484 (2018) 19. Swinger, N., De-Arteaga, M., Heffernan, I., Thomas, N., Leiserson, M.D., Kalai, A.T.: What are the biases in my word embedding? arXiv:1812.08769 (2018) 20. Tredici, M.D., Fern´ andez, R.: Semantic variation in online communities of practice. In: 12th IWCS (2017)
134
D. Oba et al.
21. Wuebker, J., Simianer, P., DeNero, J.: Compact personalized models for neural machine translation. In: EMNLP 2018, pp. 881–886 (2018) 22. Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personalizing dialogue agents: i have a dog, do you have pets too? In: 56th ACL, pp. 2204–2213 (2018)
Semantic Roles in VerbNet and FrameNet: Statistical Analysis and Evaluation Aliaksandr Huminski(B) , Fiona Liausvia, and Arushi Goel Institute of High Performance Computing, Singapore 138632, Singapore {huminskia,liausviaf,arushi goel}@ihpc.a-star.edu.sg Abstract. Semantic role theory is a widely used approach for verb representation. Yet, there are multiple indications that semantic role paradigm is necessary but not sufficient to cover all elements of verb structure. We conducted a statistical analysis of semantic role representation in VerbNet and FrameNet to provide empirical evidence of insufficiency. The consequence of that is a hybrid role-scalar approach.
Keywords: Verb representation
1
· Semantic role · VerbNet · FrameNet
Introduction
The semantic representation of verbs has a long history in linguistics. 50 years ago the article “The case for case” [1] gave a start to semantic role theory that is widely used for verb representation. Since semantic role theory is one of the oldest constructs in linguistics, variety of resources with different sets of semantic roles has been proposed. There are three types of resources depending on the level of role set granularity. The first level is very specific with roles like “eater” for the verb eat or “hitter” for the verb hit. The third level is very general with the range of roles from only two proto-roles [2] to nine roles. The second level is located between them and contains, to the best of our knowledge, from 10 to 50 roles approximately. This rough classification corresponds to the largest linguistic resources: Frame Net [3], VerbNet [4] and PropBank [5] that belong to the first, second and third type of resources accordingly. All of them use semantic role representation for verbs and are combined in Unified Verb Index system1 . They are used widely in most advanced NLP and NLU tasks, such as semantic parsing and semantic role labeling, question answering, information extraction, recognizing textual entailment, and information extraction. Knowledge of semantic representation and verb-argument structure is a key point for NLU systems and applications. The paper is structured as follows. Section 2 briefly introduces VerbNet and FrameNet, the ideas underlying their construction and the main differences 1
http://verbs.colorado.edu/verb-index/.
c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 135–147, 2023. https://doi.org/10.1007/978-3-031-24340-0_11
136
A. Huminski et al.
between them. Section 3 focuses on the basic statistical analysis of VerbNet and FrameNet. Section 4 describes advanced statistical analysis that shows that the role paradigm itself is necessary but not sufficient for proper representation of all verbs. A hybrid role-scalar approach is presented in Sect. 5. The final Sect. 6 reports our concluding observations.
2
VerbNet and FrameNet as Linguistic Resources for Analysis
VerbNet and FrameNet are the two most well-known resources where semantic roles are used. PropBank which is considered as the third resource in the Unified Verb Index, provides a semantic role representation for every verb in the Penn TreeBank [6]. But we will not analyse it in this article since PropBank defines semantic roles on a verb-by-verb basis, not making any higher generalizations2 . We will not use neither WordNet [7] for the analysis since this resource does not have semantic role representation for verbs. 2.1
VerbNet
VerbNet (VN) is the largest domain-independent computational verb lexicon currently available for English. In this paper we use the version 3.33 released in June 2018. It contains 6791 verbs and provides semantic role representation for all of them. VN 3.3 with 39 roles belongs to the second level of role set resources. In other words, the roles are not so fine-grained as in FrameNet and not so coarsegrained as in Propbank. VN was considered together with the LIRICS role set for the ISO standard 24617-4 for Semantic Role Annotation [8–10]. Idea of Construction. VN is constructed on Levin’s classification of verbs [11]. Verb classification is based on the idea that syntactic behavior of verbs (syntactic alternations) is to a large degree determined by its meaning. Similar syntactic behavior is taken as a method of grouping verbs into classes that are considered as semantic classes. So, verbs that fall into classes according to shared behavior would be expected to show shared meaning components. As a result of that, each verb4 belongs to a specific class in VN. In turn, each class has a role set that equally characterizes all members (verbs) of the class. 2
3 4
Each verb in PropBank has verb-specific numbered roles: Arg0, Arg1, Arg2, etc. with several more general roles that can be applied to any verb. That makes semantic role labeling too coarse-grained. Most verbs have two to four numbered roles. And although the tagging guidelines include a “descriptor” field for each role, such as “kicker” for Arg0 or “instrument” for Arg2 in the frameset of the verb kick, it does not have any theoretical standing [5]. http://verbs.colorado.edu/verb-index/vn3.3/. More accurate to use the term verb sense here because of verb polycemy.
Semantic Roles in VerbNet and FrameNet
2.2
137
FrameNet
FrameNet (FN) is a lexicographic project constructed on a theory of Frame Semantics, developed by Fillmore [12]. We will consider FrameNet releases 1.5 and 1.75 . Roles in FN are extremely fine-grained in comparison with VN. According to FN approach, situations and events should be represented through highly detailed roles. Idea of Construction. FN is based on the idea that a word’s meaning can be understood only with reference to a structured background [13]. In contrast to VN, FN is first and foremost semantically driven. The same syntactic behavior is not needed to group verbs together. FN takes semantic criteria as primary criteria where roles (called frame elements in FN) are assigned not to a verb class, but to a frame that describes an event. Frames are empirically derived from the British National Corpus and each frame is considered as a conceptual structure that describes event and its participants. As a result of that, a frame can include not only verbs, but also nouns, multi-word expressions, adjectives, and adverbs. All of them are grouped together according to the frames. The same as in VN, each frame has a role set that equally characterizes all members of the frame. Role set is essential for understanding an event (situation) represented by a frame. 2.3
VerbNet and FrameNet in Comparison
Table 1 summarizes the differences between VN and FN6 . Table 1. Basic differences of VN and FN. FrameNet Basis
3
VerbNet
Lexical semantics Argument syntax
Data source Corpora
Linguistic literature
Roles
Fine-grained
Coarse-grained
Results
Frames
Verb classes
Basic Statistical Analysis
Basic statistical analysis is considered as a necessary step for advanced analysis. Prior to analysis of the relations across verbs, classes/frames7 and roles, we need to extract classes/frames, those of them where at least one verb occurs, all unique roles, all verbs in classes/frames, etc. 5 6 7
https://framenet.icsi.berkeley.edu/fndrupal/. We modified the original comparison presented in [14] for our own purposes. The expression classes/frames is used hereinafter to emphasize that verbs are grouped into classes in VN and into frames in FN.
138
A. Huminski et al.
Table 2 summarizes the basic statistics related to VN and FN. It is necessary to provide some comments: 1. Number of classes/frames with and without verbs are different since there are classes in VN and frames in FN with no verbs. Also there are non-lexical frames in FN with no lexical units inside. 2. Calculating the number of classes in VN, we consider the main class and its subclass as 2 different classes even if they have the same role set. 3. Number of roles in FN reflects the number of unique roles that occur only in frames with verbs. We distinguish here the number of uniques roles from the number of the roles with duplicates in all frames (total 10542 for FN 1.7). 4. Number of verbs in reality is a number of verb senses that are assigned to different classes/frames. Because of polysemy the number of verb senses is larger than the number of unique verbs.
Table 2. Basic statistics of VN and FN. Resource
4
Number of Number of Number of Number of Av. number roles classes/ classes/ verbs of verbs per frames frames class/frame with verbs
VerbNet 3.2
30
484
454
6338
14
VerbNet 3.3
39
601
574
6791
11.8
FrameNet 1.5
656
1019
605
4683
7.7
FrameNet 1.7
727
1221
699
5210
7.45
Advanced Statistical Analysis
We will investigate further only the latest versions of VN (3.3) and FN (1.7). Advanced statistical analysis includes the following 2 types: – the distribution of verbs per class; – the distribution of roles per class. The distribution of verbs per class is about how many verbs similar in meaning are located in one class. The distribution of roles per class is about how many verbs similar in meaning are located in different classes. 4.1
Distribution of Verbs per Class in VN and FN
Distribution of verbs can be presented in 2 mutually dependent modes. First mode consists of 3 steps: 1. calculation of the number of verbs per class; 2. sorting the classes according to the number of verbs;
Semantic Roles in VerbNet and FrameNet
139
Fig. 1. Distribution of verbs per class in VN 3.3.
Fig. 2. Distribution of verbs per class in FN 1.7.
3. distribution of the verb number per class starting from the top class. Figure 1 for VN and Fig. 2 for FN illustrate the final 3rd step. Based on them one can conclude that: – verbs are not distributed evenly across the classes/frames. There is a sharp deviation from the average value: 11.8 verbs per class in VN 3.3 and 7.45 verbs per frame in FN 1.7. – regardless of the resource type (coarse-grained or fine-grained role set), a sharp deviation remains surprisingly the same.
140
A. Huminski et al.
Fig. 3. Verb coverage starting from the top classes in VN 3.3.
Fig. 4. Verb coverage starting from the top classes in FN 1.7.
Second mode has the same step#1 and step#2 but the 3rd step is different. It is a distribution of the verb coverage (from 0 to 1) starting from the top class. Figure 3 for VN and Fig. 4 for FN illustrate the final 3rd step. Based on them one can conclude that: – verb coverage is a non-linear function; – regardless of the resource type (coarse-grained or fine-grained role set), verb coverage remains surprisingly the same non-linear function. For example, having 574 classes in VN 3.3, 50% of all verbs are covered by 123 classes, 90% are covered by 319 classes; having 699 frames in FN 1.7, 50% of all verbs are covered by 95 classes, 90% are covered by 416 classes.
Semantic Roles in VerbNet and FrameNet
4.2
141
Distribution of Roles per Class in VN and FN
If the distribution of verbs reflects similarity between verbs in one class/frame, the distribution of roles shows similarity between verbs located in different classes/frames. This similarity is unfolded through identical role sets that different classes have. We extracted all different classes that have the same role set and merged them together. Table 3 shows the difference between total number of classes/frames and the number of merged classes/frames that have the same role set. Table 3. Statistics of classes/frames with different role sets. Resource
Number of Number of classes/frames classes/frames with with verbs that have different verbs role sets
VerbNet 3.3
574
138
FrameNet 1.7
699
619
Table 4 provides some examples of role sets that different classes/frames have. Table 4. Examples of the same role sets used in different classes/frames. Resource
Role set for representation of the class/frame
VN 3.3
Agent:Destination:Initial Location:Theme
15
532
Location:Theme
14
269
Agent:Recipient:Topic
FN 1.7
Number classes/ frames
of Number of verbs in all classes
11
251
Agent:Instrument:Patient
7
312
Agent:Instrument:Patient:Result
6
506
17
64
Entity Agent:Theme:Source
3
83
Experiencer:Stimulus
2
138
Self mover:Source:Path:Goal:Direction
2
137
Cause:Theme:Goal or Agent:Theme:Goal
2
125
Second type of role distribution is the number of role occurrences in all classes/frames (see Fig. 5 for VN and Fig. 6 for FN). Based on this distribution, one can conclude that: – distribution of roles is a non-linear function. Top 2–3 roles occur in almost all classes.
142
A. Huminski et al.
Fig. 5. Role distribution in VN 3.3.
Fig. 6. Role distribution in FN 1.7.
Semantic Roles in VerbNet and FrameNet
143
– regardless of the resource type (coarse-grained or fine-grained role set), distribution of roles remains surprisingly the same non-linear function. – distribution of roles correlates with the distribution of verbs (compare Fig. 5 and Fig. 6 with Fig. 1 and Fig. 2 accordingly). 4.3
General Analysis and Evaluation
Both verb and role distributions in VN and FN show sharp deviation from the average value. Despite the fact that VN and FN are different in the principles of their construction (Table 1) and are significantly different in the number of roles (Table 2), we have identical picture in both VN and FN for all types of distributions. This similarity looks weird since the obvious expectation is the following: the larger is the number of roles, the more even the role/verb distribution should be and the less disproportion is expected. The Reason Why It Happens. Assigning a role representation to all verbs of a language assumes by default that the set of all verbs is homogeneous and because of homogeneity it can be described through one unique approach: semantic roles. We consider the statistical results as an indication that the role paradigm itself is necessary but not sufficient for proper representation of all verbs. We argue the set of verbs in a language is not homogeneous. Instead, it is heterogeneous and requires at least 2 different approaches.
5
Hybrid Role-Scalar Approach
For the sake of getting universal semantic representation we offer a hybrid approach: role-scale. 5.1
Hypothesis: Roles Are Not Sufficient for Verb Representation
By definition, any semantic role is a function of a participant represented by NP, towards an event represented by a verb. Nevertheless, to cover all verbs, semantic role theory was extended beyond the traditional definition in such a way to represent, for example, a change of state. In VN there are roles like Attribute, Value, Extent, Asset etc. that match abstract participants, attributes, and their changes. For example, in the sentence Oil soared in price by 10%, “price” is in the role of Attribute and “10%” is in the role of Extent which, according to the definition, specify the range or degree of change. If we are going to represent a change of state through roles, we need to assign a role to state of a participant, not to a participant itself. Second, a change of state means a change in the value of state in particular direction. For example,
144
A. Huminski et al.
the event “heat the water” includes values of state “temperature” for water as a participant. So, to reflect a change of state we need to introduce two new roles: initial value of state and its final value on the scale of increasing values on a dimension of temperature. These 2 new roles look like numbers, not roles, on a scale. It is unclear, what it really means: a role of value. We argue that the attempts of semantic role theory extension contradict the nature of a semantic role. Roles are just one of the parts in event representation that does not cover an event completely. While a role is a suitable means for action verbs like “hit” or “stab”, a scalar is necessary for representation of the verbs like “kill” or “heat”. For instance, in semantic role theory the verb kill has the role set [Agent, Patient] while the meaning of kill contains no information about what Agent really did towards Patient. Having Agent and Patient, the verb kill is represented through an unknown action. Meanwhile, what is important for kill is not an action itself but the resulting change of state: Patient died. And this part of meaning, being hidden by roles, can be represented via a scalar change “alive-dead”. Roles gives us a necessary but not a sufficient representation, since change-of-state verbs do not indicate how it was done but what was done. The dichotomy between role and scale can be expressed in other way as the dichotomy between semantic field and semantic scale. A frame is considered as a semantic field where members of a frame are closely related with each other by their meanings, while semantic scale includes a set of values that are scattered along the scale and opposite each other. 5.2
Scale Representation
A scalar change in an entity involves a change in the value of one of its attributes in a particular direction. Related Work. The idea of using scales for verb representation was elaborated by many authors. Dixon [15,16] extracted 7 classes of property concepts that are consistently lexicalized across languages: dimension, age, value, color, physical, speed and human propensity. Rappaport Hovav [17,18] paid attention that change-of-state verbs and socalled result verbs lexicalize a change in a particular direction in the value of a scalar attribute, frequently from the domain of property concepts of Dixon. A similar approach comes from cognitive science framework [19–21] that considers verb representation to be based on a 2-vector structure model: a force vector representing the cause of a change and a result vector representing a change in object properties. It is argued that this framework provides a unified account for the multiplicity of linguistic phenomena related to verbs. Jackendoff [22–24] stated that result verbs representation can be derived from the physical space. Accordingly, a change in the value can be represented the same way as a movement in the physical space. For example, a change of possession can be represented as a movement in the space of possession.
Semantic Roles in VerbNet and FrameNet
145
Fleischhauer [25] discussed in detail the idea of verbal degree gradation and elaborated the notion of scalar change. Change-of-state verbs are considered as one of the prototypical examples of scalar verbs. There are two reasons for this: first, some of the verbs are derived from gradable adjectives, and second, the verbs express a change along a scale. It was stated that a change-of-state verb lexicalizes a scale, even if one or more of the scale parameters remain unspecified in the meaning of the verb. Scale Representation for VerbNet and FrameNet. We just outline the approach how the verbs from the three largest frames in FN (and the classes in VN accordingly) can be additionally represented via scales. More detailed analysis of the scale representation goes beyond the limits of the article. The benefit that such approach provides is that the large frames can be splitted by identifying within-frame semantic distinctions. The top largest frame in FN is the frame “Self motion”. According to the definition, it “most prototypically involves individuals moving under their own power by means of their bodies”8 . The frame contains 134 verbs and corresponds to the run-class (51.3.2) with 96 verbs in VN. The necessity of scale representation for run-class was directly indicated by Pustejovsky [26]. To make an explicit representation of change of state he introduced the concept of opposition structure in generative lexicon (GL) as an enrichment to event structure [27]. After that he applied GL-inspired componential analysis to the run-class and extracted six distinct semantic dimensions, which provide clear differentiations in meaning within this class. They are: SPEED: amble, bolt, sprint, streak, tear, chunter, flit, zoom; PATH SHAPE: cavort, hopscotch, meander, seesaw, slither, swerve, zigzag; PURPOSE: creep, pounce; BODILY MANNER: amble, ambulate, backpack, clump, clamber, shuffle; ATTITUDE: frolic, lumber, lurch, gallivant; ORIENTATION: slither, crawl, walk, backpack. The second largest frame in FN with 132 verbs is the frame “Stimulate emotion”. This frame is about some phenomenon (the Stimulus) that provokes a particular emotion in an Experiencer9 . In other words, the emotion is a direct reaction towards the stimulus. It corresponds to the second largest amuse-class (31.1) in VN with 251 verbs. Fellbaum and Mathieu [28] examined experiencer-subject verbs like surprise, fear, hate, love etc. where the gradation is richly lexicalized by verbs that denote different degrees of intensity of the same emotion (e.g., surprise, strike, dumbfound, flabbergast). The results of analysis show, first, that the chosen verbs indeed possess scalar qualities; second, they confirm the prior assignment of the verbs into broad classes based on a common underlying emotion; finally, the 8 9
https://framenet2.icsi.berkeley.edu/fnReports/data/frameIndex.xml?frame=Self motion. https://framenet2.icsi.berkeley.edu/fnReports/data/frameIndex.xml? frame=Stimulate emotion.
146
A. Huminski et al.
web-data allows to construct consistent scales with verbs ordered according to the intensity of the emotion. The third largest frame in FN is the frame “Make noise” (105 verbs) that corresponds to the sound emission-class (129 verbs) in VN. The frame is defined as “a physical entity, construed as a source, that emits a sound”10 . SnellHornby [29] suggested the following scales to characterize verbs of sound: VOLUME (whirr vs. rumble); PITCH (squeak vs. rumble); RESONANCE (rattle vs. thud ); DURATION (gurgle vs. beep).
6
Conclusion
Based on statistical analysis of VerbNet and FrameNet as verb resources we showed empirical evidence of role insufficiency as a unique approach used for verb representation. It supports the hypothesis that roles as a tool for meaning representation do not cover the variety of all verbs. As a consequence of that, another paradigm—scalar approach—is needed to fill up the gap. The hybrid role-scalar approach looks promising for verb meaning representation and will be elaborated in future.
References 1. Fillmore, Ch.: The case for case. In: Universals in Linguistic Theory, pp. 1–88 (1968) 2. Dowty, D.: Thematic proto-roles and argument selection. Language 67, 547–619 (1991) 3. Baker, C., Fillmore, Ch., Lowe, J.: The Berkeley FrameNet project. In: Proceedings of the International Conference on Computational Linguistics, Montreal, Quebec, Canada, pp. 86–90 (1998) 4. Kipper-Schuler, K.: VerbNet: a broad-coverage, comprehensive verb lexicon. Ph.D. thesis. Computer and Information Science Department, University of Pennsylvania, Philadelphia, PA (2005) 5. Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–106 (2005) 6. Marcus, M., et al.: The penn treebank: annotating predicate argument structure. In: ARPA Human Language Technology Workshop (1994) 7. Fellbaum, Ch., Miller, G.: Folk psychology or semantic entailment? A reply to Rips and Conrad. Psychol. Rev. 97, 565–570 (1990) 8. Petukhova, V., Bunt, H.: LIRICS semantic role annotation: design and evaluation of a set of data categories. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC), Marrakech, Morocco (2008) 9. Claire, B., Corvey, W., Palmer, M., Bunt, H.: A hierarchical unification of LIRICS and VerbNet semantic roles. In: Proceedings of the 5th IEEE International Conference on Semantic Computing (ICSC 2011), Palo Alto, CA, USA (2011) 10
https://framenet2.icsi.berkeley.edu/fnReports/data/frameIndex.xml?frame=Make noise.
Semantic Roles in VerbNet and FrameNet
147
10. Bunt, H., Palmer, M.: Conceptual and representational choices in defining an ISO standard for semantic role annotation. In: Proceedings of the Ninth Joint ACLISO Workshop on Interoperable Semantic Annotation (ISA-9), Potsdam, Germany (2013) 11. Levin, B.: English verb classes and alternations: a preliminary investigation. University of Chicago Press, Chicago, IL (1993) 12. Fillmore, Ch.: Frame semantics. In: Linguistics in the Morning Calm, pp. 111–137. Hanshin, Seoul (1982) 13. Fillmore, Ch., Atkins, T.: Toward a frame-based lexicon: the semantics of RISK and its neighbors. In: Frames, Fields, and Contrasts: New Essays in Semantic and Lexical Organization, pp. 75–102. Erlbaum, Hillsdale (1992) 14. Baker, C., Ruppenhofer, J.: FrameNet’s frames vs. Levin’s verb classes. In: Proceedings of the 28th Annual Meeting of the Berkeley Linguistics Society, Berkeley, California, USA, pp. 27–38 (2002) 15. Dixon, R.: Where Have All the Adjectives Gone? and Other Essays in Semantics and Syntax. Mouton, Berlin (1982) 16. Dixon, R.: A Semantic Approach to English Grammar. Oxford University Press, Oxford (2005) 17. Rappaport Hovav, M.: Lexicalized meaning and the internal temporal structure of events. In: Crosslinguistic and Theoretical Approaches to the Semantics of Aspect, pp. 13–42. John Benjamins, Amsterdam (2008) 18. Rappaport Hovav, M.: Scalar roots and their results. In: Workshop on Roots: Word Formation from the Perspective of “Core Lexical Elements”, Universitat Stuttgart, Stuttgart (2009) 19. Gardenfors, P.: The Geometry of Meaning: Semantics Based on Conceptual Spaces. MIT Press, Cambridge (2017) 20. Gardenfors, P., Warglien, M.: Using conceptual spaces to model actions and events. J. Semant. 29(4), 487–519 (2012) 21. Warglien, M., Gardenfors, P., Westera, M.: Event structure, conceptual spaces and the semantics of verbs. Theor. Linguist. 38(3–4), 159–193 (2012) 22. Jackendoff, R.: Semantics and Cognition. MIT Press, Cambridge (1983) 23. Jackendoff, R.: Semantic Structures. MIT Press, Cambridge (1990) 24. Jackendoff, R.: Foundations of Language. Oxford University Press, Oxford (2002) 25. Fleischhauer, J.: Degree Gradation of Verbs. Dusseldorf University Press, Dusseldorf (2016) 26. Pustejovsky, J., Palmer, M., Zaenen, A., Brown, S.: Verb meaning in context: integrating VerbNet and GL predicative structures. In: Proceedings of the LREC 2016 Workshop: ISA-12, Potoroz, Slovenia (2016) 27. Pustejovsky, J.: Events and the semantics of opposition. In: Events as Grammatical Objects, pp. 445–482. Center for the Study of Language and Information (CSLI), Stanford (2010) 28. Fellbaum, Ch., Mathieu, Y.: A corpus-based construction of emotion verb scales. In: Linguistic Approaches to Emotions in Context. John Benjamins, Amsterdam (2014) 29. Snell-Hornby, M.: Verb-descriptivity in German and English. A Contrastive Study in Semantic Fields. Winter, Heidelberg (1983)
Sentiment Analysis
Fusing Phonetic Features and Chinese Character Representation for Sentiment Analysis Haiyun Peng, Soujanya Poria, Yang Li, and Erik Cambria(B) School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore {penghy,sporia,liyang,cambria}@ntu.edu.sg Abstract. The Chinese pronunciation system offers two characteristics that distinguish it from other languages: deep phonemic orthography and intonation variations. We are the first to argue that these two important properties can play a major role in Chinese sentiment analysis. Hence, we learn phonetic features of Chinese characters and fuse them with their textual and visual features in order to mimic the way humans read and understand Chinese text. Experimental results on five different Chinese sentiment analysis datasets show that the inclusion of phonetic features significantly and consistently improves the performance of textual and visual representations. Keywords: Phonetic features · Character representation · Sentiment analysis
1 Introduction In recent years, sentiment analysis has become increasingly popular for processing social media data on online communities, blogs, wikis, microblogging platforms, and other online collaborative media [3]. Sentiment analysis is a branch of affective computing research that aims to mine opinions from text (but sometimes also videos [4]) based on user intent [9] in different domain [2]. Most of the literature is on English language but recently an increasing number of works are tackling the multilinguality issue, especially in booming online languages such as Chinese [17]. Chinese is one of the most popular languages on the Web and it has two relevant advantages over other languages. Nevertheless, research in Chinese sentiment analysis is still very limited due to the lack of experimental resources. Chinese language has two fundamental characteristics which make language processing on this language challenging yet interesting. Firstly, it is a pictogram language [8], which means that its symbols intrinsically carry meanings. Through a geometric composition, various symbols integrate together to form a new symbol. This is different from Romance or Germanic languages whose characters do not encode internal meanings. In order to utilize this characteristic, two branches of research exist in the literature. One of them studies the sub-word components (such as Chinese character and Chinese radicals) via a textual approach [6, 16, 20, 24, 25]. The other branch explores the compositionality using the visual presence of the characters [14, 22] by means of extracting c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 151–165, 2023. https://doi.org/10.1007/978-3-031-24340-0_12
152
H. Peng et al.
visual features from bitmaps of Chinese characters to further improve the Chinese textual word embeddings. The second characteristic of Chinese language is its deep phonemic orthography. In other words, it is difficult to infer the pronunciation of characters or words from their text. The modern pronunciation system of Chinese language is called pinyin, which is a romanization of Chinese text. It is so different from the written system that they can be seen as two unrelated languages if their mapping relationship is unknown. However, to the best of our knowledge, no work has utilized this characteristic for Chinese NLP, nor has for sentiment analysis. Furthermore, we found one feature of pinyin system which is beneficial to sentiment analysis. For each Chinese character, the pinyin system has one fundamental pronunciation and four variations1 due to four different tones (tone and intonation are interchangeable within this paper). These tones have immediate impact on semantics and sentiment, as shown in Table 1. Although phonograms (or phonosemantic compounds, xingsheng zi) are quite common in Chinese language, only less than 5% of them have exactly the same pronunciation and intonation. Table 1. Examples of intonations that alter meaning and sentiment. Text
Pinyin
空闲
k`ongxi´an Free
Meaning
Neutral
Sentiment polarity
空气
k¯ongq`ı
Air
Neutral
好吃
hˇaochi
Delicious
Positive
h`aochi
Gluttonous Negative
We argue that these two factors of Chinese language can play a vital role in Chinese natural language processing especially sentiment analysis. In this work, we take advantage of the two above-mentioned characteristics of the Chinese language by means of a multimodal approach for Chinese sentiment analysis. For the first factor, we extract visual features from Chinese character pictograms and textual features from Chinese characters. To consider the deep phonemic orthography and intonation variety of the Chinese language, we propose to use Chinese phonetic information and learn various phonetic features. For the final sentiment classification, we fuse visual, textual and phonetic information in both early fusion and late fusion mechanism. Unlike the previous approaches which either consider the visual information or textual information for Chinese sentiment analysis, our method is multimodal where we fuse visual, textual and phonetic information. Most importantly, although the importance of character level visual features has been explored before in the literature, to the best of our knowledge none of the previous works in Chinese sentiment analysis have considered the deep phonetic orthographic characteristic of Chinese language. In this work, we leverage this by learning phonetic features and use that for sentiment analysis. The experimental results show that the use of deep phonetic orthographic information is useful for Chinese sentiment analysis. Due to the above, the proposed multimodal framework outperforms the state 1
Neutral tone, in addition to the four variations, is neglected for the moment, due to its lack of connection with sentiment.
Fusing Phonetic Features and Chinese Character Representation
153
of the art Chinese sentiment analysis method by a statistically significant margin. In summary, the two main contributions of this paper are as follows: – We use Chinese phonetic information to improve sentiment analysis. – We experimentally show that a multimodal approach that leverages phonetics, visual and textual features of Chinese characters at the same time is useful for Chinese sentiment analysis. The remainder of this paper is organized as follows: we first present a brief review of general embeddings and Chinese embeddings; we then introduce our model and provide technical details; next, we describe the experimental results and presents some discussions; finally, we conclude the paper and suggest future work.
2 Related Work 2.1 General Embedding One-hot representation is the initial numeric word representation method in NLP. However, it usually leads to a problem of high dimensionality and sparsity. To solve this problem, distributed representation (or word embedding) [1] is proposed. Word embedding is a representation which maps words into low dimensional vectors of real numbers by using neural networks. The key idea is based on distributional hypothesis so as to model how to represent context words and the relation between context words and target word. In 2013, Mikolov et al. [15] introduced both continuous bag-of-words (CBOW) model and skip-gram model. The former placed context words in the input layer and target word in the output layer whereas the latter swapped the input and output in CBOW. In 2014, Pennington et al. [15] created the ‘GloVe’ embeddings. Unlike the previous which learned the embeddings from minimizing the prediction loss, GloVe learned the embeddings with dimension reduction techniques on co-occurrence counts matrix. 2.2 Chinese Embedding Chinese text differs from English text for two key aspects: it does not have word segmentations and it has a characteristic of compositionality due to its pictogram nature. Based on the former aspect, word segmentation tools are always employed before text representation, such as ICTCLAS [26], THULAC [23], Jieba2 and so forth. Based on the latter aspect, several works had focused on the use of sub-word components (such as characters and radicals) to improve word embeddings. [6] proposed decomposition of Chinese words into characters and presented a character-enhanced word embedding (CWE) model. [24] had decomposed Chinese characters to radicals and developed a radical-enhanced Chinese character embedding. 2
https://github.com/fxsjy/jieba.
154
H. Peng et al.
In [20], pure radical based embeddings were trained for short-text categorization, Chinese word segmentation and web search ranking. [25] extend the pure radical embedding by introducing multi-granularity Chinese word embeddings. Multilingual sentiment analysis in the past few years has become a growing area of research, especially in Chinese language due to the economic growth of China in the last decade. [14, 22] explored integrating visual features to textual word embeddings. The extracted visual features proved to be effective in modeling the compostionality of Chinese characters. To the best of our knowledge, no previous work has integrated pronunciation information to Chinese embeddings. Due to its deep phonemic orthography, we believe the Chinese pronunciation information could elevate the embeddings to a higher level. Thus, we propose to learn phonetic features and present two fusions of multimodal features.
3 Model In this section, we present how features from three modalities were extracted. Next, we introduce the methods to fuse those features. 3.1
Textual Embedding
As in most recent literature, textual word embedding vectors were treated as the fundamental representation of texts [1, 15, 18]. Pennington et al. [18] developed ‘GloVe’ in 2014 which employed a count-based mechanism to embed word vectors. Following the convention, we used ‘GloVe’ character embeddings [18] of 128-dimension to represent text. It is worth noting that we set the fundamental token of Chinese text as the character instead of the word for two reasons. Firstly, it is designed to align against the audio feature. Audio features can only be extracted at character level, as Chinese pronunciation is on each character. Secondly, character-level processing can avoid the errors induced by Chinese word segmentation. Although we used character GloVe embedding for our final model, experimental comparisons were conducted with both CBOW [15] and Skip-gram embeddings. 3.2
Training Visual Features
Unlike the Romance or Germanic language, the Chinese written language originated from pictograms. Later, simple symbols were combined to form complex symbols in order to express abstract meanings. For example, a geometric combination of three ‘木 (wood)’ creates a new character ‘森 (forest)’. This phenomenon gives rise to a compositional characteristic of Chinese text. Instead of a direct modeling of text compositionality using sub-word [6, 25] or sub-character [16, 24] elements, we opt for a visual model. In particular, we constructed a convolutional auto-encoder (convAE) to extract visual features. Details of the convAE are listed in Table 2. The input to the model is an 60 by 60 bitmap for each of the Chinese characters and the output of the model is a dense vector with a dimension of 512. The model was
Fusing Phonetic Features and Chinese Character Representation
155
Table 2. Configuration of convAE for visual feature extraction. Layer# Layer configuration 1
Convolution 1: kernel 5, stride 1
2
Convolution 2: kernel 4, stride 2
3
Convolution 3: kernel 5, stride 2
4
Convolution 4: kernel 4, stride 2
5
Convolution 5: kernel 5, stride 1
Feature Extracted visual feature: (1,1,512) 6
Dense ReLu: (1,1,1024)
7
Dense ReLu: (1,1,2500)
8
Dense ReLu: (1,1,3600)
9
Reshape: (60,60,1)
trained using Adagrad optimizer on the reconstruction error between original bitmap and reconstructed bitmap. The loss is given as: L
(|xt − xr | + (xt − xr )2 )
(1)
j=1
where L is the number of samples. xt is the original input bitmap and xr is the reconstructed output bitmap. After training the visual features, we obtained a lookup table where each Chinese character corresponds to a 512-dimensional feature vector. 3.3 Learning Phonetic Features Written Chinese and spoken Chinese have several fundamental differences. To the best of our knowledge, all previous literature that worked on Chinese NLP ignored the significance of the audio channel. As cognitive science suggests, human communication depends not only on visual recognition but also audio activation. This drove us to explore the mutual influence between the audio channel (pronunciation) and textual representation. Popular Romance and Germanic languages such as Spanish, Portuguese, English etc. share two remarkable characteristics. Firstly, they have shallow phonemic orthography3 . In other words, the pronunciation of a word is largely dependent on the text composition in such languages. One can almost infer the pronunciation of a word given its textual spelling. From this perspective, textual information can be interchangeable with phonetic information. For instance, if the pronunciations of English word ‘subject’ and ‘marineland’ were known, it is not hard to speculate the pronunciation of word ‘submarine’, because one can combine the pronunciation of ‘sub’ from ‘subject’ and ‘marine’ from ‘marineland’. This implies that phonetic information of these languages may not have additional information entropy than textual information. 3
https://en.wikipedia.org/wiki/Phonemic orthography.
156
H. Peng et al.
Secondly, intonation information is limited and implicit in these languages. Generally speaking, emphasis, ascending intonation and descending intonation are the major variations in these languages. Although they exerted great influence in sentiment polarity during communication, there is no clue to infer such information only from the texts. However, in comparison to these languages, Chinese has opposite characteristics to the above languages. Firstly, it has deep phonemic orthography. One can hardly infer the pronunciation of Chinese word/character from its textual writing. For example, the pronunciations of characters ‘日’ and ‘月’ are ‘r`ı’ and ‘yu`e’, respectively. A combination of the two characters makes another character ‘明 ’ which pronounced ‘m´ıng’. This characteristic motivates us to find how the pronunciation of Chinese can affect natural language understanding. Secondly, intonation information of Chinese is rich and explicit. In addition to emphasis, each Chinese character has one tone (but there are four different tones), marked by diacritics explicitly. These intonations (tones) greatly affect the semantic and sentiment of Chinese characters and words. Examples were shown in Table 1. To this end, we found it was not trivial to explore how Chinese pronunciation can influence natural language understanding, especially sentiment analysis. In particular, we designed three ways to learn phonetic information, namely extraction from audio signal, pinyin with intonation and pinyin without intonation. An illustration is shown in Table 3. Extracted Feature from Audio Clips (Ex). The symbols of modern Chinese spoken system is named ‘Hanyu Pinyin’, abbreviated to ‘pinyin’. It is the official romanization system for mandarin in mainland China. The system includes four diacritics denoting tones. For each of the Chinese characters, it has one corresponding pinyin. This pinyin has four variations in tones. In order to extract phonetic features, for each tone of each pinyin, we collected an audio clip which recorded a female’s pronunciation of each pinyin (with tone) from a language learning resource4 . Each audio clip lasts around one second with a standard pronunciation of one pinyin with tone. The quality of these clips were validated by two native speakers. Next, we used openSMILE [7] to extract phonetic features on each of the obtained pinyin-tone audio clip. Audio features are extracted 30 Hz frame-rate and a sliding window of 20 ms. They consist of a total num of 39 low-level descriptors (LLD) and their statistics, e.g., MFCC, root quadratic mean, etc. Table 3. Illustration of 3 types of phonetic features (a(x) stands for the audio clip for pinyin ‘x’). Text 假设学校明天放假 (Suppose the school is on vacation tomorrow.) Pinyin Jiˇa Sh`e Xu´e Xi`ao M´ıng Ti¯an F`ang Ji`a Ex PW PO
4
a(Jiˇa) a(Sh`e) a(Xu´e) a(Xi`ao) a(M´ıng) a(Ti¯an) a(F`ang) a(Ji`a) Jia3 She4 Xue2 Xiao4 Ming2 Tian1 Fang4 Jia4 Jia She Xue Xiao Ming Tian Fang Jia
https://chinese.yabla.com/.
Fusing Phonetic Features and Chinese Character Representation
157
After obtaining features for each of the pinyin-tone clip, we obtained an m × 39 dimensional matrix for each clip, where m depends on the length of clip and 39 is the number of features. To regulate the feature representation for each clip, we conducted SVD on the matrices to reduce them to 39-dimensional vectors, where we extracted the vector with the singular values. In the end, high dimensional feature matrices of each pinyin clip were transformed to a dense feature vector of 39 dimensions. A lookup table between pinyin and audio feature vector is constructed accordingly. Pinyin with Intonation (PW). Instead of collecting audio clips for each pinyin, we directly represent Chinese characters with pinyin tokens, as shown in Table 3. The number denotes four kinds of intonations. Specifically, we retrieve a textual corpus, which was used to train for Chinese character textual embeddings. We employ parser5 to convert each character in the above corpus to its corresponding pinyin. For example, a sentence from textual corpus ‘今天心情好’ will be converted to Pinyin (PW) sentence ‘jin1 tian1 xin1 qing2 hao3’. Although1 3.49% of the common 3500 Chinese characters are heteronym, the parser claimed to select the most probable pinyin of the heteronym based on context. Moreover, we don’t aim to address the issue of heterohym is this paper. For the rest of the 96% common characters, the parser is accurate becasue there is no ambiguity. Now the sequence of Chinese characters in textual corpus has been converted to a sequence of pinyins. Thus context from textual corpus is maintained in the pinyin corpus. We then treat each pinyin in the above converted corpus as our new token and train embedding for it. Thus 128-dimension pinyin token embedding vectors using conventional ‘GloVe’ character embeddings [18] were trained. A lookup table between pinyin with intonation (PW) and embedding vector is constructed accordingly. Pinyin Without Intonation (PO). PO distinguishes from PW in removing intonations. In PO, Chinese characters are represented by pinyins without intonations. In the previous example, the textual sentence will be converted to Pinyin (PO) sentence ‘jin tian xin qing hao’. Accordingly, 128-dimension ‘GloVe’ pinyin embeddings were trained. Pinyins that have same pronunciation but different intonations will share the same glove embedding vector, such as Jiˇa and Ji`a in Table 3. 3.4 Sentence Modeling Various deep learning models have been proposed to learn sentence representations, such as convolutional neural networks [12] and recurrent neural networks [10]. Among these, bidirectional recurrent neural networks [27] have proved their effectiveness. Thus, they were used in our experiments to model sentence representation. Particularly, we used a bidirectional long short-term memory (LSTM) network. For a sentence s of n Chinese characters s = {x1 , x2 , ..., xn−1 , xn }. xi stands for the embedding representation of ith character in the sentence. Thus, the sequence was fed to a forward LSTM. The hidden output from forward direction is denoted by hf orw . Bidirectional LSTM applied the LSTM in a backward direction to obtain hback . In the end, the representation of this sentence S is the concatenation of the two. 5
https://github.com/mozillazg/python-pinyin.
158
3.5
H. Peng et al.
Fusion of Modalities
In the context of the Chinese language, textual embeddings have been applied in various tasks and proved its effectiveness in encoding semantics or sentiment [6, 20, 24, 25]. Recently, visual features pushed the performance of textual embedding further via a multimodal fusion [14, 22]. This is achieved due to the effective modeling of compositionality of Chinese characters by the visual features. In this work, we hypothesize that the use of phonetic features along with textual and visual can improve the performance. Thus, we introduced the following fusion methods. 1. Early Fusion: In this kind of fusion, each Chinese character is represented by a concatenation of three segments. Each segment represents one modality, see below: char = [embT ⊕ embP ⊕ embV ]
(2)
where char is character representation. embT , embP , embV are embeddings from text, phoneme and vision, respectively. 2. Late Fusion: Fusion takes place at sentence classification level. Sentence representations from each of the modality are fused just before feeding to a softmax layer. Equation 3 shows the late fusion mechanism: sentence = [ST ⊕ SP ⊕ SV ]
(3)
where ST , SP , SV are bidirectional LSTM outputs from textual, phonetic and visual modality, respectively. sentence is the concatenation of three modalities and was sent to a softmax classifier whose output class is sentiment polarity. There are other complex fusion methods available in the literature [19], however, we did not use them in our paper for two reasons. (1) Fusion through concatenation is one proven effective method [11, 14, 21], and (2) it has the added benefit of simplicity, thus allowing for the emphasis (contributions) of the system to remain with the features themselves.
4 Experiments and Results In this section, we start with introducing the experimental setup. Experiments were conducted in five steps. Firstly, we compare unimodal features. Secondly, we experiment on various fusion methods. We then analyze and validate the role of phonetic features. Next, we visualize the role of different features used in experiments. Last but not least, we locate the cause of improvement. 4.1
Experimental Setup
Datasets. We evaluate our method on five datasets: Weibo, It168, Chn2000, Review-4 and Review-5. The first three datasets consist of reviews extracted from micro-blog and review websites. The last two datasets contain reviews from [5], where Review-4 has reviews from computer and camera domains, and Review-5 contains reviews from car
Fusing Phonetic Features and Chinese Character Representation
159
and cellphone domains. The experimental datasets6 are shown in Table 4. For phonetic experiments, we employ online codes7 on the datasets to convert text to pinyin with intonations (As this step functions as disambiguation, we collect three online resources and select the most reliable one). For visual features, we refer to the lookup table to convert characters to visual features. Table 4. # of reviews in different experimental datasets. Weibo It168 Chn2000 Review-4 Review-5 Positive
1900
560
600
1975
2599
Negative 1900
458
739
879
1129
1018 1339
2854
3728
Sum
3800
Setup and Baselines. We used TensorFlow and Keras to implement our model. All models used an Adam Optimizer with a learning rate of 0.001 and a L2-norm regularizer of 0.01. Dropout rate is 0.5. Each mini-batch contains 50 samples. We report the average testing results of each model for 50 epochs in a 5-fold cross validation. The above parameters were set with the use of a grid search on the validation data. In related works of Chinese textual embedding, all of them aimed at improving Chinese word embeddings, such as CWE [6], MGE [25]. Those who utilized visual features [14, 22] also aimed at word level. However, they cannot stand as fair baselines to our proposed model, as our model studied on Chinese character embedding. There are two major reasons for studying at character level. Firstly, pinyin pronunciation system is designed for character level. Pinyin system does not have corresponding pronunciations to Chinese words. Secondly, character level will bypass Chinese word segmentation operation which may induce errors. Conversely, using character level pronunciation to model word level pronunciation will cause sequence modeling issues. For instance, a Chinese word ‘你好’ is comprised of two characters, ‘你’ and ‘好’. For textual embedding, the word can be treated as one single unit by training a word embedding vector. For phonetic embedding, however, we cannot treat the word as one single unit from the perspective of pronunciation. The correct pronunciation of the word is a time sequence of character pronunciation of firstly ‘你’ and then ‘好’. If we work at word level, we have to come up with a representation of the pronunciation of this word, such as an average of character phonetic feature etc. To make a fair comparison, we used three embedding methods to train Chinese textual character embedding, namely Skipgram, CBOW [15] and GloVe [18]. We also implement the radical-enchanced character embeddings, charCBOW and charSkipGram, from [13]. To model Chinese sentences, we also tested on three deep learning models, namely convolutional neural network, LSTM and bidirectional LSTM (see footnote 6).
6 7
Both the datasets and codes in this paper are available for public download upon acceptance. https://github.com/mozillazg/python-pinyin.
160
H. Peng et al. Table 5. Classification accuracy of unimodality in bidirectional LSTM. Weibo It168 Chn2000 Review-4 Review-5
4.2
GloVe
75.97
83.29 84.76
88.33
87.53
CBOW
73.94
83.57 82.90
86.83
85.70
Skip-gram
75.28
81.81 82.30
87.38
86.91
Visual
63.20
69.72 70.43
79.99
79.83
Phonetic feature Ex
67.20
76.19 80.36
80.98
81.38
PO
68.12
81.13 80.28
83.36
83.15
PW 73.31
83.19 84.24
86.58
86.56
Experiments on Unimodality
For textual embeddings, we have tested GloVe, skip-gram and CBOW. For phonetic representation, three types of features were tested. For visual representation, the 512 dimensional feature extracted from bitmap was used to represent each character. As shown in Table 5, textual embeddings (GloVe) achieved the best performance among all three modalities in four datasets. This is due to the fact that they successfully encoded the semantics and dependency between characters. We also noticed that visual feature achieved the worst performance among three modalities, which was within our expectation. As demonstrated in [22], pure visual features are not representative enough to obtain a comparable performance with the textual embedding. Last but not least, phonetic features performed better than visual feature. Although visual features captured compositional information of Chinese characters, they failed to distinguish different meanings of characters that have same writing but different tones. These tones could largely alter the sentiment of Chinese words and further affect sentiment of sentence, as proved by the observation that PW outperformed PO constantly. In order to use the complimentary information available in the modalities we proposed two fusion techniques below. 4.3
Experiments on Fusion of Modalities
In this set of experiments, we evaluated both the early fusion and late fusion with every possible combination of modalities. After extensive experimental trials, we summarized that the concatenation of Ex and PW embeddings (denoted as ExPW) performed best among all phonetic feature combinations. Thus we used it as phonetic feature in the fusion of modalities. The results in Table 6 suggest the best performance was achieved by fusing either textual, phonetic and visual features or textual and phonetic features. We found that charCBOW and charSkipGram methods perform quite close to the original CBOW and Skipgram methods. They perform slightly but not constantly better than their baselines. We conjecture this could be caused by the relatively small size of our training corpus compared to the original Chinese Wikipedia Dump training corpus. With the corpus size increased, all embedding methods are expected to have improved performance. It is without doubts, though, that the corpus we used still presents a fair platform for all methods to compare.
Fusing Phonetic Features and Chinese Character Representation
161
Table 6. Classification accuracy of multimodality. (T and V represent textual and visual, respectively. + means the fusion operation. ExPW is the concatenation of Ex and PW.) Weibo It168 Chn2000 Review-4 Review-5 Unimodal T
Early
Late
75.97
83.29 84.76
88.33
87.53
ExPW
73.89
82.69 84.31
86.41
86.51
V
63.20
69.72 70.43
79.99
79.83
charCBOW [13]
73.31
83.19 83.94
87.07
85.68
charSkipGram [13] 72.00
83.08 82.52
85.39
84.52
T+ExPW
76.52
86.53 87.6
88.96
88.73
T+V
76.07
84.17 85.66
88.26
88.30
ExPW+V
73.39
84.27 84.99
87.14
88.3
T+ExPW+V
75.73
86.43 87.98
88.72
89.48
T+ExPW
75.5
85.29 84.63
89.67
86.03
T+V
76.05
82.60 83.87
87.53
87.47
ExPW+V
72.18
82.84 85.22
89.77
85.9
T+ExPW+V
75.37
85.88 85.45
90.01
86.19
We also noticed that phonetic features when fused with textual or visual features, improved the performance of both textual and visual unimodal classifiers. This validates our hypothesis that phonetic features are an important factor in improving Chinese sentiment analysis performance. An integration of multiple modalities took advantages of each modality and pushed the overall performance to a better place. A p-value of 0.008 in the paired t-test between with and without phonetic features suggested that the best performing improvement of integrating phonetic feature is statistically significant. We also note that early fusion generally outperformed late fusion with a notable gap. We conjecture that late fusion cuts off the complementary capability offered by each modality for each character and fusion at the sentence level may not remember the multimodal information of characters. In comparison, early fusion merged multiple modalities at character level that encapsulated the multimodal information for each character. Initially, we believed the fusion of all modalities would performed the best. However, the results on Weibo and It168 datasets violated our hypothesis. We assumed that this was caused by the poor performance of visual modality in these two datasets, as unimodal shown in Table 6. Table 7. Performance of learned and random generated phonetic feature. Weibo It168 Chn2000 Review-4 Review-5 Learned phonetic feature Ex 67.20 PO 68.12 PW 73.31
76.19 80.36 81.13 80.28 83.19 84.24
80.98 83.36 86.58
81.38 83.15 86.56
Random phonetic feature
57.63 58.71
69.31
69.82
53.30
162
H. Peng et al.
Glove was chosen as the textual embedding in our model due to it performance in Table 5. Although we did not show its results from CBOW or Skip-gram, the general trend still remained the same. Either the fusion of all modalities or the fusion of phonetic and textual embeddings achieved the best performance in most of the cases. This indicated the positive contribution of phonetic feature when used collaboratively with other modalities. 4.4
Validating Phonetic Feature
In the previous section, we showed that phonetic features help improving the overall sentiment classification performance when fused with other modalities. However, the improvement could also be due to training of the classifier. In other words, the improvement may still occur even when phonetic features were totally random, which did not encode any phonetic information. To wipe out this concern, we developed a set of controlled experiments in which we validated the performance of phonetic features. In particular, we generated random real-valued vectors as random phonetic feature for each character. Each dimension of the random phonetic feature vector is a float number between -1 to 1 sampled from a Gaussian distribution. Then, we used this random feature vector to represent each Chinese character and yielded the results in Table 7. In the comparison between the learned phonetic feature and random phonetic feature, we observe that the learned feature outperformed random feature with at least 10% in all datasets. This result indicated the improvement of performance was due to the contribution of learned phonetic feature but not the training of classifiers. Phonetic feature itself is the cause and random features will not provide similar performance. 4.5
Visualization of the Representation
We visualize the extracted phonetic features (Ex) to see what information has been extracted. As shown in Fig. 1 on the left, pinyins that share similar vowels were clustered. We also found that various pinyins of same intonations stay close to each other. This suggested that our phonetic features had captured certain phonetic information of pinyins. We also visualize the fused embedding which concatenated textual, phonetic and visual features in Fig. 1 on the right. We noted that the fused embedding clustered characters that share not only similar pronunciations but also same components (radicals).
Fig. 1. Selected t-SNE visualization of phonetic embeddings and fused embeddings (Left is the Ex feature, where number denotes intonation. Right is fused embedding of T+Ex+V. Green/red circles cluster phonetic/compositional (or semantic) similarity.) (Color figure online)
Fusing Phonetic Features and Chinese Character Representation
163
It can be concluded that the fused embeddings combine the phonetic information from phonetic features, the compositional information from visual features and semantic information from textual embeddings. Since we mentioned the two characteristics of pinyin system in the beginning, it is reasonable to determine either the deep phonemic orthography or the variety of intonations contributes to the improvement. This leads to another group of controlled experiments in the following section. 4.6 Who Contributes to the Improvement? As shown on the left in Fig. 2, red lines outperformed blue lines consistently in all three fusions. Red lines differ from blue lines only in having extra Ex features. The extracted phonetic features were considered as encoding the uniqueness of Chinese pronunciations, namely deep phonemic orthography. This validated that the deep phonemic orthography property of Chinese helped in sentiment analysis task. Similarly on the right in Fig. 2, red lines also outshined green lines in all fusion cases. The difference between the green and red line is the lack of intonations. ExPO eliminates intonations, compared to ExPW. This difference has caused the performance gap between the green and red lines, which further proves the importance of intonations.
90 88
88 86
83
84 82
78
80 78 76
73
74 72
68
Weibo
It168
Chn2000
Review-4
Review-5
Weibo
It168
Chn2000
Review-4
Review-5
T+PW
V+PW
T+PW+V
T+ExPO
V+ExPO
T+ExPO+V
T+ExPW
ExPW+V
T+ExPW+V
T+ExPW
V+ExPW
T+ExPW+V
Fig. 2. Performance comparison between various phonetic features in early fusion. (Color figure online)
5 Conclusion Modern Chinese pronunciation system (pinyin) provides a new perspective in addition to the written system in representing Chinese language. Due to its deep phonemic orthography and intonation variations, it is expected to bring new contributions to statistical representation of Chinese language, especially in the task of sentiment analysis. To the best of our knowledge, we are the first to present an approach to learn phonetic information out of pinyin (both from pinyin tokens and from audio signal). We then integrate the extracted information to textual and visual features to create new Chinese representations. Experiments on five datasets demonstrated the positive contribution of phonetic information to Chinese sentiment analysis, as well as the effectiveness of fusion mechanism. Even though our method is straightforward, it suggests
164
H. Peng et al.
greater potential of taking advantage of phonetic information of languages with deep phonemic orthography, such as Arabic and Hebrew. In the future, we plan to extend the work in the following directions. Firstly, we will try to explore the mutual influence of multimodality. Direct concatenation, as in our work, ignores the dependency between modalities. Secondly, we would explore how phonetic features encode semantics.
References 1. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003) 2. Cambria, E., Song, Y., Wang, H., Howard, N.: Semantic multi-dimensional scaling for opendomain sentiment analysis. IEEE Intell. Syst. 29(2), 44–51 (2014) 3. Cambria, E., Wang, H., White, B.: Guest editorial: big social data analysis. Knowl.-Based Syst. 69, 1–2 (2014) 4. Chaturvedi, I., Satapathy, R., Cavallari, S., Cambria, E.: Fuzzy commonsense reasoning for multimodal sentiment analysis. Pattern Recogn. Lett. 125, 264–270 (2019) 5. Che, W., Zhao, Y., Guo, H., Su, Z., Liu, T.: Sentence compression for aspect-based sentiment analysis. IEEE Trans. Audio Speech Lang. Process. 23(12), 2111–2124 (2015) 6. Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and word embeddings. In: IJCAI, pp. 1236–1242 (2015) 7. Eyben, F., W¨ollmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462. ACM (2010) 8. Hansen, C.: Chinese ideographs and western ideas. J. Asian Stud. 52(2), 373–399 (1993) 9. Howard, N., Cambria, E.: Intention awareness: improving upon situation awareness in human-centric environments. Hum.-Centric Comput. Inf. Sci. 3(9), 1–17 (2013) 10. Irsoy, O., Cardie, C.: Opinion mining with deep recurrent neural networks. In: EMNLP, pp. 720–728 (2014) 11. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014) 12. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014) 13. Li, Y., Li, W., Sun, F., Li, S.: Component-enhanced chinese character embeddings. arXiv preprint arXiv:1508.06669 (2015) 14. Liu, F., Lu, H., Lo, C., Neubig, G.: Learning character-level compositionality with visual features. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers), pp. 2059–2068 (2017) 15. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 16. Peng, H., Cambria, E., Zou, X.: Radical-based hierarchical embeddings for Chinese sentiment analysis at sentence level. In: FLAIRS, pp. 347–352 (2017) 17. Peng, H., Ma, Y., Li, Y., Cambria, E.: Learning multi-grained aspect target sequence for Chinese sentiment analysis. Knowl.-Based Syst. 148, 167–176 (2018) 18. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014) 19. Poria, S., Cambria, E., Hazarika, D., Mazumder, N., Zadeh, A., Morency, L.P.: Multi-level multiple attentions for contextual multimodal sentiment analysis. In: ICDM, pp. 1033–1038 (2017)
Fusing Phonetic Features and Chinese Character Representation
165
20. Shi, X., Zhai, J., Yang, X., Xie, Z., Liu, C.: Radical embedding: delving deeper to Chinese radicals, vol. 2: Short Papers, p. 594 (2015) 21. Snoek, C.G., Worring, M., Smeulders, A.W.: Early versus late fusion in semantic video analysis. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 399–402. ACM (2005) 22. Su, T.r., Lee, H.y.: Learning Chinese word representations from glyphs of characters. In: EMNLP, pp. 264–273 (2017) 23. Sun, M., Chen, X., Zhang, K., Guo, Z., Liu, Z.: Thulac: an efficient lexical analyzer for chinese. Technical Report (2016) 24. Sun, Y., Lin, L., Yang, N., Ji, Z., Wang, X.: Radical-enhanced Chinese character embedding. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds.) ICONIP 2014. LNCS, vol. 8835, pp. 279–286. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12640-1 34 25. Yin, R., Wang, Q., Li, P., Li, R., Wang, B.: Multi-granularity Chinese word embedding. In: EMNLP, pp. 981–986 (2016) 26. Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: Hhmm-based Chinese lexical analyzer ictclas. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, vol. 17, pp. 184–187. Association for Computational Linguistics (2003) 27. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)
Sentiment-Aware Recommendation System for Healthcare Using Social Media Alan Aipe, N. S. Mukuntha(B) , and Asif Ekbal Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, India {alan.me14,mukuntha.cs16,asif}@iitp.ac.in Abstract. Over the last decade, health communities (known as forums) have evolved into platforms where more and more users share their medical experiences, thereby seeking guidance and interacting with people of the community. The shared content, though informal and unstructured in nature, contains valuable medical and/or health related information and can be leveraged to produce structured suggestions to the common people. In this paper, at first we propose a stacked deep learning model for sentiment analysis from the medical forum data. The stacked model comprises of Convolutional Neural Network (CNN) followed by a Long Short Term Memory (LSTM) and then by another CNN. For a blog classified with positive sentiment, we retrieve the top-n similar posts. Thereafter, we develop a probabilistic model for suggesting the suitable treatments or procedures for a particular disease or health condition. We believe that integration of medical sentiment and suggestion would be beneficial to the users for finding the relevant contents regarding medications and medical conditions, without having to manually stroll through a large amount of unstructured contents. Keywords: Health social media Medical sentiment
1
· Deep learning · Suggestion mining ·
Introduction
With the increasing popularity of electronic-bulletin boards, there has been a phenomenal growth in the amount of social media information available online. Users post about their experiences on social media such as medical forums and message-boards, seeking guidance and emotional support from the online community. As discussed in [2], medical social media is an increasingly viable source of useful information. These users, who are often patients themselves or the friends and/or relatives of patients write their personal views and/or experiences. Their posts are rich in information such as their experiences with disease and their satisfaction with treatment methods and diagnosis. As discussed in [3], medical sentiment refers to a patient’s health status, medical conditions and treatment. Extraction of this information as well as its analysis have several potential applications. The difficulty in the extraction of information such as sentiment and suggestions from a forum post can be attributed c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 166–181, 2023. https://doi.org/10.1007/978-3-031-24340-0_13
Sentiment-Aware Recommendation System for Healthcare
167
to a variety of reasons. Forum posts contain informal language combined with the usages of medical conditions and terms. The medical domain itself is sensitive to misinformation. Thus, any system built on this data would also have to incorporate relevant domain knowledge. 1.1
Problem Definition
Our main objective is to develop a sentiment-aware recommendation system to help build a patient assisted health-care system. We propose a novel framework for mining medical sentiments and suggestions from medical forum data. This broad objective can be modularized into the following set of research questions: RQ1: Can an efficient multi-class classifier be developed to help us understand the overall medical sentiment expressed in a medical forum post?
RQ2: How can we model the similarity between the two medical forum posts?
RQ3: Can we propose an effective algorithm for treatment suggestion by leveraging medical sentiment obtained from the forum posts? By addressing these research questions, we aim to create a patient assisted health-care system, which is able to determine the sentiments of any user that s/he expresses in the forum post, point the user to similar forum posts for more information, and suggest possible treatment(s) or procedural methods for the user’s symptoms and possible disorders. 1.2
Motivation
The amount of health related information being sought after on the Internet is on the rise. As discussed in [4], an estimated 6.75 million health-related searches are made on Google every day. The Pew Internet Survey [5] claims that 35% of U.S. adults have used the internet to diagnose a medical condition they themselves or another might have, and that 41% of these online diagnosers have had their suspicions confirmed by a clinician. There has also been an increase in the number of health-related forums and discussion boards on the Internet, which contain useful information that is yet to be properly harnessed. Sentiment analysis has various applications. We believe it can also provide important information in health-care. In addition to doctor’s advice, connecting with other people who have been in similar situations can help with several practical difficulties. According to The Pew Internet Survey, 24% of all adults have obtained information or support from others who are having the same health conditions.
168
A. Aipe et al.
A person posting on such a forum is often looking for emotional support from similar people. Consider the following two blog posts: Post 1: Hi. I have been on sertaking 50 mgs for about 2 months now and previously was at 25 mg for two weeks. Overall my mood is alot more stable and I dont worry as much as I did before however I thought I would have a bath and when I dried my hair etc I started to feel anxious, lightheaded and all the lovely feeling you get with panic. Jus feel so yuck at the moment but my day was actually fine. This one just came out of the blue.. I wanted to know if anyone else still gets some bad moments on these. I don’t know if they feel more intense as I have been feeling good for a while now. Would love to hear others stories.
Post 2: Just wanna let you all know who are suffering with head aches/pressure that I finally went to the doctor. Told him how mines been lasting close to 6 weeks and he did a routine check up and says he’s pretty I have chronic tension headaches. He prescribed me muscle relaxers, 6 visits to neck massages at a physical therapist and told me some neck exercises to do. I went in on Tuesday and since yesterday morning things have gotten better. I’m so happy I’m finally getting my life back. Just wanted you all to know so maybe you can feel better soon In the first post, the author discusses an experience with a drug, and is looking to hear from people with similar issues. In the second post, the author discusses a positive experience and seeks to help people with similar problems. One of our aims is to develop a system to automatically retrieve and provide such a user with the posts most similar to theirs. Also, in order to make an informed decision knowing patient’s satisfaction for a given course of treatment might be useful. We also seek to provide suggestions for treatment for a particular patient’s problems. The suggestions can subsequently be verified by a qualified professional, and then be prescribed to the patients, or in more innocuous cases (such as with ‘more sleep’ or ‘meditation’), can be directly taken as advice. 1.3
Contributions
In this paper, we propose a sentiment-aware patient assisted health-care system using the information extracted from the medical forums. We propose a deep learning model with a stacked architecture that makes use of Convolutional Neural Network (CNN) layers, and a Long-Short Term Memory (LSTM) network, for the classification of a blog post into its medical sentiment class. To the best of our knowledge, exploring the usage of medical sentiment to retrieve similar posts (from medical blogs) and treatment options is not yet attempted. We summarize the contributions of our proposed work as follows:
Sentiment-Aware Recommendation System for Healthcare
169
– We propose an effective deep learning based stacked model utilizing CNN and LSTM for medical sentiment classification. – We develop a method for retrieving the relevant medical forum posts, similar to a given post. – We propose an effective algorithm for treatment suggestions, that could lead towards building a patient care system.
2
Related Works
Social media is a source of huge information that can be leveraged for building many socially intelligent systems. Sentiment analysis has been explored quite extensively in various domains. However, this has not been addressed in the medical/health domain in the required measure. In [3], authors have analyzed the peculiarities of sentiment and word usage in medical forums, and performed quantitative analysis on clinical narratives and medical social media resources. In [2], multiple notions of sentiment analysis with reference to medical text are discussed in details. In [1], authors have built a system that identifies drugs that cause serious adverse reactions, using messages discussing them from online health forums. They use an ensemble of Na¨ıve Bayes (NB) and Support Vector Machines (SVMs) classifiers to successfully identify the past drugs withdrawn from the market. Similarly, in [13], users’ written contents from social media were used to mine the association between drugs for predicting the Adverse Drug Reactions (ADRs). FDA alerts were used as gold standard, and the statistic Proportional Reporting Ratios (PRR) was shown to be of high importance in solving the problem. In [11], one of the shared tasks involved the retrieval of medical forum posts related to the search queries provided. The queries involved were short, detailed and to the point, typically being less than 10 words. Our work however, focuses more on medical sentiment involved in an entire forum post, and helps to retrieve the similar posts. Recently, [12] presented a benchmark setup for analyzing the medical sentiment of users on social media. They identified and analyzed multiple forms of medical sentiments in text from forum posts, and developed a corresponding annotation scheme. They also annotated and released a benchmark dataset for the problem. In our current work we propose a novel stacked deep learning based ensemble model for sentiment analysis in the medical domain. This is significantly different from the prior works mentioned above. To the best of our knowledge, no prior attempt has been made to exploit medical sentiment from social media posts to suggest treatment options to build a patient-assisted recommendation system.
3
Proposed Framework
In this section, we describe our proposed framework comprising of three phases, each of which tackles a research question enumerated in Sect. 1.1.
170
3.1
A. Aipe et al.
Sentiment Classification
Medical sentiment refers to analyzing the health status reflected in a given social media post. We approach this task as a multi-class classification problem using sentiment taxonomy as described in Sect. 4.1. Convolutional Neural Network (CNN) architectures have been extensively applied to sentiment analysis and classification tasks [8,12]. Long Short Term Memory (LSTMs) are a special kind of Recurrent Neural Network (RNN) capable of learning long-term dependencies by handling the vanishing and exploding gradient problem [6]. We propose an architecture consisting of two deep Convolutional Neural Network (CNN) layers and a Long-Short Term Memory (LSTM) layer, stacked with a fully connected layer followed by three-neurons output layer having softmax as an activation function. A diagrammatic representation of the classifier is shown in Fig. 1. The social media posts are first vectorized (as discussed in Sect. 4.3) and then fed as input to the classifier. Convolutional layers, used in the classifier, generate 200 dimensional feature maps of unigram and bigram filter sizes. Feature maps from the final CNN layer are maxpooled, flattened and fed into the fully connected layer having a rectified linear unit (ReLU) activation function. The output of the above mentioned layer is fed into another fully connected layer with a softmax activation to obtain class probabilities. Sentiment denoted by the class having the highest softmax value is considered to be the medical sentiment of the input message. The intuition behind adopting a CNN-LSTM-CNN architecture is as follows: During close scrutiny of the dataset, we observed that users often share experiences adhering to their time frame. For example, “I was suffering from anxiety. My doctor asked me to take cit 20 mg per day. Now I feel better”. In this post, the user portrays his/her initial condition, explains the treatment which was used and then the effect of the treatment- all in a very timely sequence. Moreover, health status also keeps changing in the same sequence. This trend was observed throughout the dataset. Therefore, temporal features are the key to medical sentiment classification. Hence, in our stacked ensemble model, first CNN layer extracts top-level features, then LSTM finds the temporal relationships between the extracted features and the final CNN layer filters out the top temporal relationships which are subsequently fed into a fully connected layer.
Fig. 1. Stacked CNN-LSTM-CNN Ensemble Architecture for medical sentiment classification
Sentiment-Aware Recommendation System for Healthcare
3.2
171
Top-N Similar Posts Retrieval
Users often share contents in forums, seeking guidance and to connect with other people who experienced similar medical scenarios. Thus, retrieving topN similar posts would help to focus on contents which are relevant to one’s medical condition, without having to manually scan through all the forum posts. We could have posed this task as a regression problem where a machine learning (ML) model learns to predict similarity score for a given pair of forum posts, but there is no suitable dataset available for this task to the best of our knowledge. We tackle this task by creating a similarity metric (as shown in E.q. 5) and evaluating it by manually annotating a small test set (as discussed in Sect. 4.5). The similarity metric comprises of three terms: – Disease Similarity: It refers to the Jaccard similarity score computed between the two forum posts with respect to the diseases mentioned in the posts. Section 4.4 discusses how the diseases are extracted from a given post. Let J(A,B) denotes the Jaccard similarity between set A and B, DS(P,Q) denotes the disease similarity between two forum posts P and Q, D(P) and D(Q) denote the set of diseases mentioned in P and Q respectively, then: DS(P, Q) = J(D(P ), D(Q))
(1)
where, J(A, B) =
|A ∩ B| |A ∪ B|
– Symptom similarity : It refers to the Jaccard similarity between two forum posts with respect to the symptoms mentioned in them. Section 4.4 discusses how the symptoms mentioned in texts are extracted from a given post. Let SS(P,Q) denotes the disease similarity between two forum posts P and Q, S(P) and S(Q) denote the set of diseases mentioned in P and Q respectively, then SS(P, Q) = J(S(P ), S(Q)) (2) – Text similarity : It refers to the cosine similarity between the document vectors corresponding to two forum posts. Document vector of a post is the sum of vectors of all the words (Sect. 4.3) in a given sentence. Let DP and DQ denote the document vectors corresponding to the forum posts P and Q, TS(P,Q) denotes the cosine similarity between them, then T S(P, Q) =
DP · DQ |DP | × |DQ |
(3)
We compute the above similarities between a pair of posts, and use Eq. 5 to obtain the overall similarity score Sim(P,Q) between two given forum posts P and Q. For a given test instance, training posts are ranked according to the
172
A. Aipe et al.
similarity score (with respect to test) and top-N posts are retrieved. 2 × DS(P, Q) + SS(P, Q) 3 2 × M ISim(P, Q) + T S(P, Q) Sim(P, Q) = 3 M ISim(P, Q) =
(4) (5)
where MISim(P,Q) denotes the similarity between P and Q with respect to the relevant medical information. The main objective of similar post retrieval is to search for the posts depicting similar medical experience. Medical information shared in a forum post can be considered as an aggregate of the disease conditions and symptoms encountered. Medical experience shared in a forum post can be considered as an aggregate of the medical information shared and the semantic meaning of the text, in the same order of relevance. This is the intuition behind adoption of the similarity metric in Eq. 5. 3.3
Treatment Suggestion
A treatment T mentioned in a forum post P can be considered suitable for a disease D mentioned in post Q if P and Q depict similar medical experience and the probability that T to produce a positive medical sentiment, given D. Thus, suggestion score G(T,D) is given by, G(T, D) = Sim(P, Q) × P r(+veSentiment|T, D)
(6)
G(T, D) ≥ τ Treatment T is suggested < τ Treatment T is not suggested where τ is a hyper-parameter of the framework and Pr(A) denotes the probability of event A.
4
Dataset and Experimental Setup
In this section, we discuss the details of the datasets used for our experiments and the evaluation setups. 4.1
Forum Dataset
We perform experiments using a recently released datasets for sentiment analysis [12]. This dataset consists of social media posts collected from the medical forum ’patient.info’. In total 5,189 posts were segregated into three classes – Exist, Recover, Deteriorate based on medical conditions the post described, and 3,675 posts were classified into three classes – Effective, Ineffective, Serious Adverse
Sentiment-Aware Recommendation System for Healthcare
173
Effect based on the effect of medication. As our framework operates at a generic level, we combine both the segments into a single dataset, mapping labels from each segment to a sentiment taxonomy as discussed in Sect. 4.1. The classes with respect to medical condition are redefined as follows: – Exist: User shares the symptoms of any medical problem. This is mapped to the neutral sentiment. – Recover: Users share their recovery status from the previous problems. This is mapped to the positive sentiment. – Deteriorate: User share information about their worsening health conditions. We map this to the negative sentiment. The classes with respect to the effect of medication are: – Effective: User shares information about the usefulness of treatment. This is mapped to the positive sentiment. – Ineffective: User shares information that the treatment undergone has no effect as such. These are mapped to the neutral sentiment. – Serious adverse effect: User shares negative opinions towards the treatment, mainly due to adverse drug effect. This is mapped to the negative sentiment. Sentiment Taxonomy A different sentiment taxonomy is conceptualized keeping in mind the generic behavior of our proposed system. It does not distinguish between the forum posts related to medical conditions and medication. Thus, a one-to-one mapping from sentiment classes used in each segment of the dataset to a more generic taxonomy is essential. We show the class distribution in Table 1. Table 1. Class distribution in the dataset with respect to sentiment taxonomy Sentiment
Distribution(%)
Positive
37.49
Neutral
32.34
Negative
30.17
– Positive sentiment : Forum posts depicting improvement in overall health status or positive results of the treatment. For example : ”I have been suffering from anxiety for a couple of years. Yesterday, my doc prescribed me Xanax. I am feeling better now.” This post is considered positive as it depicts positive results of Xanax. – Negative sentiment : Forum posts describing deteriorating health status or negative results of treatment. For example : ”Can citalopram make it really hard for you to sleep? i cant sleep i feel wide awake every night for the last week and im on it for 7 weeks.”
174
A. Aipe et al.
– Neutral sentiment: This denotes to the forum posts where neither positive nor negative sentiment is expressed, with no change in overall health status of the person. For example : ”I was wondering if anyone has used Xanax for anxiety and stress. I have a doctors appointment tomorrow and not sure what will be decided to use.” 4.2
Word Embeddings
Capturing semantic similarity between the target texts is an important step towards accurate classification. For this reason, word embeddings play a pivotal role. We use the pre-trained word2vec [10] model1 , induced from the PubMed and PMC texts along with the texts extracted from the Wikipedia dump. 4.3
Tools Used and Preprocessing
The codebase, during experimentation, is written in Python (version 3.6) with external libraries – namely keras 2 for neural network design, sklearn 3 for evaluation of baseline and the proposed model, pandas 4 for easier access of data in the form of tables (or, data frames) during execution, nltk 5 for textual analysis and pickle 6 for saving and retrieving input-output of different modules from the secondary storage devices. The preprocessing phase comprises of the removal of non-ASCII characters, stop words and handling of non alphanumeric characters followed by tokenization. Tokens of size (number of characters) less than 3 were also removed due to a very low probability of these becoming indicative features to the classification model. Labels corresponding to sentiment classes of each segment in the dataset are mapped to the generic taxonomy classes (as discussed in Sect. 4.1), and the corresponding one-hot encodings are generated. Text Vectorization. Using the pre-trained word2vec model (discussed in Sect. 4.2), each token is converted to a 200-dimensional vector. They are stacked together and padded to form a 2-D matrix of desired size (150 × 200). The number 150 denotes the maximum number of tokens in any preprocessed forum posts belonging to the training set. 4.4
UMLS Concept Retrieval
Identification of medical information like diseases, symptoms and treatments mentioned in a forum post is essential for the top-n similar post retrieval 1 2 3 4 5 6
http://bio.nlplab.org/. https://keras.io/. http://scikit-learn.org/. http://pandas.pydata.org/. http://www.nltk.org/. https://docs.python.org/3/library/pickle.html.
Sentiment-Aware Recommendation System for Healthcare
175
(Sect. 3.2) and treatment suggestion (Sect. 3.3) phases of the proposed framework. The Unified Medical Language System7 (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences (created in 1986). Therefore UMLS concept identifiers, related to the above mentioned medical information, were retrieved using Apache cTAKES8 . Concepts with semantic type ‘Disorder or Disease’ were added to the set of diseases-those with semantic types ‘Sign or Symptom’ were added to the set of symptoms and those with semantic types ‘Medication’ and ‘Procedures’ were added to the set of treatments. 4.5
Relevance Judgement for Similar Post Retrieval
Annotating pairs of forum posts with their similarity scores as per human judgment is necessary to evaluate how much the retrieved text is relevant. This corresponds to the evaluation of the proposed custom similarity metric (E.q. 5). Since annotating every pairs of posts is a cumbersome task, 20% of the total posts in the dataset were randomly selected, maintaining equal class distribution for the annotation purpose. For each such post, top 5 similar posts are retrieved using the similarity metric. Annotators were asked to judge the similarity between each retrieved post and the original post on a Likert-type scale, from 1 to 5 (1 represents high dissimilarity while 5 represents high similarity between a pair of posts). Annotators were provided with the guidelines for relevance judgments on two questions–‘Is this post relevant to the original post in terms of medical information?’ and ‘Are the experiences and situations depicted in the pair of posts similar?’. A pair of posts is given high similarity rating if both of the conditions are true, and a low rating if neither is true. Three annotators having post-graduate educational levels performed the annotations. We measure the inter-annotator agreement using Krippendorff’s alpha metric [9], and this was observed to be 0.78. Disagreements between the annotators can be explained on the basis of ambiguities, encountered during the labeling task. We provide few examples below: 1. There are cases where original writer of the blog assigned higher rating (denoting relevant), but the annotator disagreed on what constituted a ‘relevant’ post. This often corresponds to the posts giving general advice for illness. For example,’You can take xanax in case of high stress. It worked for me.’ Such advice may not be applicable to a certain specific situation. 2. Ambiguities are also observed for the cases where the authors of the posts are of similar age, sex and socio-economic backgrounds, but have different health issues (for example, one post depicted a male teenager with severe health anxiety, while the other post described a male teenager with social anxiety). For such cases, similarity ratings were varied. 3. Ratings also vary in cases where the symptoms match, but the cause and disorder differ. Annotators face problem in judging the posts which do not 7 8
https://www.nlm.nih.gov/research/umls/. https://ctakes.apache.org/.
176
A. Aipe et al.
contain enough medical information. For example, headache can be a symptom for different diseases.
5
Experimental Results and Analysis
In this section, we report the evaluation results and present necessary analysis. 5.1
Sentiment Classification
The classification model (described in Sect. 3.1) is trained on a dataset of 8,864 unique instances obtained after preprocessing. We define a baseline model by implementing the CNN based system as proposed in [12] under the identical experimental conditions as that of our proposed architecture. We also develop a model based on an LSTM. To see the real impact of the third layer, we also show the performance of a CNN-LSTM based model. Batch size for training was set to 32. Results of 5-fold cross validation are shown in Table 2. Table 2. Evaluation results of 5-fold cross-validation for sentiment classification. Model
Accuracy Cohen-Kappa Macro
Baseline [12] LSTM CNN-LSTM Proposed model
0.63 0.609 0.6516 0.6919
0.443 0.411 0.4559 0.4966
Precision 0.661 0.632 0.6846 0.7179
Recall 0.643 0.628 0.6604 0.7002
F1-Score 0.652 0.63 0.6708 0.7089
Evaluation shows that the proposed model performs better than the baseline system, and efficiently captures medical sentiment from the social media posts. Table 2 shows the accuracy, cohen-kappa, precision, recall and F1 score of the proposed model as 0.6919, 0.4966, 0.7179, 0.7002 and 0.7089, respectively. In comparison to the baseline model this is approximately a 9.13% improvement in terms of all the metrics. Posts usually consist of medical events and experiences. Therefore, capturing temporally related spatially close features is required for inferring the overall health status. The proposed CNN-LSTM-CNN network has been shown to be better at this task compared to the other models. The high value of the Cohen-Kappa metric suggests that the proposed model indeed learns to classify posts into 3 sentiment classes rather than making any random guess. A closer look at the classification errors revealed that there are instances where CNN and LSTM predict incorrectly, but the proposed model correctly classifies. With the following example where baseline and LSTM both failed to correctly classify, but the proposed model succeeded: ‘I had a doctors appointment today. He told I was recovering and should be more optimistic. I am still anxious and stressed most of the time’. Baseline model and LSTM classified
Sentiment-Aware Recommendation System for Healthcare
177
it as positive (might be because of terms like ’recovering’ and ’optimistic’) while the proposed model classified it as negative. This shows that the proposed model can satisfactorily capture the contextual information, and leverage it effectively for the classification task. To understand where our system fails we perform detailed error analysisboth quantitatively and qualitatively. We show quantitative analysis in terms of confusion matrix as shown in Fig. 2.
Fig. 2. Confusion matrix of sentiment classification
Close scrutiny of the predicted and actual values of test instances reveals that majority of misclassification occurs in cases where sentiment remains positive or negative throughout the post, and suddenly alters at the end of the post. For example: “I have been suffering from anxiety for a couple of years now. Doctors kept prescribing new and new medicines but I felt no change. Stress was unbearable. I lost my parents last year. The grief made me even worse. But I am currently feeling good after taking coQ10”. We observe that the proposed model was confused in such cases. Moreover, users often share some personal content which does not help the medical domain significantly. Such noises also contribute to the misclassification. Comparison with Existing Models: One of the recent works on medical sentiment analysis is reported in [12]. They have trained and evaluated a CNN based architecture separately for medical condition and medication segments. As discussed in the dataset and experiment section, we have merged both the datasets related to medical condition and medications into one for training and evaluation. Our definition of medical sentiment is, thus, more generic in nature, and direct comparison to the existing system is not very rational. None of the related works mentioned in the related works section addressed sentiment analysis for medical suggestion mining. The experimental setups, datasets and the sentiment classes used in all these works are also very different. 5.2
Top-N Similar Post Retrieval
Evaluation of the retrieval task is done by comparing similarity scores assigned for a pair of forum posts by the system and by human annotator (as discussed
178
A. Aipe et al.
in Sect. 4.5). Our focus is to determine the correlation between the similarity score assigned to the pairs of posts through human and the system judgments (rather than on actual similarity values). That is if a human feels that a post P is more relavant to post Q than post R, then the system also operates in the same way. Therefore, we use pearson correlation coefficient for the evaluation purpose. Statistical significance of the correlation (2-tailed p-value from T-test with null hypothesis that the correlation occured by chance) was found to be 0.00139, 0.0344 and 0.0186, respectively, for each sentiment class. [email protected] is also calculated to evaluate the relevance of the retrieved forum posts. As annotation was done using top-5 retrieved posts (as discussed in Sect. 4.5), [email protected] could not be calculated. We design a baseline model using K-nearest neighbour algorithm that makes use of cosine similarity metric for capturing the textual similarity. We show the results in Table 3. From the evaluation results, it is evident that similarity scores assigned by the proposed system are more positively correlated with the human judgments than the baseline. Correlation can be considered statistically significant as the p-values corresponding to all the sentiment classes are less than 0.05. The better [email protected] metric corresponds that a greater number of relevant posts are retrieved by the proposed approach in comparison to the baseline model. Table 3. Evaluation corresponding to the top-n similar posts retrieval. ’A’ and ’B’ denote the results corresponding to the proposed metric and K-nearest neighbor algorithm using text similarity metric, respectively. Sentiment Pearson correlation [email protected] Positive Neutral Negative
A 0.3586 0.3297 0.3345
B 0.2104 0.2748 0.2334
A 0.6638 0.5932 0.623
B 0.5467 0.5734 0.5321
DCG5 A 6.0106 5.3541 4.7361
B 2.1826 2.4353 2.4825
We also calculate the Discounted Cumulative Gain (DCG) [7] of the similarity scores for both models from the human judgments. The idea behind DCG is that highly relevant documents appearing lower in the ranking should be penalized. A logarithmic reduction factor was applied to the human relevance judgment which was scaled from 0 to 4, and the DCG accumulated at a rank position 5 was calculated with the following formula: DCG5 =
5 i=1
reli log2 (i + 1)
(7)
where reli is the relevance judgment of the post at position i. The NDCG could not be calculated, as annotation was done only using top-5 retrieved posts (as discussed in Sect. 4.5).
Sentiment-Aware Recommendation System for Healthcare
179
During error analysis, we observe few forum posts where users share their personal feelings, but due to the presence of less medically relevant contents, these are labeled as irrelevant by the system. However, these contain some relevant information that could be useful to the end users. For example, ’Hello everyone, Good morning to all. I know I had been away for a couple of days. I went outing with my family to get away from the stress I had been feeling lately. Strolled thorugh the park, played tennis with kids and visited cool places nearby. U know Family is the best therapy for every problem. Still feeling a little bit anxious lately. Suggest me something’ – The example blog contains proportionally more personal information than the medically relevant one. However, ’Feeling a little bit anxious lately’ is the medically relevant part of the post. Thus, filtering out such contents is required for better performance and would help the system to focus better on the relevant contents. There are two possible ways to tackle this problem. We would like to look into these in future. 1. Increasing the weight of medical information similarity (represented as MISim in Eq. 5) while computing the overall similarity score. 2. Identifying and removing personal, medically irrelevant contents using (possibly) by either designing a sequence labeling model (classifying relevant vs. irrelevant) or by manually verifying the data or by finding certain phrases or snippets from the blog. 5.3
Treatment Suggestion
Evaluation of treatment suggestion is particularly challenging because it requires the annotators with high level of medical expertise. Moreover to the best of our knowledge there is no existing benchmark dataset for this evaluation. Hence, we are not able to provide any quantitative evaluation of the suggestion module. However, it is to be noted that our suggestion module is based on the soundness of sentiment classification module. Our evaluation presented in the earlier section shows that our sentiment classifier has acceptable output quality. The task of a good treatment suggestion system is to mine the best and relevant treatment suggestion for a candidate disease. As the function for computing the suggestion score (Eq. 6) involves computing the probability of positive sentiment, given a treatment T and disorder/disease D, it is always ensured that T is a candidate treatment for D, i.e. the treatment T produced positive results in context of D in at least one case. In other words, the probability term ensures that irrelevant treatments that did not give positive result in context of D would never appear as treatment suggestion for D. The efficiency of the suggestion module depends on the following three factors: 1. Apache cTAKES retrieved correct concepts in majority of the cases with only a few exceptions, which are mostly ambiguous in nature. For example, the word ‘basis’ can represent clinically-proven NAD+ Supplement or can be used as synonym of the word ’premise’. 2. If an irrelevant post is labeled as relevant by the system, then suggestions shouldn’t contain treatments mentioned in that post. Thus, the similarity
180
A. Aipe et al.
metric plays an important role in picking the right treatment for a given candidate disease. 3. Value of hyper-parameter τ (Eq. 6): As its value decreases, more number of candidate treatments are suggested by the system. Performance of the module can be augmented and tailored by tweaking the above parameters depending on the practical application in hand.
6
Conclusion and Future Work
In this paper, we have established the usefulness of medical sentiment analysis for building a recommendation system that will assist building a patient assisted health-care system. A deep learning model has been presented for classifying the medical sentiment expressed in a forum post into conventional polarity-based classes. We have empirically shown that the proposed architecture can satisfactorily capture sentiment from the social media posts. We have also proposed a novel similarity metric for the retrieval of forum posts with similar medical experiences and sentiments. A novel treatment suggestion algorithm has been also proposed, that utilizes our similarity metric along with the patient-treatment satisfaction ratings. We have performed a very detailed analysis of our model. In our work, we use the UMLS database due to its wide usage and acceptability as a standard database. We also point to other future work, such as annotating a dataset for treatment suggestions – which would increase the scope of machine learning, developing a sequence labeling model to remove personal irrelevant contents etc. Our work serves as an initial study in harnessing the huge amounts of open, useful information available on medical forums. Acknowledgements. Asif Ekbal acknowledges the Young Faculty Research Fellowship (YFRF), supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia).
References 1. Chee, B.W., Berlin, R., Schatz, B.: Predicting adverse drug events from personal health messages. In: AMIA Annual Symposium Proceedings, vol. 2011, pp. 217–226 (2011) 2. Denecke, K.: Sentiment analysis from medical texts. In: Health Web Science. HIS, pp. 83–98. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20582-3 10 3. Denecke, K., Deng, Y.: Sentiment analysis in medical settings: new opportunities and challenges. Artif. Intell. Med. 64(1), 17 – 27 (2015). https://doi. org/10.1016/j.artmed.2015.03.006, http://www.sciencedirect.com/science/article/ pii/S0933365715000299 4. Eysenbach, G., Kohler, C.h.: What is the prevalence of health-related searches on the World Wide Web? qualitative and quantitative analysis of search engine queries on the internet. In: AMIA Annual Symposium Proceedings, pp. 225–229 (2003)
Sentiment-Aware Recommendation System for Healthcare
181
5. Fox S, D.M.: Health Online. Washington, DC: Pew Internet & American Life (2013). Accessed 20 Nov 2013. http://www.pewinternet.org/Reports/2013/ Health-online/Summary-of-Findings.aspx 6. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 7. J¨ arvelin, K., Kek¨ al¨ ainen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002). https://doi.org/10.1145/582415. 582418 8. Kim, Y.: Convolutional neural networks for sentence classification. CoRR abs/1408.5882 (2014) 9. Krippendorff, K.: Computing krippendorff’s alpha-reliability (2011). https:// repository.upenn.edu/asc papers/43 10. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 2013 (2013) 11. Palotti, J.R.M., et al.: CLEF 2017 task overview: the IR task at the ehealth evaluation lab - evaluating retrieval methods for consumer health search. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, 11–14 September 2017 (2017). http://ceur-ws.org/Vol-1866/invited paper 16.pdf 12. Yadav, S., Ekbal, A., Saha, S., Bhattacharyya, P.: Medical sentiment analysis using social media: towards building a patient assisted system. In: Chair, N.C.C., et al. (eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan, 7–12 May 2018 (2018) 13. Yang, C.C., Yang, H., Jiang, L., Zhang, M.: Social media mining for drug safety signal detection. In: Proceedings of the 2012 International Workshop on Smart Health and Wellbeing, SHB 2012, pp. 33–40. ACM, New York (2012). https://doi. org/10.1145/2389707.2389714
Sentiment Analysis Through Finite State Automata Serena Pelosi(B) , Alessandro Maisto, Lorenza Melillo, and Annibale Elia Department of Political and Communication Science, University of Salerno, Salerno, Italy {spelosi,amaisto,amelillo,elia}@unisa.it
Abstract. The present research aims to demonstrate how powerful Finite State Automata (FSA) can be, into a domain in which the vagueness of the human opinions and the subjectivity of the user generated contents make the automatic “understanding” of texts extremely hard. Assuming that the semantic orientation of sentences is based on the manipulation of sentiment words, we built from scratch, for the Italian language, a network of local grammars for the annotation of sentiment expressions and electronic dictionaries for the classification of more than 15,000 opinionated words. In the paper we explain in detail how we made use of FSA for both the automatic population of sentiment lexicons and the sentiment classification of real sentences.
Keywords: Finite State Automata Valence Shifters · Sentiment lexicon
1
· Sentiment Analysis · Contextual · Electronic dictionary
Introduction
The Web 2.0, as an interactive medium, offers to Internet users the opportunity to freely share thoughts and feelings with the web communities. This kind of information is extremely important under the consumers decision making process; we make particular reference to experience and search goods or to the e-commerce in general, if one think to what extent the evaluation of the products qualities is influenced by the past experiences of those customers that had already experienced the same goods and that had posted their opinions online. The automatic treatment of User Generated Contents becomes a relevant research problem when the huge volume of raw texts online makes their semantic content impossible to be managed by human operators. As a matter of fact, the largest amount of on-line data is semistructured or unstructured and, as a result, its monitoring requires sophisticated Natural Language Processing (NLP) tools, that must be able to pre-process them from their linguistic point of view and, then, automatically access their semantic content. The Sentiment Analysis research field can have a large impact on many commercial, Government and Business Intelligence application. Examples are c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 182–197, 2023. https://doi.org/10.1007/978-3-031-24340-0_14
Sentiment Analysis Through Finite State Automata
183
ad-placement applications, Flame Detection Systems, Social Media Monitoring Systems, Recommendation Systems, Political Analysis, etc. However, it would be difficult, indeed, for humans to read and summarize such a huge volume of data, but, in other respects, to introduce machines to the semantic dimension of the human language remains an open problem. The largest part of on-line data is semistructured or unstructured and, as a result, its monitoring requires sophisticated NLP strategies and tools, in order to preprocess them from their linguistic point of view and, then, automatically access their semantic content. In the present work we present a method which exploits Finite State Automata (FSA) with the purpose of building high performance tools for the Sentiment Analysis1 . We computed the polarity of more than 15.000 Italian sentiment words, which have been semi-automatically listed into machine-readable electronic dictionaries, through a network of FSA, which consists of a syntactic network of grammars composed by 125 graphs. We tested a strategy based on the systematic substitution of semantically oriented classes of (simple or compound) words into the same sentence patterns. The combined use of dictionaries and automata made it possible to apply our method on real text occurrences2 . In Sect. 2 we will mention the most used techniques for the automatic propagation of sentiment lexicons and for the sentence annotation. Section 3 will delineate our method, carried out through finite state technologies. Then, in Sect. 4 and 5 we will go through our morphological and syntactic solutions to the mentioned challenges.
2
State of the Art
The core of this research consists of two distinguished Sentiment Analysis tasks: at the word level, the dictionary population and, at the sentence level, the annotation of complex expressions. In this paragraph we will summarize other methods used in literature to face those tasks. Many techniques have been discussed in literature to perform the Sentiment Analysis. They can be classified into lexicon based methods, learning methods and hybrid methods. In Sentiment Analysis tasks the most effective indicators used to discover subjective expressions are adjectives or adjective phrases [67], but recently it became really common the use of adverbs [6], nouns [72] and verbs as well [55]. Among the state of the art methods used to build and test dictionaries we mention the Latent Semantic Analysis (LSA) [41]; bootstrapping algorithms [65]; 1 2
Annibale Elia, Lorenza Melillo and Alessandro Maisto worked on the Conclusion of the paper, while Serena Pelosi on Introduction and Paragraphs 1, 2, 3 and 4. We chose a rule-based method, among others, in order to verify the hypothesis that words can be classified together in accordance to both semantic and syntactic criteria.
184
S. Pelosi et al.
graph propagation algorithms [33,71]; conjunctions and morphological relations between adjectives [29]; Context Coherency [35]; distributional similarity3 [79]. Pointwise Mutual Information (PMI) using Seed Words 4 has been applied to sentiment lexicon propagation by [22,63,69–71]. It has been observed, indeed, that positive words occur often close to positive seed words, whereas negative words are likely to appear around negative seed words [69,70]. Learning and statistical methods for Sentiment Analysis intent usually make use of Support Vector Machine [52,57,80] or Na¨ıve Bayes classifiers [37,68]. In the end, as regards the hybrid methods we must cite the works of [1,9,10, 24,42,60,64,76]. The largest part of the state of the art works on polarity lexicons for Sentiment Analysis purposes has been carried out on the English language. Italian lexical databases are mostly created by translating and adapting the English ones, SentiWordNet and WordNet-Affect, among others. The works on the Italian language that deserve to be mentioned are [5], merged the semantic information belonging to existing lexical resources in order to obtain an annotated lexicon of senses for Italian, Sentix (Sentiment Italian Lexicon)5 . Basically, MultiWordNet [58], the Italian counterpart of WordNet [20,47], has been used to transfer polarity information associated to English synsets in SentiWordNet [19] to Italian synsets, thanks to the multilingual ontology BabelNet [53]. Every Sentix’s entry is described by information concerning its part of speech, its WordNet synset ID, a positive and a negative score from SentiWordNet, a polarity score (from -1 to 1) and an intensity score (from 0 to 1). [8] presented a lexical sentiment resource thant contains polarized simple words, multiwords and idioms which has been annotated with polarity, intensity, emotion and domain labels6 . [12] built a lexicon for the EVALITA 2014 task by collecting adjectives, adverbs (extracted from the De Mauro - Paravia Italian dictionary [11]), nouns and verbs (from Sentix) and by classifying their polarity through the online Sentiment Analysis API provided by Ai Applied 7 . Another Italian Sentiment Lexicon is the one semi-automatically developed from ItalWordNet v.2 starting from a list of seed key-words classified manually [66]. It includes 24.293 neutral and polarized items distributed in XML-LMF format8 . [30] achieved good results in the SentiPolC 2014 task by 3
4
5 6 7 8
Word Similarity is a very frequently used method in the dictionary propagation over the thesaurus-based approaches. Examples are the Maryland dictionary, created thanks to a Roget-like thesaurus and a handful of affixe [48], and other lexicons based on WordNet, like SentiWordNet, built on the base of quantitative analysis of glosses associated to synsets [17, 18] or other lexicons based on the computing of the distance measure on WordNet [17, 34]. Seed words are words which are strongly associated with a positive/negative meaning, such as eccellente (“excellent”) or orrendo (“horrible”), by which it is possible to build a bigger lexicon, detecting other words that frequently occur alongside them. http://valeriobasile.github.io/twita/downloads.html. https://www.celi.it/. http://ai-applied.nl/sentiment-analysis-api. http://hdl.handle.net/20.500.11752/ILC-73.
Sentiment Analysis Through Finite State Automata
185
semi-automatically translating in Italian different lexicons; namely, SentiWordNet, Hu-Liu Lexicon, AFINN Lexicon, Whissel Dictionary, among others. As regards the works on the lexicon propagation, we mention three main research lines: the first one is grounded on the richness of the already existent thesauri, WordNet9 [47] among others. The second approach is based on the hypothesis that the words that convey the same polarity appear close in the same corpus, so the propagation can be performed on the base of co-occurrence algorithms [69,78] and, [4,36,61,69,78]. In the end, the morphological approach, which is the one that employs morphological structures and relations for the assignment of the prior sentiment polarities to unknown words, on the base of the manipulation of the morphological structures of known lemmas10 [40,50,50,77]. However, it does not seem to be enough to just dispose of sentiment dictionaries. Actually, the syntactic structures in which the opinionated lemmas occur have a strong impact on the resulting polarity of the sentences. That is the case of negation, intensification, irrealis markers and conditional tenses. Rule-based approaches, that take into account the syntactic dimension of the Sentiment Analysis, are [49,51]. FSA have been used for the linguistic analysis of sentiment expressions by [3,27,43]. Rule-based approaches, that take into account the syntactic dimension of the Sentiment Analysis, are [49,51,81].
3
Methodology
The present research has been grounded on the richness, in term of lexical and grammatical resources, of the linguistic databases built in the Department of Political and Communication Science (DSPC) of the University of Salerno by the Computational Linguistic lab “Maurice Gross”, which started its study on language formalization since the 1981 [16,73]. These resources take the shape of lexicon-grammar tables, that cross-check the lexicon and the syntax of any given language, in this case Italian; domain independent machine-readable dictionaries and inflectional and derivational local grammars in the form of finite state automata. Differently from other lexicon-based Sentiment Analysis methods, our approach has been grounded on the solidity of the Lexicon-Grammar resources 9
10
Although WordNet does not include semantic orientation information for its lemmas; semantic relations, such as synonymy or antonymy, are commonly used in order to automatically propagate the polarity, starting from a manually annotated set of seed word. [2, 13, 18, 18, 28, 31, 31, 34, 39, 45, 45]. This approach presents some drawbacks, such as the lack of scalability, the unavailability of enough resources for many languages and the difficulty to handle newly coined words, which are not already contained in the thesauri. Morphemes allow not only the propagation of a given word polarity (e.g. en-, -ous, -fy), but also its switching (e.g. dis-, -less), its intensification (e.g. super-, over-) and its weakening (e.g. semi-) [54].
186
S. Pelosi et al.
and classifications [16,73], that provide fine-grained semantic but also syntactic descriptions of the lexical entries. Such lexically exhaustive grammars distance themselves from the tendency of other sentiment resources to classify together words that have nothing in common from the syntactic point of view. In the present work, we started from the annotation of a small sized dictionary of opinionated words11 . FSA are used in both the morphological expansion of the lexicon and in the syntactic modeling of the words in context. In this research we assume that words can be classified together not only on the base of their semantic content, but also according to syntactic criteria. Thanks to finite state technologies, we computed the polarity of individual words by systematically replacing them with other items (endowed with the same and/or different individual polarity) into many sentence (or phrase) patterns. The hypothesis is that classes of words, characterized by the same individual annotation, can be generalized when considered into different syntactic contexts, because they undergo the same semantic shifting when occurring in similar patterns. Dictionaries and FSA used in tandem made it possible to verify these assumptions on real corpora. 3.1
Local Grammars and Finite-State Automata
Sentiment words, multiwords and idioms used in this work are listed into Nooj electronic dictionaries while the local grammars12 used to manipulate their polarities are formalized thanks to Finite State Automata. Electronic dictionaries have then been exploited in order to list and to semantically and syntactically classify, into a machine readable format, the sentiment lexical resources. The computational power of Nooj graphs has, instead, been used to represent the acceptance/refuse of the semantic and syntactic properties through the use of constraint and restrictions. Finite-State Automata (FSA) are abstract devices characterized by a finite set of nodes or “states” connected one another by transitions that allow us to determine sequences of symbols related to a particular path. These graphs are read from left to right, or rather, from the initial state to the final state [26].
11
12
While compiling the dictionary, the judgment on the words “prior polarity” is given without considering any textual context. The entries of the sentiment dictionary receive the same annotation and, then, are grouped together if they posses the same semantic orientation. The Prior Polarity [56] refers to the individual words Semantic Orientation (SO) and differs from the SO because it is always independent from the context. Local grammars are algorithms that, through grammatical, morphological and lexical instructions, are used to formalize linguistic phenomena and to -parse texts. They are defined “local” because, despite any generalization, they can be used only in the description and analysis of limited linguistic phenomena.
Sentiment Analysis Through Finite State Automata
3.2
187
Sentita and Its Manually-Built Resources
In this Paragraph we will briefly describe the Sentiment lexicon, available for the Italian language, which have been semi-automatically created on the base of the resources of the DSPC. The tagset used for the Prior Polarity annotation of the resources is composed of four tags: POS positive; NEG negative; FORTE intense and DEB weak. Such labels, if combined together, can generate an evaluation scale that goes from –3 to +3 and a strength scale that ranges from –1 to +1. Neutral words (e.g. nuovo “new”, with score 0 in the evaluation scale) have been excluded from the lexicon13 . In our resources, adjectives and bad words have been manually extracted and evaluated starting from the Nooj Italian electronic dictionary of simple words, preserving their inflectional (FLX) and derivational (DRV) properties. Moreover, compound adverbs [15], idioms [74,75], verbs [14,16] have been weighted starting from the Italian Lexicon-Grammar tables14 , in order to maintain the syntactic, semantic and transformational properties connected to each one of them.
4
Morphology
In this paragraph we will describe how FSA have been exploited to enrich the sentiment lexical resources. The adjectives have been used as starting point for the expansion of the sentiment lexicon, on the base of the morphophonological relations that connect the words and their meanings15 . Thanks to a morphological FSA it has been possible to enlarge the size of SentIta on the base of the morphological relations that connect the words and their meanings. More than 5,000 labeled adjectives have been used to predict the orientation of the adverbs with which they were morphologically related. All the adverbs contained in the Italian dictionary of simple words have been used as an input and a mophological FSA has been used to quickly populate the new dictionary by extracting the words ending with the suffix -mente, “-ly”, and by making such words inherit the adjectives’ polarity. The Nooj annotations consisted in
13
14 15
The main difference between the words listed in the two scales is the possibility to use them as indicators for the subjectivity detection: basically, the words belonging to the evaluation scale are “anchors” that begin the identification of polarized phrases or sentences, while the ones belonging to the strength scale are just used as intensity modifiers (see Paragraph 5.3). available for consultation at http://dsc.unisa.it/composti/tavole/combo/tavole.asp. The morphological method could be also applied to Italian verbs, but we chose to avoid this solution because of the complexity of their argument structures. We decided, instead, to manually evaluate all the verbs described in the Italian Lexicongrammar binary tables, so we could preserve the different lexical, syntactic and transformational rules connected to each one of them [16].
188
S. Pelosi et al.
a list of 3,200+ adverbs that, at a later stage, have been manually checked, in order to adjust the grammar’s mistakes16 . In detail, the Precision achieved in this task is 99% and the Recall is 88%. The derivation of quality nouns from qualifier adjectives is another derivation phenomenon of which we took advantage for the automatic enlargement of SentIta. These kind of nouns allow to treat as entities the qualities expressed by the base adjectives. A morphological FSA, following the same idea of the adverbs grammar, matches in a list of abstract nouns the stems that are in morpho-phonological relation with our list of hand-tagged adjectives. Because the nouns, differently from the adverbs, need to have specified the inflection information, we associated to each suffix entry, into an electronic dictionary dedicated to suffixes of quality nouns, the inflectional paradigm that they give to the words with which they occur. In order to effortlessly build a noun dictionary of sentiment words we firstly exploit the hand-made list of nominalization of the psychological verbs [25,27,46]. Furthermore, we took advantage from other derivation phenomena connected to nouns: the derivation of quality nouns from qualifier adjectives. We built a morphological FSA that, following the same idea of the adverbs grammar, matches into a list of abstract nouns the stems that are in morphophonological relation with our list of hand-tagged adjectives. Table 1. Analytical description of the most productive quality nouns suffixes. Suffixes Inflection Correct Precision
16
-it` a
N602
-mento
N5
666
98%
514
90%
-(z)ione N46
359
86%
-ezza
N41
305
99%
-enza
N41
148
94%
-ia
N41
145
98%
-ura
N41
142
88%
-aggine
N46
72
97%
-eria
N41
71
95%
-anza
N41
57
86%
TOT
–
2579
93%
The meaning of the deadjectival adverbs in -mente is not always predictable starting from the base adjectives from which they are derived. Also the syntactic structures in which they occur influences their interpretation. Depending on their position in sentences, the deadjectival adverbs can be described as adjective modifiers (e.g. altamente “highly”), predicate modifiers (e.g. perfettamente “perfectly”) or sentence modifiers (e.g. ultimamente “lately”).
Sentiment Analysis Through Finite State Automata
189
As regards the suffixes used to form the quality nouns (Table 1) [62], it must be said that they generally make the new words simply inherit the orientation of the derived adjectives. Exceptions are -edine and -eria that almost always shift the polarity of the quality nouns into the weakly negative one (–1), e.g. faciloneria “slapdash attitude”. Also the suffix -mento differs from the others, in so far it belongs to the derivational phenomenon of the deverbal nouns of action [21]. It has been possible to use it into our grammar for the deadjectival noun derivation by using the past participles of the verbs listed in the adjective dictionary of sentiment (e.g. V:sfinire “to wear out”, A:sfinito “worn out”, N:sfinimento; “weariness”). The Precision achieved in this task is 93%. In this work we draw up a FSA which can also interact, at a morphological level, with a list of prefixes able to negate (e.g. anti-, contra-, non- among others) or to intensify/downtone (e.g. arci-, semi- among others) the orientation of the words in which they appear [32].
5
Syntax
Contextual Valence Shifters are linguistic devices able to change the prior polarity of words when co-occurring in the same context [38,59]. In this work we handle the contextual shifting by generalizing all the polar words that posses the same prior polarity. A network of local grammars as been designed on a set of rules that compute the words individual polarity scores, according to the contexts in which they occur. In general, the sentence annotation is performed through an Enhanced Recursive Transition Network, by using six different metanodes17 , that, working as containers for the sentiment expressions, assign equal labels to the patterns embedded in the same graphs. Among the most used Contextual Valence Shifters we took into account linguistic phenomena like Intensification, Negation, Modality and Comparison. Moreover, we formalized some classes of frozen sentences that modify the polarity of the sentiment words that occur in them. Our network of 15,000 opinionated lemmas and 125 embedded FSA has been tested on a multi-domain corpus of customer reviews18 achieving in the sentencelevel sentiment classification task an average Precision of 75% and a Recall of 73%.
17 18
Metanodes are labeled through the six corresponding values of the evaluation scale, which goes from –3 to +3. The dataset contains Italian opinionated texts in the form of users reviews and comments from e-commerce and opinion websites; it lists 600 texts units (50 positive and 50 negative for each product class) and refers to six different domains, for all of which different websites (such as www.ciao.it; www.amazon.it; www.mymovies.it; www.tripadvisor.it) have been exploited [44].
190
5.1
S. Pelosi et al.
Opinionated Idioms
More than 500 Italian frozen sentences containing adjectives [74,75] have been evaluated and then formalised with a pair of dictionary-grammar. Among the idioms considered there are the comparative frozen sentences of the type N0 Agg come C1, described by [74], that usually intensify the polarity of the adjective of sentiment they contain, as happens in (1). (1) Mary `e bella [+2] come il sole [+3] “Mary is as beautiful as the sun” Otherwise, it is also possible for an idiom of that sort to be polarised when the adjective (e.g. bianco, “white”) contained in it is neutral (2), or even to reverse its polarity as happens in (3) (e.g. agile, “agile”, is positive). In that regard, it is interesting to notice that the 84% of the idioms has a clear SO, while just the 36% of the adjectives they contain is polarised19 . (2) Mary `e bianca [0] come un cadavere [–2] “Mary is as white as a dead body” (Mary is pale) (3) Mary `e agile [+2] come una gatta di piombo [–2] “Mary is as agile as a lead cat” (Mary is not agile) 5.2
Negation
As regards negation, we included in our grammar negative operators (e.g. non, “not”, mica, per niente, affatto, “not at all”), negative quantifiers (e.g. nessuno, “nobody” niente, nulla, “nothing”) and lexical negation (e.g. senza, “without”, mancanza di, assenza di, carenza di, “lack of”) [7]. As exemplified in the following sentences, negation indicators not always change a sentence polarity in its positive or negative counterparts (4); they often have the effect of increasing or decreasing the sentence score (5). That is why we prefer to talk about valence “shifting” rather than “switching”. (4) Citroen non [neg] produce auto valide [+2] [–2] “Citroen does not produce efficient cars” (5) Grafica non proprio [neg] spettacolare [+3] [–2] “The graphic not quite spectacular”
19
Other idioms included in our resources are of the kind N0 essere (Agg + Ppass) Prep C1 (e.g. Max `e matto da legare, “Max is so crazy he should be locked up”); N0 essere Agg e Agg (e.g. Max `e bello e fritto, “Max is cooked”); C0 essere Agg (come C1 + E) (e.g. Mary ha la coscienza sporca ↔ La coscienza `e sporca, “Mary has a guilty conscience” ↔ “The conscience is guilty”), N0 essere C1 Agg (e.g. Mary `e una gatta morta, “Mary is a cock tease”).
Sentiment Analysis Through Finite State Automata
5.3
191
Intensification
We included the Intensification rules into our grammar net, firstly, by combining in the words belonging to the strength scale (tags FORTE/DEB) with the sentiment words listed in the evaluation scale (tags POS/NEG)20 . Besides, also the repetition of more than one negative or positive words, or the use of absolute superlative affixes have the effect of increasing the words’ Prior Polarity. In general, the adverbs intensify or attenuate adjectives, verbs and other adverbs, while the adjectives modify the intensity of nouns. Intensification and negation can also appear together in the same sentence. 5.4
Modality
According to [7], modality can be used to express possibility, necessity, permission, obligation or desire, through grammatical cues, such as adverbial phrases (e.g. “maybe”, “certainly”); conditional verbal moods; some verbs (e.g. “must”, “can”, “may”); some adjectives and nouns (e.g. “a probable cause”). When computing the Prior Polarities of the SentIta items into the textual context, we considered that modality can also have a significant impact on the SO of sentiment expressions. According to the literature trends, but without specifically focusing on the [7] modality categories, we recalled in the FSA dedicated to modality the following linguistic cues and we made them interact with the SentIta expressions: sharpening and softening adverbs; modal verbs and conditional and imperfect tenses. Examples of the modality in our work are the following: – “Potere” + Indicative Imperfect + Oriented Item: (6) Poteva [Modal+IM] essere una trama interessante [+2] [–1] “It could be an interesting plot” – “Potere” + Indicative Imperfect + Comparative + Oriented Items: (7) Poteva [Modal+IM] andare peggio [I-OpW +2] [–1] “It might have gone worse” – “Dovere” + Indicative Imperfect: (70) Questo doveva [Modal+IM] essere un film di sfumature [+1] [–2] “This one was supposed to be a nuanced movie” – “Dovere” + “Potere” + Past Conditional : (71) Non[Negation] avrei [Aux+C] dovuto [Modal+PP] buttare via i miei soldi [–2] “I should not have burnt my money” 20
Words that, at first glance, seem to be intensifiers but at a deeper analysis reveal a more complex behavior are abbastanza “enough” troppo “too much” and poco “not much”. In this research we noticed as well that the co-occurrence of troppo, poco and abbastanza with polar lexical items can provoke, in their semantic orientation, effects that can be associated to other contextual valence shifters. The ad hoc rules dedicated to these words (see Table ??) are not actually new, but refer to other contextual valence shifting rules that have been discussed in this Paragraph.
192
5.5
S. Pelosi et al.
Comparison
Sentences that express a comparison generally carry along with them opinions about two or more entities, with regard to their shared features or attributes [23]. As far as the comparative sentences are concerned, we considered in this work the already mentioned comparative frozen sentences of the type N0 Agg come C1 ; some simple comparative sentences that involve the expressions meglio di, migliore di, “better than”, peggio di, peggiore di, “worse than”, superiore a, “superior to” inferiore a, “less than”; and the comparative superlative. The comparison with other products has been evaluated with the same measures of the other sentiment expression; so the polarity can range from –3 to +3. 5.6
Other Sentiment Expressions
In order to reach high levels of Recall, the lexicon-based patterns require also the support of lexicon independent expressions. In our work, we listed and computed many cases in which expressions that do not imply the words contained in our dictionaries are sentiment indicators as well. This is the case in which one can see the importance of the Finite-state automata. Without them it would be really difficult and uneconomical for a programmer to provide the machine with concise instructions to correctly recognise and evaluate some kind of opinionated sentences that can often reach high levels of variability. Examples of patterns of this kind are valerne la pena [+2] , “to be worthwhile”; essere (dotato + fornito + provvisto) di [+2] , “to be equipped with”; grazie a [+2] , “thanks to”; essere un (aspetto + nota + cosa + lato) negativo [–2] , “to be a negative side”; non essere niente di che [–1] , “to be nothing special”; tradire le (aspettative + attese + promesse) [–2] , “not live up to one’s expectations”; etc. For simplicity, in the present work we put in this node of the grammar the sentences that imply the use of frozen, semi-frozen expression and words that, for the moment, are not part of the dictionaries.
6
Conclusion
In this paper we gave our contribution to the most two challenging tasks of the Sentiment Analysis field: the lexicon propagation and the sentence semantic annotation. The necessity to quickly monitor huge quantity of semistructured and unstructured data from the web, poses several challenges to Natural Language Processing, that must provide strategies and tools to analyze their structures from lexical, syntactical and semantic point of views. Unlike many other Italian and English sentiment lexicons, SentIta, made up on the interaction of electronic dictionaries and lexicon dependent local grammars, is able to manage simple and multiword structures, that can take the shape of distributionally free structures, distributionally restricted structures and frozen structures.
Sentiment Analysis Through Finite State Automata
193
According with the major contribution in the Sentiment Analysis literature, we did not consider polar words in isolation. We computed they elementary sentence contexts, with the allowed transformations and, then, their interaction with contextual valence shifters, the linguistic devices that are able to modify the prior polarity of the words from SentIta, when occurring with them in the same sentences. In order to do so, we took advantage of the computational power of the finite-state technology. We formalized a set of rules that work for the intensification, downtoning and negation modeling, the modality detection and the analysis of comparative forms. Here, the difference with other state-of-theart strategies consists in the elimination of complex mathematical calculation in favor of the easier use of embedded graphs as containers for the expressions designed to receive the same annotations into a compositional framework.
References 1. Andreevskaia, A., Bergler, S.: When specialists and generalists work together: overcoming domain dependence in sentiment tagging. In: ACL, pp. 290–298 (2008) 2. Argamon, S., Bloom, K., Esuli, A., Sebastiani, F.: Automatically determining attitude type and force for sentiment analysis, pp. 218–231 (2009) 3. Balibar-Mrabti, A.: Une ´etude de la combinatoire des noms de sentiment dans une grammaire locale. Langue fran¸caise, pp. 88–97 (1995) 4. Baroni, M., Vegnaduzzo, S.: Identifying subjective adjectives through web-based mutual information, vol. 4, pp. 17–24 (2004) 5. Basile, V., Nissim, M.: Sentiment analysis on Italian tweets. In: Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 100–107 (2013) 6. Benamara, F., Cesarano, C., Picariello, A., Recupero, D.R., Subrahmanian, V.S.: Sentiment analysis: adjectives and adverbs are better than adjectives alone. In: ICWSM (2007) 7. Benamara, F., Chardon, B., Mathieu, Y., Popescu, V., Asher, N.: How do negation and modality impact on opinions? pp. 10–18 (2012) 8. Bolioli, A., Salamino, F., Porzionato, V.: Social media monitoring in real life with blogmeter platform. In: ESSEM@ AI* IA 1096, 156–163 (2013) 9. Dang, Y., Zhang, Y., Chen, H.: A lexicon-enhanced method for sentiment classification: An experiment on online product reviews. In: Intelligent Systems, IEEE, vol. 25, pp. 46–53. IEEE (2010) 10. Dasgupta, S., Ng, V.: Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-vol. 2, pp. 701–709. Association for Computational Linguistics (2009) 11. De Mauro, T.: Dizionario italiano. Paravia, Torino (2000) 12. Di Gennaro, P., Rossi, A., Tamburini, F.: The ficlit+ cs@ unibo system at the evalita 2014 sentiment polarity classification task. In: Proceedings of the Fourth International Workshop EVALITA 2014 (2014) 13. Dragut, E.C., Yu, C., Sistla, P., Meng, W.: Construction of a sentimental word dictionary, pp. 1761–1764 (2010) 14. Elia, A.: Le verbe italien. Les compl´etives dans les phrases ` aa un compl´ement (1984)
194
S. Pelosi et al.
15. Elia, A.: Chiaro e tondo: Lessico-Grammatica degli avverbi composti in italiano. Segno Associati (1990) 16. Elia, A., Martinelli, M., D’Agostino, E.: Lessico e Strutture sintattiche. Liguori, Introduzione alla sintassi del verbo italiano. Napoli (1981) 17. Esuli, A., Sebastiani, F.: Determining the semantic orientation of terms through gloss classification. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 617–624. ACM (2005) 18. Esuli, A., Sebastiani, F.: Determining term subjectivity and term orientation for opinion mining vol. 6, p. 2006 (2006) 19. Esuli, A., Sebastiani, F.: SentiWordNet: a publicly available lexical resource for opinion mining. In: Proceedings of LREC, vol. 6, pp. 417–422 (2006) 20. Fellbaum, C.: WordNet. Wiley Online Library (1998) 21. Gaeta, L.: Nomi d’azione. La formazione d elle parole in italiano. T¨ ubingen: Max Niemeyer Verlag, pp. 314–351 (2004) 22. Gamon, M., Aue, A.: Automatic identification of sentiment vocabulary: exploiting low association with known sentiment terms, pp. 57–64 (2005) 23. Ganapathibhotla, M., Liu, B.: Mining opinions in comparative sentences. In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1. pp. 241–248. Association for Computational Linguistics (2008) 24. Goldberg, A.B., Zhu, X.: Seeing stars when there aren’t many stars: graph-based semi-supervised learning for sentiment categorization. In: Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, pp. 45–52. Association for Computational Linguistics (2006) 25. Gross, M.: Les bases empiriques de la notion de pr´edicat s´emantique. Langages, pp. 7–52 (1981) 26. Gross, M.: Les phrases fig´ees en fran¸cais. In: L’information grammaticale, vol. 59, pp. 36–41. Peeters (1993) 27. Gross, M.: Une grammaire locale de l’expression des sentiments. Langue fran¸caise, pp. 70–87 (1995) 28. Hassan, A., Radev, D.: Identifying text polarity using random walks, pp. 395–403 (2010) 29. Hatzivassiloglou, V., McKeown, K.R.: Predicting the semantic orientation of adjectives. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pp. 174–181. Association for Computational Linguistics (1997) 30. Hernandez-Farias, I., Buscaldi, D., Priego-S´ anchez, B.: Iradabe: adapting English lexicons to the Italian sentiment polarity classification task. In: First Italian Conference on Computational Linguistics (CLiC-it 2014) and the fourth International Workshop EVALITA2014, pp. 75–81 (2014) 31. Hu, M., Liu, B.: Mining and summarizing customer reviews, pp. 168–177 (2004) 32. Iacobini, C.: Prefissazione. La formazione delle parole in italiano. T¨ ubingen: Max Niemeyer Verlag, pp. 97–161 (2004) 33. Kaji, N., Kitsuregawa, M.: Building lexicon for sentiment analysis from massive collection of html documents. In: EMNLP-CoNLL, pp. 1075–1083 (2007) 34. Kamps, J., Marx, M., Mokken, R.J., De Rijke, M.: Using wordnet to measure semantic orientations of adjectives (2004) 35. Kanayama, H., Nasukawa, T.: Fully automatic lexicon expansion for domainoriented sentiment analysis. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 355–363. Association for Computational Linguistics (2006)
Sentiment Analysis Through Finite State Automata
195
36. Kanayama, H., Nasukawa, T.: Fully automatic lexicon expansion for domainoriented sentiment analysis, p. 355 (2006) 37. Kang, H., Yoo, S.J., Han, D.: Senti-lexicon and improved na¨ıve bayes algorithms for sentiment analysis of restaurant reviews. In: Expert Systems with Applications, vol. 39, pp. 6000–6010. Elsevier (2012) 38. Kennedy, A., Inkpen, D.: Sentiment classification of movie reviews using contextual valence shifters. Comput. Intell. 22(2), 110–125 (2006) 39. Kim, S.M., Hovy, E.: Determining the sentiment of opinions, p. 1367 (2004) 40. Ku, L.W., Huang, T.H., Chen, H.H.: Using morphological and syntactic structures for Chinese opinion analysis, pp. 1260–1269 (2009) 41. Landauer, T.K., Dumais, S.T.: A solution to plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. In: Psychological Review, vol. 104, p. 211. American Psychological Association (1997) 42. Li, F., Huang, M., Zhu, X.: Sentiment analysis with global topics and local dependency. In: AAAI (2010) 43. Maisto, A., Pelosi, S.: Feature-based customer review summarization. In: Meersman, R., et al. (eds.) OTM 2014. LNCS, vol. 8842, pp. 299–308. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-45550-0 30 44. Maisto, A., Pelosi, S.: A lexicon-based approach to sentiment analysis. the Italian module for Nooj. In: Proceedings of the International Nooj 2014 Conference, University of Sassari, Italy. Cambridge Scholar Publishing (2014) 45. Maks, I., Vossen, P.: Different approaches to automatic polarity annotation at synset level, pp. 62–69 (2011) 46. Mathieu, Y.Y.: Les pr´edicats de sentiment. Langages, pp. 41–52 (1999) 47. Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995) 48. Mohammad, S., Dunne, C., Dorr, B.: Generating high-coverage semantic orientation lexicons from overtly marked words and a thesaurus. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-vol. 2, pp. 599–608. Association for Computational Linguistics (2009) 49. Moilanen, K., Pulman, S.: Sentiment composition, pp. 378–382 (2007) 50. Moilanen, K., Pulman, S.: The good, the bad, and the unknown: morphosyllabic sentiment tagging of unseen words, pp. 109–112 (2008) 51. Mulder, M., Nijholt, A., Den Uyl, M., Terpstra, P.: A lexical grammatical implementation of affect, pp. 171–177 (2004) 52. Mullen, T., Collier, N.: Sentiment analysis using support vector machines with diverse information sources. In: EMNLP, vol. 4, pp. 412–418 (2004) 53. Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network, vol. 193, pp. 217–250 (2012) 54. Neviarouskaya, A.: Compositional approach for automatic recognition of finegrained affect, judgment, and appreciation in text (2010) 55. Neviarouskaya, A., Prendinger, H., Ishizuka, M.: Compositionality principle in recognition of fine-grained emotions from text. In: ICWSM (2009) 56. Osgood, C.E.: The nature and measurement of meaning. Psychol. Bull. 49(3), 197 (1952) 57. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing, vol. 10, pp. 79–86. Association for Computational Linguistics (2002)
196
S. Pelosi et al.
58. Pianta, E., Bentivogli, L., Girardi, C.: MultiWordNet: developing an aligned multilingual database. In: Proceedings of the first international conference on global WordNet, vol. 152, pp. 55–63 (2002) 59. Polanyi, L., Zaenen, A.: Contextual valence shifters, pp. 1–10 (2006) 60. Prabowo, R., Thelwall, M.: Sentiment analysis: a combined approach. J. Inf. 3, 143–157 (2009) 61. Qiu, G., Liu, B., Bu, J., Chen, C.: Expanding domain sentiment lexicon through double propagation. vol. 9, pp. 1199–1204 (2009) 62. Rainer, F.: Derivazione nominale deaggettivale. La formazione delle parole in italiano, pp. 293–314 (2004) 63. Rao, D., Ravichandran, D.: Semi-supervised polarity lexicon induction. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 675–682. Association for Computational Linguistics (2009) 64. Read, J., Carroll, J.: Weakly supervised techniques for domain-independent sentiment classification. In: Proceedings of the 1st International CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion, pp. 45–52. ACM (2009) 65. Riloff, E., Wiebe, J., Wilson, T.: Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4. pp. 25–32. Association for Computational Linguistics (2003) 66. Russo, I., Frontini, F., Quochi, V.: OpeNER sentiment lexicon italian - LMF (2016). http://hdl.handle.net/20.500.11752/ILC-73, digital Repository for the CLARIN Research Infrastructure provided by ILC-CNR 67. Taboada, M., Anthony, C., Voll, K.: Methods for creating semantic orientation dictionaries. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), Genova, Italy, pp. 427–432 (2006) 68. Tan, S., Cheng, X., Wang, Y., Xu, H.: Adapting naive bayes to domain adaptation for sentiment analysis. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 337–349. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00958-7 31 69. Turney, P.D.: Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews, pp. 417–424 (2002) 70. Turney, P.D., Littman, M.L.: Measuring praise and criticism: inference of semantic orientation from association. ACM Trans. Inf. Syst. (TOIS) 21, 315–346 (2003) 71. Velikovich, L., Blair-Goldensohn, S., Hannan, K., McDonald, R.: The viability of web-derived polarity lexicons. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 777–785. Association for Computational Linguistics (2010) 72. Vermeij, M.: The orientation of user opinions through adverbs, verbs and nouns. In: 3rd Twente Student Conference on IT, Enschede June (2005) 73. Vietri, S.: The Italian module for Nooj. In: In Proceedings of the First Italian Conference on Computational Linguistics, CLiC-it 2014. Pisa University Press (2014) 74. Vietri, S.: On some comparative frozen sentences in Italian. Lingvisticæ Investigationes 14(1), 149–174 (1990) 75. Vietri, S.: On a class of Italian frozen sentences. Lingvisticæ Investigationes 34(2), 228–267 (2011) 76. Wan, X.: Co-training for cross-lingual sentiment classification. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 1, pp. 235–243. Association for Computational Linguistics (2009)
Sentiment Analysis Through Finite State Automata
197
77. Wang, X., Zhao, Y., Fu, G.: A morpheme-based method to Chinese sentence-level sentiment classification. Int. J. Asian Lang. Proc. 21(3), 95–106 (2011) 78. Wawer, A.: Extracting emotive patterns for languages with rich morphology. Int. J. Comput. Linguist. Appl. 3(1), 11–24 (2012) 79. Wiebe, J.: Learning subjective adjectives from corpora. In: AAAI/IAAI, pp. 735– 740 (2000) 80. Ye, Q., Zhang, Z., Law, R.: Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. In: Expert Systems with Applications, vol. 36, pp. 6527–6535. Elsevier (2009) 81. Yi, J., Nasukawa, T., Bunescu, R., Niblack, W.: Sentiment analyzer: extracting sentiments about a given topic using natural language processing techniques, pp. 427–434 (2003)
Using Cognitive Learning Method to Analyze Aggression in Social Media Text Sayef Iqbal and Fazel Keshtkar(B) Science and Mathematics Department, St. John’s University College of Professional Studies Computer Science, 8000 Utopia Parkway, Jamaica, NY 11439, USA {sayef.iqbal16,keshtkaf}@stjohns.edu
Abstract. Aggression and hate speech is a rising concern in social media platforms. It is drawing significant attention in the research community who are investigating different methods to detect such content. Aggression, which can be expressed in many forms, is able to leave victims devastated and often scar them for life. Families and social media users prefer a safer platform to interact with each other. Which is why detection and prevention of aggression and hatred over internet is a must. In this paper we extract different features from our social media data and perform supervised learning methods to understand which model produces the best results. We also analyze the features to understand if there is any pattern involved in the features that associates to aggression in social media data. We used state-of-the-art cognitive feature to gain better insight in our dataset. We also employed ngrams sentiment and Part of speech features as a standard model to identify other hate speech and aggression in text. Our model was able to identify texts that contain aggression with an f-score of 0.67. Keywords: Hate speech Classification
1
· Aggression · Sentiment · Social media ·
Introduction
According to Wikipedia1 , aggression is defined as the action or response of an individual who expresses something unpleasant to another person [4]. Needless to say, aggression in social media platforms has become a major factor in polarizing the community with hatred. Aggression can take the form of harassment, cyberbullying, hate speech and even taking jabs at one another. It is growing as more and more users are joining the social network. Around 80% of the teenagers use social media nowadays and one in three young people have been found victims of cyberbullying [25]. The rise of smartphones and smart devices and ease of use of social media platforms have led to the spread of aggression over the internet [8]. Recently, 1
https://en.wikipedia.org/wiki/Aggression Date: 11/22/2018.
c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 198–211, 2023. https://doi.org/10.1007/978-3-031-24340-0_15
Using Cognitive Learning Method to Analyze Aggression
199
social media giants like Facebook and Twitter took some action and have been investigating this issue (i.e. deleting suspicious accounts). However, there is still a lack of sophisticated algorithms which can automatically detect these problems. Hence, more investigation needs to be done in order to address this issue at a larger scale. On the other hand, due to the subjectivity of the aggression and hate associated with aggression, this problem has been challenging as well. Therefore, an automatic detection system for front line defense against such aggression texts will be useful to minimize spread of hatred across social media platforms and it can help to maintain a healthy online environment. This paper focuses on generating a binary classification model for analyzing any pattern from 2018 shared task TRAC (Trolling, Aggression and Cyberbullying) dataset [12]. The data initially was annotated into three categories as follows: – Non Aggression (NAG)- there is no aggression in the text – Overtly Aggressive (OAG)- text contains open aggressive lexical features – Covertly Aggression (CAG)- text contains aggression without open acknowledgement of aggressive lexical features Examples of each category are shown in Table 1. Table 1. Some examples with original labels and our modified labels. Text examples
Original label New label
Cows are definitely gonna vote for Modi ji in 2019 ;)
CAG
AG
Don’t u think u r going too far u Son of a B****........#Nigam
OAG
AG
Happy Diwali.!!let’s wish the next one year health, wealth n growth to our Indian economy.
NAG
NAG
To analyze the aggression patterns, in this paper, we focus on building a classification model using Non Aggression (NAG) and Aggression (AG) classes. We combine the overlapping OAG and CAG categories into the AG category from the initial dataset. In this research, we investigate a combination of features such as word n-gram, LIWC, part of speech and sentimental polarity. We also applied different supervised learning algorithms to evaluate our model. While most of the supervised learning methods produced promising results, Random Forest classifier produced the best accuracy (68.3%) and f-score (0.67) while also producing state of the art true-positive rate of (83%). However, all the classifiers produced results with greater accuracy and precision for our proposed binary class (AG) and (NAG) than the initial three classes of (NAG), (CAG) and (OAG). We also analyzed n-gram and LIWC features that were used for model building and found that it mostly affirms the presence of non-aggressive content in texts. This paper serves to lay the ground for our future work which is to identify what differentiates OAG from CAG.
200
S. Iqbal and F. Keshtkar
The rest of the paper is organized as follows: Related Work section gives a brief overview of the research already done in this area. The Methodology section describes our methodology and the details about the dataset, pre-processing steps, feature extraction and the algorithms that were used. Experiments and Result section presents the experiments and results from the proposed model and finally, the conclusion and future works are discussed in Conclusion and Future work section.
2
Related Work
Several studies have been done in order to detect aggression level in social media texts [5]. Some research focuses on labelling the texts as either expressing positive or negative opinion about a certain topic. Raja and Swamynathan in his research analyzed sentiment from tweet posts using sentiment corpus to score sentimental words in the tweets [15]. They proposed a system which tags any sentimental word in a tweet and then scores the word using SentiWordnet’s word list and sentimental relevance scoring using an estimator method. The system produced promising sentimental values of words. However, the research did not focus on the analysis of the lexical features or sentimental words, such as how often they appear in a text and what kinfd of part of speech does the word(s) belong to. Samghabadi et al. analyzed data for both Hindi and English language by using a combination of lexical and semantic features [19]. They used Twitter dataset for training and testing purposes and applied supervised learning algorithms to identify texts as being Non Aggressive, Covertly Aggressive and Overtly Aggressive. They used lexical feature such as word n-gram, char-ngram, k-skip n-grams and tf-idf word transformation. For word embedding, they employed Word2vec [24] and also used Stanford’s sentimental [21] tool to measure the sentiment scores of words. They also used LWIC to analyze the texts from tweets and Facebook comments [22]. Finally, they used binary calculation to identify the gender probability to produce an effective linguistic model. They were able to retrieve an f-score of 0.5875 after applying classifiers on their linguistic featured model. In contrast to this research, our system produced results with higher f-score (0.67) even though we used different feature set and employed different approach for supervised learning and analysis. On the other hand, Roy et al. [17] used Convolution Neural Networks (CNN) [11] and Support Vector Machine (SVM) classifiers on the pre-processed dataset of tweets and Facebook comments to classify the data. They employed the preprocessing technique of removing regular expressions, urls and usernames from the text. They used an ensemble approach using both CNN and SVM for classifying their data. In contrast to our research which produced better results when using Random Forest classifier, the performance of their system improved when SVM was used on unigrams and tf-idf features along with CNN with a kernel size of 2 x embedding size. The system was able to classify the social media posts with an f-score of 0.5099.
Using Cognitive Learning Method to Analyze Aggression
201
On a different note, Sharma et al. proposed a degree based classification of harmful speeches that are often manifested in posts and comments in social media platforms [20]. They extracted bag of word and tf-idf features from preprocessed Facebook posts and comments that was annotated by three different annotators subjectively. They performed Naive Bayes, Support vector Machine and Random Forest classifiers on their model. Random Forest worked the best on their model and gave results with an accuracy of 76.42%. Van Hee et al. explored the territory of cyberbullying, a branch of aggression, content in social media platform [23]. Cyberbullying can really affect the confidence, self-esteem emotional values of a victim especially in the youth. They propose a linear SVM supervised machine learning method for detecting cyberbullying content in social media by exploring wide range of features in English and Dutch corpus. The detection system provides a quantitative analysis on texts for a way to signal cyberbullying events in social media platform. They performed a binary classification experiment for the automatic detection for cyberbullying texts with an f-score of 64.32% and for English corpus. Unlike in our research, [23] did not employ sentiment or use any psycholinguistic feature for supervised learning methods. However, our system was able to produce slightly better f-score result even though we use different dataset. Sahay et al. in their research address the negative aspect of online social interaction. Their work is based on the detection of cyberbullying and harassment in social media platforms [18]. They perform classification analysis of labelled textual content from posts and comments that could possible contain cyberbullying, trolling, sarcastic and harassment content. They build their classification model based on n-gram feature, opposed to our system in which we consider other features like part of speech along side n-gram features. They perform different machine learning algorithms to evaluate their feature-engineering process generating a score between 70–90% for the training dataset. Similarly, Reynolds et al. perform a lexical feature extraction on their labelled dataset that was collected from web crawling that contained posts mainly from teenagers and college students [16]. They proposed a binary classification of ‘yes’ or ‘no’ for posts from 18,554 users in Formspring.me website that may or may not contain cyberbullying content. They perform different supervised learning method on their extracted features and found J48 produced the best true positive accuracy of 61.6% and an average accuracy 0f 81.7%. While the researchers were able to retrieve results with better accuracy they do not analyze the texts within which is a strong focal point of our research. Dinkar et al. propose a topic-sensitive classifier to detect cyberbullying content using 4,500 comments from youtube to train and test their sub-topic classification models [7]. The sub-topics included, sexuality, race and culture, and intelligence. They used tf-idf, Ortony lexicons, list of profane words, part of speech tagging and topic-specific unigrams and binary grams as their features. Although they applied multiple classifiers on their feature model, SVM produced
202
S. Iqbal and F. Keshtkar
the most reliable with kappa value of above 0.7 for all topic-sensitive classes and JRip producing most accurate results for all the classes. They found that building label-specific classifiers were more effective than multiclass classifiers at detecting cyberbullying sensitive messages. Chen et al. also propose a Lexical Syntactic Feature (LSF) architecture to detect use of offensive language in social media platforms [6]. They included a user’s writing style, structure, lexical and semantic of content in the texts among many others to identify the likeliness of a user putting up an offensive content online. They achieved a precision of 98.24% and recall of 94.34% in sentence offensive detection using LSF as a feature in their modelling. They performed both Naive Bayes and SVM classification algorithm with SVM producing the best accuracy result in classifying the offensive content. However, in this paper, we propose a new model, which to the best of our knowledge, has not been used in previous researches. We build a model with a combination of psycholinguistic, semantic, word n-gram, part of speech and other lexical features to analyze aggression patterns in the dataset. Methodology section explains the details of our model.
3
Methodology
In this section we discuss the details of our methodology, dataset, pre-processing, feature extraction, and algorithms that have been used in this model. The data was collected from shared TRAC [12]. 3.1
Dataset
The dataset was collected from the TRAC [12] workshop (Trolling, Aggression and Cyberbullying) 2018 workshop held in August 2018, NM, USA. TRAC focuses on investigating online aggression, trolling, cyberbullying and other related phenomena. The workshop aimed to create a platform for academic discussions on this problem, based on previous joint work that they have done as part of a project funded by the British Council. Our dataset was part of the workshop’s English data that comprised of 11,999 Facebook posts and comments with 6,941 comments labelled as Aggessive and 5,030 as Non-aggressive. The comments were annotated subjectively into three categories NAG, CAG and OAG by research scientists and reviewers who organized the workshop. We decided to use a binary class of AG and NAG for these texts. Figure 1 illustrates the distribution of the categories of aggression in texts. We considered complete dataset for analyzing and model building. The corpus is code-mixed, i.e., it contains texts in English and Hindi (written in both Roman and Devanagari script). However, for our research, we only considered using English text written in Roman script. Our final dataset, excluding Devanagari script, contained 11,999 Facebook comments.
Using Cognitive Learning Method to Analyze Aggression
203
Fig. 1. Distribution of dataset OAG(22.6%) CAG(35.3%) & NAG(42.1%)
3.2
Pre-processing
Pre-processing is the technique of cleaning and normalization of data which may consist in removing less important tokens, words, or characters in a text such as ‘a’, ‘and’, ‘@’ etc. and also lowering capitalized words like ‘APPLE’. The texts contained several unimportant tokens, for instance, urls, numbers, html tags, and special characters which caused noise in the text for analysis. We cleaned the data first by employing NLTK (Natural language and Text Processing Toolkit) [2] stemmer and stopwords package. Table 2 illustrates the transformation of text before and after pre-processing. Table 2. Text before and after pre-processing Before Respect all religion sir, after all we all have to die, and after death there Will be no disturbance and will be complete silence After
3.3
Respect religion sir die death disturbance complete silence
Feature Extraction
In this section we describe the features that we extracted from the dataset. We extracted various features, however, for the sake of this specific research, we only consider the following features due to their better performance in our final model. We adapted the following features- part of speech, n-grams (unigrams, bigrams, trigrams), tf-idf, sentiment polarity and LIWC’s psycholinguistic feature. Figure 2 illustrates the procedure that was adapted in the process of feature extraction to build a model for supervised learning. Part-of-Speech Features. Part-of-Speech (PoS) are classes or lexical categories assigned to words due to the word’s similar grammatical and syntactic properties. Typical examples of part of speech include adjective, noun, adverb, etc. PoS helps us to identify or tag words into certain categories and find any pattern they create with regards to aggression and non-aggression texts. For the purposes of this research, we applied NLTK’s [2] part of speech tagging package
204
S. Iqbal and F. Keshtkar
Fig. 2. Feature extraction architecture of system.
on our dataset to count the occurrences of PoS tags in each text. This led to the extraction of 24 categories of words. For instance extracting PoS tags from the text respect religion sir die death disturbance complete silence leaves us with ‘respect’: NN, ‘religion’: NN, ‘sir’: NN, ‘die’: VBP, ‘death’: NN, ‘disturbance’: NN, ‘complete’: JJ, ‘silence’: NN where NN represents for tagging a noun word and JJ and VBP for adjective and verb of non-3rd person singular present form, respectively. N-grams Features. Language model or n-grams in natural language processing (NLP) refers to the sequence of n items (word, token) from texts. Typically, n refers to the numbers of words in a sequence gathered from a text after applying text processing techniques. N-gram is commonly used in NLP for developing supervised machine learning models [3]. It helps us identify which words tend to appear together more then often. We implemented Weka [9] tool to extract unigram, bigram and trigram word feature from these texts. We utilized Weka’s Snowball stemmer to stem the words for standard cases and Rainbow stopwords to further remove any potential stop words. We considered tf-idf score as values of word n-gram instead of their frequencies. Over 270,000 tokens were extracted after n-gram feature extraction. We also employed Weka’s built-in ranker algorithm to identify which features contribute most towards the correct classification of the texts. This helped us understand which words were most useful and related to our annotated classes. We considered only top 437 items for further analysis. We dropped features ranked below 437 as they were barely of any relevance as per ranker algorithm. Table 3 illustrates some examples of unigram,
Using Cognitive Learning Method to Analyze Aggression
205
bigram and trigram after applying n-gram feature extraction on the text ‘respect religion sir die death disturbance complete silence’. Table 3. Examples of n-gram features n-gram
Example of n-gram tokens
unigram Respect, religion, disturbance Respect religion, disturbance complete bigram h trigram Respect religion sir, die death disturbance
Sentiment Features. Sentiment features are used to analyze any opinion expressed from texts as having a positive, negative or neutral emotion [10]. Sentiment analysis, especially in social media texts, is an important technique to monitor public opinion over a topic. It helps to understand the opinion expressed in a text by performing and evaluating the sentiment value of each word and overall text. We used TextBlob [1] to evaluate the score of sentiment polarity of each pre-processed word and the text as a whole. TextBlob provides easy access to common text-processing operations. The package converts sentences to a list of words and performs word-level sentiment analysis to give a sentiment polarity score for each text. Sentiment polarity is a floating number ranging from –1.0 to 1.0. A number closer to –1.0 is an expression of negative opinion and a number closer to 1.0 is an expression of positive opinion. We keep track of the document id and the corresponding sentiment polarity score as a feature. For instance, the text ‘respect religion sir die death disturbance complete silence’ produced a sentiment polarity score of –0.10 with subjectivity of –0.40. However, we only consider polarity score in the feature. Linguistic Inquiry and Word Count Features. LIWC (Linguistic Inquiry and Word Count) performs computerized text analysis to extract psycholinguistic features from texts [14]. We utilized LIWC 2015 psychometric text analyzer [13] in order to gain insight on our data. The features provide understanding to textual content by scoring and labelling the text segments into many of its categories. In this research we applied Weka’s ranking algorithm on LIWC features to rank the most significant and useful LIWC feature that contributed most towards classifying the texts as AG or NAG. We found 12 such LIWC features which were crucial in our analysis and which produced the best accuracy and f-score for our classifiers. 3 illustrates the distribution of the psycholinguisting features among 11,999 facebook comments. Each document may contain one or more of these cognitive features.
206
S. Iqbal and F. Keshtkar
Fig. 3. Psycholinguist category distribution using LIWC
4 4.1
Experiments and Results Experimental Setup
In this section we evaluate the performance of our model using supervised learning algorithms. We report accuracy and f-score of different supervised learning methods on the models that was created using the features explained in previous sections. We also evaluated the validity of our models and identify vital features and patterns that caused high and low performances in our system. 4.2
Result
We considered different combinations of features to build the best possible model that could eventually lead to higher performance. We ran various algorithms such as Support Vector Machine and Random Forest on different combination of features. Some combination of features performed better than others and we picked the one that produced best result. We noted from our results that Random Forest produced better results. Table 4 shows the results obtained by applying these classifiers on different combination of features. We kept n-gram words as our gold standard feature in model building and then applied different combinations of other features. The features that were used in model building were Unigram (U), Bigram (B), Trigram (T), Sentiment polarity (SP), Part of Speech (PoS) and LIWC (LIWC). We applied different classifier using both 10-fold cross validation and 66% data for training using multiple combination of these features. The best results were obtained when
Using Cognitive Learning Method to Analyze Aggression
207
Table 4. F-score of classifiers using 66% data for training F score Feature
SVM
Random forest
U+B+T U+B+T+SP U+B+T+SP+LIWC U+B+T+SP+LIWC+PoS
0.6100 0.6360 0.6410 0.6450
0.6340 0.6500 0.6450 0.6700
considering U, BU, SP, GI and SW as features and using 66% of the data for training and 33% for testing. Figure 4 shows the confusion matrix of our model for both classes, AG and NAG. The confusion matrix was generated by applying Random Forest classifier on 66% of the data (testing set) using U+B+T+SP+PoS+LIWC features in the model. Interestingly, according to the confusion matrix, upon applying Random Forest classifier on the model, 1,930 out of 2,351 of the AG class texts in the test set were identified correctly. This leads us to understand that the true positive for aggression in texts was 83% which is extremely promising.
Fig. 4. Confusion matrix of random forest classifier.
We also found that the sentiment polarity score for texts were evenly distributed among both the classes, even though it was evaluated as a vital feature
208
S. Iqbal and F. Keshtkar
by ranking algorithm. And it was among the top feature ranked by ranker algorithm which contributed towards higher accuracy and f-score. We also found that some of the words happened to exist in both unigram and bigram, for instance, ‘loud’ in ‘loudspeaker’ and ‘loud noise’. This leaves us to understand that those words are key in classifying the texts. When considering word n-gram feature, there were very few bigrams in the model as it mostly comprised of unigrams and contained words that related to religion and politics. Also, most words in texts that were annotated as aggressive comprised of adjectives and nouns. 4.3
Discussion and Analysis
A common issue with the dataset was that it often contained either abbreviated or unknown words and phrases which could not be extracted by using any of the lexicons. Hence, these words and phrases were left out of our analysis. Also, some texts contained either stop words or a mixture of stop words and emoticons which led to the removal of all of the content upon pre-possessing. Performing pre-processing on the text hare pm she q ni such or the hare panic mr the h unto ab such ran chalice led to the removal of whole text. This also prevented us from further analyzing the text even though it may potentially have had some aggressive lexical or emoticons. But because emoticons can be placed sarcastically in texts, we did not consider it as a feature in our model. There were some texts which comprised of non-English words. The words in these texts switched between English and other languages which made our analysis difficult as it was solely intended for English corpus. Some words like ‘dadagiri’, which means ‘bossy’ in Hindi context, were not transliterated, which is why the semantic of the text could not be captured. The sentiment polarity score for the text ‘chutiya rio hittel best mobile network india’ was 0.0 where clearly it should have been scored below 0.0 as it contained a strong negative word in another language (Hindi in Roman script). Analyzing Result Data. Adjective (JJ), Verb non-3rd person singular present form (VBP) and NN (noun) were among the prominent part of speech for n-gram words that we extracted. Figure 5 illustrates the distribution of part of speech in our n-gram words feature. Since Sentiment Polarity (SP) was among the top features ranked by Weka, (SP) as a feature identified 6,094 texts correctly and 5,905 texts incorrectly. Out of the 6,094 that were correctly identified, only 1,861 texts were labelled as AG and 4,233 as NAG. Figure 6 illustrates the distribution of AG and NAG labels after performing sentiment polarity analysis on the texts. Also, of the top 434 n-gram words ranked by Weka, (SP) identified 396 n-gram words as NAG and only 37 as AG. This clearly indicates, that sentiment polarity is a good feature in identifying NAG texts.
Using Cognitive Learning Method to Analyze Aggression
209
Fig. 5. Part of speech tagging of n-gram features
Fig. 6. Comparison of sentimental polarity feature
5
Conclusion and Future Work
In this paper, we propose an approach to detect aggression in texts. We tried to understand patterns in both AG and NAG class texts based on the part of speech and sentiment. The model produced promising results as it helps us to make clear distinction between texts that contain aggression and those that do not. Our System architecture also adapted well to the feature extraction process for aggression detection. For future work, we plan to use more lexicon features for sentiment analysis in order to further improve the accuracy and f-score value for correct classification of our model. We also plan to use hashtags and emoticons which we think will be promising features. These features will help us to identify more important words and contents from texts that were not detected. We would also like to investigate on the sub domains of Aggression- Covertly Aggressive and Overtly Aggressive contents and identify distinguishing factors between them.
210
S. Iqbal and F. Keshtkar
References 1. Bagheri, H., Islam, M.J.: Sentiment analysis of twitter data. arXiv preprint arXiv:1711.10377 (2017) 2. Bird, S., Loper, E.: Nltk: the natural language toolkit. In: Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions, p. 31. Association for Computational Linguistics (2004) 3. Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992) 4. Buss, A.H.: The psychology of aggression (1961) 5. Chatzakou, D., Kourtellis, N., Blackburn, J., De Cristofaro, E., Stringhini, G., Vakali, A.: Mean birds: detecting aggression and bullying on twitter. In: Proceedings of the 2017 ACM on Web Science Conference, pp. 13–22. ACM (2017) 6. Chen, Y., Zhou, Y., Zhu, S., Xu, H.: Detecting offensive language in social media to protect adolescent online safety. In: Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), pp. 71–80. IEEE (2012) 7. Dinakar, K., Reichart, R., Lieberman, H.: Modeling the detection of textual cyberbullying. Soc. Mob. Web 11(02), 11–17 (2011) 8. G¨ orzig, A., Frumkin, L.A.: Cyberbullying experiences on-the-go: when social media can become distressing. Cyberpsychology 7(1), 4 (2013) 9. Holmes, G., Donkin, A., Witten, I.H.: Weka: a machine learning workbench. In: Intelligent Information Systems, 1994. Proceedings of the 1994 Second Australian and New Zealand Conference on, pp. 357–361. IEEE (1994) 10. Keshtkar, F., Inkpen, D.: Using sentiment orientation features for mood classification in blogs. In: 2009 International Conference on Natural Language Processing and Knowledge Engineering, pp. 1–6 (2009). https://doi.org/10.1109/NLPKE. 2009.5313734 11. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014) 12. Kumar, R., Reganti, A.N., Bhatia, A., Maheshwari, T.: Aggression-annotated corpus of Hindi-English code-mixed data. arXiv preprint arXiv:1803.09402 (2018) 13. Pennebaker, J.W., Boyd, R.L., Jordan, K., Blackburn, K.: The development and psychometric properties of liwc2015. Technical Report (2015) 14. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Assoc. 71(2001), 2001 (2001) 15. Raja, M., Swamynathan, S.: Tweet sentiment analyzer: Sentiment score estimation method for assessing the value of opinions in tweets. In: Proceedings of the International Conference on Advances in Information Communication Technology & Computing, p. 83. ACM (2016) 16. Reynolds, K., Kontostathis, A., Edwards, L.: Using machine learning to detect cyberbullying. In: Machine Learning and Applications and Workshops (ICMLA), 2011 10th International Conference on, vol. 2, pp. 241–244. IEEE (2011) 17. Roy, A., Kapil, P., Basak, K., Ekbal, A.: An ensemble approach for aggression identification in English and Hindi text. In: Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), pp. 66–73 (2018) 18. Sahay, K., Khaira, H.S., Kukreja, P., Shukla, N.: Detecting cyberbullying and aggression in social commentary using NLP and machine learning. people (2018) 19. Samghabadi, N.S., Mave, D., Kar, S., Solorio, T.: Ritual-uh at trac 2018 shared task: aggression identification. arXiv preprint arXiv:1807.11712 (2018)
Using Cognitive Learning Method to Analyze Aggression
211
20. Sharma, S., Agrawal, S., Shrivastava, M.: Degree based classification of harmful speech using twitter data. arXiv preprint arXiv:1806.04197 (2018) 21. Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013) 22. Tausczik, Y., Pennebaker, J.: The psychological meaning of words: Liwc and computerized text analysis methods. J. Lang. Soc. Psychol. 29, 24–54 (2010) 23. Van Hee, C., et al.: Automatic detection of cyberbullying in social media text. arXiv preprint arXiv:1801.05617 (2018) 24. Wang, H.: Introduction to word2vec and its application to find predominant word senses. http://compling.hss.ntu.edu.sg/courses/hg7017/pdf/word2vec %20and%20its%20application%20to%20wsd.pdf (2014) 25. Zainol, Z., Wani, S., Nohuddin, P., Noormanshah, W., Marzukhi, S.: Association analysis of cyberbullying on social media using apriori algorithm. Int. J. Eng. Technol. 7, 72–75 (2018). https://doi.org/10.14419/ijet.v7i4.29.21847
Opinion Spam Detection with Attention-Based LSTM Networks Zeinab Sedighi1 , Hossein Ebrahimpour-Komleh2 , Ayoub Bagheri3 , and Leila Kosseim4(B) 1
4
Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada [email protected] 2 Department of Computer Engineering, University of Kashan, Kashan, Islamic Republic of Iran [email protected] 3 Department of Methodology and Statistics, Utrecht University, Utrecht, The Netherlands [email protected] Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada [email protected] Abstract. Today, online reviews have a great influence on consumers’ purchasing decisions. As a result, spam attacks, consisting of the malicious inclusion of fake online reviews, can be detrimental to both customers as well as organizations. Several methods have been proposed to automatically detect fake opinions; however, the majority of these methods focus on feature learning techniques based on a large number of handcrafted features. Deep learning and attention mechanisms have recently been shown to improve the performance of many classification tasks as they enable the model to focus on the most the important features. This paper describes our approach to apply LSTM and attentionbased mechanisms for detecting deceptive reviews. Experiments with the Three-domain data set [15] show that a BiLSTM model coupled with Multi-Headed Self Attention improves the F-measure from 81.49% to 87.59% in detecting fake reviews. Keywords: Deep learning · Attention mechanisms · Natural language processing · Opinion spam detection · Machine learning
1
Introduction
Due to the increasing public reliance on social media for decision making, companies and organizations regularly monitor online comments from their users in order to improve their business. To assist in this task, much research has addressed the problems of opinion mining [2,22]. However, the ease of sharing comments and experience on specific topics has also led to an increase in fake review attacks (or spam) by individuals or groups. In fact, it is estimated that as much as one-third c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 212–221, 2023. https://doi.org/10.1007/978-3-031-24340-0_16
Opinion Spam Detection with Attention-Based LSTM Networks
213
of opinion reviews on the web are spam [21]. These, in turn, decrease the trustworthiness of all online reviews for both users and organizations. Manually discerning spam reviews from non-spam ones has been shown to be both difficult and inaccurate [17]; therefore developing automatic approaches to detect review spam has become a necessity. Although automatic opinion spam detection has been addressed by the research community, it still remains an open problem. Most previous work on opinion spam detection have used classic supervised machine learning methods to distinct spam from non-spam reviews. Consequently, much attention has been paid to learning appropriate features to increase the performance of the classification. In this paper we explore the use of an LSTM based model that uses an attention mechanism to learn representations and features automatically to detect spam reviews. All the deep learning models tested obtained significantly better performance than traditional supervised approaches and the BiLSTM+MultiHeaded Self Attention yielded a best F-measure of 87.59%, a significant improvement over the current state of the art. This article is organized as follows. Section 2 surveys related work in opinion spam review detection. Our attention-based model is then described in Sect. 3. Results are presented and discussed in Sect. 4. Finally, Sect. 5 proposes future work to improve our model.
2
Related Work
According to [7], spam reviews can be divided into three categories 1) untruthful reviews which deliberately affect user decisions, 2) reviews whose purpose is to advertise specific brands and 3) non-reviews which are irrelevant. Types 2 and 3 are more easy to detect as the topic of the spam differs significantly from truthful reviews; however type 1 spam are more difficult to identify. This article focuses on reviews type 1, which try to mislead users using topic-related deceptive comments. 2.1
Opinion Spam Detection
Much research has been done on the automatic detection of spam reviews. Techniques employed range from unsupervised (e.g. [1]), semi-supervised (e.g. [10,13]) and supervised methods (e.g. [9,14,27]) with a predominance for supervised methods. Most methods rely on human feature engineering and experiment with different classifiers and hyper parameters to enhance the classification quality. To train from more data, [17] generated an artificial data set and applied supervised learning techniques for text classification. [10] uses spam detection for text summarization and [10] applies Naive Bayes, logistic regression and Support Vector Machine methods after feature extraction using Part-Of-Speech
214
Z. Sedighi et al.
tags and LIWC. To investigate cross domain spam detection, [12] uses a data set consist of three review domains to avoid the dependency to a specific domain. They examine SVM and SAGE as classification models. 2.2
Deep Learning for Sentiment Analysis
Deep learning models have been widely used in natural language processing and text mining [16] such as sentiment analysis [20], co-reference resolution [5], POS tagging [19] and parsing [6] as they are capable to learn relevant syntactic and semantic features automatically as opposed to hand-feature engineering. Because long term dependencies are prominent in natural language, Recurrent Neural Networks (RNNs) and in particular Long Short Term Memories (LSTMs) have been very successful in many applications (e.g. [18,20,25]). [11] employed an RNN in parallel with a Convolutional Neural Network (CNN) to improve the analysis of sentiment phrases. [20] used an RNN to create sentence representations. [25] presented a context representation for relation classification using a ranking recurrent neural network. 2.3
Attention Mechanisms
Attention mechanisms have also shown much success in the last few years [3]. Using such mechanisms, neural networks are able to better model dependencies in sequences of information in texts, voices, videos, etc. [4,26] regardless of their distance. Because they learn which information from the sequence is more useful to predict the current output, attention mechanisms have increased the performance of many Natural Language Processing tasks such as [3,23]. An attention function maps an input sequence and a set of key-value pairs to an output. The output is calculated as a weighted sum of the values. The weight assigned to each value is obtained using a compatibility function of the sequence and the corresponding key. In a vanilla RNN without attention, the model embodies all the information of the input sequence by means of the last hidden state. However, when applying an attention mechanism, the model is able to glance back at the entire input; not only by accessing the last hidden state but also by accessing a weighted combination of all input states. Several types of attention mechanisms have been proposed [23]. Self Attention, also known as intra-attention, aims to learn the dependencies between the words in a sentence and uses this information to model the internal structure of the sentence. Scaled Dot-Product Attention [24], on the other hand, calculates the similarity using scaled dot-product. As opposed to Self Attention, Scaled Dot-Product Attention uses an additional dimension to adjust the inner product from becoming too large. If the calculation is performed several times instead of once, it will enable the model to learn more relevant information concurrently in different sub-spaces. This last model is called Multi-Headed Self Attention. In light of the success of these attention mechanisms, this paper evaluates the use of these techniques for the task of opinion spam detection.
Opinion Spam Detection with Attention-Based LSTM Networks
3 3.1
215
Methodology Attention-Based LSTM Model
Figure 1 shows the general model used in our experiments. The look-up layer maps words to a look-up table by applying word embeddings. In order to better capture the relations between distant words, the next layer uses LSTM units. In our experiments (see Sect. 4), we investigated with both undirectional and bi-directional LSTMs (BiLSTM) [8]. We considered the use of one LSTM layer, one BiLSTM layer and two BiLSTM layers. In all cases we used 150 LSTM units in each layer and training phase was applied after each 32 time steps using Back Propagation Through Time (BPTT) with a learning rate of 1e-3 and a dropout rate of 30%.
Fig. 1. General architecture of the attention-based models
The results from the LSTM layer is fed into the attention layer. Here, we experimented with two mechanisms: Self Attention and Multi-Headed Self Attention mechanisms [24]. Finally, a Softmax layer is applied for the final binary classification.
4
Experiments
To evaluate our models, we use the Three-Domain data set [15] which constitutes the standard benchmark in this field. [15] introduced the Three-domain
216
Z. Sedighi et al.
review spam data set: a collection of 3032 reviews in three different domains (Hotel, Restaurant and Doctor) annotated by three types of annotators (Turker, Expert and Customer). Each review is annotated with a binary label: truthful or deceptive. Table 1 shows statistics of the data set. Table 1. Statistics of the three-domain dataset Data set
Turker Expert Customer Total
Hotel
800
280
800
1880
Restaurant
200
120
400
720
Doctor Total
200
32
200
432
1200
432
1400
3032
To compare our proposed models with traditional supervised machine learning approaches we also experimented with SVM, Naive Bayes and Log Regression methods. For these traditional feature-engineered models, we pre-processed the text to remove stop words, and stemmed the removing words. Then, to distinguish the role of words, POS tagging was applied. To extract helpful features for classifying reviews, feature engineering techniques are required. Bigrams and TF-IDF are applied to extract more repetitive words in the document. For other deep learning models, we used both a CNN and an RNN. The CNN and the RNN use the same word embeddings as our model (see Sect. 3). The CNN uses two Convolutional and Pooling layers connected to one fully connected hidden layers.
5
Results and Analysis
Our various models were compared using all three domains, in-domain and cross domains. 5.1
All Three-Domain Results
In our first set of experiments we used the combination of all three domains (Hotel, Restaurant and Doctor) for a total of 3032 reviews. Table 2 shows the results of our deep learning models compared classic machine learning methods where using cross-validation. As Table 2 shows, in general all deep learning models yield a higher F-measure than SVM, Naive Bayes and Log Regression. It is interesting to note that both precision and recall benefit from the deep models and those attention mechanisms yield a significant improvement in performance. The Multi-Headed Self Attention performs better than the Self Attention; and the bidirectional LSTM does provide a significant improvement compared to the unidirectional LSTM.
Opinion Spam Detection with Attention-Based LSTM Networks
217
Table 2. Results in all three-domain classification Methods
Precision Recall F-measure
SVM
72.33
68.50
70.36
Naive Bayes
61.69
63.32
62.49
Log Regression
55.70
57.34
56.50
CNN
79.23
69.34
73.95
RNN
75.33
73.41
74.35
LSTM
80.85
68.74
74.30
BiLSTM
82.65
80.36
81.49
BiLSTM + Self Attention
85.12
83.96
84.53
BiLSTM + Multi-Headed Self Attention 90.68
84.72
87.59
Table 3. Results in in-domain classification Data sets
Methods
Precision Recall F-measure
Hotel
SVM
69.97
67.36
68.64
Naive Bayes
58.71
62.13
60.37
Log Regression
55.32
56.45
55.87
RNN
72.25
70.31
71.27
CNN
78.76
74.30
76.47
LSTM
84.65
81.11
82.84
BiLSTM
86.43
83.05
84.70
BiLSTM + Self Attention
90.21
85.73
87.91
BiLSTM + Multi-Headed Self Attention 89.33
92.59
90.93
73.76
69.43
71.52
Naive Bayes
63.18
66.32
64.71
Log Regression
59.96
63.87
61.85
RNN
77.12
74.96
76.02
CNN
78.11
76.92
77.51
LSTM
80.23
78.74
79.47
BiLSTM
86.85
87.35
87.10
BiLSTM + Self Attention
88.68
86.47
87.56
BiLSTM + Multi-Headed Self Attention 89.66
91.00
90.32
SVM
72.17
74.39
73.26
Naive bayes
63.18
67.98
65.49
Log regression
65.83
69.27
67.51
RNN
75.28
67.98
71.44
CNN
77.63
70.03
73.63
LSTM
79.92
74.21
77.49
BiLSTM
79.85
78.74
79.29
BiLSTM + Self Attention
82.54
80.61
81.56
BiLSTM + Multi-Headed Self Attention 84.76
81.10
82.88
Restaurant SVM
Doctor
5.2
In-domain Results
Table 3 shows the results of same models for each domain. As Table 3 shows, the same general conclusion can be drawn for each specific domain: deep learning
218
Z. Sedighi et al.
methods show significant improvements compared to classical ML methods, and attention mechanisms increase the performance even more. These results seem to shows that neural models are more suitable for deceptive opinion spam detection. The results on the Restaurant data are similar to those on the Hotel domain. However, the models yield lower results on the Doctor domain. A possible reason is that the number of reviews in this domain are relatively lower, which leads to relatively lower performance. 5.3
Cross-domain Results
Finally, to evaluate the generality of our model, we performed an experiment across domains, where we trained models on one domain and evaluated them on another. Specifically, we trained the classifiers on Hotel reviews (for which we had more data), and evaluated their the performance on the other domains. Table 4 shows the result of these experiments. Again the same general trend appears. One notices however, that the performance of the models does drop compared to Table 3 where the training was done on the same domain. 5.4
Comparison with Previous Work
In order to compare the proposed model with the state of the art, we performed a last experiment in line with the experimental set up of [15]. As indicated in Sect. 2, [15] used an SVM and SAGE based on unigram + LIWS + POS tags. To our knowledge, their approach constitutes the sate of the art approach on Table 4. Results in cross domain classification Datasets
Methods
Hotel vs. Doctor
SVM
Precision Recall F-measure 67.28
64.91
66.07
Naive bayes
59.94
63.13
61.49
Log regression
55.47
51.93
53.64
RNN
72.33
68.50
70.36
CNN
61.69
63.32
62.49
LSTM
74.65
70.89
72.72
BiLSTM
76.82
71.63
74.13
BiLSTM+Self Attention
78.10
73.79
75.88
BiLSTM+Multi-Headed Self Attention 81.90
77.34
79.55
70.75
68.94
69.83
Naive Bayes
64.87
67.11
65.97
Log Regression
60.72
57.90
59.27
RNN
79.23
69.34
73.95
CNN
75.33
73.41
74.35
LSTM
80.15
73.94
76.91
BiLSTM
80.11
79.92
80.01
BiLSTM+Self Attention
87.73
82.27
84.91
BiLSTM+Multi-Headed Self Attention 90.68
84.72
87.59
Hotel Vs. Restaurant SVM
Opinion Spam Detection with Attention-Based LSTM Networks
219
Table 5. Comparison of the proposed model with [15] on the customer data Data
Hotel
Customer Precision Our model
Recall [15] Our model
F-Measure [15] Our model
[15]
85
67
93
66
89
66
Restaurant 90
69
92
72
91
70
Table 6. Comparison of the proposed model with [15] on the expert data Data
Hotel
Expert Precision Our model
Recall [15] Our model
F-Measure [15] Our model
[15]
80
58
85
61
82
59
Restaurant 79
62
84
64
81
70
Table 7. Comparison of the proposed model with [15] on the Turker data Data
Hotel
Turker Precision Our model
Recall [15] Our model
F-Measure [15] Our model
[15]
87
64
92
58
89
61
Restaurant 88
68
89
70
88
68
the Three Domain dataset. Although [15] performed a variety of experiments, the set up used applied the classifiers on the turker and the customer sections of the dataset only (see Table 1). The model was trained on the entire data set (Customer+Expert+Turker), and tested individually on the Turker, Customer, and Expert sections using cross-validation. To compare our approach, we reproduced their experimental set up, and as shown in Table 5 to Table 7, our BiLSTM+Multi-Headed Self Attention model outperforms this state of the art.
6
Conclusion and Future Work
In this paper we showed that an attention mechanism can learn document representation automatically for opinion spam detection and clearly outperform nonattention based models as well as classic models. Experimental results show that the Multi-Headed Self Attention performs better than the Self Attention; and the
220
Z. Sedighi et al.
bidirectional LSTM does provide a significant improvement compared to the unidirectional LSTM. Utilizing a model with no need for manual feature extraction from documents with high performance is effective to improve the detection of spam comments. Utilizing an attention mechanism and an LSTM model enable us to have a comprehensive model for distinguishing different reviews in different domains. This shows the generality power of our model. One challenge left for the future is to improve the performance of cross domains spam detection. This would enable the model to be used widely to reach the performance of in domain results in all domains. Acknowledgments. This work was financially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).
References 1. Abbasi, A., Zhang, Z., Zimbra, D., Chen, H., Nunamaker Jr, J.F.: Detecting fake websites: the contribution of statistical learning theory. In: Mis Quarterly, pp. 435– 461 (2010) 2. Bagheri, A., Saraee, M., De Jong, F.: Care more about customers: Unsupervised domain-independent aspect detection for sentiment analysis of customer reviews. Knowl.-Based Syst. 52, 201–213 (2013) 3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 4. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), pp. 4960–4964 (2016) 5. Clark, K., Manning, C.D.: Improving coreference resolution by learning entity-level distributed representations vol. 1, pp. 643–653 (2016) 6. Collobert, R.: Deep learning for efficient discriminative parsing. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 224–232. FL, USA (2011) 7. Dixit, S., Agrawal, A.: Survey on review spam detection. Int. J. Comput. Commun. Technol. 4, 0975–7449 (2013) 8. Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3, 115–143 (2002) 9. Jindal, N., Liu, B.: Opinion spam and analysis. In: Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM 2008), pp. 219–230. ACM, Palo Alto, California, USA (2008) 10. Jindal, N., Liu, B., Lim, E.P.: Finding unusual review patterns using unexpected rules. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM 2010), pp. 1549–1552. Toronto, Canada, October 2010 11. Kuefler, A.R.: Merging recurrence and inception-like convolution for sentiment analysis. https://cs224d.stanford.edu/reports/akuefler.pdf (2016) 12. Lau, R.Y., Liao, S., Kwok, R.C.W., Xu, K., Xia, Y., Li, Y.: Text mining and probabilistic language modeling for online review spam detection. ACM Trans. Manage. Inf. Syst. (TMIS) 2(4), 25 (2011)
Opinion Spam Detection with Attention-Based LSTM Networks
221
13. Li, F., Huang, M., Yang, Y., Zhu, X.: Learning to identify review spam. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2011), pp. 2488–2493 (2011). https://dl.acm.org/citation.cfm?id=2283811 14. Li, H.: Detecting Opinion Spam in Commercial Review Websites. Ph.D. thesis, University of Illinois at Chicago (2016) 15. Li, J., Ott, M., Cardie, C., Hovy, E.: Towards a general rule for identifying deceptive opinion spam. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (ACL-2014), vol. 1, pp. 1566– 1576 (2014) 16. Manning, C.D.: Computational linguistics and deep learning. Comput. Linguist. 41(4), 701–707 (2015) 17. Ott, M., Cardie, C., Hancock, J.T.: Negative deceptive opinion spam. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL/HLT 2013), pp. 497–501 (2013) 18. Pascanu, R., Gulcehre, C., Cho, K., Bengio, Y.: How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026 (2013) 19. Santos, C.D., Zadrozny, B.: Learning character-level representations for part-ofspeech tagging. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1818–1826 (2014) 20. Socher, R.: Deep learning for sentiment analysis - invited talk. In: Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, p. 36. San Diego, California (2016) 21. Streitfeld, D.: The Best Book Reviews Money Can Buy. The New York Times, New York, 25 (2012) 22. Sun, S., Luo, C., Chen, J.: A review of natural language processing techniques for opinion mining systems. Inf. Fusion 36, 10–25 (2017) 23. Vaswani, A., et al.: Attention is all you need. In: 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA (2017) 24. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 25. Vu, N.T., Adel, H., Gupta, P., Sch¨ utze, H.: Combining recurrent and convolutional neural networks for relation classification. arXiv preprint arXiv:1605.07333 (2016) 26. Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning (ICML 2015), pp. 2048–2057. Lille, France (2015) 27. Zhang, D., Zhou, L., Kehoe, J.L., Kilic, I.Y.: What online reviewer behaviors really matter? Effects of verbal and nonverbal behaviors on detection of fake online reviews. J. Manage. Inf. Syst. 33(2), 456–481 (2016)
Multi-task Learning for Detecting Stance in Tweets Devamanyu Hazarika1(B) , Gangeshwar Krishnamurthy2 , Soujanya Poria3 , and Roger Zimmermann1 1
2
National University of Singapore, Singapore, Singapore {hazarika,rogerz}@comp.nus.edu.sg Artificial Intelligence Initiative, Agency for Science Technology and Research (A*STAR), Singapore, Singapore [email protected] 3 Nanyang Technological University, Singapore, Singapore Abstract. Detecting stance of online posts is a crucial task to understand online content and trends. Existing approaches augment models with complex linguistic features, target-dependent properties, or increase complexity with attention-based modules or pipeline-based architectures. In this work, we propose a simpler multi-task learning framework with auxiliary tasks of subjectivity and sentiment classification. We also analyze the effect of regularization against inconsistent outputs. Our simple model achieves competitive performance with the state of the art in micro-F1 metric and surpasses existing approaches in macro-F1 metric across targets. We are able to show that multi-tasking with a simple architecture is indeed useful for the task of stance classification.
Keywords: Detecting stance
1
· Sentiment analysis · Text classification
Introduction
Automatic detection of stance over text is an emerging task of opinion mining. In recent times, its importance has increased due to its role in practical applications. It is used in information retrieval systems to filter content based on the authors’ stance, to analyze trends in politics and policies [14], in summarization systems to understand online controversies [13]. It also finds its use in modern day problems that plague the Internet, such as identification of rumor or hate speech [21]. The task involves determining whether a piece of text, such as a tweet or debate post is FOR, AGAINST, or NONE towards an entity which can be persons, organizations, products, policies, etc. (see Table 1). This task is challenging due to the use of informal language and literary devices such as sarcasm. For example, in the second sample in Table 1, the phrase Thank you God! can mislead a trained model to consider it as a favoring stance. Challenges also amplify as in many D. Hazarika and G. Krishnamurthy—Equal contribution. c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 222–235, 2023. https://doi.org/10.1007/978-3-031-24340-0_17
Multi-task Learning for Detecting Stance in Tweets
223
Table 1. Sample tweets representing stances against target topics.
Target
Tweet
Stance
1 Climate change is a real concern
Incredibly moving as a scientist weeps on @BBCRadio4 for the #ocean and for our grandchildren’s future
For
2 Atheism
I still remember the days when I prayed God Against for strength.. then suddenly God gave me difficulties to make me strong. Thank you God!
3 Feminist
When I go up the steps of my house I feel like the @ussoccer wnt .. I too have won the Women’s World Cup. #brokenlegprobs #USA
Against
tweets the target of the stance may or may not be mentioned. In the third sample in Table 1, the tweet doesn’t talk about feminism in particular but rather mocks indirectly using Women’s World Cup. Present state-of-the-art networks in this task majorly follow the neural approach. These models increase their complexity either by adding extra complex features – such as linguistic features [26] – as input or through complex networks with attention modules or pipeline-based architectures [9,10]. In this paper, we restrict ourselves from increasing complexity and search for simple solutions for this task. To this end, we propose a simple convolutional neural network, named MTL-Stance, that adopts multi-task learning (MTL) for stance classification (Task A) by jointly training with the related tasks of subjectivity (Task B) and sentiment prediction (Task C). For subjectivity analysis, we categorize both For and Against stances to be subjective while None stance as objective. It is important to note that unlike traditional opinion mining, subjectivity here refers to the presence of stance towards a target in a tweet. Conversely, objectivity contains both tweets which have either no stance or are subjective by their stance is indeterminable. To tackle inconsistent predictions (e.g. Task A predicts For stance while Task B predicts objective), we explore a regularization term that penalizes the network for inconsistent outputs. Overall, subjectivity represents a coarse-grained version of stance classification and is thus expected to aid the task at hand. We also consider sentiment prediction (Task C) in the MTL framework to allow the model learn common relations (if any). [18] mentions how sentiment and stance detection are related tasks. However, both the tasks are not same as a person might express same stance towards a target either by positive or negative opinion. A clear relationship is also often missing since the opinion expressed in text might not be directed towards the target. Nevertheless, both the tasks do tend to rely on some related parameters which motivates us to investigate their joint training.
224
D. Hazarika et al.
The contributions of this work can be summarized as follows: – We propose a multi-task learning algorithm for stance classification by associating the related tasks of subjectivity and sentiment detection. – We demonstrate that a simple CNN-based classifier trained in an end-to-end fashion can supersede models having extra linguistic information or pipelinebased architectures. – Our proposed model achieves competitive results to the state-of-the-art performance on the SemEval 2016 benchmark dataset whilst having a simpler architecture with a single-phase end-to-end mechanism [17]. The paper is organized as follows: Sect. 2 presents the related works in the literature and compares them to our proposed work; Sect. 3 presents the proposed model and explains the MTL framework utilized for training; Sect. 4 details the experimental setup and the results on the dataset. Finally, Sect. 5 provides concluding remarks.
2
Related Work
The analysis of stance detection has been performed on various forms of text such as debates in congress or online platforms [12,24,27,30], student essays [20], company discussions [1], etc. With the popularity of social media, there is also a surge of opinionated text in microblogs [17]. Traditional approaches involve linguistic features into their models such as sentiment [24], lexicon-based and dependency-based features [2], argument features [12]. Many works also use structural information from online user graphs or retweet links [19,22]. With the proliferation of deep-learning, several neural approaches have been attempted on this task with state-of-the-art performance. Most of the works utilize either recurrent or convolutional networks to model the text and the target. In [3], the authors use a bi-directional recurrent network to jointly model the text along with the target by initializing the text network with the output of the target network. On the other hand, convolutional neural networks also have been used for encoding text in this task [29]. Apart from basic networks, existing works also incorporate extra information as auxiliary input into their neural models. These features include user background information such as user tastes, comment history, etc. [5]. We focus on some recent works that have attained state-of-the-art performance. Specifically, we look at Target-specific Attentional Network (TAN) [10], Two-phase Attention-embedded LSTM (T-PAN) [9] and Hierarchical Attention Network (HAN) [26]. Similar to [3], TAN uses a bi-directional LSTM scheme to model the task. It includes the target embeddings into the network by using an attention mechanism [4]. We follow similar motivations to use target-specific information. However, aiming to minimize network complexity, we opt for a simple concatenation-based fusion scheme. T-PAN stands closest to our proposed model as it too incorporates information from classifying subjectivity of tweets. It is achieved by following a two-phase
Multi-task Learning for Detecting Stance in Tweets
225
Fig. 1. Stance-MTL: Multi-task framework for stance detection.
model where in first phase the subjectivity is decided and in second phase, only the subjective tweets from first phase are used to be classified as favoring or nonfavoring stances. In contrast to this approach, we do not use a pipeline-based approach as it bears an higher possibility of error propagation. Instead, in our MTL framework, both the classifications are done simultaneously. The Hierarchical Attention Network (HAN), proposed by [26] contains of a hierarchical framework of attention-based layers which includes extra information from sentiment, dependency and argument representations. Unlike HAN, our model is not dependent on complex linguistic features. Rather, we enforce a simple CNN-based model that trains end-to-end under the multi-task regime.
3 3.1
Proposed Approach Task Formulation
The task of stance classification can be defined as follows: given a tweet text and its associated target entity, the aim of the model is to determine the author’s stance towards the target. The possible stances could be favoring (FOR), against (AGAINST) or inconclusive (NONE). The NONE class consists of tweets that could either have a neutral stance or be the case whether determining the stance is not easy. 3.2
Multi-task Learning
Multi-task learning (MTL) is a framework that requires optimizing a network towards multiple tasks [23]. The motivation arises from the belief that features learnt for a particular task can be useful for related tasks [7]. In particular, MTL exploits the synergies between related tasks through joint learning, where supervision from the related/auxiliary tasks provides an inductive bias into the network that allows it to generalize better. Effectiveness of MTL framework is evident in the literature of various fields such as speech recognition [8], computer vision [11], natural language processing [6], and others.
226
D. Hazarika et al.
MTL algorithms can be realized by two primary methods. First is to train individual models for each task with a common regularization that enforces the models to be similar. Second way is to follow a stronger association by sharing common weights across tasks. In this work, we take influence from both these approaches by using a shared model along with explicit regularization against inconsistent output combinations. Below we provide the details of our model: MTL-Stance. 3.3
Model Details
The overall architecture of the MTL-Stance is shown in the Fig. 1. It consists of the input tweet and its target. The inputs are processed by shared convolutional layers whose outputs are concatenated. The further layers are separated into the three mentioned tasks. Concrete network details are mentioned below. Input Representation. A training example consists of a tweet text: {T wi }ni=0 , a target entity: {T ri }m i=0 , stance label: y1 ∈ {For, Against, None}, the derived subjectivity label: y2 ∈ {Subjective, Objective} and sentiment label: y3 ∈ {Positive, Negative, Neither}. Both T w ∈ Rn×k and T r ∈ Rm×k are sequences of words represented in a matrix form with each word corresponding to its k-dimensional word vector [15]. Shared Parameters. To both the tweet and target representations, we apply a shared convolutional layer to extract higher-level features. We use multiple filter of different sizes. The width of each filter is fixed to k but the height, h, is varied (as hyper-parameter). For example, let w ∈ Rh×k be a filter which can extract a feature vector z of size RL−h+1 where L is the length of the input. Each entry of vector z is given by: zi = g(w Ti:i+h−1 + b) here, is the convolution operation, b ∈ R is a bias term, and, g is a non-linear function. We then apply a max-over-time pooling operation over the feature vector z to get the maximum value of zˆ = max(z). The above convolution layer with Fl filters is applied M times on both tweet and target representations to get an output of M · Fl features. These feature representations of tweet text and target text are given by FT w and FT r . Next, we obtain the joint representation by concatenating them, i.e., FT = [FT w ; FT r ] ∈ R2·M ·Fl . This representation is fed to a non-linear fully-connected layer f cinter coupled with Dropout [25]. hinter = f cinter (FT ) This layer is also the last shared layer before task-specific layers are applied.
Multi-task Learning for Detecting Stance in Tweets
227
Task-Specific Layers. For each of the three tasks, i.e., stance, subjectivity, and sentiment classification, we use three different non-linear fully-connected layers. The layer weights are not shared among them so that they can individually learn task specific features. hi = f ci (hinter )
∀i ∈ {1, 2, 3}
Finally, we pass these features through another projection layer with softmax normalization to get the probability distribution over the labels for each task. y ˆi = sof tmax(Wi · hi + bi ) ∀i ∈ {1, 2, 3} Loss Function. We use the categorical cross-entropy on each of the outputs to calculate the loss. We also add a joint regularization loss (Regloss ) to the total loss function: ⎧ ⎫ ⎛ ⎞ Ck N 3 ⎬ i,j −1 ⎨ ⎝ Loss = yki,j log2 (ˆ yk )⎠ + Regloss (1) ⎭ N i=1 ⎩ j=1 k=1
where N is the number of mini-batch samples, Ck is the number of categories for each of k th task ( in the order: stance, subjectivity and sentiment), yki,j is the probability of the ith sample of k th task for the j th label, and similarly yˆki,j is its predicted counterpart. In our setup, C1 = 3, C2 = 2 and C3 = 3. The regularization term Regloss is dependent on the output of the first two tasks, and defined as: y1i ) | ⊕ sgn| argmax(ˆ y2i ) |) Regloss = α · (sgn| argmax(ˆ
(2)
where α is a weighting term (hyper-parameter), sgn|.| is the sign function and ⊕ is a logical XOR operation used to penalize the instances where both subjectivity and stance are predicted with contradiction.
4
Experiments
4.1
Dataset
We utilize the benchmark Twitter Stance Detection corpus for stance classification originally proposed by [16] and later used in a SemEval task 61 [17]. The dataset presents the task of identifying the stance of a tweet’s author towards a target, determining whether the author is favoring (For) or is against (Against) a target or whether neither of the inference is likely (None). The dataset comprises of English tweets spanning five primary targets: Atheism, Climate Change is Concern, Feminist Movement, Hillary Clinton, Legalization of Abortion with pre-defined training and testing splits. The distribution statistics are provided in Table 2. Along with stance labels, sentiment labels are also provided which we use for the joint training (see Table 2). 1
http://alt.qcri.org/semeval2016/task6/.
228
D. Hazarika et al.
Table 2. Percentage distribution of stance and sentiment labels for instances (tweets) in the dataset across targets and splits Target
Stance Train # for
Atheism Climate Feminism Hillary Abortion All
4.2
513 395 664 689 653
17.9 53.7 31.6 17.1 18.5
Sentiment Test Train Test Against Neither # For Against Neither pos Neg None Pos Neg None 59.3 3.8 49.4 57.0 54.4
22.8 42.5 19.0 25.8 27.1
2914 25.8 47.9
26.3
220 169 285 295 280
14.5 72.8 20.4 15.3 16.4
72.7 6.5 64.2 58.3 67.5
12.7 20.7 15.4 26.4 16.1
60.4 60.4 17.9 32.0 28.7
35.0 35.0 77.2 64.0 66.1
4.4 4.4 4.8 3.9 5.0
1249 24.3 57.3
18.4
33.0 60.4 6.4
59.0 59.0 19.3 25.7 20.3
35.4 35.4 76.1 70.1 72.1
5.4 5.4 4.5 4.0 7.5
29.4 63.3 7.2
Training Details
We use the standard training and testing set provided in the dataset. Hyperparameters are tuned using a held out validation data: 10% of the training data. To optimize the parameters, we use RMSProp [28] optimizer with an inital learning rate of 1e−4 . The hyper-parameter are fl = 128, M = 3. And for each of the 3 filter size the window size (h) is 2, 3 and 4. We fix the tweet length n to 30 and target length m to 6. The number of hidden units in task-specific layers f c[1/2/3] is 300. We initialize the word vectors with the 300-dimensional pretrained word2vec embeddings [15] which are optimized during training. Following the previous works, we train different models for different targets but with the same hyperparameters. And the final result is the concatenation of predicted result of these models. 4.3
Baselines
We compare MTL-Stance with the following baseline methods: – SVM: This model accounts for a non-neural baseline that has been widely used in previous works [17]. The model uses simple bag-of-words features for stance classification. – LSTM: A simple LSTM model without target features for classification. – TAN: is an RNN-based architecture that uses an target-specific attentionmodule to focus on parts of the tweet that is related to the target topic [10]. – T-PAN: is a two-phase model for classifying the stance [9]. The first phase classifies subjectivity and the second phase classifies the stance based on first phase. Concretely, utterances classified as objective in the first-phase are dropped out from the second phase and assigned the None label. – HAN: is a hierarchical attention model which uses linguistic features that include sentiment, dependency and argument features [26].
Multi-task Learning for Detecting Stance in Tweets
229
Table 3. Comparision of MTL-Stance with state-of-the-art models on Twitter Stance Detection corpus. MTL-Stance results are the average of 5 runs with different initializations. Model
Atheism Climate Feminism Hillary Abortion M acFavg M icFavg
SVM
62.16
42.91
56.43
55.68
60.38
55.51
67.01
LSTM
58.18
40.05
49.06
61.84
51.03
52.03
63.21
TAN
59.33
53.59
55.77
65.38
63.72
59.56
68.79
T-PAN
61.19
66.27
58.45
57.48
60.21
60.72
68.84
HAN
70.53
49.56
57.50
61.23
66.16
61.00
69.79
64.66
58.82
66.27
67.54
64.69
69.88
MTL-Stance 66.15
4.4
Evaluation Metrics
We use both micro-average and macro-average of F1-score across targets as our evaluation metric as defined by [26]. The F1-score for Favour and Against categories for all instances is calculated as: F[f avor/against] =
2 × P[f avor/against] × R[f avor/against] P[f avor/against] + R[f avor/against]
(3)
where P and R are precision and recall. Then the final metric, M icFavg is the average of Ff avor and Fagainst . M icFavg = 4.5
Ff avor + Fagainst 2
(4)
Results
Table 3 shows the performance results on Twitter Stance Detection corpus. Our model, MTL-Stance, performs significantly better than the state-of-the-art models across most targets. The SVM model does not perform well since it only uses bag of words features of tweet text only. LSTM model also does not exploit the information from target text; hence its performance is significantly lower, though it uses a neural architecture. On the other hand, neural models such as TAN, T-PAN, and HAN use both tweet and target text which outperforms both SVM and LSTM. This indicates that target information is a useful feature for stance classification. 4.6
Ablation Study
We further experiment on different variations of the MTL-Stance model to analyze the extent to which various features of our model contribute to the performance. The variants of the model are as follows: – Single: This model does not use multi-task learning framework. The model is trained with only stance labels.
230
D. Hazarika et al.
– Single + subj.: This model uses multi-task framework and uses subjectivity labels along with the stance labels. – Single + subj. + regloss : This model further adds regularization loss (see Sect. 3.3) to add penalty to mismatched output. – Single + sent.: This model uses multi-task framework and uses sentiment labels along with the stance labels. – MTL-Stance: Our final model that uses multi-task learning with regularization loss. This model uses all the three labels: subjectivity, sentiment and stance. Table 4 provides the performance results of these models. As seen, understanding the subjectivity of a tweet towards the target helps the model make better judgment about its stance. Intuitively, a tweet that has no stance towards the target tends to be objective while the one with opinion tends to be subjective. Addition of regularization penalty further improves the overall performance. Analyzing the confusion matrix between the sentiment and stance labels reveals that stance and sentiment are not correlated [18]. Yet, addition of sentiment classification task in MTL improves performance of the model. This indicates the presence of common underlying relationships that the model is able to learn and exploit. Also note that our single model consists of a very simple architecture and does not beat the state-of-the-art models described in Table 3. But the same architecture outperforms them with a multi-task learning objective and regularization loss. This indicates that the performance can be significantly improved if complex neural architectures are combined with the multi-task learning for stance classification. Table 4. Ablation across auxiliary tasks. Note: subj. = subjectivity (Task B) , sent. = sentiment (Task C) Model: stance
Atheism Climate Feminism Hillary Abortion M acFavg M icFavg 63.71
43.89
58.75
63.12
63.05
58.50
67.40
66.27
51.70
56.70
62.57
65.03
60.45
67.41
+ subj. + regloss 64.87
51.53
60.09
64.25
65.49
61.25
68.40
+ sent.
66.30
62.06
56.19
63.11
64.44
62.42
67.76
MTL-Stance
66.15
64.66
58.82
66.27
67.54
64.69
69.88
+ subj.
4.7
Importance of Regularization
Table 5 compares the effect of regularization loss that we have introduced in this paper. The regularization loss allows the model to learn the correlation between subjectivity and stance more effectively by penalizing when the model predicts a tweet as subjective but a neutral stance (or vice-versa). The performance improvement shows the effectiveness of the regularization in our model.
Multi-task Learning for Detecting Stance in Tweets
231
Table 5. MTL-Stance with and without regularization loss
RegLoss Atheism Climate Feminism Hillary Abortion M acFavg M icFavg No
64.03
63.75
58.46
64.21
68.58
63.81
68.13
Yes
66.15
64.66
58.82
66.27
67.54
64.69
69.88
4.8
Effect on Regularization Strength (α)
Figure 2 shows the performance trend of the model as α is varied in regularization loss. At α = 5, the model reaches the highest performance with M icFavg = 69.88 and M acFavg = 64.69. As the value of α is increased, we observe that the performance of the model starts dropping. This is expected as the model starts under-fitting after it exceeds the α value of 10, and continues to drop in performance as it is increased.
Fig. 2. Performance plot of the MTL-Stance model when α in regularization loss is varied.
4.9
Case-Study and Error Analyses
We present an analysis of few instances where our model, MTL-Stance, succeeds and also fails in predicting the correct stance. We also include the subjectivity of the tweet that the model predicts for more insights into the model’s behavior. Tweet: @violencehurts @WomenCanSee The most fundamental right of them all, the right to life, is also a right of the unborn. #SemST
232
D. Hazarika et al.
Target: Legalization of Abortion Actual Stance: Against Predicted Stance: Against Actual Sentiment: Positive Predicted Sentiment: Positive Predicted Subjectivity: Subjective This is an example of a tweet having positive sentiment while having an opposing opinion towards the target. Though the stance of the tweet towards the target is Against, the overall sentiment of the tweet without considering target is Positive. We observe that MTL-Stance is able to capture such complex relationships across many instances in the test set. Tweet: @rhhhhh380 What we need to do is support all Republicans and criticize the opposition. #SemST Target: Hillary Clinton Actual Stance: Against Predicted Stance: None Predicted Subjectivity: Subjective For the tweet above MTL-Stance predicts None whereas the true stance is Against. The tweet is targeted towards ‘Hillary Clinton’, but we observe that the author is referring to Republicans and not the target directly. This is a challenging example since it requires knowledge about the relation of both entities (Hillary and Republicans) to predict the stance label correctly. MTL-Stance. however, is able to correctly perdict the subjective label which demonstrates that it is able to capture some of these patterns in the coarse-grained classification. Tweet: Please vote against the anti-choice amendment to the Scotland Bill on Monday @KevinBrennanMP - Thanks! #abortionrights #SemST Target: Legalization of Abortion Actual Stance: Against Predicted Stance: Favor Predicted Subjectivity: Subjective The above instance suggests other challenges that models face in predicting the stance correctly. The tweet has multiple negations in it and requires multihop inference in order to come to the right conclusion about the stance. Handling such use cases demands rigorous design and fundamental reasoning capabilities.
5
Conclusion
In this paper, we introduced MTL-Stance, a novel model that leverages sentiment and subjectivity information for stance classification through a multi-task learning setting. We also propose a regularization loss that helps the model to learn the correlation between subjectivity and stance more effectively. MTL-Stance
Multi-task Learning for Detecting Stance in Tweets
233
uses simple end-to-end model with CNN architecture for stance classification. In addition, it does not use any kind of extra linguistic features or pipeline methods. The experimental results shows that MTL-Stance outperforms state-of-the-art models on the Twitter Stance Detection benchmark dataset. Acknowledgment. This research is supported by Singapore Ministry of Education Academic Research Fund Tier 1 under MOE’s official grant number T1 251RES1820.
References 1. Agrawal, R., Rajagopalan, S., Srikant, R., Xu, Y.: Mining newsgroups using networks arising from social behavior. In: Proceedings of the 12th International Conference on World Wide Web, pp. 529–535. ACM (2003) 2. Anand, P., Walker, M., Abbott, R., Tree, J.E.F., Bowmani, R., Minor, M.: Cats rule and dogs drool!: classifying stance in online debate. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pp. 1–9. Association for Computational Linguistics (2011) 3. Augenstein, I., Rockt¨ aschel, T., Vlachos, A., Bontcheva, K.: Stance detection with bidirectional conditional encoding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 876–885 (2016) 4. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949. IEEE (2016) 5. Chen, W.F., Ku, L.W.: Utcnn: a deep learning model of stance classification on social media text. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1635–1645 (2016) 6. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. ACM (2008) 7. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493– 2537 (2011) 8. Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8599–8603. IEEE (2013) 9. Dey, K., Shrivastava, R., Kaushik, S.: Topical stance detection for twitter: a twophase LSTM model using attention. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 529–536. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7 40 10. Du, J., Xu, R., He, Y., Gui, L.: Stance classification with target-specific neural attention networks. In: International Joint Conferences on Artificial Intelligence (2017) 11. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015) 12. Hasan, K.S., Ng, V.: Why are you taking this stance? identifying and classifying reasons in ideological debates. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 751–762 (2014)
234
D. Hazarika et al.
13. Jang, M., Allan, J.: Explaining controversy on social media via stance summarization. arXiv preprint arXiv:1806.07942 (2018) 14. Lai, M., Hern´ andez Far´ıas, D.I., Patti, V., Rosso, P.: Friends and enemies of clinton and trump: using context for detecting stance in political tweets. In: Sidorov, G., Herrera-Alc´ antara, O. (eds.) MICAI 2016. LNCS (LNAI), vol. 10061, pp. 155–168. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62434-1 13 15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) 16. Mohammad, S., Kiritchenko, S., Sobhani, P., Zhu, X., Cherry, C.: A dataset for detecting stance in tweets. In: LREC (2016) 17. Mohammad, S., Kiritchenko, S., Sobhani, P., Zhu, X., Cherry, C.: Semeval-2016 task 6: detecting stance in tweets. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 31–41 (2016) 18. Mohammad, S.M., Sobhani, P., Kiritchenko, S.: Stance and sentiment in tweets. ACM Trans. Internet Technol. (TOIT) 17(3), 26 (2017) 19. Murakami, A., Raymond, R.: Support or oppose?: classifying positions in online debates from reply activities and opinion expressions. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 869–875. Association for Computational Linguistics (2010) 20. Persing, I., Ng, V.: Modeling stance in student essays. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 2174–2184 (2016) 21. Poddar, L., Hsu, W., Lee, M.L., Subramaniyam, S.: Predicting stances in twitter conversations for detecting veracity of rumors: a neural approach. In: 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 65–72. IEEE (2018) 22. Rajadesingan, A., Liu, H.: Identifying users with opposing opinions in Twitter debates. In: Kennedy, W.G., Agarwal, N., Yang, S.J. (eds.) SBP 2014. LNCS, vol. 8393, pp. 153–160. Springer, Cham (2014). https://doi.org/10.1007/978-3-31905579-4 19 23. Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017) 24. Somasundaran, S., Wiebe, J.: Recognizing stances in ideological on-line debates. In: Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, pp. 116–124. Association for Computational Linguistics (2010) 25. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 26. Sun, Q., Wang, Z., Zhu, Q., Zhou, G.: Stance detection with hierarchical attention network. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2399–2409 (2018) 27. Thomas, M., Pang, B., Lee, L.: Get out the vote: Determining support or opposition from congressional floor-debate transcripts. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 327–335. Association for Computational Linguistics (2006) 28. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26–31 (2012)
Multi-task Learning for Detecting Stance in Tweets
235
29. Vijayaraghavan, P., Sysoev, I., Vosoughi, S., Roy, D.: Deepstance at semeval-2016 task 6: detecting stance in tweets using character and word-level CNNs. In: Proceedings of SemEval, pp. 413–419 (2016) 30. Walker, M., Anand, P., Abbott, R., Grant, R.: Stance classification using dialogic properties of persuasion. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 592–596. Association for Computational Linguistics (2012)
Related Tasks Can Share! A Multi-task Framework for Affective Language Kumar Shikhar Deep, Md Shad Akhtar(B) , Asif Ekbal, and Pushpak Bhattacharyya Department of Computer Science and Engineering, Indian Institute of Technology Patna, Bihta, India {shikhar.mtcs17,shad.pcs15,asif,pb}@iitp.ac.in Abstract. Expressing the polarity of sentiment as ‘positive’ and ‘negative’ usually have limited scope compared with the intensity/degree of polarity. These two tasks (i.e. sentiment classification and sentiment intensity prediction) are closely related and may offer assistance to each other during the learning process. In this paper, we propose to leverage the relatedness of multiple tasks in a multi-task learning framework. Our multi-task model is based on convolutional-Gated Recurrent Unit (GRU) framework, which is further assisted by a diverse hand-crafted feature set. Evaluation and analysis suggest that joint-learning of the related tasks in a multi-task framework can outperform each of the individual tasks in the single-task frameworks. Keywords: Multi-task learning · Single-task learning classification · Sentiment intensity prediction
1
· Sentiment
Introduction
In general, people are always interested in what other people are thinking and what opinions they hold for a number of topics like product, politics, news, sports etc. The number of people expressing their opinions on various social media platforms such as Twitter, Facebook, LinkedIn etc. are being continuously growing. These social media platforms have made it possible for the researchers to gauge the public opinion on their topics of interest- and that too on demand. With the increase of contents on social media, the process of automation of Sentiment Analysis [20] is very much required and is in huge demand. User’s opinions extracted from these social media platform are being used as inputs to assist in decision making for a number of applications such as businesses analysis, market research, stock market prediction etc. Coarse-grained sentiment classification (i.e. classifying a text into either positive or negative sentiment) is a well-established and well-studied task [10]. However, such binary/ternary classification studies do not always reveal the exact state of human mind. We use language to communicate not only our sentiments but also the intensity of those sentiments, e.g. one could judge that we are very angry, slightly sad, very much elated, etc. through our utterances. Intensity refers to the degree of sentiment a person may express through his text. It c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 236–247, 2023. https://doi.org/10.1007/978-3-031-24340-0_18
Related Tasks Can Share! A Multi-task Framework for Affective Language
237
Table 1. Example sentences with their sentiment classes and intensity scores from SemEval-2018 dataset on Affect in Tweets [15]. Tweet
Valence Intensity
@ LoveMyFFAJacket FaceTime - we can still annoy you
Pos-S
0.677
and i shouldve cut them off the moment i started hurting myself over them
Neg-M
0.283
@ VescioDiana You forgot #laughter as well
Pos-S
0.700
also facilitates us to analyze the sentiment on much finer level rather than only expressing the polarity of the sentiments as positive or negative. In recent times, studies on the amount of positiveness and negativeness of a sentence (i.e. how positive/negative a sentence is or the degree of positiveness/negativeness) has gained attention due to its potential applications in various fields. Few example sentences are depicted in Table 1. In this work, we focus on the fine-grained sentiment analysis [29]. Further, we aim to solve the fine-grained analysis with two different lenses i.e. fine-grained sentiment classification and sentiment intensity prediction. – Sentiment or Valence1 Classification: In this task, we classify each tweet into one of the seven possible fine-grained classes -corresponding to various levels of positive and negative sentiment intensity- that best represents the mental state of the tweeter, i.e. very positive (Pos-V ), moderately positive (Pos-M ), slightly positive (Pos-S ), neutral (Neu), slightly negative (NegS ), moderately negative (Neg-M ), and very negative (Neg-V ). – Sentiment or Valence Intensity Prediction: Unlike the discrete labels in the classification task, in intensity prediction, we determine the degree or arousal of sentiment that best represents the sentimental state of the user. The scores are a real-valued number in the range 0 & 1, with 1 representing the highest intensity or arousal. The two tasks i.e. sentiment classification and their intensity predictions are related and have inter-dependence on each other. Building separate system for each task is often less economical and more complex than a single multi-task system that handles both the tasks together. Further, joint-learning of two (or more) related tasks provides a great assistance to each other and also offers generalization of multiple tasks. In this paper, we propose a hybrid neural network based multi-task learning framework for sentiment classification and intensity prediction for tweets. Our network utilizes bidirectional gated recurrent unit (Bi-GRU) [28] network in cascade with convolutional neural network (CNN) [13]. The max-pooled features and a diverse set of hand-crafted features are then concatenated, and subsequently fed to the task-specific softmax layer for the final prediction. We evaluate 1
Valence signifies the pleasant/unpleasant scenarios.
238
K. S. Deep et al.
our approach on the benchmark dataset of SemEval-2018 shared task on Affect in Tweets [15]. We observe that, our proposed multi-task framework attains better performance when both the tasks are learned jointly. The rest of the paper are organized as follows. In Sect. 2, we furnish the related work. We present our proposed approach in Sect. 3. In Sect. 4, we describe our experimental results and analysis. Finally, we conclude in Sect. 5.
2
Related Work
The motivation behind applying multi-task model for sentiment analysis comes from [27] which gives a general overview of multi-task learning using deep learning techniques. Multitask learning (MTL) is not only applied to Natural Language Processing [4] tasks, but it has also shown success in the areas of computer vision [9], drug discovery [24] and many other. The authors in [5] used stacking ensemble technique to merge the results of classifiers/regressors through which the handcrafted features were passed individually and finally fed those results to a meta classifier/regressor to produce the final prediction. This has reported to have achieved the state-of-the-art performance. The authors in [8] used bidirectional Long Short Term Memory (biLSTM) and LSTM with attention mechanism and performed transfer learning by first pre-training the LSTM networks on sentiment data. Later, the penultimate layers of these networks are concatenated to form a single vector which is fed as an input to the dense layers. There was a gated recurrent units (GRU) based model proposed by [26] with a convolution neural network (CNN) attention mechanism and training stacking-based ensembles. In [14] they combined three different features generated using deep learning models and traditional methods in support vector machines (SVMs) to create an unified ensemble system. In [21] they used neural network model for extracting the features by transferring the emotional knowledge into it and passed these features through machine learning models like support vector regression (SVR) and logistic regression. In [2] authors have used a Bi-LSTM in their architecture. In order to improve the model performance they applied a multi-layer self attention mechanism in Bi-LSTM which is capable of identifying salient words in tweets, as well as gain insight into the models making them more interpretable. Our proposed model differs from previous models in the sense that we propose an end to end neural network based approach that performs both sentiment analysis and sentiment intensity prediction simultaneously. We use gated recurrent units (GRU) along with convolutional neural network (CNN) inspired by [26]. We fed the hidden states of GRU to CNN layer in order to get a fixed size vector representation of each sentence. We also use various features extracted from the pre-trained resources like DeepMoji [7], Skip-Thought Vectors [12], Unsupervised Sentiment Neuron [23] and EmoInt [5].
Related Tasks Can Share! A Multi-task Framework for Affective Language
239
Fig. 1. Proposed architecture
3
Proposed Methodology
In this section, we describe our proposed multi-task framework in details. Our model consists of a recurrent layer (biGRU) followed by a CNN module. Given a tweet, the GRU learns contextual representation of each word in the sentence, i.e. the representation of each word is learnt based on the sequence of words in the sentence. This representation is then used as input to the CNN module for the sentence representation. Subsequently, we apply max-pooling over the convoluted features of each filter and concatenated them. The hidden representation, as obtained from the CNN module, is shared across multiple tasks (here, two tasks i.e. sentiment classification and intensity prediction). Further, the hidden representation is assisted by a diverse set of hand-crafted features (c.f. Sect. 3.1) for the final prediction. In our work, we experiment with two different paradigms of predictors i.e. a) the first model is the traditional deep learning framework that makes use of softmax (or sigmoid) function in the output layer, and b) the second model is developed by replacing softmax classifier using support vector machine (SVM) [31] (or support vector regressor (SVR)). In the first model we feed the concatenated representation to two separate fully-connected layers with softmax (classification) and sigmoid (intensity) functions for the two tasks. In the second model, we feed hidden representations as feature vectors to the SVM and SVR respectively, for the prediction. A high-level block diagram of the proposed methodology is depicted in Fig. 1. 3.1
Hand-Crafted Features
We perform transfer learning from various state-of-the-art deep learning techniques. Following sub-sections explains these models in detail:
240
K. S. Deep et al.
– DeepMoji [7]: DeepMoji performs distant supervision on a very large dataset [19,32] (1.2 billion tweets) comprising of noisy labels (emojis). By incorporating transfer learning on various downstream tasks, they were able to outperform the state-of-the-art results of 8 benchmark datasets on 3 NLP tasks across 5 domains. Since our target task is closely related to this, we adapt this for our domain. We extract 2 different feature sets: • the embeddings from the softmax layer which is of 64 dimensions. • the embeddings from the attention layer which is of 2304 dimensions. – Skip-Thought Vectors [12]: Skip-thought is a kind of model that is trained to reconstruct the surrounding sentences to map sentences that share semantic and syntactic properties into similar vectors. It has the capability to produce highly generic semantic representation of sentence. The skip-thought model has two parts: • Encoder: It is generally a Gated Recurrent Unit (GRU) whose final hidden state is passed to the dense layers to get the fixed length vector representation of each sentence. • Decoder: It takes this vector representation as input and tries to generate the previous and next sentence. For this two different GRUs are needed. Due to its fixed length vector representation, skip-thought could be helpful to us. The feature extracted from skip-thought model is of dimension 4800. – Unsupervised Sentiment Neuron: [23] developed an unsupervised system which learned an excellent representation of sentiment. Actually the model was designed to produce Amazon product reviews, but the data scientists discovered that one single unit of network was able to give high predictions for sentiments of texts. It was able to classify the reviews as positive or negative, and its performance was found to be better than some popular models. They even got encouraging results on applying their model on the dataset of Yelp reviews and binary subset of the Stanford Sentiment Treebank. Thus the sentiment neuron model could be used to extract features by transfer learning. The features extracted from Sentiment Neuron model are of dimension 4096. – EmoInt [5]: We intended to use various lexical features apart from using some pre-trained embeddings. EmoInt [5] is a package which provides a high level wrapper to combine various word embeddings. The lexical features includes the following: – AFINN [17] contains list of words which are manually rated for valence between −5 to +5 where −5 indicates very negative sentiment and +5 indicates very positive sentiment. – SentiWordNet [1] is lexical resource for opinion mining. It assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity. – SentiStrength [32] gives estimation of strength of positivity and negativity of sentiment. – NRC Hashtag Emotion Lexicon [17] consists of emotion word associations computed via Hashtags on twitter texts labelled by emotions. – NRC Word-Emotion Association Lexicon [17] consists of 8 sense level associations (anger, fear, joy, sadness, anticipation, trust, disgust and surprise) and 2 sentiment level associations(positive and negative)
Related Tasks Can Share! A Multi-task Framework for Affective Language
241
– The NRC Affect Intensity [16] are the lexicons which provides real values of affect intensity. The final feature vector is the concatenation of all the individual features. This feature vector is of size (133, 1). 3.2
Word Embeddings
Embedding matrix is generated from the pre-processed text using a combination of three pre-trained embeddings: 1. Pre-trained GloVe embeddings for tweets [22]: We use 200-dimensional pre-trained GloVe word embeddings, trained on the Twitter corpus, for the experiments. To make it compatible with the other embeddings, we pad 100dimensional zero vector to each embedding. 2. Emoji2Vec [6]: Emoji2Vec provides 300 dimension vectors for most commonly used emojis in twitter platform (in case any emoji is not replaced with its corresponding meaning). 3. Character-level embeddings2 : Character-level embeddings are trained over common crawl glove corpus providing 300 dimensional vectors for each character (used in case if word is not present in other two embeddings). Procedure to generate representations for a tweet using all these embeddings is described in Algorithm 1.
Algorithm 1. Procedure to generate representations for word in tweet do if word in GloVe then word vector = get vector(GloVe, word ) else if word in Emoji2Vec then word vector = get vector(Emoji2Vec, word ) else /*n = Number of characters in word */ word vector = n1 * n 1 get vector(CharEmbed, chars[n]) end if end for
4
Experiments and Results
4.1
Dataset
We evaluate our proposed model on the datasets of SemEval-2018 shared task on Affect in Tweets [15]. There are approximately 1181, 449 & 937 tweets for 2
https://github.com/minimaxir/char-embeddings.
242
K. S. Deep et al. Train - 1181 Development - 449 Test - 937
Tweets
300
200
100
0 Neg-V
Neg-M
Neg-S
Neu Pos-S Sentiment
Pos-M
Pos-V
(a) Sentiment class distribution.
(b) Sentiment intensity distribution.
Fig. 2. Sentiment distribution for SemEval-2018 task on Affect in Tweets [15]
training, development and testing. For each tweet, two labels are given: a) sentiment class (one of the seven class on sentiment scale i.e. very positive (PosV ), moderately positive (Pos-M ), slightly positive (Pos-S ), neutral (Neu), slightly negative (Neg-S ), moderately negative (Neg-M ), and very negative (Neg-V )); and b) an intensity score in the range 0 to 1. We treat prediction of these two labels as two separate tasks. In our multi-task learning framework, we intend to solve these two tasks together. A brief statistics of the datasets is depicted in Fig. 2. 4.2
Preprocessing
Tweets in raw form are noisy because of the use of irregular, short form of text (e.g. hlo, whtsgoin etc.), emojis and slangs and are prone to many distortions in terms of semantic and syntactic structures. The preprocessing step modifies the raw tweets to prepare for feature extraction. We use Ekphrasis tool [3] for tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction. Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. They used word statistics from 2 big corpora i.e. English Wikipedia and Twitter (330 million English tweets). Ekphrasis was developed as a part of text processing pipeline for SemEval-2017 shared task on Sentiment Analysis in Twitter [25]. We list the preprocessing steps that have been carried out below. – All characters in text are converted to lower case – Remove punctuation except ! and ? because ‘!’ and ‘?’ may contribute to better result of valence detection. – Remove extra space and newline character – Group similar emoji, replace them with their meaning in words using Emojipedia – Named Entity recognition and replace with keyword or token (@shikhar → username, https://www.iitp.ac.in → url ) – Split the hashtags (#iamcool → i am cool ) – Correct the misspelled words (facbok → facebook ) (Table 2).
Related Tasks Can Share! A Multi-task Framework for Affective Language
4.3
243
Experiments
We pad each tweet to a maximum length of 50 words. We employ 300-dimensional word embedding for the experiments (c.f. Sect. 3.2). The GRU dimension is set to 256. We use 100 different filters of varying sizes (i.e. 2-gram, 3-gram, 4-gram, 5gram and 6-gram filters) with max-pool layer in the CNN module. We use ReLU [18] activation and set the Dropout [30] as 0.5. We optimize our model using Adam [11] with cross-entropy and mean-squared-error (MSE) loss functions for sentiment classification and intensity prediction, respectively. For experiments, we employ python based deep learning library Keras with TensorFlow as the backend. We adopt the official evaluation metric of SemEval2018 shared task on Affect in Tweets [15], i.e. Pearson correlation coefficient, for measuring the performance of both tasks. We train our model for the maximum 100 epochs with early stopping criteria having patience = 20. Table 2. Pearson correlation for STL and MTL frameworks for sentiment classification and intensity prediction. + Reported in [5]; ∗ Reproduced by us. Framework
Sentiment classification Intensity prediction DL (softmax) ML (SVM) DL (sigmoid) ML (SVR)
Single-task learning (STL) 0.361
0.745
0.821
0.818
Multi-task learning (MTL) 0.408
0.772
0.825
0.830
State-of-the-art [5]
0.836+ (0.776∗ )
0.873+ (0.829∗ )
In single-task learning (STL) framework, we build separate systems for both sentiment classification and intensity prediction. We pass the normalized tweet to our Convolutional-GRU framework for learning. Since the number of training samples are considerably few to effectively learn a deep learning model, we assist the model with various hand-crafted features. The concatenated representations are fed to the softmax layer (or sigmoid) for the sentiment (intensity) prediction. We obtain 0.361 Pearson coefficient for sentiment classification and 0.821 for intensity prediction. Further, we also try to exploit the traditional machine learning algorithms for prediction. We extract the concatenated representations and feed them as an input to SVM for sentiment classification and SVR for intensity prediction. Consequently, SVM reports increased Pearson score of 0.745 for sentiment classification, whereas we observe comparable results (i.e. 0.818 Pearson score) for intensity prediction. The MTL framework yields an improved performance for both the tasks in both the scenarios. In the first model, MTL reports 0.408 and 0.825 Pearson scores as compared with the Pearson scores of 0.361 & 0.821 in STL framework for the sentiment classification and intensity prediction, respectively. Similarly, the MTL framework reports 3 and 2 points improved Pearson scores in the second model for the two tasks, respectively. These improvements clearly suggest that the MTL framework, indeed, exploit the inter-relatedness of multiple tasks in
244
K. S. Deep et al.
order to enhance the individual performance through a joint-model. Further, we observe the improvement of MTL models to be statistically significant with 95% confidence i.e. p-value < 0.05 for paired T-test. On same dataset, Duppada et al. [5] (winning system of SemEval-2018 task on Affect in Tweets [15]) reports Pearson scores of 0.836 and 0.873 for sentiment classification and intensity prediction, respectively. The authors in [5] passed the same handcrafted features individually through XGBost and Random Forest classifier/regressor and combined the results of all the classifiers/regressors using stacking ensemble technique. After that they passed the results from the models to a meta classifier/regressor as input. They used Ordinal Logistic Classifier and Ridge Regressor as meta classifier/regressor. In comparison, our proposed system (i.e. MTL for ML framework) obtains Pearson scores of 0.772 and 0.830 for sentiment classification and intensity prediction, respectively. It should be noted that we tried to reproduce the works of Duppada et al. [5], but obtained Pearson scores of only 0.776 and 0.829, respectively. Further, our proposed MTL model offers lesser complexity compared to the state-of-the-art systems. Unlike the state-of-the-art systems we do not require separate system for each task, rather an end-to-end single model addresses both the tasks simultaneously. 4.4
Error Analysis
In Fig. 3, we present the confusion matrices for both the models (first and second, based on DL and ML paradigms). It is evident from the confusion matrices that most of the mis-classifications are within the close proximity of the actual labels, and our systems occasionally confuse with ‘positive’ and ‘negative’ polarities (i.e. only 43 and 22 mis-classifications for the first model-DL based and second modelML based, respectively).
(a) Multi-task:DL
(b) Multi-task:ML
Fig. 3. Confusion matrices for sentiment classification.
We also perform error analysis on the obtained results. Few frequently occurring error cases are presented below: – Metaphoric expressions: Presence of metaphoric/ironic/sarcastic expressions in the tweets makes it challenging for the systems in correct predictions.
Related Tasks Can Share! A Multi-task Framework for Affective Language
245
Table 3. MTL vs STL for sentiment classification and intensity prediction Sentence
Actual
DL
ML
MTL
STL
MTL
STL
Neg-M
Neg-M
Neg-V
Neg-M
Neg-S
Pos-M
Pos-M
Pos-S
Pos-M
Pos-S
Maybe he was partly right. THESE emails might lead to impeachment and ’lock him up’ #ironic #ImpeachTrump #Laughter strengthens #relationships. #Women are more attracted to someone with the ability to make them #laugh. a) Sentiment Classification I graduated yesterday and already had 8 family members asking what job I’ve got now #nightmare @rohandes Lets see how this goes. We falter in SL and this goes downhill. It’s kind of shocking how amazing your rodeo family is when the time comes that you need someone
0.55 0.49 0.52
0.57
0.51
0.59
0.64
(+0.02)
(-0.04)
(+0.04)
(+0.09)
0.48
0.35
0.49
0.29
(-0.01)
(-0.14)
(+0.00)
(-0.20)
0.53
0.55
0.52
0.51
(+0.01)
(+0.03)
(+0.00)
(-0.01)
b) Intensity Prediction
• “@user But you have a lot of time for tweeting #ironic”. Actual: Neg-M Prediction: Neu – Neutralizing effect of opposing words: Presence of opposing phrases in a sentence neutralizes the effect of actual sentiments. • “@user Macron slips up and has a moment of clarity & common sense... now he is a raging racist. Sounds right. Liberal logic” Actual: Neg-M Prediction: Neu We further analyze the predictions of our MTL models against STL models. Analysis suggests that our MTL model indeed improves the predictions of many examples that are mis-classified (or having larger error margins) than the STL models. In Table 3, we list a few examples showing the actual labels, MTL prediction and STL prediction for both sentiment classification and intensity prediction.
5
Conclusion
In this paper, we have presented a hybrid multi-task learning framework for affective language. We propose a convolutional-GRU network with the assistance of a diverse hand-crafted feature set for learning the shared hidden representations for multiple tasks. The learned representation is fed to SVM/SVR classifier for the predictions. We have evaluated our model on the benchmark datasets of SemEval-2018 shared on Affect in Tweets for the two tasks (i.e. sentiment classification and intensity prediction). Evaluation suggests that a single multi-task model obtains improved results against separate systems of single-task models.
246
K. S. Deep et al.
Acknowledgement. Asif Ekbal acknowledges the Young Faculty Research Fellowship (YFRF), supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia).
References 1. Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: LREC, vol. 10, pp. 2200– 2204 (2010) 2. Baziotis, C., et al.: NTUA-SLP at SemEval-2018 task 1: predicting affective content in tweets with deep attentive RNNs and transfer learning. arXiv Preprint arXiv:1804.06658 (2018) 3. Baziotis, C., Pelekis, N., Doulkeridis, C.: DataStories at SemEval-2017 task 4: deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), Vancouver, Canada, pp. 747–754 (2017) 4. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, NY, USA, pp. 160–167 (2008) 5. Duppada, V., Jain, R., Hiray, S.: SeerNet at SemEval-2018 task 1: domain adaptation for affect in tweets. arXiv Preprint arXiv:1804.06137 (2018) 6. Eisner, B., Rockt¨ aschel, T., Augenstein, I., Boˇsnjak, M., Riedel, S.: emoji2vec: learning emoji representations from their description. arXiv Preprint arXiv:1609. 08359 (2016) 7. Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., Lehmann, S.: Using-millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2017) 8. Gee, G., Wang, E.: psyML at SemEval-2018 task 1: transfer learning for sentiment and emotion analysis. In: Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 369–376 (2018) 9. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015) 10. Kim, S.M., Hovy, E.: Determining the sentiment of opinions. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 1367 (2004) 11. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv Preprint arXiv:1412.6980 (2014) 12. Kiros, R., et al.: Skip-thought vectors. arXiv Preprint arXiv:1506.06726 (2015) 13. LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. In: The Handbook of Brain Theory and Neural Networks, vol. 3361, no. 10 (1995) 14. Meisheri, H., Dey, L.: TCS research at SemEval-2018 task 1: learning robust representations using multi-attention architecture. In: Proceedings of the 12th International Workshop on Semantic Evaluation, pp. 291–299 (2018) 15. Mohammad, S., Bravo-Marquez, F., Salameh, M., Kiritchenko, S.: SemEval-2018 task 1: affect in tweets. In: Proceedings of the 12th International Workshop on Semantic Evaluation, pp. 1–17 (2018) 16. Mohammad, S.M., Bravo-Marquez, F.: Emotion intensities in tweets. arXiv Preprint arXiv:1708.03696 (2017)
Related Tasks Can Share! A Multi-task Framework for Affective Language
247
17. Mohammad, S.M., Bravo-Marquez, F.: WASSA-2017 shared task on emotion intensity. arXiv Preprint arXiv:1708.03700 (2017) 18. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-2010), pp. 807–814 (2010) 19. Nakov, P., Ritter, A., Rosenthal, S., Sebastiani, F., Stoyanov, V.: SemEval-2016 task 4: sentiment analysis in Twitter. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 1–18 (2016) 20. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: Sentiment classification using machine learning techniques. In: Proceedings of the ACL-2002 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 79–86 (2002) 21. Park, J.H., Xu, P., Fung, P.: PlusEmo2Vec at SemEval-2018 task 1: exploiting emotion knowledge from emoji and hashtags. arXiv Preprint arXiv:1804.08280 (2018) 22. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 23. Radford, A., Jozefowicz, R., Sutskever, I.: Learning to generate reviews and discovering sentiment. arXiv Preprint arXiv:1704.01444 (2017) 24. Ramsundar, B., Kearnes, S., Riley, P., Webster, D., Konerding, D., Pande, V.: Massively multitask networks for drug discovery. arXiv Preprint arXiv:1502.02072 (2015) 25. Rosenthal, S., Farra, N., Nakov, P.: SemEval-2017 task 4: sentiment analysis in Twitter. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 502–518, August 2017 26. Rozental, A., Fleischer, D.: Amobee at SemEval-2018 task 1: GRU neural network with a CNN attention mechanism for sentiment classification. arXiv Preprint arXiv:1804.04380 (2018) 27. Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv Preprint arXiv:1706.05098 (2017) 28. Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. Trans. Sig. Proc. 45(11), 2673–2681 (1997) 29. Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013) 30. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014) 31. Suykens, J.A., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9(3), 293–300 (1999). https://doi.org/10.1023/A: 1018628609742 32. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment strength detection in short informal text. J. Am. Soc. Inform. Sci. Technol. 61(12), 2544– 2558 (2010)
Sentiment Analysis and Sentence Classification in Long Book-Search Queries Amal Htait1,2(B) , S´ebastien Fournier1,2 , and Patrice Bellot1,2 1
2
Aix Marseille Univ, Universit de Toulon, CNRS, LIS, Marseille, France {amal.htait,sebastien.fournier,patrice.bellot}@lis-lab.fr Aix Marseille Univ, Avignon Universit, CNRS, EHESS, OpenEdition Center, Marseille, France {amal.htait,sebastien.fournier,patrice.bellot}@openedition.org
Abstract. Handling long queries can involve either reducing its size by retaining only useful sentences, or decomposing the long query into several short queries based on their content. A proper sentence classification improves the utility of these procedures. Can Sentiment Analysis has a role in sentence classification? This paper analysis the correlation between sentiment analysis and sentence classification in long booksearch queries. Also, it studies the similarity in writing style between book reviews and sentences in book-search queries. To accomplish this study, a semi-supervised method for sentiment intensity prediction, and a language model based on book reviews are presented. In addition to graphical illustrations reflecting the feedback of this study, followed by interpretations and conclusions. Keywords: Sentiment intensity · Language model · Search queries Books · Word embedding · Seed-words · Book reviews
1
·
Introduction
The book search field is a subsection of data search domain, with a label of recommendation. Users would be seeking book suggestions and recommendations by a request of natural language text form, called user query. One of the main characteristic of queries in book search is their length. The user query is often long, descriptive, and even narrative. Users express their needs of a book, opinion toward certain books, describe content or event in a book, and even sometimes share personal information (e.g., I am a teacher ). Being able to differentiate types of sentences, in a query, can help in many tasks. Detecting non-useful sentences from the query (e.g., Thanks for any and all help.), can help in query reduction. And classifying sentences by the type of information within, can be used for adapted search. For example, sentences including good read experience, with a book title, can be oriented to a book c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 248–259, 2023. https://doi.org/10.1007/978-3-031-24340-0_19
Sentiment Analysis and Sentence Classification in Long Book-Search Queries
249
similarity search, but sentences including a certain topic preferences should be focusing on a topic search. And also, sentences including personal information can be used for personalised search. In this work, sentence classification is studied on two levels: the usefulness of the sentence towards the search, and the type of information provided by the useful sentence. And three types of information are highlighted on: book titles and author names (e.g., I read “Peter the Great His Life and World” by Robert K. Massie.), personal information (e.g., I live in a very conservative area), and narration of book content or story (e.g., The story opens on Elyse overseeing the wedding preparation of her female cousin). “Different types of sentences express sentiment in very different ways” [4], therefore, the correlation is studied between the sentiment in a sentence and its type. And for the task, sentiment intensity prediction is calculated using a semi-supervised method, explained in Sect. 4. In addition, sentences in a query can share similar writing style and subject with book reviews. Below is a part of a long book search query: I just got engaged about a week and a half ago and I’m looking for recommendations on books about marriage. I’ve already read a couple of books on marriage that were interesting. Marriage A History talks about how marriage went from being all about property and obedience to being about love and how the divorce rate reflects this. The Other Woman: Twenty-one Wives, Lovers, and Others Talk Openly About Sex, Deception, Love, and Betrayal not the most positive book to read but definitely interesting. Dupont Circle A Novel I came across at Kramerbooks in DC and picked it up. The book focuses on three different couples including one gay couple and the laws issues regarding gay marriage ... In the example query, the part in bold present a description of specific books content with books titles, e.g. “Marriage A History”, and interpretations or personal point of view with expressions like “not the most positive book ... but definitely interesting”. These sentences seem as book reviews sentences. Therefore, finding similarities between certain sentences in a query and books reviews can be an important feature for sentence classification. To calculate that similarity in a general form, a reviews’ statistical language model is used to find for each sentence in the query its probability of being generated from that model (and therefore its similarity to that model’s training dataset of reviews). This work covers an analysis of sentence’s type correlation with its sentiment intensity and its similarity to reviews, and the paper is presented as below: – Presenting the user queries used for this work. – Extracting sentiment intensity of each sentence in the queries. – Creating a statistical language model based on reviews, and calculating the probability for each sentence to be generated from the model. – Analysing the relation between language model scores, sentiment intensity scores and the type of sentences.
250
2
A. Htait et al.
Related Work
For the purpose of query classification, many machine learning techniques have been applied, including supervised [9], unsupervised [6] and semi-supervised learning [2]. In book search field, fewer studies covered query classification. Ollagnier et al. [10] worked on a supervised machine learning method (Support Vector Machine) for classifying queries into the following classes: oriented (a search on a certain subject with orienting terms), non-oriented (a search on a theme in general), specific (a search for a specific book with an unknown title), and non-comparable (when the search do not belong to any of the previous classes). Their work was based on 300 annotated query from INEX SBS 20141 But the mentioned work, and many more, processed the query classification and not the classification of the sentences within the query. The length of booksearch queries created new obstacles to defeat, and the most difficult obstacle is the variety of information in its long content, which require a classification at the sentence level. Sentences in general, based on their type, reveal sentiment in different ways, therefore, Chen et al. [4] focused on using classified sentences to improve sentiment analysis with deep machine learning. In this work, the possibility of an opposite perspective is studied, which is the improvement of sentence classification using sentiment analysis. In addition, this work is studying the improvement of sentence classification using language model technique. Language models (LM) have been successfully applied to text classification. In [1], models were created using training annotated datasets and then used to compute the likelihood of generating the test sentences. In this work, a model is created based on book reviews and used to compute the likelihood of generating query sentences, as a similarity measurement between book reviews style and book-search query sentences type.
3
User Queries
The dataset of user queries, used in this work, is provided by CLEF - Social Book Search Lab - Suggestion Track2 . The track provides realistic search requests (also known as user queries), collected from LibraryThing3 . Out of 680 user queries, from the 2014’s dataset of Social Book Search Lab, 43 queries are randomly selected based on their length. These 43 queries have more than 55 words, stopwords excluded. Then, each query is segmented into sentences, which results a total of 528 sentences. These sentences are annotated based on usefulness towards the search, and on the information provided as: book titles and authors names, personal information, and narration of book content, an example is shown in the below XML extraction at Fig. 1. 1 2 3
https://inex.mmci.uni-saarland.de/data/documentcollection.html. http://social-book-search.humanities.uva.nl/#/suggestion. https://www.librarything.com/.
Sentiment Analysis and Sentence Classification in Long Book-Search Queries
251
Fig. 1. An example of annotated sentences from user queries.
4
Sentiment Intensity
As part of this work, sentiment intensity is calculated for each sentence of the queries. The following method is inspired by a semi-supervised method for sentiment intensity prediction in tweets, and it was established on the concepts of adapted seed-words and words embedding [8]. To note that the seed-words are words with strong semantic orientation, chosen for their lack of sensitivity to the context. They are used as paradigms of positive and negative semantic orientation. And adapted seed-words are seed-words with the characteristic of being used in a certain context or subject. Also, word embedding is a method to represent words in high quality learning vectors, from large amounts of unstructured and unlabelled text data, to predict neighbouring words. In the work of Htait et al. [8], the extracted seed-words were adapted to micro- blogs. For example, the word cool is an adjective that refers to a moderately low temperature and has no strong sentiment orientation, but it is often used in micro-blogs as an expression of admiration or approval. Therefore, cool is considered a positive seed-word in micro-blogs. In this paper, book search is the targeted domain for sentiment intensity prediction, therefore, the extracted seed- words are adapted to book search domain, and more specifically, extracted from book reviews since the reviews has the richest vocabulary in the book search domain. Using annotated book reviews, as positive and negative, by Blitzer et al.4 [3], the list of most common words in every annotation class is collected. Then, after removing the stop words, the first 70 most relevant to book domain words, with strong sentiment, are selected manually from each previously described list, as positive and negative seed-words. An example of adapted to book-search positive seed-words: insightful, inspirational and masterpiece. And an example of negative seed-words: endless, waste and silly. 4
Book reviews from Multi-Domain Sentiment Dataset by http://www.cs.jhu.edu/ mdredze/datasets/sentiment/index2.html.
252
A. Htait et al.
Word embedding, or distributed representations of words in a vector space, are capable of capturing lexical, semantic, syntactic, and contextual similarity between words. And to determine the similarity between two words, the measure of cosine distance is used between the vectors of these two words in the word embedding model. In this paper, a word embedding model is created based on more than 22 million Amazon’s book reviews [7], as training dataset, after applying a pre-processing to the corpora, to improve its usefulness (e.g. tokenization, replacing hyperlinks and emoticons, removing some characters and punctuation). For the purpose of learning word embedding from the previously prepared corpora (which is raw text), Word2Vec is used with the training strategy SkipGram (in which the model is given a word and it attempts to predict its neighboring words). To train word embedding and create the models, Gensim5 framework for Python is used. And for the parameters, the models are trained with word representations of dimensionality 400, a context window of one and negative sampling for five iterations (k = 5). As a result, a model is created with a vocabulary size of more than 2.5 million words. Then, and for each word in the sentence, the difference between average cosine similarity with positive seed-words and negative seed-words represent its sentiment intensity score, using the previously created model. For example, the word confusing has an average cosine similarity with positive seed-words equals to 0.203 and an average cosine similarity with negative seed-words equals to 0.322, what makes its sentiment intensity score equals to −0.119 (a negative score represent a negative feeling). And for the word young the sentiment intensity score equals to 0.012. To predict the sentiment intensity of the entire sentence, first the adjectives, nouns and verbs are selected from the sentence using Stanford POS tagger [12], then the ones with high sentiment intensity are used by adding up their score to have a total score for the sentence. Note that the created tool Adapted Sentiment Intensity Detector (ASID), used to calculate the sentiment intensity of words, is shared by this work’s researchers as an open source6 .
5
Reviews Language Model
The book reviews are considered a reference in sentence’s characteristic detection, since a similarity in style is noticed between certain sentences of user queries and the reviews. To calculate this similarity in writing style, a statistical language modelling approach is used to compute the likelihood of generating a sentence of a query from a book reviews language model. The statistical language modelling were originally introduced by Collins in [5], and it is the science of building models that estimate the prior probabilities of word strings [11]. The model can be presented as θR = P (wi|R) with i ∈ [1, |V |], where P (wi|R) is the probability of word wi in the reviews corpora 5 6
https://radimrehurek.com/gensim/index.html. https://github.com/amalhtait/ASID.
Sentiment Analysis and Sentence Classification in Long Book-Search Queries
253
R, and |V | is the size of the vocabulary. And this model is used to denote the probability of a word according to the distribution as P (wi|θR ) [13]. The probability of a sentence W to be generated from a book reviews language model θR is defined as the following conditional probability P (W |θD ) [13], which is calculated as following: P (W |θD ) =
m
P (wi|θR )
(1)
i=1
where W is a sentence, wi is a word in the sentence W , and θR represents the book reviews model. The tool SRILM7 [11] is used to create the model from book reviews dataset (as training data), and for computing the probability of sentences in queries to be generated from the model (as test data). The language model is created as a standard language model of trigram and Good-Turing discounting (or Katz) for smoothing, based on 22 million of Amazon’s book reviews [7], as training dataset. The tool SRILM offers details in the diagnostic output like the number of words in the sentence, the sentence likelihood to model or the logarithm of likelihood by logP (W |θR ), and the perplexity which is the inverse probability of the sentence normalized by the number of words. In this paper, the length of sentences vary from one word to almost 100 words, therefore the score of perplexity seems more reliable for a comparison between sentences. To note that minimising perplexity is the same as maximising probability of likelihood, and a low perplexity indicates the probability distribution is good at predicting the sample.
6
Analysing Scores
As previously explained in Sect. 3, a corpora of 528 sentences from user queries is created and annotated as the examples in Fig. 1. Then, for each sentence the sentiment intensity score and the perplexity score are calculated following the methods previously explained in Sects. 4 and 5. To present the scores, Violin plots are used for their ability to show the probability density of the data at different values. Also, they include a marker(white dot) for the median of the data and a box(black rectangle) indicating the interquartile range. 6.1
Sentiment Intensity, Perplexity and Usefulness Correlation
The graph in Fig. 2 shows the distribution (or probability density) of sentiment intensity between two categories of sentences: on the right the sentences which are useful to the search and on the left the sentences which are not useful to the search. The shape on the left is horizontally stretched compared to the right one, and mostly dilated over the area of neutral sentiment intensity (sentiment score = 0), where also exist the median of the data. On the other hand, the shape on 7
http://www.speech.sri.com/projects/srilm/.
254
A. Htait et al.
the right is vertically stretched, showing the diversity in sentiment in- tensity in the useful to search sentences, but concentrated mostly in the positive area, at sentiment score higher than zero but lower than 0.5.
Fig. 2. The distribution of sentiment intensity between two categories of sentences: on the right the sentences which are useful to the search and on the left the sentences which are not useful to the search.
The graph in Fig. 3 represent the distribution of perplexity between two categories of sentences: on the right the sentences which are useful to the search and on the left the sentences which are not useful to the search. Both shapes are vertically compressed and dilated over the area of low perplexity. But the graph on the right, of the useful sentences, shows the median of the data on a lower level of score of perplexity, than the left graph. Explained by the slightly horizontal dilation of the left graph above the median level. 6.2
Sentiment Intensity, Perplexity and Information Type Correlation
The graphs in Fig. 4 shows the distribution of sentiment between the informational sentences, consecutively from top to bottom: – Book titles and authors names: on the right, the sentences with books titles or authors names, and on the left, the sentences without books titles and authors names. The graph on the right shows a high distribution of positive sentiment, but the left graph shows a high concentration on neural sentiment with a small distribution for positive and negative sentiment. Also, It is noticed the lack of negative sentiment in sentences with books titles or authors names. – Personal information: on the right, the sentences containing personal information about the user, and on the left, the sentences without personal information. The graph on the right shows a high concentration on neutral sentiment, where also exist the median of the data, and then a smaller distribution in
Sentiment Analysis and Sentence Classification in Long Book-Search Queries
255
Fig. 3. The distribution of perplexity between two categories of sentences: on the right the sentences which are useful to the search and on the left the sentences which are not useful to the search.
positive sentiment. On the left, the graph shows a lower concentration on neural sentiment, but it is noticeable the existence of sentences with extremely high positivism. – Narration of book content: on the right, the sentences containing book content or events, and on the left, the sentences without book content. Both graphs are vertically stretched but have different shapes. The graph on the right shows a higher distribution of negative sentiment as for sentences with book content, and the graph on the left shows higher positive values. The graphs in Fig. 5 shows the distribution of perplexity between the informational sentences, consecutively from top to bottom: Book titles and authors names, Personal information and Narration of book content. When comparing the first set of graphs, of book titles and authors names, the left graph has its median of data on a lower perplexity level than the right graph, with a higher concentration of data in a tighter interval of perplexity. For the second sets of graphs, of personal information, the right graph shows a lower interquartile range than the left graph. As for the third set of graphs, of book content, a slight difference can be detected between the two graphs, where the left graph is more stretched vertically. 6.3
Graphs Interpretation
Observing the distribution of data in the graphs of the previous sections, many conclusions can be extracted: – In Fig. 2, it is clear that useful sentences tend to have high level of emotions (positive or negative), but non-useful sentences are more probable to be neutral.
256
A. Htait et al.
Fig. 4. The distribution of Sentiment between the informational categories of sentences: Books titles or authors names, Personal information and Narration of book content.
Sentiment Analysis and Sentence Classification in Long Book-Search Queries
257
Fig. 5. The distribution of perplexity between the informational categories of sentences: Books titles or authors names, Personal information and Narration of book content.
258
A. Htait et al.
– The Fig. 3 shows that sentences with high perplexity, which means they are not similar to book reviews sentences, have a higher probability of being not useful sentence than useful. – The Fig. 4 gives an idea of sentiment correlation with sentences information: sentences with book titles or author names have a high level of positive emotions, but sentences with personal information tend to be neutral. And sentences with book content narration are distributed over the area of emotional moderate level, with a higher probability of positive than negative. – The Fig. 5 gives an idea of the correlation of reviews style similarity with sentences information: sentences with no book titles are more similar to reviews than the ones with book titles. Also, sentences with personal information tend to be similar to reviews. And sentences with book content narration show a slight more similarity with reviews sentences style than the sentences with no book content narration.
7
Conclusion and Future Work
This paper analysis the relation between sentiment intensity and reviews similarity toward sentences types in long book-search queries. First, by presenting the user queries and books collections, then extracting the sentiment intensity of each sentence of the queries (using Adapted Sentiment Intensity Detector (ASID)). Then, by creating a statistical language model based on reviews, and calculating the probability of each sentence being generated from that model. And finally by presenting, in graphs, the relation between sentiment intensity score, language model score, and the type of sentences. The graphs show that sentiment intensity can be an important feature to classify the sentences based on their usefulness to the search. Since non-useful sentences are more probable to be neutral in sentiment, than useful sentences. Also, the graphs show that sentiment intensity can also be an important feature to classify the sentences based on the information within. It is clear in the graphs, that the sentences containing book titles are richer in sentiment and mostly positive compared to sentences not containing book titles. In addition, the graphs show that sentences with personal information tend to be neutral, in a higher probability than those with no personal information. On the other hand, the graphs show that the similarity of sentences to reviews style can also be a feature to classify sentences by usefulness and by their information content, but in a slightly lower level of importance than sentiment analysis. Similarity between sentences and book reviews style is higher for useful sentences, for sentences with personal information and for sentences with narration of book content, but not for sentences containing book titles. The previous analysis and conclusions gives a preview on the effect of sentiment analysis and similarity to reviews in sentence classification of long booksearch queries. The next task would be to test these conclusions by using sentiment analysis and similarity to reviews, as new features, in a supervised machine learning classification of sentences in long book-search queries.
Sentiment Analysis and Sentence Classification in Long Book-Search Queries
259
Acknowledgement. This work has been supported by the French State, man- aged by the National Research Agency under the “Investissements d’avenir” program under the EquipEx DILOH projects (ANR-11-EQPX-0013).
References 1. Bai, J., Nie, J.Y., Paradis, F.: Using language models for text classification. In: Proceedings of the Asia Information Retrieval Symposium, Beijing, China (2004) 2. Beitzel, S.M., Jensen, E.C., Frieder, O., Lewis, D.D., Chowdhury, A., Kolcz, A.: Improving automatic query classification via semi-supervised learning. In: Fifth IEEE International Conference on Data Mining (ICDM 2005), pp. 8–pp. IEEE (2005) 3. Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 440– 447 (2007) 4. Chen, T., Xu, R., He, Y., Wang, X.: Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN. Expert Syst. Appl. 72, 221–230 (2017) 5. Collins, M.: Three generative, lexicalised models for statistical parsing. arXiv preprint cmp-lg/9706022 (1997) 6. Diemert, E., Vandelle, G.: Unsupervised query categorization using automaticallybuilt concept graphs. In: Proceedings of the 18th International Conference on World Wide Web, pp. 461–470 (2009) 7. He, R., McAuley, J.: Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th International Conference on World Wide Web, pp. 507–517 (2016) 8. Htait, A., Fournier, S., Bellot, P.: LSIS at SemEval-2017 task 4: using adapted sentiment similarity seed words for English and Arabic tweet polarity classification. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 718–722 (2017) 9. Kang, I.H., Kim, G.: Query type classification for web document retrieval. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 64–71 (2003) 10. Ollagnier, A., Fournier, S., Bellot, P.: Analyse en d´ependance et classification de requˆetes en langue naturelle, applicationa la recommandation de livres. Traitement Automatique des Langues 56(3) (2015) 11. Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Seventh International Conference on Spoken Language Processing (2002) 12. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 252–259 (2003) 13. Zhai, C.: Statistical language models for information retrieval. In: Synthesis Lectures on Human Language Technologies, vol. 1, no. 1, pp. 1–141 (2008)
Comparative Analyses of Multilingual Sentiment Analysis Systems for News and Social Media Pavel Pˇrib´ an ˇ1,2(B)
and Alexandra Balahur1
1
2
European Commission Joint Research Centre, Via E. Fermi 2749, 21027 Ispra, VA, Italy [email protected] , [email protected] Faculty of Applied Sciences, Department of Computer Science and Engineering, University of West Bohemia, Univerzitni 8, 301 00 Pilsen, Czech Republic
Abstract. In this paper, we present evaluation of three in-house sentiment analysis (SA) systems originally designed for three distinct SA tasks, in a highly multilingual setting. For the evaluation, we collected a large number of available gold standard datasets, in different languages and varied text types. The aim of using different domain datasets was to achieve a clear snapshot of the level of overall performance of the systems and thus obtain a better quality of an evaluation. We compare the results obtained with the best performing systems evaluated on their basis and performed an in-depth error analysis. Based on the results, we can see that some systems perform better for different datasets and tasks than the ones they were designed for, showing that we could replace one system with another and gain an improvement in performance. Our results are hardly comparable with the original dataset results because the datasets often contain a different number of polarity classes than we used, and for some datasets, there are even no basic results. For the cases in which a comparison was possible, our results show that our systems perform very well in view of multilinguality.
Keywords: Sentiment analysis
1
· Multilinguality · Evaluation
Introduction
Recent years have seen a growing interest in the task of Sentiment Analysis (SA). In spite of these efforts however, real applications for sentiment analysis are still challenged by a series of aspects, such as multilinguality and domain dependence. Sentiment analysis can be divided into different sub-tasks like aspect based SA, polarity or fine-grained SA, entity-centered SA. SA can also be applied on many different levels of scope – document level, sentence or phrases level. Performing sentiment analysis in a multilingual setting is even more challenging, as most datasets available are annotated for English texts and low-resourced languages c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 260–279, 2023. https://doi.org/10.1007/978-3-031-24340-0_20
Comparative Analyses of Multilingual Sentiment Analysis Systems
261
suffer from a lack of annotated datasets on which machine learning models can be trained. In this paper, we describe an evaluation of our three in-house SA systems designed for three distinct SA tasks, in a highly multilingual setting. These systems process a tremendous amount of text every day, and therefore it is essential to know their quality and also be able to evaluate these applications correctly. At present, these systems cannot be sufficiently evaluated. Due to the lack of correct evaluation, we decided to prepare appropriate resources and tools for the evaluation, assess these applications and summarize obtained results. We collect and describe a rich collection of publicly available datasets for sentiment analysis, and we present the performance of individual systems for the collected datasets. We also carry out additional experiments with the datasets, and we show that for news articles performance of classification increases when adding the title of the news article to the body text. 1.1
Tasks Description
The evaluated systems are intended for solving three sentiment related tasks – Twitter Sentiment Analysis (TSA) task, Tonality in News (TON ) task and the Targeted Sentiment Analysis (ESA) task that can also be called Entity-Centered Sentiment Analysis. In the Twitter Sentiment Analysis and Tonality tasks, the systems have to assign a polarity which determines the overall sentiment of a given tweet or a news article (generally speaking text). Targeted Sentiment Analysis (ESA) task is a task of a sentiment polarity classification towards an entity mention in a given text. For all mentioned tasks, the sentiment polarity can be one of the positive, negative or neutral labels or a number from −100 to 100, where a negative value indicates negative sentiment, a positive value indicates positive sentiment and zero (or values close to zero) means neutral sentiment. In our evaluation experiments, we used the 3-point scale (positive, negative, neutral ). 1.2
Systems Overview
TwitOMedia system [4] for the TSA task uses a hybrid approach, which employs supervised learning with a Support Vector Machines Sequential Minimal Optimization [32], on unigram and bigram features. EMMTonality system for the TON task counts occurrences of language specific sentiment terms from our in-house language specific dictionaries. Each sentiment term has a sentiment value assigned. The system sums up values for all words (which are present in the mentioned dictionary) in a given text. The resulting number is normalized and scaled to a range from −100 to 100 where the negative value indicates negative tonality, the positive value indicates positive tonality and the neutral tonality is expressed with zero. EMMTonality system also contains a module for the ESA task which computes sentiment towards an entity in a given text. This approach is the same as for
262
P. Pˇrib´ an ˇ and A. Balahur
the tonality in news articles, with the difference that only a certain number of words surroundings the entity are used to compute the sentiment value towards the entity. EMMSenti system is intended to solve only the ESA task. This system uses a similar approach to the EMMTonality system, see [38] for the detailed description.
2
Related Work
In [35], authors summarize eight publicly available datasets for a Twitter sentiment analysis and they are giving an overview of the existing evaluation datasets and their characteristics. Another comparison of available methods for sentiment analysis is mentioned in [15]. They describe four different approaches (machine learning, lexicon-based, statistical and rule-based ) and they distinguish between three different levels of the scope of sentiment analysis, i.e. document level, sentence level and word/phrase/sub-sentence level. In recent years most of the state-of-the-art systems and approaches for sentiment analysis used neural networks and deep learning techniques. Very popular became the Convolutional Neural Network (CNN) [24] and the Recurrent Neural Network (RNN) like Long Short-Term Memory (LSTM) [21] or Gated Recurrent Unit (GRU) [12]. In [22] they used a CNN architecture for sentiment analysis and question answering. One of the proofs of neural networks successfulness is that most of the top teams [8,14,18] in sentiment analysis (or tasks related to the sentiment analysis) in the last SemEval [28,34] and WASSA [23,27] competitions used deep learning techniques. In [41] they present a comprehensive survey of current application in sentiment analysis. [5] compare several models on six different benchmark datasets, which belong to different domains and additionally have different levels of granularity. They showed that LSTMs based neural networks are particularly good at fine-grained sentiment tasks. In [39] the authors introduced sentiment-specific word embeddings (SSWE) for Twitter sentiment classification, which encode sentiment information in the continuous representation of words. The majority of the sentiment analysis research mainly focuses on monolingual methods, especially in English but some effort is being made for multilingual approaches as well. [2] propose an approach to obtain training data for French, German and Spanish using three distinct Machine translation (MT) systems. They translated English data to the three languages, and then they evaluated performance for sentiment analysis after using the three MT systems. They showed that the gap in classification performance between systems trained on English and translated data is minimal, and they claim that MT systems are mature enough to be reliably employed to obtain training data for languages other than English and that sentiment analysis systems can obtain comparable performances to the one obtained for English. In [3] they extended work from [2] and showed that tf-idf weighting with unigram features has a positive impact on the results. In [11], the authors study possibilities of usage of English model for sentiment analysis in different Russian, Spanish, Turkish and Dutch languages where the
Comparative Analyses of Multilingual Sentiment Analysis Systems
263
annotated data are more limited. They propose a multilingual approach where a single RNN model is built in the language where the largest sentiment analysis resources are available. Then they used MT to translate test data to English and finally they used the model to classify the translated data. The paper [16] provide a review of multilingual sentiment analysis. They compare their implementation of existing approaches on common data. Precision observed in their experiments is typically lower than the one reported by the original authors, which could be caused by the lack of detail in the original presentation of those approaches. In [42] they created bilingual sentiment word embeddings, which is based on the idea of encoding sentiment information into semantic word vectors. Related multilingual approach for sentiment analysis for low-resource languages is presented in [6]. They introduced Bilingual Sentiment Embeddings (BLSE), which are jointly optimized to represent (a) semantic information in the source and target languages, which are bound to each other through a small bilingual dictionary, and (b) sentiment information, which is annotated on the source language only. In [7], authors extend an approach from [6] to domain adaption for sentiment analysis. Their model takes as input two mono-domain embedding spaces and learns to project them to a bi-domain space, which is jointly optimized to project across domains and to predict sentiment. From the previous review, we can deduce that the current state-of-the-art approaches for sentiment analysis in English are solely based on neural networks and deep learning techniques. Deep learning techniques usually require more data than the “traditional” machine learning approaches (Support Vector Machine, Logistic Regression) and it is evident that they will be used for richresources languages (English). On the other hand, much less effort was invested in the multilingual approaches, and low-resources languages compared to English. First studies about multilingual approaches mostly relied on machine translation systems, but in recent years neural networks along with deep learning techniques were employed as well. Another common idea for multilingual approaches in SA is that researchers are trying to find a way how to create a model based on data from rich-resources language and transform the knowledge in such a way that it is possible to use the model for other languages.
3
Datasets
In this section, we describe the datasets we collected for the evaluation. The applications assessed require different types of datasets or at least different domains to carry out a proper evaluation. We collected mostly public available datasets, but we also used our in-house non-public datasets. The polarity labels for all collected Twitter and news datasets are positive, neutral or negative. If the original dataset contained other polarity labels than the three mentioned, we either discarded them or mapped them to positive, neutral or negative polarity labels.
264
P. Pˇrib´ an ˇ and A. Balahur
Sentiment analysis of tweets is a prevalent problem, and much effort is being put into solving this problem and related problems in recent years [19,20,23,27, 29,30,34]. Therefore, datasets for this task are easier to find. On the other hand, finding datasets for the ESA task is much more challenging because there is less research effort being put into this task and thus there are less existing resources. For the sentiment analysis in news articles we were not able to find a proper public dataset for the English language, and therefore we used our in-house datasets. For some languages exist publicly available corpora such as Slovenian [10], German [25], Brazilian Portuguese [1], Ukrainian and Russian [9]. 3.1
Twitter Datasets
In this subsection, we present the sentiment datasets for the Twitter domain. We collected 2.8M labelled tweets in total from several datasets, see Table 1 for detailed statistics. Next, we shortly describe each of these datasets. Table 1. Twitter datasets statistics. Dataset
Total
Positive
Negative Neutral
Sentiment140 Test Sentiment140 Train Health Care Reform Obama-McCain Debate Sanders T4SA SemEval 2017 Train SemEval 2017 Test InHouse Tweets Test InHouse Tweets Train
498 1600 000 2394 1904 3424 1 179 957 52 806 12 284 3813 4569
182 800 000 543 709 519 371 341 20 555 2375 1572 2446
177 800 000 1381 1195 572 179 050 8430 3972 601 955
139 – 470 – 2333 629 566 23 821 5937 1640 1168
Total
2 861 649 1 200 242 996 333
665 074
Sentiment140 [19] dataset consists of two parts – training and testing. The training part includes 800k positive and 800k negative automatically labelled tweets. Authors of this dataset collected tweets containing certain emoticons, and to every tweet, they assigned a label based on the emoticon. For example, :) and :-) both express positive emotion and thus tweets containing these emoticons were labelled as positive. The testing part of this dataset is composed of 459 manually annotated tweets (177 negative, 139 neutral and 182 positives). The detailed description of this approach is described in [19]. The authors of [37] created Health Care Reform dataset based on tweets about health care reform in the USA. They extracted tweets containing the health care reform hashtag “#hcr” from the early 2010s. This dataset contains 543 positive, 1381 negative and 470 neutral examples.
Comparative Analyses of Multilingual Sentiment Analysis Systems
265
Obama-McCain Debate ne [36] dataset was manually annotated with the Amazon Mechanical Turk by one or more annotators for the categories positive, negative, mixed or other. Total 3269 tweets posted during the presidential debate on September 26th, 2008 between Barack Obama and John McCain were annotated. We filtered this dataset to obtain only tweets with a positive or negative label (no neutral classes were present). After the filtering process, we received 709 positives and 1195 negative examples. T4SA [40] dataset was obtained from July to December 2016. The authors discarded retweets, tweets not containing any static image and tweets whose text was less than five words long. Authors were able to gather 3.4M tweets in English. Then, they classified the sentiment polarity of the texts and selected the tweets having the most confident textual sentiment predictions. This approach resulted in approximately a million labelled tweets. For the sentiment polarity classification, authors used an adapted version of the ItaliaNLP Sentiment Polarity Classifier [13]. This classifier uses a tandem LSTM-SVM architecture. Along with the tweets, authors also crawled the images contained in the tweets. The aim was to automatically build a training set for learning a visual classifier able to discover the sentiment polarity of a given image [40]. SemEval-2017 dataset was created for the Sentiment Analysis in Twitter task [34] at SemEval 2017. The authors made available all the data from previous years of the Sentiment Analysis in Twitter [30] tasks and they also collected some new tweets. They chose English topics based on popular events that were trending on Twitter. The topics included a range of named entities (e.g., Donald Trump, iPhone), geopolitical entities (e.g., Aleppo, Palestine), and other entities. The dataset is divided into two parts – SemEval 2017 Train and SemEval 2017 Test. They used CrowdFlower to annotate the new tweets. We removed all duplicated tweets from the SemEval 2017 Train part which resulted in approximately 20K positive, 8K negative and 23K neutral examples and 2K positive, 4K negative and 6K neutral examples for the SemEval 2017 Test part (see Table 1). InHouse Tweets dataset consists of two datasets InHouse Tweets Train and InHouse Tweets Test used in [4]. These datasets come from SemEval 2013 task 2 Sentiment Analysis in Twitter [20]. Sanders twitter dataset1 created by Sanders Analytics consists of 5512 manually labelled tweets by one annotator. Each tweet is related to one of four topics (Apple, Google, Microsoft, Twitter). Tweets are labelled as either positive, negative, neutral or irrelevant. We discarded tweets labelled as irrelevant. In [35] the authors also described and used Sanders twitter dataset. 3.2
Targeted Entity Sentiment Datasets
For the ESA task, we were able to collect three labelled datasets. Datasets from [17,26] are created from tweets, and our InHouse Entity dataset [38] contains sentences from news articles. Detailed statistics are shown in Table 2. 1
Dataset can be obtained from https://github.com/pmbaumgartner/text-feat-lib.
266
P. Pˇrib´ an ˇ and A. Balahur Table 2. Targeted Entity Sentiment Analysis datasets statistics. Dataset
Total
Dong 6940 3288 Mitchel InHouse Entity 1281 Total
Positive Negative Neutral 1734 707 169
1733 275 189
3473 2306 923
11 509 2610
2197
6702
Dong [17] is manually annotated dataset for the ESA task consisting of 1734 positive, 1733 negative and 3473 neutral examples. Each example consists of a tweet, an entity and a class label which denotes a sentiment towards the entity. [26] used the Amazon Mechanical Turk to annotate Mitchel dataset with 3288 examples (tweet – entity pairs) for the ESA task. Tweets with a single highlighted named entity were shown to the annotators, and they were instructed to select the sentiment being expressed towards the entity (positive, negative or no sentiment). For the evaluation, we also used our InHouse Entity dataset created in [38]. This dataset was created as a multilingual parallel news corpus annotated with sentiment towards entities. They used data from Workshops on Statistical Machine Translation (2008, 2009, 2010)2 . Firstly, they recognized the named entities and then selected examples were manually annotated with two annotators. The disagreed cases were judged by the third annotator. They were able to obtain 1281 labelled examples (707 positive, 275 negative and 923 neutral), e.g. sentences with annotated entity and sentiment expressed towards the entity. 3.3
News Tonality Datasets
For the TON 3 task, we used our two non-public multilingual datasets. Firstly, our InHouse News dataset consists of 1830 manually labelled texts from news articles about Macedonian Referendum in 23 languages, but the majority is formed by Macedonian, Bulgarian, English, Italian and Russian, see Table 3. Each example contains a title and description of a given article. For the evaluation of our systems we used only Bulgarian, English, Italian and Russian because other languages are either not supported by the evaluated systems or the number of examples is less than 60 samples. EP News dataset contains more than 50K manually labelled news articles about the European Parliament and European Union in 25 European languages. Each news article in this dataset consists from a title and full text of the article and also from their English translation, we selected five main European languages (English, German, French, Italian and Spanish) for the evaluation, see Table 4 for details. 2 3
http://www.statmt.org/wmt10/translation-task.html. For this task we also used tweets described in Subsect. 3.1.
Comparative Analyses of Multilingual Sentiment Analysis Systems
267
Table 3. InHouse News dataset statistics. InHouse News
Total Positive Negative Neutral
Macedonian Bulgarian English Italian Russian Other Languages Total
974 215 339 62 65 175
516 118 198 41 17 60
234 26 35 3 34 44
224 71 106 18 14 71
1830
950
376
504
Table 4. EP Tonality News dataset statistics. EP News English German French Italian Spanish Total
4
Total 2193 5122 2964 1544 3594
Positive Negative Neutral 263 389 574 291 324
172 179 308 152 135
1758 4554 2082 1101 3135
15417 1841
946
12630
Evaluation and Results
In this section, we present the summary of all the evaluation results for of all the three systems. For each system, we select an appropriate collection of datasets, and we classify examples of each selected dataset separately. Then, we merge all selected datasets, and we classify them together. Except for the InHouse News dataset and EP News dataset, all experiments are performed on English texts. We carry out experiments on the EMMTonality system with the InHouse News dataset on Bulgarian, English, Italian and Russian. Experiments with the EP News dataset are performed on the TwitOMedia and EMMTonality system with English, German, French, Italian and Spanish4 . Each sample is classified as positive, negative or neutral and for all named systems we did not apply any additional preprocessing steps5 . As an evaluation metric, we used Accuracy and Macro F1 score which are defined as: F1M =
4 5
2 × P M × RM P M + RM
(1)
On the EMMTonality system we perfom experiments with all available languages, but we report results only for English, German, French, Italian and Spanish. Except baselines systems.
268
P. Pˇrib´ an ˇ and A. Balahur
where P M denotes Macro Precision an RM denotes Macro Recall. Precision Pi and recall Ri are firstly computed separately for each class (n is the number of classes) and then averaged as follows: n Pi PM = i (2) n n Ri M (3) R = i n 4.1
Baselines
For basic comparison, we created baseline models for the TSA task and TON task. These baseline models are based on unigram or unigram-bigram features. Results are shown in Tables 5, 6, 7, 8 and 9. For the baseline models, we apply minimal preprocessing steps like lowercasing and word normalization which includes conversion of URLs, emails, money, phone numbers, usernames, dates and numbers expressions to one common token, for example, token “www. google.com” is converted to the token “”. These steps lead to a reduction of feature space as shown in [19]. We use ekphrasis library from [8] for word normalization. Table 5. Results of baseline models for the InHouse Tweets Test dataset with unigram features (models were trained on InHouse Tweets Train dataset). Baseline
Macro F1 Accuracy
Log. regression 0.5525 0.5308 SVM 0.4233 Naive Bayes
0.5843 0.5641 0.4993
To train the baseline models, we use an implementation of Support Vector Machines (SVM) – concretely Support Vector Classification (SVC) with linear kernel, Logistic Regression with lbfgs solver and Naive Bayes algorithms from the scikit-learn library [31], default values are used for other parameters of the mentioned classifiers. Our InHouse News dataset does not contain a large number of examples, and therefore we perform experiments with 10-fold crossvalidation, the same approach is applied for the EP News dataset. For the News datasets (InHouse News and EP News) we train baseline models with different combinations of data. In Table 6 are shown results for models which are trained on a concatenation of examples in different languages. For each dataset, we select all untranslated examples (texts in original languages), and we train model regardless of the language. The model is then able to classify texts in all languages which were used to train the model. This approach should lead to performance improvement as is shown in [4]. The same approach is used to
Comparative Analyses of Multilingual Sentiment Analysis Systems
269
Table 6. Macro F1 score and Accuracy results of baseline models with unigram and bigram features. The InHouse News dataset and the EP News dataset with all examples (all languages) were used. We used 10-fold cross-validation (results in table are averages of individual folds). Bold values denote best results for each dataset. Baseline
Config
InHouse News EP News Macro F1 Accuracy Macro F1 Accuracy
Log. regression text 0.663 text+title 0.704
0.705 0.738
0.551 0.578
0.864 0.870
SVM
text 0.657 text+title 0.717
0.697 0.747
0.564 0.591
0.856 0.866
Naive Bayes
text 0.612 text+title 0.646
0.676 0.702
0.513 0.552
0.845 0.852
acquire results for Table 7, but only specific languages are used, specifically for the InHouse News dataset it is English, Bulgarian, Italian and Russian and for the EP News dataset it is English, French, Italian, German and Spanish. Table 8 contains results for models trained only on original English texts. In Tables 6, 7, 8 and 9 column Config denotes whether the text of an example is used or if a title of the example is concatenated with the text and is used as well. Table 7. Macro F1 score and Accuracy results of baseline models with unigram and bigram features. The InHouse News dataset with Bulgarian, English, Italian and Russian examples and the EP News dataset with English, French, Italian, German and Spanish examples were used. We used 10-fold cross-validation (results in table are averages of individual folds). Bold values denote best results for each dataset. Baseline
Config
InHouse News EP News Macro F1 Accuracy Macro F1 Accuracy
Log. regression text 0.629 text + title 0.692
0.682 0.729
0.497 0.529
0.833 0.841
SVM
text 0.630 text + title 0.684
0.677 0.718
0.513 0.540
0.819 0.833
Naive Bayes
text 0.585 text + title 0.612
0.657 0.678
0.432 0.457
0.816 0.817
If we compare baseline results from Table 8 with results from Table 10 (last five lines of the table), we can see that baselines perform much better than our current system (see Macro F1 score in tables). The TwitOMedia system was trained on tweets messages, so it is evident that its performance on news articles will be lower, but the EMMTonality system should achieve better results.
270
P. Pˇrib´ an ˇ and A. Balahur
Table 8. Macro F1 score and Accuracy results of baseline models with unigram and bigram features. The InHouse News dataset and EP News dataset only with original English examples were used. We used 10-fold cross-validation (results in table are averages of individual folds). Bold values denote best results for each dataset. Baseline
Config
InHouse News EP News Macro F1 Accuracy Macro F1 Accuracy
Log. regression text 0.612 text + title 0.685
0.730 0.769
0.510 0.534
0.820 0.826
SVM
text 0.608 text + title 0.674
0.719 0.760
0.530 0.546
0.815 0.827
Naive Bayes
text 0.502 text + title 0.547
0.695 0.713
0.441 0.446
0.818 0.819
Our results from Tables 6, 7 and 8 confirm the claims from [4] that joining of data in different languages leads the performance improvement. Models trained on all examples (regardless language), see Table 6, achieve best results. Table 9. Macro F1 score and Accuracy results of baseline models trained on SemEval 2017 Train and Test datasets with unigram features. Evaluation was performed on original English examples from our InHouse News and EP News datasets. Bold values denote best results for each dataset. Baseline
Config
InHouse News EP News Macro F1 Accuracy Macro F1 Accuracy
Log. regression text 0.395 text + title 0.408
0.432 0.462
0.312 0.310
0.518 0.495
SVM
text 0.380 text + title 0.389
0.429 0.456
0.283 0.287
0.408 0.397
Naive Bayes
text 0.237 text + title 0.239
0.296 0.293
0.313 0.314
0.639 0.620
We collected large manually labelled dataset of tweets, and we wanted to study the possibility to use this dataset to train a model. This model would then be used for classification of news articles that are different from the domain of the training data. After comparing results from Table 9 (last five lines of the table) with results from Table 10, we can see that our simple baseline is not outperformed on the InHouse News dataset by the other two systems. These results show that it is possible to use data from different domains for training and obtain good results. We also observe that incorporating the title (concatenating the title and the text) of a news article leads to an increase in performance across all datasets and
Comparative Analyses of Multilingual Sentiment Analysis Systems
271
combination of data used for training models. These results show that the title is an essential part of the news and contains significant sentiment and semantic information despite its short length. 4.2
Twitter Sentiment Analysis
To evaluate a system for the TSA task, we used a domain rich collection of tweets datasets. We collected datasets with almost 3M labelled tweets, detailed statistics of used datasets can be seen in Table 1. Table 10 shows obtained results for Accuracy and Macro F1 measures. From Table 10 is evident that the TwitOMedia system [4] performs best for the InHouse Tweets Test dataset (bold values in the table). This dataset is based on data from [20] and was used to develop (train and test) this system. The reason why the TwitOMedia system performs better for the InHouse Tweets Test dataset than for the InHouse Tweets Train dataset (HTTr) is that the system was trained on translations of the HTTr dataset. Original training dataset (HTTr) was translated into several languages, and then the translations were merged to one training dataset which was used to train the model. This approach leads to performance improvement as is shown in [4]. For the other datasets, the performance is lower especially for the domainspecific ones and datasets which does not contain instances with neutral classes, for example, Health Care Reform dataset or Sentiment 140 Train dataset. The first reason is most likely that the system was trained on the other domain of texts which is too much different and thus the system is not able to successfully classify (generalize) texts from these domain-specific datasets. Secondly, Sentiment140 Train dataset and Obama-McCain Debate dataset do not contain examples with a neutral class. 4.3
Tonality in News
EMMTonality system for the TON task was evaluated on the same set of datasets like the one for the TwitOMedia system. Obtained results are shown in Table 10. If we compare results of the TwitOMedia system and results of the EMMTonality system, we can see that the EMMTonality system achieves better results for these datasets: Sentiment140 Test, Health Care Reform, ObamaMcCain Debate, Sanders, SemEval 2017 Train, and SemEval 2017 Test. The overall results are better for the TwitOMedia system. Results for the InHouse News and EP News datasets are comparable for both evaluated systems. Regarding multilinguality, the EMMTonality system slightly overperforms the TwitOMedia system in Macro F1 score, see Table 11. Table 11 contains results for the EP News dataset for five European languages (English, German, French, Italian and Spanish).
272
P. Pˇrib´ an ˇ and A. Balahur
Table 10. Macro F1 score and Accuracy results of the evaluated TwitOMedia and EMMTonality systems. Bold values denote the best results in specific dataset category (Individual Twitter datasets, joined Twitter datasets and News datasets), and underlined values denote best results for specific dataset category and for each system separetely. Dataset
TwitOMedia EMMTonality Macro F1 Accuracy Macro F1 Accuracy
Sentiment140 Test Health Care Reform Obama-McCain Debate (OMD) Sanders Sentiment140 Train (S140T) SemEval 2017 Train SemEval 2017 Test T4SA InHouse Tweets Test (HTT) InHouse Tweets Train (HTTr)
0.566 0.410 0.270 0.468 0.312 0.501 0.460 0.603 0.710 0.629
0.530 0.326 0.290 0.591 0.358 0.529 0.500 0.669 0.708 0.599
0.666 0.456 0.331 0.526 0.250 0.538 0.552 0.410 0.583 0.580
0.639 0.403 0.357 0.618 0.375 0.561 0.564 0.392 0.610 0.574
All Tweets w/o S140T, OMD, T4SA 0.597 All Tweets w/o S140T, T4SA 0.507
0.660 0.528
0.545 0.542
0.563 0.558
InHouse News en EP News en, text EP News en, title + text EP News translated, text EP News translated, title + text
0.425 0.698 0.690 0.432 0.388
0.398 0.422 0.425 0.390 0.393
0.425 0.678 0.675 0.278 0.238
0.397 0.368 0.372 0.368 0.369
Table 11. Macro F1 score and Accuracy results for the EP News dataset for English, German, French, Italian and Spanish examples. Lang. Config
TwitOMedia Macro F1 Accuracy
EMMTonality Macro F1 Accuracy
EN
Text 0.368 Text+Title 0.372
0.698 0.690
0.422 0.425
0.678 0.675
DE
Text 0.333 Text+Title 0.344
0.711 0.687
0.348 0.360
0.846 0.730
FR
Text 0.354 Text+Title 0.356
0.614 0.602
0.389 0.383
0.549 0.472
IT
Text 0.314 Text+Title 0.351
0.692 0.690
0.397 0.405
0.347 0.330
ES
Text 0.337 Text+Title 0.332
0.828 0.823
0.392 0.392
0.386 0.333
Comparative Analyses of Multilingual Sentiment Analysis Systems
4.4
273
Targeted Sentiment Analysis
We evaluated the EMMSenti system and EMMTonality system for the ESA task on the Dong, Mitchel and InHouse Entity datasets, see Table 12 for results. We obtained the best results for the InHouse Entity dataset in terms of Accuracy measure and also for the Macro F1 score. The best results across all datasets and systems are obtained for the neutral class (not reported in the table) and for other classes our systems work more poorly. The classification algorithm (for both systems) is based on counting subjective terms (words) around entity mentions (no machine learning algorithm or approach is involved). It is obvious that the quality of dictionaries used, as well as their adaptation to the domain, is crucial. If no subjective term from the text is found in the dictionary, to the example is assigned the neutral label. The best performance of our systems for the neutral class can be explained by the fact that most of the neutral instances do not contain any subjective term. We also have to note that we were not able to reproduce results obtained in [38] and our achieved performance for this dataset is worse. It is possible that the authors of [38] used slightly different lexicons than we used. Table 12. Macro F1 score and Accuracy results for the EMMSenti and EMMTonality systems evaluation. Bold values denote best results for each dataset. Dataset
4.5
EMMSenti EMMTonality Macro F1 Accuracy Macro F1 Accuracy
Dong 0.491 0.483 Mitchel InHouse Entity 0.517
0.512 0.660 0.663
0.496 0.490 0.507
0.501 0.640 0.659
All
0.571
0.512
0.557
0.505
Error Analysis
In order to understand the causes leading in erroneous classification, we analyze the misclassified examples from Twitter and the News datasets for the EMMTonality and the TwitOMedia systems. We categorize the errors into four groups (see below)6 . We randomly select 40 incorrectly classified examples for each class and for each system across all datasets which were used for evaluation of these systems, which resulted in 240 manually evaluated examples in total. We found the following major groups of errors: 1. Implicit sentiment/external knowledge: Sentiment is often expressed implicitly, or external knowledge is needed for a correct classification. The evaluated text does not contain any explicit attributes (words, phrases, 6
Each incorrectly classified example may be contained in more than one error group. Some examples were also (in our view) annotated incorrectly. For some cases, we were not able to discover the reason for misclassification.
274
P. Pˇrib´ an ˇ and A. Balahur
emoji/emoticons) which would clearly indicate the sentiment and because our systems are based on surface level features (unigrams/bigrams or counting occurrences of sentiment words), they will fail in these examples. For example, text like “We went to Stanford University today. Got a tour. Made me want to go back to college.” indicates positive sentiment but for this decision we have to know that Stanford University is prestigious university (which is positive) and according to the sentence “Made me want to go back to college.” author has probably a positive relation to universities or his previous studies. This group of errors is the most common in our set of the error analysis examples, we observed it in 94 cases and only for positive or negative examples. 2. Slang expression: Misclassified examples in this group contain domainspecific words, slang expressions, emojis, unconventional linguistic means, misspelled or uppercased words like “4life”, “YEAH BOII”, “yessss”, “grrrl”, “yummmmmy”. We observe this type of errors in 29 examples and most of them were caused by the EMMTonality system (which is reasonable because this system is intended for news). The appropriate solution for part of this problem is an application of preprocessing steps like spell correction, lowercasing, text normalization (“yesssss” ⇒ “yes”) or extending of dictionaries. In case of extending dictionaries, we have to deal with Twitter vocabulary because the Twitter vocabulary (vocabulary of tweets) is changing quite fast (new modern expressions and hashtags are introduced often) and thus dictionaries have to be modified regularly. On the other hand, the TwitOMedia system would have to be retrained every time with new examples in order to extend its feature set or a more advanced normalization system should be used in the pre-processing stage. 3. Negation: Negation of terms is an essential aspect of sentiment classification [33]. Negations can easily change or reverse the sentimental orientation. This error appeared in 35 cases in our set of the error analysis examples. 4. Opposite sentiment words: The last type of errors is caused by sentiment words which express the opposite or different sentiment than the entire text. This type of error was typical for examples annotated with a neutral label. For example tweet “#Yezidi #Peshmerga forces playing volleyball and crushing #ISIS in the frontline.” is annotated as neutral but contains words like “crushing, #ISIS” or “frontline” which can indicate negative sentiment. We observed this error in 20 examples. The first group of errors (Implicit sentiment/external knowledge) was the most common among the evaluated examples and is also the hardest one because the system would have to have access to world knowledge or be able to detect implicit sentiment in order to be able of correct classification. This error was observed only for examples annotated with positive or negative labels; there, the explicit sentiment markers are missing. The majority of these examples were misclassified as a neutral class. In this case, the sentiment analysis system must be complemented with a system for emotion detection similar to one of the top systems from [23] to improve classification performance. In case of emotion detection for examples which were classified as a neutral class, we would change the neutral class according to the detected emotion. The examples with negative
Comparative Analyses of Multilingual Sentiment Analysis Systems
275
emotions like sadness, fear or anger would be changed to the negative class and examples with positive emotions like joy or surprise would be changed to the positive class. Figure 1 shows confusion matrices for the EMMTonality and TwitOMedia systems. We can see that a noticeable amount of misclassified examples was predicted as a neutral class and the mentioned improvement should positively affect a significant number of examples according to our statistics from the error analysis.
(a) TwitOMedia system
(b) EMMTonality system
Fig. 1. Confusion matrices for the TwitOMedia and EMMTonality systems on all tweets without S140T and T4SA datasets.
Lastly, we have to note that we were not able to decide the reason for misclassification in 35 cases. According to us, in seven cases was the annotated label incorrect.
5
Conclusion
In this paper, we showed the process of thoroughly evaluating three systems for sentiment analysis with a comparison of their performance. We collected and described a rich collection of publicly available datasets, and we performed experiments with these datasets and showed the performance of individual systems. We carried out additional experiments with collected datasets and showed that for news articles is more beneficial also to include the title of the news article along with the text of the article itself. We performed a thorough error analysis and proposed potential solutions for each category of misclassified examples. In our future work, we will explore current state-of-the-art methods and develop new approaches (including deep learning methods, multilingual embeddings and other recent machine learning approaches) for multilingual sentiment analysis in order to implement them in our highly multilingual environment.
276
P. Pˇrib´ an ˇ and A. Balahur
Acknowledgments. This work was partially supported from ERDF “Research and Development of Intelligent Components of Advanced Technologies for the Pilsen Metropolitan Area (InteCom)” (no.: CZ.02.1.01/0.0/0.0/17 048/0007267) and by Grant No. SGS-2019-018 Processing of heterogeneous data and its specialized applications. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.
References 1. de Arruda, G.D., Roman, N.T., Monteiro, A.M.: An annotated corpus for sentiment analysis in political news. In: Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology, pp. 101–110 (2015) 2. Balahur, A., Turchi, M.: Multilingual sentiment analysis using machine translation? In: Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis, WASSA 2012, Stroudsburg, PA, USA, pp. 52–60. Association for Computational Linguistics (2012). dl.acm.org/citation.cfm?id=2392963.2392976 3. Balahur, A., Turchi, M.: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis. Comput. Speech Lang. 28(1), 56–75 (2014) 4. Balahur, A., et al.: Resource creation and evaluation for multilingual sentiment analysis in social media texts. In: LREC, pp. 4265–4269. Citeseer (2014) 5. Barnes, J., Klinger, R., Schulte im Walde, S.: Assessing state-of-the-art sentiment models on state-of-the-art sentiment datasets. In: Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 2–12. Association for Computational Linguistics (2017). https://doi. org/10.18653/v1/W17-5202. aclweb.org/anthology/W17-5202 6. Barnes, J., Klinger, R., Schulte im Walde, S.: Bilingual sentiment embeddings: joint projection of sentiment across languages. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2483–2493. Association for Computational Linguistics (2018). aclweb.org/anthology/P18-1231 7. Barnes, J., Klinger, R., Schulte im Walde, S.: Projecting embeddings for domain adaption: joint modeling of sentiment analysis in diverse domains. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 818–830. Association for Computational Linguistics (2018). aclweb.org/anthology/C18-1070 8. Baziotis, C., Pelekis, N., Doulkeridis, C.: DataStories at SemEval-2017 task 4: deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 747–754. Association for Computational Linguistics, August 2017 9. Bobichev, V., Kanishcheva, O., Cherednichenko, O.: Sentiment analysis in the Ukrainian and Russian news. In: 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), pp. 1050–1055. IEEE (2017) ˇ 10. Buˇcar, J., Znidarˇ siˇc, M., Povh, J.: Annotated news corpora and a lexicon for sentiment analysis in Slovene. Lang. Resour. Eval. 52(3), 895–919 (2018). https://doi. org/10.1007/s10579-018-9413-3
Comparative Analyses of Multilingual Sentiment Analysis Systems
277
11. Can, E.F., Ezen-Can, A., Can, F.: Multilingual sentiment analysis: an RNN-based framework for limited data. CoRR abs/1806.04511 (2018) 12. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1179. aclweb.org/anthology/D14-1179 13. Cimino, A., Dell’Orletta, F.: Tandem LSTM-SVM approach for sentiment analysis. In: CLiC-it/EVALITA (2016) 14. Cliche, M.: BB twtr at SemEval-2017 task 4: Twitter sentiment analysis with CNNs and LSTMs. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 573–580. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/S17-2094. aclweb.org/anthology/S17-2094 15. Collomb, A., Costea, C., Joyeux, D., Hasan, O., Brunie, L.: A study and comparison of sentiment analysis methods for reputation evaluation. Rapport de recherche RRLIRIS-2014-002 (2014) 16. Dashtipour, K., et al.: Multilingual sentiment analysis: state of the art and independent comparison of techniques. Cogn. Comput. 8(4), 757–771 (2016). https:// doi.org/10.1007/s12559-016-9415-7 17. Dong, L., Wei, F., Tan, C., Tang, D., Zhou, M., Xu, K.: Adaptive recursive neural network for target-dependent Twitter sentiment classification. In: The 52nd Annual Meeting of the Association for Computational Linguistics (ACL). ACL (2014) 18. Duppada, V., Jain, R., Hiray, S.: SeerNet at SemEval-2018 task 1: domain adaptation for affect in tweets. In: Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 18–23. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/S18-1002. aclweb.org/anthology/S18-1002 19. Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant supervision. CS224N project report, Stanford 1(12) (2009) 20. Hltcoe, J.: SemEval-2013 task 2: sentiment analysis in Twitter, Atlanta, Georgia, USA 312 (2013) 21. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 22. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1181. aclweb.org/anthology/D14-1181 23. Klinger, R., De Clercq, O., Mohammad, S., Balahur, A.: IEST: WASSA-2018 implicit emotions shared task. In: Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Brussels, Belgium, pp. 31–42. Association for Computational Linguistics, October 2018. aclweb.org/anthology/W18-6206 24. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 25. Lommatzsch, A., B¨ utow, F., Ploch, D., Albayrak, S.: Towards the automatic sentiment analysis of German news and forum documents. In: Eichler, G., Erfurth, C., Fahrnberger, G. (eds.) I4CS 2017. CCIS, vol. 717, pp. 18–33. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60447-3 2 26. Mitchell, M., Aguilar, J., Wilson, T., Van Durme, B.: Open domain targeted sentiment. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1643–1654 (2013)
278
P. Pˇrib´ an ˇ and A. Balahur
27. Mohammad, S.M., Bravo-Marquez, F.: WASSA-2017 shared task on emotion intensity. In: Proceedings of the Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA), Copenhagen, Denmark (2017) 28. Mohammad, S.M., Bravo-Marquez, F., Salameh, M., Kiritchenko, S.: SemEval2018 task 1: affect in tweets. In: Proceedings of International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA (2018) 29. Mohammad, S.M., Kiritchenko, S., Zhu, X.: NRC-Canada: building the state-ofthe-art in sentiment analysis of tweets. arXiv preprint arXiv:1308.6242 (2013) 30. Nakov, P., Ritter, A., Rosenthal, S., Stoyanov, V., Sebastiani, F.: SemEval-2016 task 4: sentiment analysis in Twitter. In: Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval 2016, San Diego, California. Association for Computational Linguistics, June 2016 31. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 32. Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods, pp. 185–208 (1999) 33. Reitan, J., Faret, J., Gamb¨ ack, B., Bungum, L.: Negation scope detection for Twitter sentiment analysis. In: Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 99– 108 (2015) 34. Rosenthal, S., Farra, N., Nakov, P.: SemEval-2017 task 4: sentiment analysis in Twitter. In: Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval 2017, Vancouver, Canada. Association for Computational Linguistics, August 2017 35. Saif, H., Fernandez, M., He, Y., Alani, H.: Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold (2013) 36. Shamma, D.A., Kennedy, L., Churchill, E.F.: Tweet the debates: understanding community annotation of uncollected sources. In: Proceedings of the First SIGMM Workshop on Social Media, pp. 3–10. ACM (2009) 37. Speriosu, M., Sudan, N., Upadhyay, S., Baldridge, J.: Twitter polarity classification with label propagation over lexical links and the follower graph. In: Proceedings of the First Workshop on Unsupervised Learning in NLP, pp. 53–63. Association for Computational Linguistics (2011) 38. Steinberger, J., Lenkova, P., Kabadjov, M., Steinberger, R., Van der Goot, E.: Multilingual entity-centered sentiment analysis evaluated by parallel corpora. In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pp. 770–775 (2011) 39. Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for Twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1555–1565. Association for Computational Linguistics (2014). https:// doi.org/10.3115/v1/P14-1146. aclweb.org/anthology/P14-1146 40. Vadicamo, L., et al.: Cross-media learning for image sentiment analysis in the wild. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 308–317, October 2017. https://doi.org/10.1109/ICCVW.2017.45
Comparative Analyses of Multilingual Sentiment Analysis Systems
279
41. Zhang, L., Wang, S., Liu, B.: Deep learning for sentiment analysis: a survey. CoRR abs/1801.07883 (2018). arxiv.org/abs/1801.07883 42. Zhou, H., Chen, L., Shi, F., Huang, D.: Learning bilingual sentiment word embeddings for cross-language sentiment classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 430–440. Association for Computational Linguistics (2015). https:// doi.org/10.3115/v1/P15-1042. aclweb.org/anthology/P15-1042
Sentiment Analysis of Influential Messages for Political Election Forecasting Oumayma Oueslati1(B) , Moez Ben Hajhmida1 , Habib Ounelli1 , and Erik Cambria2 1
2
University of Tunis El Manar, Tunis, Tunisia [email protected] Nanyang Technological University, Singapore, Singapore
Abstract. In this paper, we explore the use of sentiment analysis of influential messages on social media to improve political election forecasting. While social media users are not necessarily representative of the overall electors, bias correction of users messages is critical for producing a reliable forecast. The observation motivates our work is that people on social media consult the messages of each other before taking a decision, this means that social media users influence each other. We first built a classifier to detect politically influential messages based on different aspects (messages content, time, sentiment, and emotion). Then, we predicted electoral candidates votes using sentiment degree of influential messages. We applied our proposed model to the 2016 United States presidential election. We conducted experiments at different intervals of times. Results show that our approach achieves better performance than both off-line polling and classical approaches.
Keywords: Sentiment analysis Presidential election
1
· Political election forecasting ·
Introduction
Nowadays, writing and messaging on social media is a part of our daily routine. Facebook, for example, enjoys more than one billion daily active users. The exponential growth of social media has engendered the growth of user-generated content (UGC) available on the web. The availability of UGC raised the possibility to monitor electoral campaigns by tracking and exploring citizens preferences [1]. Jin et al. [2] stated that analyzing social media during an electoral campaign may be more useful and accurate than the traditional off-line polls and surveys. This approach represents not only a more economical process to predict the election outcome but also a faster way to analyze such a massive amount of data. Thus, many studies proved that analyzing social media based on several indicators led to a reliable forecast of the final result. Some works [3,4] have relied on c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 280–292, 2023. https://doi.org/10.1007/978-3-031-24340-0_21
Sentiment Analysis of Influential Messages for Political Election Forecasting
281
simple techniques such as the volume of data related to candidates. More recent works tried to provide a better alternative to the traditional off-line poll using sentiment analysis of UGC [5,6]. Whatever the used technique, addressing the data bias is a crucial phase which impacts the quality of the outcome. While social media contents are not necessarily all relevant for the prediction, an appropriate technique to bias UGC is needed. In this paper, we propose a sentiment analysis based approach to predict the political elections by relying only on influential messages shared on social media. Social influence has been observed not only in political participation but many other domains such as health behaviors and idea generation [7]. To the best of our knowledge, our work is the first investigating politically influential messages to forecast election outcome. According to Cialdini and Trost [8], social influence occurs when an individual’s views, emotions, or actions are impacted by the views, emotions or actions of another individual. By analogy, the political influence was achieved thanks to the direct interaction with voters through social media platforms. The politicians tweet on Twitter and post on Facebook to receive voters feedbacks and understand their expectations. Hence, we built a classifier to select the influential messages based on content, time, sentiment, and emotion features. To compute sentiment features, we adopted a concept-level sentiment analysis Framework largely recommended in literature [9] called SenticNet [10]. To extract emotion features, we built an emotional lexicon based on the Facebook reactions. For each electoral candidate, the number of votes was predicted using sentiment polarity and degree of influential messages figuring in the candidate official Facebook page. We applied the proposed approach to the 2016 United States presidential election. To evaluate the prediction quality, we mainly considered two kinds of ground truths for comparison: the election outcome itself and polls released by traditional polling institutions. We also compared our approach with classical approaches merely based on data volume. Experiments were conducted at different intervals of time. Results showed that using Influential messages led to a more accurate prediction. In term of structure, the rest of the paper is organized as follows: Section 2 explores the current literature; Sect. 3 addresses the research methods; Sect. 4 presents the results discussion and implications; lastly, Sect. 5 gives a synopsis of the main concluding remarks.
2 2.1
Related Works Sentiment Analysis
Social media platforms have changed the way that people use the information to make a decision. They tend to consult the reviews of each other before making their choices and decisions. Sentiment analysis in social media is a challenging problem that has attracted a large body of research [11,12]. In [13], authors investigated the impact of sentiment analysis tools to extract useful information from unstructured data ranging from evaluating consumer products, services, healthcare, and financial services to analyzing social events and political elections.
282
O. Oueslati et al.
Cambria et al. [10] have introduced SenticNet which is a concept-level sentiment analysis framework, consisting of 100,000 concept entries. SenticNet acts as a semantical link between concept-level emotion and natural word-level language data. Five affiliated semantic nodes are listed following each concept. These nodes are connected by semantic relations, four sentics, and a sentiment polarity value. The four sentics present a detailed emotional description of the concept they belong to, namely Introspection, Temper, Attitude, and Sensitivity. The sentiment polarity value is an integrated evaluation of the concept sentiment based on the four parameters. The sentiment polarity provided by SenticNet is a float number in the range between −1 to 1. Many applications have been developed by employing SenticNet. These applications can be exploited in many fields such as the analysis of a considerable amount of social data, human and computer interactions. In [14], Bravo-Marquez et al. used SenticNet to build a sentiment analysis system for Twitter. In [15], authors used SenticNet to build an e-health system called iFeel which analyze patients opinions about the provided healthcare. Another study by Qazi et al. [9] recommended SenticNet to extract sentiment features. Encouraged by these works, we also used SenticNet framework to extract sentiment features from the extracted message. 2.2
Election Forcasting Approaches
Forecasting elections in social media have become the latest buzzword. Politicians have adopted social media, predominantly Facebook and Twitter, as a campaigning tool. On the other hand, the general public has widely adopted social media to conduct political discussions [16]. Hence, Bond et al. [17] affirm that social media content may influence citizens political behavior. Sang and Bos [18] stated that many studies have proven that analyzing social media using several techniques and based on different indicators led to a reliable forecast of electoral campaigns’ and result. Tumasjan et al. in [4] were the first using Twitter to predict the outcome of German Federal election. They used a simple technique based on counting the number of tweets that a party get. Although their success in predicting the winner of the 2009 German Federal Elections, their simple technique get many critics. Jungherr et al. [19] highlighted the lack of methodological justification. Furthermore, Gayo-Avello [5,20] stressed making use of sentiment analysis to produce more accurate results. In [5], Gayo-Avello reported a better error rate when using sentiment analysis (17.1% using volume, 7.6% using sentiment). Consequently, many works have taken his advice such as [6,21–23]. Addressing the data bias is an essential phase in predicting an electoral outcome [24,25]. Social media users are not necessarily representative of the overall population. However, many works such as [4,19] did not proceed by biasing data. Some others works such as [5,24] attempted to reduce the bias according to user age and geolocation. They attempted to improve the overall view of the electorate. However, the authors reported that the success was minimal and the improvement was somewhat marginal. A very recent work by Arroba et
Sentiment Analysis of Influential Messages for Political Election Forecasting
283
al. [25] explore the geographic weighting factors to improve political prediction results. They stated that geographic weighting along with sentiment polarity and relevance led to a better outcome.
3
Proposed Method
In this section, we introduce the approach used to build our model, shown in Fig. 1. Our methodology consists of a series of steps that range from the extraction of Facebook user messages (FUMs) to the election prediction process. Our work is influenced by the advice of [5]. Instead of merely relying on the volume (the number of messages the candidate receive), we have used sentiment analysis in our methodology along with the attempt to bias data by selecting only influential messages. We applied this methodology to the last presidential election of the U.S. The presidential election took place on November 8, 2016 with two favorite candidates: The Republican Donald Trump, and the Democratic Hillary Clinton. Republican Donald Trump lost the popular vote to Democrat Hillary Clinton by more than 2.8 million votes.
Fig. 1. Workflow of the proposed model.
3.1
Data Collection
Twitter is the most used to predict election outcome thanks to the ease that Twitter platform gives to extract data. To choose our data source, we compared Facebook and Twitter in term of data quality and platform popularity. Many previous studies [20,26,27] found that Twitter data was unreliable to predict electoral outcomes. It is mainly due to selecting tweets unrelated to the
284
O. Oueslati et al.
candidates. Selecting tweets based on a manually constructed list of keywords certainly led to a loss of relevant information. Though tweets may not comprise any keywords from the pre-defined list, it does not mean that they are necessarily irrelevant. In contrast, Facebook provides official candidates pages which allow having a large sample of relevant data independently from keywords. It also provides more information about the text message and does not limit the user to a specific number of characters. Twitter limits their users to a 240-characters which forces users to express their opinions briefly and sometimes partially. Next, favorable statistics on the U.S. Facebook users encouraged us to rely on it to have an accurate electoral prediction. The total Facebook audience in the United States amounted to 214 million users, where more than 208 million users are older than 17 years1 . We extracted data from candidates’ official Facebook pages. Namely, we extracted FUMs along with: Users responses on the FUM, Users reactions (Likes, Love, Haha, Wow, Sad, and Angry), and Timestamps (FUM publication time, First FUM reply time, and Last FUM reply time). The collection was directly done from public verified Facebook pages with a self-made application, using the Facebook Graph API2 . Data collection was conducted within one year before the presidential election in November 2016 so that we can experiment our model over several periods of times (one year before the election day, six months before, one week before, etc.). In the first pre-processing step, we deleted URLs, empty messages, nonEnglish messages, and duplicated row data. Hence, if a message is duplicated but has different metrics, we kept it. For example, the following message: “WE NEED TRUMP NOW!!!” appeared three times in our raw data but each time with different numbers of likes and replies, so we have considered it. After the data cleaning step, we kept 10k messages from Hillary Clinton official Facebook page and 12k messages from Donald Trump official Facebook page. 3.2
Feature Generation
This subsection describes our features to characterize influential messages. Based on the definition of the social influence stated by Cialdini and Trost [8]: “Social influence occurs when an individual’s views, emotions, or actions are impacted by the views, emotions or actions of another individual”, we designed four kinds of features (sentiment, emotion, time, and content). In total, we designed 20 features to characterize whether the message is influential or not. Sentiment Feature: We conducted the sentiment analysis task using SenticNet. We attributed to each FUM a sentiment score between 1 and -1. Practically, SenticNet is inspired by the Hourglass of emotions model [28]. In order to calculate the sentiment score, each term is represented on the ground of the intensity of four basic emotional dimensions namely sensitivity, aptitude, attention, and pleasantness. In our work, we computed the sentiment features using the Sentic 1 2
www.statista.com/statistics/398136/us-facebook-user-age-groups/. www.developers.facebook.com/tools/explorer/.
Sentiment Analysis of Influential Messages for Political Election Forecasting
285
API3 . The sentiment features are as follow: (1) OSS which is the Overall Sentiment Score of the FUM, (2) SSPMax which is the Sentiment Score of the most Positive term in the FUM, (3) SSNMax which is the Sentiment Score of the most Negative term in the FUM, (4) SSNMax which is the Sentiment Score of the most Negative term in the FUM, and (5) SSNMin which is the Sentiment Score of the least Negative term in the FUM. Emotion Feature: To extract the emotion features, we first built an emotion lexicon based on Facebook reactions. Kumar and Valdamni [29] stated that in social networks, if someone reacted to a public post or review (message), it means that the person has positive or negative emotions towards the entity in question. So emotions may be denoted explicitly through reviews and messages or implicitly through reactions. In our work, we explore Facebook reactions to construct an emotion lexicon. This lexicon would allow emotion extraction from any FUM, even if the FUM did not receive any reaction yet. There are six reactions that Facebook users use to express their emotions toward a message (Like, Love, Haha (laughing), Wow (surprised), Sad, and Angry). From the collected data we selected all the messages which had received any reaction. After cleaning the review and deleting stop words, based on the reactions, we selected terms reflecting emotions. So we got a list of emotional terms, and based on reactions count, we attribute a score for each term. For example, the term ‘waste’ appears in two messages with (5, 10) Like, (0, 1) Love, (12, 0) Haha, (3, 18) Wow, (0, 40) Sad, and (30, 7) Angry. So the term ‘waste’ has in total 15 Like, 1 Love, 12 Haha, 18 Wow, 40 Sad, and 37 Angry. We normalized the score through the sum of all reaction (123 in this example). Lastly, the emotion features were extracted based on the constructed emotion lexicon. The first emotional feature EMT evaluates the presence of EMotional Terms in the FUM (the number of emotional terms divided by the number of all terms). The six others features are LKR, LVR, LGR, SPR, AGR, and SDR. They represent the Like ratio, the Love ratio, the Laugh ratio, the Surprise ratio, the Anger ratio, and the Sadness ratio respectively. Time Feature: The time aspect is important to analyze. Indeed, we generated two time-features to evaluate users engagement towards FUMs: LCF =
LastP ostedReplyT ime − M essageP ulicationT ime ElectionP redictionT ime − M essageP ulicationT ime
RCT = 1 −
F irstP ostedReplyT ime − M essageP ulicationT ime ElectionP redictionT ime − M essageP ulicationT ime
The feature Life cycle (LCF) measures how much the message persists and remains popular by knowing how long the content can drive user attention and engagement. The Life cycle value is comprised between zero and one. The feature Reaction Time (RCT) evaluates the time that a FUM makes to start receiving 3
http://sentic.net/api.
286
O. Oueslati et al.
responses. This feature allows knowing if a message has rapidly engaged the users and drove their attention. Content Feature: The generated content features attempt to evaluate the quality of the message content. A message which is not clear and readable cannot be influential. Hence, content features include (1) NBC which is the Number of Characters in the FUM, (2) NBW which is the Number of Words in the FUM, (3) NBS which is the Number of Sentences in the FUM, (4) NBWS which is the Number of word Per Sentence, (5) NBSE which is the Number of Spelling Errors in the message, and (6) ARI which is the Automated Readability Index. ARI is a measure calculated as following: ARI = CharCount W ordCount 4.71 ∗ ( W ordCount ) + 0.5 ∗ ( SentCount ) − 21. This score indicates the US educational level required to comprehend a given text. The higher the score, the less readable the text [30]. 3.3
Influential Classifier Construction
In our work, we propose to reduce the data bias based on messages influence rather than who wrote the message. As social influence has been observed in political participation [17], we built a classifier to select only politically influential FUM which make others actions and emotions impacted by the actions and emotions of the FUM writer. To build our classifier we need a labeled dataset. While it is too expensive to label influential message manually, we selected messages which got many responses from other users. If the message and the responses have approximately the same sentiment polarity (positive or negative), the message is marked as influential. On the other hand, if the message and its responses have different sentiment polarity, the message is marked as no-influential. We did a manual revision for the messages having the same sentiment polarity as their responses but a margin that exceeds 0.5 regarding score. Through this technique of semiautomatic labeling, we got a labeled dataset of 1561 messages: 709 labeled influential and 852 labeled non-influential. 3.4
Election Outcome Prediction Model
We used the methods by [5] with some changes. While Gayo Avello et al. counted every positive message and every negative message, we included only the influential one. Then, the predicted vote share for a candidate C1 was computed as follows: inf P osSent(C1) + inf N egSent(C2) inf P osSent(C1) + inf N egSent(C1) + inf P osSent(C2) + inf N egSent(C2) C1 is the candidate for whom support is being computed while C2 is the opposing candidate. Therefore, infPosSent(C) and infNegSent(C) are respectively, the number of positive influential and the number of negative influential messages multiplied by their sentiment score.
Sentiment Analysis of Influential Messages for Political Election Forecasting
4
287
Results and Findings
First of all, we compared the performance of several supervised classification algorithms to select the best one. Subsequently, relying on the best algorithm, we derived our prediction model and evaluated its performance. In our experimentation, we used machine learning algorithms from scikit-learn package, tenfold cross-validation to improve generalization and avoid overfitting. 4.1
Learning Quality
To obtain a model that reasonably fits our objective, we performed the learning phase through several supervised classification algorithms. Then, we selected the best algorithm regarding accuracy (ACC), F-measure (F1) and AUC. To better understand classifiers performance, we examine how classifiers label test data. Therefore, we focus on True Positive (TP) and True Negative (TN) rates generated by each classifier. Classifiers performance are reported in Table 1. Table 1. Performance comparison of various classification algorithms. Classifier
ACC
NN
F1
AUC
%TP
%TN
73.48
72.76
74.34
83.78
64.91
RBF SVM 55.54
69.35
51.85
11.57
92.14
DT
84.10
82.29
79.83
84.74
82.51
RF
89.11 90.02 89.02 88.01 90.02
ANN
52.72
64.86
49.98
20.03
79.93
NB
62.91
70.38
61.11
41.47
80.75
LR
75.53
75.79
76.07
81.95
70.19
In term of ACC, F1, and AUC, the Random Forest (RF) achieved the best performance followed by Decision Tree (DT), Logistic Regression (LR), Nearest Neighbors (NN), and RBF SVM; while Naive Bayes (NB) and Artificial Neural Network (ANN) perform poorly. Regarding the TP and TN rates, Random Forest (RF) also achieved the best rates. Moreover, Random Forest was the best classifier realizing the right balance between the two classes (88.01% as TP and 90.02% as TN). In Fig. 2, we plot in the same graph the ROC curve of each classifier. Upon visual inspection, we observe that the curve of Random Forest classifier is closer than other curves to the upper-left corner of the ROC space. This proves that Random Forest classifier has the best trade-off between sensitivity (TP rate) and specificity (1 - FP rate). Random Forest shows the best performance to predict the Influential class with minimal false Positives correctly. After that, we used the Random Forest classifier for further classifications. For sentiment features, the overall sentiment (OSS) of the FUM is the most important followed by the sentiment score of the most negative term, and the sentiment score of the least positive term (SSNMax and SSPMin).
288
O. Oueslati et al.
Fig. 2. ROC curve of the different classifiers.
4.2
Features Quality
In this subsection, the relevance of the generated features through its prediction strength. We draw the features importance plot in Random Forest classification, as shown in Fig. 3. We notice that features related to FUM sentiment are the most important, followed by the features related to the content, and the features related to the time.
Fig. 3. Feature importance in Random Forest classification.
We find that strongly negative FUM tend to be more attractive and influential than strongly positive FUM. This finding is in line with observations for
Sentiment Analysis of Influential Messages for Political Election Forecasting
289
the features related to emotions. The rate of likes (LKR) is the most important followed by the rate of emotional terms presence (EMT) and sadness rate (SDR). We state that the Like button is ambiguous. Before October 2015, the other reactions did not exist. Only the Like reaction was available which made it overused to explain positive and negative emotions. Even after introducing the rest of emotional reactions, the Like is still overused. Therefore, LKR reflects users engagement toward the FUM more than the emotion that users give off. In contrast, we note that the sadness rate (SDR) is more decisive than the love rate (LVR). This observation also affirms that strongly negative FUMs that imply negative emotion tend to be more influential than FUMs implying positive emotion like Love. For content features, features related to the FUM length (NBC and NBS), and readability (ARI) are the most important. We find that brief FUM cannot be influential as a long FUM. However, the FUM must be readable and comprehensible by a wide range of peoples to be influential. We find that the ARI measure performs well in the context of social media because it was designed based on the length indicators. However, the spelling error rate (NBSE) is not critical in social media because people tend to use colloquial and invented words and to make some frequent mistakes. Lastly, for the time features, we find that the feature RCT is more important than the feature LC. The FUM making less time to engage users tend to have a longer life cycle. The first replies on the FUM reflect if the FUM would be influential or not. 4.3
Predicting Election Outcome Quality
In order to quantify the difference between the prediction and the ground truth, we relied on the Mean Absolute Error (MAE) like the vast majority of previous works. The MAE is defined as follows: 1 n | (Pi − Ri ) | M AE = i=1 n where n is the number of candidates, Pi is the predicted vote percentage of the candidate i and Ri is the true election result percentage of the candidate i . We applied our approach to different time intervals. We also tried other previous approaches to evaluate better the contribution of influential message selection and sentiment analysis: Message Count (MC), Message Sentiment (MS), and our Influential Message Sentiment (IMS). Results are presented in Table 2. Table 2. MAE at different time intervals. 1 year 6 Months 3 Months 1 month 2 Weeks 1 week IMS 01.29 01.52
03.02
00.88
MS
01.10
01.42
03.00
03.83
06.33
02.71
02.00
02.50
MC 06.92
08.72
15.53
07.16
06.07
06.80
290
O. Oueslati et al.
The best MAE done by the most well-known polling institutes4 is 2.3 by Reuters/Ipos. However, the worst MAE is 4.5 by LA Times/U.S.C Tracking. Our approach was capable of achieving a 0.88 by choosing influential messages posted one month before the election day, 1.10 by selecting influential messages published two weeks before, and 1.52 by selecting influential messages published six months before. Also, compared to the MC approach and MS approach, our approach was more accurate by achieving an error rate inferior to one. However, the best error rate achieved by MC was 6.07 and the best error rate achieved by MS was 2. Relying only on data volume led to the highest error. When considering the sentiment add to the volume, the MAE slightly decreases. And especially after removing non-influential messages the MAE is considerably improved. Furthermore, scoring each vote by the strength of the expressed sentiment helps the prediction model to ignore weak messages. To better visualize the difference between approaches, we illustrate the MAE in Fig. 4. We noted that independently of the time interval, relying only on data volume always led to the highest error. We also noted that predicting the election outcome one year before the day of election achieve a good performance compared by others time interval. Exploring the candidates’ online presence strategy before the election is relevant to conclude how well the candidate worked on his/her public image. Based on the success of this last, we can accurately predict the election result. There is mainly two kinds of online presence strategy; the long-term (1 year) and the short-term (less than one month). However, the more the historical record of time is reduced the more the forecast performance got worst.
Fig. 4. MAE overview on different time intervals by different approaches.
The error rate is reduced while forecasting is one year and one month before the election day. In contrast, the error is enormous six months and one week before the election day. That is to say exploring partially the historical record 4
www.realclearpolitics.com.
Sentiment Analysis of Influential Messages for Political Election Forecasting
291
is like analyzing an online political strategy by half. Moreover, few days before the election day, the noise is present more than any period.
5
Conclusion
In this paper, we proposed a novel model for election forecasting using sentiment analysis of influential messages. We collected data from Facebook graph API. Then, we constructed a classifier to select only the influential messages based on messages content, time, sentiment, and emotion. Random Forest algorithm has shown the best classification performance. We applied our model to the 2016 United States presidential election. We demonstrated that it is reliable to predict election results based on sentiment analysis of influential messages. Also, we demonstrated that data bias is appropriately addressed with influential messages selection. We found that our approach was capable of achieving better MAE than both off-line poll and classical approaches. In the future, we plan to continue our work on performing sentiment analysis of influential messages using other modalities such as influence degree definition.
References 1. Woodly, D.: New competencies in democratic communication? Blogs, agenda setting and political participation. Public Choice 134, 109–123 (2008) 2. Jin, X., Gallagher, A., Cao, L., Luo, J., Han, J.: The wisdom of social multimedia: using flickr for prediction and forecast. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1235–1244. ACM (2010) 3. Williams, C., Gulati, G.: What is a social network worth? Facebook and vote share in the 2008 presidential primaries. In: American Political Science Association (2008) 4. Tumasjan, A., Sprenger, T.O., Sandner, P.G., Welpe, I.M.: Predicting elections with twitter: what 140 characters reveal about political sentiment. In: ICWSM, vol. 10, pp. 178–185 (2010) 5. Gayo Avello, D., Metaxas, P.T., Mustafaraj, E.: Limits of electoral predictions using twitter. In: AAAI Conference on Weblogs and Social Media (2011) 6. Burnap, P., Gibson, R., Sloan, L., Southern, R., Williams, M.: 140 characters to victory?: Using twitter to predict the UK: general election. Electoral Stud. 41(2016), 230–233 (2015) 7. Romero, D.M., Reinecke, K., Robert Jr., L.P.: The influence of early respondents: information cascade effects in online event scheduling. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 101–110. ACM (2017) 8. Cialdini, R.B., Trost, M.R.: Social influence: social norms, conformity and compliance (1998) 9. Qazi, A., Raj, R.G., Tahir, M., Cambria, E., Syed, K.B.S.: Enhancing business intelligence by means of suggestive reviews. Sci. World J. 2014 (2014) 10. Cambria, E., Poria, S., Hazarika, D., Kwok, K.: SenticNet 5: discovering conceptual primitives for sentiment analysis by means of context embeddings. In: AAA, no. 1, pp. 1795–1802 (2018)
292
O. Oueslati et al.
11. Cambria, E., Hussain, A.: Sentic album: content-, concept-, and context-based online personal photo management system. Cogn. Comput. 4, 477–496 (2012) 12. Grassi, M., Cambria, E., Hussain, A., Piazza, F.: Sentic web: a new paradigm for managing social media affective information. Cogn. Comput. 3, 480–489 (2011) 13. Cambria, E., Song, Y., Wang, H., Howard, N.: Semantic multidimensional scaling for open-domain sentiment analysis. IEEE Intell. Syst. 29, 44–51 (2014) 14. Bravo-Marquez, F., Mendoza, M., Poblete, B.: Meta-level sentiment models for big social data analysis. Knowl.-Based Syst. 69, 86–99 (2014) 15. Ara´ ujo, M., Gon¸calves, P., Cha, M., Benevenuto, F.: iFeel: a system that compares and combines sentiment analysis methods. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 75–78. ACM (2014) 16. Strandberg, K.: A social media revolution or just a case of history repeating itself? The use of social media in the, finish parliamentary elections. New Media Soc. 15(2013), 1329–1347 (2011) 17. Bond, R.M., et al.: A 61-million-person experiment in social influence and political mobilization. Nature 489, 295–298 (2012) 18. Sang, E.T.K., Bos, J.: Predicting the 2011 Dutch senate election results with twitter. In: Proceedings of the Workshop on Semantic Analysis in Social Media, pp. 53–60. Association for Computational Linguistics (2012) 19. Jungherr, A.: Tweets and votes, a special relationship: the 2009 federal election in Germany. In: Proceedings of the 2nd Workshop on Politics, Elections and Data, pp. 5–14. ACM (2013) 20. Gayo-Avello, D.: “I wanted to predict elections with twitter and all i got was this lousy paper”-a balanced survey on election prediction using twitter data. arXiv preprint arXiv:1204.6441 (2012) 21. Franch, F.: (wisdom of the crowds) 2: UK election prediction with social media. J. Inf. Technol. Polit. 10(2013), 57–71 (2010) 22. Ceron, A., Curini, L., Iacus, S.M., Porro, G.: Every tweet counts? How sentiment analysis of social media can improve our knowledge of citizens’ political preferences with an application to Italy and France. New Media Soc. 16, 340–358 (2014) 23. Caldarelli, G., et al.: A multi-level geographical study of Italian political elections from twitter data. PLoS ONE 9, e95809 (2014) 24. Choy, M., Cheong, M.L., Laik, M.N., Shung, K.P.: A sentiment analysis of Singapore presidential election 2011 using twitter data with census correction. arXiv preprint arXiv:1108.5520 (2011) 25. Arroba Rimassa, J., Llopis, F., Mu˜ noz, R., Guti´errez, Y., et al.: Using the twitter social network as a predictor in the political decision. In: 19th CICLing Conference (2018) 26. Bermingham, A., Smeaton, A.: On using twitter to monitor political sentiment and predict election results. In: Proceedings of the Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2011), pp. 2–10 (2011) 27. Metaxas, P.T., Mustafaraj, E., Gayo-Avello, D.: How (not) to predict elections. In: Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third International Conference on Social Computing (SocialCom), pp. 165–171 (2011) 28. Susanto, Y., Livingstone, A., Ng, B.C., Cambria, E.: The hourglass model revisited. IEEE Intell. Syst. 35, 96–102 (2020) 29. Kumar, R., Vadlamani, R.: A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl.-Based Syst. 89, 14–46 (2015) 30. Smith, E.A., Kincaid, J.P.: Derivation and validation of the automated readability index for use with technical materials. Hum. Factors 12, 457–564 (1970)
Basic and Depression Specific Emotions Identification in Tweets: Multi-label Classification Experiments Nawshad Farruque(B) , Chenyang Huang, Osmar Za¨ıane, and Randy Goebel Alberta Machine Intelligence Institute (AMII), Department of Computing Science, University of Alberta, Edmonton, AB T6G 2R3, Canada {nawshad,chuang8,zaiane,rgoebel}@ualberta.ca
Abstract. We present an empirical analysis of basic and depression specific multi-emotion mining in Tweets, using state of the art multi-label classifiers. We choose our basic emotions from a hybrid emotion model consisting of the commonly identified emotions from four highly regarded psychological models. Moreover, we augment that emotion model with new emotion categories arising from their importance in the analysis of depression. Most of these additional emotions have not been used in previous emotion mining research. Our experimental analyses show that a cost sensitive RankSVM algorithm and a Deep Learning model are both robust, measured by both Micro F-Measures and Macro F-Measures. This suggests that these algorithms are superior in addressing the widely known data imbalance problem in multi-label learning. Moreover, our application of Deep Learning performs the best, giving it an edge in modeling deep semantic features of our extended emotional categories. Keywords: Emotion identification · Sentiment analysis
1 Introduction Mining multiple human emotions can be challenging area of research since human emotions tend to co-occur [12]. For example, most often human emotions such as joy and surprise tend to occur together rather than separately as just joy or just surprise. (See Table 1 for some examples from our dataset). In addition, identifying these cooccurrences of emotions and their compositionality can provide insight for fine grained analysis of emotions in various mental health problems. But there is little literature that has explored multi-label emotion mining from text [2, 11, 17]. However, with the increasing use of social media, where people share their day to day thoughts and ideas, it is easier than ever to capture the presence of different emotions in their posts. So our main research focus is to provide insights on identifying multi-emotions from social media posts such as Tweets. To compile a list of emotions we want to identify, we have used a mixed emotion model [17] which is based on four distinct and widely used emotion models used in psychology. Furthermore, we augment this emotion model with further emotions that are deemed useful for a depression identification task we c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 293–306, 2023. https://doi.org/10.1007/978-3-031-24340-0_22
294
N. Farruque et al.
intend to pursue later. Here we separate our experiments into two: one for a smaller emotion model (using nine basic human emotions), and another for an augmented emotion model (using both basic and depression related human emotions). We present a detailed analysis of the performance of several state of the art algorithms used in multi-label text mining tasks on both sets of data, which have varying degrees of data imbalance. Table 1. Example Tweets with multi-label emotions Tweets
Labels
“Feel very annoyed by everything and I hope to leave soon because I can’t Angry, sad stand this anymore” “God has blessed me so much in the last few weeks I can’t help but smile” Joy, love
1.1
Emotion Modeling
In affective computing research, the most widely accepted model for emotion is the one suggested by [5] previous models, augmented with a small number of additional emotions: love, thankfulness and guilt, all of which are relevant to our study. We further seek to confirm emotions such as betrayal, frustration, hopelessness, loneliness, rejection, schadenfreude and self loathing; any of these could contribute to the identification of a depressive disorder [1, 16]. Our mining of these emotions, with the help of RankSVM and an attention based deep learning model, is a new contribution not previously made [7, 13]. 1.2
Multi-label Emotion Mining Approaches
Earlier research in this area has employed learning algorithms from two broad categories: (1) Problem Transformation and (2) Algorithmic Adaptation. A brief description of each of these follows in the next subsections. 1.3
Problem Transformation Methods
In the problem transformation methods approach (PTM), multi-label data is transformed into single label data, and a series of single label (or binary) classifiers are trained for each label. Together, they predict multiple labels (cf. details provided in Sect. 2). This method is often called a “one-vs-all” model. The problem is that these methods do not consider any correlation amongst the labels. This problem was addressed by a model proposed by [2], which uses label powersets (LP) to learn an ensemble of k-labelset classifiers (RAKEL) [18]. Although this classifier method respects the correlation among labels, it is not robust against the data imbalance problem, which is an inherent problem in multi-label classification, simply because of the typical uneven class label distribution.
Basic and Depression Specific Emotions Identification in Tweets
295
1.4 Algorithmic Adaptation Methods The alternative category to PTMs are the so-called algorithmic adaptation methods (AAMs), where a single label classifier is modified to do multi-label classification. Currently popular AAMs are based on trees, such as the classic C4.5 algorithm adapted for multi-label tasks [4], probabilistic models such as [6], and neural network based methods such as, BP-MLL [19]. However, as with PTMs, these AAM methods are also not tailored for imbalanced data learning, and fail to achieve good accuracy with huge multi-label datasets where imbalance is a common problem. In our approach, we explore two state of the art methods for multi-label classification. One is a cost sensitive RankSVM and the other is a deep learning model based on Long Short Term Memory (LSTM) and Attention. The former is an amalgamation of PTM and AAMs, with the added advantage of large margin classifiers. This choice provides an edge on learning from huge imbalanced multi-label data, while still considering the label correlations. The later is a purely AAM approach, which is able to more accurately capture the latent semantic structure of Tweets. In Sects. 2 and 3 we provide the technical details of our baseline and experimental models.
2 Baseline Models According to the results in [17], good accuracy can be achieved with a series of Na¨ıve Bayes (NB) classifiers (also known as a one-vs-all classifier), where each classifier is trained on balanced positive and negative samples (i.e., Tweets, represented by bag-ofwords features) for each class; especially with respect to other binary classifiers (e.g., Support Vector Machines (SVMs)). To recreate this baseline, we have implemented our own “one-vs-all” method. To do so, we transform our multi-label data into sets of single label data, then train separate binary NB classifiers for each of the labels. An NB classifier N Bi in this model uses emotion Ei as positive sample and all other emotion samples as negative samples, where i is a representative index of our n emotion repertoire. We next concatenate the binary outputs of these individual classifiers to get the final multi-label output. Note that previous closely related research [2, 11] used simple SVM and RAKEL, which are not robust against data imbalance and did not look at short texts, such as Tweets for multi-label emotion mining. On the other hand, [17] had a focus on emotion mining from Tweets but their methods were multi-class, unlike us, where we are interested in multi-label emotion mining.
3 Experiment Models 3.1 A Cost Sensitive RankSVM Model Conventional multi-label classifiers learn a mapping function, h : X → 2q from a D q dimensional feature space, X ∈ RD to the label space Y ⊆ {0, 1} , where q is the number of labels. A simple label powerset algorithm (LP) considers each distinct combination of labels (also called labelsets) that exist in the training data as a single label, thus retaining the correlation of labels. In multi-label learning, some labelsets occur
296
N. Farruque et al.
more frequently than others, and traditional SVM algorithms perform poorly in these scenarios. In many information retrieval tasks, the RankSVM algorithm is widely used to learn rankings for documents, given a query. This idea can be generalized to multilabel classification, where the relative ranks of labels is of interest. [3] have proposed two optimized versions of the RankSVM [8] algorithm: one is called RankSVM(LP), which not only incorporates the LP algorithm but also associates misclassification cost λi with each training instance. This misclassification cost is higher for the label powersets that have smaller numbers of instances, and is automatically calculated based on the distribution of label powersets in the training data. To further reduce the number of generated label powersets (and to speed up processing), they proposed another version of their algorithm called RankSVM (PPT), where labelsets are pruned by an a priori threshold based on properties of the data set. 3.2
A Deep Learning Model
Results by [20] showed that a combination of Long Short Term Memory (LSTM) and an alternative to LSTM, Gated Recurrent Unit layers (GRU), can be very useful in learning phrase level features, and have very good accuracy in text classification. [9] achieved state of the art in sentence classification with the help of bidirectional LSTM (bi-LSTM) combined with a self-attention mechanism. Here we adopt [9]’s model and further enable this model for multi-label classification by using a suitable loss function and a thresholded softmax layer to generate multi-label output. We call this model LSTM-Attention (LSTM-att), as shown in Fig. 1; wi is the word embedding (which can be either one-hot bag-of-words or dense word vectors), hi is the hidden states of LSTM at time step i, and the output of this layer is fed to the Self Attention (SA) layer. The SA layer’s output is then sent to a linear layer, which translates the final output to a probability of different labels (in this case emotions) with the help of softmax activation. Finally, a threshold is applied on the softmax output to get the final multi-label predictions. 3.3
Loss Function Choices
The choice of a loss function is important in this context. For multi-label classification task [10], it has shown that Binary Cross Entropy (BCE) Loss over sigmoid activation is very useful. Our use of a BCE objective can be formulated as follows: n
minimize
L
1 [yil log(σ(yˆil )) n i=1 l=1
(1)
+(1 − yil )log(1 − σ(yˆil ))] where L is the number of labels, n is the number of samples, yi is the target label, yˆi is the predicted label from last linear layer (see in Fig. 1) and σ is a sigmoid function, σ(x) = 1+e1−x .
Basic and Depression Specific Emotions Identification in Tweets
I
am
happy
LSTM
LSTM
LSTM
LSTM
297
Self-attention
Linear Layer Softmax and Threshold
Predictions Fig. 1. LSTM-Attention model for multi-label classification
Thresholding The output of the last linear layer is a k-dimensional vector, where k is the number of different labels of the classification problem. For our task we use the softmax function to normalize each yˆi within the range of (0, 1) as follows. eyˆi yˆi = k i=1
eyˆi
(2)
Since we are interested in more than one label, we use a threshold and consider only those labels with predicted probability beyond that threshold. Let the threshold be t, hence the final prediction for each label pi is 1, yi > t pi = (3) 0, else To adjust the thresholds for the LSTM-Att model, we use a portion of our training data as an evaluation set. Based on our choice of evaluation set, we has found that the threshold, t = 0.3 provides the best results.
4 Experiments We use the well-known bag-of-words (BOW) and pre-trained word embedding vectors (WE) as our feature sets, for two RankSVM algorithms: RankSVM-LP and RankSVMPPT, a one-vs-all Naive Bayes (NB) classifier and a deep learning model (LSTM-Att).
298
N. Farruque et al.
We name our experiments with algorithm names suffixed by feature names, for example RankSVM-LP-BOW names the experiment with the RankSVM Label Powerset function on a bag-of-words feature set. We run these experiments on our two sets of data, and we use the RankSVM implementation provided by [3]. For multi-label classification, we implement our own one-vs-all model using a Python library named sci-kit learn and its implementation of multinomial NB. We implement our deep learning model in PyTorch1 . In the next section, we present a detailed description of the datasets, data collection, data pre-processing, feature sets extraction and evaluation metrics. 4.1
Data Set Preparation
We use two sets of multi-label data2 . Set 1 (we call it 9 emotion data) consists of Tweets only from the “Clean and Balanced Emotion Tweets” (CBET) dataset provided by [17]. It contains 3000 Tweets from each of nine emotions, plus 4,303 double labeled Tweets (i.e., Tweets which have two emotion labels), for a total of 31,303 Tweets. To create Set 2 (we call it 16 emotion data), we add extended emotions Tweets (having single and double labels) with Set 1 data adding up to total 50,000 Tweets. We used Tweets having only one and two labels because this is the natural distribution of labels in our collected data. Since previous research showed that hashtag labeled Tweets are consistent with the labels given by human judges [14], these Tweets were collected based on relevant hashtags and key-phrases3 . Table 2 lists the additional emotions we are interested in. Our data collection process is identical to [17], except that we use the Twitter API and key-phrases along with hashtags. In this case, we gather Tweets with these extra emotions between June, 2017 to October, 2017. The statistics of gathered Tweets are presented in Table 3. Both of our data sets have the following characteristics after preprocessing: – – – – – – – – – – –
All Tweets are in English. All the letters in Tweets are converted to lowercase. White space characters, punctuation and stop words are removed. Duplicate Tweets are removed. Incomplete Tweets are removed. Tweets shorter than 3 words are removed. Tweets having more than 50% of its content as name mentions are removed. URLs are replaced by ‘url’. All name mentions are replaced with ‘@user’. Multi-word hashtags are decomposed in their constituent words. Hashtags and key-phrases corresponding to emotion labels are removed to avoid data overfitting.
Finally, we use 5 fold cross validation (CV) (train 80%–test 20% split). In each fold we further create a small validation set (10% of the training set) and use it for parameter 1 2 3
http://pytorch.org/. We intend to release our dataset online upon publication of this paper. We use specific key-phrases to gather Tweets for specific emotions, e.g. for loneliness, we use, “I am alone” to gather more data for our extra emotion data collection process.
Basic and Depression Specific Emotions Identification in Tweets
299
tuning in Rank SVM, and for threshold finding for LSTM-Att. Our baseline model does not have any parameters to tune. Finally, our results are averaged on the test set in the 5 Fold CV based on the best parameter combination that was found in each of the folds validation set and we report that. Using this approach, we find that threshold = 0.3 is generally better. We do not do heavy parameter tuning in LSTM-att. Also, we use the Adam Optimizer with 0.001 learning rate. Table 2. New emotion labels and corresponding hashtags and key phrases Emotion
List of Hashtags
Betrayed
#betrayed
Frustrated
#frustrated, #frustration
Hopeless
#hopeless, #hopelessness, no hope, end of everything
Loneliness
#lonely, #loner, i am alone
Rejected
#rejected, #rejection, nobody wants me, everyone rejects me
Schadenfreude #schadenfreude Self loath
#selfhate, #ihatemyself, #ifuckmyself, i hate myself, i fuck myself
Table 3. Sample size for each new emotions (after cleaning) Emotion
Number of Tweets
Betrayed
1,724
Frustrated
4,424
Hopeless
3,105
Loneliness
4,545
Rejected
3,131
Schadenfreude
2,236
Self loath Total
4,181 23,346
4.2 Feature Sets We create a vocabulary of 5000 most frequent words which occur in at least three training samples. If we imagine this vocabulary as a vector where each of its indices represent each unique word in that vocabulary, then a bag-of- words (BOW) feature can be represented by marking as 1 those indices whose corresponding word matches with a Tweet word. To create word embedding features we use a 200 dimensional GloVe [15] word embeddings trained on a corpus of 2B Tweets with 27B tokens and 1.2M vocabulary4 . We represent each Tweet with the average word embedding of its constituent words that are also present in the pre-trained word embeddings. We further normalize the word embedding features using min-max normalization. 4
https://nlp.stanford.edu/projects/glove/.
300
N. Farruque et al.
4.3
Evaluation Metrics
We report some metrics for our data imbalance measurements to highlight the effect of the analysed approaches in multi-label learning. 4.4
Quantifying Imbalance in Labelsets
We use the same idea as mentioned in [3] for data imbalance calculation in labelsets, inspired by the idea of kurtosis. The following equation is used to calculate the imbalance in labelsets, l (Li − Lmax )4 (4) ImbalLabelSet = i=1 (l − 1)s4 where,
l 1 (Li − Lmax ) s= l i=1
(5)
here, l is the number of all labelsets, Li is the number of samples in the i-th labelset, Lmax is the number of samples with the labelset having the maximum samples. The value of LabelSetImbal is a measure of the distribution shape of the histograms depicting labelset distribution. A higher value of LabelSetImbal indicates a larger imbalance, and determines the “peakedness” level of the histogram (see Fig. 2), where the x axis depicts the labelsets and the y axis denotes the counts for them. The numeric imbalance level in labelsets is presented in Table 4; there we notice that the 16 emotion dataset is more imbalanced than the 9 emotion dataset. Table 4. Degree of imbalance in labelset and labels Dataset
Labelset Imbal. Label Imbal.
9 Emotion data
27.95
37.65
16 Emotion data 52.16
69.93
Fig. 2. Labelset imbalance: histogram for 9 emotion dataset on the left and histogram for 16 emotion dataset on the right
Basic and Depression Specific Emotions Identification in Tweets
301
4.5 Mic/Macro F-Measures We use label based Mic/Macro F-Measures to report the performance of our classifiers in classifying each label. See Eqs. 6 and 7. l l l T Pj , F Pj , F Nj ) M icro-F M = F -M easure( j=1
j=1
(6)
j=1
l
M acro-F M =
1 F -M easure(T Pj , F Pj , F Nj ) l j=1
(7)
where, l is the total number of labels, and F-Measure is a standard F1-Score5 and T P, F P, F N are True Positive, False Positive and False Negative labels respectively. Micro-FM provides the performance of our classifiers by taking into account the imbalance in the dataset unlike Macro-FM.
5 Results Analysis Overall, LSTM-Attention with word embedding features (LSTM-Att-WE) performs the best in terms of Micro-FM and Macro-FM measures, compared with the baseline multilabel NB and best performing RankSVM models averaged across 9 emotion and 16 emotion datasets. These results are calculated according to Eq. 8, where, BestF Mi refers to the model with best F-Measure value (either Micro or Macro), i indicates the dataset and can be either 9 or 16, AV G denotes the function that calculates average, the sub-script index refers to our 9 or 16 emotion dataset, and other variables are self explanatory. To compare with RankSVM, we use best RankSVM F-Measures instead of BaselineF M in the above equation. In the Micro-FM measure, LSTM-Att achieves a 44% increase with regard to baseline NB and a 23% increase with regard to best RankSVM models. In Macro-FM measure, LSTM-Att-WE shows a 37% increase with regard to baseline NB-BOW and a 18% increase with regard to best RankSVM models. The percentage values were rounded to the nearest integers. It is worth noting that a random assignment to classes would result in an accuracy of 11% in the case of the 9 emotions (1/9) and 6% in the case of the 16 emotion dataset (1/16). The following sections show analyses based on F-measures, Data Imbalance and Confusion Matrices (see Tables 5, 6 and Fig. 3). BestF M16 − BaselineF M16 , BaselineF M16 BestF M9 − BaselineF M9 ) × 100 BaselineF M9
F M -Inc = AV G(
5
https://en.wikipedia.org/wiki/F-score.
(8)
302
N. Farruque et al. Table 5. Results on 9 emotion dataset Models
Macro-FM Micro-FM
NB-BOW (baseline)
0.3915
0.3920
NB-WE
0.3715
0.3617
RankSVM-LP-BOW
0.3882
0.3940
RankSVM-LP-WE
0.4234
0.4236
RankSVM-PPT-BOW 0.4275
0.4249
RankSVM-PPT-WE
0.3930
0.3920
LSTM-Att-BOW
0.4297
0.4492
LSTM-Att-WE
0.4685
0.4832
Table 6. Results on 16 emotion dataset Models
Macro-FM Micro-FM
NB-BOW (baseline)
0.2602
0.2608
NB-WE
0.2512
0.2356
RankSVM-LP-BOW
0.3523
0.3568
RankSVM-LP-WE
0.3342
0.3391
RankSVM-PPT-BOW 0.3406
0.3449
RankSVM-PPT-WE
0.3432
0.3469
LSTM-Att-BOW
0.3577
0.3945
LSTM-Att-WE
0.4020
0.4314
Fig. 3. Comparative results on 9 and 16 emotion datasets
5.1
Performance with Regard to F-Measures
Overall, in all models, the Micro-FM and Macro-FM values are very close to each other (see Table 5 and 6) indicating that all of the models have similar performance in terms of both most populated and least populated classes.
Basic and Depression Specific Emotions Identification in Tweets
303
5.2 Performance with Regard to Data Imbalance Performance of all the models generally drops as the imbalance increases. For this performance measure, we take average F-Measures (Micro or Macro) across WE and BOW features for NB and DL; for RankSVM the values are averaged over two types of RankSVM models as well (i.e., LP and PPT). See Eq. 9, where, the function, performance drop, and P D(model), takes any model and calculates the decrease or drop of F-measures (either Micro or Macro) averaged over WE and BOW features. For RankSVM, we input RankSVM-PPT and RankSVM-BOW, and take the average to determine the performance drop in overall RankSVM algorithm based on Eq. 10. We observe that LSTM-Att’s performance drop for Micro-FM is 11% and Macro-FM is 15% and with respect to 9 emotions (less imbalanced) to 16 emotion data (more imbalanced). In comparison, RankSVM has higher drop (for Micro-FM it is 17% and for Macro-FM it is 18%) and multi-label NB has the highest drop (for Micro-FM it is 41% and for Macro-FM it is 38%) for the same. These results indicate that overall LSTM-Att and RankSVM models are more robust against data imbalance. 1 AV G9 (model.BOW.F M, model.W E.F M ) ×{(AV G9 (model.BOW.F M, model.W E.F M )
P D(model) = (
(9)
−(AV G16 (model.BOW.F M, model.W E.F M )}) × 100 RankSV M -P D = AV G(P D(RankSV M -P P T ), P D(RankSV M -LP )) 5.3 Confusion Matrices
Fig. 4. Confusion matrix for 9 emotions
(10)
304
N. Farruque et al.
An ideal confusion matrix should be strictly diagonal with all other values set as zero. In our multi-label case, we see that our confusion matrix has the highest values along diagonal implying it is correctly classifying most of the emotions (On the other hand, non-diagonal values imply incorrect classification, where “love” and “joy” and “f rustrated” and “hopeless” are mostly confused labels because these emotion labels tends to occur together (Fig. 4 and 5).
Fig. 5. Confusion matrix for 16 emotions
6 Conclusion and Future Work We have experimented with two state of the art models for a multi-label emotion mining task. We have provided details of data collection and processing for our two multi-label datasets, one containing Tweets with nine basic emotions and another having those Tweets augmented with additional Tweets from seven new emotions (related to depression). We also use two widely used features for this task, including bag-of-words and word embedding. Moreover, we provide a detailed analysis of these algorithms performance based on Micro-FM and Macro-FM measures. Our experiments indicate that a deep learning model exhibits superior results compared to others; we speculate that it is because of improved capture of subtle differences in the language, but we lack an explanatory mechanism to confirm this. In future, we would like to explore several selfexplainable and post-hoc explainable deep learning models to shed some light on what these deep learning models look at for multi-label emotion classification task compared
Basic and Depression Specific Emotions Identification in Tweets
305
to their non-deep learning counterparts. Moreover, deep learning and RankSVM models are both better in handling data imbalance. It is also to be noted that a word embedding feature-based deep learning model is better than bag-of-words feature based deep learning model, unlike Na¨ıve Bayes and RankSVM models. As expected, this confirms that Deep Learning models are good with dense word vectors rather than very sparse bagof-words features. In the future, we would like to do a finer grained analysis of Tweets from depressed people, based on these extended emotions, and identify the subtle language features from the attention layers outputs, which we believe will help us to detect early signs of depression, to monitor depressive condition, its progression and treatment outcome. Acknowledgements. We thank Natural Sciences and Engineering Research Council of Canada (NSERC) and Alberta Machine Intelligence Institute (AMII) for their generous support to pursue this research.
References 1. Abramson, L.Y., Metalsky, G.I., Alloy, L.B.: Hopelessness depression: a theory-based subtype of depression. Psychol. Rev. 96(2), 358 (1989) 2. Bhowmick, P.K.: Reader perspective emotion analysis in text through ensemble based multilabel classification framework. Comput. Inf. Sci. 2(4), 64 (2009) 3. Cao, P., Liu, X., Zhao, D., Zaiane, O.: Cost sensitive ranking support vector machine for multi-label data learning. In: Abraham, A., Haqiq, A., Alimi, A.M., Mezzour, G., Rokbani, N., Muda, A.K. (eds.) HIS 2016. AISC, vol. 552, pp. 244–255. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52941-7 25 4. Clare, A., King, R.D.: Knowledge discovery in multi-label phenotype data. In: De Raedt, L., Siebes, A. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, pp. 42–53. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44794-6 4 5. Ekman, P.: An argument for basic emotions. Cogn. Emot. 6(3–4), 169–200 (1992) 6. Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 195– 200. ACM (2005) 7. Hasan, M., Agu, E., Rundensteiner, E.: Using hashtags as labels for supervised learning of emotions in Twitter messages. In: Proceedings of the Health Informatics Workshop (HIKDD) (2014) 8. Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM (2002) 9. Lin, Z., et al.: A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017) 10. Liu, J., Chang, W.C., Wu, Y., Yang, Y.: Deep learning for extreme multi-label text classification. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–124. ACM (2017) 11. Luyckx, K., Vaassen, F., Peersman, C., Daelemans, W.: Fine-grained emotion detection in suicide notes: a thresholding approach to multi-label classification. Biomed. Inf. Insights 5(Suppl 1), 61 (2012) 12. Mill, A., K¨oo¨ ts-Ausmees, L., Allik, J., Realo, A.: The role of co-occurring emotions and personality traits in anger expression. Front. Psychol. 9, 123 (2018)
306
N. Farruque et al.
13. Mohammad, S.M.: # emotional tweets. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 246–255. Association for Computational Linguistics (2012) 14. Mohammad, S.M., Kiritchenko, S.: Using hashtags to capture fine emotion categories from tweets. Comput. Intell. 31(2), 301–326 (2015) 15. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 16. Pietraszkiewicz, A., Chambliss, C.: The link between depression and schadenfreude: further evidence. Psychol. Rep. 117(1), 181–187 (2015) 17. Shahraki, A.G., Za¨ıane, O.R.: Lexical and learning-based emotion mining from text. In: International Conference on Computational Linguistics and Intelligent Text Processing (CICLing) (2017) 18. Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multilabel classification. IEEE Trans. Knowl. Data Eng. 23(7), 1079–1089 (2011) 19. Zhang, M.L., Zhou, Z.H.: Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 18(10), 1338–1351 (2006) 20. Zhou, C., Sun, C., Liu, Z., Lau, F.: A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630 (2015)
Generating Word and Document Embeddings for Sentiment Analysis Cem Rıfkı Aydın(B) , Tunga G¨ung¨or , and Ali Erkan Computer Engineering Department, Bo˘gazic¸i University, Bebek, 34342 Istanbul, Turkey [email protected], [email protected], [email protected]
Abstract. Sentiments of words can differ from one corpus to another. Inducing general sentiment lexicons for languages and using them cannot, in general, produce meaningful results for different domains. In this paper, we combine contextual and supervised information with the general semantic representations of words occurring in the dictionary. Contexts of words help us capture the domainspecific information and supervised scores of words are indicative of the polarities of those words. When we combine supervised features of words with the features extracted from their dictionary definitions, we observe an increase in the success rates. We try out the combinations of contextual, supervised, and dictionary-based approaches, and generate original vectors. We also combine the word2vec approach with hand-crafted features. We induce domain-specific sentimental vectors for two corpora, which are the movie domain and the Twitter datasets in Turkish. When we thereafter generate document vectors and employ the support vector machines method utilising those vectors, our approaches perform better than the baseline studies for Turkish with a significant margin. We evaluated our models on two English corpora as well and these also outperformed the word2vec approach. It shows that our approaches are cross-domain and portable to other languages. Keywords: Sentiment analysis · Opinion mining · Word embeddings · Machine learning
1 Introduction Sentiment analysis has recently been one of the hottest topics in natural language processing (NLP). It is used to identify and categorise opinions expressed by reviewers on a topic or an entity. Sentiment analysis can be leveraged in marketing, social media analysis, and customer service. Although many studies have been conducted for sentiment analysis in widely spoken languages, this topic is still immature for Turkish and many other languages. Neural networks outperform the conventional machine learning algorithms in most classification tasks, including sentiment analysis [9]. In these networks, word embedding vectors are fed as input to overcome the data sparsity problem and to make the representations of words more “meaningful” and robust. Those embeddings indicate how close the words are to each other in the vector space model (VSM). c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 307–318, 2023. https://doi.org/10.1007/978-3-031-24340-0_23
308
C. R. Aydın et al.
Most of the studies utilise embeddings, such as word2vec [14], which take into account the syntactic and semantic representations of the words only. Discarding the sentimental aspects of words may lead to words of different polarities being close to each other in the VSM, if they share similar semantic and syntactic features. For Turkish, there are only a few studies which leverage sentimental information in generating the word and document embeddings. Unlike the studies conducted for English and other widely-spoken languages, in this paper, we use the official dictionaries for this language and combine the unsupervised and supervised scores to generate a unified score for each dimension of the word embeddings in this task. Our main contribution is to create original and effective word vectors that capture syntactic, semantic and sentimental characteristics of words, and use all of this knowledge in generating embeddings. We also utilise the word2vec embeddings trained on a large corpus. Besides using these word embeddings, we also generate hand-crafted features on a review-basis and create document vectors. We evaluate those embeddings on two datasets. The results show that we outperform the approaches which do not take into account the sentimental information. We also had better performances than other studies carried out on sentiment analysis in Turkish media. We also evaluated our novel embedding approaches on two English corpora of different genres. We outperformed the baseline approaches for this language as well. The source code and datasets are publicly available1 . The paper is organised as follows. In Sect. 2, we present the existing works on sentiment classification. In Sect. 3, we describe the methods proposed in this work. The experimental results are shown and the main contributions of our proposed approach are discussed in Sect. 4. In Sect. 5, we conclude the paper.
2 Related Work In the literature, the main consensus is that the use of dense word embeddings outperforms the sparse embeddings in many tasks. Latent semantic analysis (LSA) used to be the most popular method in generating word embeddings before the invention of the word2vec and other word vector algorithms which are mostly created by shallow neural network models. Although many studies have been employed on generating word vectors including both semantic and sentimental components, generating and analysing the effects of different types of embeddings on different tasks is an emerging field for Turkish. Latent Dirichlet allocation (LDA) is used in [3] to extract mixture of latent topics. However, it focusses on finding the latent topics of a document, not the word meanings themselves. In [19], LSA is utilised to generate word vectors, leveraging indirect cooccurrence statistics. These outperform the use of sparse vectors [5]. Some of the prior studies have also taken into account the sentimental characteristics of a word when creating word vectors [4, 11, 12]. A model with semantic and sentiment components is built in [13], making use of star-ratings of reviews. In [10], a sentiment lexicon is induced preferring the use of domain-specific cooccurrence statistics over the word2vec method and they outperform the latter. 1
https://github.com/cemrifki/sentiment-embeddings.
Generating Word and Document Embeddings for Sentiment Analysis
309
In a recent work on sentiment analysis in Turkish [6], they learn embeddings using Turkish social media. They use the word2vec algorithm, create several unsupervised hand-crafted features, generate document vectors and feed them as input into the support vector machines (SVM) approach. We outperform this baseline approach using more effective word embeddings and supervised hand-crafted features. In English, much of the recent work on learning sentiment-specific embeddings relies only on distant supervision. In [7], emojis are used as features and a bi-directional long short-term memory (bi-LSTM) neural network model is built to learn sentimentaware word embeddings. In [18], a neural network that learns word embeddings is built by using contextual information about the data and supervised scores of the words. This work captures the supervised information by utilising emoticons as features. Most of our approaches do not rely on a neural network model in learning embeddings. However, they produce state-of-the-art results.
3 Methodology We generate several word vectors, which capture the sentimental, lexical, and contextual characteristics of words. In addition to these mostly original vectors, we also create word2vec embeddings to represent the corpus words by training the embedding model on these datasets. After generating these, we combine them with hand-crafted features to create document vectors and perform classification, as will be explained in Sect. 3.5. 3.1 Corpus-Based Approach Contextual information is informative in the sense that, in general, similar words tend to appear in the same contexts. For example, the word smart is more likely to cooccur with the word hardworking than with the word lazy. This similarity can be defined semantically and sentimentally. In the corpus-based approach, we capture both of these characteristics and generate word embeddings specific to a domain. Firstly, we construct a matrix whose entries correspond to the number of cooccurrences of the row and column words in sliding windows. Diagonal entries are assigned the number of sliding windows that the corresponding row word appears in the whole corpus. We then normalise each row by dividing entries in the row by the maximum score in it. Secondly, we perform the principal component analysis (PCA) method to reduce the dimensionality. It captures latent meanings and takes into account high-order cooccurrence removing noise. The attribute (column) number of the matrix is reduced to 200. We then compute cosine similarity between each row pair wi and wj as in (1) to find out how similar two word vectors (rows) are. cos(wi , wj ) =
wi w˙ j ˙ j || ||wi ||||w
(1)
310
C. R. Aydın et al.
Thirdly, all the values in the matrix are subtracted from 1 to create a dissimilarity matrix. Then, we feed the matrix as input into the fuzzy c-means clustering algorithm. We chose the number of clusters as 200, as it is considered a standard for word embeddings in the literature. After clustering, the dimension i for a corresponding word indicates the degree to which this word belongs to cluster i. The intuition behind this idea is that if two words are similar in the VSM, they are more likely to belong to the same clusters with analogous probabilities. In the end, each word in the corpus is represented by a 200-dimensional vector. In addition to this method, we also perform singular value decomposition (SVD) on the cooccurrence matrices, where we compute the matrix M P P M I = U ΣV T . Positive pointwise mutual information (PPMI) scores between words are calculated and the truncated singular value decomposition is computed. We take into account the U matrix only for each word. We have chosen the singular value number as 200. That is, each word in the corpus is represented by a 200-dimensional vector as follows. wi = (U )i 3.2
(2)
Dictionary-Based Approach
In Turkish, there do not exist well-established sentiment lexicons as in English. In this approach, we made use of the TDK (T¨urk Dil Kurumu - “Turkish Language Institution”) dictionary to obtain word polarities. Although it is not a sentiment lexicon, combining it with domain-specific polarity scores obtained from the corpus led us to have state-ofthe-art results. We first construct a matrix whose row entries are corpus words and column entries are the words in their dictionary definitions. We followed the Boolean approach. For instance, for the word cat, the column words occurring in its dictionary definition are given a score of 1. Those column words not appearing in the definition of cat are assigned a score of 0 for that corresponding row entry. When we performed clustering on this matrix, we observed that those words having similar meanings are, in general, assigned to the same clusters. However, this similarity fails in capturing the sentimental characteristics. For instance, the words happy and unhappy are assigned to the same cluster, since they have the same words, such as feeling, in their dictionary definitions. However, they are of opposite polarities and should be discerned from each other. Therefore, we utilise a metric to move such words away from each other in the VSM, even though they have common words in their dictionary definitions. We multiply each value in a row with the corresponding row word’s raw supervised score, thereby having more meaningful clusters. Using the training data only, the supervised polarity score per word is calculated as in (3). wt = log
Nt N Nt N
+ 0.01 + 0.01
(3)
Here, wt denotes the sentiment score of word t, Nt is the number of documents (reviews or tweets) in which t occurs in the dataset of positive polarity, N is the number of all the words in the corpus of positive polarity. N denotes the corpus of negative
Generating Word and Document Embeddings for Sentiment Analysis
311
polarity. Nt and N denote similar values for the negative polarity corpus. We perform normalisation to prevent the imbalance problem and add a small number to both numerator and denominator for smoothing. As an alternative to multiplying with the supervised polarity scores, we also separately multiplied all the row scores with only +1 if the row word is a positive word, and with −1 if it is a negative word. We have observed it boosts the performance more compared to using raw scores.
Fig. 1. The effect of using the supervised scores of words in the dictionary algorithm. It shows how sentimentally similar word vectors get closer to each other in the VSM.
The effect of this multiplication is exemplified in Fig. 1, showing the positions of word vectors in the VSM. Those “x” words are sentimentally negative words, those “o” words are sentimentally positive ones. On the top coordinate plane, the words of opposite polarities are found to be close to each other, since they have common words in their dictionary definitions. Only the information concerned with the dictionary definitions
312
C. R. Aydın et al.
are used there, discarding the polarity scores. However, when we utilise the supervised score (+1 or −1), words of opposite polarities (e.g. “happy” and “unhappy”) get far away from each other as they are translated across coordinate regions. Positive words now appear in quadrant 1, whereas negative words appear in quadrant 3. Thus, in the VSM, words that are sentimentally similar to each other could be clustered more accurately. Besides clustering, we also employed the SVD method to perform dimensionality reduction on the unsupervised dictionary algorithm and used the newly generated matrix by combining it with other subapproaches. The number of dimensions is chosen as 200 again according to the U matrix. The details are given in Sect. 3.4. When using and evaluating this subapproach on the English corpora, we used the SentiWordNet lexicon [2]. We have achieved better results for the dictionary-based algorithm when we employed the SVD reduction method compared to the use of clustering. 3.3
Supervised Contextual 4-Scores
Our last component is a simple metric that uses four supervised scores for each word in the corpus. We extract these scores as follows. For a target word in the corpus, we scan through all of its contexts. In addition to the target word’s polarity score (the selfscore), out of all the polarity scores of words occurring in the same contexts as the target word, minimum, maximum, and average scores are taken into consideration. The word polarity scores are computed using (3). Here, we obtain those scores from the training data. The intuition behind this method is that those four scores are more indicative of a word’s polarity rather than only one (the self-score). This approach is fully supervised unlike the previous two approaches. 3.4
Combination of the Word Embeddings
In addition to using the three approaches independently, we also combined all the matrices generated in the previous approaches. That is, we concatenate the reduced forms (SVD - U) of corpus-based, dictionary-based, and the whole of 4-score vectors of each word, horizontally. Accordingly, each corpus word is represented by a 404-dimensional vector, since corpus-based and dictionary-based vector components are each composed of 200 dimensions, whereas the 4-score vector component is formed by four values. The main intuition behind the ensemble method is that some approaches compensate for what the others may lack. For example, the corpus-based approach captures the domain-specific, semantic, and syntactic characteristics. On the other hand, the 4scores method captures supervised features, and the dictionary-based approach is helpful in capturing the general semantic characteristics. That is, combining those three approaches makes word vectors more representative. 3.5
Generating Document Vectors
After creating several embeddings as mentioned above, we create document (review or tweet) vectors. For each document, we sum all the vectors of words occurring in
Generating Word and Document Embeddings for Sentiment Analysis
313
that document and take their average. In addition to it, we extract three hand-crafted polarity scores, which are minimum, mean, and maximum polarity scores, from each review. These polarity scores of words are computed as in (3). For example, if a review consists of five words, it would have five polarity scores and we utilise only three of these sentiment scores as mentioned. Lastly, we concatenate these three scores to the averaged word vector per review. That is, each review is represented by the average word vector of its constituent word embeddings and three supervised scores. We then feed these inputs into the SVM approach. The flowchart of our framework is given in Fig. 2. When combining the unsupervised features, which are word vectors created on a word-basis, with supervised three scores extracted on a review-basis, we have better state-of-the-art results.
Fig. 2. The flowchart of our system.
4 Datasets We utilised two datasets for both Turkish and English to evaluate our methods: For Turkish, as the first dataset, we utilised the movie reviews which are collected from a popular website2 . The number of reviews in this movie corpus is 20,244 and the average number of words in reviews is 39. Each of these reviews has a star-rating score which is indicative of sentiment. These polarity scores are between the values 0.5 and 5, at intervals of 0.5. We consider a review to be negative it the score is equal to or lower than 2.5. On the other hand, if it is equal to or higher than 4, it is assumed to 2
https://www.beyazperde.com.
314
C. R. Aydın et al.
be positive. We have randomly selected 7,020 negative and 7,020 positive reviews and processed only them. The second Turkish dataset is the Twitter corpus which is formed of tweets about Turkish mobile network operators. Those tweets are mostly much noisier and shorter compared to the reviews in the movie corpus. In total, there are 1,716 tweets. 973 of them are negative and 743 of them are positive. These tweets are manually annotated by two humans, where the labels are either positive or negative. We measured the Cohen’s Kappa inter-annotator agreement score to be 0.82. If there was a disagreement on the polarity of a tweet, we removed it. We also utilised two other datasets in English to test the portability of our approaches to other languages. One of them is a movie corpus collected from the web3 . There are 5,331 positive reviews and 5,331 negative reviews in this corpus. The other is a Twitter dataset, which has nearly 1.6 million tweets annotated through a distant supervised method [8]. These tweets have positive, neutral, and negative labels. We have selected 7,020 positive tweets and 7,020 negative tweets randomly to generate a balanced dataset.
5 Experiments 5.1
Preprocessing
In Turkish, people sometimes prefer to spell English characters for the corresponding Turkish characters (e.g. i for ı, c for c¸) when writing in electronic format. To normalise such words, we used the Zemberek tool [1]. All punctuation marks except “!” and “?” are removed, since they do not contribute much to the polarity of a document. We took into account emoticons, such as “:))”, and idioms, such as “kafayı yemek” (lose one’s mind), since two or more words can express a sentiment together, irrespective of the individual words thereof. Since Turkish is an agglutinative language, we used the morphological parser and disambiguation tools [16, 17]. We also performed negation handling and stop-word elimination. In negation handling, we append an underscore to the end of a word if it is negated. For example, “g¨uzel deˇgil” (not beautiful) is redefined as “g¨uzel ” (beautiful ) in the feature selection stage when supervised scores are being computed. 5.2
Hyperparameters
We used the LibSVM utility of the WEKA tool. We chose the linear kernel option to classify the reviews. We trained word2vec embeddings on all the four corpora using the Gensim library [15] with the skip-gram method. The dimension size of these embeddings is set at 200. As mentioned, other embeddings, which are generated utilising the clustering and the SVD approach, are also of size 200. For c-means clustering, we set the maximum number of iterations at 25, unless it converges.
3
https://github.com/dennybritz/cnn-text-classification-tf.
Generating Word and Document Embeddings for Sentiment Analysis
315
Table 1. Accuracies for different feature sets fed as input into the SVM classifier in predicting the labels of reviews. The word2vec algorithm is the baseline method. Word embedding type
Turkish (%) English (%) Movie Twitter Movie Twitter
Corpus-based + SVD (U) Dictionary-based + SVD (U) Supervised 4-scores Concatenation of the above three Corpus-based + Clustering word2vec
76.19 60.64 89.38 88.12 52.27 76.47
64.38 51.36 76.00 73.23 52.73 46.57
66.54 55.29 75.65 73.40 51.02 57.73
87.17 60.00 72.62 73.12 54.40 62.60
Corpus-based + SVD (U) + 3-feats Dictionary-based + SVD (U) + 3-feats Supervised 4-scores + 3-feats Concatenation of the above three + 3 feats Corpus-based + Clustering + 3-feats word2vec + 3-feats
88.45 88.64 90.38 89.77 87.89 88.88
72.60 71.91 78.00 72.60 71.91 71.23
76.85 76.66 77.05 77.03 75.02 77.03
85.88 80.40 72.83 80.20 74.40 75.64
5.3 Results We evaluated our models on four corpora, which are the movie and the Twitter datasets in Turkish and English. All of the embeddings are learnt on four corpora separately. We have used the accuracy metric since all the datasets are completely or nearly completely balanced. We performed 10-fold cross-validation for both of the datasets. We used the approximate randomisation technique to test whether our results are statistically significant. Here, we tried to predict the labels of reviews and assess the performance. We obtained varying accuracies as shown in Table 1. “3 feats” features are those hand-crafted features we extracted, which are the minimum, mean, and maximum polarity scores of the reviews as explained in Sect. 3.5. As can be seen, at least one of our methods outperforms the baseline word2vec approach for all the Turkish and English corpora, and all categories. All of our approaches performed better when we used the supervised scores, which are extracted on a review-basis, and concatenated them to word vectors. Mostly, the supervised 4-scores feature leads to the highest accuracies, since it employs the annotation information concerned with polarities on a word-basis. As can be seen in Table 1, the clustering method, in general, yields the lowest scores. We found out that the corpus - SVD metric does always perform better than the clustering method. We attribute it to that in SVD the most important singular values are taken into account. The corpus - SVD technique outperforms the word2vec algorithm for some corpora. When we do not take into account the 3-feats technique, the corpusbased SVD method yields the highest accuracies for the English Twitter dataset. We show that simple models can outperform more complex models, such as the concatenation of the three subapproaches or the word2vec algorithm. Another interesting finding is that for some cases the accuracy decreases when we utilise the polarity labels, as in the case for the English Twitter dataset.
316
C. R. Aydın et al.
Since the TDK dictionary covers most of the domain-specific vocabulary used in the movie reviews, the dictionary method performs well. However, the dictionary lacks many of the words, occurring in the tweets; therefore, its performance is not the best of all. When the TDK method is combined with the 3-feats technique, we observed a great improvement, as can be expected. Success rates obtained for the movie corpus are much better than those for the Twitter dataset for most of our approaches, since tweets are, in general, much shorter and noisier. We also found out that, when choosing the p value as 0.05, our results are statistically significant compared to the baseline approach in Turkish [6]. Some of our subapproaches also produce better success rates than those sentiment analysis models employed in English [7, 18]. We have achieved state-of-the-art results for the sentiment classification task for both Turkish and English. As mentioned, our approaches, in general, perform best in predicting the labels of reviews when three supervised scores are additionally utilised. We also employed the convolutional neural network model (CNN). However, the SVM classifier, which is a conventional machine learning algorithm, performed better. We did not include the performances of CNN for embedding types here due to the page limit of the paper. As a qualitative assessment of the word representations, given some query words, we visualised the most similar words to those words using the cosine similarity metric. By assessing the similarities between a word and all the other corpus words, we can find the most akin words according to different approaches. Table 2 shows the most similar words to given query words. Those words which are indicative of sentiment are, in general, found to be most similar to those words of the same polarity. For example, the most similar word to muhtes¸em (gorgeous) is 10/10, both of which have positive polarity. As can be seen in Table 2, our corpus-based approach is more adept at capturing domainspecific features as compared to word2vec, which generally captures general semantic and syntactic characteristics, but not the sentimental ones. Table 2. Most similar words to given queries according to our corpus-based approach and the baseline word2vec algorithm. Query Word
Corpus-based word2vec
Muhtes¸em (Gorgeous)
10/10
Harika (Wonderful)
Berbat (Terrible)
Vasat (Mediocre) ˙Ilginc¸
K¨ot¨u (Bad)
Fark (Difference)
(Interesting)
Tespit (Finding) ˙Iyi (Good)
K¨ot¨u (Bad)
Sıkıcı (Boring)
˙Iyi (Good)
G¨uzel (Beautiful)
K¨ot¨u (Bad)
Senaryo (Script)
Kurgu (Plot)
Kurgu (Plot)
Generating Word and Document Embeddings for Sentiment Analysis
317
6 Conclusion We have demonstrated that using word vectors that capture only semantic and syntactic characteristics may be improved by taking into account their sentimental aspects as well. Our approaches are portable to other languages and cross-domain. They can be applied to other domains and other languages than Turkish and English with minor changes. Our study is one of the few ones that perform sentiment analysis in Turkish and leverages sentimental characteristics of words in generating word vectors and outperforms all the others. Any of the approaches we propose can be used independently of the others. Our approaches without using sentiment labels can be applied to other classification tasks, such as topic classification and concept mining. The experiments show that even unsupervised approaches, as in the corpus-based approach, can outperform supervised approaches in classification tasks. Combining some approaches, which can compensate for what others lack, can help us build better vectors. Our word vectors are created by conventional machine learning algorithms; however, they, as in the corpus-based model, produce state-of-the-art results. Although we preferred to use a classical machine learning algorithm, which is SVM, over a neural network classifier to predict the labels of reviews, we achieved accuracies of over 90% for the Turkish movie corpus and about 88% for the English Twitter dataset. We performed only binary sentiment classification in this study as most of the studies in the literature do. We will extend our system in future by using neutral reviews as well. We also plan to employ Turkish WordNet to enhance the generalisability of our embeddings as another future work. Acknowledgments. This work was supported by Boˇgazic¸i University Research Fund Grant Number 6980D, and by Turkish Ministry of Development under the TAM Project number DPT2007K12-0610. Cem Rifki Aydin has been supported by T¨uB˙ITAK BIDEB 2211E.
References 1. Akın, A.A., Akın, M.D.: Zemberek, an open source NLP framework for Turkic languages. Structure 10, 1–5 (2007) 2. Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Calzolari, N., et al. (eds.) LREC European Language Resources Association (2010) 3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993– 1022 (2003) 4. Boyd-Graber, J., Resnik, P.: Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet allocation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 45–55. Association for Computational Linguistics (2010) 5. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1023/A:1022627411411
318
C. R. Aydın et al.
¨ 6. Ertugrul, A.M., Onal, I., Acart¨urk, C.: Does the strength of sentiment matter? A regression based approach on Turkish social media. In: Natural Language Processing and Information Systems - 22nd International Conference on Applications of Natural Language to Information Systems, NLDB 2017, Li`ege, Belgium, 21–23 June 2017, Proceedings, pp. 149–155 (2017). https://doi.org/10.1007/978-3-319-59569-6 16 7. Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., Lehmann, S.: Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In: EMNLP, pp. 1615–1625. Association for Computational Linguistics (2017) 8. Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant supervision. In: Processing, pp. 1–6 (2009) 9. Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif. Intell. Res. 1510(726), 345–420 (2016) 10. Hamilton, W.L., Clark, K., Leskovec, J., Jurafsky, D.: Inducing domain-specific sentiment lexicons from unlabeled corpora. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 595–605. Association for Computational Linguistics (2016) 11. Li, F., Huang, M., Zhu, X.: Sentiment analysis with global topics and local dependency. In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10), pp. 1371–1376. Association for Computational Linguistics (2010) 12. Lin, C., He, Y.: Joint sentiment/topic model for sentiment analysis. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 375–384. ACM (2009). https://doi.org/10.1145/1645953.1646003 13. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, pp. 142–150. Association for Computational Linguistics (2011) 14. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR 1301(3781), 1–12 (2013) ˇ uˇrek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Pro15. Reh˚ ceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, May 2010 16. Sak, H., G¨ung¨or, T., Sarac¸lar, M.: Morphological disambiguation of Turkish text with perceptron algorithm. In: Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2007), pp. 107–118. CICLing Press (2007). https://doi.org/10.1007/978-3-540-70939-8 10 17. Sak, H., G¨ung¨or, T., Sarac¸lar, M.: Turkish language resources: morphological parser, morphological disambiguator and web corpus. In: Nordstr¨om, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 417–427. Springer, Heidelberg (2008). https://doi.org/10.1007/ 978-3-540-85287-2 40 18. Tang, D., Wei, F., Qin, B., Liu, T., Zhou, M.: Coooolll: a deep learning system for Twitter sentiment classification. In: Proceedings of the 8th International Workshop on Semantic Evaluation, [email protected] 2014, Dublin, Ireland, 23–24 August 2014, pp. 208–212 (2014) 19. Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)
Speech Processing
Speech Emotion Recognition Using Spontaneous Children’s Corpus Panikos Heracleous1 , Yasser Mohammad2,4(B) , Keiji Yasuda1,3 , and Akio Yoneyama1 1
2
KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi, Saitama 356-8502, Japan {pa-heracleous,yoneyama}@kddi-research.jp National Institute of Advanced Industrial Science and Technology, Tokyo, Japan [email protected] 3 Nara Institute of Science and Technology, Ikoma, Japan [email protected] 4 Assiut University, Asyut, Egypt [email protected]
Abstract. Automatic recognition of human emotions is a relatively new field and is attracting significant attention in research and development areas because of the major contribution it could make to real applications. Previously, several studies reported speech emotion recognition using acted emotional corpus. For real world applications, however, spontaneous corpora should be used in recognizing human emotions from speech. This study focuses on speech emotion recognition using the FAU Aibo spontaneous children’s corpus. A method based on the integration of feed-forward deep neural networks (DNN) and the i-vector paradigm is proposed, and another method based on deep convolutional neural networks (DCNN) for feature extraction and extremely randomized trees as classifier is presented. For the classification of five emotions using balanced data, the proposed methods showed unweighted average recalls (UAR) of 61.1% and 59.2%, respectively. These results are very promising showing the effectiveness of the proposed methods in speech emotion recognition. The two proposed methods based on deep learning (DL) were compared to a support vector machines (SVM) based method and they demonstrated superior performance. Keywords: Speech emotion recognition · Spontaneous corpus · Deep neural networks · Feature extraction · Extremely randomized trees
1
Introduction
Emotion recognition plays an important role in human-machine communication [4]. Emotion recognition can be used in human-robot communication, when robots communicate with humans in accord with the detected human emotions, and also has an important role to play in call centers in detecting a caller’s emotional state in cases of emergency (e.g., hospitals, police stations), or to identify c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 321–333, 2023. https://doi.org/10.1007/978-3-031-24340-0_24
322
P. Heracleous et al.
the level of the customer’s satisfaction (i.e., providing feedback). In the current study, emotion recognition based on speech is experimentally investigated. Previous studies reported automatic speech emotion recognition using Gaussian mixture models (GMMs) [28,29], hidden Markov models (HMM) [24], support vector machines (SVM) [21], neural networks (NN) [20], and DNN [9,26]. In [17], a study based on concatenated i-vectors is reported. Audiovisual emotion recognition is presented in [18]. Previously, i-vectors were used in speech emotion recognition [17]. However, only a very few studies reported speech emotion recognition using i-vectors integrated with DNN [30]. Furthermore, to our knowledge the integration of i-vectors and DL for speech emotion recognition when limited data are available has not been investigated exhaustively so far and, therefore, the research area still remains open. Additionally, in the current study the FAU Aibo [25] state-of-theart spontaneous children’s emotional corpus is used for the classification of five emotions based on DNN and i-vectors. Another method is proposed that uses DCNN [1,14] to extract informative features, which are then used by extremely randomized trees [8] for emotion recognition. The extremely randomized trees classifier is similar to the random forest classifier [11], but with randomized tree splitting. The motivation for using extremely randomized trees lies in previous observations showing their effectiveness in the case of a small number of features, and also because of the computational efficiency. The proposed methods based on DL are compared with a baseline classification approach. In the baseline method, i-vectors and SVM are being used. To further increase temporal information in the feature vectors, in the current study, shifted delta cepstral (SDC) coefficients [3,27] were also used along with the well-known mel-frequency cepstral coefficients (MFCC) [23].
2 2.1
Methods Data
The FAU Aibo corpus consists of 9 h of German speech of 51 children between the ages of 10 and 13 interacting with Sony’s pet robot Aibo. Spontaneous, emotionally colored children’s speech was recorded using a close-talking microphone. The data was annotated in relation to 11 emotion categories by five human labelers on a word level. In the current study, the FAU Aibo data are used for classification of the emotional states of angry, emphatic, joyful, neutral, and rest. To use balanced training and test data, 590 training utterances and 299 test utterances randomly selected from each emotion were used. 2.2
Feature Selection
MFCC features are used in the experiments. MFCCs are very commonly used features in speech recognition, speaker recognition, emotion recognition, and language identification. Specifically, in the current study, 12 MFCCs plus energy are extracted every 10 ms using a window length of 20 ms.
Speech Emotion Recognition Using Spontaneous Children’s Corpus
323
Fig. 1. Computation of SDC coefficients using MFCC and delta MFCC features.
SDC coefficients have been successfully used in language recognition. In the current study, the use of SDC features in speech emotion recognition is also experimentally investigated. The motivation for using SDC is to increase the temporal information in the feature vectors, which consist of frame-level features with limited temporal information. The SDC features are obtained by concatenating delta cepstral features across multiple frames. The SDC features are described by four parameters, N, d, P and k, where N is the number of cepstral coefficients computed at each frame, d represents the time advance and delay for the delta computation, k is the number of blocks whose delta coefficients are concatenated to form the final feature vector, and P is the time shift between consecutive blocks. Accordingly, kN parameters are used for each SDC feature vector, as compared with 2N for conventional cepstral and delta-cepstral feature vectors. The SDC is calculated as follows: Δc(t + iP ) = c(t + iP + d) − c(t + iP − d)
(1)
The final vector at time t is given by the concatenation of all Δc(t+iP ) for all 0 ≤ i < k − 1, where c(t) is the original feature value at time t. In the current study, the feature vectors with static MFCC features and SDC coefficients are of length 112. The concatenated MFCC and SDC features are used as input when using the DCNN with extremely randomized trees and conventional CNN classifiers. In the case of using DNN and SVM, the MFCC and SDC features are used to construct i-vectors used in classification. Figure 1 illustrates the extraction of SDC features. 2.3
The i-Vector Paradigm
A widely used classification approach in speaker recognition is based on GMMs with universal background models (UBM). In this approach, each speaker model is created by adapting the UBM using maximum a posteriori (MAP) adaptation. A GMM supervector is constructed by concatenating the means of the adapted models. As in speaker recognition, GMM supervectors can also be used for emotion classification.
324
P. Heracleous et al.
The main disadvantage of GMM supervectors, however, is the high dimensionality, which imposes high computational and memory costs. In the i-vector paradigm, the limitations of high dimensional supervectors (i.e., concatenation of the means of GMMs) are overcome by modeling the variability contained in the supervectors with a small set of factors. Considering speech emotion classification, an input utterance can be modeled as: M = m + Tw
(2)
where M is the emotion-dependent supervector, m is the emotion-independent supervector, T is the total variability matrix, and w is the i-vector. Both the total variability matrix and emotion-independent supervector are estimated from the complete set of training data. 2.4
Classification Approaches
Deep Neural Networks. DL [10] is behind several of the most recent breakthroughs in computer vision, speech recognition, and agents that achieved human-level performance in several games such as go, poker etc. A DNN is a feed-forward neural network with more than one hidden layer. The units (i.e., neurons) of each hidden layer take all outputs of the lower layer and pass them through an activation function. In the current study, three hidden layers with 64 units and the ReLu activation function are used. On top, a Softmax layer with five classes was added. The number of batches was set to 512, and 500 epochs were used. Convolutional Neural Networks. A CNN is a special variant of the conventional network, which introduces a special network structure consisting of alternating convolution and pooling layers. CNN have been successfully applied to sentence classification [13], image classification [22], facial expression recognition [12], and in speech emotion recognition [16]. Furthermore, in [7] bottleneck features extracted from CNN are used for robust language identification. In this paper, DCNN for learning informative features from the signal that is then used for emotion classification is investigated. The MFCC and SDC features are calculated using overlapping windows with a length of 20 ms. This generates a multidimensional time-series that represent the data for each session. The proposed method is a simplified version of the method recently proposed in [19] for activity recognition using mobile sensors. The proposed classifier consists of a DCNN followed by extremely randomized trees instead of the standard fully connected classifier. The motivation for using extremely randomized trees lies in previous observations showing their effectiveness in the case of a small number of features. The network architecture is shown in Fig. 2, and consists of a series of five blocks, each of which consists of two convolutional layers (64 5×5) followed by a max-pooling layer (2×2). Outputs from the last three blocks are then combined and flattened to represent the learned features. Training of the classifier proceeds in three stages as shown in the Fig. 3:
Speech Emotion Recognition Using Spontaneous Children’s Corpus
325
Fig. 2. The architecture of the deep feature extractor along with the classifier used during feature learning.
Network training, feature selection, and tree training. During network training, the DCNN is trained with predefined windows of 21 feature MFCC/SDC blocks (21 × 112 features). Network training consists of two sub-stages: First, the network is concatenated with its inverse to form an auto-encoder that is trained in unsupervised mode using all data in the training set and without the labels (i.e., pre-training stage). Second, three fully connected layers are attached to the output of the network, and the whole combined architecture is trained as a classifier using the labeled training set. These fully connected layers are then removed, and the output of the neural network (i.e., deep feature extractor) represents the learned features. Every hidden layer is an optimized classifier, and an optimized classifier is a useful feature extractor because the output is discriminative. The second training stage (i.e., feature selection) involves selecting a few of the outputs from the deep feature extractor to be used in the final classification. Each feature (i.e., neuronal output i) is assigned a total quality (Q (i)) according to Eq. 3, where I¯j (i) is z-score normalized feature importance (Ij (i)) according to a base feature selection method. Q (i) =
nf
wj I¯j (i) ,
(3)
j=0
In the current study, three base selectors are utilized: randomized logistic regression [6], linear SVMs with L1 penalty, and extremely randomized trees. Random linear regression (RLR) estimates feature importance by randomly selecting subsets of training samples and fitting them using a L1 sparsity inducing penalty that is scaled for a random set of coefficients. The features that appear repeatedly in such selections (i.e., with high coefficients) are assumed to be more important and are given higher scores. The second base extractor uses a linear SVM with an L1 penalty to fit the data and then select the features that have nonzero coefficients, or coefficients under a given threshold, from the fitted model. The third feature selector employs extremely randomized trees. During fitting of decision trees, features that appear at lower depths are generally more important. By fitting several such trees, feature importance can be estimated as
326
P. Heracleous et al.
Fig. 3. The proposed training process showing the three stages of training and the output of each stage. Table 1. Equal error rates (EER) for individual emotions when using three different classifiers. Classifier
Angry Emphatic Joyful Neutral Rest Average
DNN
20.1
19.8
16.4
21.1
29.8 21.4
DCNN 24.1 + Randomized trees
24.7
23.7
30.4
29.4 26.5
SVM
27.3
20.4
30.4
41.5 28.7
23.7
the average depth of each feature in the trees. Feature selection uses n-fold cross validation to select an appropriate number of neurons to retain in the final (fast) feature extractor (Fig. 3). For this study, the features (outputs) whose quality (Qi ) exceeds the median value of qualities are retained. Given the selected features from the previous step, an extremely randomized tree classifier is then trained using the labeled data set (i.e., tree training stage). Note that the approach described above allows a classification decision to be generated for each of the 21 MFCC/SDC blocks. To generate a single emotion prediction for each test sample, the outputs of the classifier need to be combined. One possibility is to use a recurrent neural network (RNN), an LSTM, or HMM to perform this aggregation. Nevertheless, in this study, the simplest voting aggregator, in which the label of the test file is the mode of the labels of all its data, is used.
3
Results
In the current study, the equal error rate (EER) and the UAR are used as evaluation measures. The UAR is defined as the mean value of the recall for each class. In addition, in the current study, the detection error tradeoff (DET) graphs are also shown. Table 1 shows the EERs when using the three classifiers. As shown, by using DNN along with i-vectors, the lowest EER is obtained. Specifically, when using DNN, the EER was 21.4%. The second lowest EER was obtained using DCNN with extremely randomized tress. In this case, a 26.5% EER was obtained. Using
Speech Emotion Recognition Using Spontaneous Children’s Corpus
327
Table 2. Confusion matrix of five emotions recognition when using DNN with i-vectors. Angry Emphatic Joyful Neutral Rest Angry
63.5
14.4
6.7
5.0
10.4
Emphatic 15.1
63.9
0.3
14.3
6.4
Joyful
3.7
2.3
68.9
4.4
20.7
Neutral
3.3
14.4
6.4
60.2
15.7
12.3
9.0
17.1
12.4
49.2
Rest
Table 3. Confusion matrix of five emotions recognition when using DCNN and extremely randomized trees. Angry Emphatic Joyful Neutral Rest Angry
65.2
1.7
2.7
0.3
30.1
Emphatic 13.7
61.2
2.0
0
23.1
Joyful
4.2
2.2
61.3
0
32.3
Neutral
8.7
9.4
1.0
43.8
37.1
16.8
10.5
8.7
5.3
58.7
Rest
Table 4. Confusion matrix of five emotions recognition when using conventional CNN. Angry Emphatic Joyful Neutral Rest Angry
51.5
15.1
11.4
10.0
12.0
Emphatic 10.7
53.8
14.7
12.7
8.1
Joyful
13.4
12.0
51.8
12.0
10.8
Neutral
11.7
13.4
12.4
52.5
10.0
Rest
13.7
8.0
16.4
9.4
52.5
SVM, the EER was 28.7%. The results also show, that joyful, emphatic, and angry emotions show the lowest EERs. A possible reason may be the higher emotional information included in these three emotions. On the other hand, the highest EER were obtained in the case of neutral and rest emotions (i.e., less emotional states). The UAR when using DNN with i-vectors was 61.1%. This is a very promising result and superior to other similar studies [2,5,15] that used different classifiers and features with unbalanced data. The result also show that DNN and i-vectors can be effectively integrated in speech emotion recognition even in the case of limited training data. The second highest UAR was obtained in the case of DCNN with extremely randomized trees. In this case, a 59.2% UAR was achieved. When a fully-connected layer was used on top of the convolutional layers (i.e., conventional CNN classifier) the UAR was 52.4%. This rate was lower compared to the extremely randomized trees classifier with deep feature
328
P. Heracleous et al. Table 5. Confusion matrix of five emotions recognition when using SVM. Angry Emphatic Joyful Neutral Rest Angry
55.2
15.7
6.0
7.0
16.1
Emphatic 16.7
44.5
3.3
17.1
18.4
Joyful
3.3
2.7
62.2
4.7
27.1
Neutral
7.7
12.4
13.6
35.5
30.8
11.4
9.4
18.7
14.0
46.5
Rest
Fig. 4. DET curves of speech emotion recognition using DNN.
extractor. Finally, when using SVM and i-vectors, a 48.8% UAR was achieved. The results show that when using the two proposed methods based on DL, higher UARs are achieved compared to the baseline approach. Tables 2, 3, 4, and 5 show the confusion matrices. As shown, in the case of DNN, the classification rates are comparable (with the exception of rest). The joyful, emphatic, angry classes are recognized with the highest rates, and rest is recognized with the lowest rate. In the case of using DCNN with extremely randomized trees, the classes angry and joyful show the highest rates. When using the conventional CNN, similar rates were obtained for all emotions. In the
Speech Emotion Recognition Using Spontaneous Children’s Corpus
329
case of SVM, joyful and angry are recognized with the highest accuracy. It can be, therefore, concluded that the emotions angry and joyful are recognized with the highest rates in most cases. DCNN
False Negative Rate (FNR) [%]
40
20
angry emphatic joyful neutral rest
10 5 2 1 0.5 0.2 0.1 0.1 0.2 0.5 1
2
5
10
20
40
False Positive Rate (FPR) [%]
Fig. 5. DET curves of speech emotion recognition using DCNN and extremely randomized trees.
Figures 4, 5, and 6 show the DET curves of the five individual emotions recognition. As shown, in all cases, superior performance was achieved for the emotion joyful. Figure 7 shows the overall DET curves for the three classifiers. The figure clearly demonstrates that by using the two proposed methods based on DL, the highest performance is achieved. More specifically, the highest performance is obtained when using DNN and i-vectors. Note that above 30% FPR, SVM shows superior performance compared to DCNN with extremely randomized trees. The overall EER, however, is lower in the case of DCNN with extremely randomized trees compared to SVM.
330
P. Heracleous et al.
Fig. 6. DET curves of speech emotion recognition using SVM.
False Negative Rate (FNR) [%]
40
SVM DNN CNN
20
10 5 2 1 0.5 0.2 0.1 0.1 0.2 0.5 1
2
5
10
20
40
False Positive Rate (FPR) [%] Fig. 7. DET curves of speech emotion recognition using three different classifiers.
Speech Emotion Recognition Using Spontaneous Children’s Corpus
4
331
Conclusion
The current paper focused on speech emotion recognition based on deep learning and using the state-of-the-art FAU Aibo emotion corpus of children’s speech. The proposed method based on DNN and i-vectors achieved a 61.1% UAR. This result is very promising and superior to the results obtained using the same data. The results also show that i-vectors and DNN can be efficiently used in speech emotion recognition, even in the case of very limited training data. The UAR when using DCNN with extremely randomized trees was 59.2%. The two proposed methods were compared to a baseline SVM based classification scheme, and they showed superior performance. Currently, speech emotion recognition using the proposed methods and the FAU Aibo data in noisy and reverberant environments is being investigated.
References 1. Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 1533–1545 (2014) 2. Attabi, Y., Alam, J., Dumouchel, P., Kenny, P., Shaughnessy, D.O.: Multiple windowed spectral features for emotion recognition. In: Proceedings of ICASSP, pp. 7527–7531 (2013) 3. Bielefeld, B.: Language identification using shifted delta cepstrum. In: Fourteenth Annual Speech Research Symposium (1994) 4. Busso, C., Bulut, M., Narayanan, S.: Toward effective automatic recognition systems of emotion in speech. In: Gratch, J., Marsella, S. (eds.) Social Emotions in Nature and Artifact: Emotions in Human and Human-Computer Interaction, pp. 110–127. Oxford University Press, New York (2013) 5. Cao, H., Verma, R., Nenkova, A.: Combining ranking and classification to improve emotion recognition in spontaneous speech. In: Proceedings of INTERSPEECH (2012) 6. Friedman, J., Hastie, T., et al.: Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 28(2), 337– 407 (2000) 7. Ganapathy, S., Han, K., Thomas, S., Omar, M., Segbroeck, M.V., Narayanan, S.S.: Robust language identification using convolutional neural network features. In: Proceedings of Interspeech (2014) 8. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006) 9. Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Proceedings of Interspeech, pp. 2023–2027 (2014) 10. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012) 11. Ho, T.K.: Random decision forests. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, pp. 278–282 (1995)
332
P. Heracleous et al.
12. Huynh, X.-P., Tran, T.-D., Kim, Y.-G.: Convolutional Neural Network Models for Facial Expression Recognition Using BU-3DFE Database. In: Information Science and Applications (ICISA) 2016. LNEE, vol. 376, pp. 441–450. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0557-2 44 13. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751 (2014) 14. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097– 1105. Curran Associates, Inc. (2012) 15. Le, D., Provost, E.M.: Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks. In: Proceedings of IEEE ASRU, pp. 216–221 (2013) 16. Lim, W., Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. In: Proceedings of Signal and Information Processing Association Annual Summit and Conference (APSIPA) (2016) 17. Liu, R.X.Y.: Using i-vector space model for emotion recognition. In: Proceedings of Interspeech, pp. 2227–2230 (2012) 18. Metallinou, A., Lee, S., Narayanan, S.: Decision level combination of multiple modalities for recognition and analysis of emotional expression. In: Proceedings of ICASSP, pp. 2462–2465 (2010) 19. Mohammad, Y., Matsumoto, K., Hoashi, K.: Deep feature learning and selection for activity recognition. In: Proceedings of the 33rd ACM/SIGAPP Symposium On Applied Computing, pp. 926–935. ACM SAC (2018) 20. Nicholson, J., Takahashi, K., Nakatsu, R.: Emotion recognition in speech using neural networks. Neural Comput. Appl. 9(4), 290–296 (2000) 21. Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012) 22. Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Commun. 29, 2352–2449 (2017) 23. Sahidullah, M., Saha, G.: Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Commun. 54(4), 543–565 (2012) 24. Schuller, B., Rigoll, G., Lang, M.: Hidden Markov model-based speech emotion recognition. In: Proceedings of the IEEE ICASSP, vol. 1, pp. 401–404 (2003) 25. Steidl, S.: Automatic Classification of Emotion-Related User States in Spontaneous Children’s Speech. Logos Verlag, Berlin (2009) 26. Stuhlsatz, A., Meyer, C., Eyben, F., Zielke1, T., Meier, G., Schuller, B.: Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proceedings of ICASSP, pp. 5688–5691 (2011) 27. Torres-Carrasquillo, P., Singer, E., Kohler, M.A., Greene, R.J., Reynolds, D.A., Deller, J.R.: Approaches to language identification using gaussian mixture models and shifted delta cepstral features. In: Proceedings of ICSLP2002INTERSPEECH2002, pp. 16–20 (2002) 28. Tang, H., Chu, S., Johnson, M.H.: Emotion recognition from speech via boosted Gaussian mixture models. In: Proceedings of ICME, pp. 294–297 (2009)
Speech Emotion Recognition Using Spontaneous Children’s Corpus
333
29. Xu, S., Liu, Y., Liu, X.: Speaker recognition and speech emotion recognition based on GMM. In: 3rd International Conference on Electric and Electronics (EEIC 2013), pp. 434–436 (2013) 30. Zhang, T., Wu, J.: Speech emotion recognition with i-vector feature and RNN model. In: 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), pp. 524–528 (2015)
Natural Language Interactions in Autonomous Vehicles: Intent Detection and Slot Filling from Passenger Utterances Eda Okur1(B) , Shachi H. Kumar2 , Saurav Sahay2 , Asli Arslan Esme1 , and Lama Nachman2 1 Intel Labs, Hillsboro, USA {eda.okur,asli.arslan.esme}@intel.com 2 Intel Labs, Santa Clara, USA {shachi.h.kumar,saurav.sahay,lama.nachman}@intel.com
Abstract. Understanding passenger intents and extracting relevant slots are crucial building blocks towards developing contextual dialogue systems for natural interactions in autonomous vehicles (AV). In this work, we explored AMIE (Automated-vehicle Multi-modal Incabin Experience), the in-cabin agent responsible for handling certain passenger-vehicle interactions. When the passengers give instructions to AMIE, the agent should parse such commands properly and trigger the appropriate functionality of the AV system. In our current explorations, we focused on AMIE scenarios describing usages around setting or changing the destination and route, updating driving behavior or speed, finishing the trip, and other use-cases to support various natural commands. We collected a multi-modal in-cabin dataset with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme via a realistic scavenger hunt game activity. After exploring various recent Recurrent Neural Networks (RNN) based techniques, we introduced our hierarchical joint models to recognize passenger intents along with relevant slots associated with the action to be performed in AV scenarios. Our experimental results outperformed certain competitive baselines and achieved overall F1-scores of 0.91 for utterance-level intent detection and 0.96 for slot filling tasks. In addition, we conducted initial speech-totext explorations by comparing intent/slot models trained and tested on human transcriptions versus noisy Automatic Speech Recognition (ASR) outputs. Finally, we evaluated the results with single passenger rides versus the rides with multiple passengers. Keywords: Intent recognition · Slot filling · Hierarchical joint learning · Spoken language understanding (SLU) · In-cabin dialogue agent
1
Introduction
One of the exciting yet challenging areas of research in Intelligent Transportation Systems is developing context-awareness technologies that can enable c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 334–350, 2023. https://doi.org/10.1007/978-3-031-24340-0_25
Natural Language Interactions in AVs: Intent Detection and Slot Filling
335
autonomous vehicles to interact with their passengers, understand passenger context and situations, and take appropriate actions accordingly. To this end, building multi-modal dialogue understanding capabilities situated in the in-cabin context is crucial to enhance passenger comfort and gain user confidence in AV interaction systems. Among many components of such systems, intent recognition and slot filling modules are one of the core building blocks towards carrying out successful dialogue with passengers. As an initial attempt to tackle some of those challenges, this study introduces in-cabin intent detection and slot filling models to identify passengers’ intent and extract semantic frames from the natural language utterances in AV. The proposed models are developed by leveraging the User Experience (UX) grounded realistic (ecologically valid) in-cabin dataset. This dataset is generated with naturalistic passenger behaviors, multiple passenger interactions, with the presence of a Wizard-of-Oz (WoZ) agent in moving vehicles with noisy road conditions. 1.1
Background
Long Short-Term Memory (LSTM) networks [7] are widely used for temporal sequence learning or time-series modeling in Natural Language Processing (NLP). These neural networks are commonly employed for sequence-to-sequence (seq2seq) and sequence-to-one (seq2one) modeling problems, including slot filling tasks [11] and utterance-level intent classification [5,17] which are wellstudied for various application domains. Bidirectional LSTMs (Bi-LSTMs) [18] are extensions of traditional LSTMs which are proposed to improve model performance on sequence classification problems even further. Jointly modeling slot extraction and intent recognition [5,25] is also explored in several architectures for task-specific applications in NLP. Using Attention mechanism [16,24] on top of RNNs is yet another recent breakthrough to elevate the model performance by attending inherently crucial sub-modules of a given input. There exist various architectures to build hierarchical learning models [10,22,27] for document-tosentence level, and sentence-to-word level classification tasks, which are highly domain-dependent and task-specific. Automatic Speech Recognition (ASR) technology has recently achieved human-level accuracy in many fields [20,23]. For spoken language understanding (SLU), it is shown that training SLU models on true text input (i.e., human transcriptions) versus noisy speech input (i.e., ASR outputs) can achieve varying results [9]. Even greater performance degradations are expected in more challenging and realistic setups with noisy environments, such as moving vehicles in actual traffic conditions. As an example, a recent work [26] attempts to classify sentences as navigation-related or not using the DARPA-supported CUMove in-vehicle speech corpus [6], a relatively old and large corpus focusing on route navigation. For this binary intent classification task, the authors observed that detection performances are largely affected by high ASR error rates due to background noise and multi-speakers in the CU-Move dataset (not publicly available). For in-cabin dialogue between car assistants and driver/passengers,
336
E. Okur et al.
recent studies explored creating a public dataset using a WoZ approach [3], and improving ASR for passenger speech recognition [4]. A preliminary report on research designed to collect data for human-agent interactions in a moving vehicle is presented in a previous study [19], with qualitative analysis on initial observations and user interviews. Our current study is focused on the quantitative analysis of natural language interactions found in this in-vehicle dataset [14], where we address intent detection and slot extraction tasks for passengers interacting with the AMIE in-cabin agent. Contributions. In this study, we propose intent recognition and slot filling models with UX grounded naturalistic passenger-vehicle interactions. We defined in-vehicle intent types and refined their relevant slots through a data-driven process based on observed interactions. After exploring existing approaches for jointly training intents and slots, we applied certain variations of these models that perform best on our dataset to support various natural commands for interacting with the car agent. The main differences in our proposed models can be summarized as follows: (1) Using the extracted intent keywords in addition to the slots to jointly model them with utterance-level intents (where most of the previous work [10,27] only join slots and utterance-level intents, ignoring the intent keywords); (2) The 2-level hierarchy we defined by word-level detection/extraction for slots and intent keywords first, and then filtering-out predicted non-slot and non-intent keywords instead of feeding them into the upper levels of the network (i.e., instead of using stacked RNNs with multiple recurrent hidden layers for the full utterance [10,22], which are computationally costly for long utterances with many non-slot & non-intent-related words), and finally using only the predicted valid-slots and intent-related keywords as an input to the second level of the hierarchy; (3) Extending joint models [5,25] to include both beginning-of-utterance and end-of-utterance tokens to leverage Bi-LSTMs (after observing that we achieved better results by doing so). We compared our intent detection and slot filling results with the results obtained from Dialogflow1 , a commercially available intent-based dialogue system by Google, and we showed that our proposed models perform better for both tasks on the same dataset. We also conducted initial speech-to-text explorations by comparing models trained and tested (10-fold CV) on human transcriptions versus noisy ASR outputs (via Cloud Speech-to-Text2 ). Finally, we evaluated the results with single passenger rides versus the rides with multiple passengers.
2
Methodology
2.1
Data Collection and Annotation
Our AV in-cabin dataset includes around 30 h of multi-modal data collected from 30 passengers (15 female, 15 male) in a total of 20 rides/sessions. In 10 sessions, 1 2
https://dialogflow.com. https://cloud.google.com/speech-to-text/.
Natural Language Interactions in AVs: Intent Detection and Slot Filling
337
Fig. 1. AMIE In-cabin data collection setup
single passenger was present (i.e., singletons), whereas the remaining 10 sessions include two passengers (i.e., dyads) interacting with the vehicle. The data is collected “in the wild” on the streets of Richmond, British Columbia, Canada. Each ride lasted about 1 h or more. The vehicle is modified to hide the operator and the human acting as an in-cabin agent from the passengers, using a variation of WoZ approach [21]. Participants sit in the back of the car, separated by a semisound proof and translucent screen from the human driver and the WoZ AMIE agent at the front. In each session, the participants were playing a scavenger hunt game by receiving instructions over the phone from the Game Master. Passengers treat the car as AV and communicate with the WoZ AMIE agent via speech commands. Game objectives require passengers to interact naturally with the agent to go to certain destinations, update routes, stop the vehicle, give specific directions regarding where to pull over or park (sometimes with a gesture), find landmarks, change speed, get in and out of the vehicle, etc. Further details of the data collection design and scavenger hunt protocol can be found in the preliminary study [19]. See Fig. 1 for the vehicle instrumentation to enhance multi-modal data collection setup. Our study is the initial work on this multi-modal dataset to develop intent detection and slot filling models. We leveraged data from the back-driver video/audio stream recorded by an RGB camera (facing the passengers) for manual transcription and annotation of the in-cabin utterances. In addition, we used the audio data recorded by Lapel 1 Audio and Lapel 2 Audio (Fig. 1) as our input resources for the ASR. For in-cabin intent understanding, we described four groups of usages to support various natural commands for interacting with the vehicle: (1) Set/Change Destination/Route (including turn-by-turn instructions), (2) Set/Change Driving Behavior/Speed, (3) Finishing the Trip Use-cases, and (4) Others (open/close door/window/trunk, turn music/radio on/off, change AC/temperature, show
338
E. Okur et al. Table 1. AMIE dataset statistics: utterance-level intent types AMIE scenario
Intent type
Utterance count
Finishing the Trip Use-cases
Stop Park PullOver DropOff
317 450 295 281
Set/Change Destination/Route
SetDestination SetRoute
552 676
Set/Change GoFaster Driving Behavior/Speed GoSlower
265 238
Others (Door, Music, etc.)
142 202
OpenDoor Other Total
3418
map, etc.). According to those scenarios, ten types of passenger intents are identified and annotated as follows: SetDestination, SetRoute, GoFaster, GoSlower, Stop, Park, PullOver, DropOff, OpenDoor, and Other. For slot filling task, relevant slots are identified and annotated as: Location, Position/Direction, Object, Time Guidance, Person, Gesture/Gaze (e.g., ‘this’, ‘that’, ‘over there’, etc.), and None/O. In addition to utterance-level intents and slots, word-level intentrelated keywords are annotated as Intent. We obtained 1331 utterances having commands to the AMIE agent from our in-cabin dataset. We expanded this dataset via the creation of similar tasks on Amazon Mechanical Turk [2] and reached 3418 utterances with intents in total. Intent and slot annotations are obtained on the transcribed utterances by majority voting of 3 annotators. The annotation results for utterance-level intent types, slots, and intent keywords can be found in Table 1 and Table 2 as a summary of dataset statistics. Table 2. AMIE dataset statistics: slots and intent keywords Slot type
Slot count Keyword type
Keyword count
Location 4460 Position/Direction 3187 Person 1360 632 Object Time Guidance 792 523 Gesture/Gaze 19967 None
Intent Non-Intent Valid-Slot Non-Slot Intent ∪ Valid-Slot Non-Intent ∩ Non-Slot
5921 25000 10954 19967 16875 14046
Total
Total
30921
30921
Natural Language Interactions in AVs: Intent Detection and Slot Filling
2.2
339
Detecting Utterance-Level Intent Types
As a baseline system, we implemented term-frequency and rule-based mapping mechanisms between word-level intent keywords extraction to utterancelevel intent recognition. To further improve the utterance-level performance, we explored various RNN architectures and developed hierarchical (2-level) models to recognize passenger intents along with relevant entities/slots in utterances. Our hierarchical model has the following 2-levels: – Level-1: Word-level extraction (to automatically detect/predict and eliminate non-slot & non-intent keywords first, as they would not carry much information for understanding the utterance-level intent-type). – Level-2: Utterance-level recognition (to detect final intent-types for given utterances, using valid slots and intent keywords as inputs only, which are detected at Level-1). RNN with LSTM Cells for Sequence Modeling. In this study, we employed an RNN architecture with LSTM cells that are designed to exploit long-range dependencies in sequential data. LSTM has a memory cell state to store relevant information and various gates, which can mitigate the vanishing gradient problem [7]. Given the input xt at time t, and hidden state from the previous time step ht−1 , the hidden and output layers for the current time step are computed. The LSTM architecture is specified by the following equations: it = σ(Wxi xt + Whi ht−1 + bi )
(1)
ft = σ(Wxf xt + Whf ht−1 + bf ) ot = σ(Wxo xt + Who ht−1 + bo ) gt = tanh(Wxg xt + Whg ht−1 + bg )
(2) (3) (4)
ct = ft ct−1 + it gt ht = ot tanh(ct )
(5) (6)
where W and b denote the weight matrices and bias terms, respectively. The sigmoid (σ) and tanh are activation functions applied element-wise, and denotes the element-wise vector product. LSTM has a memory vector ct to read/write or reset using a gating mechanism and activation functions. Here, input gate it scales down the input, the forget gate ft scales down the memory vector ct , and the output gate ot scales down the output to achieve final ht , which is used to predict yt (through a sof tmax activation). Similar to LSTMs, GRUs [1] are proposed as simpler and faster alternatives, having reset and update gates only. For Bi-LSTM [5,18], two LSTM architectures are traversed in forward and backward directions, where their hidden layers are concatenated to compute the output. Extracting Slots and Intent Keywords. For slot filling and intent keywords extraction, we experimented with various configurations of seq2seq LSTMs [17] and GRUs [1], as well as Bi-LSTMs [18]. A sample network architecture can be
340
E. Okur et al.
Fig. 2. Seq2seq Bi-LSTM network for slot filling and intent keyword extraction
seen in Fig. 2 where we jointly trained slots and intent keywords. The passenger utterance is fed into the LSTM/GRU network with an embedding layer, and this sequence of words is transformed into word vectors. We also experimented with GloVe [15], word2vec [12,13], and fastText [8] as pre-trained word embeddings. To prevent overfitting, we used a dropout layer with 0.5 rate for regularization. Best performing results are obtained with Bi-LSTMs and GloVe embeddings (6B tokens, 400K vocabulary size, vector dimension 100). Utterance-Level Recognition. For utterance-level intent detection, we mainly experimented with 5 groups of models: (1) Hybrid: RNN + Rule-based, (2) Separate: Seq2one Bi-LSTM with Attention, (3) Joint: Seq2seq Bi-LSTM for slots/intent keywords & utterance-level intents, (4) Hierarchical & Separate, (5) Hierarchical & Joint. For (1), we detect/extract intent keywords and slots (via RNN) and map them into utterance-level intent-types (rule-based). For (2), we feed the whole utterance as input sequence and intent-type as a single target into the Bi-LSTM network with an Attention mechanism. For (3), we jointly train word-level intent keywords/slots and utterance-level intents (by adding < BOU > / < EOU > terms to the beginning/end of utterances with intenttypes as their labels). For (4) and (5), we detect/extract intent keywords/slots first and then only feed the predicted keywords/slots as a sequence into (2) and (3), respectively.
3 3.1
Experiments and Results Utterance-Level Intent Detection Experiments
The details of the five groups of models and their variations we experimented with for utterance-level intent recognition are summarized in this section. Hybrid Models. Instead of purely relying on machine learning (ML) or deep learning (DL) systems, hybrid models leverage both ML/DL and rule-based systems. In this model, we defined our hybrid approach as using RNNs first for
Natural Language Interactions in AVs: Intent Detection and Slot Filling
341
Fig. 3. Hybrid models network architecture
detecting/extracting intent keywords and slots; then applying rule-based mapping mechanisms to identify utterance-level intents (using the predicted intent keywords and slots). A sample network architecture can be seen in Fig. 3 where we leveraged seq2seq Bi-LSTM networks for word-level extraction before the rule-based mapping to utterance-level intent classes. The model variations are defined based on varying mapping mechanisms and networks as follows: – Hybrid-0: RNN (Seq2seq LSTM for intent keywords extraction) + Rule-based (mapping extracted intent keywords to utterance-level intents) – Hybrid-1: RNN (Seq2seq Bi-LSTM for intent keywords extraction) + Rulebased (mapping extracted intent keywords to utterance-level intents) – Hybrid-2: RNN (Seq2seq Bi-LSTM for intent keywords & slots extraction) + Rule-based (mapping extracted intent keywords & ‘Position/Direction’ slots to utterance-level intents) – Hybrid-3: RNN (Seq2seq Bi-LSTM for intent keywords & slots extraction) + Rule-based (mapping extracted intent keywords & all slots to utterance-level intents)
Separate Seq2one Models. This approach is based on separately training sequence-to-one RNNs for utterance-level intents only. These are called separate models as we do not leverage any information from the slot or intent keyword tags (i.e., utterance-level intents are not jointly trained with slots/intent keywords). Note that in seq2one models, we feed the utterance as an input sequence, and the LSTM layer will only return the hidden state output at the last time step. This single-output (or concatenated output of last hidden states from the forward and backward LSTMs in the Bi-LSTM case) will be used to classify the intent type of the given utterance. The idea behind this is that the last hidden state
342
E. Okur et al.
(a) Separate Seq2one Network
(b) Separate Seq2one with Attention
Fig. 4. Separate models network architecture
of the sequence will contain a latent semantic representation of the whole input utterance, which can be utilized for utterance-level intent prediction. See Fig. 4 (a) for sample network architecture of the seq2one Bi-LSTM network. Note that in the Bi-LSTM implementation for seq2one learning (i.e., when not returning sequences), the outputs of backward/reverse LSTM is actually ordered in reverse time steps (tlast ... tf irst ). Thus, as illustrated in Fig. 4 (a), we concatenate the hidden state outputs of forward LSTM at the last time step and backward LSTM at the first time step (i.e., the first word in a given utterance), and then feed this merged result to the dense layer. Figure 4 (b) depicts the seq2one Bi-LSTM network with Attention mechanism applied on top of Bi-LSTM layers. For the Attention case, the hidden state outputs of all time steps are fed into the Attention mechanism that will allow pointing at specific words in a sequence when computing a single output [16]. Another variation of the Attention mechanism we examined is the AttentionWithContext, which incorporates a context/query vector jointly learned during the training process to assist the attention [24]. All seq2one model variations we experimented with can be summarized as follows: – – – –
Separate-0: Seq2one LSTM for utterance-level intents Separate-1: Seq2one Bi-LSTM for utterance-level intents Separate-2: Seq2one Bi-LSTM with Attention [16] for utterance-level intents Separate-3: Seq2one Bi-LSTM with AttentionWithContext [24] for utterancelevel intents
Joint Seq2seq Models. Using sequence-to-sequence networks, the approach here is jointly training annotated utterance-level intents and slots/intent keywords by adding / tokens to the beginning/end of each utterance, with utterance-level intent-type as labels of such tokens. Our approach is an extension of [5], in which only an term is added with intent-type tags associated to this end-of-sentence token, both for LSTM and Bi-LSTM cases.
Natural Language Interactions in AVs: Intent Detection and Slot Filling
343
Fig. 5. Joint models network architecture
However, we experimented with adding both and terms as Bi-LSTMs will be used for seq2seq learning, and we observed that slightly better results are achieved by doing so. The idea behind this is that, since this is a seq2seq learning problem, at the last time step (i.e., prediction at ), the reverse pass in Bi-LSTM would be incomplete (refer to Fig. 4 (a) to observe the last Bi-LSTM cell). Therefore, adding token and leveraging the backward LSTM output at first time step (i.e., prediction at ) would potentially help for joint seq2seq learning. Overall network architecture can be found in Fig. 5 for our joint models. We will report the experimental results on two variations (with and without intent keywords) as follows: – Joint-1: Seq2seq Bi-LSTM for utterance-level intent detection (jointly trained with slots) – Joint-2: Seq2seq Bi-LSTM for utterance-level intent detection (jointly trained with slots & intent keywords)
Hierarchical and Separate Models. Proposed hierarchical models are detecting/extracting intent keywords & slots using sequence-to-sequence networks first (i.e., level-1), and then feeding only the words that are predicted as intent keywords & valid slots (i.e., not the ones that are predicted as ‘None/O’) as an input sequence to various separate sequence-to-one models (described above) to recognize final utterance-level intents (i.e., level-2). A sample network architecture is given in Fig. 6 (a). The idea behind filtering out non-slot and non-intent keywords here resembles providing a summary of the input sequence to the upper levels of the network hierarchy. Note that we learn this summarized sequence of keywords using another RNN layer. That would potentially result in focusing the utterance-level classification problem on the most salient words of the input sequences (i.e., intent keywords & slots) and also effectively reducing
344
E. Okur et al.
(a) Hierarchical & Separate Model
(b) Hierarchical & Joint Model
Fig. 6. Hierarchical models network architecture
the length of input sequences (i.e., improving the long-term dependency issues observed in longer sequences). Note that, according to the dataset statistics in Table 2, 45% of the words found in transcribed utterances with passenger intents are annotated as non-slot and non-intent keywords. For example, ‘please’, ‘okay’, ‘can’, ‘could’, incomplete/interrupted words, filler sounds like ‘uh’/‘um’, certain stop words, punctuation, and many other tokens exist that are not related to intent/slots. Therefore, the proposed approach would reduce the sequence length nearly by half at the input layer of level-2 for utterance-level recognition. For hierarchical & separate models, we experimented with four variations based on which separate model was used at the second level of the hierarchy, and these are summarized as follows: – Hierarchical & Separate-0: Level-1 (Seq2seq LSTM for intent keywords & slots extraction) + Level-2 (Separate-0: Seq2one LSTM for utterance-level intent detection) – Hierarchical & Separate-1: Level-1 (Seq2seq Bi-LSTM for intent keywords & slots extraction) + Level-2 (Separate-1: Seq2one Bi-LSTM for utterance-level intent detection) – Hierarchical & Separate-2: Level-1 (Seq2seq Bi-LSTM for intent keywords & slots extraction) + Level-2 (Separate-2: Seq2one Bi-LSTM + Attention for utterance-level intent detection) – Hierarchical & Separate-3: Level-1 (Seq2seq Bi-LSTM for intent keywords & slots extraction) + Level-2 (Separate-3: Seq2one Bi-LSTM + AttentionWithContext for utterance-level intent detection) Hierarchical and Joint Models. Proposed hierarchical models detect/extract intent keywords & slots using sequence-to-sequence networks first, and then only the words that are predicted as intent keywords & valid slots (i.e., not the ones that are predicted as ‘None/O’) are fed as input to the joint sequence-to-sequence models (described above). See Fig. 6 (b) for sample network architecture. After
Natural Language Interactions in AVs: Intent Detection and Slot Filling
345
the filtering or summarization of sequence at level-1, and tokens are appended to the shorter input sequence before level-2 for joint learning. Note that in this case, using the Joint-1 model (jointly training annotated slots & utterance-level intents) for the second level of the hierarchy would not make much sense (without intent keywords). Hence, the Joint-2 model is used for the second level as described below: – Hierarchical & Joint-2: Level-1 (Seq2seq Bi-LSTM for intent keywords & slots extraction) + Level-2 (Joint-2 Seq2seq models with slots & intent keywords & utterance-level intents) Table 3 summarizes the results of various approaches we investigated for utterance-level intent understanding. We achieved a 0.91 overall F1-score with our best-performing model, namely Hierarchical & Joint-2. All model results are obtained via 10-fold cross-validation (10-fold CV) on the same dataset. For our AMIE scenarios, Table 4 shows the intent-wise detection results with the initial (Hybrid-0 ) and currently best performing (H-Joint-2 ) intent recognizers. With our best model (H-Joint-2 ), relatively problematic SetDestination and SetRoute intents’ detection performances in baseline model (Hybrid-0 ) jumped from 0.78 to 0.89 and 0.75 to 0.88, respectively. We compared our intent detection results with Dialogflow’s Detect Intent API. The same AMIE dataset is used to train and test (10-fold CV) Dialogflow’s intent detection and slot filling modules, using the recommended hybrid mode (rule-based and ML). As shown in Table 4, an overall F1-score of 0.89 is achieved with Dialogflow for the same task. As you can see, our Hierarchical & Joint models obtained higher results than the Dialogflow for 8 out of 10 intent types. 3.2
Slot Filling and Intent Keyword Extraction Experiments
Slot filling and intent keyword extraction results are given in Table 5 and Table 6, respectively. For slot extraction, we reached a 0.96 overall F1-score using the seq2seq Bi-LSTM model. That is slightly better than using the LSTM model. Note that although the overall performance is slightly improved with Bi-LSTM model, relatively problematic Object, Time Guidance, Gesture/Gaze slots’ F1score performances increased from 0.80 to 0.89, 0.80 to 0.85, and 0.87 to 0.92, respectively. Note that with Dialogflow, we reached a 0.92 overall F1-score for the entity/slot filling task on the same dataset. As you can see, our models reached significantly higher F1-scores than the Dialogflow for 6 out of 7 slot types (except Time Guidance). 3.3
Speech-to-Text Experiments for AMIE: Training and Testing Models on ASR Outputs
For transcriptions, utterance-level audio clips were extracted from the passengerfacing video stream, which was the single source used for human transcriptions of all utterances from passengers, the AMIE agent, and the game master. Since
346
E. Okur et al.
our transcriptions-based intent/slot models assumed perfect (at least close to human-level) ASR in the previous sections, we experimented with the more realistic scenario of using ASR outputs for intent/slot modeling. We employed Cloud Speech-to-Text API to obtain ASR outputs on audio clips with passenger utterances, which were segmented using transcription time-stamps. We observed an overall word error rate (WER) of 13.6% in ASR outputs for all 20 sessions of AMIE. Considering that a generic ASR is used with no domain-specific acoustic models for this moving vehicle environment with in-cabin noise, the initial results were quite promising to move on with the model training on ASR outputs. For initial explorations, we created a new dataset having utterances with commands using ASR outputs of the in-cabin data (20 sessions with 1331 spoken utterances). A human transcriptions version of this set is also created. Although the dataset size is limited, slot/intent keyword extraction and utterance-level intent recognition models are not severely affected when trained and tested (10-fold CV) on ASR outputs instead of manual transcriptions. See Table 7 for the overall F1-scores of the compared models. Singleton versus Dyad Sessions. After the ASR pipeline described above was completed for all 20 sessions of the AMIE in-cabin dataset (ALL with 1331 utterances), we repeated all our experiments on the following two subsets: (i) 10 sessions having a single passenger (Singletons with 600 utterances), and (ii) remaining 10 sessions having two passengers (Dyads with 731 utterances). We observed overall WER of 13.5% and 13.7% for Singletons and Dyads, respectively. The overlapping speech cases with slightly more conversations going on Table 3. Utterance-level intent detection performance results (10-fold CV) Model type
Prec Rec
Hybrid-0: RNN (LSTM) + Rule-based (intent keywords)
0.86 0.85 0.85
Hybrid-1: RNN (Bi-LSTM) + Rule-based (intent keywords)
0.87 0.86 0.86
F1
Hybrid-2: RNN (Bi-LSTM) + Rule-based (intent keywords & Pos slots) 0.89 0.88 0.88 Hybrid-3: RNN (Bi-LSTM) + Rule-based (intent keywords & all slots)
0.90 0.90 0.90
Separate-0: Seq2one LSTM
0.87 0.86 0.86
Separate-1: Seq2one Bi-LSTM
0.88 0.88 0.88
Separate-2: Seq2one Bi-LSTM + Attention
0.88 0.88 0.88
Separate-3: Seq2one Bi-LSTM + AttentionWithContext
0.89 0.89 0.89
Joint-1: Seq2seq Bi-LSTM (uttr-level intents & slots)
0.88 0.87 0.87
Joint-2: Seq2seq Bi-LSTM (uttr-level intents & slots & intent keywords) 0.89 0.88 0.88 Hierarchical & Separate-0 (LSTM)
0.88 0.87 0.87
Hierarchical & Separate-1 (Bi-LSTM)
0.90 0.90 0.90
Hierarchical & Separate-2 (Bi-LSTM + Attention)
0.90 0.90 0.90
Hierarchical & Separate-3 (Bi-LSTM + AttentionWithContext)
0.90 0.90 0.90
Hierarchical & Joint-2 (uttr-level intents & slots & intent keywords)
0.91 0.90 0.91
Natural Language Interactions in AVs: Intent Detection and Slot Filling
347
Table 4. Intent-wise performance results of utterance-level intent detection AMIE
Intent
Our intent detection models
Scenario
Type
Baseline (Hybrid-0) Best (H-Joint-2) Intent detection Prec Rec F1 Prec Rec F1 Prec Rec F1
Finishing The Trip
Stop Park PullOver DropOff
0.88 0.96 0.95 0.90
0.91 0.87 0.96 0.95
0.93 0.94 0.97 0.95
0.90 0.91 0.95 0.92
Dialogflow
0.91 0.94 0.94 0.95
0.92 0.94 0.96 0.95
0.89 0.95 0.95 0.96
0.90 0.88 0.97 0.91
0.90 0.91 0.96 0.93
Dest/Route SetDest SetRoute
0.70 0.88 0.78 0.80 0.71 0.75
0.89 0.90 0.89 0.84 0.91 0.87 0.86 0.89 0.88 0.83 0.86 0.84
Speed
GoFaster GoSlower
0.86 0.89 0.88 0.92 0.84 0.88
0.89 0.90 0.90 0.94 0.92 0.93 0.89 0.86 0.88 0.93 0.87 0.90
Others
OpenDoor 0.95 0.95 0.95 Other 0.92 0.72 0.80
0.95 0.95 0.95 0.94 0.93 0.93 0.83 0.81 0.82 0.88 0.73 0.80
0.86 0.85 0.85
0.91 0.90 0.91 0.90 0.89 0.89
Overall
(longer transcriptions) in Dyad sessions compared to the Singleton sessions may affect the ASR performance, which may also affect the intent/slots models performances. As shown in Table 7, although we have more samples with Dyads, the performance drops between the models trained on transcriptions vs. ASR outputs are slightly higher for the Dyads compared to the Singletons, as expected. Table 5. Slot filling results (10-fold CV) Slot type
Our slot filling models Dialogflow Seq2seq LSTM Seq2seq Bi-LSTM Slot Filling Prec Rec F1 Prec Rec F1 Prec Rec F1
Location Position/Direction Person Object Time Guidance Gesture/Gaze None
0.94 0.92 0.97 0.82 0.88 0.86 0.97
Overall
0.95 0.95 0.95 0.96 0.96 0.96
0.92 0.93 0.96 0.79 0.73 0.88 0.98
0.93 0.93 0.97 0.80 0.80 0.87 0.97
0.96 0.95 0.98 0.93 0.90 0.92 0.97
0.94 0.95 0.97 0.85 0.80 0.92 0.98
0.95 0.95 0.97 0.89 0.85 0.92 0.98
0.94 0.91 0.96 0.96 0.93 0.86 0.92
0.81 0.92 0.76 0.70 0.82 0.65 0.98
0.87 0.91 0.85 0.81 0.87 0.74 0.95
0.92 0.92 0.92
348
E. Okur et al. Table 6. Intent keyword extraction results (10-fold CV) Keyword type Prec Rec
F1
Intent Non-Intent
0.95 0.93 0.94 0.98 0.99 0.99
Overall
0.98 0.98 0.98
Table 7. F1-scores of models Trained/Tested on transcriptions vs. ASR outputs Train/Test on Transcriptions Slot Filling & Intent Keywords
Train/Test on ASR Outputs
ALL Singleton Dyad ALL Singleton Dyad
Slot Filling 0.97 0.96 0.98 0.98 Intent Keyword Extraction Slot Filling & Intent Keyword Extraction 0.95 0.95
0.96 0.97 0.94
0.95 0.94 0.97 0.96 0.94 0.92
0.93 0.96 0.91
Utterance-level Intent Detection
ALL Singleton Dyad ALL Singleton Dyad
Hierarchical & Separate Hierarchical & Separate + Attention Hierarchical & Joint
0.87 0.85 0.89 0.86 0.89 0.87
4
0.86 0.87 0.88
0.85 0.84 0.86 0.84 0.87 0.85
0.83 0.84 0.85
Discussion and Conclusion
We introduced AMIE, the intelligent in-cabin car agent responsible for handling certain AV-passenger interactions. We develop hierarchical and joint models to extract various passenger intents, along with relevant slots for actions to be performed in AV, achieving F1-scores of 0.91 for intent recognition and 0.96 for slot extraction. Besides using the generic ASR with noisy outputs, we show that our models are still capable of achieving comparable results with models trained on human transcriptions. We believe that the ASR performance can be improved by collecting more in-domain data to obtain domain-specific acoustic models. These initial models will allow us to collect more speech data via bootstrapping with the intent-based dialogue application that we have built. Plus, the hierarchy we defined can eliminate costly annotation efforts in the future, especially for the word-level slots and intent keywords. Once enough domain-specific multi-modal data is collected, our future work is to explore training end-to-end dialogue agents for our in-cabin use-cases. We are planning to exploit other modalities for an improved understanding of the in-cabin dialogue as well. Acknowledgments. We want to show our gratitude to our colleagues from Intel Labs, especially, Cagri Tanriover for his tremendous efforts in coordinating and implementing the vehicle instrumentation to enhance multi-modal data collection setup (as he illustrated in Fig. 1), John Sherry and Richard Beckwith for their insights and expertise that guided the collection of this UX grounded and ecologically valid dataset (via scavenger hunt protocol and WoZ research design). The authors are also immensely
Natural Language Interactions in AVs: Intent Detection and Slot Filling
349
grateful to the GlobalMe, Inc. members, especially Rick Lin and Sophie Salonga, for their extensive efforts in organizing and executing the data collection, transcription, and some annotation tasks for this research in collaboration with our team at Intel Labs.
References 1. Chung, J., G¨ ul¸cehre, C ¸ ., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014). http:// arxiv.org/abs/1412.3555 2. Crowston, K.: Amazon mechanical turk: a research tool for organizations and information systems scholars. In: Bhattacherjee, A., Fitzgerald, B. (eds.) IS&O 2012. IAICT, vol. 389, pp. 210–221. Springer, Heidelberg (2012). https://doi.org/ 10.1007/978-3-642-35142-6 14 3. Eric, M., Krishnan, L., Charette, F., Manning, C.D.: Key-value retrieval networks for task-oriented dialogue. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 37–49. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/W17-5506 4. Fukui, M., Watanabe, T., Kanazawa, M.: Sound source separation for plural passenger speech recognition in smart mobility system. IEEE Trans. Consum. Electron. 64(3), 399–405 (2018). https://doi.org/10.1109/TCE.2018.2867801 5. Hakkani-Tur, D., et al.: Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM. ISCA (2016). https://www.microsoft.com/en-us/ research/publication/multijoint/ 6. Hansen, J.H., et al.: Cu-move: analysis & corpus development for interactive invehicle speech systems. In: 7th European Conference on Speech Communication and Technology (2001) 7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 8. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017) 9. Liu, B., Lane, I.: Joint online spoken language understanding and language modeling with recurrent neural networks. CoRR abs/1609.01462 (2016). http://arxiv. org/abs/1609.01462 10. Meng, Z., Mou, L., Jin, Z.: Hierarchical RNN with static sentence-level attention for text-based speaker change detection. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2203–2206. CIKM’17, ACM, New York, NY, USA (2017). https://doi.org/10.1145/3132847.3133110 11. Mesnil, G., et al.: Using recurrent neural networks for slot filling in spoken language understanding. Trans. Audio Speech Lang. Proc. 23(3), 530–539 (2015). https:// doi.org/10.1109/TASLP.2014.2383614 12. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013) 13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Vol. 2, pp. 3111–3119. NIPS’13, Curran Associates Inc., USA (2013). http://dl.acm.org/ citation.cfm?id=2999792.2999959
350
E. Okur et al.
14. Okur, E., Kumar, S.H., Sahay, S., Esme, A.A., Nachman, L.: Conversational intent understanding for passengers in autonomous vehicles. In: 13th WiML Workshop, co-located with the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, Canada (2018). http://arxiv.org/abs/1901.04899 15. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162 16. Raffel, C., Ellis, D.P.W.: Feed-forward networks with attention can solve some long-term memory problems. CoRR abs/1512.08756 (2015). http://arxiv.org/abs/ 1512.08756 17. Ravuri, S., Stolcke, A.: Recurrent neural network and lstm models for lexical utterance classification. In: Proceedings of the Interspeech, pp. 135–139. ISCA - International Speech Communication Association, Dresden (2015) 18. Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. Trans. Sig. Proc. 45(11), 2673–2681 (1997). https://doi.org/10.1109/78.650093 19. Sherry, J., Beckwith, R., Arslan Esme, A., Tanriover, C.: Getting things done in an autonomous vehicle. In: Social Robots in the Wild Workshop at the 13th Annual ACM/IEEE International Conference on Human-Robot Interaction (HRI) (2018). http://socialrobotsinthewild.org/wp-content/uploads/2018/ 02/HRI-SRW 2018 paper 3.pdf 20. Stolcke, A., Droppo, J.: Comparing human and machine errors in conversational speech transcription. CoRR abs/1708.08615 (2017). http://arxiv.org/abs/1708. 08615 21. Wang, P., Sibi, S., Mok, B., Ju, W.: Marionette: enabling on-road wizard-of-oz autonomous driving studies. In: Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, pp. 234–243. HRI’17, ACM, New York, NY, USA (2017). https://doi.org/10.1145/2909824.3020256 22. Wen, L., Wang, X., Dong, Z., Chen, H.: Jointly Modeling Intent Identification and Slot Filling with Contextual and Hierarchical Information. In: Huang, X., Jiang, J., Zhao, D., Feng, Y., Hong, Yu. (eds.) NLPCC 2017. LNCS (LNAI), vol. 10619, pp. 3–15. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73618-1 1 23. Xiong, W., et al.: Achieving human parity in conversational speech recognition. CoRR abs/1610.05256 (2016). http://arxiv.org/abs/1610.05256 24. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016). https://doi.org/10.18653/v1/N16-1174 25. Zhang, X., Wang, H.: A joint model of intent determination and slot filling for spoken language understanding. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, pp. 2993–2999. IJCAI’16, AAAI Press (2016). http://dl.acm.org/citation.cfm?id=3060832.3061040 26. Zheng, Y., Liu, Y., Hansen, J.H.L.: Navigation-orientated natural spoken language understanding for intelligent vehicle dialogue. In: 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 559–564 (2017). https://doi.org/10.1109/IVS.2017.7995777 27. Zhou, Q., Wen, L., Wang, X., Ma, L., Wang, Y.: A hierarchical LSTM model for joint tasks. In: Sun, M., Huang, X., Lin, H., Liu, Z., Liu, Y. (eds.) CCL/NLPNABD -2016. LNCS (LNAI), vol. 10035, pp. 324–335. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47674-2 27
Audio Summarization with Audio Features and Probability Distribution Divergence Carlos-Emiliano Gonz´ alez-Gallardo1(B) , Romain Deveaud1 , Eric SanJuan1 , and Juan-Manuel Torres-Moreno1,2 1
2
LIA - Avignon Universit´e, 339 chemin des Meinajaries, 84140 Avignon, France {carlos-emiliano.gonzalez-gallardo,eric.sanjuan, juan-manuel.torres}@univ-avignon.fr D´epartement de GIGL, Polytechnique Montr´eal, C.P. 6079, succ. Centre-ville, Montr´eal, Qu´ebec H3C 3A7, Canada
Abstract. The automatic summarization of multimedia sources is an important task that facilitates the understanding of an individual by condensing the source while maintaining relevant information. In this paper we focus on audio summarization based on audio features and the probability of distribution divergence. Our method, based on an extractive summarization approach, aims to select the most relevant segments until a time threshold is reached. It takes into account the segment’s length, position and informativeness value. Informativeness of each segment is obtained by mapping a set of audio features issued from its Melfrequency Cepstral Coefficients and their corresponding Jensen-Shannon divergence score. Results over a multi-evaluator scheme shows that our approach provides understandable and informative summaries. Keywords: Audio summarization Human language understanding
1
· JS divergence · Informativeness ·
Introduction
Multimedia summarization has become a major need since Internet platforms like Youtube1 provide easy access to massive online resources. In general, automatic summarization intends to produce an abridged and informative version of its source [17]. The type of automatic summarization we focus in this article is audio summarization, which source corresponds to an audio signal. Audio summarization can be performed with the following three approaches: directing the summary using only audio features [2,9,10,21], extracting the text inside the audio signal and directing the summarization process using textual methods [1,13,16] and an hybrid approach which consists of a mixture of the first two [15,19,20]. Each approach has advantages and disadvantages with regard to 1
https://www.youtube.com/.
c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 351–361, 2023. https://doi.org/10.1007/978-3-031-24340-0_26
352
C.-E. Gonz´ alez-Gallardo et al.
the others. Using only audio features for creating a summary has the advantage of being totally transcript independent; however, this may also be a problem given that the summary is based only on how things are said. By contrast, directing the summary with textual methods benefits from the information contained within the text, dealing to more informative summaries; nevertheless, in some cases transcripts are not unavailable. Finally, using both audio features and textual methods can boost the summary quality; yet, disadvantages of both approaches are present. The method we propose in this paper consists of an hybrid approach during training phase while text independent during summary creation. It resides on using textual information to learn an informativeness representation based on probability distribution divergences that standard audio summarization with audio features does not consider. During the summarization process this representation is used to obtain an informativeness score without a textual representation of the audio signal to summarize. To our knowledge, probability distribution divergences have not been used for audio summarization. The rest of this article is organized as follows. In Sect. 2 we give an overview of what audio summarization is, we include its advantages and disadvantages comparing it with other summarization techniques. During Sect. 3 we explain how the probability distribution divergence may be used over an audio summarization framework and we describe in detail our summarization proposal. In Sect. 4 we describe the dataset used during training and the summary generation phases as well as the evaluation metric that we adopted to measure the quality of the produced summaries and the results from the experimental evaluation of the proposed method. Finally, Sect. 5 concludes the article.
2
Audio Summarization
Audio summarization without any textual representation aims to produce an abridged and informative version of an audio source using only the information contained in the audio signal. This kind of summarization is challenging because the available information corresponds to how things are said, this is advantageous in terms of transcripts availability. Hybrid audio summarization methods or text based audio summarization algorithms need automatic or manual speech transcripts to select the pertinent segments and produce an informative summary [19,20]. Nevertheless, speech transcripts may be expensive, non available or of low quality, this creates repercussions over the summarization performance. Duxans et al. [2] managed to generate audio based summaries of a soccer match using re-transmissions that detect highlighted events. They based their detection algorithm on two acoustic features: the block energy and the acoustic repetition indexes. The performance was measured in terms of goal recall and summary precision, showing high rates for both categories. Maskey et al. [10] presented an audio based summarization method using a Hidden Markov Model (HMM) framework. They used a set of different acoustic/prosodic features to represent the HMM observation vectors: speaking rate; F0 min, max, mean, range and slope; min, max and mean RMS energy; RMS
Audio Summarization with Audio Features and P.D.D
353
slope and sentence duration. The hidden variables represented the inclusion or exclusion of a segment within the summary. They performed experiments over 20 CNN shows and 216 stories previously used in [9]. Evaluation was made with standard Precision, Recall and F-measures information retrieval measures. Results show us that the HMM framework had a very good coverage (Recall = 0.95) but a very poor precision (P recision = 0.26) when selecting pertinent segments. Zlatintsi et al. [21] addressed the audio summarization task by exploring the potential of a modulation model for the detection of perceptually important audio events. They performed a saliency computation of audio streams based on a set of saliency models and various linear, adaptive and nonlinear fusion schemes. Experiments were performed over audio data extracted from six 30minute movie clips. Results were reported in terms of frame-level precision scores showing that nonlinear fusion schemes perform best. Audio summarization based only on acoustic features like fundamental frequencies, energy, volume change and speaker turn, has the big advantage that no textual information is needed. This approach is especially useful when human transcripts are not available for the spoken documents and Automatic Speech Recognition (ASR) transcripts have a high word error rate. However, for high informative contexts like broadcast news, bulletins or reports, most relevant information resides on the things that are said while audio features are limited to how things are said.
3
Probability Distribution Divergence for Audio Summarization
All presented methods in the previous section omit the informativity content of the audio streams. In order to overcome the lack of information, we propose an extractive audio summarization method capable of representing the informativeness of a segment in terms of its audio features during training phase; informativeness is mapped by a probability distribution divergence model. Then, when creating a summary, textual independence is reached using only audio based features. Divergence is defined by Manning [8] as a function which estimates the difference between two probability distributions. In the framework of automatic text summarization evaluation, [6,14,18] have used divergence based measures such as Kullback-Leibler and Jensen-Shannon (JS) to compare the probability distribution of words between automatically produced summaries and their sources. Extractive summarization based on the divergence of probability distributions has been discussed in [6] and a method has been proposed in [17] (DIVTEX). Our proposal, based on an extractive summarization approach aims to select the most pertinent audio segments until a time threshold is reached. A training phase is in charge of learning a model that maps a set of 277 audio features to an informativeness value. A big dataset is used to compute the informativeness by obtaining the divergence between the dataset documents and their corresponding
354
C.-E. Gonz´ alez-Gallardo et al.
segments. During the summarization phase, the method takes into account the segment’s length, position and the mapped informativeness of the audio features to rank the pertinence of each audio segment. 3.1
Audio Signal Pre-processing
During the pre-processing step, the audio signal is split into background and foreground channels. This process is normally used on music records for separating vocals and other sporadic signals from accompanying instrumentation. Rafii et. al [12] achieved this separation for identifying recurrent elements by looking for similarities instead of periodicities. Rafii et. al approach is useful for those song records where repetitions happen intermittently or without a fixed period; however, we found that applying the same method to newscasts and reports audio files made much easier to segment them using only the background signal. We assume this phenomena is due to the fact that newscasts and reports are heavily edited with a low volume of background music playing while the journalist speak and louder music/noises for transitions (foreground). Following [12], to suppress non-repetitive deviations from the average spectrum and discard vocal elements, audio frames are compared using the cosine similarity. Similar frames separated by at least two seconds are aggregated by taking their per-frequency median value to avoid being biased by local continuity. Next, assuming that both signals are additive, a pointwise minimum between the obtained frames and the original signal is applied to obtain a raw background filter. Then, a foreground and background time-frequency mask is derived from the raw background filter and the input signal with a soft mask operation. Finally, foreground and background components are obtained by multiplying the timefrequency masks with the input signal. 3.2
Informativeness Model
Informativeness is learned from the transcripts of a big audio dataset such as newscasts and reports. A mapping between a set of 277 audio features and an informativeness value is learned during the training phase. It corresponds to the Jensen-Shannon divergence (DJS ) between the segmented transcripts and their source. The DJS is based on the Kullback-Leibler divergence [4] with the main difference that is symmetric. The DJS between a segment Q and its source P is defined by [7,18] as: 2Pw 2Qw 1 Pw log2 + Qw log2 (1) DJS (P ||Q) = 2 Pw + Qw Pw + Qw w∈P P + Cw
δ |P | + δ × β Q +δ Cw Qw = |Q| + δ × β Pw =
(2) (3)
Audio Summarization with Audio Features and P.D.D
355
(P |Q)
where Cw is the frequency of word w over P or Q. To avoid shifting the probability mass to unseen events, the scaling parameter δ is set to 0.0005. |P | and |Q| corresponds to the number of tokens on P and Q. Finally β = 1.5 × |V |, where |V | is the vocabulary size on P . Each segment Q has a length of 10 s and is represented by 277 audio features where 275 corresponds to 11 statistical values of 25 Mel-frequency Cepstral Coefficients (MFCC) and the other two correspond to the number of frames in the segment and its starting time. The 11 statistical values can be seen in Table 1, where φ and φ corresponds to the first and second MFCC derivative. Table 1. MFCC based statistical values Feature
MFCC φ φ φ
min max median mean variance skewness kurtosis
• • • • • • • • •
• •
A linear least squares regression model (LR(X, Y )) is trained to map the 277 audio features (X) into a informativeness score (Y ). Figure 1 shows the whole training phase (informativeness model). All audio processing and feature extraction is performed with the Librosa library2 [11]. 3.3
Audio Summary Creation
The summary creation of a document P follows the same audio signal preprocessing steps described in Sect. 3.1. During this phase, only the audio signal is needed and informativeness of each candidate segment Qi ∈ P is predicted with the LR(Qi , YQi ) model. Figure 2 shows the full summarization pipeline to obtain a θ threshold length summary of an audio document P . After the background signal is isolated from the main signal a temporallyconstrained agglomerative clustering routine is used to partition the audio stream into k contiguous segments. k=
Plength × 20 60
being Plength the length in seconds of P . 2
https://librosa.github.io/librosa/index.html.
(4)
356
C.-E. Gonz´ alez-Gallardo et al.
Fig. 1. Informativeness model scheme
To rank the pertinence of each segment Q1 ...Qk , a score SQi is computed. Audio summarization is performed by choosing those segments which contain higher SQi scores in order of appearance until θ is reached. SQi is defined as: SQi =
1 1 + e−(Δti −5)
t
×
Q |Qi | − i × e Δti × e1−LRQi |P |
(5)
Here Δti = tQi +1 − tQi , being tQi the starting time of the segment Qi and tQi +1 the starting time of Qi+1 . |Qi | and |P | corresponds to the length in seconds of the segment Qi and P respectively.
4
Experimental Evaluation
We trained the informativeness model explained in Sect. 3.2 with a set of 5,989 audio broadcasts which corresponds to more than 310 h of audio in French, English and Arabic [5]. Transcripts were obtained with the ASR system described on [3]. During audio summary creation we focused on a small dataset of 10 English audio samples. In this phase no ASR system was used given the text independence our systems achieves once the informativeness model has been obtained. Selected sample lengths vary between 102 s (1 m 42 s) and 584 s (9 m 44 s) with an average length of 318 s (5 m 18 s). Similar to Rott et al. [13], we implement a 1–5 subjective scaled opinion metric to evaluate the quality of the generated summaries and their parts. During evaluation, we provided a set of five evaluators with the original audio, the generated summary, their corresponding segments and the scale shown in Table 2.
Audio Summarization with Audio Features and P.D.D
357
Fig. 2. Summary creation scheme
4.1
Results
Summary length was set to be the 35% of the original audio length during experimentation. Evaluation was performed over the complete audio summaries as well as over each summary segment. We are interested on measuring the informativeness of the generated summaries but also on measuring the informativeness of each one of its segments. Table 2. Evaluation scale Score Explanation 5 4 3 2 1
Full informative Mostly informative Half informative Quite informative Not informative
Table 3 shows the length of each video and the number of segments that were selected during the summarization process. “Full Score” corresponds to the complete audio summaries evaluation while “Average Score” to the score of their corresponding summary segments. Both metrics represent different things and seem to be quite correlated. “Full Score” quantifies the informativeness of all the summary as a whole while “Average Score” represents the summary quality in terms of the information of each of its segments. To validate this observation,
358
C.-E. Gonz´ alez-Gallardo et al.
we computed the linear correlation between these two metrics obtaining a PCC value equal to 0.53. The average scores of all evaluators can be seen in Table 3. The lowest “Full Score” average value obtained during evaluation was 2.75 and the highest 4.67, meaning that the summarization algorithm generated at least half informative summaries. “Average Score” values oscillate between 2.49 and 3.76. An interesting case is sample #6, which according to its “Full Score” is “mostly informative” (Table 2) but has the lowest “Average Score” of all samples. This difference is given because 67% of its summary segments has an informativity score < 3, but in general it achieves to communicate almost all the relevant information. Figure 3 plots the average score of each one of the 30 segments for sample #6. Table 3. Audio summarization performance over complete summaries and summary segments Sample Length
Segments Full score Average score
1 2 3 4 5 6 7 8 9 10
8 13 5 5 22 30 8 20 18 4
3m 5m 2m 1m 8m 9m 5m 6m 7m 2m
19 s 21 s 47 s 42 s 47 s 45 s 23 s 24 s 35 s 01 s
4.20 3.50 3.80 3.60 4.67 4.00 3.20 3.75 3.75 2.75
2.90 2.78 3.76 2.95 3.68 2.49 3.75 2.84 3.19 2.63
Fig. 3. Audio summarization performance for sample #6
Audio Summarization with Audio Features and P.D.D
359
A graphical representation of the audio summaries and their performance can be seen in Fig. 4. Full audio streams are represented by white bars while summary segments are represented by the gray zones. The height of each summary segment corresponds to their informativeness score.
Fig. 4. Graphical representation of audio summarization performance
From Fig. 4 it can be seen that samples #2, #3, #7, #8 and #10 have all their summary segments clustered to the left. This is due to the preference that the summarization technique is given to the first part of the audio stream region whereby, within a standard newscast, is gathered the major part of the information. The problem is that in cases where different topics are covered over the newscast (multi-topic newscast, interviews, round tables, reports, etc.), relevant information is distributed all over the video. If a big amount of relevant segments are grouped in this region, the summarization algorithm uses all the space available for the summary very fast, discarding a large region of the audio stream. This is the case of samples #7 and #10 which “Full Scores” are less to 3.50. Concerning sample #5, a well distribution of its summary segments is observed. From its 22 segments, only 4 had an informativeness score ≤ 3, achieving the highest “Full Score” of all samples and a good “Average Score”.
5
Conclusions
In this paper we presented an audio summarization method based on audio features and on the hypothesis that mapping the informativeness from a
360
C.-E. Gonz´ alez-Gallardo et al.
pre-trained model using only audio features may help to select those segments which are more pertinent for the summary. Informativeness of each segment was obtained by mapping a set of audio features issued from its Mel-frequency Cepstral Coefficients and their corresponding Jensen-Shannon divergence score. Summarization was performed over a sample of English newscasts, demonstrating that the proposed method is able to generate at least half informative extractive summaries. We can deduce that there is not a clear correlation between the quality of a summary and the quality of its parts. However this behavior could be modeled as a recall based relation between both measures. As future work we will validate this hypothesis as well as expand the evaluation dataset from a multilingual perspective to consider French an Arabic summarization. Acknowledgments. We would like to acknowledge the support of CHIST-ERA for funding this work through the Access Multilingual Information opinionS (AMIS), (France - Europe) project.
References 1. Christensen, H., Gotoh, Y., Renals, S.: A cascaded broadcast news highlighter. IEEE Trans. Audio Speech Lang. Process. 16(1), 151–161 (2008) 2. Duxans, H., Anguera, X., Conejero, D.: Audio based soccer game summarization. In: 2009 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting. BMSB’09, pp. 1–6. IEEE (2009) 3. Jouvet, D., Langlois, D., Menacer, M., Fohr, D., Mella, O., Sma¨ıli, K.: Adaptation of speech recognition vocabularies for improved transcription of youtube videos. J. Int. Sci. Gen. Appl. 1(1), 1–9 (2018) 4. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951) 5. Leszczuk, M., Grega, M., Ko´zbial, A., Gliwski, J., Wasieczko, K., Sma¨ıli, K.: Video summarization framework for newscasts and reports – work in progress. In: Dziech, A., Czy˙zewski, A. (eds.) MCSS 2017. CCIS, vol. 785, pp. 86–97. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69911-0 7 6. Louis, A., Nenkova, A.: Automatic summary evaluation without human models. In: TAC (2008) 7. Louis, A., Nenkova, A.: Automatically evaluating content selection in summarization without human models. In: 2009 Conference on Empirical Methods in Natural Language Processing, Vol, 1. pp. 306–314. ACL (2009) 8. Manning, C.D., Sch¨ utze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999) 9. Maskey, S., Hirschberg, J.: Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization. In: 9th European Conference on Speech Communication and Technology (2005) 10. Maskey, S., Hirschberg, J.: Summarizing speech without text using hidden markov models. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 89–92. Association for Computational Linguistics (2006)
Audio Summarization with Audio Features and P.D.D
361
11. McFee, B., et al.: librosa: audio and music signal analysis in python. In: 14th Python in Science Conference, pp. 18–25 (2015) 12. Rafii, Z., Pardo, B.: Music/voice separation using the similarity matrix. In: ISMIR, pp. 583–588 (2012) ˇ 13. Rott, M., Cerva, P.: Speech-to-text summarization using automatic phrase extraction from recognized text. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 101–108. Springer, Cham (2016). https://doi. org/10.1007/978-3-319-45510-5 12 14. Saggion, H., Torres-Moreno, J.M., Cunha, I.d., SanJuan, E.: Multilingual summarization evaluation without human models. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 1059–1067. COLING’10, Association for Computational Linguistics, Stroudsburg, PA, USA (2010). https://dl.acm.org/doi/10.5555/1944566.1944688 ´ Beke, A.: Summarization of spontaneous speech using 15. Szasz´ ak, G., T¨ undik, M.A., automatic speech recognition and a speech prosody based tokenizer. In: KDIR, pp. 221–227 (2016) 16. Taskiran, C.M., Pizlo, Z., Amir, A., Ponceleon, D., Delp, E.J.: Automated video program summarization using speech transcripts. IEEE Trans. Multimedia 8(4), 775–791 (2006). https://doi.org/10.1109/TMM.2006.876282 17. Torres-Moreno, J.M.: Automatic Text Summarization. John Wiley & Sons, Hoboken (2014) 18. Torres-Moreno, J., Saggion, H., da Cunha, I., SanJuan, E., Vel´ azquez-Morales, P.: Summary evaluation with and without references. Polibits 42, 13–19 (2010). https://polibits.cidetec.ipn.mx/ojs/index.php/polibits/article/view/42-2/1781 19. Zechner, K.: Spoken language condensation in the 21st century. In: 8th European Conference on Speech Communication and Technology (2003) 20. Zlatintsi, A., Iosif, E., Marago, P., Potamianos, A.: Audio salient event detection and summarization using audio and text modalities. In: 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 2311–2315. IEEE (2015) 21. Zlatintsi, A., Maragos, P., Potamianos, A., Evangelopoulos, G.: A saliency-based approach to audio event detection and summarization. In: 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 1294–1298. IEEE (2012)
Multilingual Speech Emotion Recognition on Japanese, English, and German Panikos Heracleous1(B) , Keiji Yasuda1,2 , and Akio Yoneyama1 1
Education and Medical ICT Laboratory, KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi, Saitama 356-8502, Japan {pa-heracleous,yoneyama}@kddi-research.jp 2 Nara Institute of Science and Technology, Ikoma, Japan [email protected]
Abstract. The current study focuses on human emotion recognition based on speech, and particularly on multilingual speech emotion recognition using Japanese, English, and German emotional corpora. The proposed method exploits conditional random fields (CRF) classifiers in a two-level classification scheme. Specifically, in the first level, the language spoken is identified, and in the second level, speech emotion recognition is carried out using emotion models specific to the identified language. In both the first and second levels, CRF classifiers fed with acoustic features are applied. The CRF classifier is a popular probabilistic method for structured prediction, and is widely applied in natural language processing, computer vision, and bioinformatics. In the current study, the use of CRF in speech emotion recognition when limited training data are available is experimentally investigated. The results obtained show the effectiveness of using CRF when only a small amount of training data are available and methods based on a deep neural networks (DNN) are less effective. Furthermore, the proposed method is also compared with two popular classifiers, namely, support vector machines (SVM), and probabilistic linear discriminant analysis (PLDA) and higher accuracy was obtained using the proposed method. For the classification of four emotions (i.e., neutral, happy, angry, sad) the proposed method based on CRF achieved classification rates of 93.8% for English, 95.0% for German, and 88.8% for Japanese. These results are very promising, and superior to the results obtained in other similar studies on multilingual or even monolingual speech emotion recognition . Keywords: Speech emotion recognition · Multilingual · Conditional random fields · Two-level classification · i-vector paradigm · Deep learning
1
Introduction
Automatic recognition of human emotions [1] is a relatively new field, and is attracting considerable attention in research and development areas because of c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 362–375, 2023. https://doi.org/10.1007/978-3-031-24340-0_27
Multilingual Speech Emotion Recognition on Japanese, English, and German
363
its high importance in real applications. Emotion recognition can be used in human-robot communication, when robots communicate with humans according to the detected human emotions, and also has an important role at call centers to detect the caller’s emotional state in cases of emergency (e.g., hospitals, police stations), or to identify the level of a customer’s satisfaction (i.e., providing feedback). In the current study, multilingual emotion recognition based on speech is experimentally investigated. Specifically, using English, German, and Japanese emotional speech data, multilingual emotion recognition experiments are conducted based on several classification approaches and the i-vector paradigm framework. Previous studies reported automatic speech emotion recognition using Gaussian mixture models (GMMs) [2], support vector machines [3], neural networks (NN) [4], and deep neural networks (DNN) [5]. Most studies in speech emotion recognition have focused solely on a single language, and cross-corpus speech emotion recognition has been addressed in only a few studies. In [6], experiments on emotion recognition are described using comparable speech corpora collected from American English and German interactive voice response systems, and the optimal set of acoustic and prosodic features for mono-, cross-, and multilingual anger recognition are computed. Cross-language speech emotion recognition based on HMMs and GMMs is reported in [7]. Four speech databases for cross-corpus classification, with realistic, non-prompted emotions and a large acoustic feature vector are reported in [8]. In the current study, however, multilingual speech emotion recognition using Japanese, English, and German corpora based on a two-level classification scheme is demonstrated. Specifically, spoken language identification and emotion recognition are integrated in a complete system capable of recognizing four emotions from English, German, and Japanese databases. In the first level, spoken language identification using emotional speech is performed, and in the second level the emotions are classified using acoustic models of the language identified in the first level. For classification in both the first and second levels, CRF classifiers are applied and compared to SVM and PLDA classifiers. A similar study –but with different objectives– is presented in [9]. In a more recent study [10], a three-layer perception model is used for multilingual speech emotion recognition using Japanese, Chinese, and German emotional corpora. In that specific study, the volume of training and test data used in classification is closely comparable with the data used in the current study, and, therefore, comparisons are, to some extent, possible. Although very limited training data were available, DNN and convolutional neural networks (CNN) were also considered for comparison purposes. Automatic language identification (LID) is a process whereby a spoken language is identified automatically. Applications of language identification include, but are not limited to, speech-to-speech translation systems, re-routing incoming calls to native speaker operators at call centers, and speaker diarization. Because of the importance of spoken language identification in real applications, many studies have addressed this issue. The approaches reported are categorized into
364
P. Heracleous et al.
the acoustic-phonetic approach, the phonotactic approach, the prosodic approach, and the lexical approach [11]. In phonotactic systems [11,12], sequences of recognized phonemes obtained from phone recognizers are modeled. In [13], a typical phonotactic language identification system is used, where a language dependent phone recognizer is followed by parallel language models (PRLM). In [14], a universal acoustic characterization approach to spoken language recognition is proposed. Another method based on vector-space modeling is reported in [11,15], and presented in [16]. In acoustic modeling-based systems, different features are used to model each language. Earlier language identification studies reported methods based on neural networks [17,18]. Later, the first attempt at using deep learning has also been reported [19]. Deep neural networks for language identification were used in [20]. The method was compared with i-vector-based classification, linear logistic regression, linear discriminant analysis-based (LDA), and Gaussian modelingbased classifiers. In the case of a large amount of training data, the method demonstrated its superior performance. When limited training data were used, the i-vector yields the best identification rate. In [21] a comparative study on spoken language identification using deep neural networks was presented. Other methods based on DNN and recurrent neural networks (RNN) were presented in [22,23]. In [24], experiments on language identification using i-vectors and CRF were reported. The i-vector paradigm for language identification with SVM [25] was also applied in [26]. SVM with local Fisher discriminant analysis is used in [27]. Although significant improvements in LID have been achieved using phonotactic approaches, most state-of-the-art systems still rely on acoustic modeling.
2 2.1
Methods Emotional Speech Data
Four professional female actors simulated Japanese emotional speech. These comprised neutral, happy, angry, sad, and mixed emotional states. Fifty-one utterances for each emotion were produced by each speaker. The sentences were selected from a Japanese book for children. The data were recorded at 48 kHz and down-sampled to 16 kHz, and they also contained short and longer utterances varying from 1.5 s to 9 s. Twenty-eight utterances were used for training and 20 for testing. The remaining utterances were excluded due to poor speech quality. For the English emotional speech data, the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) set [28] was used. RAVDESS uses a set of 24 actors (12 male, 12 female) speaking and singing with various emotions, in a North American English accent, and contains 7,356 high-quality video recordings of emotionally-neutral statements, spoken and sung with a range of emotions. The speech set consists of the 8 emotional expressions: neutral, calm, happy, sad, angry, fearful, surprised, and disgusted. The song set consists of the 6 emotional expressions: neutral, calm, happy, sad, angry, and fearful. All emotions except
Multilingual Speech Emotion Recognition on Japanese, English, and German
365
neutral are expressed at two levels of emotional intensity: normal and strong. There are 2,452 unique vocalizations, all of which are available in three formats: full audio-video (720p, H.264), video only, and audio only (wav). The database has been validated in a perceptual experiment involving 297 participants. The data are encoded as 16-bit, 48-kHz wav files, and down-sampled to 16 kHz. In the current study, 96 utterances for neutral, happy, angry, and sad emotional states were used. For training, 64 utterances were used for each emotion, and 32 for testing. The German database used was the Berlin database [29], which includes seven emotional states: anger, boredom, disgust, anxiety, happiness, sadness, and neutral speech. The utterances were produced by ten professional German actors (5 female, 5 male) speaking ten sentences with an emotionally neutral content but expressed with the seven different emotions. The actors produced 69 frightened, 46 disgusted, 71 happy, 81 bored, 79 neutral, 62 sad, and 127 angry emotional sentences. For training, 42 utterances were used in the study, and for testing, 20 utterances, in the neutral, happy, angry, and sad modes. 2.2
Classification Approaches
Conditional Random Fields (CRF). CRF is a modern approach similar to HMMs, however with a different nature. CRF are undirected graphical models, a special case of conditionally trained finite state machines. They are discriminative models, which maximize the conditional probability of observation and state sequences. CRF assume frame dependence, and as a result context is also considered. The main advantage of CRF is their flexibility to include a wide variety of non-independent features. CRF have been successfully used for meeting segmentation [30], for phone classification [31], and for events recognition and classification [32]. A language identification method based on deep-structured CRF has been reported in [33]. The current study is based on the popular and very simple linear-chain CRF, along with low dimensional feature representation using i-vectors. Similarly, to [34] for object recognition using CRF, each input sentence in represented by a single vector (i.e., an i-vector), and this scenario is different from the conventional classification approaches in machine learning, where the input space is represented as a set of feature vectors. In CRF, the probabilities of a class label s given the observation sequence o = (o1 , o2 , ..., oT ) are given by the following equation: 1 λ·f (k,s,o) p(k|o, λ) = e (1) z(o, λ) s∈k
where λ is the parameter vector, f is the sufficient statistics vector, and s = (s1 , s2 , ..., sT ) is a hidden state sequence. The function z(o, λ) ensures that the model forms a properly normalized probability and is defined as: z(o, λ) = eλ·f (k,s,o) (2) k
s∈k
Figure 1 demonstrates the structure of HMM and CRF models.
366
P. Heracleous et al.
Fig. 1. Structures of hidden Markov models (HMM) and conditional random fields (CRF).
Support Vector Machines (SVM). A support vector machine (SVM) is a two-class classifier constructed from sums of a kernel function K(.,.) f (x) =
L
αi ti K(x, xi ) + d
(3)
i=1
L where the ti are the ideal outputs, i=1 αi ti = 0, and αi > 0. An SVM is a discriminative classifier, which is widely used in regression and classification. Given a set of labeled training samples, the algorithm finds the optimal hyperplane, which categorizes new samples. SVM is among the most popular machine learning methods. The advantages of SVM include the support of high-dimensionality, memory efficiency, and versatility. However, when the number of features exceeds the number of samples the SVM performs poorly. Another disadvantage is that SVM is not probabilistic because it works by categorizing objects based on the optimal hyperplane. Originally, SVMs were used for binary classification. Currently, the multiclass SVM, a variant of the conventional SVM, is widely used in solving multiclass classification problems. The most common way to build a multi-class SVM is to use K one-versus-rest binary classifiers (commonly referred to as ”oneversus-all” or OVA classification). Another strategy is to build one-versus-one classifiers, and to choose the class that is selected by the most classifiers. In this case, K(K-1)/2 classifiers are required and the training time decreases because less training data are used for each classifier. Probabilistic Linear Discriminant Analysis (PLDA). PLDA is a popular technique for dimension reduction using the Fisher criterion. Using PLDA, new axes are found, which maximize the discrimination between the different classes. PLDA was originally applied to face recognition, and and can be used to specify
Multilingual Speech Emotion Recognition on Japanese, English, and German
367
Fig. 2. Classification scheme based on the i-vector paradigm.
a generative model of the i- vector representation. A study using UBM-based LDA for speaker recognition was also presented in [35]. Adapting this to language identification and emotion classification, for the i-th language or emotion, the i-vector wi,j representing the j-th recording can be formulated as: wi,j = m + Sxi + ei,j
(4)
where S represents the between-language or between-emotion variability, and the latent variable x is assumed to have a standard normal distribution, and to represent a particular language or emotion and channel. The residual term ei,j represents the within-language or within-emotion variability, and it is assumed to have a normal distribution. Figure 2 shows the two-level classification scheme used in the current study. 2.3
Shifted Delta Cepstral (SDC) Coefficients
Previous studies showed that language identification performance is improved by using SDC feature vectors, which are obtained by concatenating delta cepstra across multiple frames. The SDC features are described by the N number of cepstral coefficients, d time advance and delay, k number of blocks concatenated for the feature vector, and P time shift between consecutive blocks. For each SDC final feature vector, kN parameters are used. In contrast, in the case of conventional cepstra and delta cepstra feature vectors, 2N parameters are used. The SDC is calculated as follows: Δc(t + iP ) = c(t + iP + d) − c(t + iP − d)
(5)
368
P. Heracleous et al.
The final vector at time t is given by the concatenation of all Δc(t + iP ) for all 0 ≤ i < k, where c(t) is the original feature value at time t. In the current study, SDC coefficients were used not only in spoken language identification, but also in emotion classification. 2.4
Feature Extraction
In automatic speech recognition, speaker recognition, and language identification, mel-frequency cepstral coefficients (MFCC) are among the most popular and most widely used acoustic features. Therefore, in modeling the languages being identified and the emotions being recognized, this study similarly used 12 MFCC, concatenated with SDC coefficients to form feature vectors of length 112. The MFCC features were extracted every 10 ms using a window-length of 20 ms. The extracted acoustic features were used to construct the i-vectors used in emotion and spoken language identification modeling and classification. A widely used approach for speaker recognition is based on Gaussian mixture models (GMM) with universal background models (UBM). The individual speaker models are created using maximum a posteriori (MAP) adaptation of the UBM. In many studies, GMM supervectors are used as features. The GMM supervectors are extracted by concatenating the means of the adapted model. The problem of using GMM supervectors is their high dimensionality. To address this issue, the i-vector paradigm was introduced which overcomes the limitations of high dimensionality. In the case of i-vectors, the variability contained in the GMM supervectors is modeled with a small number of factors, and the whole utterance is represented by a low dimensional i-vector of 100-400 dimension. Considering language identification, an input utterance can be modeled as: M = m + Tw
(6)
where M is the language-dependent supervector, m is the language-independent supervector, T is the total variability matrix, and w is the i-vector. Both the total variability matrix and language-independent supervector are estimated from the complete set of the training data. The same procedure is used to extract i-vectors used in speech emotion recognition. 2.5
Evaluation Measures
In the current study, the equal error rates (EER) (i.e., equal false alarms and miss probability) and the classification rates are used as evaluation measures. The classification rate is defined as: n 1 N o. of corrects f or class k · 100 (7) acc = n N o. of trials f or class k k=1
where n is the number of the emotions. In addition, the detection error trade-off (DET) curves, which show the function of miss probability and false alarms, are also given.
Multilingual Speech Emotion Recognition on Japanese, English, and German
3
369
Results
This section presents the results for multilingual emotion classification based on a two-level classification scheme using Japanese, English, and German corpora. 3.1
Spoken Language Identification Using Emotional Speech Data
The i-vectors used in modeling and classification are constructed using MFCC features and SDC coefficients. For training, 160 utterances from each language are used, and 80 utterances for testing. The dimension of the i-vectors is set to 100, and 256 Gaussian components are used in the UBM-GMM. Due to the fact that only three target languages are used, the identification was perfect almost in all cases (except in the case of using PLDA was 98.8%). On the other hand, it should be noted that language identification is conducted using emotional speech data, and this result indicates that spoken language classification using emotional speech data does not present any particular difficulties compared to normal speech. 3.2
Emotion Recognition Based on a Two-Level Classification Scheme
Table 1 shows the average emotion classification rates when using MFCC features only. As shown, high classification rates are being obtained. The results show that the two classifiers based on DNN and CNN show lower rates (except for Japanese). A possible reason may be the small volume of training data in the case of English and German. Table 2 shows the average classification rates when using MFCC features along with SDC coefficients. As shown, the CRF classifier shows superior performance in most of cases, followed by SVM. The results show that using SDC coefficients along with MFCC features improves classification rates. This result indicates that SDC coefficients are effective not only in spoken language identification, but also in speech emotion recognition. Note, however, that in this case of DNN and CNN, small or no improvements are being obtained. The results Table 1. Average emotion classification rates when using MFCC features for the ivector construction. Classifier Language Japanese English German PLDA
85.2
77.3
91.7
CRF
79.4
87.5
90.0
SVM
82.8
80.5
91.3
DNN
90.6
68.3
85.2
CNN
90.2
71.0
88.7
370
P. Heracleous et al.
indicate that due to the limited training data, DNN and CNN are less effective for this task. Table 3 shows the classification rates for the four emotions when using the CRF classifier and MFCC features along with SDC coefficients. In the case of Japanese the average accuracy was 88.8%, in the case of English the average was 93.8%, and in the case of German, a 95.0% accuracy was obtained. Concerning the German corpus, the results obtained are significantly higher compared to the results reported in [36] when the same corpus was used. Table 4 shows the individual classification rates when SVM was used. In the case of Japanese, a 82.8% average accuracy was achieved, in the case of English the average accuracy was 91.4%, and when using the German corpus the average accuracy was 95.0%. Table 5 shows the recognition rates when using the PLDA classifier. The average accuracy for Japanese was 85.2%, the accuracy for English was 90.2%, Table 2. Average emotion classification rates when using MFCC features and SDC coefficients for the i-vector construction. Classifier Corpus Japanese English German PLDA
87.6
90.9
91.7
CRF
88.8
93.8
95.0
SVM
90.9
93.0
95.0
DNN
83.7
76.2
82.7
CNN
88.8
77.1
84.5
Table 3. Emotion classification rates when using CRF classifier and MFCC features, along with SDC coefficients, for the i-vector construction. Corpus
Emotions Neutral Happy Anger Sad
Japanese 85.0
83.8
88.8
Average
97.5 88.8
English
87.5
100.0 96.9
90.6
German
100.0
95.0
100.0 95.0
85.0
93.8
Table 4. Emotion classification rates when using SVM classifier and MFCC features, along with SDC coefficients, for the i-vector construction. Corpus
Emotions Neutral Happy Anger Sad
Average
Japanese 92.5
70.0
81.3
87.5
82.8
English
84.4
100.0 93.8
87.5
91.4
German
95.0
95.0
100.0 95.0
90.0
Multilingual Speech Emotion Recognition on Japanese, English, and German
371
Table 5. Emotion classification rates when using PLDA classifier and MFCC features, along with SDC coefficients, for the i-vector construction. Corpus
Emotions Neutral Happy Anger Sad
Average
Japanese 72.8
86.4
90.1
91.4 85.2
English
93.9
90.9
97.0
78.8 90.2
German
95.2
81.0
85.7
95.2 89.3
and for the German corpus an accuracy of 89.3% was achieved. The results show that when using CRF, superior performance was obtained, followed by SVM. The lowest rates were obtained when the PLDA classifier was used. The results also show that the emotion sad is recognized with the highest rates in most cases. 3.3
Emotion Recognition Using Multilingual Emotion Models
In this baseline approach, a single-level classification scheme is used. Using emotional speech data from Japanese, English, and German languages, common emotion models are trained. For training, 112 Japanese, 64 English, and 40 German i-vectors are used for each emotion. For testing, 80 Japanese, 32 English, and 20 German i-vectors are used for each emotion. Since also using SDC coefficients improves the performance of the two-level approach, in this method, i-vectors are constructed using MFCC features in conjunction with SDC coefficients. Table 6 shows the classification rates. As shown, using a universal multilingual model, the average emotion classification accuracies for the three languages are 75.2%, 77.7%, and 75.0% when using PLDA, CRF, and SVM classifiers, respectively. This is a promising result and superior to the results obtained in other similar studies. While the rates achieved are lower than with the two-level approach, in this approach a single level is used with reduced system complexity (i.e., language identification is not applied). Furthermore, the classification rates may be improved with a larger amount of training data. These result show that i-vectors can efficiently be applied in multilingual emotion recognition when universal, multilingual emotion models are also used. The results also show that in most cases, the performance of the CRF classifier is superior. Table 6. Average emotion classification rates when using a universal emotion model with MFCC features and SDC coefficients for the i-vector construction. Classifier Corpus Japanese English German Average PLDA
75.3
68.2
82.1
75.2
CRF
80.9
73.4
78.8
77.7
SVM
76.9
71.9
76.3
75.0
372
P. Heracleous et al.
Table 7. Equal error rates (EER) when using a universal emotion model with MFCC features and SDC coefficients for the i-vector construction. Classifier Corpus Japanese English German Average PLDA
22.6
22.6
17.5
20.9
CRF
17.6
22.9
17.5
19.3
SVM
22.5
26.8
20.4
23.2
SVM PLDA CRF
False Negative Rate (FNR) [%]
40
20
10 5 2 1 0.5 0.2 0.1 0.1 0.2 0.5 1
2
5
10
20
40
False Positive Rate (FPR) [%]
(a) Japanese
SVM PLDA CRF
20
10 5 2 1 0.5 0.2
SVM PLDA CRF
40
False Negative Rate (FNR) [%]
False Negative Rate (FNR) [%]
40
20
10 5 2 1 0.5 0.2
0.1
0.1 0.1 0.2 0.5 1
2
5
10
20
40
0.1 0.2 0.5 1
2
5
10
20
False Positive Rate (FPR) [%]
False Positive Rate (FPR) [%]
(b) (English
(c) German
40
Fig. 3. DET curves for the three languages used in emotion classification when using common multilingual models.
Table 7 shows the EER when a universal, multilingual emotion model is used. As shown, the EER for German is the lowest among the three, followed by the EER for Japanese. The average EERs for the three languages are 20.9%, 19.3%, and 23.2% when using PLDA, CRF, and SVM classifiers, respectively. Also in this case, the lowest EERs were obtained using the CRF classifier. Figure 3 shows the DET curves for multilingual emotion recognition using a universal emotion model.
Multilingual Speech Emotion Recognition on Japanese, English, and German
4
373
Discussion
Although using real-world emotional speech data would represent a more realistic situation, acted emotional speech data are widely used in speech emotion classification. Furthermore, the current study mainly investigated classification schemes and features extraction methods, so using acted speech is a reasonable and acceptable approach. Because of limited emotional data, deep learning approaches in multilingual emotion recognition were not investigated. In contrast, a method is proposed that integrates spoken language identification and emotion classification. In addition to SVM and PLDA classifiers, the CRF classifier is also used in combination with the i-vector paradigm. The results obtained show the advantage of using the CRF classifier, especially when limited data are available. For comparison purposes, deep neural networks were also considered. Because of the limited training data, however, the classification rates when using DNN and CNN were significantly lower. In order to address the problems associated with using acted speech, an initiative to obtain a large quantity of spontaneous emotional speech is currently being undertaken. With such data, it will also be possible to analyze the behavior of additional classifiers, such as deep neural networks, and to investigate the problem of multilingual speech emotion recognition in realistic situations (e.g., noisy or reverberant environments).
5
Conclusions
The current study experimentally investigated multilingual speech emotion classification. A two-level classification approach was used, integrating spoken language identification and emotion recognition. The proposed method was based on CRF classifier and the i-vector paradigm. When classifying four emotions, the proposed method achieved a 93.8% classification rate for English, a 95.0% rate for German, and 88.8% rate for Japanese. These results were very promising, and demonstrated the effectiveness of the proposed methods in multilingual speech emotion recognition. An initiative to obtain realistic, spontaneous emotional speech data for a large number of languages is currently being undertaken. As future work, the effect of noise and reverberation will also be investigated.
References 1. Busso, C., Bulut, M., Narayanan, S.: Toward effective automatic recognition systems of emotion in speech. In: Gratch, J., Marsella, S. (eds.) Social emotions in nature and artifact: emotions in human and human-computer interaction, pp. 110– 127. Oxford University Press, New York (2013) 2. Tang, H., Chu, S., Johnson, M.H.: Emotion recognition from speech via boosted gaussian mixture models. In: Proceedings of ICME, pp. 294–297 (2009) 3. Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012) 4. Nicholson, J., Takahashi, K., Nakatsu, R.: Emotion recognition in speech using neural networks. Neural Comput. Appli. 9(4), 290–296 (2000)
374
P. Heracleous et al.
5. Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Proceedings of Interspeech, pp. 223–227 (2014) 6. Polzehl, T., Schmitt, A., Metze, F.: Approaching multi-lingual emotion recognition from speech-on language dependency of acoustic prosodic features for anger detection. In: Proceedings of Speech Prosody (2010) 7. Bhaykar, M., Yadav, J., Rao, K.S.: Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM. In: 2013 National Conference on Communications (NCC), pp. 1–5. IEEE (2013) 8. Eyben, F., Batliner, A., Schuller, B., Seppi, D., Steidl, S.: Crosscorpus classification of realistic emotions - some pilot experiments. In: Proceedings of the Third International Workshop on EMOTION (satellite of LREC) (2010) 9. Sagha, H., Matejka, P., Gavryukova, M., Povolny, F., Marchi, E., Schuller, B.: Enhancing multilingual recognition of emotion in speech by language identification. In: Proceedings of Interspeech (2016) 10. Li, X., Akagi, M.: A three-layer emotion perception model for valence and arousalbased detection from multilingual speech. In: Proceedings of Interspeech, pp. 3643– 3647 (2018) 11. Li, H., Ma, B., Lee, K.A.: Spoken language recognition: From fundamentals to practice. In: Proceedings of the IEEE, vol. 101(5), pp. 1136–1159 (2013) 12. Zissman, M.A.: Comparison of four approaches to automatic language identification of telephone speech. lEEE Trans. Speech Audio Process. 4(1), 31–44 (1996) 13. Caseiro, D., Trancoso, I.: Spoken language identification using the speechdat corpus. In: Proceedings of ICSLP 1998 (1998) 14. Siniscalchi, S.M., Reed, J., Svendsen, T., Lee, C.-H.: Universal attribute characterization of spoken languages for automatic spoken language recognition. Comput. Speech Lang. 27, 209–227 (2013) 15. Lee, C.-H.: principles of spoken language recognition. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds.) Springer Handbook of Speech Processing. SH, pp. 785–796. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-49127-9 39 16. Reynolds, D.A., Campbell, W.M., Shen, W., Singer, E.: Automatic language recognition via spectral and token based approaches. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds.) Springer Handbook of Speech Processing. SH, pp. 811–824. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-49127-9 41 17. Cole, R., Inouye, J., Muthusamy, Y., Gopalakrishnan, M.: Language identification with neural networks: a feasibility study. In: Proceedings of IEEE Pacific Rim Conference, pp. 525–529 (1989) 18. Leena, M., Rao, K.S., Yegnanarayana, B.: Neural network classifiers for language identification using phonotactic and prosodic features. In: Proceedings of Intelligent Sensing and Information Processing, pp. 404–408 (2005) 19. Montavon, G.: Deep learning for spoken language identification. In: NIPS workshop on Deep Learning for Speech Recognition and Related Applications (2009) 20. Moreno, I.L., Dominguez, J.G., Plchot, O., Martinez, D., Rodriguez, J.G., Moreno, P.: Automatic language identification using deep neural networks. In: Proceedings of ICASSP, pp. 5337–5341 (2014) 21. Heracleous, P., Takai, K., Yasuda, K., Mohammad, Y., Yoneyama, A.: Comparative Study on Spoken Language Identification Based on Deep Learning. In: Proceedings of EUSIPCO (2018) 22. Jiang, B., Song, Y., Wei, S., Liu, J.-H., McLoughlin, I.V., Dai, L.-R.: Deep bottleneck features for spoken language identification. PLoS ONE 9(7), 1–11 (2010)
Multilingual Speech Emotion Recognition on Japanese, English, and German
375
23. Zazo, R., Diez, A.L., Dominguez, J.G., Toledano, D.T., Rodriguez, J.G.: Language identification in short utterances using long short-term memory (lstm) recurrent neural networks. PLoS ONE 11(1), e0146917 (2016) 24. Heracleous, P., Mohammad, Y., Takai, K., Yasuda, K., Yoneyama, A.: Spoken Language Identification Based on I-vectors and Conditional Random Fields. In: Proceedings of IWCMC, pp. 1443–1447 (2018) 25. Cristianini, N., Taylor, J.S.: Support vector machines. Cambridge University Press, Cambridge (2000) 26. Dehak, N., Carrasquillo, P.A.T., Reynolds, D., Dehak, R.: Language recognition via ivectors and dimensionality reduction. In: Proceedings of Interspeech, pp. 857– 860 (2011) 27. Shen, P., Lu, X., Liu, L., Kawai, H.: Local fisher discriminant analysis for spoken language identification. In: Proceedings of ICASSP, pp. 5825–5829 (2016) 28. Livingstone, S.R., Peck, K., F.A., Russo: RAVDESS: The ryerson audio-visual database of emotional speech and song. In: 22nd Annual Meeting of the Canadian Society for Brain, Behaviour and Cognitive Science (CSBBCS), Kingston, ON (2012) 29. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of Interspeech (2005) 30. Reiter, S., Schuller, B., Rigoll, G.: Hidden conditional random fields for meeting segmentation. In: Proceedings of ICME, pp. 639–642 (2007) 31. Gunawardana, A., Mahajan, M., Acero, A., Platt, J.C.: Hidden conditional random fields for phone classification. In: Proceedings of Interspeech, pp. 1117–1120 (2005) 32. Llorens, H., Saquete, E., Colorado, B.N.: TimeML events recognition and classification: learning crf models with semantic roles. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 725– 733 (2010) 33. Yu, D., Wang, S., Karam, Z., Deng, L.: Language recognition using deep-structured conditional random fields. In: Proceedings of ICASSP, pp. 5030–5033 (2010) 34. Quattoni, A., Collins, M., Darrell, T.: Conditional random fields for object recognition. In: Saul, L.K., Weiss, Y., Bottou, L., (eds.) Advances in Neural Information Processing Systems 17, MIT Press, pp. 1097–1104 (2005) 35. Yu, C., Liu, G., Hansen, J.H.L.: Acoustic feature transformation using ubm-based lda for speaker recognition. In: Proceedings of Interspeech, pp. 1851–1854 (2014) 36. Li, X., Akagi, M.: Multilingual speech emotion recognition system based on a three-layer model. In: Proceedimgs of Interspeech, pp. 3606–3612 (2016)
Text Categorization
On the Use of Dependencies in Relation Classification of Text with Deep Learning Bernard Espinasse, Sébastien Fournier, Adrian Chifu(B) , Gaël Guibon, René Azcurra, and Valentin Mace Aix-Marseille Université, Université de Toulon, CNRS, LIS, Marseille, France {bernard.espinasse,sebastien.fournier,adrian.chifu,gael.guibon, rene.azcurra,valentin.mace}@lis-lab.fr Abstract. Deep Learning is more and more used in NLP tasks, such as in relation classification of texts. This paper assesses the impact of syntactic dependencies in this task at two levels. The first level concerns the generic Word Embedding (WE) as input of the classification model, the second level concerns the corpus whose relations have to be classified. In this paper, two classification models are studied, the first one is based on a CNN using a generic WE and does not take into account the dependencies of the corpus to be treated, and the second one is based on a compositional WE combining a generic WE with syntactical annotations of this corpus to classify. The impact of dependencies in relation classification is estimated using two different WE. The first one is essentially lexical and trained on the Wikipedia corpus in English, while the second one is also syntactical, trained on the same previously annotated corpus with syntactical dependencies. The two classification models are evaluated on the SemEval 2010 reference corpus using these two generic WE. The experiments show the importance of taking dependencies into account at different levels in the relation classification. Keywords: Dependencies · Relation classification · Deep learning Word embedding · Compositional word embedding
1
·
Introduction
Deep Learning is more and more used for various task of Natural Language Processing (NLP), such as relation classification from text. It should be recalled that the Deep Learning has emerged mainly with convolutional neural networks (CNNs), originally proposed in computer vision [5]. These CNNs were later used in language processing to solve problems such as sequence labelling [1], semantic analysis - semantic parsing [11], relation extraction, etc. CNNs are the most commonly used for deep neural network models in the relation classification task. One of the first contribution is certainly the basic CNN model proposed by Lui et al. (2013) [7]. Then we can mention the model proposed by Zeng et al. (2014) [13] with max-pooling, and the model proposed by Nguyen and Grishman (2015) [10] with multi-size windows. Performance of c Springer Nature Switzerland AG 2023 A. Gelbukh (Ed.): CICLing 2019, LNCS 13452, pp. 379–391, 2023. https://doi.org/10.1007/978-3-031-24340-0_28
380
B. Espinasse et al.
these CNN-based relation classification models are low in terms of Precision and Recall. These low performances can be explained by two reasons. First, despite their success, CNNs have a major limitation in language processing, due to the fact that they were invented to manipulate pixel arrays in image processing, and therefore only take into account the consecutive sequential n-grams on the surface chain. Thus, in relation classification, CNNs do not consider long-distance syntactic dependencies, these dependencies play a very important role in linguistics, particularly in the treatment of negation, subordination, fundamental in the analysis of feelings, etc. [8]. Second, Deep Learning-based relation classification models generally use as input a representation of words obtained by lexical immersion or Word Embedding (WE) with training on a large corpus. Skip-Gram or Continuous Bag-of-Word WEs models, generally only consider the local context of a word, in a window of a few words before and a few words after, this without consideration of syntactic linguistics characteristics. Consequently syntactic dependencies are not taken into account in relation classification using Deep Learning-based models. As syntactic dependencies play a very important role in linguistics, it makes sense to take them into account for classification or relation extraction. The consideration of these dependencies in Deep Learning models can be carried out at different levels. At a first level, (Syntaxical Word Embedding) dependencies are taken into account upstream, at the basic word representation level in a generic WE trained on a large corpus syntactically annotated and generated with specific tool. At a second level, related to the relation classification corpus (Compositional Word Embedding), there is a combination of a generic WE trained on a large corpus with specific features as dependencies, extracted from the words in the sentences of the corpus to classify. This paper assesses the impact of syntactic dependencies in relation classification at these two different levels. The paper is organized as follows. In Sect. 2, we first present a generic syntactical WE trained on a large corpus that has been previously annotated with syntactical dependencies and considering for each word dependencies, in which it is involved. In Sect. 3, two Deep Learning models of relation classification are presented. The first model, that we have developed, is based on a CNN using as input a generic WE trained on a large corpus completed by a positional embedding of the corpus to classify. The second model, the FCM model implemented with a neural network of perceptron type, is based on a compositional WE strategy, using as input a combination of generic WE with specific syntactical features from the corpus to classify relations. In Sect. 4 we present the results of experiments obtained with these two relation classification models on the SemEval 2010 reference corpus using different WEs. Finally, we conclude by reviewing our work and presenting some perspectives for future research.
2
A Syntactical Word Embedding Taking into Account Dependencies
In Deep Learning approach, relation classification models generally use as input a representation of the words of a specific natural language obtained by lexical
On the Use of Dependencies in Relation Classification
381
immersion or Word Embedding (WE). We can distinguish two main WE models: Skip-Gram and Continuous Bag-of-Word. These WEs only consider the local context of a word, in a window of a few words before and a few words after. Syntactic dependencies are not taken into account in these WE models, whereas these syntactic dependencies play a very important role in NLP tasks. Given a classic Bag-of-Words WE taking into account the neighbours upstream and downstream of a word, according to a defined window, we may consider the following sentence: “Australian scientist discovers star with telescope”. With a 2-word window WE, the contexts of the word discovers are Australian, scientist, star and with. This misses the important context of the telescope. By setting the window to 5 in the WE, you can capture more topical content, by example the word telescope, but also weaken the importance of targeted information on the target word. A more relevant word contextualization consists to integrate the different syntactic dependencies in which this word participates, dependencies that can involve words that are very far in the text. The syntactic dependencies that we consider in this paper are those of Stanford (Stanford Dependencies) defined by [2]. Note that syntactic dependencies are both more inclusive and more focused than Bag-of-Words. They capture relations with distant words and therefore out of reach with a small window Bag-of-Words (for example, the discovery instrument is the telescope/preposition with), and also filter out incidental contexts that are in the window, but not directly related to the target word (for example, Australian is not used as a context for discovery). Levy and Goldberg [6] have proposed a generalization of the Skip-gram approach by replacing Bag-of-Word contexts with contexts related to dependencies. They proposed a generalization of the WE Skip-Gram model in which the linear contexts of Bag-of-Words are replaced by arbitrary contexts. This is especially the case with contexts based on syntactic dependencies, which produces similarities of very different kinds. They also demonstrated how the resulting WE model can be questioned about discriminatory contexts for a given word, and observed that the learning procedure seems to favour relatively local syntactic contexts, as well as conjunctions and preposition objects. Levy and Goldberg [6] developed a variant of Word2Vec tool [9], named Word2Vec-f 1 , based on a such syntactic dependency-based contextualization.
3
Two Models for Relation Classification Using Syntactical Dependencies
In this section, two models for relation classification by supervised Deep Learning are presented and used for our experiments developed in Sect. 4. Firstly, we developed a model based on CNN using WEs trained on a large corpus. The second, proposed by [4], is based on a compositional WE combining a generic WE and a syntactic annotation of the corpus whose relations are to be classified. 1
Word2Vec-f : https://bitbucket.org/yoavgo/Word2Vecf.
382
3.1
B. Espinasse et al.
A CNN Based Relation Classification Model (CNN)
The first model, that we have developed, is based on CNN using a generic WE trained on a large corpus. This model is inspired by the one used by Nguyen and Grishman (2015) [10] and it takes as input a WE either using word2Vec or Word2Vec-f tools and a Positional Embedding relative to the corpus from which we want to extract relations. The architecture of our CNN network (Fig. 1) consists of five main layers:
Fig. 1. CNN architecture.
– Two convolutional layers using a number and a size defined for convolutional filters to capture the characteristics of the pretreated input. The filter size is different for each layer. For each layer there are also attached grouping layers (Max Pooling) with an aggregation function (max) for the identification of the most important characteristics produced by the output vector of each convolutional layer. – A fully connected layer that uses the RELU (Rectified Linear Unit) activation function. – A fully connected layer using the Sotfmax activation function to classify the relations to be found. – A logistic regression layer making the optimization of network weighting values with a function to update these values iteratively on the training data. This architecture will be implemented in the TensorFlow 2 platform (version 1.8), using the Tflearn API3 that facilitates its implementation and experimentation. 3.2
A Compositional Word Embedding Based Relation Classification Model (FCM)
This model, named FCM (for Factor-based Compositional embedding Model) is proposed by Gormley, Yu and Dredze (2014–2015) [3,4]. The FCM model is 2 3
TensorFlow : https://www.tensorflow.org/. Tflearn : http://tflearn.org/.
On the Use of Dependencies in Relation Classification
383
based on a compositional WE, which combines a generic WE trained on a large corpus with specific features at a syntactical level from the corpus to classify relations. More precisely, This compositional WE is developed by combinin