Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14–18, 2019, Proceedings, Part II [1st ed.] 978-3-030-15718-0;978-3-030-15719-7

324 68 19MB

English Pages [439] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14–18, 2019, Proceedings, Part II [1st ed.]
 978-3-030-15718-0;978-3-030-15719-7

Citation preview

LNCS 11438

Leif Azzopardi · Benno Stein · Norbert Fuhr · Philipp Mayr · Claudia Hauff · Djoerd Hiemstra (Eds.)

Advances in Information Retrieval 41st European Conference on IR Research, ECIR 2019 Cologne, Germany, April 14–18, 2019 Proceedings, Part II

123

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board Members David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zurich, Switzerland John C. Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C. Pandu Rangan Indian Institute of Technology Madras, Chennai, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA

11438

More information about this series at http://www.springer.com/series/7409

Leif Azzopardi Benno Stein Norbert Fuhr Philipp Mayr Claudia Hauff Djoerd Hiemstra (Eds.) •









Advances in Information Retrieval 41st European Conference on IR Research, ECIR 2019 Cologne, Germany, April 14–18, 2019 Proceedings, Part II

123

Editors Leif Azzopardi University of Strathclyde Glasgow, UK Norbert Fuhr Universität Duisburg-Essen Duisburg, Germany Claudia Hauff Delft University of Technology Delft, The Netherlands

Benno Stein Bauhaus Universität Weimar Weimar, Germany Philipp Mayr GESIS - Leibniz Institute for the Social Sciences Cologne, Germany Djoerd Hiemstra University of Twente Enschede, The Netherlands

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-15718-0 ISBN 978-3-030-15719-7 (eBook) https://doi.org/10.1007/978-3-030-15719-7 Library of Congress Control Number: 2019934339 LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The 41st European Conference on Information Retrieval (ECIR) was held in Cologne, Germany, during April 14–18, 2019, and brought together hundreds of researchers from Europe and abroad. The conference was organized by GESIS – Leibniz Institute for the Social Sciences and the University of Duisburg-Essen—in cooperation with the British Computer Society’s Information Retrieval Specialist Group (BCS-IRSG). These proceedings contain the papers, presentations, workshops, and tutorials given during the conference. This year the ECIR 2019 program boasted a variety of novel work from contributors from all around the world and provided new platforms for promoting information retrieval-related (IR) activities from the CLEF Initiative. In total, 365 submissions were fielded across the tracks from 50 different countries. The final program included 39 full papers (23% acceptance rate), 44 short papers (29% acceptance rate), eight demonstration papers (67% acceptance rate), nine reproducibility full papers (75% acceptance rate), and eight invited CLEF papers. All submissions were peer reviewed by at least three international Program Committee members to ensure that only submissions of the highest quality were included in the final program. As part of the reviewing process we also provided more detailed review forms and guidelines to help reviewers identify common errors in IR experimentation as a way to help ensure consistency and quality across the reviews. The accepted papers cover the state of the art in IR: evaluation, deep learning, dialogue and conversational approaches, diversity, knowledge graphs, recommender systems, retrieval methods, user behavior, topic modelling, etc., and also included novel application areas beyond traditional text and Web documents such as the processing and retrieval of narrative histories, images, jobs, biodiversity, medical text, and math. The program boasted a high proportion of papers with students as first authors, as well as papers from a variety of universities, research institutes, and commercial organizations. In addition to the papers, the program also included two keynotes, four tutorials, four workshops, a doctoral consortium, and an industry day. The first keynote was presented by this year’s BCS IRSG Karen Sparck Jones Award winner, Prof. Krisztian Balog, On Entities and Evaluation, and the second keynote was presented by Prof. Markus Strohmaier, On Ranking People. The tutorials covered a range of topics from conducting lab-based experiments and statistical analysis to categorization and deep learning, while the workshops brought together participants to discuss algorithm selection (AMIR), narrative extraction (Text2Story), Bibliometrics (BIR), as well as social media personalization and search (SoMePeAS). As part of this year’s ECIR we also introduced a new CLEF session to enable CLEF organizers to report on and promote their upcoming tracks. In sum, this added to the success and diversity of ECIR and helped build bridges between communities. The success of ECIR 2019 would not have been possible without all the help from the team of volunteers and reviewers. We wish to thank all our track chairs for

vi

Preface

coordinating the different tracks along with the teams of meta-reviewers and reviewers who helped ensure the high quality of the program. We also wish to thank the demo chairs: Christina Lioma and Dagmar Kern; student mentorship chairs: Ahmet Aker and Laura Dietz; doctoral consortium chairs: Ahmet Aker, Dimitar Dimitrov and Zeljko Carevic; workshop chairs: Diane Kelly and Andreas Rauber; tutorial chairs: Guillaume Cabanac and Suzan Verberne; industry chair: Udo Kruschwitz; publicity chair: Ingo Frommholz; and sponsorship chairs: Jochen L. Leidner and Karam Abdulahhad. We would like to thank our webmaster, Sascha Schüller and our local chair, Nina Dietzel along with all the student volunteers who helped to create an excellent online and offline experience for participants and attendees. ECIR 2019 was sponsored by: DFG (Deutsche Forschungsgemeinschaft), BCS (British Computer Society), SIGIR (Special Interest Group on Information Retrieval), City of Cologne, Signal Media Ltd, Bloomberg, Knowledge Spaces, Polygon Analytics Ltd., Google, Textkernel, MDPI Open Access Journals, and Springer. We thank them all for their support and contributions to the conference. Finally, we wish to thank all the authors, reviewers, and contributors to the conference. April 2019

Leif Azzopardi Benno Stein Norbert Fuhr Philipp Mayr Claudia Hauff Djoerd Hiemstra

Organization

General Chairs Norbert Fuhr Philipp Mayr

Universität Duisburg-Essen, Germany GESIS – Leibniz Institute for the Social Sciences, Germany

Program Chairs Leif Azzopardi Benno Stein

University of Glasgow, UK Bauhaus-Universität Weimar, Germany

Short Papers and Poster Chairs Claudia Hauff Djoerd Hiemstra

Delft University of Technology, The Netherlands University of Twente, The Netherlands

Workshop Chairs Diane Kelly Andreas Rauber

University of Tennessee, USA Vienna University of Technology, Austria

Tutorial Chairs Guillaume Cabanac Suzan Verberne

University of Toulouse, France Leiden University, The Netherlands

Demo Chairs Christina Lioma Dagmar Kern

University of Copenhagen, Denmark GESIS – Leibniz Institute for the Social Sciences, Germany

Industry Day Chair Udo Kruschwitz

University of Essex, UK

Proceedings Chair Philipp Mayr

GESIS – Leibniz Institute for the Social Sciences, Germany

viii

Organization

Publicity Chair Ingo Frommholz

University of Bedfordshire, UK

Sponsor Chairs Jochen L. Leidner Karam Abdulahhad

Thomson Reuters/University of Sheffield, UK GESIS – Leibniz Institute for the Social Sciences, Germany

Student Mentoring Chairs Ahmet Aker Laura Dietz

Universität Duisburg-Essen, Germany University of New Hampshire, USA

Website Infrastructure Sascha Schüller

GESIS – Leibniz Institute for the Social Sciences, Germany

Local Chair Nina Dietzel

GESIS – Leibniz Institute for the Social Sciences, Germany

Program Committee Mohamed Abdel Maksoud Ahmed Abdelali Karam Abdulahhad Dirk Ahlers Qingyao Ai Ahmet Aker Elif Aktolga Dyaa Albakour Giambattista Amati Linda Andersson Avi Arampatzis Ioannis Arapakis Jaime Arguello Leif Azzopardi Ebrahim Bagheri Krisztian Balog Alvaro Barreiro

Codoma.tech Advanced Technologies, Egypt Research Administration GESIS - Leibniz institute for the Social Sciences, Germany Norwegian University of Science and Technology, Norway University of Massachusetts Amherst, USA University of Duisburg Essen, Germany Apple Inc., USA Signal Media, UK Fondazione Ugo Bordoni, Italy Vienna University of Technology, Austria Democritus University of Thrace, Greece Telefonica Research, Spain The University of North Carolina at Chapel Hill, USA University of Strathclyde, UK Ryerson University, USA University of Stavanger, Norway University of A Coruña, Spain

Organization

Alberto Barrón-Cedeño Srikanta Bedathur Alejandro Bellogin Patrice Bellot Pablo Bermejo Catherine Berrut Prakhar Biyani Pierre Bonnet Gloria Bordogna Dimitrios Bountouridis Pavel Braslavski Paul Buitelaar Guillaume Cabanac Fidel Cacheda Sylvie Calabretto Pável Calado Arthur Camara Ricardo Campos Fazli Can Iván Cantador Cornelia Caragea Zeljko Carevic Claudio Carpineto Pablo Castells Long Chen Max Chevalier Manoj Chinnakotla Nurendra Choudhary Vincent Claveau Fabio Crestani Bruce Croft Zhuyun Dai Jeffery Dalton Martine De Cock Pablo de La Fuente Maarten de Rijke Arjen de Vries Yashar Deldjoo Kuntal Dey Emanuele Di Buccio Giorgio Maria Di Nunzio Laura Dietz Dimitar Dimitrov Mateusz Dubiel

ix

Qatar Computing Research Institute, Qatar IIT Delhi, India Universidad Autonoma de Madrid, Spain Aix-Marseille Université - CNRS (LSIS), France Universidad de Castilla-La Mancha, Spain LIG, Université Joseph Fourier Grenoble I, France Yahoo CIRAD, France National Research Council of Italy – CNR, Italy Delft University of Technology, The Netherlands Ural Federal University, Russia Insight Centre for Data Analytics, National University of Ireland Galway, Ireland IRIT - Université Paul Sabatier Toulouse 3, France Universidade da Coruña, Spain LIRIS, France Universidade de Lisboa, Portugal Delft University of Technology, The Netherlands Polytechnic Institute of Tomar, Portugal Bilkent University, Turkey Universidad Autónoma de Madrid, Spain University of Illinois at Chicago, USA GESIS, Germany Fondazione Ugo Bordoni, Italy Universidad Autónoma de Madrid, Spain University of Glasgow, UK IRIT, France Microsoft, India International Institute of Information Technology, Hyderabad, India IRISA - CNRS, France University of Lugano (USI), Switzerland University of Massachusetts Amherst, USA Carnegie Mellon University, USA University of Glasgow, UK University of Washington, USA Universidad de Valladolid, Spain University of Amsterdam, The Netherlands Radboud University, The Netherlands Polytechnic University of Milan, Italy IBM India Research Lab, India University of Padua, Italy University of Padua, Italy University of New Hampshire, USA GESIS, Germany University of Strathclyde, UK

x

Organization

Carsten Eickhoff Tamer Elsayed Liana Ermakova Cristina España-Bonet Jose Alberto Esquivel Hui Fang Hossein Fani Paulo Fernandes Nicola Ferro Ingo Frommholz Norbert Fuhr Michael Färber Patrick Gallinari Shreyansh Gandhi Debasis Ganguly Wei Gao Dario Garigliotti Anastasia Giachanou Giorgos Giannopoulos Lorraine Goeuriot Julio Gonzalo Pawan Goyal Michael Granitzer Guillaume Gravier Shashank Gupta Cathal Gurrin Matthias Hagen Shuguang Han Allan Hanbury Preben Hansen Donna Harman Morgan Harvey Faegheh Hasibi Claudia Hauff Jer Hayes Ben He Nathalie Hernandez Djoerd Hiemstra Frank Hopfgartner Andreas Hotho Gilles Hubert Ali Hürriyetoğlu Dmitry Ignatov

Brown University, USA Qatar University, Qatar UBO, France UdS and DFKI, Germany Signal Media, UK University of Delaware, USA University of New Brunswick, Canada PUCRS, Brazil University of Padua, Italy University of Bedfordshire, UK University of Duisburg-Essen, Germany University of Freiburg, Germany LIP6 - University of Paris 6, France Walmart Labs, USA IBM Ireland Research Lab, Ireland Victoria University of Wellington, New Zealand University of Stavanger, Norway University of Lugano, Switzerland Imis Institute, Athena R.C., Greece University Grenoble Alpes, CNRS, France UNED, Spain IIT Kharagpur, India University of Passau, Germany CNRS, IRISA, France International Institute of Information Technology, Hyderabad, India Dublin City University, Ireland Martin-Luther-Universität Halle-Wittenberg, Germany Google, USA Vienna University of Technology, Austria Stockholm University, Sweden NIST, USA Northumbria University, UK Norwegian University of Science and Technology, Norway Delft University of Technology, The Netherlands Accenture, Ireland University of Chinese Academy of Sciences, China IRIT, France University of Twente, The Netherlands The University of Sheffield, UK University of Würzburg, Germany IRIT, France Koc University, Turkey National Research University Higher School of Economics, Russia

Organization

Bogdan Ionescu Radu Tudor Ionescu Shoaib Jameel Adam Jatowt Shen Jialie Jiepu Jiang Xiaorui Jiang Alexis Joly Gareth Jones Jaap Kamps Nattiya Kanhabua Jaana Kekäläinen Diane Kelly Liadh Kelly Dagmar Kern Roman Kern Dhruv Khattar Julia Kiseleva Dietrich Klakow Yiannis Kompatsiaris Kriste Krstovski Udo Kruschwitz Vaibhav Kumar Oren Kurland Chiraz Latiri Wang-Chien Lee Teerapong Leelanupab Mark Levene Liz Liddy Nut Limsopatham Chunbin Lin Christina Lioma Aldo Lipani Nedim Lipka Elena Lloret Fernando Loizides David Losada Natalia Loukachevitch Bernd Ludwig Mihai Lupu Craig Macdonald Andrew Macfarlane Joao Magalhaes

xi

University Politehnica of Bucharest, Romania University of Bucharest, Romania Kent University, UK Kyoto University, Japan Queen’s University, Belfast, UK Virginia Tech, USA Aston University, UK Inria, France Dublin City University, Ireland University of Amsterdam, The Netherlands NTENT, Spain University of Tampere, Finland University of Tennessee, USA Maynooth University, Ireland GESIS, Germany Graz University of Technology, Austria International Institute of Information Technology Hyderabad, India Microsoft Research AI, USA Saarland University, Germany CERTH - ITI, Greece University of Massachusetts Amherst, USA University of Essex, UK Carnegie Mellon University, USA Technion, Israel University of Tunis, Tunisia The Pennsylvania State University, USA King Mongkut’s Institute of Technology Ladkrabang Birkbeck, University of London, UK Center for Natural Language Processing, Syracuse University, USA Amazon Amazon AWS, USA University of Copenhagen, Denmark University College London, UK Adobe, USA University of Alicante, Spain Cardiff University, UK University of Santiago de Compostela, Spain Research Computing Center of Moscow State University, Russia University of Regensburg, Germany Research Studios, Austria University of Glasgow, UK City University London, UK Universidade NOVA de Lisboa, Portugal

xii

Organization

Walid Magdy Marco Maggini Maria Maistro Thomas Mandl Behrooz Mansouri Ilya Markov Stefania Marrara Miguel Martinez-Alvarez Bruno Martins Luis Marujo Yosi Mass Philipp Mayr Edgar Meij Massimo Melucci Marcelo Mendoza Alessandro Micarelli Dmitrijs Milajevs Seyed Abolghasem Mirroshandel Stefano Mizzaro Boughanem Mohand Felipe Moraes Ajinkya More Jose Moreno Yashar Moshfeghi Josiane Mothe André Mourão Henning Müller Franco Maria Nardini Wolfgang Nejdl Mahmood Neshati Dong Nguyen Massimo Nicosia Jian-Yun Nie Qiang Ning Boris Novikov Andreas Nuernberger Aurélie Névéol Neil O’Hare Salvatore Orlando Iadh Ounis Mourad Oussalah Deepak P.

The University of Edinburgh, UK University of Siena, Italy University of Padua, Italy University of Hildesheim, Germany Rochester Institute of Technology, USA University of Amsterdam, The Netherlands Consorzio C2T, Italy Signal, UK IST and INESC-ID - Instituto Superior Técnico, University of Lisbon, Portugal Snap Inc., USA IBM Haifa Research Lab, Israel GESIS, Germany Bloomberg L.P., UK University of Padua, Italy Universidad Técnica Federico Santa María, Brazil Roma Tre University, Italy Queen Mary University of London, UK Sharif University of Technology, Iran University of Udine, Italy IRIT University Paul Sabatier Toulouse, France Delft University of Technology, The Netherlands Netflix, USA IRIT/UPS, France University of Strathclyde, UK Institut de Recherche en Informatique de Toulouse, France Universidade NOVA de Lisboa, Portugal HES-SO, Switzerland ISTI-CNR, Italy L3S and University of Hannover, Germany Sharif University of Technology, Iran University of Twente, The Netherlands University of Trento, Italy University of Montreal, Canada University of Illinois at Urbana-Champaign, USA National Research University Higher School of Economics, Russia Otto-von-Guericke University of Magdeburg, Germany LIMSI, CNRS, Université Paris-Saclay, France Yahoo Research, USA Università Ca’ Foscari Venezia, Italy University of Glasgow, UK University of Oulu, Finland Queen’s University Belfast, Ireland

Organization

Jiaul Paik Joao Palotti Girish Palshikar Javier Parapar Gabriella Pasi Arian Pasquali Bidyut Kr. Patra Pavel Pecina Gustavo Penha Avar Pentel Raffaele Perego Vivien Petras Jeremy Pickens Karen Pinel-Sauvagnat Florina Piroi Benjamin Piwowarski Vassilis Plachouras Bob Planque Senja Pollak Martin Potthast Georges Quénot Razieh Rahimi Nitin Ramrakhiyani Jinfeng Rao Andreas Rauber Traian Rebedea Navid Rekabsaz Steffen Remus Paolo Rosso Dmitri Roussinov Stefan Rueger Tony Russell-Rose Alan Said Mark Sanderson Eric Sanjuan Rodrygo Santos Kamal Sarkar Fabrizio Sebastiani Florence Sedes Giovanni Semeraro Procheta Sen Armin Seyeditabari Mahsa Shahshahani Azadeh Shakery

xiii

IIT Kharagpur, India Vienna University of Technology/Qatar Computing Research Institute, Austria/Qatar Tata Research Development and Design Centre, India University of A Coruña, Spain Università degli Studi di Milano Bicocca, Italy University of Porto, Portugal VTT Technical Research Centre, Finland Charles University in Prague, Czech Republic UFMG University of Tallinn, Estonia ISTI-CNR, Italy Humboldt-Universität zu Berlin, Germany Catalyst Repository Systems, USA IRIT, France Vienna University of Technology, Austria CNRS, Pierre et Marie Curie University, France Facebook, UK Vrije Universiteit Amsterdam, The Netherlands University of Ljubljana, Slovenia Leipzig University, Germany Laboratoire d’Informatique de Grenoble, CNRS, France University of Tehran, Iran TCS Research, Tata Consultancy Services Ltd., India University of Maryland, USA Vienna University of Technology, Austria University Politehnica of Bucharest, Romania Idiap Research Institute, Switzerland University of Hamburg, Germany Universitat Politècnica de València, Spain University of Strathclyde, UK Knowledge Media Institute, UK UXLabs, UK University of Skövde, Sweden RMIT University, Australia Laboratoire Informatique d’Avignon- Université d’Avignon, France Universidade Federal de Minas Gerais, Brazil Jadavpur University, Kolkata, India Italian National Council of Research, Italy IRIT P. Sabatier University, France University of Bari, Italy Indian Statistical Institute, India UNC Charlotte, USA University of Amsterdam, The Netherlands University of Tehran, Iran

xiv

Organization

Manish Shrivastava Ritvik Shrivastava Rajat Singh Eero Sormunen Laure Soulier Rene Spijker Efstathios Stamatatos Benno Stein L. Venkata Subramaniam Hanna Suominen Pascale Sébillot Lynda Tamine Thibaut Thonet Marko Tkalcic Nicola Tonellotto Michael Tschuggnall Theodora Tsikrika Denis Turdakov Ferhan Ture Yannis Tzitzikas Sumithra Velupillai Suzan Verberne Vishwa Vinay Marco Viviani Stefanos Vrochidis Shuohang Wang Christa Womser-Hacker Chenyan Xiong Grace Hui Yang Peilin Yang Tao Yang Andrew Yates Hai-Tao Yu Hamed Zamani Eva Zangerle Fattane Zarrinkalam Dan Zhang Duo Zhang Shuo Zhang Sicong Zhang Guoqing Zheng Leonardo Zilio Guido Zuccon

International Institute of Information Technology, Hyderabad, India Columbia University, USA International Institute of Information Technology, Hyderabad, India University of Tampere, Finland Sorbonne Universités UPMC-LIP6, France Cochran, The Netherlands University of the Aegean, Greece Bauhaus-Universität Weimar, Germany IBM Research, India The ANU, Australia IRISA, France IRIT, France University of Grenoble Alpes, France Free University of Bozen-Bolzano, Italy ISTI-CNR, Italy Institute for computer science, DBIS, Innsbruck, Austria Information Technologies Institute, CERTH, Greece Institute for System Programming RAS Comcast Labs, USA University of Crete and FORTH-ICS, Greece KTH Royal Institute of Technology, Sweden Leiden University, The Netherlands Adobe Research Bangalore, India Università degli Studi di Milano-Bicocca - DISCo, Italy Information Technologies Institute, Greece Singapore Management University, Singapore Universität Hildesheim, Germany Carnegie Mellon University; Microsoft, USA Georgetown University, USA Twitter Inc., USA University of California at Santa Barbara, USA Max Planck Institute for Informatics, Germany University of Tsukuba, Japan University of Massachusetts Amherst, USA University of Innsbruck, Austria Ferdowsi University, Iran Facebook, USA Kunlun Inc. University of Stavanger, Norway Georgetown University, USA Carnegie Mellon University, USA Université catholique de Louvain, Belgium The University of Queensland, Australia

Organization

Sponsors

xv

xvi

Organization

Contents – Part II

Short Papers (Continued) Open-Set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dimitrios Pritsos, Anderson Rocha, and Efstathios Stamatatos

3

Exploiting Global Impact Ordering for Higher Throughput in Selective Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michał Siedlaczek, Juan Rodriguez, and Torsten Suel

12

Cross-Domain Recommendation via Deep Domain Adaptation . . . . . . . . . . . Heishiro Kanagawa, Hayato Kobayashi, Nobuyuki Shimizu, Yukihiro Tagami, and Taiji Suzuki

20

It’s only Words and Words Are All I Have . . . . . . . . . . . . . . . . . . . . . . . . Manash Pratim Barman, Kavish Dahekar, Abhinav Anshuman, and Amit Awekar

30

Modeling User Return Time Using Inhomogeneous Poisson Process . . . . . . . Mohammad Akbari, Alberto Cetoli, Stefano Bragaglia, Andrew D. O’Harney, Marc Sloan, and Jun Wang

37

Inductive Transfer Learning for Detection of Well-Formed Natural Language Search Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bakhtiyar Syed, Vijayasaradhi Indurthi, Manish Gupta, Manish Shrivastava, and Vasudeva Varma

45

Towards Spatial Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paul Mousset, Yoann Pitarch, and Lynda Tamine

53

Asymmetry Sensitive Architecture for Neural Text Matching . . . . . . . . . . . . Thiziri Belkacem, Jose G. Moreno, Taoufiq Dkaki, and Mohand Boughanem

62

QGraph: A Quality Assessment Index for Graph Clustering . . . . . . . . . . . . . Maria Halkidi and Iordanis Koutsopoulos

70

A Neural Approach to Entity Linking on Wikidata . . . . . . . . . . . . . . . . . . . Alberto Cetoli, Stefano Bragaglia, Andrew D. O’Harney, Marc Sloan, and Mohammad Akbari

78

Self-attentive Model for Headline Generation . . . . . . . . . . . . . . . . . . . . . . . Daniil Gavrilov, Pavel Kalaidin, and Valentin Malykh

87

xviii

Contents – Part II

Can Image Captioning Help Passage Retrieval in Multimodal Question Answering?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shurong Sheng, Katrien Laenen, and Marie-Francine Moens

94

A Simple Neural Approach to Spatial Role Labelling . . . . . . . . . . . . . . . . . Nitin Ramrakhiyani, Girish Palshikar, and Vasudeva Varma

102

Neural Diverse Abstractive Sentence Compression Generation . . . . . . . . . . . Mir Tafseer Nayeem, Tanvir Ahmed Fuad, and Yllias Chali

109

Fully Contextualized Biomedical NER. . . . . . . . . . . . . . . . . . . . . . . . . . . . Ashim Gupta, Pawan Goyal, Sudeshna Sarkar, and Mahanandeeshwar Gattu

117

DeepTagRec: A Content-cum-User Based Tag Recommendation Framework for Stack Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suman Kalyan Maity, Abhishek Panigrahi, Sayan Ghosh, Arundhati Banerjee, Pawan Goyal, and Animesh Mukherjee

125

Document Performance Prediction for Automatic Text Classification . . . . . . . Gustavo Penha, Raphael Campos, Sérgio Canuto, Marcos André Gonçalves, and Rodrygo L. T. Santos

132

Misleading Metadata Detection on YouTube . . . . . . . . . . . . . . . . . . . . . . . Priyank Palod, Ayush Patwari, Sudhanshu Bahety, Saurabh Bagchi, and Pawan Goyal

140

A Test Collection for Passage Retrieval Evaluation of Spanish Health-Related Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eleni Kamateri, Theodora Tsikrika, Spyridon Symeonidis, Stefanos Vrochidis, Wolfgang Minker, and Yiannis Kompatsiaris On the Impact of Storing Query Frequency History for Search Engine Result Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erman Yafay and Ismail Sengor Altingovde

148

155

AspeRa: Aspect-Based Rating Prediction Model . . . . . . . . . . . . . . . . . . . . . Sergey I. Nikolenko, Elena Tutubalina, Valentin Malykh, Ilya Shenbin, and Anton Alekseev

163

Heterogeneous Edge Embedding for Friend Recommendation. . . . . . . . . . . . Janu Verma, Srishti Gupta, Debdoot Mukherjee, and Tanmoy Chakraborty

172

Public Sphere 2.0: Targeted Commenting in Online News Media . . . . . . . . . Ankan Mullick, Sayan Ghosh, Ritam Dutt, Avijit Ghosh, and Abhijnan Chakraborty

180

Contents – Part II

An Extended CLEF eHealth Test Collection for Cross-Lingual Information Retrieval in the Medical Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shadi Saleh and Pavel Pecina

xix

188

An Axiomatic Study of Query Terms Order in Ad-Hoc Retrieval . . . . . . . . . Ayyoob Imani, Amir Vakili, Ali Montazer, and Azadeh Shakery

196

Deep Neural Networks for Query Expansion Using Word Embeddings . . . . . Ayyoob Imani, Amir Vakili, Ali Montazer, and Azadeh Shakery

203

Demonstration Papers Online Evaluations for Everyone: Mr. DLib’s Living Lab for Scholarly Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joeran Beel, Andrew Collins, Oliver Kopp, Linus W. Dietz, and Petr Knoth

213

StyleExplorer: A Toolkit for Textual Writing Style Visualization . . . . . . . . . Michael Tschuggnall, Thibault Gerrier, and Günther Specht

220

MedSpecSearch: Medical Specialty Search . . . . . . . . . . . . . . . . . . . . . . . . . Mehmet Uluç Şahin, Eren Balatkan, Cihan Eran, Engin Zeydan, and Reyyan Yeniterzi

225

Contender: Leveraging User Opinions for Purchase Decision-Making . . . . . . Tiago de Melo, Altigran S. da Silva, Edleno S. de Moura, and Pável Calado

230

Rethinking ‘Advanced Search’: A New Approach to Complex Query Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tony Russell-Rose, Jon Chamberlain, and Udo Kruschwitz

236

node-indri: Moving the Indri Toolkit to the Modern Web Stack . . . . . Felipe Moraes and Claudia Hauff

241

PaperHunter: A System for Exploring Papers and Citation Contexts . . . . . . . Michael Färber, Ashwath Sampath, and Adam Jatowt

246

Interactive System for Automatically Generating Temporal Narratives . . . . . . Arian Pasquali, Vítor Mangaravite, Ricardo Campos, Alípio Mário Jorge, and Adam Jatowt

251

CLEF Organizers Lab Track Early Detection of Risks on the Internet: An Exploratory Campaign . . . . . . . David E. Losada, Fabio Crestani, and Javier Parapar

259

xx

Contents – Part II

CLEF eHealth 2019 Evaluation Lab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liadh Kelly, Lorraine Goeuriot, Hanna Suominen, Mariana Neves, Evangelos Kanoulas, Rene Spijker, Leif Azzopardi, Dan Li, Jimmy, João Palotti, and Guido Zuccon

267

LifeCLEF 2019: Biodiversity Identification and Prediction Challenges . . . . . . Alexis Joly, Hervé Goëau, Christophe Botella, Stefan Kahl, Marion Poupard, Maximillien Servajean, Hervé Glotin, Pierre Bonnet, Willem-Pier Vellinga, Robert Planqué, Jan Schlüter, Fabian-Robert Stöter, and Henning Müller

275

CENTRE@CLEF 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, and Ian Soboroff

283

A Decade of Shared Tasks in Digital Text Forensics at PAN . . . . . . . . . . . . Martin Potthast, Paolo Rosso, Efstathios Stamatatos, and Benno Stein

291

ImageCLEF 2019: Multimedia Retrieval in Lifelogging, Medical, Nature, and Security Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bogdan Ionescu, Henning Müller, Renaud Péteri, Duc-Tien Dang-Nguyen, Luca Piras, Michael Riegler, Minh-Triet Tran, Mathias Lux, Cathal Gurrin, Yashin Dicente Cid, Vitali Liauchuk, Vassili Kovalev, Asma Ben Abacha, Sadid A. Hasan, Vivek Datla, Joey Liu, Dina Demner-Fushman, Obioma Pelka, Christoph M. Friedrich, Jon Chamberlain, Adrian Clark, Alba García Seco de Herrera, Narciso Garcia, Ergina Kavallieratou, Carlos Roberto del Blanco, Carlos Cuevas Rodríguez, Nikos Vasillopoulos, and Konstantinos Karampidis CheckThat! at CLEF 2019: Automatic Identification and Verification of Claims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tamer Elsayed, Preslav Nakov, Alberto Barrón-Cedeño, Maram Hasanain, Reem Suwaileh, Giovanni Da San Martino, and Pepa Atanasova A Task Set Proposal for Automatic Protest Information Collection Across Multiple Countries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ali Hürriyetoğlu, Erdem Yörük, Deniz Yüret, Çağrı Yoltar, Burak Gürel, Fırat Duruşan, and Osman Mutlu

301

309

316

Doctoral Consortium Papers Exploring Result Presentation in Conversational IR Using a Wizard-of-Oz Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Souvick Ghosh

327

Contents – Part II

xxi

Keyword Search on RDF Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dennis Dosso

332

Learning User and Item Representations for Recommender Systems . . . . . . . Alfonso Landin

337

Improving the Annotation Efficiency and Effectiveness in the Text Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markus Zlabinger

343

Logic-Based Models for Social Media Retrieval . . . . . . . . . . . . . . . . . . . . . Firas Sabbah

348

Adapting Models for the Case of Early Risk Prediction on the Internet . . . . . Razan Masood

353

Integration of Images into the Patent Retrieval Process . . . . . . . . . . . . . . . . Wiebke Thode

359

Dialogue-Based Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abhishek Kaushik

364

Dynamic Diversification for Interactive Complex Search . . . . . . . . . . . . . . . Ameer Albahem

369

The Influence of Backstories on Queries with Varying Levels of Intent in Task-Based Specialised Information Retrieval . . . . . . . . . . . . . . . . . . . . . Manuel Steiner

375

Workshops Proposal for the 1st Interdisciplinary Workshop on Algorithm Selection and Meta-Learning in Information Retrieval (AMIR) . . . . . . . . . . . . . . . . . . Joeran Beel and Lars Kotthoff

383

The 2nd International Workshop on Narrative Extraction from Text: Text2Story 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alípio Mário Jorge, Ricardo Campos, Adam Jatowt, and Sumit Bhatia

389

Bibliometric-Enhanced Information Retrieval: 8th International BIR Workshop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guillaume Cabanac, Ingo Frommholz, and Philipp Mayr

394

Third Workshop on Social Media for Personalization and Search (SoMePeAS 2019) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ludovico Boratto and Giovanni Stilo

401

xxii

Contents – Part II

Tutorials Conducting Laboratory Experiments Properly with Statistical Tools: An Easy Hands-On Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tetsuya Sakai

405

Text Categorization with Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jacques Savoy

408

Concept to Code: Neural Networks for Sequence Learning. . . . . . . . . . . . . . Omprakash Sonie, Muthusamy Chelliah, Surender Kumar, and Bidyut Kr. Patra

410

A Tutorial on Basics, Applications and Current Developments in PLS Path Modeling for the IIR Community . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markus Kattenbeck and David Elsweiler

413

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

417

Contents – Part I

Modeling Relations Learning Lexical-Semantic Relations Using Intuitive Cognitive Links . . . . . . Georgios Balikas, Gaël Dias, Rumen Moraliyski, Houssam Akhmouch, and Massih-Reza Amini

3

Relationship Prediction in Dynamic Heterogeneous Information Networks . . . Amin Milani Fard, Ebrahim Bagheri, and Ke Wang

19

Retrieving Relationships from a Knowledge Graph for Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Puneet Agarwal, Maya Ramanath, and Gautam Shroff

35

Embedding Geographic Locations for Modelling the Natural Environment Using Flickr Tags and Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . Shelan S. Jeawak, Christopher B. Jones, and Steven Schockaert

51

Classification and Search Recognising Summary Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark Fisher, Dyaa Albakour, Udo Kruschwitz, and Miguel Martinez Towards Content Expiry Date Determination: Predicting Validity Periods of Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Axel Almquist and Adam Jatowt Dynamic Ensemble Selection for Author Verification . . . . . . . . . . . . . . . . . Nektaria Potha and Efstathios Stamatatos Structural Similarity Search for Formulas Using Leaf-Root Paths in Operator Subtrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Zhong and Richard Zanibbi

69

86 102

116

Recommender Systems I PRIN: A Probabilistic Recommender with Item Priors and Neural Models . . . Alfonso Landin, Daniel Valcarce, Javier Parapar, and Álvaro Barreiro Information Retrieval Models for Contact Recommendation in Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Javier Sanz-Cruzado and Pablo Castells

133

148

xxiv

Contents – Part I

Conformative Filtering for Implicit Feedback Data . . . . . . . . . . . . . . . . . . . Farhan Khawar and Nevin L. Zhang

164

Graphs Binarized Knowledge Graph Embeddings. . . . . . . . . . . . . . . . . . . . . . . . . . Koki Kishimoto, Katsuhiko Hayashi, Genki Akai, Masashi Shimbo, and Kazunori Komatani A Supervised Keyphrase Extraction System Based on Graph Representation Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Corina Florescu and Wei Jin

181

197

Recommender Systems II Comparison of Sentiment Analysis and User Ratings in Venue Recommendation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xi Wang, Iadh Ounis, and Craig Macdonald

215

AntRS: Recommending Lists Through a Multi-objective Ant Colony System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pierre-Edouard Osche, Sylvain Castagnos, and Anne Boyer

229

Automated Early Leaderboard Generation from Comparative Tables . . . . . . . Mayank Singh, Rajdeep Sarkar, Atharva Vyas, Pawan Goyal, Animesh Mukherjee, and Soumen Chakrabarti

244

Query Analytics Predicting the Topic of Your Next Query for Just-In-Time IR . . . . . . . . . . . Seyed Ali Bahrainian, Fattane Zarrinkalam, Ida Mele, and Fabio Crestani Identifying Unclear Questions in Community Question Answering Websites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Trienes and Krisztian Balog Local and Global Query Expansion for Hierarchical Complex Topics . . . . . . Jeffrey Dalton, Shahrzad Naseri, Laura Dietz, and James Allan

261

276 290

Representation Word Embeddings for Entity-Annotated Texts . . . . . . . . . . . . . . . . . . . . . . Satya Almasian, Andreas Spitz, and Michael Gertz

307

Vectors of Pairwise Item Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gaurav Pandey, Shuaiqiang Wang, Zhaochun Ren, and Yi Chang

323

Contents – Part I

xxv

Reproducibility (Systems) Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joel Mackenzie, Antonio Mallia, Matthias Petri, J. Shane Culpepper, and Torsten Suel

339

An Experimental Study of Index Compression and DAAT Query Processing Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Mallia, Michał Siedlaczek, and Torsten Suel

353

Reproducing and Generalizing Semantic Term Matching in Axiomatic Information Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peilin Yang and Jimmy Lin

369

Optimizing Ranking Models in an Online Setting . . . . . . . . . . . . . . . . . . . . Harrie Oosterhuis and Maarten de Rijke

382

Simple Techniques for Cross-Collection Relevance Feedback . . . . . . . . . . . . Ruifan Yu, Yuhao Xie, and Jimmy Lin

397

Reproducibility (Application) A Comparative Study of Summarization Algorithms Applied to Legal Case Judgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paheli Bhattacharya, Kaustubh Hiware, Subham Rajgaria, Nilay Pochhi, Kripabandhu Ghosh, and Saptarshi Ghosh Replicating Relevance-Ranked Synonym Discovery in a New Language and Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrew Yates and Michael Unterkalmsteiner On Cross-Domain Transfer in Venue Recommendation . . . . . . . . . . . . . . . . Jarana Manotumruksa, Dimitrios Rafailidis, Craig Macdonald, and Iadh Ounis The Effect of Algorithmic Bias on Recommender Systems for Massive Open Online Courses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ludovico Boratto, Gianni Fenu, and Mirko Marras

413

429 443

457

Neural IR Domain Adaptive Neural Sentence Compression by Tree Cutting . . . . . . . . . Litton J. Kurisinkel, Yue Zhang, and Vasudeva Varma

475

An Axiomatic Approach to Diagnosing Neural IR Models . . . . . . . . . . . . . . Daniël Rennings, Felipe Moraes, and Claudia Hauff

489

xxvi

Contents – Part I

Cross Lingual IR Term Selection for Query Expansion in Medical Cross-Lingual Information Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shadi Saleh and Pavel Pecina

507

Zero-Shot Language Transfer for Cross-Lingual Sentence Retrieval Using Bidirectional Attention Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Goran Glavaš and Ivan Vulić

523

QA and Conversational Search QRFA: A Data-Driven Model of Information-Seeking Dialogues . . . . . . . . . Svitlana Vakulenko, Kate Revoredo, Claudio Di Ciccio, and Maarten de Rijke Iterative Relevance Feedback for Answer Passage Retrieval with Passage-Level Semantic Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keping Bi, Qingyao Ai, and W. Bruce Croft

541

558

Topic Modeling LICD: A Language-Independent Approach for Aspect Category Detection . . . Erfan Ghadery, Sajad Movahedi, Masoud Jalili Sabet, Heshaam Faili, and Azadeh Shakery Topic Grouper: An Agglomerative Clustering Approach to Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Pfeifer and Jochen L. Leidner

575

590

Metrics Meta-evaluation of Dynamic Search: How Do Metrics Capture Topical Relevance, Diversity and User Effort? . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ameer Albahem, Damiano Spina, Falk Scholer, and Lawrence Cavedon A Markovian Approach to Evaluate Session-Based IR Systems. . . . . . . . . . . David van Dijk, Marco Ferrante, Nicola Ferro, and Evangelos Kanoulas Correlation, Prediction and Ranking of Evaluation Metrics in Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Soumyajit Gupta, Mucahid Kutlu, Vivek Khetan, and Matthew Lease Modeling User Actions in Job Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alfan Farizki Wicaksono, Alistair Moffat, and Justin Zobel

607 621

636 652

Contents – Part I

xxvii

Image IR Automated Semantic Annotation of Species Names in Handwritten Texts . . . Lise Stork, Andreas Weber, Jaap van den Herik, Aske Plaat, Fons Verbeek, and Katherine Wolstencroft

667

Tangent-V: Math Formula Image Search Using Line-of-Sight Graphs . . . . . . Kenny Davila, Ritvik Joshi, Srirangaraj Setlur, Venu Govindaraju, and Richard Zanibbi

681

Figure Retrieval from Collections of Research Articles . . . . . . . . . . . . . . . . Saar Kuzi and ChengXiang Zhai

696

“Is This an Example Image?” – Predicting the Relative Abstractness Level of Image and Text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Otto, Sebastian Holzki, and Ralph Ewerth

711

Short Papers End-to-End Neural Relation Extraction Using Deep Biaffine Attention. . . . . . Dat Quoc Nguyen and Karin Verspoor

729

Social Relation Inference via Label Propagation . . . . . . . . . . . . . . . . . . . . . Yingtao Tian, Haochen Chen, Bryan Perozzi, Muhao Chen, Xiaofei Sun, and Steven Skiena

739

Wikipedia Text Reuse: Within and Without . . . . . . . . . . . . . . . . . . . . . . . . Milad Alshomary, Michael Völske, Tristan Licht, Henning Wachsmuth, Benno Stein, Matthias Hagen, and Martin Potthast

747

Stochastic Relevance for Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . Marco Ferrante, Nicola Ferro, and Eleonora Losiouk

755

Exploiting a More Global Context for Event Detection Through Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dorian Kodelja, Romaric Besançon, and Olivier Ferret Faster BlockMax WAND with Longer Skipping . . . . . . . . . . . . . . . . . . . . . Antonio Mallia and Elia Porciani A Hybrid Modeling Approach for an Automated Lyrics-Rating System for Adolescents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jayong Kim and Mun Y. Yi Evaluating Similarity Metrics for Latent Twitter Topics . . . . . . . . . . . . . . . . Xi Wang, Anjie Fang, Iadh Ounis, and Craig Macdonald

763 771

779 787

xxviii

Contents – Part I

On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonathan Donnelly and Adam Roegiest

795

Image Tweet Popularity Prediction with Convolutional Neural Network. . . . . Yihong Zhang and Adam Jatowt

803

Enriching Word Embeddings for Patent Retrieval with Global Context . . . . . Sebastian Hofstätter, Navid Rekabsaz, Mihai Lupu, Carsten Eickhoff, and Allan Hanbury

810

Incorporating External Knowledge to Boost Machine Comprehension Based Question Answering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huan Wang, Weiming Lu, and Zeyun Tang

819

Impact of Training Dataset Size on Neural Answer Selection Models . . . . . . Trond Linjordet and Krisztian Balog

828

Unsupervised Explainable Controversy Detection from Online News . . . . . . . Youngwoo Kim and James Allan

836

Extracting Temporal Event Relations Based on Event Networks . . . . . . . . . . Duc-Thuan Vo and Ebrahim Bagheri

844

Dynamic-Keyword Extraction from Social Media . . . . . . . . . . . . . . . . . . . . David Semedo and João Magalhães

852

Local Popularity and Time in top-N Recommendation . . . . . . . . . . . . . . . . . Vito Walter Anelli, Tommaso Di Noia, Eugenio Di Sciascio, Azzurra Ragone, and Joseph Trotta

861

Multiple Keyphrase Generation Model with Diversity . . . . . . . . . . . . . . . . . Shotaro Misawa, Yasuhide Miura, Motoki Taniguchi, and Tomoko Ohkuma

869

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

877

Short Papers (Continued)

Open-Set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio Dimitrios Pritsos1 , Anderson Rocha2 , and Efstathios Stamatatos1(B) 1

2

University of the Aegean, 83200 Karlovassi, Samos, Greece {dpritsos,stamatatos}@aegean.gr Institute of Computing, University of Campinas (Unicamp), Campinas, SP, Brazil

Abstract. Web genre identification can boost information retrieval systems by providing rich descriptions of documents and enabling more specialized queries. The open-set scenario is more realistic for this task as web genres evolve over time and it is not feasible to define a universally agreed genre palette. In this work, we bring to bear a novel approach to web genre identification underpinned by distributional features acquired by doc2vec and a recently-proposed open-set classification algorithm— the nearest neighbors distance ratio classifier. We present experimental results using a benchmark corpus and a strong baseline and demonstrate that the proposed approach is highly competitive, especially when emphasis is given on precision.

Keywords: Web genre identification Distributional features

1

· Open-set classification ·

Introduction

Web Genre Identification (WGI) is a multi-class text classification task aiming at the association of web pages to labels (e.g., blog, e-shop, personal home page, etc.) corresponding to their form, communicative purpose, and style rather than their content. WGI can enhance the potential of information retrieval (IR) systems by allowing more complex and informative queries, whereby topic-related keywords and genre labels are combined to better express the information need of users and grouping search results by genre [16,31]. Moreover, WGI is specially useful to enhance performance of Natural Language Processing (NLP) methods, such as part-of-speech tagging (POS) [22] and text summarization [36] by empowering genre-specific model development. In spite of WGI’s immediate applications, there are certain fundamental difficulties hardening its deployment in practice. First, there is a lack of both a consensus on the exact definition of genre [5] and a genre palette that comprises all available genres and sub-genres [17,18,33,34] to aim for. New web genres appear on-the-fly and existing genres evolve over time [4]. Furthermore, it is c Springer Nature Switzerland AG 2019  L. Azzopardi et al. (Eds.): ECIR 2019, LNCS 11438, pp. 3–11, 2019. https://doi.org/10.1007/978-3-030-15719-7_1

4

D. Pritsos et al.

not clear whether a whole web page should belong to a single genre or sections of the same web page can belong to different genres [8,15]. Finally, style of documents is affected by both genre-related choices and author-related choices [24,34]. Instead of aiming to anticipate all possible web-genres possible to appear in a practical scenario, it would be wiser to consider a proper handling of genres of interest while properly handling “unseen” genres. In this vein, WGI can be viewed as an open-set classification task to better deal with incomplete genre palettes [2,26–28,37]. This scheme requires strong generalization in comparison to the traditional closed-set setup—the one in which all genres of interest are known or defined a priori. One caveat, though, is that open-set classification methods tend to perform better while operating in not-so-high dimensional manifolds. However, to date, most common and effective stylometric features in prior art, e.g., word and character n-grams, yield high-dimensional spaces [10,34]. Aiming at properly bringing to bear the powerful algorithm modeling of open-set classification to the WGI setup, in this paper, we apply a recentlyproposed open-set classification algorithm, the Nearest Neighbors Distance Ratio (NNDR) [19], to WGI. To produce a compact representation of web pages—more amenable to the open-set modelling—we rely upon Distributional Features (DF) [39] in this paper. Finally, we are also using an evaluation methodology that is more appropriate for the open-set classification framework with unstructured noise [27]. We organize the remaining of this paper into four more sections. Sect. 2 presents previous work on WGI while Sect. 3 describes the proposed approach. Sect. 4 discusses the experimental setup and obtained results. Finally, Sect. 6 draws the main conclusions of this study and presents some future work directions.

2

Related Work

Most previous studies in WGI consider the case where all web pages should belong to a predefined taxonomy of genres [7,10,14,32]. Putting this setup under the vantage point of machine learning, it is the same as assuming what is known as a closed-set problem definition. However, this na¨ıve assumption is not appropriate for most applications related to WGI as it is not possible to construct a universal genre palette a priori nor force web pages to always fall into any of the predefined genre labels. Such web pages are considered noise and include web documents where multiple genres co-exist [13,33]. Santini [33] defines structured noise as the collection of web pages belonging to several genres, unknown during training. Such structured noise can be used as a negative class for training a binary classifier [38]. However, it is highly unlikely that such a collection represents the real distribution of pages of the web at large. On the other hand, unstructured noise is a random collection of pages [33] for which no genre labels are available. The effect of noise in WGI was first studied in [6,11,13,35].

Open-Set Web Genre Identification

5

Open-set classification models for WGI were first described in [28,37]. However, these models were only tested in noise-free corpora [26]. Asheghi [2] showed that it is much more challenging to perform WGI in the noisy web setup in comparison to noise-free corpora. Recently, Ensemble Methods were shown to achieve high effectiveness in open-set WGI setups [27]. Great attention historically on WGI has been given to the appropriate definition of features that are capable of capturing genre characteristics—which includes but are not limited to character n-grams or word n-grams, part-ofspeech histograms, the frequency of the most discriminative words, etc. [10,12– 14,17,23,24,34]. Additionally, some additional useful features might come from exploiting HTML structure and/or the hyperlink functionality of web pages [1,3,7,29,40]. Recently deep learning methods have also been tested in genre detection setups with promising results [39].

3

Proposed Approach—Open-Set Web Genre Identification

3.1

Distributional Features Learning

In this study, we rely upon a Doc2Vec text representation to provide distributional features for the WGI problem [20,21,30]. In particular, we have implemented a special module inside our package, named Html2Vec 1 where a whole corpus can be used as input and one Bag-of-Words Paragraph Vector (PV-BOW) is returned per web-page of the corpus. PV-BOW consists of a Neural Network (NNet) comprising a softmax multi-class classifier a=k approximating max T1 T −k log p(ta |ta−k , ..., ta+k ). PV-BOW is trained using stochastic gradient-descent where the gradient is obtained via back-propagation. Given a sequence of training n-grams (word or character) t1 , t2 , t3 , ..., tT , the objective function yof the NNet is the maximized average log-probability t p(ta |ta−k , ..., ta+k ) = e eayi . i For training the PV-BOW in this study, for each iteration, of the stochastic gradient descent, a text window is sampled with size wsize . Then a random term (n-gram) is sampled from the text window and form a classification task given the paragraph vector. Thus y = b + s(t1 , t2 , t3 , ..., twsize ), where s() is the sequence of word-n-grams or character-n-grams of the sampled window. Each type of n-grams is used separately as suggested in [25]. This model provides us with a representation of web pages of pre-defined dimensionality DFdim . 3.2

Nearest Neighbors Distance Ratio Classifier

The Nearest Neighbors Distance Ratio (NNRD) classifier is an open-set classification algorithm introduced in [19], which in turn, is an extension upon the Nearest Neighbors (NN) algorithm. NNRD calculates the distance of a new sample s to its nearest neighbor t and to the closest training sample u belonging to 1

https://github.com/dpritsos/html2vec.

6

D. Pritsos et al.

a different class with respect to t. Then, if the ratio d(s, t)/d(s, u) is higher than a threshold, the new sample is classified to the class of s. Otherwise, it is left unclassified. It is remarkable that, in contrast to other open-set classifiers, training of NNDR requires both known samples (belonging to classes known during training) and unknown examples (belonging to other/unknown classes) of interest. In more details, the Distance Ratio Threshold (DRT) used to classify new samples is adjusted by maximizing the Normalized Accuracy (NA) N A = λAKS +(1−λ)AU S , where AKS is the accuracy on known samples and AU S is the accuracy on unknown samples. The parameter λ regulates the mistakes trade-off on the known and unknown samples prediction. Since usually in training phase only known classes are available, Mendes et al. [19] propose an approach to repeatedly split available training classes into two sets (known and “simulated” unknown). In our implementation of NNDR, we use cosine distance rather than the Euclidean distance because previous work found this type of distance more suitable for WGI [27].2

4

Experiments

4.1

Corpus

Our experiments are based on SANTINIS, a benchmark corpus already used in previous work in WGI [18,27,32]. This dataset comprises 1,400 English webpages evenly distributed into seven genres (blog, eshop, FAQ, frontpage, listing, personal home page, search page) as well as 80 BBC web-pages evenly categorized into four additional genres (DIY mini-guide, editorial, features, short-bio). In addition, the dataset comprises a random selection of 1,000 English web-pages taken from the SPIRIT corpus [9]. The latter can be viewed as unstructured noise since genre labels are missing. 4.2

Experimental Setup

To represent web-pages, we use features exclusively related to textual information, excluding any structural information, URLs, etc. The following representation schemes are examined: Character 4-grams (C4G), Word unigrams (W1G), and Word 3-grams (W3G). For each of these schemes, we use either TermFrequency (TF) weights or DF features. The feature space for TF is defined by a vocabulary VT F , which is extracted based on the most frequent terms of the training set—we consider VT F = {5k, 10k, 50k, 100k}. The DF space is predefined in the PV-BOW model—we consider DFdim = {50, 100, 250, 500, 1000}. In PV-BOW, the terms with very low-frequency in the training set are discarded. In this study, we examine T Fmin = {3, 10} as cutoff frequency threshold. The text window size is selected from Wsize = {3, 8, 20}. The remaining 2

https://github.com/dpritsos/OpenNNDR.

Open-Set Web Genre Identification

7

parameters of PV-BOW are set as follows: α = 0.025, epochs = {1, 3, 10} and decay = {0.002, 0.02}. Regarding the NNRD open-set classifier, there are two parameters, lambda and DRT, and their considered values are: λ = {0.2, 0.5, 0.7}, DRT = {0.4, 0.6, 0.8, 0.9}. All aforementioned parameters are adjusted based on gridsearch using only the training part of the corpus. For a proper comparison with prior art, we use two open-set WGI approaches with good previously reported results as baselines: Random Feature Subset Ensemble (RFSE) and one-class SVM (OCSVM) [27,28]. All parameters of these methods have been been adjusted as suggested in [27] (based on the same corpus). We follow the open-set evaluation framework with unstructured noise introduced in [27]. In particular, the open-set F1 score [19] is calculated over the known classes (the noisy class is excluded). The reported evaluation results are obtained by performing 10-fold cross-validation and, in each fold, we include the full set of 1,000 pages of noise. This setup is comparable to previous studies [27]. Table 1. Performance of baselines and NNDR on the SANTINIS coprus. All evaluation scores are macro-averaged. Model

Features

Dim. Precision Recall AUC

F1

RFSE

TF-C4G

50k

0.739

0.780 0.652

0.759

RFSE

TF-W1G 50k

0.776

0.758

0.657 0.767

RFSE

TF-W3G 50k

0.797

0.722

0.615

0.758

0.662

0.367

0.210

0.472

OCSVM TF-C4G

5

5k

OCSVM TF-W1G 5k

0.332

0.344

0.150

0.338

OCSVM TF-W3G 10k

0.631

0.654

0.536

0.643

NNDR

TF-C4G

0.664

0.403

0.291

0.502

NNDR

TF-W1G 5k

0.691

0.439

0.348

0.537

NNDR

TF-W3G 10k

0.720

0.664

0.486

0.691

NNDR

DF-C4G

0.829

0.600

0.455

0.696

NNDR

DF-W1G 50

0.733

0.670

0.541

0.700

NNDR

DF-W3G 100

0.827

0.615

0.564

0.706

5k

50

Results

We apply the baselines and NNDR in the SANTINIS corpus. In the training phase, we use only the 11 known genre classes while in test phase, we also consider an additional class (unstructured noise). Table 1 shows the performance of tested methods when either TF or DF representation schemes, based on C4G, W1G, or W4G features, are used.

8

D. Pritsos et al.

First, we compare NNDR using TF features with baselines, also using this kind of features. In this case, NNDR outperforms OCSVM. On the other hand, RFSE performed better than NNDR for Macro-F1 and Macro-AUC. This is consistent for any kind of features (C4G, W1G, or W3G). There is notable difference in the dimensionality of representation used by the examined approaches though. RFSE relies upon a 50k-D manifold while NNDR and OCSVM are based on much lower dimensional spaces. It has to be noted that RFSE builds an ensemble by iteratively selecting a subset of the available features (randomly). That way, it internally reduces the dimensionality for each constituent base classifier. On the other hand, NNDR seems to be confused when thousands of features are considered as it is based on distance calculations. Next, we compare NNDR models using either TF or DF features. There is a notable improvement when DFs are used in associated with the open-set NNDR classifier. The dimensionality of DF is much lower than TF and this seems to be crucial to improve the performance of NNDR. This is consistent for all three feature types (C4G, W1G, and W3G). NNDR with TF scheme is competitive only when W3G features are used. It has also to be noted that in all cases the selected value of parameter DRT is 0.8. This indicates that NNDR is a very robust algorithm. Finally, the proposed approach using NNDR and DF outperforms OCSVM but it is outperformed by the strong baseline RFSE in both macro-AUC and macro F1. However, when precision is concerned, NNDR is much better. A closer look at the comparison of the two methods is provided in Fig. 1, where precision curves in 11-standard recall levels are depicted. The precision value at rj level is interpolated as follows: P (rj ) = maxrj ≤r≤r+j+1 (P (r)). OCSVM

Fig. 1. Precision curves in 11-standard recall levels of the examined open-set classifiers using either W3G features (left) or W1G features (right).

The NNDR-DF model maintains very high precision scores for low levels of recall. The difference between NNDR-DF and RFSE at that point is clearer when W3G features are used. NNDR-TF is clearly worse than both NNDR-DF and

Open-Set Web Genre Identification

9

RFSE. In addition, OCSVM is competitive in terms of precision only when W3G features are used but its performance drops abruptly in comparison to that of NNDR-DF. Note that the point where the curves end indicates the percentage of corpus that is left unclassified (assigned to unknown class). RFSE manages to recognize correctly larger part of the corpus, more than 70%, with respect to NNDR-DF that reaches 60%.

6

Conclusions

It seems that distributional features provide a significant enhancement to the NNDR open-set method. The low-dimensionality of DF is crucial to boost the performance of NNDR. Yet, RFSE proves to be a hard-to-beat baseline at the expense of relying upon a much higher representation space (usually in the thousands of features). However, with respect to precision, the proposed approach is much more conservative and it prefers to leave web-pages unclassified rather than guessing an inaccurate genre label. Depending on the application of WGI, precision can be considered much more important than recall and this is where our proposed algorithm shines. Further research could focus on more appropriate distance measures within NNDR specially with recent data-driven features obtained with powerful NLP convolutional and recurrent deep networks. Moreover, alternative types of distributional features could be used (e.g., topic modeling). Finally, a combination of NNDR with RFSE models could be studied as they seem to exploit complementary views of the same problem. Acknowledgement. Prof. Rocha thanks the financial support of FAPESP D´ej` aVu (Grant #2017/12646-3) and CAPES DeepEyes Grant.

References 1. Abramson, M., Aha, D.W.: What’s in a URL? Genre classification from URLs. Intelligent techniques for web personalization and recommender systems. AAAI Technical report. Association for the Advancement of Artificial Intelligence (2012) 2. Asheghi, N.R.: Human Annotation and Automatic Detection of Web Genres. Ph.D. thesis, University of Leeds (2015) 3. Asheghi, N.R., Markert, K., Sharoff, S.: Semi-supervised graph-based genre classification for web pages. In: TextGraphs-9, p. 39 (2014) 4. Boese, E.S., Howe, A.E.: Effects of web document evolution on genre classification. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 632–639. ACM (2005) 5. Crowston, K., Kwa´snik, B., Rubleske, J.: Problems in the use-centered development of a taxonomy of web genres. In: Mehler, A., Sharoff, S., Santini, M. (eds.) Genres on the Web. Text, Speech and Language Technology, vol. 42, pp. 69–84. Springer, Dordrecht (2011). https://doi.org/10.1007/978-90-481-9178-9 4 6. Dong, L., Watters, C., Duffy, J., Shepherd, M.: Binary cybergenre classification using theoretic feature measures. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), pp. 313–316 (2006)

10

D. Pritsos et al.

7. Jebari, C.: A pure URL-based genre classification of web pages. In: 2014 25th International Workshop on Database and Expert Systems Applications (DEXA), pp. 233–237. IEEE (2014) 8. Jebari, C.: A combination based on OWA operators for multi-label genre classification of web pages. Procesamiento del Lenguaje Nat. 54, 13–20 (2015) 9. Joho, H., Sanderson, M.: The spirit collection: an overview of a large web collection. SIGIR Forum 38(2), 57–61 (2004) 10. Kanaris, I., Stamatatos, E.: Learning to recognize webpage genres. Inf. Process. Manage. 45(5), 499–512 (2009) 11. Kennedy, A., Shepherd, M.: Automatic identification of home pages on the web. In: Proceedings of the 38th Annual Hawaii International Conference on System Sciences, HICSS 2005, p. 99c. IEEE (2005) 12. Kumari, K.P., Reddy, A.V., Fatima, S.S.: Web page genre classification: impact of n-gram lengths. Int. J. Comput. Appl. 88(13), 13–17 (2014) 13. Levering, R., Cutler, M., Yu, L.: Using visual features for fine-grained genre classification of web pages. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences, pp. 131–131. IEEE (2008) 14. Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Inf. Process. Manage. 41(5), 1263–1276 (2005) 15. Madjarov, G., Vidulin, V., Dimitrovski, I., Kocev, D.: Web genre classification via hierarchical multi-label classification. In: Jackowski, K., Burduk, R., Walkowiak, K., Wo´zniak, M., Yin, H. (eds.) IDEAL 2015. LNCS, vol. 9375, pp. 9–17. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24834-9 2 16. Malhotra, R., Sharma, A.: Quantitative evaluation of web metrics for automatic genre classification of web pages. Int. J. Syst. Assur. Eng. Manage. 8(2), 1567–1579 (2017) 17. Mason, J., Shepherd, M., Duffy, J.: An n-gram based approach to automatically identifying web page genre. In: HICSS, pp. 1–10. IEEE Computer Society (2009) 18. Mehler, A., Sharoff, S., Santini, M.: Genres on the Web: Computational Models and Empirical Studies. Text, Speech and Language Technology. Springer, Heidelberg (2010). https://doi.org/10.1007/978-90-481-9178-9 19. Mendes J´ unior, P.R., et al.: Nearest neighbors distance ratio open-set classifier. Mach. Learn. 106, 1–28 (2016) 20. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 21. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) 22. Nooralahzadeh, F., Brun, C., Roux, C.: Part of speech tagging for French social media data. In: COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 23–29 August 2014, Dublin, Ireland, pp. 1764–1772 (2014) 23. Onan, A.: An ensemble scheme based on language function analysis and feature engineering for text genre classification. J. Inf. Sci. 44(1), 28–47 (2018) 24. Petrenz, P., Webber, B.: Stable classification of text genres. Comput. Linguist. 37(2), 385–393 (2011) 25. Posadas-Dur´ an, J.P., G´ omez-Adorno, H., Sidorov, G., Batyrshin, I., Pinto, D., Chanona-Hern´ andez, L.: Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput. 21(3), 627–639 (2017)

Open-Set Web Genre Identification

11

26. Pritsos, D., Stamatatos, E.: The impact of noise in web genre identification. In: Mothe, J., et al. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 268–273. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24027-5 27 27. Pritsos, D., Stamatatos, E.: Open set evaluation of web genre identification. Lang. Resour. Eval. 52, 1–20 (2018) 28. Pritsos, D.A., Stamatatos, E.: Open-set classification for automated genre identification. In: Serdyukov, P., et al. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 207–217. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36973-5 18 29. Priyatam, P.N., Iyengar, S., Perumal, K., Varma, V.: Don’t use a lot when little will do: genre identification using URLs. Res. Comput. Sci. 70, 207–218 (2013) ˇ uˇrek, R., Sojka, P.: Software framework for topic modelling with large cor30. Reh˚ pora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, pp. 45–50, May 2010. http://is.muni.cz/ publication/884893/en 31. Rosso, M.A.: User-based identification of web genres. J. Am. Soc. Inf. Sci. Technol. 59(7), 1053–1072 (2008). https://doi.org/10.1002/asi.20798 32. Santini, M.: Automatic identification of genre in web pages. Ph.D. thesis, University of Brighton (2007) 33. Santini, M.: Cross-testing a genre classification model for the web. In: Mehler, A., Sharoff, S., Santini, M. (eds.) Genres on the Web. Text, Speech and Language Technology, vol. 42, pp. 87–128. Springer, Dordrecht (2011). https://doi.org/10. 1007/978-90-481-9178-9 5 34. Sharoff, S., Wu, Z., Markert, K.: The Web Library of Babel: evaluating genre collections. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation, pp. 3063–3070 (2010) 35. Shepherd, M.A., Watters, C.R., Kennedy, A.: Cybergenre: automatic identification of home pages on the web. J. Web Eng. 3(3–4), 236–251 (2004) 36. Stewart, J.G.: Genre oriented summarization. Ph.D. thesis, Carnegie Mellon University (2009) 37. Stubbe, A., Ringlstetter, C., Schulz, K.U.: Genre as noise: noise in genre. Int. J. Doc. Anal. Recogn. (IJDAR) 10(3–4), 199–209 (2007) 38. Vidulin, V., Luˇstrek, M., Gams, M.: Using genres to improve search engines. In: Proceedings of the International Workshop Towards Genre-Enabled Search Engines, pp. 45–51 (2007) 39. Worsham, J., Kalita, J.: Genre identification and the compositional effect of genre in literature. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1963–1973 (2018) 40. Zhu, J., Zhou, X., Fung, G.: Enhance web pages genre identification using neighboring pages. In: Bouguettaya, A., Hauswirth, M., Liu, L. (eds.) WISE 2011. LNCS, vol. 6997, pp. 282–289. Springer, Heidelberg (2011). https://doi.org/10.1007/9783-642-24434-6 23

Exploiting Global Impact Ordering for Higher Throughput in Selective Search Michal Siedlaczek(B) , Juan Rodriguez , and Torsten Suel Computer Science and Engineering, New York University, New York, USA {michal.siedlaczek,jcr365,torsten.suel}@nyu.edu

Abstract. We investigate potential benefits of exploiting a global impact ordering in a selective search architecture. We propose a generalized, ordering-aware version of the learning-to-rank-resources framework [9] along with a modified selection strategy. By allowing partial shard processing we are able to achieve a better initial trade-off between query cost and precision than the current state of the art. Thus, our solution is suitable for increasing query throughput during periods of peak load or in low-resource systems.

Keywords: Selective search

1

· Global ordering · Shard selection

Introduction

Substantial advances have recently been made in selective search—a type of federated search where a collection is clustered into topical shards, and a selection algorithm selects for each query a small set of relevant shards for processing. The latest state of the art proposed by Dai et al. [9] uses a learning-to-rank technique from complex document ranking [18] to select shards during query processing. We propose a generalization of this approach that exploits a global impact ordering of the documents, in addition to topic-based clustering. Queryindependent global impact scores model the overall quality of documents in a collection. They have previously been used in unsafe early termination techniques, such as tiering, which limit query cost by disregarding documents deemed unimportant, unless a query is found to require more exhaustive processing. However, to our knowledge, this is the first attempt at combining selective search and global ordering-based early termination. Contributions. We produce global orderings for GOV2 and Clueweb09-B collections; describe a modified selective search architecture that can exploit the orderings, and expand the state-of-the-art solution to allow for partial shard selection; compare our results with the baseline, and show that we achieve a better quality-efficiency trade-off at low query costs; discuss a number of research opportunities that are motivated by our work. c Springer Nature Switzerland AG 2019  L. Azzopardi et al. (Eds.): ECIR 2019, LNCS 11438, pp. 12–19, 2019. https://doi.org/10.1007/978-3-030-15719-7_2

Global Impact Ordering in Selective Search

2

13

Related Work

Selective Search. A number of researchers have recently explored selective search architectures, where a collection is clustered into topical shards [10,16]. Then, when processing a query, we select a limited number of shards that are predicted to be most relevant to the query topic. The literature identifies three classes of shard selection algorithms: Term-based methods [1,7] model the language distribution of each shard. Sample-based methods [16,24,25] use results from a centralized sample index, which contains a small random sample (up to 2%) of the entire index. Supervised methods [9,13] use labeled data to learn models that predict shard relevance. The current state of the art, proposed by Dai et al. [9], belongs to the last group. We derive our solution from their model, and also use it as a baseline for comparison. Global Impact Ordering and Index Tiering. A global impact ordering is a query-independent ordering of the documents in a collection in terms of quality or importance. There are a number of ways to compute such orderings, including Pagerank [20], spam scores [8], performance of a document on past queries [2, 12], or machine-learned orderings [22]. Global orderings can be exploited for faster query processing by organizing index structures such that higher-quality documents appear first during index traversal. One well known approach that exploits a global ordering is called index tiering [4,17,19,21,23]. Here, the collection is partitioned into two or more subsets called tiers based on document quality, and an inverted index is built for each of them. Each query is then first evaluated on the first tier with the highestquality documents, and only evaluated on additional tiers if results on earlier tiers are considered insufficient. An alternative approach simply assigns IDs to documents in descending order of quality, and then stops index traversal once enough high quality documents have been evaluated for a query [3,12]. We refer to this method as global rank cutoff (GRC). Our Approach. We apply GRC to a selective search environment by ordering documents inside each cluster by global impact ordering. Our ordering is determined by the performance of documents on past queries as in [2,12], which appears to provide a stronger ordering than Pagerank or spam scores, though limited improvements may be possible with an ML-based approach that combines a number of features [22]. We note that index tiering and GRC are orthogonal and complementary to safe early termination techniques such as WAND [6], BMW [11], or Max-Score [26]. In fact, previous work [15] has shown that these techniques provide benefits in selective search architectures, and we would expect them to also be profitable in our proposed architecture.

3

Ordering-Aware Selective Search

Clustering. Our index relies on the collection of documents being clustered into a number of topical shards. Since this step is orthogonal to our work, we adopt the clusters used by Dai et al. [9,10].

14

M. Siedlaczek et al.

Reordering. Documents in each shard are reordered according to a global ordering obtained by counting document hits on a query  log: Given a set of queries Q and a document d, we define a hit count h(d) = q∈Q Iq (d), where Iq (d) is equal to 1 if d is among the top k results retrieved for query q, or 0 otherwise (we used k = 1000). For Clueweb09-B, we used 40,000 queries from the TREC 2009 Million Query Track. For GOV2, we used a set of 2 million queries randomly generated by a language model learned from a sample of several TREC query tracks. In each shard, we then assign document IDs to documents based on decreasing hit count. Evaluation is performed on a set of queries that is disjoint from those used to compute the ordering or train the language model. Ordered Partitioning. The idea behind reordering is to facilitate global rank cutoff within topical clusters. In this paper, we approximate it by partitioning consecutive document IDs in each shard into b ranges of equal size, effectively setting up b cutoff points where processing can stop. We call these partitions buckets and number them with consecutive integers, where 1 denotes the first (highest quality) bucket and b denotes the last one. This approach allows us to extend selective search with global ordering while trading off model accuracy and complexity. Large values of b model GRC well but make both learning and feature extraction more expensive. Structures with small b are fast and resemble discrete tiers. When b = 1, we get the standard selective search approach, as done in previous work. From this point forward, we refer to a solution with b buckets as Bb . In particular, B1 denotes the baseline. Learning Model. We modify the learning-to-rank-resources model proposed by Dai et al. [9]; see that paper for more details. We discard CSI-based features, as they were shown to provide only negligible improvement but are expensive to compute [9]. To the shard-level features (shard popularity and term-based statistics), we add only one bucket-level feature, namely the bucket number, from 1 to b. Thus, we train and predict a ranking of buckets instead of shards. This ranking is further converted into the final selection as described below. Bucket Cost. Given a query, every bucket i in a shard s has an associated processing cost c(s, i). We consider two cost models: (1) uniform cost: c(s, i) = 1/b, used during shard selection, and (2) posting cost: the number of postings for the query terms within bucket (s, i), used during evaluation. We also present one set of experiments where we add an additional per-shard access cost to the posting cost, in order to model the overhead of forwarding a query to a shard. Shard Selection. In constrast to previous shard selection algorithms, our approach requires us to select shards as well as a cut-off point for early termination inside each selected shard (Fig. 1). Our model predicts a bucket-level ranking, but to model GRC, we need to make sure to select a prefix of the bucket list in each shard. Thus, given a budget T per query, we iterate over all buckets from highest to lowest ranked, until the remaining query budget t (initially t = T ) drops to 0. For a bucket at a position i in a shard s, we consider all unselected buckets (s, j) for j ≤ i, and denote the sum of their costs as C. If C ≤ t, then we update the cost t ← t − C and mark all buckets (s, j), j ≤ i as selected.

Global Impact Ordering in Selective Search

15

Fig. 1. An illustration of order-aware shard selection. Based on the predicted bucket ranking, portions of shards are selected (shaded area). While strongly relevant shards (S3, S8) may be fully processed, others could terminate early using GRC (S1, S4, S7), or be ignored entirely (S2, S5, S6).

4

Methodology

Data Sets. We conduct our experiments on GOV2 and Clueweb09-B collections, consisting of approximately 25 million and 50 million documents, respectively. Following previous work [9], we remove the documents with Waterloo spam score [8] below 50 from Clueweb09-B. The resulting collection consists of nearly 38 million documents. We use the default HTML parser from the BUbiNG library, and stem terms using the Porter2 algorithm. No stopwords are removed. We use 150 queries from the TREC 04-05 Terabyte Track (GOV2) and 200 queries from the TREC 09-12 Web Track (Clueweb09-B) topics for training our model. We evaluate our method using 10-fold cross-validation. Training Model. Following [9], we use the SVMrank library [14] with linear kernel to train our ranking prediction model. It implements the pair-wise approach to learning to rank [18]. We compute the relevance-based ground truth as described in [9]. Evaluation Metrics. Following [9], we report search accuracy in terms of P@10 and the more recall-oriented MAP@1000. (Due to space restrictions, we exclude NDCG@30, as it exhibited similar behavior to MAP@1000.) Query cost is reported in terms of the number of postings as described in Sect. 3. Search Architecture. The GOV2 collection is clustered into 199 shards, and Clueweb09-B into 123 shards, as done in previous work [9,10]. We process our queries using the MG4J search engine [5] with default BM25 scoring (k1 = 1.2, b = 0.5). When scoring documents in shards, we use global values for collection size, posting count, term frequencies, and occurrence counts, in order to obtain scores that are comparable between shards.

5

Experiments

We experimented with b = 1, 10, 20, where b = 1 is equivalent to the current state of the art without using a global ordering in [9], which serves as the baseline. Since the baseline disregards differences in query cost due to inverted list lengths during shard selection, we also do so, using the uniform cost model for this step. Following previous work, we report the resulting posting costs in the evaluation. We express budgets in terms of the numbers of selected shards, where one shard is translated into b buckets for b > 1. We report results for budgets up to

16

M. Siedlaczek et al.

20 shards. However, our goal is to limit the cost with little decrease in quality; therefore, we are mostly interested in very low costs, below what has already been shown to work on a par with exhaustive search. Figure 2 shows the tradeoffs between quality measures (P@10 and MAP@1000) and query cost in terms of processed postings. Reported values are averaged over all queries.

Fig. 2. Quality-cost trade-off of bucket selection for different bucket counts and data sets. Shards were selected under the uniform cost model and with budgets from 1 to 20 shards. For Clueweb09-B, additional budgets of 0.25 and 0.5 were tested for b > 1. The quality of exhaustive search is indicated by a dashed line.

Both measures improve much faster for B10 and B20 than for B1 . For instance, P@10 in Clueweb09-B improves upon exhaustive search for a budget as small as 1 shard. At low budgets, we need to process only about a third of what is needed in B1 to achieve the same quality under P@10. Similarly, achieving the MAP@1000 quality level of exhaustive search requires much fewer postings to be traversed, with smaller improvements observed for GOV2. These improvements are achieved by enabling partial shard access; instead of traversing unimportant documents in highly relevant shards, we allocate budget towards traversing highly important documents in slightly less topically relevant shards. This is achieved by learning of a model that automatically adjusts the cutoff levels. Shard Access Overhead. While our results show that we can decrease resource requirements, there is an important caveat. One expected side effect of the proposed solution is that for the average query, we contact more shards than for b = 1. However, dispatching a query to more shards and collecting the results is likely to result in some amount of overhead. We now address this concern. As shown in Fig. 3, B10 and B20 contact about 2 to 3 times as many shards as B1 . We now estimate how this would impact the performance of our measures.

Global Impact Ordering in Selective Search

17

Fig. 3. Average number of shards used in query processing, relative to the baseline.

To do this, we add an overhead cost for contacting a shard equal to 10% of the average cost of an exhaustive query on a single shard. We believe this is a conservative upper bound to the overhead of contacting a shard. Figure 4 compares the performance of the no-overhead case (in red) to the 10%-overhead case (in blue), and shows that the improvement over the baseline is still significant.

Fig. 4. Quality-cost trade-off of bucket selection with per-shard access overheads of 0% and 10% of the average cost of an exhaustive query on a single shard. (Color figure online)

6

Conclusions

In this paper, we have described how to combine selective search with global ordering-based early termination. Our approach can significantly improve throughput in scenarios where resources are scarce, or where a system experiences peak loads. Although our solution results in additional overhead for contacting more shards, we showed that overall costs still decrease significantly. Our results motivates a number of research questions that we are currently pursuing, including the use of better global orderings based on machine learning, use of

18

M. Siedlaczek et al.

additional bucket-level features, the performance of the schemes when used to generate candidates for reranking under a complex ranker, and possible schemes for adapting to high loads under realistic service level agreements (SLAs). Acknowledgement. This research was partially supported by NSF Grant IIS1718680 and a grant from Amazon.

References 1. Aly, R., Hiemstra, D., Demeester, T.: Taily: shard selection using the tail of score distributions. In: Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 673–682 (2013) 2. Anagnostopoulos, A., Becchetti, L., Leonardi, S., Mele, I., Sankowski, P.: Stochastic query covering. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining, pp. 725–734 (2011) 3. Asadi, N., Lin, J.: Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In: Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 997–1000 (2013) 4. Baeza-Yates, R., Murdock, V., Hauff, C.: Efficiency trade-offs in two-tier web search systems. In: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–170. ACM (2009) 5. Boldi, P., Vigna, S.: MG4J at TREC 2005. In: The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings (2005) 6. Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: Proceedings of the 12th International Conference on Information and Knowledge Management, pp. 426–434 (2003) 7. Callan, J.P., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21–28 (1995) 8. Cormack, G.V., Smucker, M.D., Clarke, C.L.A.: Efficient and effective spam filtering and re-ranking for large web datasets. Inf. Retrieval 14(5), 441–465 (2011) 9. Dai, Z., Kim, Y., Callan, J.: Learning to rank resources. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 837–840 (2017) 10. Dai, Z., Xiong, C., Callan, J.: Query-biased partitioning for selective search. In: Proceedings of the 25th ACM International Conference on Information and Knowledge Management, pp. 1119–1128 (2016) 11. Ding, S., Suel, T.: Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 993–1002 (2011) 12. Garcia, S., Williams, H.E., Cannane, A.: Access-ordered indexes. In: Proceedings of the 27th Australasian Conference on Computer Science, pp. 7–14 (2004) 13. Hong, D., Si, L., Bracke, P., Witt, M., Juchcinski, T.: A joint probabilistic classification model for resource selection. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 98–105 (2010) 14. Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217–226 (2006)

Global Impact Ordering in Selective Search

19

15. Kim, Y., Callan, J., Culpepper, J.S., Moffat, A.: Does selective search benefit from WAND optimization? In: Ferro, N., Crestani, F., Moens, M.-F., Mothe, J., Silvestri, F., Di Nunzio, G.M., Hauff, C., Silvello, G. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 145–158. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30671-1 11 16. Kulkarni, A., Callan, J.: Selective search: efficient and effective search of large textual collections. ACM Trans. Inf. Syst. (TOIS) 33(4), 17 (2015) 17. Leung, G., Quadrianto, N., Tsioutsiouliklis, K., Smola, A.J.: Optimal web-scale tiering as a flow problem. In: Advances in Neural Information Processing Systems, pp. 1333–1341 (2010) 18. Liu, T.Y.: Learning to rank for information retrieval. Found. Trends Inf. Retrieval 3(3), 225–331 (2009) 19. Ntoulas, A., Cho, J.: Pruning policies for two-tiered inverted index with correctness guarantee. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–198 (2007) 20. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical report 1999–66 (1999) 21. Panigrahi, D., Gollapudi, S.: Document selection for tiered indexing in commerce search. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining, pp. 73–82. ACM (2013) 22. Richardson, M., Prakash, A., Brill, E.: Beyond PageRank: machine learning for static ranking. In: Proceedings of the 15th International Conference on World Wide Web, pp. 707–715 (2006) 23. Risvik, K.M., Aasheim, Y., Lidal, M.: Multi-tier architecture for web search engines. In: Proceedings of the First Conference on Latin American Web Congress, pp. 132–143 (2003) 24. Si, L., Callan, J.: Relevant document distribution estimation method for resource selection. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 298–305 (2003) 25. Thomas, P., Shokouhi, M.: SUSHI: scoring scaled samples for server selection. In: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 419–426 (2009) 26. Turtle, H., Flood, J.: Query evaluation: strategies and optimizations. Inf. Process. Manage. 31(6), 831–850 (1995)

Cross-Domain Recommendation via Deep Domain Adaptation Heishiro Kanagawa1(B) , Hayato Kobayashi2,4 , Nobuyuki Shimizu2 , Yukihiro Tagami2 , and Taiji Suzuki3,4 1

4

Gatsby Unit, UCL, London, UK [email protected] 2 Yahoo Japan Corporation, Tokyo, Japan {hakobaya,nobushim,yutagami}@yahoo-corp.jp 3 The University of Tokyo, Tokyo, Japan RIKEN Center for Advanced Intelligence Project, Tokyo, Japan [email protected]

Abstract. The behavior of users in certain services indicates their preferences, which may be used to make recommendations for other services they have never used. However, the cross-domain relation between items and user preferences is not simple, especially when there are few or no common users and items across domains. We propose a contentbased cross-domain recommendation method for cold-start users that does not require user- or item-overlap. We formulate recommendations as an extreme classification task, and the problem is treated as an instance of unsupervised domain adaptation. We assess the performance of the approach in experiments on large datasets collected from Yahoo! JAPAN video and news services and find that it outperforms several baseline methods including a cross-domain collaborative filtering method. Keywords: Cross-domain recommendation

1

· Deep domain adaptation

Introduction

Conventional recommender systems are known to be ineffective in cold-start scenarios, e.g., when the user is new to the service, or the goal is to recommend items from a service that the user has not used, because they require knowledge of the user’s past interactions [32]. On the other hand, as the variety of Web services has increased, information about cold-start users can often be obtained from their activities in other services. Therefore, cross-domain recommender systems (CDRSs), which utilize such information from other domains, have gained research attention as a promising solution to the cold-start problem [4]. This paper addresses a problem of cross-domain recommendation in which we cannot expect users or items to be shared across domains. Our task is to This work was conducted during the first author’s internship at Yahoo! JAPAN. TS was partially supported by MEXT Kakenhi (18H03201) and JST-CREST. c Springer Nature Switzerland AG 2019  L. Azzopardi et al. (Eds.): ECIR 2019, LNCS 11438, pp. 20–29, 2019. https://doi.org/10.1007/978-3-030-15719-7_3

Cross-Domain Recommendation via Deep Domain Adaptation

21

recommend items in a domain (service) to users who have not used it before but have a history in another domain (see Sect. 5 for details). This situation is common in practice. For instance, when one service is not known (e.g., when it is newly built) to users of another more popular service, there are few overlapping users, which was actually observed in the dataset of this study. Another probable instance is when user identities are anonymized to preserve privacy [17]; there are no shared users in this case because we cannot know if two users are identical across domains. The case in which domains have a user- or item-overlap has been extensively studied because an overlap helps us learn relations between users and items of different domains [24,25,27,36]. The problem, however, is harder when there is little or no overlap as learning of user-item relations becomes more challenging. Previous studies dealt with this challenge by using specific forms of auxiliary information that are not necessarily available including user search queries on Bing for recommendations in other Microsoft services [8], tags on items given by users such as in a photo sharing service, e.g., Flickr [1,2], and external knowledge repositories such as Wikipedia [9,19,20,23]. Methods that use the content information of items like this are called content-based (CB) methods. Collaborative filtering (CF) methods, which do not require auxiliary information, have also been proposed [12,17,43]. Despite their broader applicability, a drawback of CF methods compared to CB ones is that they suffer from data sparsity and require a substantial amount of user interactions. Our main contribution is to fill in the gap between CB and CF approaches. We propose a CB method that (a) only uses content information generally available to content service providers and (b) does not require shared users or items. Our approach is the application of unsupervised domain adaptation (DA). Although unsupervised DA has shown fruitful results in machine learning [3,11,33], its connection with CDRSs has been largely unexplored. Our formulation by extreme multi-class classification [6,30], where labels (items) corresponding to a user are predicted, allows us the use of this technique. We use a neural network (NN) architecture for unsupervised DA, the domain separation network (DSN) [3], because NNs can learn efficient features, as shown in the successful applications to recommender systems [6,8,14,28,38–41,44]. We provide a practical case study through experiments on large datasets collected from existing commercial services of Yahoo! JAPAN. We compare our method with several baselines including the state-of-the-art CF method [17]. We show that (a) the CF method does not actually perform well in this real application, and (b) our approach shows the highest performance in DCG.

2

Related Work

Approaches to CDRS can be categorized into two types [4]. The first type aggregates knowledge by combining datasets from multiple domains in a common format, e.g., a common rating matrix [24]. As a result, user-item overlaps or a specific data format are typically assumed [1,2,8,25–27,35]. In particular, the

22

H. Kanagawa et al.

deep-learning approach [15] uses, as in our method, unstructured texts to learn user and item representations but assumes a relatively large number of common users. The second type links domains by transferring learned knowledge. This line of research has been limited to matrix-factorization based CF methods because sharing latent factors across domains allows knowledge transfer [12,17,43]. Our method is a CB method and solves the aforementioned data sparsity problem of CF methods. Also, DA provides an alternative method of knowledge transfer to shared latent factors. There are works labeled as cross-domain recommendation that solve a different task. Studies including [7,22,29,42] have dealt with sparsity reduction. The task is to improve recommendation quality within a target domain and is not cross-selling from a different domain. This is performed, for example, by conducting CF on a single domain having sparse data with the help of other domains, such as using a shared latent factor learned in dense domains [7,22]. Although the work [42] uses DA, it cannot solve the same problem as ours. Finally, online reinforcement learning can be used to obtain feedback from cold-start users [31]. However, it is difficult to deploy it in commercial applications because the performance is harder to evaluate compared to offline methods such as ours since the model constantly changes over time.

3

Problem Setting

Let X be the input feature space, such as Rd with a positive integer d. Let Y = {1, . . . , L} be a collection of items that we wish to recommend. We are given two datasets from two different domains. The first is a labeled dataset S S DS = {(xSi , yiS )}N i=1 . Each element denotes a user-item pair with yi ∈ Y being S a label representing the item, and xi ∈ X being a feature vector that represents the previous history of a user and is formed by the content information of the items in the history. This domain, from which the labeled dataset comes, is T T called the source domain. The second dataset1 is DTX = {xTi }N i=1 , where xi ∈ X denotes the feature vector of a user’s history consisting of items in the other domain. We call this data domain the target domain. The goal is to recommend items in the source domain to users in the target domain. Our approach is to construct a classifier η : X → Y such that η(xT ) gives the most probable item in Y for a new user history xT from the target domain.

4

Unsupervised Domain Adaptation

We define the problem of unsupervised DA [3,11]. We are given two datasets i.i.d.

i.i.d.

DS ∼ PS and DTX ∼ PTX as in the previous section. Here, PS and PT are probability distributions over X × Y , which are assumed to be similar but different. PTX is the marginal distribution over X in domain T . The goal of 1

Superscript X denotes that the data is missing labels associated to their input vectors.

Cross-Domain Recommendation via Deep Domain Adaptation

23

unsupervised  DA is to obtain a classifier η with a low risk defined by RT (η) = Pr(x,y)∼PT η(x) = y only using the datasets DS and DTX . Domain Separation Networks: We introduce the neural network architecture, domain separation networks (DSNs) [3]. DSN, separately from domainspecific features, learns predictive domain-invariant features, which gives a shared representation of inputs from different domains. Figure 1 shows the architecture of a DSN model, which consists of the following components: a shared encoder Ec (x; θc ), private encoders EpS (x; θpS ), EpT (x; θpT ), a shared decoder D(h; θd ), and a classifier G(h; θg ), where θc , θpS , θpT , θd , and θg denote parameters. Given an input vector x, which comes from the source domain or the target domain, a shared encoder function Ec maps it to a hidden representation hc , which represents features shared across domains. For the source (target) domain, a DSN has a private encoder EpS (EpT ) that maps an input vector to a hidden representation hSp (hTp ), which serves as a feature vector specific to the domain. A common decoder D reconstructs an input vector x of the hTp ˆT x xT EpT D source (or target) domain from the sum of shared and private hidden Target private encoder representations: hc and hSp (hTp ). xT ˆS x Ec hTc The classifier G takes a shared hidden representation hc as input and Shared decoder xS hSc predicts the corresponding label. ˆ y G Training is performed by miniShared encoder mizing the following objective funchSp xS EpS tion LDSN with respect to parameters θc , θpS , θpT , θd , and θg : LDSN = Source private encoder Ltask + αLrecon + βLdiff + γLsim , where α, β, and γ are parameters Fig. 1. Architecture of DSN that control the effect of the associated terms. Ltask is a classification loss (cross entropy loss). Lrecon is a reconstruction loss (squared Euclidean distance). The loss Ldiff encourages the shared and private encoders to extract different types of features. As in [3], we impose a soft subspace orthogonality loss. The similarity loss Lsim encourages the shared encoder to produce representations that are hardly distinguishable. We use the domain adversarial similarity loss [10,11]. For the output layer of the classifier, we use the softmax activation. We suggest a list of items in descending order of the output probabilities.

5

Experiments

We conduct performance evaluation on a pair of two real-world datasets in collaboration with Yahoo! JAPAN. The datasets consist of the browsing logs of a video on demand service (VIDEO) and a news aggregator (NEWS). The task is to recommend videos to NEWS users who have not used the VIDEO service.

24

H. Kanagawa et al.

Dataset Description: For VIDEO, we create a labeled dataset where a label is a video watched by a particular user, and the input is the list of videos previously watched by the user. Similarly, for NEWS, we build an unlabeled dataset that consists of histories of news articles representing users. The number of instances for each dataset is roughly 11 million. For evaluation, we form a test dataset of size 38, 250 from users who used both services, where each instance is a pair of a news history (input) and a video (label). Items in both services have text attributes. For VIDEO, we use the following attributes: title, category, short description, and cast information. For NEWS, article title and category are used. We treat a history as a document comprised of attribute words and represent it with the TF-IDF scheme. We form a vocabulary set for each dataset according to the TF-IDF value. Combining the two sets gives a common vocabulary set of 50,000 words. Experimental Settings: We construct a DSN consisting of fully-connected layers 2 . The exponential linear unit (ELU) [5] is used as the activation function for all layers. The weight and bias parameters for the fully-connected layers are initialized following [13]. We apply dropout [34] to all fully-connected layers with the rates 0.25 for all the encoders and 0.5 for the decoder and the classifier. We also impose an L2 penalty on the weight parameters with the regularization parameter chosen from {10−1 , 10−2 , 10−3 , 10−4 }. The parameters α, β, and γ in LDSN are set to 10−3 , 10−2 , and 100, respectively. The parameter γ is chosen by the similarity of the marginals3 and agrees with the setting in [3]. The ADAM optimizer [21] is used, and the initial learning rate is set to 10−3 . We use discounted cumulative gain (DCG) [18] for evaluation. DCG gives a higher score when a correct item appears earlier in a suggesteditem list and M ym = thus measures ranking quality. This is defined by DCG@M = m=1 I(ˆ y)/ log(m + 1), where the function I(ˆ ym = y) returns 1 when the m-th suggested item yˆm is y and otherwise 0. The use of metrics such as precision is not appropriate since negative feedback is not well-defined in implicit feedback; unobserved items may express users’ distaste or just have been unseen, whereas we can treat viewed items as true positive. We evaluate the method for five different training and test dataset pairs, which are randomly sampled from the datasets. We sample 80% of both the whole training and test datasets. For each pair, we compute DCG averaged over the test set. We use a validation set that consists of logs of common users on the same dates as the training data for model selection to investigate attainable performance of domain adaptation. Note that this setting still keeps our target situation where there are not enough overlapping users for training.

2 3

The unit size of each hidden layer is as follows: Ec = EpS = EpT = (256 − 128 − 128 − 64), D = (128 − 128 − 256), and G = (256 − 256 − 256 − 64). Left is the input. We train a classifier that detects the domain of the input represented by the shared encoder and test the classification accuracy. This is due to the adversarial training.

Cross-Domain Recommendation via Deep Domain Adaptation

25

Table 1. Performance comparison of DSN, NN, CdMF, POP. (DCG) of DSN and NN denotes models that are chosen according to the value of DCG@100. Similarly, (CEL) denotes the cross entropy loss. Each entry is Mean ± Std. Method

DCG@1

DCG@10

DCG@50

DCG@100

DSN (DCG) 0.0618 ± 0.0212 0.2133 ± 0.0154 0.2873 ± 0.0151 0.2945 ± 0.0153 DSN (CEL) 0.0406 ± 0.0212

0.1668 ± 0.0384

0.2583 ± 0.0230

0.2655 ± 0.0229

NN (DCG)

0.0415 ± 0.0211

0.1938 ± 0.0131

0.2735 ± 0.0102

0.2797 ± 0.0107

NN (CEL)

0.0282 ± 0.0301

0.1616 ± 0.0279

0.2473 ± 0.0247

0.2556 ± 0.0238

CdMF

0.0005 ± 0.0000

0.0040 ± 0.0000

0.0135 ± 0.0000

0.0644 ± 0.0004

POP

0.0398 ± 0.0007

0.2099 ± 0.0012

0.2790 ± 0.0016

0.2871 ± 0.0010

Baseline Models: The following baseline methods are compared. – Most Popular Items (POP). This suggests the most popular items in the training data. The comparison with this shows how well the proposed approach achieves personalization. – Cross-domain Matrix Factorization (CdMF) [17]. This is the state-ofthe-art collaborative filtering method for CDRSs requiring no user- or itemover lap. To alleviate sparsity, we eliminate users with history logs fewer than 5 for the video data and 20 for the news data from the training data. We construct a user-item matrix for each domain, with the value of observed entries 1 and of unobserved entries 0. The unobserved entries are randomly subsampled as the size of the whole unseen entries is too large to be processed. As with our method, we use 80% of the instances for training and the remaining 20% for validation. The hyper-parameters α, β0 , and ν0 are set at the same values as in [17]. The method requires inference of latent variables by Gibbs sampling; each sampling step involves computing matrix inverses. Due to the computational complexity, we were unable to optimize these hyper-parameters, and therefore they are fixed. – Neural Network (NN). To investigate the effect of DA, the same network without DA is evaluated. This is obtained by minimizing the same loss as DSN except that β, γ in LDSN are set to 0. It can be thought of as a strong content-based single-domain method as more robust features are learned with the reconstruction loss Lrecon than only with the task loss Ltask (as in [39], with the roles of item and user swapped). Experimental Results: We report the results in Table 1. CdMF underperformed other methods. As mentioned in [16], this is likely because CdMF cannot process the implicit feedback or accurately capture the popularity structure in the dataset. The DSN model chosen by the cross-entropy loss (DSN (CEL)) also had a worse performance than POP. We also chose the model based on the values of DCG@100 on the validation set (denoted by DSN (DCG)), which showed the best performance in all cut-off number settings. This result can be interpreted as follows. As predicting the most probable item is hard, the crossentropy loss does not give a useful signal for model selection. On the other hand,

26

H. Kanagawa et al.

DCG provides the quality of ranking, and therefore a classifier which captures the joint distribution of items and users well is more likely to be chosen. The improved performance of DSN (DCG) over NN (DCG) supports this claim since DA improves the learning of the joint distribution. This implies that replacing the loss with a ranking loss such as [37] could improve ranking quality. Conclusion and Future Work: Our evaluation demonstrated the potential of the proposed DA-based approach. However, extreme classification is still challenging and should be addressed as a future work. One possible direction would be incorporation of item information as in [39], as this would make items more distinguishable and enable predictions of unobserved items.

References 1. Abel, F., Ara´ ujo, S., Gao, Q., Houben, G.-J.: Analyzing cross-system user modeling on the social web. In: Auer, S., D´ıaz, O., Papadopoulos, G.A. (eds.) ICWE 2011. LNCS, vol. 6757, pp. 28–43. Springer, Heidelberg (2011). https://doi.org/10.1007/ 978-3-642-22233-7 3 2. Abel, F., Herder, E., Houben, G.J., Henze, N., Krause, D.: Cross-system user modeling and personalization on the social web. User Model. User-Adap. Inter. 23(2), 169–209 (2013). https://doi.org/10.1007/s11257-012-9131-2 3. Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 343–351. Curran Associates, Inc. (2016). http://papers.nips.cc/paper/6254-domain-separationnetworks.pdf 4. Cantador, I., Fern´ andez-Tob´ıas, I., Berkovsky, S., Cremonesi, P.: Cross-domain recommender systems. In: Ricci, F., Rokach, L., Shapira, B. (eds.) Recommender Systems Handbook, pp. 919–959. Springer, Boston (2015). https://doi.org/10.1007/ 978-1-4899-7637-6 27 5. Clevert, D., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). CoRR abs/1511.07289 (2015). http://arxiv. org/abs/1511.07289 6. Covington, P., Adams, J., Sargin, E.: Deep neural networks for YouTube recommendations. In: Proceedings of the 10th ACM Conference on Recommender Systems, RecSys 2016, pp. 191–198. ACM, New York (2016). https://doi.org/10.1145/ 2959100.2959190 7. Cremonesi, P., Quadrana, M.: Cross-domain recommendations without overlapping data: myth or reality? In: Proceedings of the 8th ACM Conference on Recommender Systems, RecSys 2014, pp. 297–300. ACM, New York (2014). https:// doi.org/10.1145/2645710.2645769 8. Elkahky, A.M., Song, Y., He, X.: A multi-view deep learning approach for cross domain user modeling in recommendation systems. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, pp. 278–288. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2015). https://doi.org/10.1145/2736277.2741667

Cross-Domain Recommendation via Deep Domain Adaptation

27

9. Fern´ andez-Tob´ıas, I., Cantador, I., Kaminskas, M., Ricci, F.: A generic semanticbased framework for cross-domain recommendation. In: Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems, HetRec 2011, pp. 25–32. ACM, New York (2011). https://doi.org/10. 1145/2039320.2039324 10. Ganin, Y., Lempitsky, V.S.: Unsupervised domain adaptation by backpropagation. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015, pp. 1180–1189 (2015). http://jmlr.org/ proceedings/papers/v37/ganin15.html 11. Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(59), 1–35 (2016). http://jmlr.org/papers/v17/15-239.html 12. Gao, S., Luo, H., Chen, D., Li, S., Gallinari, P., Guo, J.: Cross-domain recommendation via cluster-level latent factor model. In: Blockeel, H., Kersting, K., Nijssen, ˇ S., Zelezn´ y, F. (eds.) ECML PKDD 2013. LNCS (LNAI), vol. 8189, pp. 161–176. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40991-2 11 13. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. CoRR abs/1502.01852 (2015). http://arxiv.org/abs/1502.01852 14. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., Chua, T.S.: Neural collaborative filtering. In: Proceedings of the 26th International Conference on World Wide Web, WWW 2017, pp. 173–182. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2017). https://doi.org/ 10.1145/3038912.3052569 15. Hu, G., Zhang, Y., Yang, Q.: MTNet: a neural approach for cross-domain recommendation with unstructured text. In: KDD 2018 Deep Learning Day (2018). https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18 paper 5.pdf 16. Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 263–272. IEEE Computer Society, Washington (2008). https://doi.org/10.1109/ICDM.2008.22 17. Iwata, T., Takeuchi, K.: Cross-domain recommendation without shared users or items by sharing latent vector distributions. In: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), AISTATS 2015, USA, pp. 379–387 (2015) 18. J¨ arvelin, K., Kek¨ al¨ ainen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002). https://doi.org/10.1145/582415. 582418 19. Kaminskas, M., Fern´ andez-Tob´ıas, I., Cantador, I., Ricci, F.: Ontology-based identification of music for places. In: Cantoni, L., Xiang, Z.P. (eds.) Information and Communication Technologies in Tourism 2013, pp. 436–447. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36309-2 37 20. Kaminskas, M., Fern´ andez-Tob´ıas, I., Ricci, F., Cantador, I.: Knowledge-based identification of music suited for places of interest. Inf. Technol. Tourism 14(1), 73–95 (2014). https://doi.org/10.1007/s40558-014-0004-x 21. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980 22. Li, B., Yang, Q., Xue, X.: Can movies and books collaborate? Cross-domain collaborative filtering for sparsity reduction. In: Proceedings of the 21st International Joint Conference on Artificial Intelligence, IJCAI 2009, pp. 2052–2057. Morgan Kaufmann Publishers Inc., San Francisco (2009)

28

H. Kanagawa et al.

23. Loizou, A.: How to recommend music to film buffs: enabling the provision of recommendations from multiple domains, May 2009. https://eprints.soton.ac.uk/66281/ 24. Loni, B., Shi, Y., Larson, M., Hanjalic, A.: Cross-domain collaborative filtering with factorization machines. In: de Rijke, M., et al. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 656–661. Springer, Cham (2014). https://doi.org/10.1007/978-3-31906028-6 72 25. Low, Y., Agarwal, D., Smola, A.J.: Multiple domain user personalization. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2011, pp. 123–131. ACM, New York (2011). https:// doi.org/10.1145/2020408.2020434 26. Man, T., Shen, H., Jin, X., Cheng, X.: Cross-domain recommendation: an embedding and mapping approach. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, pp. 2464–2470 (2017). https://doi.org/10.24963/ijcai.2017/343 27. Nakatsuji, M., Fujiwara, Y., Tanaka, A., Uchiyama, T., Ishida, T.: Recommendations over domain specific user graphs. In: Proceedings of the 2010 Conference on ECAI 2010: 19th European Conference on Artificial Intelligence, pp. 607–612. IOS Press, Amsterdam (2010) 28. Oord, A.V.D., Dieleman, S., Schrauwen, B.: Deep content-based music recommendation. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS 2013, pp. 2643–2651. Curran Associates Inc., USA (2013) 29. Pan, W., Xiang, E.W., Liu, N.N., Yang, Q.: Transfer learning in collaborative filtering for sparsity reduction. In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, pp. 230–235. AAAI Press (2010) 30. Partalas, I., et al.: LSHTC: a benchmark for large-scale text classification. CoRR abs/1503.08581 (2015). http://arxiv.org/abs/1503.08581 31. Rubens, N., Elahi, M., Sugiyama, M., Kaplan, D.: Active learning in recommender systems. In: Ricci, F., Rokach, L., Shapira, B. (eds.) Recommender Systems Handbook, pp. 809–846. Springer, Boston (2015). https://doi.org/10.1007/978-1-48997637-6 24 32. Sahebi, S., Brusilovsky, P.: Cross-domain collaborative recommendation in a coldstart context: the impact of user profile size on the quality of recommendation. In: Carberry, S., Weibelzahl, S., Micarelli, A., Semeraro, G. (eds.) UMAP 2013. LNCS, vol. 7899, pp. 289–295. Springer, Heidelberg (2013). https://doi.org/10.1007/9783-642-38844-6 25 33. Sener, O., Song, H.O., Saxena, A., Savarese, S.: Learning transferrable representations for unsupervised domain adaptation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 2110–2118. Curran Associates, Inc. (2016). http://papers.nips.cc/paper/6360-learning-transferrable-representationsfor-unsupervised-domain-adaptation.pdf 34. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014). http://jmlr.org/papers/v15/srivastava14a.html 35. Sun, M., Li, F., Zhang, J.: A multi-modality deep network for cold-start recommendation. Big Data Cogn. Comput. 2(1), 7 (2018). https://doi.org/10.3390/ bdcc2010007. http://www.mdpi.com/2504-2289/2/1/7

Cross-Domain Recommendation via Deep Domain Adaptation

29

36. Tang, J., Wu, S., Sun, J., Su, H.: Cross-domain collaboration recommendation. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2012, pp. 1285–1293. ACM, New York (2012). https://doi.org/10.1145/2339530.2339730 37. Valizadegan, H., Jin, R., Zhang, R., Mao, J.: Learning to rank by optimizing NDCG measure. In: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems 22, pp. 1883–1891. Curran Associates, Inc. (2009). http://papers.nips.cc/paper/3758learning-to-rank-by-optimizing-ndcg-measure.pdf 38. Wang, H., SHI, X., Yeung, D.Y.: Collaborative recurrent autoencoder: recommend while learning to fill in the blanks. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 415–423. Curran Associates, Inc. (2016) 39. Wang, H., Wang, N., Yeung, D.Y.: Collaborative deep learning for recommender systems. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015, pp. 1235–1244. ACM, New York (2015). https://doi.org/10.1145/2783258.2783273 40. Wu, C.Y., Ahmed, A., Beutel, A., Smola, A.J., Jing, H.: Recurrent recommender networks. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM 2017, pp. 495–503. ACM, New York (2017). https://doi.org/10.1145/3018661.3018689 41. Wu, Y., DuBois, C., Zheng, A.X., Ester, M.: Collaborative denoising auto-encoders for top-n recommender systems. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM 2016, pp. 153–162. ACM, New York (2016). https://doi.org/10.1145/2835776.2835837 42. Zhang, Q., Wu, D., Lu, J., Liu, F., Zhang, G.: A cross-domain recommender system with consistent information transfer. Decis. Support Syst. 104, 49–63 (2017). https://doi.org/10.1016/j.dss.2017.10.002. http://www.sciencedirect.com/ science/article/pii/S016792361730180X 43. Zhang, Y., Cao, B., Yeung, D.Y.: Multi-domain collaborative filtering. In: Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, UAI 2010, pp. 725–732. AUAI Press, Arlington (2010) 44. Zheng, Y., Tang, B., Ding, W., Zhou, H.: A neural autoregressive approach to collaborative filtering. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML 2016, vol. 48, pp. 764–773. JMLR.org (2016)

It’s only Words and Words Are All I Have Manash Pratim Barman1 , Kavish Dahekar2 , Abhinav Anshuman3 , and Amit Awekar4(B) 1

Indian Institute of Information Technology Guwahati, Guwahati, India 2 SAP Labs, Bengaluru, India 3 Dell India R&D Center, Bengaluru, India 4 Indian Institute of Technology Guwahati, Guwahati, India [email protected]

Abstract. The central idea of this paper is to demonstrate the strength of lyrics for music mining and natural language processing (NLP) tasks using the distributed representation paradigm. For music mining, we address two prediction tasks for songs: genre and popularity. Existing works for both these problems have two major bottlenecks. First, they represent lyrics using handcrafted features that require intricate knowledge of language and music. Second, they consider lyrics as a weak indicator of genre and popularity. We overcome both the bottlenecks by representing lyrics using distributed representation. In our work, genre identification is a multi-class classification task whereas popularity prediction is a binary classification task. We achieve an F1 score of around 0.6 for both the tasks using only lyrics. Distributed representation of words is now heavily used for various NLP algorithms. We show that lyrics can be used to improve the quality of this representation.

Keywords: Distributed representation

1

· Music mining

Introduction

The dramatic growth in streaming music consumption in the past few years has fueled the research in music mining [2]. More than 85% of online music subscribers search for lyrics [1]. It indicates that lyrics are an important part of the musical experience. This work is motivated by the observation that lyrics are not yet used to their true potential for understanding music and language computationally. There are three main components to experiencing a song: visual through video, auditory though music, and linguistic through lyrics. As compared to video and audio components, lyrics have two main advantages when it comes to analyzing songs. First, the purpose of the song is mainly conveyed through the lyrics. Second, lyrics as a text data require far fewer resources to analyze computationally. In this paper, we focus on lyrics to demonstrate their value for two broad domains: music mining and NLP. A line from song Words by Bee Gees. c Springer Nature Switzerland AG 2019  L. Azzopardi et al. (Eds.): ECIR 2019, LNCS 11438, pp. 30–36, 2019. https://doi.org/10.1007/978-3-030-15719-7_4

It’s only Words and Words Are All I Have

Fig. 1. Genre & popularity prediction

31

Fig. 2. Improving word vectors

A recent trend in NLP is to move away from handcrafted features in favor of distributed representation. Methods such as word2vec and doc2vec have achieved tremendous success for various NLP tasks in conjunction with Deep Learning [14]. Given a song, we focus on two prediction tasks: genre and popularity. We apply distributed representation learning methods to jointly learn the representation of lyrics as well as genre & popularity labels. Using these learned vectors, we experiment with various traditional supervised machine learning and Deep Learning models. We apply the same methodology for popularity prediction. Please refer to Figs. 1 and 2 for overview of our approach. Our work has three research contributions. First, this is the first work that demonstrates the strength of distributed representation of lyrics for music mining and NLP tasks. Second, contrary to existing work, we show that lyrics alone can be good indicators of genre and popularity. Third, the quality of words vectors can be improved by capitalizing on knowledge encoded in lyrics.

2

Dataset

Lyrics are protected by copyright and cannot be shared directly. Most researchers in the past have used either small datasets that are manually curated or large datasets that represent lyrics as a bag of words [3–5,7,11–13]. Small datasets are not enough for training distributed representation. Bag of words representation lacks information about the order of words in lyrics. Such datasets cannot be used for training distributed representation. To get around this problem, we harvested lyrics from user-generated content on the Web. Our dataset contains around 400,000 songs in English. We had to do extensive preprocessing to remove text that is not part of lyrics. We also had to detect and remove duplicate lyrics. Metadata about lyrics that is genre and popularity was obtained from Fell and Sporleder [4]. However, for genre and popularity prediction, we were constrained to use only a subset of dataset due to class imbalance problem.

3

Genre Prediction

Our dataset contains songs from eight genres: Metal, Country, Religious, Rap, R&B, Reggae, Folk, and Blues. Our dataset had a severe problem of class imbalance with genres such as Rap dominating. Using complete dataset was resulting in prediction models that were highly biased towards the dominant classes.

32

M. P. Barman et al. Table 1. F1-scores for genre prediction. Highest value for each genre is in bold. Model

Genre Metal Country Religious Rap

R& B Reggae Folk

Blues Average

SVM

0.575 0.493

0.634

0.815 0.534 0.608

0.437 0.532 0.579

KNN

0.463 0.457

0.557

0.729 0.457 0.547

0.428 0.515 0.519

Random Forest 0.552 0.536

0.474 0.559 0.585

0.644

0.791 0.525 0.599

Genre Vector

0.605 0.551

0.641

0.738 0.541 0.716 0.475 0.59 0.607

CNN

0.543 0.466

0.668

0.801 0.504 0.628

0.471 0.563 0.580

GRU

0.479 0.467

0.558

0.745 0.462 0.601

0.355 0.531 0.525

Bi-GRU

0.494 0.471

0.567

0.752 0.492 0.609

0.372 0.488 0.531

Hence, we use undersampling technique to generate balanced training and test datasets. We repeated this method to generate ten different versions of training and test datasets. Each version of dataset had about 8000 songs with about 1000 songs for each genre. Lyrics of each genre were randomly split into two partitions: 80% for training and 20% for testing. Experimental results reported here are average across these ten datasets. We did not observe any significant variance in results across different instances of training and test datasets, indicating the robustness of the results.

Fig. 3. Confusion matrix for genre. Rows: true label, Columns: predicted label.

Distributed representation of lyrics and genres were jointly learned using doc2vec model [6]. This model gave eight genre vectors (a vector representation for each genre) and vector representation for each song in the training and test dataset. We experimented with vector dimensionality and found 300 as the optimal dimensionality for our task. Using this vector representation, we experimented with both traditional machine learning models (SVM, KNN, Random Forest and Genre Vectors) and deep learning models (CNN, GRU, and Bidirectional GRU) for genre prediction task. Please refer to Table 1. For the KNN model, the genre of a test instance was determined based on genres of K nearest neighbors in the training dataset. Nearest neighbors were determined using cosine similarity. We tried three parameter values for K: 10, 25, and 50. However, there was no significant difference in results. For Genre Vector model, the genre

It’s only Words and Words Are All I Have

33

of a test instance was determined based on the cosine similarity of test instance with vectors obtained for each genre. We can observe that Rap is the easiest genre to predict as rap songs have a distinctive vocabulary. The Folk genre is the most difficult to identify. For each genre, the worst performing model is the KNN, indicating that the local neighborhood of a test instance is not the best indicator of the genre. On average, Genre Vector model performs the best. Please refer to Fig. 3. This figure represents confusion matrix for one version of training dataset using Genre Vector model. Each row of the matrix sums to around 200 as the number of instances in test dataset per genre were around 200. We can notice that confusion relationships are asymmetric. We say that a genre X is confused with genre Y if the genre prediction model identifies many songs of genre X as having genre Y. For example, observe the row corresponding to the Folk genre. It is mainly confused with the Religious genre as about 16% of Folk songs are identified as Religious. However for the Religious genre, Folk does not appear as one of the top confused genres. Similarly, genre Reggae is most confused with R&B. However, R&B is least confused with Reggae.

4

Popularity Prediction

Only a subset of songs had user ratings data available with ratings ranging from 1 to 5 [4]. For two genres: Folk and Blues, we did not get popularity data for enough number of songs. For popularity prediction task, the number of genres was thus reduced to six. Number of songs per genre are: Metal (15254), Country (2640), Religious (3296), Rap (19774), R&B (6144), and Reggae (294). Songs of each genre were randomly partitioned into two disjoint sets: 80% for training and 20% for testing. To ensure robustness of results, we performed experiments on ten such versions of the dataset. Experimental results reported here are average across ten runs. We model popularity prediction as a binary classification problem. For each genre, we divided songs into two categories: low popularity (rated 1, 2, or 3) and high popularity(rated 4 or 5). The number of songs included in each class were balanced to avoid any over fitting of model. Considering the distinctive nature of each genre, we built a separate model per genre for popularity prediction. For each genre using the doc2vec model, we generated two popularity vectors (one each for low and high popularity) and vector representation for each song in training and testing dataset. Similar to the genre prediction task, we experimented with seven prediction models. Please refer to Table 2. We can observe that Deep Learning based models perform better than other models. However, for every genre, the gap between the best and worst model has narrowed down as compared to the genre prediction task.

5

Improving Word Vectors with Lyrics

A large text corpus such as Wikipedia is necessary to train distributed representation of words. Lyrics are a poetic creation that requires significant creativity.

34

M. P. Barman et al.

Table 2. F1-scores for popularity prediction. Highest value for each genre is in bold. Model

Genre Country Metal

Rap

Reggae Religious R& B

SVM

0.6238

0.6756

0.7301

0.7539

0.5681

0.6342

KNN

0.5871

0.6351

0.7071

0.7713

0.5387

0.5814

Random Forest

0.6201

0.6683

0.7176

0.7647

0.5635

0.646

Popularity Vector 0.6180

0.6776 0.7663

0.7820

0.5886

0.6401

CNN

0.632

0.6717

0.7652

0.8011 0.5933

0.6661

GRU

0.5801

0.6479

0.7505

0.5187

0.5661

0.6434

Bi-GRU

0.6037

0.6581

0.7684 0.6613

0.5517

0.5886

Table 3. Results of word analogy tasks. Highest value for each task is in bold. Tasks

Lyrics Wikipedia Sampled Wiki Lyrics+Wiki

(1) capital-common-countries

10.95

87.75

89.43

87.94

(2) capital-world

07.26

90.35

79.25

90.00

(3) currency

02.94

05.56

01.85

05.56

(4) city-in-state

07.87

66.55

61.71

66.73

(5) family

81.05

94.74

82.82

94.15

(6) gram1-adjective-to-adverb

08.86

35.71

25.53

36.51

(7) gram2-opposite

19.88

51.47

33.46

51.10

(8) gram3-comparative

83.56

91.18

80.66

90.00

(9) gram4-superlative

53.33

75.72

59.49

77.83

74.32

73.33

60.82

75.81

(11) gram6-nationality-adjective 06.29

97.01

92.60

97.08

(12) gram7-past-tense

68.07

65.47

69.00

(10) gram5-present-participle

54.44

(13) gram8-plural

73.19

87.60

76.81

89.52

(14) gram9-plural-verbs

60.00

71.85

63.32

72.62

Overall across all tasks

50.33

75.71

66.6

78.11

Knowledge encoded in them can be utilized by training distributed representation of words. For this task, we used our entire dataset of 400K songs. Using the word2vec model, we generated four sets of word vectors. The four training datasets were: Lyrics only (D1, 470 MB), Complete Wikipedia (D2, 13 GB), Sampled Wikipedia (D3, 470 MB), and Lyrics combined with Wikipedia (D4, 13.47 GB). For dataset D3, we randomly sampled pages from Wikipedia till we collected dataset of a size comparable to our Lyrics dataset. For dataset D3, we created ten such sampled versions of Wikipedia. Results given here for D3 are average across ten such datasets.

It’s only Words and Words Are All I Have

35

To compare these four sets of word embeddings, we used 14 tasks of word analogy tests proposed by Mikolov [8]. Please refer to Table 3. Each cell in the table represents accuracy (in percentage) of a particular word vector set for a particular word analogy task. First five tasks in the table consist of finding a related pair of words. These can be grouped as semantic tests. Next nine tasks (6 to 14) check syntactic properties of word vectors using various grammar related tests. These can be grouped as syntactic tests. By sheer size, we expect D2 to beat our dataset D1. However, we can observe that for tasks 5, 8, 12, 13, and 14 D1 gives results comparable to D2. For task 10, D1 is able to beat D2 despite the significant size difference. Datasets D3 and D1 are comparable in size. For task 10, D1 significantly outperforms D3. For all other tasks, the performance gap between D3 and D1 is reduced noticeably. We can observe that D1 performs better on syntactic tests than semantic tests. However, the main takeaway from this experiment is that dataset D4 performs the best for a majority of the tasks. Also, D4 is the best performing dataset overall. These results indicate that lyrics can be used in conjunction with large text corpus to further improve distributed representation of words.

6

Related Work

Existing works that have used lyrics for genre and popularity prediction can be partitioned into two categories. First, that use lyrics in augmentation with acoustic features of the song [5,7] and second, that do not use acoustic features [3,4,10,13]. However, all of them represent lyrics using either handcrafted features or bag-of-words models. Identifying features manually requires intricate knowledge of music, and such features vary with the underlying dataset. Mikolov and Le have shown that distributed representation of words and documents is superior to bag-of-words models [6,9]. To the best of our knowledge, this is the first work that capitalizes on such representation of lyrics for genre and popularity prediction. However, our results cannot be directly compared with existing works as datasets, set of genres, the definition of popularity, and distribution of target classes are not identical. Still, our results stand in contrast with existing works that have concluded that lyrics alone are a weak indicator of genre and popularity. These works report significantly low performance of lyrics for genre prediction task. For example, Rauber et al. report an accuracy of 34% [10], Doraisamy et al. report an accuracy of 40% [13], McKay et al. report an accuracy of 43% [7], and Hu et al. reported accuracy of abysmal 19% [5]. The accuracy of our method is around 63%.

7

Conclusion and Future Work

This work has demonstrated that using distributed representation; lyrics can serve as a good indicator of genre and popularity. Lyrics can also be useful to improve distributed representation of words. Deep Learning based models can deliver better results if larger training datasets are available. Our method can

36

M. P. Barman et al.

be easily integrated with recent music mining algorithms that use an ensemble of lyrical, audio, and social features.

References 1. Lyrics take centre stage in streaming music, a midia research white paper (2017). https://www.nielsen.com/us/en/insights/reports/2018/2017-musicus-year-end-report.html 2. Nielsen 2017 U.S. music year-end report. https://www.midiaresearch.com/ app/uploads/2018/01/Lyrics-Take-Centre-Stage-In-Streaming-%E2%80%93LyricFind-Report.pdf 3. Logan, P.M.B., Kositsky, A.: Semantic analysis of song lyrics. In: IEEE International Conference on Multimedia and Expo (ICME), pp. 159–168, June 2004 4. Fell, M., Sporleder, C.: Lyrics-based analysis and classification of music. In: COLING (2014) 5. Hu, Y., Ogihara, M.: Genre classification for million song dataset using confidencebased classifiers combination. In: ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 1083–1084 (2012) 6. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014) 7. McKay, C., Burgoyne, J.A., Hockman, J., Smith, J.B., Vigliensoni, G., Fujinaga, I.: Evaluating the genre classification performance of lyrical features relative to audio, symbolic and cultural features. In: ISMIR (2010) 8. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR, abs/1301.3781 (2013) 9. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) 10. Rauber, A., Mayer, R., Neumayer, R.: Rhyme and style features for musical genre classification by song lyrics. In: International Conference on Music Information Retrieval (ISMIR), pp. 337–342, June 2008 11. Rauber, A., Mayer, R., Neumayer, R.: Combination of audio and lyrics features for genre classification in digital audio collections. In: Proceedings of the 16th ACM International Conference on Multimedia, pp. 159–168, October 2008 12. Doraisamy, S., Ying, T.C., Abdullah, L.N.: Genre and mood classification using lyric features. In: 2012 International Conference on Information Retrieval and Knowledge (CAMP). IEEE, March 2012 13. Ying, T.C., Doraisamy, S., Abdullah, L.N.: Genre and mood classification using lyric features. In: International Conference on Information Retrieval Knowledge Management, pp. 260–263 (2012) 14. Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing [review article]. IEEE Comput. Int. Mag. 13(3), 55–75 (2018)

Modeling User Return Time Using Inhomogeneous Poisson Process Mohammad Akbari1,2(B) , Alberto Cetoli2 , Stefano Bragaglia2 , Andrew D. O’Harney2 , Marc Sloan2 , and Jun Wang1 1

University College London, London, UK {m.akbari,jun.wang}@ucl.ac.uk 2 Context Scout, London, UK {alberto,stefano,andy,marc}@contextscout.com

Abstract. For Intelligent Assistants (IA), user activity is often used as a lag metric for user satisfaction or engagement. Conversely, predictive leading metrics for engagement can be helpful with decision making and evaluating changes in satisfaction caused by new features. In this paper, we propose User Return Time (URT), a fine grain metric for gauging user engagement. To compute URT, we model continuous inter-arrival times between users’ use of service via a log Gaussian Cox process (LGCP), a form of inhomogeneous Poisson process which captures the irregular variations in user usage rate and personal preferences typical of an IA. We show the effectiveness of the proposed approaches on predicting the return time of users on real-world data collected from an IA. Experimental results demonstrate that our model is able to predict user return times reasonably well and considerably better than strong baselines that make the prediction based on past utterance frequency.

Keywords: User Return Time Prediction

1

· Intelligent Assistant

Introduction

Intelligent Assistants (IAs) are software agents that interact with users to complete a specific task. The success of an IA is directly linked to long-term user engagement which can be measured by observing when users return to reuse the IA. Indeed, the repeating usage of service is a typical characteristic for IAs, thus, monitoring a user’s usage pattern is a necessary measure of engagement. Predicting when usage will next occur is therefore a useful indicator of whether a user will continue to be engaged and can serve as a reference to compare against when introducing new features, as well as a method for managing churn for business purposes. For example, predicted return times of customers can be utilized for clustering customers according to their activity and narrow their interest to investigate a specific groups of users with a short or long interarrival time for target marketing [3,13]. Furthermore, it helps the service to c Springer Nature Switzerland AG 2019  L. Azzopardi et al. (Eds.): ECIR 2019, LNCS 11438, pp. 37–44, 2019. https://doi.org/10.1007/978-3-030-15719-7_5

38

M. Akbari et al.

prepare content for the customers in advance to better serve them in engagement utterance. For example, a target marketing and advertising program can be planned for the next engagement of the user. Modeling the inter-arrival time of user engagement events is a challenging task due to the complex temporal patterns exhibited. Users typically engage with IA systems in an ad hoc fashion, starting tasks at different time points with irregular frequency. For example, Fig. 1 shows the inter-arrival times (denoted by black crosses) of two users in one week. Notice the existence of regions of both high and low density of inter-arrival times over a one week interval. Users do not arrive in evenly spaced intervals but instead they usually arrive in times that are clustered due to completing several tasks in bursts. This may be attributed to a user’s personal preferences and their tasks’ priorities. Retrospective studies in modeling inter-arrival times between events treat events as independent and exponentially distributed over time with constant rate [2]. They hence fail to perform accurate predictions when there exists time-varying patterns between events [15]. Further, they mostly focus on web search queries [1].

Fig. 1. Intensity functions (dotted lines) and corresponding predicted user interarrival times for different users (black crosses). Light regions depict the uncertainty of estimations.

To this purpose, we define User Return Time (URT) as the predicted interarrival time until next user activity, allowing us to predict user engagement to aid in creating an optimal IA system. To do so, we propose to model inter-arrival times between a user usage sessions with a doubly stochastic process. More specifically, we leverage the log-Gaussian Cox process (LGCP), an inhomogeneous Poisson process (IPP), to model the inter-arrival times between events. LGCP models return times that are generated by an intensity function which varies across time. It assumes a non-parametric form for the intensity function allowing the model complexity to depend on the data set. We evaluate the proposed model using a real-world dataset from an IA, and demonstrate that it provides good predictions for inter-arrival return times, improving upon the baselines. Even though the main application is return time estimation for an IA, one could apply the proposed approach to other events in e-commerce systems such as the arrival times of requests, return time of customers, etc. The contribution of this paper can be summarized as,

Modeling User Return Time

39

– We propose User Return Time as a measure to predict user engagement with an IA. – We leverage a doubly-stochastic process, i.e. Log-Gaussian Cox processes, to predict URT for an IA, and show that it effectively captures the time-varying patterns of usage between events. – We verified the effectiveness of the proposed method on a real-world dataset and found that Gaussian and Periodic kernels can make accurate estimations.

2

Problem Statement

Let U = {u1 , u2 , . . . , un } denote a set of n different users who used the IA to i denote the task history of the i-th assist with different tasks. Let Hi = {ti,j }nj=1 user, where ti,j represents the start time of the j-th task performed by the i-th user and ni is the number of tasks performed by user i. Our aim is to estimate a user’s next start time from their past interactions with the IA, i.e., predict ti,k+1 based on the start times of the previous sessions, i.e. {ti,j }kj=1 . Based on the above discussion, we formally define the problem of Predicting User Return Time as: Given a user history of session start times Hi , predict the next times that the user will use the IA.

3

Model

Poisson processes have been widely adopted for estimating the cross-interval times between different events such as failure of devices, social media events and purchase in e-commerce sites [12]. The Homogeneous Poisson process (HPP) is a class of point processes that assumes the events are generating with a constant intensity rate, i.e. λ, (with respect to the time and the product features). However, in an IA scenario, user engagement incidence often occurs with a varyingrate over time where there are several spikes of usage in a short period and a long-period of absence (when the user performs other tasks). Thus, we exploit the Inhomogeneous Poisson process (IPP) [7] that can model events happening at a variable rate by considering the intensity to be a function of time, i.e. λ(t). For example, Fig. 1 shows intensity functions learned from two different IPP models. Notice how the generated inter-event times vary according to the intensity function values. To model inter-arrival times, we employed a log-Gaussian Cox process which models the intensity function of point processes as a stochastic function [8]. LGCP learns the intensity function, λ(t), non-parametrically via a latent function sampled from Gaussian processes [10]. Here, to impose non-negativity to the intensity function (as an interval cannot be negative), we assume an exponential form for the intensity function, i.e., λ(t) = exp (f (t)). We adopted a non-parametric approach to model the intensity function which utilizes Bayesian inference to train a model, where the complexity of the model is learned from the training data available. In the next section, we explain the details of the proposed model and how to learn model parameters from training data.

40

3.1

M. Akbari et al.

Modeling Inter-arrival Time

An inhomogeneous Poisson process (unlike HPP) uses a time varying intensity function and hence, the distribution of inter-arrival times is not independent and identically distributed [11]. In IPP, the number of tasks y (signals of a returning user  q to the IA) occurring in an interval [p, q] is Poisson distributed with rate λ(t)dt, p   p(y|λ(t), [p, q]) = Poisson y|

q p

 λ(t)dt =

 q p

y    q λ(t)dt exp − p λ(t)dt y!

(1)

Assume that the k-th event occurred at time tk = p and we are interested in the inter-arrival time δk = tk+1 − tk of the next event. The arrival time of next event tk+1 can be obtained as tk+1 = tk + δk . The cumulative distribution for δk , which provides the probability that an event occurs by time p + q can be obtained as, p(δk ≤ q) = 1 − p(δk > q|λ(t), tk = p)      p+q λ(t)dt = 1 − exp − = 1 − exp − p

q 0

 λ(p + t)dt .

(2)

The derivation is obtained by considering a Poisson probability for zero counts p+q with rate parameter given by p λ(t)dt and applying integration by substitution to obtain Eq. (2). The probability density function of the random variable tn is obtained by taking the derivative of Eq. (2) with respect to q,   p(δk = q) = λ(p + q)exp −

0

q

 λ(p + t)dt .

(3)

We associate a distinct intensity function λi (t) = exp (fi (t)) to each user ui as they have different temporal preferences. The latent function fi (t) is modeled to come from a zero mean Gaussian process (GP) [10] prior. The Squared Exponential (SE) kernel is a common choice for GPs and it is defined as,   (ti − tj )2 k(ti , tj ) = σexp − . l

(4)

Equation (4) imposes smoothness over time on the intensity function. We also experiment with periodic kernels which allow the modelling of functions that repeat themselves exactly. Periodic kernels can model complex periodic structure relating to the working week by finding a proper periodicity hyperparameter in the kernel. The periodic kernel is defined as,   2 sin2 π|ti − tj |/r kperiodic (ti , tj ) = σ 2 exp − , l2

(5)

where σ and l are the output variance and length-scale, respectively, and r is the periodicity hyperparameter.

Modeling User Return Time

3.2

41

Inference

We learn the model parameters by maximizing the marginal likelihood over all users in the dataset. The likelihood of the return times over the dataset is then obtained by taking the product of return times over all users.

4 4.1

Experiments Dataset

We conducted an empirical study in which we used the access log of a commercial web IA to construct our dataset. The IA has been developed as an extension for the Chrome Web browser that assists recruiters to automatically find and segregate job candidate information such as skills and contact details, based on the information found on popular social networking platforms such as LinkedIn. For the purpose of this experiment the tool was altered to log the interactions of users with the tool when they performed their tasks. Various interactions were logged with a time stamp, based on which we constructed our dataset. Table 1 depicts a sample log record for a specific user from our dataset. Inspired by [4,5], we split action sequences into sessions based on a time gap of 15 min which is commonly used in information retrieval and web search to identify sessions. We collected all logs of a random set of users within a period of 3 months, from the beginning of April 2018 until the end of June 2018. The dataset consists of 2, 999, 593 interaction events committed by 133 distinct users. Table 1. Example log of a user’s interaction sequence. The first three interactions occurred within a single task. The final interaction indicated the start of a new task.

4.2

User Action

Time stamp

484

New web page

2018-07-01 17:35:25

484

Opened IA

2018-07-01 17:35:27

484

Clicked Contact Button 2018-07-01 17:35:51

484

New web page

2018-07-01 18:05:25

Baselines and Evaluation Metrics

Here the proposed model is compared against several methods to evaluate their effectiveness in predicting URT. We discuss the advantages, assumptions and limitations of each and provide empirical results on a real-world dataset. We examine the following distinct models: (1) Linear Regression: a linear regression model which is trained on a historical window of URT. We used the last 20 (computed empirically) inter-arrival times as features. (2) HPP: we also used a homogeneous Poisson process (HPP) [6] which models an exponentially distributed inter-arrival times with a fix rate λ. The rate parameter was learned

42

M. Akbari et al.

based on the maximum likelihood approach. (3) HP: we compare against the Hawkes Process (HP) [14], a self exciting point process where an occurrence of an event increases the probability of the event arriving soon afterwards. We consider a univariate Hawkes process where the intensity function is modeled  as λ(t) = μ + ti