Mobility Patterns, Big Data and Transport Analytics: Tools and Applications for Modeling 0128129700, 9780128129708

Mobility Patterns, Big Data and Transport Analytics provides a guide to the new analytical framework and its relation to

1,916 322 34MB

English Pages 452 [432] Year 2018

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Mobility Patterns, Big Data and Transport Analytics: Tools and Applications for Modeling
 0128129700, 9780128129708

Table of contents :
Cover
Front Matter
Copyright
Dedication
Contributors
About the Editors
1
Big Data and Transport Analytics: An Introduction
Introduction
Book Structure
Special Acknowledgments
References
Further Reading
Part I: Methodological
2
Machine Learning Fundamentals
Introduction
A Little Bit of History
Deep Neural Networks and Optimization
Bayesian Models
Basics of Machine Learning Experiments
Concluding Remarks
References
Further Reading
3
Using Semantic Signatures for Social Sensing in Urban Environments
Introduction
Spatial Signatures
Spatial Point Pattern
Spatial Autocorrelations
Spatial Interactions With Other Geographic Features
Place-Based Statistics
Temporal Signatures
Thematic Signatures
Examples
Comparing Place Types
Comparison Using Spatial Signatures
Comparison Using Temporal Signatures
Coreference Resolution Across Gazetteers
Geoprivacy
Temporally Enhanced Geolocation
Regional Variation
Extraction of Urban Functional Regions
Summary
References
4
Geographic Space as a Living Structure for Predicting Human Activities Using Big Data
Introduction
Living Structure and the Topological Representation
Data and Data Processing
Prediction of Tweet Locations Through Living Structure
Correlations at the Scale of Thiessen Polygons
Correlations at the Scale of Natural Cities
Degrees of Wholeness or Life or Beauty
Implications on the Topological Representation and Living Structure
Conclusion
Acknowledgments
References
5
Data Preparation
Introduction
Tools and Techniques
Scripting and Statistical Analysis Software
Python
R
MatLab
Database Management Software
MySQL
PostgreSQL
Commercial DBMS
NoSQL Data Management
Working With Web Data
Probe Vehicle Traffic Data
Formats and Protocols
TMC Codes
Open Location Referencing
Data Characteristics
Data Sources
Mobile Location Services
Consumer GPS Devices
Commercial Vehicle Transponders
Other Sources
Granularity
Vendor Quality Control and Imputation
Challenges
Completeness
Data Quality and Accuracy
Data Preparation and Quality Control
Data Loading
Geospatial Data
Tabular Traffic Data
Outlier and Error Detection
Visual Analysis
Rule-Based Outlier Detection
Statistical Methods
Imputation
Missing Data Patterns
Univariate Methods
Multivariate Methods
Multiple Imputation
Context Data
The Role of Context Data
Types of Context Data
Weather Data
Incident, Roadworks, and Road Blockages Data
Events Data
Social Media Data
Formats and Data Collection
Data Cleaning and Preparation
Information Extraction
Topic Modeling
Sentiment Analysis
References
6
Data Science and Data Visualization
Introduction
Structured Visualization
Multidimensional Data Visualization Techniques
Parallel Coordinates
Multidimensional Scaling (MDS)
t-Distributed Stochastic Neighbor Embedding for High-Dimensional Data Sets (t-SNE)
Case Studies
Experimental Setup
Car Characteristics Data Set
Congestion on I95
Dimensionality Reduction on NYC Taxi Flows
Dimensionality Reduction on the NYC Turnstile Data Set
Conclusions
References
Further Reading
7
Model-Based Machine Learning for Transportation
Introduction
Background Concepts
Notation
Case Study 1: Taxi Demand in New York City
Initial Probabilistic Model: Linear Regression
Likelihood Function
Priors
Key Components of MBML
Generative Process
Probabilistic Graphical Model
Joint Probability Distribution
Inference
Bayesian Inference
Exact vs Approximate Bayesian Inference
Model Improvements
Heteroscedasticity
Count Data
Case Study 2: Travel Mode Choices
Improvement: Hierarchical Modeling
Case Study 3: Freeway Occupancy in San Francisco
Autoregressive Model
State-Space Model
Linear Dynamical Systems
Common Enhancements to LDS
Filling Gaps
External Data
Regimes
NonLinear Variations on LDS
Case Study 4: Incident Duration Prediction
Preprocessing
Bag-of-Words Encoding
Latent Dirichlet Allocation
Formal Definition
Summary
Further Reading
References
8
Textual Data in Transportation Research: Techniques and Opportunities
Introduction
Big Textual Data, Text Sources, and Text Mining
Meaning of Text in the Context of Computational Linguistics
Text Mining
Text Mining Process Model
Textual Data Sources in Transportation
Fundamental Concepts and Techniques in Literature
Topic Modeling
Word2Vec-Text Embeddings With Deep Learning
Application Examples of Big Textual Data in Transportation
Developing Transportation and Logistics Performance Classifiers Using NLTK and Naïve Bayes
Understanding the Public Opinion Toward Driverless Cars With Topic Modeling
Predicting Taxi Demand in Special Events With Text Embeddings and Deep Learning
Conclusions
References
Further Reading
Part II: Applications
9
Statewide Comparison of Origin-Destination Matrices Between California Travel Model and Twitter
Introduction
California Statewide Travel Demand Model
Twitter Data
Trip Extraction Methods
Models for Matrix Conversion
Tobit Regression Model
Latent Class Regression Model
Summary and Conclusion
References
10
Transit Data Analytics for Planning, Monitoring, Control, and Information
Introduction
Measuring System Performance From the Passenger's Point of View
The Individual Reliability Buffer Time (IRBT)
Denied Boarding
Decision Support With Predictive Analytics
Framework
Demand Prediction Engine
Online Simulation Engine (Performance Prediction)
Demand, Supply, Information Loop
Implementation
Application: Provision of Crowding Predictive Information
Optimal Design of Transit Demand Management Strategies
Framework and Problem Formulation
Application: Prepeak Discount Design
Conclusion
Acknowledgments
References
Further Reading
11
Data-Driven Traffic Simulation Models: Mobility Patterns Using Machine Learning Techniques
New Modeling Challenges and Data Opportunities
New Modeling Requirements
New Data Sources
Future Challenges
Background
Data-Driven Traffic Performance Modeling: Overall Framework
Modeling Approach
Model Components
Clustering and Classification
Clustering
Classification
Flexible Fitting Models
Locally Weighted Regression
Multivariate Adaptive Regression Splines
Kernel Support Vector Machines
Gaussian Processes
Bayesian Regularized Neural Networks
Forecasting
Application to Mesoscopic Modeling
Data and Experimental Design
Case Study Setup
Application and Results
Application to Microscopic Traffic Modeling
Data and Experimental Design
Case Study Setup
Application and Results
Application to Weak Lane Discipline Modeling
Data and Experimental Design
Case Study Setup
Identification of Lead and Lag Vehicle
Determination of Virtual Lanes
Application and Results
Network-Wide Application
Implementation Aspects
Case Study Setup
Results
Conclusions
Acknowledgments
References
12
Introduction
The Role of Big Data in Traffic Safety Analysis
Real-Time Crash Prediction
Driving Behavior
ADAS and Autonomous Vehicles (AVs)
Conclusions
References
13
A Back-Engineering Approach to Explore Human Mobility Patterns Across Megacities Using Online Traffic Maps
Introduction
Data and Traffic Information Extraction Methods
Cities Characteristics
Data Gathering and Preprocessing
Extracting Traffic Information by Image Processing
Temporal and Spatiotemporal Mobility Patterns
Temporal Patterns
Spatiotemporal Patterns
Dynamic Clustering and Propagation of Congestion
Conclusions
References
14
Pavement Patch Defects Detection and Classification Using Smartphones, Vibration Signals and Video Images
Introduction
Brief Literature Review
Vibration-Based Methods
Vision-Based Methods
Methodology
Anomaly Detection Using ANNs and Timeseries Analysis of Vibration Signals
Anomaly Detection Using Entropic-Filter Image Segmentation
Patch Detection and Measurement Using Support Vector Machines (SVM)
Conclusions
References
15
Collaborative Positioning for Urban Intelligent Transportation Systems (ITS) and Personal Mobility (PM): Chal ...
Introduction
C-ITS in Support of the Smart Cities Concept
Scientific and Policy Perspectives of Urban C-ITS
Taxonomy of Urban C-ITS Applications
User Requirements for Urban C-ITS
Requirements Overview
Positioning Requirements and Parameters Definition
Positioning Technologies for Urban ITS
Radio Frequency-Based (RF) Technologies
Global Navigation Satellite Systems (GNSS)
Cellular Phone Networks
Wi-Fi, Bluetooth, ZigBee
Ultra-wideband (UWB)
Radio Frequency Identification (RFID)
MEMS-Based Inertial Navigation
Optical Technologies
Measuring Types and Positioning Techniques
Absolute Positioning Techniques
Proximity
Lateration
Fingerprinting
Relative and Hybrid Positioning Techniques
Dead Reckoning
Map Matching
Other Techniques
CP for C-ITS
From Single Sensor Positioning to CP
Fusion Algorithms and Techniques for CP
Application Cases of Integrated Urban C-ITS
Case 1: Smart-Bike Systems as a Component of Urban C-ITS
Case 2: Smart Intersection for Traffic Control and Safety
Discussion, Perspectives, and Conclusions
References
Further Reading
Conclusions
References
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Z
Back Cover

Citation preview

Mobility Patterns, Big Data and Transport Analytics Tools and Applications for Modeling

Mobility Patterns, Big Data and Transport Analytics Tools and Applications for Modeling

Edited by

Constantinos Antoniou Technical University of Munich Munich, Germany

Loukas Dimitriou University of Cyprus Nicosia, Cyprus

Francisco Pereira Technical University of Denmark Kongens Lyngby, Denmark

Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States © 2019 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-812970-8 For information on all Elsevier publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Joe Hayton Acquisition Editor: Tom Stover Editorial Project Manager: Naomi Robertson Production Project Manager: Vijayaraj Purushothaman Cover Designer: Mark Rogers Typeset by SPi Global, India

Dedication Constantinos Antoniou: To my family, especially my loving parents, Babi and Marika. Loukas Dimitriou: To Diana. Francisco Pereira: To David, Alice, and Ana.

Contributors Numbers in parentheses indicate the pages on which the authors’ contributions begin.

Mohamed Abdel-Aty (297), Department of Civil, Environmental and Construction Engineering, Orlando, FL, United States Constantinos Antoniou (1, 107, 263), Department of Civil, Geo and Environmental Engineering, Technical University of Munich, Munich, Germany Stanislav S. Borysov (9), Department of Management Engineering, Technical University of Denmark (DTU), Lyngby, Denmark Symeon E. Christodoulou (365), Department of Civil and Environmental Engineering, University of Cyprus, Nicosia, Cyprus Franc¸ ois Combes (173), IFSTTAR/AME/SPLOTT, Paris, France Adam Davis (201), Department of Geography and GeoTrans Lab, University of California, Santa Barbara, Santa Barbara, CA, United States Loukas Dimitriou (1, 297, 345), Laboratory for Transport Engineering, Department of Civil and Environmental Engineering, University of Cyprus, Nicosia, Cyprus Song Gao (31), University of Wisconsin, Madison, WI, United States Vassilis Gikas (381), National Technical University of Athens, Athens, Greece Vana Gkania (345), Laboratory for Transport Engineering, Department of Civil and Environmental Engineering, University of Cyprus, Nicosia, Cyprus Konstadinos G. Goulias (201), Department of Geography and GeoTrans Lab, University of California, Santa Barbara, Santa Barbara, CA, United States George Hadjidemetriou (365), Department of Civil and Environmental Engineering, University of Cyprus, Nicosia, Cyprus Kristian Henrickson (73), Department of Civil and Environmental Engineering, University of Washington, Seattle, WA, United States Yingjie Hu (31), University of Tennessee, Knoxville, TN, United States Krzysztof Janowicz (31), University of California, Santa Barbara, CA, United States Bin Jiang (55), Faculty of Engineering and Sustainable Development, Division of GIScience, University of G€avle, G€avle, Sweden Samaneh Beheshti Kashi (173), University of Bremen, Bremen, Germany Allison Kealy (381), RMIT University, Melbourne, VIC, Australia Aseem Kinra (173), Copenhagen Business School, Copenhagen, Denmark

xv

xvi Contributors

Haris N. Koutsopoulos (229, 263), Department of Civil and Environmental Engineering, Northeastern University, Boston, MA, United States Charalambos Kyriakou (365), Department of Civil and Environmental Engineering, University of Cyprus, Nicosia, Cyprus Jae Hyun Lee (201), Department of Geography and GeoTrans Lab, University of California, Santa Barbara, Santa Barbara, CA, United States Zhenliang Ma (229), Department of Civil and Environmental Engineering, Northeastern University, Boston, MA, United States Elizabeth McBride (201), Department of Geography and GeoTrans Lab, University of California, Santa Barbara, Santa Barbara, CA, United States Grant McKenzie (31), University of Maryland, College Park, MD, United States Peyman Noursalehi (229), Department of Civil and Environmental Engineering, Northeastern University, Boston, MA, United States Vasileia Papathanasopoulou (263), National Technical University of Athens, Athens, Greece Inon Peled (145), Department of Management Engineering, Technical University of Denmark (DTU), Lyngby, Denmark Francisco C^amara Pereira (1, 9, 73, 145, 173), Department of Management Engineering, Technical University of Denmark (DTU), Lyngby, Denmark Zheng Ren (55), Faculty of Engineering and Sustainable Development, Division of GIScience, University of G€avle, G€avle, Sweden Guenther Retscher (381), Technical University of Vienna, Vienna, Austria Filipe Rodrigues (73, 145), Department of Management Engineering, Technical University of Denmark (DTU), Lyngby, Denmark Werner Rothengatter (173), Karlsruhe Institute of Technology, Karlsruhe, Germany Katerina Stylianou (297), Laboratory for Transport Engineering, Department of Civil and Environmental Engineering, University of Cyprus, Nicosia, Cyprus Michalis Xyntarakis (107), Cambridge Systematics, Medford, MA, United States Rui Zhu (31), University of California, Santa Barbara, CA, United States Yiwen Zhu (229), Department of Civil and Environmental Engineering, Northeastern University, Boston, MA, United States

About the Editors Constantinos Antoniou is a full professor: Chair of Transportation Systems Engineering at the Technical University of Munich (TUM), Germany. He holds a Diploma in Civil Engineering from NTUA (1995), an MS in Transportation (1997), and a PhD in Transportation Systems (2004), both from MIT. His research focuses on big data analytics, modeling and simulation of transportation systems, intelligent transport systems (ITS), calibration and optimization applications, road safety, and sustainable transport systems. In his more than 20 years of experience he has held key positions in a number of research projects in Europe, the United States, and Asia, while he has also participated in a number of consulting projects. He has received numerous awards, including the 2011 IEEE ITS Outstanding Application Award. He has authored more than 300 scientific publications, including more than 100 papers in international, peer-reviewed journals (including Transportation Research Parts A and C, Transport Policy, Accident Analysis and Prevention, and Transport Geography), 210 in international conference proceedings, 2 books, and 17 book chapters. He is a member of several professional and scientific organizations’ editorial boards (he is a member of the Editorial Board of Transportation Research—Part C, Accident Analysis and Prevention, the Journal of Intelligent Transportation Systems, Smart Cities and associate editor of the EURO Journal of Transportation and Logistics, IET Intelligent Transportation Systems and Transportation Letters); committees (such as TRB committees AHB45—Traffic Flow Theory and Characteristics, and ABJ70— Artificial Intelligence and Advanced Computing Applications, the Steering Committee of hEART—The European Association for Research in Transportation, FGSV committee 3.10 “Theoretical fundamentals of road traffic"); and a frequent reviewer for a large number of scientific journals, scientific conferences, research proposals, and scholarships.

xvii

xviii About the Editors

Loukas Dimitriou is an assistant professor in the Department of Civil and Environmental Engineering, University of Cyprus (UCY), Cyprus, and founder and head of the LαB for Transport Engineering, UCY. His research interests focus on the application of advanced computational intelligence methods, concepts, and techniques for understanding the complex phenomena involved in realistic transport systems and also in developing design and control strategies to optimize their performance. The methodological paradigms that he utilizes, combine elements from Data Science, behavioral analytics, complex systems modeling, and advanced optimization, which are then applied in traditional fields of transport, such as demand modeling, travel behavior and systems organization, optimization and control. He has authored more than 120 publications in peer-reviewed journals, proceedings of conferences, and book chapters, while he is a regular reviewer in almost 50 international journals. He is also an active member of many international scientific organizations and committees. Francisco Pereira has been a full professor at the Technical University of Denmark (DTU), Kongens Lyngby, Denmark, since August 2015, where he leads the Machine Learning for Mobility (MLM) group. He holds a masters (2000) and PhD (2005) degree in Computer Science and Artificial Intelligence, from the University of Coimbra (UC), Portugal. Previously, he was an assistant professor at UC, Department of Computer Engineering, and then senior research scientist at the MIT ITS Lab, with particular focus on the Singapore-MIT Alliance for Research and Technology, Future Urban Mobility project (SMART/FM). His methodological research combines Machine Learning and Transportation Research, and his preferred applications generally relate to transportation research problems, such as real-time traffic prediction, behavior modeling, advanced data collection technologies, and transport modeling. He has contributed to top journals and conferences in both Machine Learning (e.g., IEEE Transactions on Pattern Analysis and Machine Intelligence, or AAAI) and

About the Editors

xix

Transportation (e.g., Transport Research Part C, ISTTT), and thus lives constantly with his feet in both worlds, which he believes constantly gives him a different perspective, despite the hard challenges. He is currently associate editor of Engineering Applications of AI (EAAI, Elsevier), a committee member of TRB ADB40 (travel demand forecasting), and has been guest editor in Transport Part C. He has been a Marie Curie fellow twice (2011 and 2015), and won several international awards, such as the Singapore Challenge 2013 (for a white paper on the future of transportation) and the TRB Pyke Johnson Award (for smartphone-based travel survey research).

Chapter 1

Big Data and Transport Analytics: An Introduction Constantinos Antoniou*, Loukas Dimitriou† and Francisco C^amara Pereira‡ *

Department of Civil, Geo and Environmental Engineering, Technical University of Munich, Munich, Germany, †Laboratory for Transport Engineering, Department of Civil and Environmental Engineering, University of Cyprus, Nicosia, Cyprus, ‡Department of Management Engineering, Technical University of Denmark (DTU), Lyngby, Denmark

Chapter Outline 1 Introduction 2 Book Structure Special Acknowledgments

1

1 3 4

References Further Reading

5 5

INTRODUCTION

The aim of this book is to contribute to the question of how the transportation profession and research community can benefit from the new era of Big Data and Data Science, the opportunities that arise, the new threats that are emerging and how old and new challenges can be addressed by the enormous quantities of information that are foreseen to be available. The current era can be characterized by three main components: 1. an unprecedented availability of (structured and unstructured) information, collected through traditional sources/sensors, but also by the extensive wealth of nontraditional sources, like internet-of-things and crowdsourcing; 2. a vast expansion of computational means (hardware and—most significantly—paradigms) exceeding Moore’s law (Moore, 1965); and 3. the development of new powerful computational methods able to treat the challenges of extensive information, able to be executed only by powerful computational means (interconnected and cloud integrated). These three elements triggered a tremendous boost in inspiration and incentives for new developments for business and industrial applications, in the associated research community, as well as in social and governmental organizations overall. The stage has changed. Mobility Patterns, Big Data and Transport Analytics. https://doi.org/10.1016/B978-0-12-812970-8.00001-4 © 2019 Elsevier Inc. All rights reserved.

1

2 Chapter

1 Big Data and Transport Analytics: An Introduction

This constitutes the new vibrant scientific area of Data Science, adding a new data-driven analytical paradigm that combines the existing traditional three, viz., the empirical, the theoretical, and the computational. As any newcomer, Data Science has been received by many with some reluctance (Pigliucci, 2009; Milne and Watling, 2017), but by others as a path to new (and easy) revelations. Famously, Chris Anderson declared that this is the “End of Theory,”1 following the long tradition of human ambition for conquering knowledge and future, starting from the biblical “Tree of Knowledge,” to statements of prolific figures of science like (purportedly) Charles Holland Duell’s “Everything that can be invented has been invented,” Lord Kelvin’s “There is nothing new to be discovered in physics now; All that remains is more and more precise measurement” and David Hilbert’s “We must know; We will know!,” until all of them to be defeated (e.g., G€ odel, 1931). However, Anderson’s (2008) statement that “Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all,” naively taken, may lead to the misconception that the fundamental use of models (hypothesis testing and explanatory analysis) is obsolete. Of course, this is not the essence of data analysis, though data-centric analysis has an impact on the experimental design, the type of information that is used for a particular research purpose, and the type of confirmatory criteria used for evaluating results. These differences in data use (in data availability and volume) and model building (in models’ typology and fundamental assumptions) signify a turning point in scientific reasoning, requiring new theoretical and practical developments for treating the new scientific threats, as well as a preparation of the new generation of scientists, able to appropriately handle the new “tool(s)” that will increasingly become available. Focusing in the field of transportation systems analysis, Data Science endeavors suit well, in their characteristics, which—succinctly—include: – Complex and Large scale, composed by multiple distinctive units, arranged in multiple sequences, layers or parallel operations; – Spatially distributed, establishing connectivity and service among remote locations by a synthesis of supply means (transport infrastructure and transport modes); – Multiple-agents engagement, involved in cooperative, noncooperative, and competitive relationships among them and the transport infrastructure; – Dynamic/Transient, since transport is by definition a dynamic phenomenon of movement in space and time; and – Stochastic, since the transport operations stand for the manifestation of the decision-making process of agents (travelers, shippers, carriers, etc.) with different characteristics, properties, opportunities, “flavors” and criteria, 1. Wired, 06.23.08.

Big Data and Transport Analytics: An Introduction Chapter

1

3

while decisions are made in a fluctuating environment in terms of the physical, economic, and other elements. The above fundamental characteristics of the transportation systems comprise sources of complexity, reflected as inaccuracies (or failures) of the typical/traditional analytical paradigms, especially when applied in real-world circumstances. The use of Big Data, treated within the new analytical field of Data Science, in our view stands for a promising new era for understanding and managing existing and future transportation phenomena. The effective exploitation of the Big Data and Data Science “promises” depends on the rate of endorsement of emerging methods and applications by the relevant scientific and industrial community. It should be highlighted that the general public (such as the end-users and the markets) anticipate new developments, with the community of “early-adopters” growing rapidly. But what are the characteristics of the relevant contemporary (and future) transportation scientist? How should the new generation of transportation scientists be equipped? Is it all about data handling/processing/analysis? The view that can be identified throughout this book reflects the idea that the strong scientific background on the field, topic, or system is a compulsory prerequisite for testing or adopting data-centric applications. This is far from the so-called Black-Box approaches and the jeopardies involved in such cases or applications. Advanced data analysis and Data Science concepts stand for an additional tool of the transportation professional, who should be formally prepared (possibly by dedicated programs) for embracing them. However, this should not be viewed as a shortcut to avoid the fundamentals. Finally, the idea for this book was conceived during a Summer School that was organized by the Editors and held on these topics in June 2016 in the premises of the University of Cyprus, Lab for Transport Engineering, with the participation of most of the (co)authors. During the Summer School, the multidisciplinary combination of both the instructors, and the attendees, became immediately evident. This pluralism of ideas, approaches, and concepts is reflected here. We are confident that the readers will benefit from the contents of this book and will enjoy this guided trip through the different topics, models, and applications aiming to cover some of the most important fields of the transportation profession, where Big Data applications have matured enough.

2

BOOK STRUCTURE

The book is structured, instead of the form of a textbook, to provide a guided trip of the reader from introductory concepts about the use of Big Data and Data Science methods in transportation, to the presentation of indicative (though mature) applications. In this way, the reader may gain a helpful insight on how concepts may be treated with new methods, such as to develop innovative applications. We think that this innovation-oriented process may be more

4 Chapter

1 Big Data and Transport Analytics: An Introduction

FIG. 1 Flowchart of the book structure.

inspiring and has timely importance rather than the presentation of a “strict” closed-form/self-contained textbook of specific applications on some topics of mobility. For achieving this, the book is divided into four Sections; the introductory, the methodological, the applications, and outlook (Fig. 1). In the current introductory part (this chapter), the general view of the book on the use and value of data-centric analysis is provided. Then, at the second part of the book, a review and presentation of fundamentals on machinelearning methods is provided (Chapter 2), while a discussion about the combination of theory-driven and data-driven methods is offered in Chapter 3. Then, the stage of mobility analysis, human activities, and the living structure within the geographical space is offered in Chapter 4. Issues regarding data preparation (Chapter 5) and data visualization (Chapter 6) are providing important preparation on prospective transportation data-analytics professionals and researchers. The preparatory part of the book comprises the theoretical underpinnings on the integration of model-based machine-learning approaches in transportation (Chapter 7) and the use of nontraditional textual data for analyzing mobility (Chapter 8). After having read the methodological part, the reader will be equipped with essential methodological tools in order to be able to better follow the applications-oriented part. In detail, indicative applications of nontraditional examples using data-analytic approaches are selected, facilitating the understanding of how new methods have been adopted in order to analyze transportation demand and systems (Chapters 9, 10 and 15; Chapter 11), road safety (Chapter 12), mobility patterns (Chapter 13), and transport infrastructure (Chapter 14). By the end of the book’s third part the reader will have gained knowledge on applications of currently state-of-the-art methods in various elements of the transportation field and hopefully inspirations for extending or improving the use of Big Data and Data Science methods in the field. The fourth and final part of book provides an outlook on the use of advanced data analytics methods in transport, aiming to offer a useful foresight to potential transport professionals, developers, and researchers.

SPECIAL ACKNOWLEDGMENTS The Editors of this book would like to gratefully acknowledge the patient and helpful support of the Editorial Managers (especially Mr. Tom Stover, Acquisitions Editor, Transport and Ms. Naomi Robertson) throughout the development process, as they provided their valuable experience and organization facilities for putting together and concluding this work.

Big Data and Transport Analytics: An Introduction Chapter

1

5

Furthermore, the Editors would like to express their sincere and deep gratitude to all distinguished authors that contributed to this effort and agreed to share their academic and professional experience with the scientific community through this volume. Without their invaluable contribution, this book would not be possible.

REFERENCES C. Anderson, 2008. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. € G€ odel, K., 1931. Uber formal unentscheidbare S€atze der Principia Mathematica und verwandter Systeme, I. Monatsh. Math. Phys. 38, 173–198. Milne, D., Watling, D., 2017. Big data and understanding change in the context of planning transport systems. J. Transp. Geogr. https://doi.org/10.1016/j.jtrangeo.2017.11.004. Moore, G., 1965. Cramming more components onto integrated circuits. Electronics. 38 (8). Pigliucci, M., 2009. The end of theory in science? EMBO Rep. 10 (6), 534. Science and Society, Opinions.

FURTHER READING D. Hilbert, 1930. Address to the Society of German Scientists and Physicians, in K€onigsberg.

Chapter 2

Machine Learning Fundamentals Francisco C^ amara Pereira and Stanislav S. Borysov Department of Management Engineering, Technical University of Denmark (DTU), Lyngby, Denmark

Chapter Outline 1 Introduction 2 A Little Bit of History 3 Deep Neural Networks and Optimization 4 Bayesian Models

1

9 11 19 23

5 Basics of Machine Learning Experiments 6 Concluding Remarks References Further Reading

24 27 28 29

INTRODUCTION

At the time of writing this chapter, with so much buzz around concepts such as artificial intelligence, big data, deep learning, and probabilistic graphical models (PGMs), machine learning has gained a bit of a mystical fog around it. This excitement has more recently led to a series of dystopian visions that, among others, literally suggest the end of the world (e.g., Kurzweil, 2005; Bostrom, 2014)! We will delve into some of these aspects, below on the historical perspective, but first we want to demystify machine learning as a general discipline. Machine learning combines statistics, optimization, and computer science. The statistics that we mention here is exactly the same discipline that has been taught for several centuries. No more, no less. In its vast majority, machine learning models also consist of functions that directly or indirectly end up as y ¼ f ðx, βÞ + E: In other words, we want to estimate (or predict) the value of a response variable, y, through a function f(x, β), where x is a vector with our observed (input) variables, and β is a vector with the parameters of our model. Since, in practice, our data (contained in x) is not perfect, there is always a bit of noise and/or unobserved data, which is usually represented by the error term, E. Because Mobility Patterns, Big Data and Transport Analytics. https://doi.org/10.1016/B978-0-12-812970-8.00002-6 © 2019 Elsevier Inc. All rights reserved.

9

10 PART

I Methodological

we cannot really know the true value of E, it is itself a random variable. Of course, as a consequence, y is a random variable as well. It is almost sure that you are familiar with the most basic form of f, the linear model: f ðx, βÞ ¼ β0 + β1 x1 + β2 x2 + … + βn xn : So, if you have a dataset with pairs of (x, y), the task is to estimate the values of β that best reproduce the linear relationship. You do this using optimization, sometimes called the “training” process, which is minimization of the difference between the true values of y and model’s predictions, also known as a loss function. By now, you are bored? Great, let us now imagine that f(x) is instead defined as f ðx, βÞ ¼ fK ðfK1 ð…ðf1 ðx, β1 Þ…Þ, βK1 Þ, βK Þ, where each function fk transforms its input and gives its result to the following one. Or, before doing that, it processes its data in a distributed fashion, for example, fK1(x1, …, xj), fK2(xj+1, …, xk), …, fKL(xk+1, …, xr), where we have L subfunctions, each processing a sub-part of the data. Imagine you have really a lot (thousands!) of such functions. Congrats, you just discovered a deep neural network (DNN) (or, in fact, many other types of models, including PGMs, another popular one these days). Of course, it can become much more complicated, but the principle is precisely the same. You have a sequence of functions, each one with its own set of parameters (sometimes shared among many different functions) where the output of one is the input of the other. Due to the large number, the estimation of these parameters may require a large amount of data and computation. And that is where it becomes distinct from classical statistics. In fact, there are several ways to categorize Machine Learning tasks. The most common way depends on whether the target variable (our y above) is present in the problem itself, leading to supervised and unsupervised learning paradigms. Supervised learning implies presence of the target variable we would like to predict so the model receives a feedback from the “teacher” how its prediction is close to the ground truth. It also includes semi-supervised learning (when the target values are partially missing), active learning (number of possible target values is limited so the model should decide which data samples to use first), and reinforcement learning (y is given in a form of reward for a set of actions performed by the algorithm). Sometimes, the target variable is not present, you only have data points x, and thus the problem formulation becomes slightly different. In such a case, the objective typically becomes to find patterns in the data or clusters of data points that seem to fit together. This is an example of unsupervised learning. From a statistical point of view, it is an example of generative learning of the data’s joint distribution p(x), while most well-known supervised learning algorithms are regarded as discriminative, that is, we estimate the conditional distribution p(y j x). Some supervised learning algorithms exist, though, that are also generative (i.e., we estimate p(y, x)), which is the case of PGMs.

Machine Learning Fundamentals Chapter

2

11

Depending on their output, machine learning models can be classified as regression (the target variable is continuous), classification (the target variable is categorical), and others like clustering (often unsupervised), probability density estimation, and dimensionality reduction (e.g., the Principal Component Analysis algorithm). Although it would be almost impossible to provide a complete overview of this highly dynamic field, this chapter will present the several major forms of what is known today as machine learning and provide some tips about where to go next, if the reader wants to be part of this revolution, or simply understand more what it is about. But first, it is worthwhile to give a historical perspective.

2

A LITTLE BIT OF HISTORY

Machine learning was born as one branch within the major field of artificial Intelligence, which also includes others such as Knowledge Representation, Perception, Creativity (Russel and Norvig, 2009; Boden, 2006). The term “machine learning” was coined by Arthur Samuel, as early as 1952, who created the first program that could play and learn the checkers game (Samuel, 1959). The “learning” process here corresponded to incrementally updating a database with moves (board positions) and their score, according to probability for later success in winning or losing the game. As the computer played more, it improved its ability to win the game. This is probably the earliest version of reinforcement learning,1 today an established sub-field for scenarios where the whole data set is only incrementally available and/or the label of each data point is not directly observable (e.g., the value of a game move may only be observable later in the game or even at its very end in the binary form of “win” or “lose”). During the 1960s and 1970s, many researchers were enchanted by the concept of a machine that is pure logic, and the memory and computer processing limitations were extremely tight, comparing with nowadays. More than that, the belief that human intelligence could all be represented through logic (a “computationalist” point of view) was widespread and exciting. This naturally led to an emphasis on rule-based systems, representing knowledge through logic (e.g., logical rules, facts, symbols) and natural language processing. In parallel, other researchers believed that we should study more the neurobiology of our brain and replicate it (a “connectionist” point of view), in what became known as artificial neural networks (ANN). The first well-known example is the perceptron (Rosenblatt, 1957), which applies a threshold rule to a linear function to discriminate a binary output. It is depicted in Fig. 1. 1. There are records of an earlier program to have reinforcement learning capabilities, the “shopping machine”, by Anthony G. Oettinger in 1951. The task was to answer the question “in what shop may article j be found?”, and the algorithm could simulate trips to eight different shops (Moor, 2003).

12 PART

I Methodological

x1

w1

x2 x… xn

w2 w…

f(x)

wn

y

b FIG. 1 Perceptron.

?

Linearly separable

Nonlinearly separable

FIG. 2 Linear separation. There are two classes (+ and o), and we want a rule (a classifier) that discriminates between them.

The perceptron was quite popular for a time as it can represent some logical gates (AND and OR), but not all (X-OR). Due to the latter limitation, such a simple ANN is limited to linearly separable problems (Fig. 2, left), whereas real problems quite often are inherently nonlinear in their nature (Fig. 2, right). This strong criticism pointed out by Minsky and Papert (1969) led to a decade with virtually no research on Neural Networks (NN), also known as the “first AI winter.” The third relevant research thread from the 1960s and 1970s relates to the “nearest neighbor” concept. Cover and Hart (1967) published the “Nearest neighbor pattern classification,” which effectively marks the creation of the “pattern recognition” field and the birth of the well-known K-nearest neighbor algorithm (K-NN). Its underlying principle is simple: If we have a problem to solve (e.g., given an input vector x, classify it into a class, so the target variable y becomes categorical), we can look for K most similar situations in our database. Of course, the key research question is how to define similarity, which boils down to the comparison of x vectors. The typical option is the Euclidean distance, but there are a lot of other metrics which are much more suitable given the nature of the data at hand (e.g., categorical, textual, temporal). Another challenge in K-NN is to define K. What would be the best number of similar examples from the data to use in the algorithm? This introduces a concept of a hyperparameter, which should be defined a priori, in contrast to the parameters β, which are to be determined during the training. Of course, in those

Machine Learning Fundamentals Chapter

2

13

days, the complexity of data and computational power and memory were so low, that only much later we find good answers. The most basic unsupervised clustering algorithm, known as k-Means, was first proposed by Lloyd (1957) at Bell Labs (but published only in 1982) and independently by Forgy (1965). Its main steps to find clusters in the data is, having k cluster centroids randomly initialized (k is another hyperparameter!), to assign each data point to a cluster with the nearest mean (step 1), and update these centroids using the data points assigned to them (step 2). Repeat these steps iteratively until convergence and the task is solved. Later, this procedure was rethought and generalized as the expectation-maximization algorithm for a Gaussian mixture model (Dempster et al., 1977). The 1980s saw the rebirth of ANNs with enough computing power to allow for multilayer networks and nonlinear functions. Notice that, if we replace the function f(x) in Fig. 1 for a logistic sigmoid function (as opposed to the 0/1 step function), we essentially obtain a binary logit model (or logistic regression), and this is the basis for the multilayer perceptron algorithm (MLP) summarized in Fig. 3. Notice that now there are two sets of weights (w(1) and w(2)). The first one associates a vector of weights (wj(1)) to each of the fj functions in the hidden layer. As mentioned, the typical form of these functions is either the logistic sigmoid (Fig. 4A) or the hyperbolic tangent. Each one of these then provides the input to the final (output) layer, weighted by the vector w(2) ¼ [w1(2), w2(2), …, wm(2)]T. The function σ is typically another sigmoid (if we are doing classification) or the identity function (if we are doing regression). w

(1)

x1

b1

xn

fm (x)

bm

Input layer

w

.....

.....

f1(x)

Hidden layer

(2)

! ()

y

b

Output layer

FIG. 3 Multilayer perceptron.

FIG. 4 (A) Sigmoid function and (B) rectified linear unit (ReLU).

14 PART

I Methodological

From a statistical modelling perspective, an MLP is simply a model with m latent variables that are combined in the output layer. The development of the backpropagation algorithm (Rumelhart et al., 1986), which allows the MLP to be efficiently estimated from data, was fundamental for its popularity. Furthermore, Hornik et al. (1989) demonstrated that one layer of suitable nonlinear neurons followed by a linear layer can approximate any nonlinear function with arbitrary accuracy, given enough nonlinear neurons. This means that an MLP network is a universal function approximator, which motivated plenty of excitement. However, it was also verified during the 1990s that the MLP often runs into overfitting problems, and its opaqueness (the weights have no clear interpretation, neither does the hidden layer) often limits its application. Thus, after the MLP, we saw a relatively low investment on NN research, marking what was known (guess what…) as the “second AI winter,” until the boom of DNN models since 2010. The key ingredients for its success were new sophisticated architectures, computational resources and amounts of data available. Differently to an MLP, which always has one layer (Hastie et al., 2009), a DNN may have multiple layers, sometimes dozens or hundreds. While the MLP is always fully connected (all elements, or neurons, in one layer are connected with all elements of the subsequent layer), a DNN is often very selective, with different sets of neurons connected to different parts of the subsequent layer. The most popular activation function is not anymore the sigmoid; instead it is the rectified linear unit (ReLU), shown in Fig. 4B. The ReLU function became popular as it helps to mitigate the vanishing gradient problem (exponential decrease of a feedback signal when the errors propagate to the bottom layers), which hindered training of NNs with many layers. Back to the 1980s, it was the golden age of the Expert Systems. These usually relied on explicit representation of domain knowledge in the form of logical rules and facts, obtained by interviewing experts. The new profession of Knowledge Engineer would typically combine the ability to do such interviews and code the knowledge into systems such as FRED (Freeway Realtime Expert System Demonstration) or FASTBRID (Fatigue Assessment of Steel Bridges), just to mention in the transportation domain. For a notion of expectations and opportunities of such systems for transportation, from the eyes of the early 1990s, the paper from Wentworth (1993) is an interesting read. The major problem with Expert Systems, and related rule-based paradigms, has always been the costly process of collecting and translating such knowledge into a machine. To make things more complicated, such knowledge is often not deterministic or objective, which eventually led to the development of fuzzy logic, as a way to improve the Expert Systems potential, which were still huge collections of hand-coded rules. On the other hand, one should ask is the machine learning at all? Aren’t the Expert Systems instead an attempt to mechanize human knowledge, explicitly extracted and hand-crafted by humans (the coders)?

Machine Learning Fundamentals Chapter

2

15

A different approach to this problem comes from probability theory. Between 1982 and 1985, Judea Pearl created the concept of Bayesian Network, where domain knowledge is also important to structure relationships between variables, but not anymore as long complex rules. Instead, he proposed to decompose knowledge into individual relationships, inspired by the concepts of causality and evidence. More importantly, he developed the belief propagation (BP) method for learning (or making inference) from data (Pearl, 1982, 1985). Such networks were successful, particularly in classification problems, and later, the challenge of the automatically generation of their structure from data (as opposed to domain expertise) was proposed by Spirtes and Glymour (1991), and remains a relevant topic until today. But perhaps the hottest trend in the 1980s and 1990s were in fact the kernel methods, and the support vector machines (SVM) in particular. We get back to the problem of linear separation (Fig. 2), where reality tends to be nonlinearly separable. If we can find a function mapping that, when applied to a data set, makes it linearly separable, then we solve the problem. Typically, this implies mapping data to a higher dimension. Fig. 5 shows an example. On the left, we have the original data set (class “o” has points {7, 5, 4, 2, 3, 4}, while class “+” is {2, 1, 0, 1}). This is one-dimensional, so let us add a new dimension by simply squaring the values. Each point is now (x, x2). On the right, we plot those points. Here it goes, we can now draw a straight line to separate the classes! Vapnik and others discovered that there is always one (higher) dimension where a data set is linearly separable. More importantly, they found a process, called the kernel trick, where such mapping is done without the need to explicitly project to the higher dimension (Boser et al., 1992) as we did above. Using convex optimization to train prediction models on data, the SVM obtains global optimality guarantees for the model/parameters learned. The SVM algorithm learns the separating line (a straight line in our example, or a hyperplane in higher dimensions) that has the largest distance to the nearest training-data point

–7 –6 –5 –4 –3 –2 –1

0

1

Original space

2

3

4

5

–7 –6

–5 –4 –3 –2 –1

0

1

2

3

4

5

Mapped space

FIG. 5 Mapping 1-D data points, which are nonseparable linearly, to a higher dimensional (2-D) space makes them linearly separable.

16 PART

I Methodological

Margin

Support vectors

FIG. 6 SVM finding the best hyperplane. There are many possible choices of linear separation (left). Support vectors are found that maximize the margin (right).

of any class. In other words, it finds the dividing line where each side has the maximum margin to the closest data points. The larger this margin, the lower the error. Fig. 6 illustrates the concept. Notice that the maximum margin always coincides with vectors from each class. These are the closest data points, the support vectors. Of course, all this happens at a potentially very high dimensionality and the 2-D example is only provided for illustration purposes. The concept of kernel is useful much beyond SVM. Intuitively, a kernel function defines similarity between two data points. The concept is illustrated in Fig. 7 with one-dimensional kernels. Imagine that you want to compare two

FIG. 7 Four different kernels.

Machine Learning Fundamentals Chapter

2

17

data points, x1and x2. The parameter u is simply the scaled difference between them, u ¼ (x1  x2)/h. The vertical axis represents their similarity. Naturally, in all cases, when u ¼ 0, the similarity is always maximum, but the way it decays is considerably different from kernel to kernel. Take the K-NN, created in the 1960s, now we can use kernels to define similarity in a different way. The similarity of a vector x to other vectors x’ is simply determined by the value of the kernel function, which depends on the distance between them and hence has a higher value for neighbors. We can even use different distance definitions for different dimensions of x, and, more importantly, we do not need to define K anymore. However, hyperparameters of a kernel, such as the characteristic length h, now come into play! By the early 2000s, the four general trends in Machine Learning had been formed: 1. Rule-based systems: These include hand-crafted rules a` la´ expert systems, decision trees, decision tables, logic programming. They are the most interpretable machine learning systems. From a laymen standpoint, they are intuitive to implement and debug, and often have competitive accuracies. 2. Kernel-based algorithms: These are typically based on the concept of neighborhood, and include a wide range, from K-NN to SVM and their derivatives. They demand a careful definition of similarity (or kernel), and do not require the definition of the model function form, as opposed, for example, to linear or polynomial regression, where we define the general function form and estimate their parameters from the data. For these aspects, these models are generally called nonparametric. To run a kernel-based algorithm, one has to retain the data set (or part of it) for use whenever a prediction is made. Another weakness of such methods is poor scaling with the data set size due to the computationally costly data matrix inversion involved into the training. These issues have contributed to the revival of NNs, which also support online learning, avoiding reuse of the entire data set to incorporate new data points. 3. DNN: A rebirth of the NN view of machine learning, DNNs, are reaching the news in almost every corner of the world. They have hit fantastic benchmarks, particularly in the areas where plenty of high-dimensional data are present such as computer vision, automatic translation, robotics, computer games, and speech recognition. Such successes are behind the resurgence of artificial intelligence as a hot topic for at least the next few years. These NNs are called “deep” because they rely on many layers of neurons, sometimes hundreds, stacked together allowing for processing the data in a hierarchical manner. Often having thousands or even millions of parameters to tune, they usually depend on massive amounts of data and computing power. Curiously, advanced computer

18 PART

I Methodological

graphics cards (initially designed for gaming!) became an essential piece of hardware for deep learning methods, because they can manipulate extremely large matrices in parallel (and you can put many such cards in a single computer now). A major drawback of this approach remains its low interpretability.2 A trained DNN ends up with thousands of parameters that have hardly a meaning for a human modeler. Also, deep learning algorithms still lack a sufficiently established set of theoretical foundations, being mainly empirical and ad hoc. 4. Bayesian statistics: Another current trend builds on Bayesian probability theory to make inference on data, that is, predict the distribution of a variable of interest and model’s parameters given a set of observations. Typically, such models require a definition of their structure (how the different variables relate to each other, and with the observations), either manually or through automatic processes (a.k.a. Bayesian Network structure learning). A particular advantage of these models is that they allow the combination of domain-based formulations (e.g., well-known domain laws, with clear function forms) with data-driven methods (e.g., unknown function form). This area has been extremely prolific in the 2010s, with the emergence of “probabilistic programming languages”3 and the general family of “Probabilistic Graphical Models.” In contrast to other methods, the Bayesian approach allows do define the full distribution of the model’s parameters instead of point estimates found, for instance, by the backpropagation algorithm in NNs. To conclude this brief historical overview, it is worth saying that despite many striking advances in Machine Learning we witness in 2010s (e.g., Generative adversarial networks can generate photorealistic pictures of nonexisting humans, and deep reinforcement learning algorithms can easily beat the best Go players), we achieved artificial intelligence only in a narrow (or weak) sense, where it outperforms human capabilities in very specific tasks. We are still quite far away from artificial General intelligence, which can be applied to any problem on a par with humans; not mentioning artificial Super intelligence, which is even difficult to define except that it should be qualitatively smarter than any human being. At the time of writing, we can identify two major trends of new research in the machine learning community. So, in the remainder, we will focus on DNN and Bayesian models, highlighting their weakness and strengths. To complete our short introduction into the subject, we will briefly touch basics of machine learning experiments such as model testing and comparison.

2. However, various indirect approaches can be applied to address this issue to some extent, for instance, see Ribeiro et al. (2016). 3. http://probabilistic-programming.org/.

Machine Learning Fundamentals Chapter

3

2

19

DEEP NEURAL NETWORKS AND OPTIMIZATION

As we already saw, a DNN is a straightforward extension of an MLP with multiple layers of artificial neurons (Fig. 8). Its loss function, L, measures the difference between the model’s predictions and ground truth. For instance, it can be mean squared error (MSE) for continuous outputs (regression) or entropy loss for categorical outputs (classification). Since it is a function of the parameters (weights and biases), in order to minimize it, the gradient descent algorithm is usually applied. All we need is to calculate gradients of the loss function with respect to the parameters and update them in the direction of a loss function minimum: old wnew ijk ¼ wijk  η

∂L , ∂wijk

where η is the learning rate, which defines how big the steps should be. The famous backpropagation algorithm is basically a chain rule applied to the loss function differentiation to calculate the required gradients. In a nutshell, it says that each neuron contributes to the loss function proportionally to the weights of its connections to the following neurons. We can simply use this fact to calculate these contributions starting from the output layer and propagate them backward in the network using its weights and the derivative of the activation function f. The problem is that the plain gradient descent will almost surely get stuck in a local minimum or saddle point (Fig. 9). Many different optimization techniques have been elaborated to tackle this issue. The most simple and popular one is stochastic gradient descent, where the gradients of the errors are calculated over small subsets of the data, called mini-batches. In fact, it introduces some noise to the estimated gradients, so the parameters have a chance to escape

FIG. 8 A generic deep neural network architecture. To compute predictions, output from a layer passed to the next layer using the network’s parameters. During the training step, parameters of each layer are updated using the loss function gradients propagating in the opposite direction (“backpropagation”).

20 PART

I Methodological

FIG. 9 Loss function is almost always a nonconvex function with a lot of local minima, so the plain gradient descend algorithm will almost surely converge to one of them.

from unwanted regions of the loss function. Many modifications to the parameters’ update rule have been also proposed, to name a few, momentum, Nesterov, Adagard, Adadelta, RMSprop, Adam, AdaMax, and many others. These modifications usually incorporate gradients from the previous learning steps into the update rule. Being quite general and flexible, a DNN has become a very popular and powerful model with a wide range of applications. One of its main strengths is the ability to process information in a hierarchical way, automatically capturing new levels of abstraction in the data, effectively dealing with the curse of dimensionality. This problem arises because, with growth of dimensionality (i.e., number of features) of the data, the volume of a unit sphere in this space growths exponentially. As a consequence, to explore this volume and provide reasonable statistical estimations, a model requires exponentially more data samples. One of the known approaches to tackle this problem is dimensionality reduction, which assumes that there are only few important dimensions, which can be represented as (often nonlinear) combinations of the initial ones. A DNN is believed to do it automatically, where each following layer learns new lowdimensional representations of the data. For example, consider a document topic classification problem given scanned handwritten text as an input. It would be almost impossible to perform this semantic analysis directly on the raw pixel intensities. Instead, a DNN detects strokes and curves first, then tries to identify letters, which in turn constitute words. This can be achieved thanks to the layered structure that theoretically can learn any complex mapping. We stress the word “theoretically” because it appears not so trivial to do it in practice. The main challenge is training of a DNN: Fitting its parameters essentially boils down to finding a minimum of a nonconvex function in a highly (thousand- or even million-) dimensional space. Therefore, plain fully connected DNNs are rarely used

Machine Learning Fundamentals Chapter

2

21

alone. Many different hybrid architectures of DNNs tailored for different purposes have been proposed. Below, we discuss a convolutional neural network (CNN) and a recurrent neural network (RNN)—the two most fundamental prototypes which can be found in almost all modern deep learning models. A CNN architecture is believed to resemble a visual system of a brain. It consists of several convolutional-pooling layer pairs followed by a fully connected network (Fig. 10). Neurons in a convolutional layer, which share their weights, scan through the input and produce multiple outputs of the same size as the input,4 also known as feature maps. The layer is called convolutional because it literarily performs convolutions of the input using adaptive kernels represented by an NN. A pooling layer downsamples these feature maps, using, for example, the max function over small contiguous regions, effectively reducing their dimensionality. These layers preserve spatial correlations and are able to capture the before-mentioned hierarchical features, such as strokes, basic geometric shapes, and so on. They are followed by a fully connected NN with a small number of hidden layers which is now able to solve the initially complex task (e.g., classification of cats vs. noncats). Originally, a CNN was designed to process images (Lecun et al., 1998) but, in principle, it can be applied to any correlated data of a fixed length such as text or time series. For image-related tasks, a CNN can be pretrained on a large set of images for further reuse, where the fully connected layer is then replaced and trained from scratch for the task at hand (the approach known as transfer learning). An RNN has a quite simple underlying idea appeared to be very powerful in practice. It is capable of storing state of the network so it has some notion of memory (Fig. 11). Output of an RNN depends not only on the current input but also on the all previous inputs. RNNs are used to process sequences of a variable length with particular emphasis on natural language processing applications. However, it turned out that the plain RNN implementation was not practical due to the limited ability to maintain its memory for a long time.

FIG. 10 An example of a convolutional neural network. A sliding window over the input image (or its all channels such as RGB) is used as an input to a convolutional layer (which is basically a fully connected NN with two layers) which produces a number of outputs (feature maps). These feature maps are then downsampled using the max function in a pooling layer. The output can be passed to the next convolutional layer and the whole procedure repeats several times. The final output is flattened into a vector and passed to fully connected layers, which compute predictions.

4. Or slightly smaller, taking into account the parameters of the sliding window.

22 PART

I Methodological

FIG. 11 An example of a recurrent neural network. The RNN computes the output, y, using not only the input variables, x, but also its hidden state, h (left). The hidden state usually comes from the previous inputs to the RNN, so for a sequence of inputs, the RNN can be “unrolled” in time (right). The memory cells can be either plain fully connected NN or contain special internal gates to control their memory in a better way.

To address this issue, many recurrent neuron types were proposed. The most popular ones are the long short-term memory (LSTM) cell (Hochreiter and Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Cho et al., 2014), which contain special gates to control and propagate their internal (or hidden) state in a more efficient way. Similar ideas were also applied to regular feed-forward NNs leading to various network architectures such as deep residual networks and highway networks, which are capable of propagating information (both forward and backward) through hundreds of layers. These networks become highly efficient for image-processing tasks. There are countless ways to modify and combine various DNN architectures due to the modular structure of DNNs equipped with end-to-end training via backpropagation. For instance, for a video frame prediction task, a CNN can be used to encode each frame as a low-dimensional vector (embedding) using, for example, output of the top hidden layer of its fully connected part, while an RNN is then applied for sequential prediction of these vectors. Finally, a deconvolutional NN can be used to decode these representations back to the images. Another example can be given by a caption-generation task, where RNN is used to process textual input and CNN to process images. Interestingly, a single joint loss function which accounts for both CNN and RNN can be used in an unsupervised manner to encode images and text in the same low-dimensional vector space. In this case, a picture of a cat and the word “cat” will end up being very close in this space of embeddings. Some would say that deep learning has been so prominent by the end of 2010s that it overshadows almost everything. While it has been behind the rebirth of Artificial Intelligence as a hot topic, supported by gigantic teams in Google, Facebook, or Amazon, we would recommend caution in this interpretation. Deep learning has shown incredible results in image and sound, but it has a long way to go in domains such as transport, which often involve human behavior modelling and simulation. The main reasons are low interpretability of DNNs, difficulty to incorporate prior and domain knowledge, stability issues, and poor ability to provide statistical properties for the estimates. In this regard, Bayesian approaches for PGMs may have a strong say.

Machine Learning Fundamentals Chapter

4

2

23

BAYESIAN MODELS

Machine learning can be viewed from a statistical point of view as estimation of probability distributions, and a very powerful way of representing such distributions can be achieved through a PGM. A PGM is primarily a graphical representation framework. In a PGM, each node corresponds to a variable and each edge to a dependence between a pair of variables. Whenever a PGM is a directed graph representing conditional dependencies, it is called a Bayesian network, which is what we illustrate here. Fig. 12A shows a linear regression model. Notice that the response variable, y, is dependent on X (inputs), β (parameters), and σ (standard deviation of the error). A PGM represents a factorization of the joint probability of all random variables. In Fig. 12A, we have p(y, X, β, σ) ¼ p(y j X, β, σ)p(β)p(σ). Notice that X is observed data (constant), which   is why it is shaded. The likelihood p(y j X, β, σ) is specified as N βT X, σ 2 . PGMs are in fact more than just a representation. They are a machine learning model in their own right, having a variety of tools for model inference, one of which is called BP. BP is based on the idea of message passing, whereby inference for each random variable follows a specific sequence, based on what is known at each moment. To give an example, let us estimate the parameters β for the linear model in Fig. 12A. We will add a trivial but useful extension, two prior distributions for β and σ (e.g., from domain knowledge, or previous data), determined by parameters ϕ and γ, that is, we have p(β j ϕ) and p(σ j γ), respectively. Imagine we have received a batch of data (matrix X is a set of input values, and vector, y, a set of y values). A possible sequence of messages to be passed is illustrated in Fig. 12B, with numbered arrows.5 Both priors propagate their distributions initially (1 and 2). We now know the distribution of σ, to be further transmitted (3). β is what we want to estimate, so it will not transmit

FIG. 12 A graphical model for regression (A) and belief propagation example (B) calculating the value (marginal) of β.

5. Factor graphs, a specific variant of PGMs, are known as a more appropriate representation for belief propagation, but we keep a simpler representation for the sake of an intuitive explanation.

24 PART

I Methodological

anything; it will only receive. On the other side, input data X is observed and also transmitted as point-mass (i.e., as constant) (4). Together with the received data y, weÐ now have all the ingredients to propagate to β the message 5, as p(y j β) ¼ p(y j β, X, σ)p(σ j γ)dσ. Notice that we are marginalizing out σ (we do not marginalize y, X, and γ because they are constants), and so message 5 is exclusively a function of β, and works like a likelihood function. Effectively, the marginal distribution of β is determined by multiplying (2) and (5), that is, pðβj yÞ ¼ Z1 pðβj ϕÞpðyj βÞ, where Z is the normalizing constant. We could also simultaneously estimate σ and β, in which case we would have an iterative process until convergence (β would also send a message). Moreover, some or all of these messages may have nontractable forms and need to be approximated by well-known methods such as Markov Chain Monte Carlo (MCMC) or Variational Bayes. BP is a very intuitive and flexible method for Bayesian inference that builds on the modularity of a PGM. Directionality and message sequence are dependent on what we know and what we want to know at each moment, so in practice any nonobserved variable or parameter can be estimated through this process, as we did for β, and this includes, obviously, the response variable, y. Finally, advances in inference algorithms and growth of computational power opened a path toward deep Bayesian models similar to DNNs, capable of dealing with numerous latent variables in a hierarchical way. Applications of Bayesian approaches in Deep Learning and combinations of PGMs and DNNs are other areas of research, where a Variational Autoencoder (Doersch, 2016) can be given as a typical example.

5 BASICS OF MACHINE LEARNING EXPERIMENTS Having the model f(x, β) trained, a natural question to ask is how to estimate its performance? Well, if our target variable y is continuous, common choices of performance metrics are MSE and its root (RMSE), mean absolute error (which is less sensitive to outliers), or the Pearson correlation coefficient. For classification problems, a confusion matrix and a whole zoo of metrics derived on its base are used. For binary classification, the confusion matrix has the following form.

Machine Learning Fundamentals Chapter

2

25

The most popular metrics are accuracy ¼ (TP + TN)/(P + N), precision ¼ TP/ (TP + FP), recall (or sensitivity) ¼ TP/(TP + FN) and F1 score, which is simply a harmonic mean of the last two ¼ 2 * precision * recall/(precision + recall). It is worth noting that there is no single “golden standard” in the world of metrics. Each metric should be considered in connection with modelling objectives. For example, if it is important not to miss any document in an informational retrieval task, priority might be given to recall over precision. Or, in medicine, cost of a False Negative error might be much higher than cost of False Positive. But the most important question to ask (which basically the whole field is about) is the following: How well can the model generalize, that is, make reliable predictions for new, unseen data? Of course, machine learning models are not oracles and cannot guarantee the exact value of something which has not been observed yet. Nevertheless, they can still provide some useful statistical intuition about the expected future performance using some tricks with the data available. With this aim, the data are divided into train and test sets,6 pretending that the latter is hidden from the model. The division ratio is arbitrary, but usually 70%–80% of the data is used for training and remaining 20%–30% for test purposes. So, we simply use the train set to fit parameters β, but evaluate the model’s performance using the data from the test set. Be careful about i.i.d. assumptions about the data samples, whether they can be randomly shuffled or not. Another important point to remember is that machine learning models are mainly designed for interpolation so it is difficult to expect fair performance of the model on any data outside boundaries of the train set. In other words, a test set should normally possess the same statistical properties as train data. Also, a closer attention should be paid to various data set problems, especially when dealing with imbalanced data sets, and statistical techniques to address them (e.g., boosting and bagging). Using this trick, we can also then compare different models, f1, …, fn, to select the best one. Here, the importance of baseline models should be stressed. It is a rule of thumb to try simpler models first. A mean predictor—the simplest baseline you can imagine, which always predicts the mean of the train set for regression or the most frequent class for classification independently on the input x—or a linear model are always a good choice to start with. It gives a reference point for your fancy model and provides better understanding of the data. Then, you can try to beat more complex state-of-the-art models. Given the same performance of two models, you should be always in favor of a less complex one (the famous Occam’s razor). However, sometimes the choice might be made towards a more interpretable “white-box” model. The concept of generalization can be also illustrated in the following way. Consider the training dataset consisting of N data points which come from the true (but unknown to us) distribution y ¼ x2 + E, where E is a normally distributed

6. In-sample and out-of-sample data in statistics terminology.

26 PART

I Methodological

FIG. 13 Underfitting and overfitting.

noise term with zero mean and small variance. If we choose our model in this quadratic from, we will unsurprisingly get low errors both on train and test sets (Fig. 13, middle). But suppose that we want to try a linear model first. In this case, the model will have higher errors on both train and test sets, or, in other words, underfit or have high bias (Fig. 13, left). If we go to another extreme and PN1 i βi x , it will go through consider a polynomial model of N  1 degree, y ¼ i¼0 every point in the train data. The train error in this case will be exactly zero but the model’s performance on the test set will still be poor (Fig. 13, right). In this case, the model overfits or has high variance. Under/overfitting is also called a bias-variance tradeoff. The problem of finding the best model is to find a sweet spot somewhere in between. Here, plotting of the learning curves might be really helpful (Fig. 14). If the data set is not huge, a straightforward extension of the train/test partition, called crossvalidation (CV), can be employed. In the conventional k-fold CV approach, the data set is divided into k bins. Then, the data from k  1 bins are used for training and the remaining kth bin is used for testing. The procedure

FIG. 14 Learning curves illustrate prediction errors for train and test data depending on the model’s complexity.

Machine Learning Fundamentals Chapter

2

27

is repeated k times so the model’s performance is averaged over the all bins. The extreme (and the most statistically precise) case is leave-one-out CV, when k equals to the size of the data set. In practice, training of a model can be a very computationally demanding task, so small k (10) is often used. The train data can be further subdivided to extract a separate validation set, which is to be used to fit hyperparameters. As mentioned before, hyperparameters are the parameters that should be defined a priori, before fitting model’s parameters. They can also control model’s complexity, and examples can be given by the highest power in polynomial regression or a number of layers and neurons in an NN (or even parameters of the training algorithm, such as the learning rate, itself!). In the same spirit, we can compare models with different hyperparameters on this validation set and select the best one to finally estimate its performance on the test set. Regularization is another way to control model’s complexity. The main idea behind it is to introduce a complexitydependent penalty term in the loss function (or use a special prior in the Bayesian setting), for example, requiring that the most of the inferred parameters β should be close to zero.

6

CONCLUDING REMARKS

At time of writing, artificial intelligence and machine learning, are two hotboiling areas where a lot of exciting news are coming almost every day. The main intention of this chapter was to provide you with a quick overview of this dynamic field and introduce some of its basic concepts. We have described a small subset of models and approaches, mainly from supervised learning. To deepen your knowledge in this direction we recommend to read also about kernel regression, Gaussian processes, decision trees, and ensemble methods (random forests, gradient boosting). You might also want to become familiar with unsupervised algorithms such as clustering (hierarchical clustering, DBSCAN), dimensionality reduction (PCA, autoencoders), and generative models (variational autoencoders, generative adversarial networks) and a few basic probabilistic models such as naı¨ve Bayes, Gaussian mixtures, and latent Dirichlet allocation. For a keen reader, this chapter will rise far more questions than provide answers. Given the intense dynamics of the field, the best advice is that you use your favorite search engine and look for the concepts we talk about here. In any case, we encourage you to read some classic textbooks covering both machine learning and statistics. The book by Bishop (2006) might be a good starting point for understanding machine learning from a statistical point of view. If you would like to focus more on deep learning, Goodfellow et al. (2016) might be another good read. If you are interested in the PGMs, a pretty smooth introduction is the MBML e-book (http://www.mbmlbook.com), but for a more technical one, we recommend Koller and Friedman (2009). We also encourage you to check countless online resources, such as courses, video lectures, tutorials, blog posts, talks and presentations. Stay tuned!

28 PART

I Methodological

REFERENCES Bishop, C., 2006. Pattern Recognition and Machine Learning. Springer-Verlag, New York. Boden, M., 2006. Mind As Machine. Oxford University Press, Oxford. Boser, B.E., Guyon, I.M., Vapnik, V.N., 1992. In: A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on Computational learning theory—COLT’92, p. 144. Bostrom, N., 2014. Superintelligence: Paths, dangers, strategies. Oxford University Press, Oxford. Cho, K., Van Merrie¨nboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv preprint arXiv: 1406.1078. Cover, T., Hart, P., 1967. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory. 13 (1). Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat Soc Series B (Methodological) 1–38. Doersch, C., 2016. Tutorial on Variational Autoencoders. arXiv preprint arXiv:1606.05908. Forgy, E.W., 1965. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21, 768–769. Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT Press, Cambridge, Massachusetts, USA. Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY. Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput 9 (8), 1735–1780. Hornik, K., Stinchcombe, M., White, H., 1989. Multilayer feedforward networks are universal approximators. Neural Netw 2 (5). Koller, D., Friedman, N., 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, Massachusetts, USA. Kurzweil, R., 2005. The Singularity Is Near. Viking Books, New York. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86 (11), 2278–2324. Lloyd, S. P. (1957). Least square quantization in PCM. Bell Telephone Laboratories Paper. Published in journal: Lloyd., S. P. (1982). Minsky, M., Papert, S., 1969. Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, Massachusetts, USA. Moor, J. (Ed.), 2003. The Turing Test: The Elusive Standard of Artificial Intelligence. Springer Science & Business Media, Dordrecht, Netherlands. Pearl, J., 1982. Reverend Bayes on inference engines: A distributed hierarchical approach. Cognitive Systems Laboratory, School of Engineering and Applied Science, University of California, Los Angeles, pp. 133–136. Pearl, J., 1985. In: Bayesian networks: a model of self-activated memory for evidential reasoning. Proceedings of the 7th Conference of the Cognitive Science Society, pp. 329–334. Rosenblatt, F., 1957. The Perceptron—a perceiving and recognizing automaton, Cornell Aeronautical Laboratory Report 85-460-1. Rumelhart, D., Hinton, G., Williams, R., 1986. Learning representations by back-propagating errors. Lett Nat 323. Russell, S.J., Norvig, P., 2009. Artificial Intelligence: A Modern Approach, third ed. Prentice Hall, Upper Saddle River, NJ. Samuel, A.L., 1959. Some studies in machine learning using the game of checkers. IBM J Res Dev. 3 (3).

Machine Learning Fundamentals Chapter

2

29

Spirtes, P., Glymour, C., 1991. An algorithm for fast recovery of sparse causal graphs. Soc Sci Comput Rev 9 (1), 62–72. Wentworth, J. A. (1993). Expert systems in transportation. AAAI Technical Report WS-93-04.

FURTHER READING Ribeiro, M.T., Singh, S., Guestrin, C., 2016. In: Why should I trust you? Explaining the predictions of any classifier.Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Miningpp. 1135–1144.

Chapter 3

Using Semantic Signatures for Social Sensing in Urban Environments Krzysztof Janowicz*, Grant McKenzie†, Yingjie Hu‡, Rui Zhu* and Song Gao§ *

University of California, Santa Barbara, CA, United States, †University of Maryland, College Park, MD, United States, ‡University of Tennessee, Knoxville, TN, United States, §University of Wisconsin, Madison, WI, United States

Chapter Outline 1 Introduction 31 2 Spatial Signatures 33 2.1 Spatial Point Pattern 34 2.2 Spatial Autocorrelations 34 2.3 Spatial Interactions With Other Geographic Features 34 2.4 Place-Based Statistics 37 3 Temporal Signatures 37 4 Thematic Signatures 41 5 Examples 45 5.1 Comparing Place Types 45

1

5.2 Coreference Resolution Across Gazetteers 5.3 Geoprivacy 5.4 Temporally Enhanced Geolocation 5.5 Regional Variation 5.6 Extraction of Urban Functional Regions 6 Summary References

48 48 49 50 50 52 53

INTRODUCTION

Several terms have been introduced over the past years to characterize a broader underlying paradigm shift in the ways research is carried out across many domains ranging from the social to the physical sciences. Big data, for instance, highlights the increasing availability of massive datasets, which enable researchers to answer new questions by giving access to a higher spatial, temporal, and thematic resolution than before, but requires novel techniques (e.g., parallelization) to handle the size of these data. The related concept of data science focuses on techniques to collect, clean, integrate, analyze, and visualize this data deluge. Several variations of these original terms have been introduced more recently to address some criticism Mobility Patterns, Big Data and Transport Analytics. https://doi.org/10.1016/B978-0-12-812970-8.00003-8 © 2019 Elsevier Inc. All rights reserved.

31

32 PART

I Methodological

related to big data and data science. For instance, broad data and smart data are both meant to highlight the fact that size alone is of less importance than the heterogeneous sources where such data may come from or the meaningful preselection and interpretation of the data (Sheth, 2014). Gray’s notion of a fourth paradigm of science (Hey et al., 2009) focuses on how the wide availability of data changes the inner workings of scientific workflows (e.g., by the unexpected/opportunistic reuse of existing data). Finally, others have pointed to the increasing need for techniques to support the meaningful integration and synthesis of datasets given their growing volume, variety, and velocity ( Janowicz et al., 2015). Given this broader trend, it is worth asking how these new datasets are created and how insights derived from these data can be made more readily available, that is, without the need to access the full data. Interestingly, many recent breakthroughs in the broader field of data science are the result of social machines, that is, large-scale, sociotechnical systems that arise from the interaction of humans and machines (Hendler and Berners-Lee, 2010; Shadbolt et al., 2013). Typical examples for such systems are Wikipedia, CAPTCHA-like systems to improve optical character recognition, or massive datasets labeled by human users or via their usage. One increasingly important method for collecting observational data of human behavior and interaction with the environment is social sensing (Aggarwal and Abdelzaher, 2013; Liu et al., 2015). It describes crowd-sourcing techniques and applications that make use of sensors that are closely attached to humans (e.g., as parts of smartphones) and are either directly or indirectly used to provide sensor observations at a high spatial and temporal resolution. While usergenerated content, such as volunteered geographic information (VGI) (Goodchild, 2007) typically relies on conscious and active contributions, social sensing often utilizes data that are created as by-products of human behavior and their interaction with technology. To give a concrete example, VGI includes tasks such as digitizing streets for the OpenStreetMap (OSM) project, while social sensing may utilize the fact that certain streets or neighborhoods are digitized and updated earlier and more frequently than others or that people visit types of places during characteristic hours or in distinctive sequences (Ye et al., 2011). Social sensing offers great potential for applications in urban planning, transportation, health, crime prevention, disaster management, and so on. For instance, social sensing has been proposed as a method for crowd-sourced earthquake early warning systems (Kong et al., 2016). In this work, we focus on a technique called semantic signatures to extract and share high-dimensional data about types of places and neighborhoods. Semantic signatures are an analogy to spectral signatures that play a crucial role in remote sensing. While these spectral signatures uniquely identify types, for example, land cover classes, via characteristic reflectance or emittance patterns in the wavelengths (called bands) of electromagnetic energy, semantic signatures utilize data traces from human behavior. Just like libraries of spectral signatures that have been used in fields ranging from agriculture to studying the atmosphere of distant planets, semantic signatures can be used in a variety of ways. In fact, we will discuss examples such as reverse geocoding, geo-privacy, coreference resolutions, and so forth.

Using Semantic Signatures for Social Sensing in Urban Environments Chapter

3

33

To give an intuitive example, semantic signatures rely on the fact that people frequently go to bakeries during the morning hours and are more likely to mention them in the context of baking, coffee, cakes, sandwiches, and so forth, while nightclubs show very distinct temporal patterns and would unlikely be mentioned in a sentence together with baking. From an inferential perspective, this implies that an unknown place visited during Friday night, co-located with other places visited during evening hours, and mentioned in the context of drinks and dancing is very unlikely to be a bakery, but rather a bar or nightclub. Each data collection and analysis method introduced in the following Sections 2–4 can be seen as a semantic band, and any combination of these bands that uniquely identifies a place type becomes a signature. For example, bars and nightclubs may be difficult to tell apart by just looking at the hours and days they are visited, but conversations about bars, for example, in a local business review, are more likely to mention “sports” or “taps,” while these terms are less likely to occur in the context of nightclubs. Hence, combining thematic and temporal data can help uniquely identify place types. It is worth mentioning that many of these distinctions are intuitive to humans, but we need probabilistic models to integrate these distinctions into computational models and workflows. Place types themselves are a key component to geographic information retrieval, recommender systems, urban planning, and so forth, as they are a proxy for the affordances ( Jordan et al., 1998) (i.e., action potentials) of places and neighborhoods. The remainder of this chapter is organized as follows. From Sections 2 to 4, we present an overview of spatial, temporal, and thematic signatures, respectively, and discuss the methods that can be used for extracting these signatures. In Section 5, we outline a variety of examples from previous work to demonstrate the values of these signatures by highlighting their usage. Finally, Section 6 summarizes this chapter and discusses future directions.

2

SPATIAL SIGNATURES

Spatial signatures capture the characteristics of places through their distributions over geographic space, as places of a given type often have a unique pattern in which they appear and colocate with other places. For example, the distribution of mountains is likely to be different from that of hotels; and the same comparison can be made for other urban points of interest (POI) such as restaurants and schools. We adopt a set of spatial statistics and use them to characterize the semantics of place types. We call the collection of these type-wise statistics a spatial signature (Zhu et al., 2016a). These signatures have been employed for tasks such as aligning place types across different gazetteers (e.g., GeoNames, Getty Thesaurus of Geographic Names [TGN], DBpedia Places) and POI datasets (e.g., Foursquare, Factual, Google Places) to increase the interoperability across different data sources. A variety of spatial statistics can be adopted to extract spatial signatures. In the following, we describe four types of statistics using

34 PART

I Methodological

specific examples. A more comprehensive discussion on many other statistics can be found in our previous work (Zhu et al., 2016a).

2.1 Spatial Point Pattern As geographic information in most gazetteers and social media are stored in the format of point features (i.e., without more detailed geometries), we first describe techniques from spatial point pattern analysis to quantify the point distribution of feature types across a study domain. Both local and global point patterns can be extracted. Regarding local point patterns, both intensity-based (e.g., local intensity and kernel density estimations of local areas) and distance-based analysis (e.g., nearest neighbor analysis, Ripley’s K, standard deviational ellipse analysis) are employed. These statistics are supposed to capture spatial arrangements of points in a local scope. With respect to global point patterns, we compute the points’ intensity and estimate their kernel density on a global scale to capture their global spatial distribution. Corresponding statistics, such as the range of Ripley’s K and the bandwidth of kernel density estimations, are selected from these statistics. Fig. 1 illustrates a comparison between the place types of Park and Dam in terms of their point patterns using Ripley’s K. It shows that parks in the DBpedia Places dataset are more clustered compared with dams, as the observed curve (solid black line) of parks deviates more from the theoretical one (dotted red line), which is built under complete spatial randomness.

2.2 Spatial Autocorrelations In addition to spatial point pattern analysis in which the distribution of points is the main focus, spatial autocorrelation analysis is adopted with a focus on investigating spatial interactions among features represented by point geometries. Second-ordered interaction analysis, Moran’s I, and semivariances are utilized in this group. Moran’s I quantifies how intensities of cells differ from their neighbors, and semivariances measure the variation of cell intensities in a specific distance lag class. For semivariances, we select values at the first, median, and last distance lags as bands for our spatial signatures as they represent variation on small, median, and large scales, respectively. Fig. 2 shows that the patterns of spatial autocorrelations (e.g., nugget, range, sill, trend) are different between Park and Dam in TGN.

2.3 Spatial Interactions With Other Geographic Features This group of statistics extends spatial signatures to consider the interactions between target place types and other geographic information. Such external information can be population-based, climate-based, or utilizing road networks. One of the reasons to choose these types of data is that they are semantically rich. For instances, features such as mountains are less likely to occur in densely

Using Semantic Signatures for Social Sensing in Urban Environments Chapter

3

35

FIG. 1 Ripley’s K of Park (left) and Dam (right) from DBpedia Places.

600 400

Semivariance

0

200

0.00 0.05 0.10 0.15 0.20 0.25

800

I Methodological

Semivariance

36 PART

0

500,000

1,000,000

1,500,000

Lag

FIG. 2 Experimental semivariogram of Park (left) and Dam (right) from TGN.

0

500,000

1,000,000 Lag

1,500,000

Using Semantic Signatures for Social Sensing in Urban Environments Chapter

3

37

populated areas while the opposite is true for hospitals. Likewise, the frequency distributions of nearest road types for Amusement Park and Restaurant are significantly different (see Fig. 3). Amusement parks are more likely to be located on avenues, while restaurants have a higher chances to be located on roads.

2.4 Place-Based Statistics In addition to the aforementioned traditional spatial analysis, place-based statistics can be used to characterize the semantics of place types as well. In contrast to spatial statistics, they focus more on describing the topological and hierarchical relations between places. In our case, for example, the number (and entropy) of distinct states (or counties), a place type occurs in, as well as the number (and entropy) of adjacent states (or counties) that also contain features of the target type are included to indicate the topological relation (e.g., contains and meets) between places and administrative regions. These statistics are beneficial in terms of distinguishing feature types such as Glacier (which occur in eight US states according to DBpedia) and River (which occur in all states). Several other kinds of place-based statistics can be used to uniquely tell apart places of certain types. The used statistics are listed in Table 1. In summary, spatial signatures are formed by bands extracted from spatial and place-based statistics to uniquely identify place types based on their interactions with other features and alternative sources of geographic data (e.g., climate classifications). Put differently, given a set of statistics about places we can successfully identify their types and we can compare these types, for example, to study their similarity.

3

TEMPORAL SIGNATURES

Though geospatial properties of place play a key role, there are additional dimensions, or attributes, that help to differentiate places from one another. In fact, it is the combination of properties and attributes that contribute to one’s understanding of place. A dimension of place that is of substantial importance to this cause is that of time. There is a temporal component to our definition of place types, one that is reflected in our representation of semantic signatures. A metro station, for instance, is a very different place at 9 a.m. on a Monday than at 3 a.m. on a Saturday just as the Roman Colosseum serves a very different purpose today than it did nearly 2000 years ago. The same geographic space can change dramatically depending on the time of day you visit it, day of the week, or season of the year. The ubiquity of sensor-rich mobile devices has given rise to applications that offer opportunities for users to contribute and share sensor information. Many of these applications and platforms use a gamification model to coerce users into contributing information that can be curated, sold, or analyzed to better understand topics ranging from human urban mobility patterns to health and

0

5

10

15

I Methodological

Frequency

20

25

38 PART

PIKE

AVE

BLVD

ST

PKWY

HWY

CONNECTOR

CIR

LA

RAMP

NO NAME

RAMP

NO NAME

DR

EXPWY

0

500

1000 1500 2000 2500 3000

RD

RD

PIKE

AVE

BLVD

ST

PKWY

HWY

CONNECTOR

CIR

FIG. 3 Histogram of road types for Amusement Park (left) and Restaurant (right) from Google.

LA

DR

EXPWY

Spatial Point Pattern Intensity

Spatial Autocorrelations

Spatial Interaction With Other Geographic Features

Place-Based Statistics

Min

Number of distinct states (or counties)

Global Moran’s I

Mean distance to nearest neighbor

Max Population

Std. of distance to nearest neighbor Kernel density (range) Kernel density (bandwidth) Local

Semivariogram (first distance lag)

Min of shortest distance

Number of adjacent states (or counties) that have the same feature type

Max of shortest distance Road network

Ripley’s K (mean deviation) Std. ellipse (rotation)

Std. ellipse (std. along y-axis)

Entropy of states (or counties)

Std.

Ripley’s K (range)

Std. ellipse (std. along x-axis)

Mean

Mean of shortest distance

Number of distinct feature types for nearest neighbor

Std. of shortest distance Semivariogram (median distance lag)

Entropy of nearest road types

Entropy of feature types for nearest neighbor

Mean precipitation

Using Semantic Signatures for Social Sensing in Urban Environments Chapter

TABLE 1 A Summary of the 41 Statistics for Spatial Signature

3

Continued

39

40 PART

TABLE 1 A Summary of the 41 Statistics for Spatial Signature—cont’d Spatial Autocorrelations

Spatial Interaction With Other Geographic Features

Intensity

Std. precipitation Mean temperature max Std. temperature max

Global

Kernel density (range)

Climate

Semivariogram (last distance lag) Kernel density (bandwidth)

Mean temperature min Std. temperature min Mean water vapor pressure Std. water vapor pressure

Place-Based Statistics LDA-based approach

Mean KL Divergence of the topic distribution Entropy of the topic distribution

I Methodological

Spatial Point Pattern

0.000 0.005 0.010 0.015 0.020

Using Semantic Signatures for Social Sensing in Urban Environments Chapter

Su

M

T

W

Th

F

3

41

Sa

Restaurant

FIG. 4 Hourly check-in patterns aggregated to 1 week for the Restaurant place type.

exercise monitoring. Local business and social platforms such as Yelp and Foursquare offer services that allow users of their platform to check-in to the digital representations of local POI (Noulas et al., 2011; Li et al., 2016). In essence, the process of checking in is the social media equivalent of telling your friends (or the public) that you are at a specific place at a certain time. The underlying place gazetteers consist of rich datasets, which contain place attributes ranging from photographs and reviews to curated, user-contributed hierarchies of place types. Accessing these check-ins gives urban researchers an unprecedented opportunity to examine the temporal visiting behavior of individuals to a plethora of place types. Through querying data from the public-facing Foursquare application programming interface, previous work accessed approximately 3.6 million check-ins to 1 million POI from 421 place types across the United States, United Kingdom, and Australia. Check-in counts per place type were cleaned and aggregated to the nearest hour of day and day of the week. This results in place type specific temporal signatures such as the one shown in Fig. 4. This figure demonstrates visiting behavior to restaurants in Los Angeles, CA at an hourly resolution over the course of an average week. We can clearly identify days of the week based on cyclical daytime versus nighttime patterns (e.g., limited number of check-ins at 2 a.m.). The peaks in the figure reflect the typical busy times at a restaurant, namely lunch and dinner time with a slight reduction in the volatility of the popular times on Saturday and Sunday. These temporal signatures can be further manipulated to explore patterns at a variety of temporal scales. Fig. 5 shows the typical Restaurant place type visiting behavior for weekdays versus weekends, days of the week, and hours of the day. Depending on the use case, these temporal aggregates can be used to inform anyone from transit and urban planners to police and commercial entities.

4

THEMATIC SIGNATURES

So far, we have discussed how models of place can be developed based on their geospatial distributions (i.e., spatial signatures) and the temporal

I Methodological

0.05 Weekday

Weekend

Weekday / weekend

Su

M

T

W

Th

F

Sa

Days of the week

(B)

0.00

0.02

0.04

0.06

0.08

0.10

(A)

0.00

0.000

0.002

0.10

0.004

0.15

0.006

0.20

0.008

42 PART

(C)

0

2

4

6

8

10

12

14

16

18

20

22

Hours of the day

FIG. 5 Aggregating temporal signatures at three different scales: (A) weekdays vs. weekends; (B) days of the week; and (C) hours of the day.

characteristics of human-place interactions (i.e., temporal signatures). In this section, we take a thematic perspective to formalize place and will present the concept of thematic signatures. In his seminal work (Tuan, 1977), Tuan defined place as space filled with human experience. While human experience is often an intangible concept, people use language to describe their perceptions, feelings, and attachments toward places. Traditionally, many of these human descriptions were in oral form and were ephemeral. In today’s big data era, and with the support of various web 2.0 platforms, such descriptions are often automatically recorded in various data sources, such as online reviews (e.g., review comments on restaurants, hotels, and state parks), travel blogs, and social media posts. These large volumes of data enable large-scale, computational studies of human experiences toward places. Thematic signatures, therefore, aim to capture the characteristics of place types based on the natural language descriptions from people, which serve as a proxy of human experiences. Different places are often situated in different environments and functionalities that afford various sets of human activities (Gibson, 1979). Accordingly, different terms tend to be used by people when describing different places. Intuitively, we are more likely to use terms, such as hike, camping, waterfall, and nature, when describing a state park. By contrast, terms such as movie, popcorn, seat, and ticket are more likely to be used

Using Semantic Signatures for Social Sensing in Urban Environments Chapter

3

43

when we describe experiences related to cinemas. In relation to spatial and temporal signatures, thematic signatures provide an additional and complementary perspective for understanding and modeling places and their types. How can we extract such thematic signatures to represent places? The data sources for deriving signatures are descriptive words conveying human experiences. Depending on the way we organize place descriptions, we can extract thematic signatures at both the place-instance and place-type levels. At the place-instance level, we focus on the descriptions for a specific place instance. For example, we can analyze the reviews for a restaurant, Bob’s BBQ joint, from different people, and learn the main topics that are generally mentioned about this restaurant. At the place-type level, we can aggregate the descriptions for all place instances belonging to the same place type, and extract the thematic signatures for this place type. For example, we can aggregate the reviews for all restaurants in a dataset, and learn the main topics related to the place type Restaurant. Aggregated over millions of reviews, these signatures provide a rich representation of place types. Both types of thematic signatures are useful and can be applied to different situations. A variety of computational models can be utilized to derive thematic signatures for places based on their related natural language descriptions. A simple approach is term frequency and inverse document frequency (TF-IDF) from the field of information retrieval (Manning et al., 2008). TF-IDF is based on the bag-of-words model, which highlights the words that are used frequently in a document and not very frequently in other documents (Hu et al., 2015). For the task of extracting thematic signatures, we can adapt TF-IDF to identify the words that show up frequently in one place instance or place type but not so frequently in other places. The adapted version of TF-IDF is: wij ¼ tf ij  log

jPj jPj j

(1)

where wij is the weight of a term j for place i, tfij is the frequency of term j used in the descriptions of place i, jPj is the total number of places, and jPjj is the number of places whose descriptions also contain the term j. Once we have computed the weights for different terms related to a place, word clouds can be employed to visualize the top terms related to a place. These terms with distinct weights can then be used as thematic signatures for places. Fig. 6 shows the word clouds based on the reviews of two place types: Asian Restaurant and Stadium. We can tell the general place types of these two examples even without looking at their place type labels. Latent Dirichlet allocation (LDA) (Blei et al., 2003) is a more advanced approach, which extracts the major topics associated with different place types. Compared with TF-IDF, LDA is more robust to noise contained in the textual descriptions, handles synonyms, and can capture the semantic relatedness

44 PART

I Methodological

FIG. 6 Word clouds for two place types: (A) Asian Restaurant and (B) Stadium.

between words. LDA is a generative model, which considers each textual document as generated from a probabilistic distribution of topics and each topic as characterized by a distribution over words. As an unsupervised model, LDA discovers semantic topics from the texts without requiring labeled data. Accordingly, each place type or instance can be characterized by the probabilities of different semantic topics based on the related textual descriptions (Adams et al., 2015; McKenzie and Janowicz, 2017). Fig. 7 shows the LDA topic distributions of two place instances: Right Proper Brewing Co. and Moon Under Water Pub & Brewery. Both examples are of the same place type (i.e., Pub), and thus share similarities in terms of their topics, such as topic 6, topic 13, and topic 24. However, we can also identify the topics under which the two pubs show different characteristics, such as topic 37.

FIG. 7 Probability distribution of the LDA topics of two pubs.

Using Semantic Signatures for Social Sensing in Urban Environments Chapter

5

3

45

EXAMPLES

In this section we briefly showcase six examples for studying and applying semantic signatures including basic operations such as place type comparison as well as more specific applications such as improving geo-privacy.

5.1 Comparing Place Types Due to the semantic heterogeneity of place types, tasks such as query federation, data integration, and conflation become challenging. Therefore, semantic signature has been applied to compare and align place types across different geospatial data sources. In our work, semantic signatures extracted from all three perspectives (i.e., spatial, temporal, and thematic) can be represented as vectors. Let p1 and p2 represent two place types, then two vectors can be constructed based on their semantic signatures: p1 ¼ hf11 , f12 ,…, f1D i

(2)

p2 ¼ hf21 , f22 ,…, f2D i

(3)

where fij represents a (normalized) feature of the semantic signature (e.g., the range of Ripley’s K, or the probability of an LDA topic). With such vector representations, we can measure the semantic similarity between place types using several approaches. One is cosine similarity, which is defined as: D X f1j f2j j¼1

ffiffiffiffiffiffiffiffiffiffiffiffi sðp1 ,p2 Þ ¼ v uX u D 2 t f 1j j¼1

vffiffiffiffiffiffiffiffiffiffiffiffi uX u D 2 t f 2j

(4)

j¼1

Cosine similarity measures the angle of the two vectors constructed from semantic signatures, and is robust to the different magnitudes of values in the vectors. Therefore, cosine similarity is especially suitable for semantic signatures whose vector element values can be largely different. When the vector elements are in probabilities (e.g., topic distribution in thematic signatures), we can also use measurements, such as Jensen-Shannon divergence (JSD), which measures the similarity between two probability distributions. Eqs. (5), (6) show the calculation of JSD, where KLD(PjjQ) is the Kullback-Leibler divergence between two discrete probability distributions P and Q (which are the semantic signatures of the two places to be compared). JSDðPjjQÞ ¼

1 1 KLDðPjjMÞ + KLDðQjjMÞ 2 2

(5)

46 PART

I Methodological

KLDðPjjQÞ ¼

X i

PðiÞ ln

PðiÞ QðiÞ

(6)

In this section, we demonstrate the usage of both spatial and temporal signatures on comparing place types.

5.1.1 Comparison Using Spatial Signatures Fig. 8 depicts a 2D visualization of similarities and differences among place types of three gazetteers: DBpedia Place, GeoNames, and TGN after mapping their high-dimensional spatial signatures into 2D using multidimensional scaling. In general, it can be observed that place types from these gazetteers overlap significantly. Furthermore, three cases are illustrated to show the strength of spatial signatures in aligning specific place types. Case 1 in Fig. 8 shows that the parks in DBpedia Places and TGN are semantically similar compared to the one in GeoNames. This makes sense as the GeoNames gazetteer includes

FIG. 8 Two-dimensional visualization of the alignment of place types across DBpedia Places, GeoNames, and TGN. Case 1: Park, Case 2: Mountain, and Case 3: County. Each dot represents a place type.

Using Semantic Signatures for Social Sensing in Urban Environments Chapter

3

47

almost all known places such as parks, while DBpedia Places and TGN only record those that are significant in some senses (e.g., historically or culturally). As another example, Case 2 demonstrates that although the same label of a specific place type is shared by the three gazetteers, Mountain in this case, their semantics do not align with each other. This case is common across different geospatial ontologies as they are mostly designed by domain experts with certain applications or domains in mind. Semantic signatures have the ability to quantify such ontological commitments. In Case 3, three place types that have totally different labels, that is, AMD2, County, and AdministrativeRegion, are shown to be semantically similar, all representing countries, when using spatial signatures. Such alignments are difficult to establish if only string matching and structural similarities are considered. In summary, by using spatial signatures, one can quantify and subsequently improve the alignment of place types across geospatial ontologies and gazetteers.

0.000

0.004

0.008

0.012

5.1.2 Comparison Using Temporal Signatures Exploring different place types, we find that many place types have a unique temporal signature and that these signatures can in fact be used to assess the similarity between place types. Fig. 9 shows the hourly pattern for airports. Compared to the Restaurant temporal signature, airports are relatively a-temporal with few peaks throughout the day and limited access at night. To give another example, the temporal signature for Church shows a clear increase in popularity on Sunday morning with smaller peaks on Sunday and Wednesday evenings (Fig. 10).

Su

M

T

W

Th

F

Sa

Airport

0.00 0.01 0.02 0.03 0.04 0.05 0.06

FIG. 9 Hourly check-in patterns aggregated to 1 week for the Airport place type.

Su

M

T

W

Th

F

Church

FIG. 10 Hourly check-in patterns aggregated to 1 week for the Church place type.

Sa

48 PART

I Methodological

Through assessing the similarity of the temporal signatures, we achieve a better understanding of urban visiting behavior as well as an appreciation of the complexity of modeling the urban landscape. With the goal of better understanding the role that these place types play in defining the city, we developed the POI Pulse1 observatory, a web-based visual platform for exploring interaction between people and places within the city of Los Angeles, CA (McKenzie et al., 2015b). In this work, the similarity between the temporal signatures is assessed along with both the geospatial (Section 2) and thematic (Section 4) signatures using information entropy to classify the numerous place types into lower level categories. These lower level categories provided the foundation on which to visually depict the pulse of the city through marker opacity and color.

5.2 Coreference Resolution Across Gazetteers In addition to aligning place types using spatial signatures, which is discussed in Section 5.1.1, this section outlines an approach for using spatial signatures to match individual geographic features between gazetteers, named as coreference resolution. In addition to conventional approaches, such as string and structure matching, spatial signatures can be adopted to capture the fact that places have a spatial context that can be used as part of the coreference resolution process (Zhu et al., 2016b). The city of Kobani, Syria is selected here as an example to illustrate the power of spatial signature, due to its military and geographic importances but also its high ambiguity in different gazetteers (i.e., there are several dissimilar toponyms for Kobani including Aarab Peunar, Kobane, and Ayn al Arab). The type-level and instance-level spatial signatures can be combined to match the Kobani from DBpedia Places to GeoNames which has in total five candidates that cannot be easily matched using conventional approaches. Euclidean distance is used to compute the dissimilarity between candidates and the target in terms of their representations using spatial signature; the candidate that has the smallest dissimilarity to the target is regarded as the match. Experiments show that although one of the candidates in GeoNames is also labeled as PopulatedPlace, by taking spatial signatures into account, the Kobani in DBpedia Places, labeled as PopulatedPlace, can still be correctly matched to Ayn al Arab in GeoNames, which is labeled as Seat of a Second-order Administrative Division (Zhu et al., 2016b).

5.3 Geoprivacy Concerns over location privacy have seen a resurgence in recent years. Mobile devices today are ubiquitous and the sensors available on these devices allow for the collection and distribution of a wide variety of contextual information. 1. See http://poipulse.com/.

Using Semantic Signatures for Social Sensing in Urban Environments Chapter

3

49

In combination with the social web, private information is being shared and distributed at an alarming volume and velocity with arguably little understanding as to the ramifications. The concept of semantic signatures sits very much in the midst of this concern over the sharing of private data as much of the digital footprints that we leave can be curated and compared to the spatial data signatures that have been extracted from millions of online sources. For instance, it is possible to substantially limit the possible locations that someone may be at purely based on the textual data that they choose to share online. A microblog post containing the text “looking forward to burritos and tequila” posted at 5 p.m. on a Friday in Los Angeles, for instance, provides a high amount of information that can be matched against our probabilistic signatures. The text itself contains references to Mexican food and alcohol while the timing of the post indicates a likelihood that the person posting the material will be going to a restaurant rather than a nightclub. Accessing the plethora of freely available gazetteers we can limit the possible locations for the person that created the post (McKenzie et al., 2016). Such an approach does not require access to actual geographic location information. Following the same thought process, signatures can also be used to foster geo-privacy, namely by showing which terms and times are most indicative of a certain activity and place. For instance, replacing “tequila” with “drinks” and sending out the message an hour before may increase information entropy to a degree where identifying a place may be less likely (McKenzie et al., 2016). Further work in this area has focused on spoofing one’s location and interests based on the inclusion of contextually relevant noise (Zakhary et al., 2017) while previous work has focused on the obfuscation of personal identifiable information (Duckham and Kulik, 2005).

5.4 Temporally Enhanced Geolocation Information related to the temporal dimension of places can be useful in a number of everyday scenarios as well. Take, for example, the process of geolocating or reverse geocoding. This is a geographic querying method that is executed by millions of people a day as they request the nearest place instances to them based on provided geographic coordinates. Standard approaches to geolocating take a pair of latitude and longitude coordinates (e.g., from a GPS-enabled mobile device) and return a set of nearby places (e.g., Dan’s Automotive Shop or Handlebar Coffee Shop). The problem with this approach, however, is that it makes the erroneous assumption that one has the same likelihood of being at a place, regardless of the time of day or day of the week. In actuality, we know that the probability of somebody being at a pub on a Friday at 11 p.m. is significantly higher than the probability of a person being at the Department of Motor Vehicles. Socio-institutional affordances (Raubal et al., 2004) aside, temporal signatures generated from the visiting behavior of millions of individuals clearly demonstrate that there are unique temporal patterns in how people interact with different place types.

50 PART

I Methodological

Exploiting these temporal patterns, existing work shows that traditional distance-only-based approaches to reverse geocoding can be augmented through the inclusion of these time-based probabilistic models (McKenzie and Janowicz, 2015). In fact, we show, through a comparison of various methods for including time in such a process, that a temporally enhanced geolocation method can improve upon the accuracy of the distance-only-based method by over 24% (based on a Mean Reciprocal Rank assessment). Such work in combination with others has led to the addition of Popular Times plots being included in major mapping and local business platforms (Lardinois, 2016).

5.5 Regional Variation The value of temporal signatures built from geosocial visiting behavior in a single city such as Los Angeles, CA is one thing, but building temporal signatures for cities around the world is different in that there will be cultural differences. The question remains as to the uniqueness of place type interactions depending on region. Using check-in data collected from across the United States, Australia, and the United Kingdom, the check-ins are split by major cities. Focusing on the cities of Los Angeles, CA, Chicago, IL, and New York City, NY, we find that there are significant differences in how the inhabitants interact with place types. Using the Watson’s two-sample test, we show that approximately 50% of place types vary significantly (P < .05) in their temporal signatures (McKenzie et al., 2015a), while others remain invariant. Fig. 11 shows temporal signatures for two place types split by US city. What this work demonstrates is that the temporal visiting behavior of some place types is a-spatial (e.g., Drug Stores), while other are regionally variant (e.g., Theme Parks). Additional research on cities outside the United States, namely Sydney, Australia and London, UK, support these findings, on a more restricted place type dataset. These results also confirm previous research on the habitual behavior of humans in an urban setting. The findings that roughly 50% of place type temporal behavior is a-spatial is of importance for the usefulness of such signatures as well, as it implies that only half of these temporal signatures have to be acquired at a local level for tasks such as reverse geocoding mentioned in Section 5.4 while the other 50% of place types can be well represented using a single, global signature.

5.6 Extraction of Urban Functional Regions Cities support a variety of human activities including living, working, shopping, eating, socializing, and recreation, which usually take place at different types of POI. Compared to other datasets and methods in remote sensing and field mapping, using POI data, social media, etc., and associated social sensing methods can lead to a better understanding of individual-level and group-level utilization of urban space at a fine-grained spatial, temporal, and thematic resolution (Liu et al., 2015). We use POI that support specific types of human activities

Using Semantic Signatures for Social Sensing in Urban Environments Chapter

FIG. 11 Circular plots depicting hourly temporal signatures for Theme Park (A, B, C) and Drug Store (D, E, F). (A, D) Los Angeles; (B, E) Chicago; (C, F) New York City.

3

51

52 PART

I Methodological

on the ground as a proxy to delineate regions with various colocation patterns of POI types (Gao et al., 2017). The same type of POI can be located in different land use types and may also support different functions. For example, restaurants are found in residential areas, in commercial areas, as well as in industrial areas. The main function of the POI-type University is education, but they also support sports activities, music shows, and so on. We argue that the semantic signatures of POI types can be employed to derive latent classification features, which will then enable the detection and the abstraction of higher-level functional regions (i.e., semantically coherent areas of interest) such as shopping areas, business districts, educational areas, and art zones in cities. We collected large-scale dataset of Foursquare venues and associated user check-in data in the most populated US cities. Based on the aforementioned data-processing procedures and the LDA topic models by incorporating the popularity score based on unique Foursquare check-in users, we can infer the probabilistic combination of different topics composing a urban function for a region given POI type cooccurrence patterns. For the city of Denver, for instance, we were able to discover (Gao et al., 2017) a high relevance of the topic Topic 25, which consists of a variety of prominent POI types such as art museum, art gallery, history museum, concert hall, and American restaurant. Such a place may serve multiple functions. The second most prominent LDA topic in this region is Topic 121 that contains a large percentage composition of brewery places. In fact, the region in Denver for which the signatures revealed a dominance of these topics is known as the “Santa Fe Dr.,” an “Art District” which attracts many local residents, artists, and tourists. This example illustrates the inference capability of our method in identifying urban functional regions given thematic signatures.

6 SUMMARY In this work we have presented an overview of spatial, temporal, and thematic signatures by discussing the utilized data, the methods to compute and compare signatures, and by providing a variety of examples from our previous work. These examples range from reverse geocoding, neighborhood extraction, coreference resolution, and ontology alignment to geoprivacy. We have also addressed the question of how local these signatures are, that is, whether their quality decays when applied to other geographic regions. The results depend on the studied place types and while some types show high variation, others do not. Consequently, global signatures can be augmented with locally trained data to improve results. Recently, there has been increased interest in utilizing embedding techniques to compare place types (Yan et al., 2017; Cocos and CallisonBurch, 2017) and early results show that these techniques yield results that strongly correlate with human similarity judgments. Utilizing such techniques for the creation of semantic signatures will be one of the directions for future work.

Using Semantic Signatures for Social Sensing in Urban Environments Chapter

3

53

REFERENCES Adams, B., McKenzie, G., Gahegan, M., 2015. Frankenplace: interactive thematic mapping for ad hoc exploratory search. In: Proceedings of the 24th International Conference on World Wide Web, pp. 12–22. Aggarwal, C.C., Abdelzaher, T., 2013. Social sensing. In: Managing and Mining Sensor Data, Springer, New York, NY, pp. 237–297. Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022. Cocos, A., Callison-Burch, C., 2017. The language of place: semantic value from geospatial context. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Short Papers, vol. 2, pp. 99–104. Duckham, M., Kulik, L., 2005. A formal model of obfuscation and negotiation for location privacy. International conference on pervasive computing, pp. 152–170. Gao, S., Janowicz, K., Couclelis, H., 2017. Extracting urban functional regions from points of interest and human activities on location-based social networks. Trans. GIS 21 (3), 446–467. Gibson, J.J., 1979. The Ecological Approach to Visual Perception. Houghton Mifflin Harcourt (HMH), Boston. Goodchild, M.F., 2007. Citizens as sensors: the world of volunteered geography. GeoJournal 69 (4), 211–221. Hendler, J., Berners-Lee, T., 2010. From the semantic web to social machines: a research challenge for AI on the world wide web. Artif. Intell. 174 (2), 156–161. Hey, T., Tansley, S., Tolle, K.M., et al., 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. vol. 1. Microsoft Research, Redmond, WA. Hu, Y., Gao, S., Janowicz, K., Yu, B., Li, W., Prasad, S., 2015. Extracting and understanding urban areas of interest using geotagged photos. Comput. Environ. Urban Syst. 54, 240–254. Janowicz, K., Van Harmelen, F., Hendler, J.A., Hitzler, P., 2015. Why the data train needs semantic rails. AI Mag. 36, 5–14. Jordan, T., Raubal, M., Gartrell, B., Egenhofer, M., 1998. An affordance-based model of place in GIS. In: 8th Int. Symposium on Spatial Data Handling, SDH, vol. 98, pp. 98–109. Kong, Q., Allen, R.M., Schreier, L., Kwon, Y.W., 2016. Myshake: a smartphone seismic network for earthquake early warning and beyond. Sci. Adv. 2 (2), e1501055. Lardinois, F., 2016. Google can now tell you how busy a place is before you arrive. https:// techcrunch.com/2016/11/21/google-can-now-tell-you-how-busy-a-place-is-before-you-arrivein-real-time/. Accessed 29 January 2018. Li, M., Westerholt, R., Fan, H., Zipf, A., 2016. Assessing spatiotemporal predictability of LBSN: a case study of three foursquare datasets. GeoInformatica. https://doi.org/10.1007/s10707-0160279-5. Liu, Y., Liu, X., Gao, S., Gong, L., Kang, C., Zhi, Y., Chi, G., Shi, L., 2015. Social sensing: a new approach to understanding our socioeconomic environments. Ann. Assoc. Am. Geogr. 105 (3), 512–530. Manning, C.D., Raghavan, P., Sch€utze, H., et al., 2008. Introduction to Information Retrieval. vol. 1. Cambridge University Press, Cambridge. McKenzie, G., Janowicz, K., 2015. Where is also about time: a location-distortion model to improve reverse geocoding using behavior-driven temporal semantic signatures. Comput. Environ. Urban Syst. 54, 1–13. McKenzie, G., Janowicz, K., 2017. The effect of regional variation and resolution on geosocial thematic signatures for points of interest. In: The Annual International Conference on Geographic Information Science, pp. 237–256.

54 PART

I Methodological

McKenzie, G., Janowicz, K., Gao, S., Gong, L., 2015. How where is when? On the regional variability and resolution of geosocial temporal signatures for points of interest. Comput. Environ. Urban Syst. 54, 336–346. McKenzie, G., Janowicz, K., Gao, S., Yang, J.A., Hu, Y., 2015. POI Pulse: a multi-granular, semantic signature-based information observatory for the interactive visualization of big geosocial data. Cartographica 50 (2), 71–85. McKenzie, G., Janowicz, K., Seidl, D., 2016. Geo-privacy beyond coordinates. The 19th AGILE Conference on Geographic Information Science, Geospatial Data in a Changing World. Springer, Helsinki, Finland, pp. 157–175. Noulas, A., Scellato, S., Mascolo, C., Pontil, M., 2011. An empirical study of geographic user activity patterns in foursquare. ICwSM 11 (70-573), 2. Raubal, M., Miller, H.J., Bridwell, S., 2004. User-centred time geography for location-based services. Geogr. Ann. Ser. B Hum. Geogr. 86 (4), 245–265. Shadbolt, N.R., Smith, D.A., Simperl, E., Van Kleek, M., Yang, Y., Hall, W., 2013. Towards a classification framework for social machines. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 905–912. Sheth, A., 2014. Transforming big data into smart data: deriving value via harnessing volume, variety, and velocity using semantic techniques and technologies. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE). Tuan, Y.F., 1977. Space and Place: The Perspective of Experience. University of Minnesota Press, Minneapolis, London. Yan, B., Janowicz, K., Mai, G., Gao, S., 2017. From ITDL to Place2Vec-reasoning about place type similarity and relatedness by learning embeddings from augmented spatial contexts. Proc. SIGSPATIAL 17, 7–10. Ye, M., Janowicz, K., M€ulligann, C., Lee, W.C., 2011. What you are is when you are: the temporal dimension of feature types in location-based social networks. Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 102–111. Zakhary, V., Sahin, C., Georgiou, T., El Abbadi, A., 2017. LocBorg: hiding social media user location while maintaining online persona. In: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, p. 12. Zhu, R., Hu, Y., Janowicz, K., McKenzie, G., 2016. Spatial signatures for geographic feature types: examining gazetteer ontologies using spatial statistics. Trans. GIS 20 (3), 333–355. Zhu, R., Janowicz, K., Yan, B., Hu, Y., 2016. Which Kobani? A case study on the role of spatial statistics and semantics for coreference resolution across gazetteers. In: International Conference on GIScience Short Paper Proceedings, vol. 1, pp. 1–4.

Chapter 4

Geographic Space as a Living Structure for Predicting Human Activities Using Big Data Bin Jiang and Zheng Ren Faculty of Engineering and Sustainable Development, Division of GIScience, University of Gavle, € Gavle, € Sweden

Chapter Outline 1 Introduction 2 Living Structure and the Topological Representation 3 Data and Data Processing 4 Prediction of Tweet Locations Through Living Structure 4.1 Correlations at the Scale of Thiessen Polygons 4.2 Correlations at the Scale of Natural Cities

55 57 60 63 63

4.3 Degrees of Wholeness or Life or Beauty 5 Implications on the Topological Representation and Living Structure 6 Conclusion Acknowledgments References

65

66 70 70 71

64

I propose a view of physical reality which is dominated by the existence of this one particular structure, W, the wholeness. In any given region of space, some subregions have higher intensity as centers, others have less. Many subregions have weak intensity or none at all. The overall configuration of the nested centers, together with their relative intensities, comprise a single structure. I define this structure as "the" wholeness of that region. Christopher Alexander (2002–2005, Book 1, p. 96)

1

INTRODUCTION

Emerging geo-referenced big data from the Internet, particularly social media such as OpenStreetMap and Twitter, provides a new instrument for geospatial research. Big data shows some distinguishing features from small data (MayerSchonberger and Cukier, 2013). For example, big data is accurately measured Mobility Patterns, Big Data and Transport Analytics. https://doi.org/10.1016/B978-0-12-812970-8.00004-X © 2019 Elsevier Inc. All rights reserved.

55

56 PART

I Methodological

and individually based with geolocations and time stamps, rather than estimated and aggregated as small data. This makes big data unique and powerful for developing new insights into geographic forms and processes (e.g., Jiang and Miao, 2015). On the other hand, big data poses enormous challenges in terms of data representation, structuring, and analytics. Unlike small data, big data is unstructured and massive, so conventional structured databases are unlikely to be of much use for data management. Conventional Gaussian statistics and Euclidean geometry are also not of much use for big data analytics ( Jiang and Thill, 2015; Jiang, 2015b). Furthermore, conventional geographic representations, such as raster and vector, are inconvenient for providing deep insights into geographic forms and processes. For example, with raster and vector representations, geographic space is abstracted as either a large set of pixels or a variety of points, lines and polygons, layer by layer (Longley et al., 2015). Under the conventional geographic representations in general, geographic space is just a collection of numerous lifeless pieces, such as pixels, points, lines, and polygons, and there is little connection between these pieces, except for immediate, nearby relationships. These local relationships indicate that a geographic space is full of more or less similar things, and one can hardly see geographic space as a living structure, consisting of far more small things than large ones. Present geographic representations are very much influenced by the mechanistic worldview inherited from 300 years of science. Under the worldview, geographic space is mechanistically conceptualized as continuous fields or discrete objects (Couclelis, 1992; Cova and Goodchild, 2002), which pose little meaning in our minds and cognition. These representations, mainly based on Newton’s absolute space and Leibniz’s relational space, are essentially geometry based rather than topology oriented. By geometry, we mean geometric details such as locations, sizes, and directions, whereas the topology enables us to see the underlying living structure of far more small things than large ones. Thus the notion of topology differs fundamentally the same notion conceived and used in the geographic information systems (GIS) literature. Geographic space as a whole is made of things—spatially coherent entities such as rivers, buildings, streets, and cities. Things are connected to other things, to constitute even larger things. For example, a set of streets or buildings constitutes a neighborhood, a set of neighborhoods constitutes a city, and a set of cities constitutes a country. These things have not yet become basic units of geographic representations in GIS (Longley et al., 2015). This is mainly due to the fact that current science is mainly mechanistic. The mechanistic worldview is remarkable and excellent, and all that we have achieved in science and technology is essentially based on this worldview. However, it is limited, in particular with respect to rebuilding architecture or making good built environments, as argued by Alexander (2002–2005). In order to build beautiful buildings, Alexander (2002–2005) conceived and developed a new worldview—a new conception of how the physical world is constituted. This new world picture is organic, so it differs fundamentally from the mechanistic one.

Geographic Space as a Living Structure Chapter

4

57

Under the organic worldview, the world is an unbroken whole that possesses a physical structure, called a living structure or wholeness (see Section 2 for an introduction) Alexandrine organic space constitutes the third view of space, on which we will further discuss in Section 5. Based on the third view of space, a topological representation has been previously developed (see Section 2 for an introduction) in order to show living structures in built environments and to further argue why the design principles of differentiation and adaptation are essential to reach living structures (Salingaros, 2005; Jiang, 2017). The present paper further explores how the topological representation or its illustrated living structure can be used to predict human activities. We will demonstrate, contrary to what we naively think, that geospatial big data is extremely well structured according to its underlying living structure, or the underlying scaling of far more small things than large ones. We will show that tweet locations can be well predicted by the living structure of street nodes extracted from OpenStreetMap. This predictability is not only based on the current status but also for the future status. We further illustrate why the topological representation is a truly multiscale representation, and subsequently put forward a new way of structuring geospatial data both big and small. The remainder of this paper is structured as follows. Section 2 introduces the living structure or wholeness as a field of centers. Section 3 describes data and data processing. Section 4, based on the United Kingdom (UK) case studies, shows that tweet locations can be predicted using street nodes. Section 5 further discusses on the implication of the results, particularly how living structures of geographic space can be used to structure geospatial data. Finally, Section 6 draws conclusion and points to future work.

2 LIVING STRUCTURE AND THE TOPOLOGICAL REPRESENTATION Living structure is a key concept of this paper, developed by Alexander (2002–2005) in his Theory of Centers, and it is also called wholeness or life or beauty. The living structure consists of many individual centers that appear at the different levels of detail of the structure, and tend to overlap and nest within each other to form a coherent whole. The terms of living or life are not particularly in the biological sense, but in terms of the underlying recursive structure,—one with far more small things than large ones, one with numerous smallest things, a very few largest things, and some in between the smallest and largest. Alexander (2002–2005) identified 15 fundamental properties (Table 1) that help judge whether a thing is a living structure, or whether a thing is living or with life; usually the more properties the thing has, the more living the thing is. In this section, we first introduce the notion of living structure and its fundamental properties using an ornament as a working example (Fig. 1), and then

58 PART

I Methodological

TABLE 1 The 15 Fundamental Properties of the Living Structure (Alexander, 2002–2005) Levels of scale

Good shape

Roughness

Strong centers

Local symmetries

Echoes

Thick boundaries

Deep interlock and ambiguity

The void

Alternating repetition

Contrast

Simplicity and inner calm

Positive space

Gradients

Not separateness

(A)

(B)

FIG. 1 An ornament and its topological representation. The ornament (A), presumably drawn by Alexander, appears in the book cover of The Nature of Order (Alexander, 2002–2005). The dot sizes shown in the topological representation (B) illustrate the degree of livingness of the ornament structure: the larger the dots, the higher the degree of the livingness.

use a configuration of 10 fictive cities (Fig. 2) to illustrate the topological representation and to show how the degree of livingness can be measured. The ornament is part of the book cover of The Nature of Order (Alexander, 2002–2005), and it shows a strong sense of livingness, with far more small things than large ones (Fig. 1A). The central theme of the four-volume book is about the nature of order, which is presumably represented by the big circle, while the four volumes, we suspect, are represented by four small circles. The four small circles are enhanced by the smaller dots within them, through so called differentiation processes. The big circle is further differentiated and therefore strengthened by the diamond-shaped piece, to which there are four

Geographic Space as a Living Structure Chapter

(A)

(B)

4

59

(C)

FIG. 2 Illustration of the topological representation. The 10 fictive cities with sizes of 1, 1/2, 1/3 ... and 1/10 are given some locations in a square space (A). These 10 cities are put into 3 hierarchical levels indicated by the 3 colors, and their corresponding Thiessen polygons nest each other (B). A complex network is then created (C) to capture adjacent relationships of the polygons at the same level, and nested relationships of the polygons across the levels.

dots attached. The big circle can be perceived as four arcs, which are further enhanced by the four little dots or strokes, as well as the four boundaries. The ornament possesses many of the 15 properties (Table 1). There are at least three levels of scale: big circle, small circles, and dots. There are many strong centers, and some are with thick boundaries. Alternating repetition is present, although less apparent, at the edge of figure and ground of the ornament. Good shapes repeat alternatively in the ornament, which is good shape itself, since it contains many good shapes in a recursive manner. Local symmetries are present with the diamond in the center, as well as in big circle with the four strokes. This ornament looks like a hand drafting, but we believe Alexander deliberately wanted to do so. He knew better than anyone else that roughness is such an important property of living structure. The diamond with four dots in the middle appears to echo the big circle with four small circles. All identified centers are not separate from each other, but tie together to become a coherent whole. This whole is topologically represented as a graph (Fig. 1B). The topological representation, developed by Jiang (2017), is to build up supporting relationship among individual centers. For example, the big circle, consisting of four arcs, is supported by the four small circles, and the diamond is supported by the four attached dots. These supporting relationships are indicated by directed links in the topological representation of the ornament (Fig. 1B). It should be noted that the topological representation does not show all potential centers. Interested readers can refer to Alexander (2002–2005) and Jiang (2016), in which a paper with a tiny dot induces up to 20 centers. With the topological representation or graph, and based on the mathematical model of wholeness ( Jiang, 2015a), we can compute degrees of livingness, as shown in Fig. 1B. Geographic space is much more complex than the ornament, and its centers are much harder to identify than those of the ornament. Let’s assume 10 fictive cities in a square space, and their sizes are respectively 1, 1/2, 1/3, … 1/10 (Fig. 2A). The 10 cities can be put into 3 hierarchical levels based on the

60 PART

I Methodological

head/tail breaks classification ( Jiang, 2013, 2015c). The three hierarchical levels are indicated by the three colors in Fig. 2B, in which the dots of three different hierarchical levels are respectively used to create Thiessen polygons. A complex network is then created for all polygons, with directed relationships from smaller dots to adjacent larger ones at the same hierarchical level, and from contained polygons to containing ones between two consecutive levels (Fig. 2C). With the network and the mathematical model of wholeness ( Jiang, 2015a), the degree of livingness can be obtained and visualized by the dot sizes in Fig. 2C. In what follows, we will briefly introduce the ideas behind the mathematical model. The set of the cities constitutes a living structure, and this living structure has two statuses: the current (ti) and the future (ti + 1) shown in Fig. 2B and C, respectively. In the current status, the degree of livingness is measured by city sizes, whereas in the future status, the degree of livingness is measured by Google’s PageRank (PR) scores ( Jiang, 2015a). The major difference between these two measures lies in the configuration, i.e., how the 10 cities support each other to constitute a coherent whole. The configuration effect is continuous, which means that city sizes, as they are now, are outcome of the configuration, and the PR scores can be regarded as the future city sizes’ ranking. For example, the two middle-sized cities are respectively supported by the three and the four small cities, and the largest city is supported by two middle-sized cities. For a spatial configuration that is well adapted, the two statuses or the two measures (sizes and PR scores) have little difference in terms of their individual ranking. Space, or spatial configuration to be specific, is not neutral, and it has the capacity to be more adapted or less adapted, or equivalently to be more whole or less whole. This dynamic view of space is what underlies Alexander’s organic worldview that space is not lifeless or neutral, but a living structure involving far more small centers than large ones, and more importantly, space is in a continuous process of adaptation. This adaptation depends on not only how various centers adapt each other within their whole spatial configuration, but also beyond in terms of how it fits to its surroundings. For example, the degree of livingness of the ornament is not only decided by its centers within, but also influenced by its surroundings for a bigger whole of the book cover, and even beyond.

3 DATA AND DATA PROCESSING In order to demonstrate that living structure can be used to predict human activities, or how human activities are shaped by the underlying living structure, we used two big datasets about the United Kingdom. They are street nodes and tweet locations between June 1–8, 2014 (Table 2). The street nodes were extracted from OpenStreetMap for building up natural cities ( Jiang and Miao, 2015) as a living structure, while tweet locations are used to verify if they can be predicted by the living structure. The street nodes refer to both street junctions and ending nodes. Street nodes can be easily derived if one writes a simple script to extract nodes with one street segment or at least three street

Geographic Space as a Living Structure Chapter

4

61

TABLE 2 The Two Datasets and Derived Natural Cities or Hotspots OSM

Tweets

NCities

United Kingdom

4,715,279

2,933,153

123,551

London

308,999

424,970

16,080

London II

37,982

91,325

2,512

London III

1,929

2,796

89

OSM, street nodes; Tweets, tweet locations; NCities, natural cities at different levels of scale; London II, the largest natural city within London; and London III, the largest natural city within London II.

segments in a street network. The nodes with one segment are ending points, while those with at least three street segments are junctions. For the convenience of the readers, we introduce how to use ArcGIS to derive street nodes from a network. There are two sets of procedures, which vary in efficiency and accuracy. The first is very accurate, but with low efficiency that is suitable for city-scale networks, while the second is less accurate, but with high efficiency that is suitable for country-scale networks. The first set of procedures relies on ArcGIS’ topology-building function, which partitions all streets on their junctions into different arcs. Under ArcToolbox Data Interoperability Tools, use Quick Export to create coverage format, which contains detailed topology. Use ArcToolbox Data Management Tool > Feature Vertices to Points to get both ends of these arcs, and then use Find Identical to get the number of ends at a same location. Those ends with the number of 1 or 3 are valid street nodes. The second set of procedures is based on ArcGIS’ Intersect function, again within ArcToolbox. OSM street data must first be merged according to the same name. Then, use Intersect to get all junctions, and use Feature Vertices to Points to get the dangling end. Note that the Intersect operation can generate duplicate junctions that must be removed by Delete Identical to get all valid junctions. Finally, merge junctions and dangling ends to get all street nodes. The derived street nodes are used to generate natural cities at different levels of scale. Natural cities are naturally, objectively derived patches from big data, such as nighttime imagery, street nodes, points of interest, and tweet locations based on head/tail breaks ( Jiang, 2013, 2015c). We first build a huge triangulated irregular network (TIN) by connecting all the street nodes or locations. This TIN comprises a large number of edges that are heavy-tail distributed, indicating far more short edges than long ones. Based on head/tail breaks, all the edges are put into two categories: those longer than the mean, called the head; and those shorter than the mean, called the tail. The edges in the tail, being shorter edges which imply higher density locations, eventually constitute individual patches called natural cities (see the Appendix in Jiang and Miao (2015) for a tutorial). It should be noted that the same notion of natural cities was used

62 PART

I Methodological

to refer to naturally evolved cities (Alexander, 1965) rather than naturally derived cities used in this paper. However, the naturally derived cities are likely to resemble the naturally evolved cities, thus becoming an important instrument for studying city structure and dynamics; for example, the natural cities derived from the social media Brightkite check-in locations with time stamps ( Jiang and Miao, 2015) can be considered to be naturally evolved cities, in particular in terms of underlying mechanisms. Every natural city is a living center, and all natural cities of a country constitute an interconnected whole as a living structure ( Jiang, 2017). The next section will examine how tweet locations can be predicted by the living structure of geographic space at different levels of scale. Natural cities at the city scale could be called hotspots. However, for the sake of convenience, we will use natural cities only to refer to the naturally derived patches at different levels of scale. Fig. 3 shows natural cities that are defined at the four levels of scale or spaces: London III, in London II, in London, and in the United Kingdom.

(B)

(C)

(A)

(D)

FIG. 3 Nested natural cities in the United Kingdom at different levels of scale. (A) United Kingdom and its natural cities, (B) London (a bit smaller than M25 highway) and its natural cities, (C) London II (very central at the northern side of Thames) and its natural cities, and (D) London III (near the city of London) and its natural cities, so a nested relationship as such: London III  London II  London  UK.

Geographic Space as a Living Structure Chapter

4

63

4 PREDICTION OF TWEET LOCATIONS THROUGH LIVING STRUCTURE The derived natural cities at the four spaces are topologically represented to set up complex networks, according to the principle illustrated in Fig. 2. We built up four complex networks, respectively, for the four spaces, which are nested within each other (Fig. 3). With these complex networks, we computed the degree of wholeness for both individual centers and their wholes. Seen in Fig. 3, the four spaces are living structures, since each of them contain far more small natural cities than large ones. In what follows, we will examine how tweet locations can be well predicted by the living structures. Each of the living structures consists of far more small centers (represented as polygons) than large ones; see an example of London II (Fig. 4A) in which both natural cities and their Thiessen polygons act as centers (Fig. 4B for the enlarged view). The centers we refer to here consist of both Thiessen polygons and natural cities, while the previous study by Jiang (2017) referred to only Thiessen polygons. The natural cities act as the cores of their corresponding Thiessen polygons, reflecting the centeredness of the centers (Alexander, 2002–2005).

4.1 Correlations at the Scale of Thiessen Polygons We found good correlations between living structure and tweet locations, with R square values ranging from 0.63 to 0.99 (Table 3). The column “OSM/Tweets” indicates the current status, while the column “Life/Tweets” indicates the future status. The good correlations imply that living structures can predict tweet locations at the scale of Thiessen polygons. Fig. 5 roughly illustrates how well street nodes correlate tweet locations. However, we are not used to this view of space

FIG. 4 The topological representation of London II. London II demonstrates a living structure with a very striking scaling hierarchy of far more small polygons than large ones: (A) an overall view of London II, and (B) an enlarged view around the largest natural city of London II. The spectral color legend is used to indicate the living structure of far more small polygons (cold colors) than large ones (warm colors), or numerous smallest (blue), a very few largest (red), and some in between the smallest and largest (other colors between blue and red).

64 PART

I Methodological

TABLE 3 R Square Values Among Street Nodes, Tweet Locations, and Degree of Wholeness at the Scale of Thiessen Polygons OSM/Tweets

Life/Tweets

United Kingdom

0.99

0.76

London

0.98

0.81

London II

0.92

0.63

London III

0.97

0.85

OSM, street nodes; Tweets, tweet locations; Life, degrees of wholeness; and /, between.

(A)

(B)

FIG. 5 Illustration of good correlation between the polygon sizes, based on street nodes (A) and their degrees of wholeness (B) for the natural city London (a bit smaller than M25 highway). The sizes of the dots are proportional to their values, so they are not classified in order to show the good correlation.

or this kind of multiscale spatial units that are nested within each other. This is somehow like a series of spatial units involving country, states, counties, with small units contained in the big ones. We are used to single-scale units, either all states or all countries, rather than mixed units of states and counties. It is exactly the multiscale units that make the topological representation unique and powerful, since it captures the scaling or living structure of space. Besides the correlation at the nested polygons scale, we could also examine the prediction at the scale of the natural cities.

4.2 Correlations at the Scale of Natural Cities Good correlations also occur at the scale of natural cities. Table 4 presents the results of R square of different pairs and at different levels, although not as good

Geographic Space as a Living Structure Chapter

4

65

TABLE 4 Percentages of Data Within Natural Cities and R Square Values among Street Nodes, Tweet Locations, and Degree of Wholeness at the Scale of Natural Cities OSM%

Tweets%

OSM/Tweets

Life/Tweets

United Kingdom

0.89

0.84

0.93

0.55

London

0.84

0.40

0.85

0.44

London II

0.86

0.24

0.70

0.58

London III

0.55

0.18

0.84

0.64

OSM%, street nodes included in natural cities; Tweets%, tweet locations included in natural cities; OSM, street nodes; Tweets, tweet locations; Life, degree of wholeness; and /, between.

as those at the level of polygons. It should be noted that both the percentages of street nodes and tweet locations included in natural cities dramatically decrease. For example, the percentage of tweets decreases from 84% at the country scale to 18% at the London III scale. This is because tweets are more evenly distributed in the natural cities than in the country. Alternatively, tweets are more heterogeneously distributed in the country than in the natural cities. The living structure or the kind of topological analysis can well predict tweet locations, but only for those within natural cities. The problem is that a vast majority of tweets are not within natural cities. In this case, we do not recommend using natural cities, but instead Thiessen polygons for effective prediction. In other words, natural cities can predict those tweets that are highly clustered or concentrated, while Thiessen polygons can be used for all tweets both highly clustered and highly segregated.

4.3 Degrees of Wholeness or Life or Beauty The United Kingdom as a whole, and its subwholes, such as London, London II and London III, are living structures. It would be interesting to compare their degrees of wholeness, life, or beauty. For this purpose, we computed correlations between street nodes and wholeness of the natural cities (Table 5; see the columns OSM/Life (TS) and OSM/Life (NC) and their averages, shown in column OSM/Life). The United Kingdom has the highest degrees of wholeness, indicating that the country is more beautiful than its cities. On the other hand, the United Kingdom, London, London II and London III all have the same degree of adaptation, given the same correlation between cities sizes and their degrees of wholeness. It should be noted that all the correlations examined above are significant at the 0.01 level (two-tailed) based on the logarithmic scale of the data. Through these computing results, we can understand why the United Kingdom is a living

66 PART

I Methodological

TABLE 5 Comparison of Wholeness Among the United Kingdom, London, London II and London III HtIndex

OSM/Life (TS)

OSM/Life (NC)

OSM/ Life

United Kingdom

8

0.78

0.75

0.77

London

6

0.85

0.65

0.75

London II

6

0.76

0.79

0.77

London III

4

0.86

0.63

0.75

OSM, street nodes; Tweets, tweet locations; Life, degree of wholeness; /, between; TS, Thiessen polygons; and NC, natural cities.

structure because its constituents are living. To be more specific, the country is living because its constituents (e.g., London) are living; London is living, because its centers (e.g., London II) are living. The reason that London II is living is because its centers (e.g., London III) are living. More generally, goodness of things is based on the recursive way of assessment. This is what underlies the notion of living structures. From this case study, we have seen already how living structure is extracted from big data of street nodes, and how subsequently human activities are shaped by the living structure. Big data differs from small data, so big data is better than small data in capturing the underlying configuration of geographic space.

5 IMPLICATIONS ON THE TOPOLOGICAL REPRESENTATION AND LIVING STRUCTURE Through the above case studies, we have seen that space is not lifeless or neutral, but a living structure involving far more small centers than large ones. In this regard, the topological representation is shown to be efficient and effective for illustrating the underlying living or scaling structure. The topological representation is a multiscale representation—multiple scales ranging from the smallest scale to the largest. To illustrate, we partition the representation in Fig. 2 into three scales in Fig. 6 (Panels A, B, and C). However, existing geographic representations such as raster and vector are essentially single scale, reflecting mechanistic views of space of Newton and Leibniz (see Panels D, E, and F of Fig. 6). Single-scale representations create many scale-related problems and have been major concerns in geographical analysis, such as the modifiable areal unit problem (Gehlke and Biehl, 1934; Openshaw, 1984), the conundrum of length (Richardson, 1961; Mandelbrot, 1982; Batty and Longley, 1994; Frankhauser, 1994; Chen, 2008), and the ecological fallacy

Geographic Space as a Living Structure Chapter

(A)

(B)

(C)

(D)

(E)

(F)

4

67

FIG. 6 The topological representation as a multiscale representation versus single-scale representations commonly used in GIS. To illustrate, the topological representation is partitioned into three scales, ranging from (A) the smallest, to (B) the medium, to (C) the largest. There are three singlescale representations: (D) administrative boundaries, (E) a regular grid, and (F) an image or a pixelbased representation.

(Robinson, 1950; King, 1997; Wu et al., 2006). The single-scale representations are suitable for showing geographic features of more or less similar scales. Therefore, it is not surprising that current spatial statistics focuses much on autocorrelation, little on scaling or fractal or living property of far more small geographic features than large ones ( Jiang, 2015b). The topological representation enables us to see not only more or less similar things in one scale (spatial dependency), but also far more small things than large ones across all scales (spatial heterogeneity). The notion of far more small things than large ones or spatial heterogeneity has been formulated as a scaling law ( Jiang, 2015b). It is complementary to Tobler’s law (1970) or the first law of geography for characterizing geographic space or the Earth’s surface: scaling law being global, while Tobler’s law being local. The scaling law only refers to statistical property, without referring to the underlying geometrical property. This is essentially the same as Zipf’s law (1949) about city-size distribution, without referring to spatial configuration of cities formulated by the Central Place Theory (Christaller, 1933). However, the Theory of Centers (Alexander, 2002–2005) about living structure concerns both the statistical and geometrical aspects. The statistical aspect illustrates the fact that city sizes meet a power-law relationship, while the geometrical aspect

68 PART

I Methodological

points to the fact that all cities adapt to each other to form a coherent whole ( Jiang, 2017). Seen from both statistical and geometric aspects, space is far more heterogeneous than what current spatial statistics and Euclidean geometry can effectively deal with. It is in this sense that we must adopt fractal geometry and Paretian statistics for geospatial analysis. It is in this sense that we must adopt the topological representation for getting insights into the living structure of space. It is in this sense that existing geographic representations show critical limitations. Given the limitations of single-scale representations, topological representation (or the underlying living structure) is a better alternative for representing or structuring geospatial data. The hierarchical data structures, such as quadtree (Samet, 2006), should be adopted to reflect the living structure of geographic space or features. All geographic features should be put into a whole, and their status can be indicated by their hierarchical levels, or their ht-index within the whole. Fig. 7 illustrates three kinds of geographic features, which all demonstrate living structures. A set of far more small cities than large ones constitute a coherent whole (Fig. 7A), and a curve is regarded as a set of far more small bends than large ones (Fig. 7B), rather than a set of more or less similar segments. The Sierpinski carpet (Fig. 7C) is a proxy of many areal geographic features, such as islands, lakes, and land use patches. A geographic space or feature is well structured hierarchically, so mechanistically imposed representations, such as raster and vector, are not appropriate for illustrating its living structure. Therefore, administrative boundaries, regular grids, or pixel-based images are not appropriate for revealing the living structure of geographic features. In this regard, naturally or organically derived cities provide a significant instrument for structuring big data. To this point, we can further elaborate on how Alexander’s organic worldview constitutes the third view of space. In the history of science, there are two dominant conceptions of space that are also called absolute and relative views of space.

1

1

3

1

3

1

1

2

2

1

1

2 2

1 1

(A)

1

1

1

3

(B)

1

1 3

1

1

1

1

1

1

1

1

1

1

2

1

1

2

1

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2

1

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2

1

1

2

1

1

2

1

1

1

1

1

1

1

1

1

1

3

(C)

FIG. 7 Structuring geographic features based on their scaling hierarchy. Geographic features of different shapes: (A) points, (B) lines, and (C) polygons, in which the ht-indices or numbers indicate their hierarchical levels.

Geographic Space as a Living Structure Chapter

4

69

The absolute view arises out of Newtonian physics, and implies that phenomena can be defined in themselves, so a space can be considered to be a container. The relative view of space comes from Leibniz’s conception of space that space can be defined as the set of all possible relationships among phenomena. These two views of space reflect pretty well the conception of space by Descartes (1954), who described space as a neutral and strictly abstract geometric medium, through his uniform spatial scheme of analytical or coordinate geometry. This spatial scheme, similar to the common geographic representations raster and vector, made us to think of space as a neutral, lifeless, and dead substance. Alexander (2002–2005) challenged this mechanistic conception of space, and conceived an organic worldview, under which space has capacity to be more living or less living according to its inherent structure. Unlike absolute or relative space, this new view of space is organic, based on the concept of wholeness, which finds its roots in many disciplines such as quantum physics (Bohm, 1980), and Gestalt psychology (K€ ohler, 1947). Alexander (2002–2005) further argues the ubiquity of wholeness in nature, in buildings, in works of art, and more specifically in any part of space at different levels of scale. Eventually, the goodness of a given part of space may be understood only as a consequence of the wholeness that exists. In other words, the wholeness is the essence of geographic space. Geography as a science has three fundamental issues to address about geographic space: (1) how it looks, (2) how it works, and (3) what it ought to be. The first issue concerns mainly about geographic forms or urban structure, which are governed by two fundamental laws: scaling law ( Jiang, 2015b) and Tobler’s law (1970). Scaling law states that there are far more small things than large ones on the Earth’s surface, whereas Tobler’s law refers to the fact that similar things tend to be nearby or related. Kriging interpolation is possible because of Tobler’s law. Prediction of tweet locations is possible because of scaling law or living structure. The second issue refers to the underlying mechanisms in terms of how a complex or living structure evolves. This is what physicists are primarily concerned with; for example, how cities evolve, and how topographic surfaces are formed with respect to geological processes. The surface complexity arises out of the deep simplicity—the deep nonlinear and chaotic processes (Gribbin, 2004). In other words, geographic processes or urban dynamics in particular fluctuate very much like stock prices. The first two issues are fundamental to many other sciences, such as physics, biology, and chemistry, for understanding and explaining complex natural and societal phenomena. The third issue is not so common to other sciences, which are hardly concerned with creation or design (Alexander, 2002–2005). The issue of what it ought to be intends to address how to create a living structure, and how to make a living structure more living or more harmonic. This third issue has not been well addressed in geography, yet it is so unique and important in terms of how to make better or more sustainable built environments. As demonstrated by Alexander (2002–2005), the concept of

70 PART

I Methodological

living structure has paved a way to this creation and design of living environments, through harmony-seeking computation (Alexander, 2009). In this connection, harmony-seeking computation as a kind of adaptive computation deserves further research in the future.

6 CONCLUSION Inspired by Alexander’s new cosmology, this paper demonstrated that human activities, such as tweet locations, can be well predicted by the underlying living structure using topological representation and analysis. From this study, we have a better sense of understanding how human activities are shaped by space, or more precisely by its underlying living structure. This finding further validates topological representation as an effective tool for geospatial analysis, particularly in the context of big data. More importantly, we showed that living structure exists in space, to varying degrees at different levels of scale, so it is a legitimate object of inquiry for better understanding goodness of built environments. Unlike existing geographic representations, such as raster and vector, which are essentially single-scale, the topological representation is a de-facto multiscale representation—multiple scales, ranging from the smallest to the largest and including some in between, within the single representation. This multiscale representation can help avoid many scale issues caused by the traditional single-scale representations. Big data provides not only a new type of data sources for geographic research, but also poses a big challenge on how to efficiently and effectively manage it, particularly how to develop new penetrating insights. Unlike small data, which rely on samples for understanding the population of the data, big data is able to effectively extract living structure of space. Big data is more representative than small data in capturing the underlying living structure. This is why big data or living structure can effectively predict human activities. From another aspect, big data is commonly considered to be unstructured, yet the underlying living structure makes big data incredibly well structured. This is because space itself is inherently well structured. It is in this context we suggest a new way of structuring big data through living structure. The living structure, and in particular its current and future statuses, indicate that space is living rather than lifeless, and space is dynamics rather than static. The essence of geographic space is its living structure. The living structure can be used to structure geospatial data by relying on its inherent hierarchy. This proposal deserves further research in the future.

ACKNOWLEDGMENTS We would like to thank the three anonymous reviewers for their constructive comments. We also would like to thank Junjun Yin for kindly sharing the tweet location data. This chapter is a reprint of Jiang and Ren (2018) with the permission of the original publisher Taylor and Francis.

Geographic Space as a Living Structure Chapter

4

71

REFERENCES Alexander, C., 1965. A city is not a tree. Architect. Forum 122 (1+2), 58–62. Alexander, C., 2002–2005. The Nature of Order: An Essay on the Art of Building and the Nature of the Universe. Center for Environmental Structure, Berkeley, CA. Alexander, C., 2009. Harmony-Seeking Computations, revised and expanded version of a Keynote Speech at the International Workshop on the Grand Challenge in Non-Classical Computation, University of York, UK, April 2005. http://www.livingneighborhoods.org/library/harmonyseeking-computations.pdf. Batty, M., Longley, P., 1994. Fractal Cities: A Geometry of Form and Function. Academic Press, London. Bohm, D., 1980. Wholeness and the Implicate Order. Routledge, London and New York. Chen, Y., 2008. Fractal Urban Systems: Scaling, Symmetry, Spatial Complexity. Science Press, Beijing (In Chinese). Christaller, W., 1933. Central Places in Southern Germany. trans. C. W. Baskin, 1966, Prentice Hall, Englewood Cliffs, NJ. Cova, T.J., Goodchild, M.F., 2002. Extending geographical representation to include fields of spatial objects. Int. J. Geogr. Inf. Sci. 16 (6), 509–532. Couclelis, H., 1992. People manipulate objects (but cultivate fields): beyond the raster-vector debate in GIS. In: Frank, A.U., Campari, I. (Eds.), Theories and Methods of Spatio-Temporal Reasoning in Geographic Space. Springer-Verlag, Berlin, pp. 65–77. Descartes, R., 1954. The Geometry of Rene Descartes. translated by Smith D. E., and Latham M. L, Dover Publications, New York. Frankhauser, P., 1994. La Fractalite des Structures Urbaines. Economica, Paris. Gehlke, C.E., Biehl, H., 1934. Certain effects of grouping upon the size of the correlation coefficient in census tract material. J. Am. Stat. Assoc. (Suppl.) 29, 169–170. Gribbin, J., 2004. Deep Simplicity: Chaos, Complexity and the Emergence of Life. Penguin Books, New York. Jiang, B., 2013. Head/tail breaks: a new classification scheme for data with a heavy-tailed distribution. Prof. Geogr. 65 (3), 482–494. Jiang, B., Miao, Y., 2015. The evolution of natural cities from the perspective of locationbased social media. Prof. Geogr. 67 (2), 295–306. (Reprinted in Plaut, P., Shach-Pinsly, D. (Eds.), 2018. ICT Social Networks and Travel Behaviour in Urban Environments, Routledge.). Jiang, B., Ren, Z., 2018. Geographic space as a living structure for predicting human activities using big data. Int. J. Geogr. Inf. Sci. https://doi.org/10.1080/13658816.2018.1427754. Jiang, B., Thill, J.-C., 2015. Volunteered geographic information: towards the establishment of a new paradigm. Comput. Environ. Urban Syst. 53, 1–3. Jiang, B., 2015a. Wholeness as a hierarchical graph to capture the nature of space. Int. J. Geogr. Inf. Sci. 29 (9), 1632–1648. Jiang, B., 2015b. Geospatial analysis requires a different way of thinking: the problem of spatial heterogeneity. GeoJournal 80 (1), 1–13. (Reprinted in Behnisch, M., Meinel, G. (Eds.), 2017. Trends in Spatial Analysis and Modelling: Decision-Support and Planning Strategies. Springer, Berlin, 23–40.). Jiang, B., 2015c. Head/tail breaks for visualization of city structure and dynamics. Cities 43, 69–77. (Reprinted in Capineri, C., Haklay, M., Huang, H., Antoniou, V., Kettunen, J., Ostermann, F., 2016. In: Purves, R. (Ed.), European Handbook of Crowdsourced Geographic Information. Ubiquity Press, London, 169–183.).

72 PART

I Methodological

Jiang, B., 2016. A complex-network perspective on Alexander’s wholeness. Phys. A Stat. Mech. Appl. 463, 475–484. (Reprinted in Ye, X., Lin, H. (Eds.), 2019. Advances in Spatially Integrated Social Sciences and Humanities. Higher Education Press: Beijing.). Jiang, B., 2017. A topological representation for taking cities as a coherent whole. Geogr. Anal. 50 (3), 298–313. https://doi.org/10.1111/gean.12145. (Reprinted in D’Acci, L. (Eds.), 2019. Mathematics of Urban Morphology. Springer, Berlin.). King, G., 1997. A Solution to the Ecological Inference Problem: Reconstructing individual Behavior from Aggregate Data. Princeton University Press, Princeton. K€ ohler, W., 1947. Gestalt Psychology: An Introduction to New Concepts in Modern Psychology. Liveright, New York. Longley, P.A., Goodchild, M.F., Maguire, D.J., Rhind, D.W., 2015. Geographic Information Science and Systems. Wiley, Chichester. Mandelbrot, B., 1982. The Fractal Geometry of Nature, W. H. Freeman and Co., New York. Mayer-Schonberger, V., Cukier, K., 2013. Big Data: A Revolution That Will Transform How We Live, Work, and Think. Eamon Dolan/Houghton Mifflin Harcourt, New York. Openshaw, S., 1984. The Modifiable Areal Unit Problem. Geo Books, Norwick Norfolk. Richardson, L.F., 1961. The problem of contiguity: An appendix to statistic of deadly quarrels. Gen. Syst. 6 (139), 139–187Society for General Systems Research: Ann Arbor, Mich. Robinson, W.S., 1950. Ecological correlations and the behavior of individuals. Am. Sociol. Rev. 15, 351–357. Samet, H., 2006. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, San Francisco, CA. Salingaros, N.A., 2005. Principles of Urban Structure. Delft, Techne. Tobler, W., 1970. A computer movie simulating urban growth in the Detroit region. Econ. Geogr. 46 (2), 234–240. Wu, J., Jones, K.B., Li, H., Loucks, O.L. (Eds.), 2006. Scaling and Uncertainty Analysis in Ecology: Methods and Applications. Springer, Berlin. Zipf, G.K., 1949. Human Behaviour and the Principles of Least Effort. Addison Wesley, Cambridge, MA.

Chapter 5

Data Preparation Kristian Henrickson*, Filipe Rodrigues† and Francisco Camara Pereira† *

Department of Civil and Environmental Engineering, University of Washington, Seattle, WA, United States, †Department of Management Engineering, Technical University of Denmark (DTU), Lyngby, Denmark

Chapter Outline 1 Introduction 2 Tools and Techniques 2.1 Scripting and Statistical Analysis Software 2.2 Database Management Software 2.3 Working With Web Data 3 Probe Vehicle Traffic Data 3.1 Formats and Protocols 3.2 Data Characteristics

1

73 74 74 76 79 81 81 83

3.3 Challenges 85 3.4 Data Preparation and Quality Control 88 4 Context Data 95 4.1 The Role of Context Data 95 4.2 Types of Context Data 96 4.3 Formats and Data Collection 99 4.4 Data Cleaning and Preparation 99 References 102

INTRODUCTION

This chapter is designed as an introduction to data preparation, with a focus on nonconventional data sources such as probe vehicles, internet, and web services. These nonconventional data promise to support new analysis methods which can provide a better understanding of our transport systems, and they are rapidly making their way into mainstream transport engineering, analysis, and decision making. However, such data come with a number of new challenges associated with their size and/or complexity, new acquisition channels, and added quality control considerations. In a general sense, data preparation is the process of reading, organizing, transforming, and quality checking, to turn raw data into something accessible and useful in subsequent analysis. Of course, the end goal is to support transport analysis which accurately predicts or explains some aspect of the physical transport system. This goal is best served by a data preparation framework that is designed and documented to insure that the result is truly representative of the underlying system, and that the processing steps applied are transparent and interpretable to the end user. Developing such a framework requires an understanding of the nature of these emerging data sources and associated Mobility Patterns, Big Data and Transport Analytics. https://doi.org/10.1016/B978-0-12-812970-8.00005-1 © 2019 Elsevier Inc. All rights reserved.

73

74 PART

I Methodological

challenges. Toward this end, this chapter also serves as a broader introduction to several important and increasingly prevalent transport data sources. Additionally, extracting, structuring, and preparing these data will require a familiarity with the standard tools and methods for dealing with large-scale structured and unstructured datasets. Providing an introduction to these data sources, and to essential tools and methods required to prepare them for analysis, is the goal of this chapter. This chapter is organized as follows. First, we describe some popular software tools for data management, processing, and analysis, and provide an introduction to web data formatting and extraction. Next, we introduce and discuss probe-vehicle based traffic data, with an emphasis on formatting and location referencing, data sources and characteristics, and quality control methods. Next, we introduce context data, that is, data which relates to the physical and human context in which the transportation system operates. This includes weather, web and social media, and event data, with a particular emphasis on internet sources and web services.

2 TOOLS AND TECHNIQUES This section briefly introduces some important software tools and methods for managing and analyzing data relevant to traffic analysis. Specifically, we introduce popular scripting and analysis software, database management systems, common markup languages and tools for text and web data. We emphasize free and/or open source tools, as these typically provide the cheapest, most interoperable, and generally most accessible options for getting started with nonconventional transportation data.

2.1 Scripting and Statistical Analysis Software A great variety of programing/scripting languages and analytical tools are available. Selecting the best alternative for a particular application will depend on a range of factors including the expertise of the development team or individual, the development vs execution time trade off, and the features of the analytical libraries currently available. Here we introduce a few popular choices for data processing and statistical analysis, with a focus on open source tools.

2.1.1 Python Python is a popular object-oriented interpreted scripting language, and has been widely applied in statistical computing, machine learning, web development, and other applications. Having been designed to facilitate rapid development and human-readable code, Python is dynamically typed, relatively easy to learn, and very flexible. Python is open source and free (even for commercial applications), and it supports a wide variety of data processing, analysis, and visualization tools through third-party packages. Compared to R, Python is more of

Data Preparation Chapter

5

75

a general purpose language, and may be preferable when performance becomes an issue or when data processing/analysis code needs to be integrated into a larger software framework. Some popular data analysis and statistical packages are listed below. Pandas is a data analysis and modeling library that is very useful for analyzing and summarizing tabular data. At the core of Pandas are the DataFrame and Series data structures, which simplify restructuring, aggregating, indexing, and summarizing large datasets with mixed data types. Scikit-learn is a statistical and machine learning library which offers a variety of powerful tools for regression, classification, clustering, and other methods. Statsmodels is another library for statistical analysis and processing and, compared to scikit-learn, is focused more on statistical methods rather than machine learning. SciPy is a general purpose library for scientific and engineering computing. In addition to core functionality such as numerical integration and optimization, SciPy includes a number of other Python packages including Pandas, Numpy (a popular numerical computing library) and Matplotlib (a general purpose plotting library).

2.1.2 R R is an open source interpreted language that is focused on statistical analysis and visualization. It is widely used in data mining, statistics, and general data analysis and visualization tasks, and much of its core functionality and thirdparty packages are developed by the statistical community. R was initially developed for data analysis and statistical modeling, and it is supported by a large and active community of developers. Compared to Python, R offers much greater breadth of functionality in terms of new and novel statistical methods, domain-specific analysis tools, and data visualization. However, Python is more flexible and capable for general purpose programming. A few popular R packages for general data analysis are introduced briefly below. dplyr is a popular R package for manipulating and restructuring data frames, the core tabular data structure in R, and database tables. stats is a general package for statistical calculations, and includes a variety of modeling tools supporting time series analysis, generalized linear models, and statistical testing. The stats package is part of the base R installation, and can be used in R “out of the box.” forecast is a popular R package for multivariate time series analysis. This package includes the very useful auto.arima function for automatic order selection and training of a univariate ARIMA model. ggplot2 is the most popular visualization package in R. Much of the details are abstracted away, which makes even complex plots relatively easy to construct.

76 PART

I Methodological

2.1.3 MatLab MatLab is a commercial mathematical computing environment and interpreted programming language. It is often compared with R due to its rich numerical and statistical computing functionality, though the MatLab environment is preferable in some cases for general programming tasks. Packages in MatLab are often more mature and robust than their R counterparts, but the software is quite expensive and each additional package represents added licensing costs. As is often the case for commercial software, the documentation and developer support for MatLab tends to be better than that for open source alternatives. As a general purpose programing language, MatLab is not as flexible, interoperable, or, in some cases, fast as Python. Both R and Python can achieve numerical computing performance comparable to that of MatLab, though this varies significantly depending on which tools/packages are used, and the ability of the user to parallelize and optimize code.

2.2 Database Management Software A database management system (DBMS) is an indispensable tool for storing, organizing, and interacting with large volumes of data. In addition to providing an interface for programmatic remote access to data, a DBMS offers a range of desirable features including transaction management, user access controls, and others. A relational DBMS or RDBMS is the most common type of DBMS, where the word relational refers to the relationships between different entities represented in the database. A relational DBMS is a structured data management framework, and it enforces whatever structure is specified by the user. For this reason, database design or data modeling is a crucial step in the development of any database with nontrivial structure. A data model describes the structure, relationships, and constraints in a database, and should be developed prior to building and populating tables. A data model is important both to plan and clearly specify the structure of a database, as well as to communicate this structure to potential users such as programmers and analysts. Using domain knowledge and information provided by the data provider, a data model should be designed with consideration of query efficiency, indexing, data integrity under ongoing data inserts, and updates, and deletions, as well as possible future changes to the geographic scope of the data. Here we discuss some of the most popular database solutions and their associated strengths and weaknesses. With the exception of Microsoft SQL Server, all of the Relational Database Management Software described below runs on both windows and Linux (Linux support for Microsoft SQL Server is present in the 2017 version).

Data Preparation Chapter

5

77

2.2.1 MySQL MySQL is a very popular Relational Database Management Software (RDBMS) maintained by Oracle, and is available in both commercial and free (under the General Public License) versions. The open source Community edition does not include enterprise features such as encryption or sophisticated thread management, but it is an easy to use and surprisingly powerful solution for a variety of database applications. Current versions support geospatial data types and spatial indexing. Being relatively simple to set up and use, MySQL community edition is recommended for those getting familiar with the use of RDBMS, and in general for transactional rather than analytical applications. For more complex databases, or those requiring more sophisticated analytical capabilities, it may become necessary to switch to a more powerful solution such as PostgreSQL.

2.2.2 PostgreSQL PostgreSQL is among the most powerful open source RDBMS available and, while not as popular as MySQL, it is widely used in commercial applications. PostgreSQL supports geospatial data types and indexing through the PostGIS extension, and offers excellent analytical query capabilities compared to other open source alternatives (e.g., MySQL). PostgreSQL is recommended as a general purpose RDBMS and, being somewhat more complex to set and use, in particular for complex databases or those requiring more powerful analytical features. Further, the geospatial capabilities available in PostGIS make PostgreSQL one of the best, if not the best, RDBMS solutions for managing and analyzing spatial data.

2.2.3 Commercial DBMS There are a number of very powerful commercial DBMSs available, some of which provide very useful tools and substantial performance advantages over open source alternatives. For example, Microsoft SQL Server provides excellent versioning tools, sophisticated multithreading, and of course excellent support for other Microsoft products such as Azure cloud services. Other enterprise-scale commercial options include Oracle RDBMS, IBM DB2, and Sybase ASE. In general, commercial DBMSs are quite expensive, and may not provide significant advantages for many users. Microsoft Access is a less expensive and user-friendly option that is often used in business environments, but comes with a number of limitations (most notably a database size limit of 2GB) that make it less applicable for dealing with large datasets. Commercial DBMSs have the advantage of support provided by the manufacturer, though open source alternatives often have a very active and helpful user-base.

78 PART

I Methodological

2.2.4 NoSQL Data Management NoSQL or “Not only SQL” is a broad concept encompassing a range of database technologies. In general terms, NoSQL data systems are designed to support management and analysis of very large data, high-velocity streaming data, and/or data that does not fit into the conventional relational model. Much of the current generation of distributed and NoSQL data processing and management software are designed to allow databases to scale out rather than scale up. That is, rather than purchasing a small number of very powerfully servers, the storage and processing load can be distributed to a large number of commodity computers. Such systems are designed for redundancy and easy recovery, making it possible to continue operating as normal with the loss of one or even several machines in the network. The benefits of a NoSQL data system vary by application and software, but generally include rapid scaling, high data throughput, built-in redundancy and recovery, and cost advantages associated with the ability to scale incrementally rather than make big upfront hardware investments. However, such systems have a number of characteristics that may be more or less problematic depending on the application. For example, NoSQL data systems often place less emphasis on data governance, and store data in an unstructured or variable schema form that may lead to quality and consistency issues. This may be an advantage for some applications which need to ingest large volumes of data with variable or undefined structure, but undesirable for maintaining the integrity of well-structured data. Also, unlike most relational DBMSs, many NoSQL systems do not support full ACID (Atomicity, Consistency, Isolation, and Durability) compliance. Such systems are described as “eventually consistent,” which means that the result of a query may not reflect the most recent changes to the data. Other possible drawbacks of NoSQL databases include the use of lower level, declarative query languages and limited support for complex queries and joins. Examples of NoSQL databases include column stores like Apache Cassandra and Amazon Redshift, Document databases such as MongoDB and CouchDB, and key-value stores like Berkeley DB and Amazon DynamoDB. The easiest and most cost effective way to get started with NoSQL is on a managed cluster through one of the several available platform-as-a-service providers. For example, Amazon provides a variety of NoSQL managed server options on AWS, as does Microsoft through Azure. However, unless one has access to an existing computing cluster and the staff expertise needed to build and use NoSQL databases, significant upfront monetary and time investment will be needed get started with NoSQL at scale. With careful indexing and a database server of reasonable performance, conventional relational database software will perform adequately for nearly any scale of interest for transportation analysis.

Data Preparation Chapter

5

79

2.3 Working With Web Data When working with web data, the most popular data formats that one is likely to encounter are XML, JSON, HTML, and plain natural language text. XML is a markup language that defines a set of rules for encoding objects in a format that is both human-readable and machine-readable. JSON is an open-standard format that also seeks human and machine-reliability for encoding objects. However, in JSON these objects consist of attribute-value pairs. Table 1 shows a sample of weather data encoded in the XML and JSON formats. XML and JSON formats will be encountered mostly when gathering data from RSS feeds or APIs. In order to retrieve data from an RSS feed or an API, the first step is to construct the request URL with all the desired parameters. This varies considerably from source to source. Therefore, a careful read of the documentation is necessary. Moreover, most APIs require the users to sign up for an API key, which in many cases is free. Given an URL address with the request, it is then trivial to retrieve the data corresponding to the request using, for example, a popular programming language like Python. In Python 3.5, retrieving the data for the upcoming events in New York City can be done in just two lines of code using the requests library (Reitz, 2018) as shown in Fig. 1. The retrieved data then needs to be parsed in order to make it machinereadable. Fortunately, many of the most common programming languages

TABLE 1 XML and JSON Examples XML

JSON

2017-01-20 12:28:32 Sunny 10 °C 85% NNE 2 mph

{ "weather”: { "timestamp”: "2017-01-20 12:28:32", "condition”: "Sunny", "temperature”: "10 °C", "humidity”: "85%", "wind”: "NNE 2 mph" } }

FIG. 1 Example of an API request in Python.

80 PART

I Methodological

FIG. 2 Example of JSON parsing in Python.

FIG. 3 Exampling of XML parsing in Python.

provide good support for this. As an example, suppose that the “resp” object contains the JSON content in Table 1. Using the “json” package of Python (Python Software Foundation, 2018), one can easily process the JSON to obtain and print out the temperature using just 3 lines of code, as shown in Fig. 2. The same could be done in the case of the XML version of data, by using the “BeautifulSoup” package from Python (Richardson, 2017), as shown in Fig. 3. As the examples above demonstrate, it is fairly straightforward to access an RSS feed or an API. However, many websites don’t provide such good interfaces. In these cases, one might need to resort to a technique called “screen scraping.” In this context, screen scraping refers to the process of obtaining the source HTML code of a web page, parsing it, and extracting the desired content from it (although, legal terms of the website may apply). If using Python, one can once again rely on the “BeautifulSoup” package for this procedure, although the difficulty of scrapping the contents of a web page depends significantly on how well structured it is. For example, for a properly structured web site like AllEventsIn, retrieving the description of an event can be as easy as the example shown in Fig. 4. Unfortunately, for poorly structured websites, extracting the relevant information from the source HTML code can be quite cumbersome. To make things worse, the structures of web sites change rather frequently. Hence, a perfectly working scrapper today, tomorrow may not work anymore. In many cases, the use of regular expressions can significantly ease the process of scrapping the contents of a web page. A regular expression (regex or regexp for short) is a special text string for describing a search pattern. Regexes make use of special characters or symbols to represent certain patterns. For example, the symbol “.” is used to match any character in a string of text, and the symbol “*” is used to indicate that the previous pattern can repeat many times or be absent. By exploiting these and other features of regexes, one can,

FIG. 4 HTML scraping example in Python.

Data Preparation Chapter

5

81

for example, use the following regex to match all the emails in a text: “[A–Z0–9._% +] + @[A–Z0–9.] + \.[A–Z]{2,}”. For further details on how to use regexes, please refer to (Goyvaerts, 2016). Using regexes in Python is quite easy through the “re” package.

3

PROBE VEHICLE TRAFFIC DATA

Mobile GPS and communications devices are an increasingly popular source of link-level speed and travel time data for transportation planning and analysis. A number of commercial providers offer a version of this type of data, typically as a combination of road link definitions, GIS file(s) specifying road link locations and topology, and a collection of travel time or speed measurements for each road link. Two of the world’s leading providers of probe vehicle based traffic data products are INRIX, based in the United States, and HERE, based in the Netherlands. Throughout this discussion, repeated references to the National Performance Monitoring Research Dataset (NPMRDS) will be made. This is a large probe vehicle-based traffic dataset covering the national highway system in the United States, acquired by the Federal Highway Administration for use in performance measurement and management. The first version of the NPMRDS was provided by HERE North America, but since mid-2017 has been provided by a consortium of academic and industry partners including INRIX. Rather than describe the specific product offerings from different providers, which in general will be determined by a procurement agreement, a general introduction to the concepts, terminology, and data characteristics is given here.

3.1 Formats and Protocols Historical traffic data will typically be delivered in the form of large text files containing traffic records, GIS shapefile(s) containing the road network, and TMC lookup table(s) describing the TMC segments and their relationship to the geospatial data. Due to the size and relational structure of link-level speed and travel time data, spread sheets and text files are not suitable data management and analysis tools. In most cases, relational database management software with geospatial extensions should be used, along with scripting and data analysis tools such as Python, R, and/or Matlab. Real-time speed and travel time data will typically be delivered via web service in the form of JSON or XML, as a response to a user request. Such data will typically require some scripting to retrieve and format for storage, analysis, and presentation. With a relational table structure and industry standard units, working with commercial traffic data is similar to working with more conventional traffic data sources (e.g., fixed mechanical traffic sensors). The primary difference, and source of complexity when integrating such data into an existing traffic database, is in the location referencing methods used. While conventional traffic data is often encoded using linear referencing, commercial probe vehicle

82 PART

I Methodological

data is generally referenced to locations using a combination of (a) predefined road segments defined in geospatial terms, and (b) traffic records which are linked to road segments through road link-specific codes. The Traffic Message Channel (TMC) is the most widely used location encoding standard in commercial traffic data, and nearly all providers offer TMC-based products. As an alternative to the TMC standard, TomTom develops and maintains the OpenLR standard for location referencing, an open source standard which offers greater coverage and flexibility than TMC-based referencing. Some vendors develop and maintain their own location referencing tables based on the TMC standard (e.g., OpenLR-compliant INRIX TMC XD segments), and in some cases use an alternate encoding in addition to TMC codes to enable sub-TMC resolution. A brief discussion of the two leading standards for traffic data location encoding is given in the following two subsections.

3.1.1 TMC Codes Common to this type of data is the use of Traffic Message Channel (TMC) codes, a unique identifying code used to assign traffic information to specific points or segments on a road network. Initially developed for encoding and transmitting traffic information on radio side bands, TMC codes are now widely used in a variety of wireless and web-based traffic information systems. The Traveler Information Services Association (TISA) maintains the TMC standard, and is responsible for making sure that new TMC definitions conform to this standard. TMC definition tables (conforming to TISA standards) are maintained by various organizations for many countries around the world, and are licensed to map makers and data vendors for use in their services. Due to the use of different TMC tables, as well as the flexibility in the geometric interpretation of a TMC definition, significant differences may be present in the TMC segment geometry used by different providers. TMC codes are used to assign traffic data to a road segment or point for both real time web service calls and in archived historical data. For a set of geospatial data describing a road network, a TMC may refer to one or more consecutive road segments, and each segment may be associated with zero or more TMCs. For example, the NPMRDS includes a GIS shapefile describing the traffic network and a TMC lookup table relating TMCs to road segment IDs in a many-tomany relationship. A few important notes about the use of TMC tables for encoding and maintaining traffic data are given below. 1. TMC tables used by a specific data vendor are proprietary, and do not cover every road segment. Typically, coverage is greatest on freeways and decreases for each subsequent functional class. 2. TMC length varies significantly, even within a single area and roadway functional class. There is pressure to develop TMCs with finer granularity, but once defined TISA standards require that a TMC should only be removed if the corresponding road segment is taken out of service.

Data Preparation Chapter

5

83

3. TMC support for freeway ramps varies between providers, and in most cases traffic data will not be available for all ramps. 4. As with any such standard, significant time is required to develop, approve, and implement new TMC definitions. This is often cited as a drawback of TMC-based encoding.

3.1.2 Open Location Referencing TomTom launched the Open Location Referencing (OpenLR) project in 2009 to develop an open source standard for location referencing, and has since been pushing for widespread industry adoption. Unlike TMC representation, OpenLR enables dynamic location encoding and a great deal of flexibility for representing different types of geospatial data including 2-dimentional shapes, points, and nonroad lines (TomTom, 2012). In OpenLR, a road segment is represented as a series of waypoints that can be combined, along with bearing and road attributes (functional class and way type), to represent a travel path. Even in a system that uses OpenLR for location encoding and exchange, the data are represented fundamentally as road link-level records (possibly with greater coverage and granularity than TMC encoding). OpenLR has a number of advantages over TMC encoding. Here we list a few key considerations for data encoded using this standard. l

l

l

OpenLR segments can represent any road segment, as well as geographic features that are not associated with the road network. That said, the road segments and time periods for which traffic data is available will still vary by vendor and product. OpenLR can represent the traffic network at higher granularity than TMCs in general, but the resolution of the underlying data is still limited to that of the database maintained by each vendor. OpenLR is a dynamic standard, and as such requires no time to develop and approve a new segment definition as is the case for TMC segments. However, a certain amount of work and time delay should be expected for a given vendor to include a new road segment in their traffic data products.

3.2 Data Characteristics 3.2.1 Data Sources Data sources vary between different vendors and within vendors by product. In cases where multiple sources are combined to create a single aggregate record, the exact source(s) of any particular record may not be presented to the end user. Instead, users will be given a single aggregate value representing the vendor’s best guess at the true traffic state. The primary sources of commercial traffic data and briefly summarized below.

84 PART

I Methodological

3.2.1.1 Mobile Location Services For certain providers, much of the data they collect is from mobile location services. That is, when an individual uses a contributing mobile phone application and agrees to share their location data, this data is automatically shared with the owners of the application and used to generate traffic statistics. This type of crowd sourcing is used by nearly all commercial traffic data vendors, though the set of applications contributing data will vary between vendors. In some cases, application users are provided with some mechanism to opt out of data reporting. 3.2.1.2 Consumer GPS Devices Consumer GPS units, including both stand-alone units and in-vehicle systems, provide a significant percentage of the traffic data available from commercial vendors. TomTom, for example, is a well-established manufacturer of consumer GPS units, and both their historical and real time data products are obtained (at least in part) through such devices (TomTom, 2011, 2014). Users contribute data by uploading their device GPS logs or in real-time via communications-enabled GPS units. Although commercial vendors are not always entirely forthcoming about all of their data sources, both HERE and INRIX source user GPS data through partners in the automotive and consumer GPS industries (also note that HERE is majority owned by a consortium consisting of BMW, Daimler, and Volkswagen). 3.2.1.3 Commercial Vehicle Transponders Commercial vehicle fleets were one of the first sources of commercial probe vehicle data and, despite the ubiquity of mobile computing and consumer GPS, remain a crucial resource. Commercial fleets including taxis, shuttles, delivery vehicles, and freight trucks are often equipped with transponders which often contain a GPS receiver. By contracting with fleet operators, some commercial traffic vendors are able to obtain the data generated by these transponders. In addition to providing a reliable and accurate source of vehicle GPS data, transponder data can target a specific vehicle subpopulation. For example, through a partnership with the American Trucking Research Institute, freight truck transponder data was used to generate truck-specific travel times that were included in the NPMRDS. 3.2.1.4 Other Sources Fixed traffic sensors, including inductance loops, traffic cameras, and others are listed among the data sources used by several commercial data vendors. For example, several traffic data vendors collect incident and congestion information based on manual observation of traffic cameras, public agency alerts, and emergency reporting systems. HERE runs several reporting centers in the United

Data Preparation Chapter

5

85

States and Europe to collect such data, while TomTom contracts with third parties for this purpose.

3.2.2 Granularity As noted previously, TMC segment lengths vary significantly even within a single region and roadway functional class. Most vendors have made efforts to move beyond the limitations of the TMC representation to achieve greater coverage, granularity, and responsiveness. In 2013, INRIX introduced the TMC XD encoding, which is maintained exclusively by INRIX (though it does conform to the OpenLR standard) and offers greater coverage and granularity compared to standard TMC definitions. While TMC segments are often several miles or more in length, TMC XD segments have a maximum length of 1.5 miles in the United States (INRIX, 2014). TomTom products are offered in both TMC and OpenLR encoding. Although HERE uses TMC encoding, they utilize an offset scheme to provide additional flexibility and detail. As of 2016, the NPMRDS contained 385,000 TMCs with an average length of 1.74 miles (Pu, 2018). 3.2.3 Vendor Quality Control and Imputation The quality and completeness of link-wise probe vehicle data varies between vendors and procurement process. For example, the standard product offered by INRIX provides a fully complete data series for all TMCs, part of which represents imputation performed by the company. On the opposite end of the spectrum, the NPMRDS was specified in the procurement contract to contain only measured values, with no imputation or processing applied. Because of this, outliers are present and a great number of observations are missing. Unprocessed and incomplete data may seem undesirable at first glance, but it allows the user to retain more control over the quality control, imputation, and uncertainty estimation which can result in more informed analysis and decision making. Some providers do include information that can aid in quality assessment in both web service responses and archived historical data. For example, in the I-95 vehicle probe project carried out by the University of Maryland, a confidence metric was developed which considers vehicle density and likelihood of current observation conditioned on the previous 45 min and on a longer time window based on historical data (INRIX, 2014). However, much of the processing applied to commercial data sets is typically not presented to the end user, which complicates any uncertainty and error assessment.

3.3 Challenges 3.3.1 Completeness Data completeness, or missing data, is a significant challenge in a variety of scientific and engineering analysis. Probe vehicle data in particular relies on

86 PART

I Methodological

observations from contributing vehicles on the transport network, and so has unique missing data challenges compared to, for example, mechanical sensor data. This primarily relates to the fact that higher volume, slower traffic conditions are more likely to be represented in the dataset, and so there is the potential for differences in speed or travel time distributions between the missing and observed data. For most commercial probe vehicle data, the number and granularity of TMC-based observations has increased significantly in the last 5 years, as has the per-month average data completeness (Hosuri, 2017). This was illustrated in (Cambridge Systematics and Texas Transportation Institute, 2015), which analyzed NPMRDS travel time data for a set of 10 eastern US states from 2014. They showed that travel time data tends to be most complete during daytime hours and least complete during the night/early morning hours. For the states analyzed in the Cambridge Systematics study, interstate highway segments were approximately 58% complete for all time periods and vehicle types, and 22% Complete for all other highway classes in 2014. (Kim and Coifman, 2014) analyzed probe vehicle data from INRIX and compared it with loop detector data on a single corridor over a period of 2 months in 2014. They found that, despite data being complete for all sampling periods, there we a great number of instances of repeated measurements. This suggests that, at least at the temporal resolution at which the data is provided, the completeness was significantly lower than 100%. Regardless of whether imputation is completed by a commercial data provider, by the end user, or not at all, it is clear that missing data will have a significant impact on the accuracy and usefulness of this data source, which underscores the importance of how such missing values are dealt with. Previous research has shown that the quantities of interest, both missing and observed, are correlated with the probability of being missing. Further, for data that is observed, there is a clear relationship between penetration rate, sampling frequency, and accuracy of the derived traffic measures (Bucknell and Herrera, 2014; Patire et al., 2015). All else being equal, longer segments are less likely to produce missing traffic records, but provide lower granularity in all records.

3.3.2 Data Quality and Accuracy Due to the commercial nature of third party probe vehicle data, there are some restrictions on how validation and comparative analysis can be conducted and presented. Here, we summarize some previous work on the accuracy of probe vehicle data to give a general sense of the challenges associated with this type of data. One notable study, part of the I-95 Corridor Coalition Vehicle Probe Project conducted by the University of Maryland, assessed the accuracy and bias of commercial probe vehicle data on different road types in the eastern United States (Young et al., 2015). The findings of this work indicate that probe vehicle

Data Preparation Chapter

5

87

data is most reliable on roadways with low signal density, low access point frequency, and relatively high annual average daily traffic (AADT). The authors of this study state that probe vehicle data is not suitable for roadways with signal density greater than two/mile, roadways with annual average daily traffic (AADT) 1 replacement values, resulting in m complete datasets. Each of the imputed datasets are then analyzed using complete data methods, and the results are combined to give a final estimate and confidence bounds for the desired quantities (model parameters, predicted values, etc.) which incorporate both the uncertainty inherent in the complete data and the uncertainty due to the missing values. There are a number of multiple imputation libraries in R, although in most cases the treatment of time series data is limited to including lag terms for the variables of interest. The Amelia library in R provides access to an expectation maximization algorithm for generating multiple imputations, and offers some tools for working with time series (limited in this case to lagged variables and polynomial/spline fitting (Honaker et al., 2016)). The MICE R library performs multiple imputation by chained equations (hence the name MICE), an iterative imputation method which, unlike joint modeling approaches such as that offered by the Amelia package, does not assume that the variables of interest follow a specified joint distribution (Buuren and Groothuis-Oudshoorn, 2011). This allows a good deal of flexibility in model specification by allowing different models to be specified for each variable, and a variety of common prediction models for real valued, nominal, and ordinal data types are supported.

4

CONTEXT DATA

4.1 The Role of Context Data The increasing availability of sensor data, such as that described in Section 3, provides a unique potential for developing great tools for traffic management and traveler decision making. However, the complex role of human behavior in the transportation system demands considerations that might not be captured with sensors that are focused on the network or vehicles. For example, the traffic manager needs to understand why certain congestion is formed (Is it an incident? A special event? A religious ceremony? Weather? School pick-up/drop-off?)

96 PART

I Methodological

and to predict how it will evolve. A special event leads to different patterns and management procedures than an incident or a flooding event. In other words, besides knowing that a problem exists, traffic managers and prediction systems need to know its context. We define context as any available semantic information that can be associated to observations from the traffic-sensing system (for example, cameras, loop counters, and GPS probes). Context can be important to explain and help predict many transport-related phenomena. For example, a sudden demand peak in an area can be due to special events, religious activities, political demonstrations, street fairs; general demand pattern changes can be associated to school holidays; and nonrecurrent supply changes can be caused by incidents, roadworks, road blockages, and harsh weather. For example, (Pereira et al., 2015) use contextual data from the internet to explain nonhabitual transport overcrowding. On a somewhat different perspective, context can be used to analyze aspects transversal to behavior and transport, such as wellbeing (for example, sentiment analysis on public transport) or environment (online reports on emissions). Context mining is therefore complementary to traffic sensing technologies. While the latter provides information on what is happening in traffic, the former helps understand why. When properly aligned in space and time, they become essential to understand how traffic will evolve, by becoming inputs to transport-prediction algorithms. Knowing context is particularly relevant in nonhabitual scenarios. While in recurrent scenarios, traffic managers and commuters are aware of their evolution and available options, in nonrecurrent ones, they need good predictive capability to make decisions. Adding semantic causal information to the prediction process together with observation should contribute to improving its accuracy, especially in nonrecurrent scenarios.

4.2 Types of Context Data There are several types of context data covering different external factors that can play a role on transport systems, such as information about weather, incidents, roadworks, road blockages, special events, natural disasters, acts of terrorism, etc. In the following, we describe a few of these data types individually and discuss some potential sources for obtaining this data.

4.2.1 Weather Data Weather data is perhaps the most well-known type of contextual data. While in certain areas of the globe the weather has a small effect on the transport system, in other areas accounting for weather information is critical. For example, in colder regions accounting for the presence of snow or ice is essential from both a management (e.g., displaying alerts, dispatching snowplows or spreading salt) and a modeling perspective (e.g., predicting travels times). Similarly, it is well

Data Preparation Chapter

5

97

known that precipitation affects road safety by increasing accident frequency and congestion, especially during peak hours (Koetse and Rietveld, 2009), and that dense fog conditions or the positioning of the sun in sky can cause poor visibility conditions, thus making drivers slow down. Indeed, the weather can affect transport systems in a multitude of ways, from road travel times to ultimately affecting people’s mode choices. The are two main sources for weather data: proprietary and online data. Proprietary data is typically collected by public agencies and it usually consists of very detailed information about the weather conditions. This can include road surface temperature, relative humidity above road surface, water film height, ice percentage, friction, etc. This data is often stored in a raw format in databases and used internally. On the other hand, online data is publicly available. There are several popular websites that provide application programming interfaces (APIs) to access this data for most places in the world, such as OpenWeatherMap (OpenWeatherMap Inc., 2018) or Wunderground (TWC Product and Technology LLC, 2018). Alternatively, one could rely on RSS feeds from local authorities, although their quality can vary quite significantly between different countries. The weather data from these sources usually comes in XML or JSON format.

4.2.2 Incident, Roadworks, and Road Blockages Data Traffic incidents have been identified as one of the major contributors to increased congestion, causing about one-quarter of the congestion on US roadways (Haas, 2006). They are estimated to cause >50% of delay experienced by motorists in total for all urban areas. Furthermore, for every minute that the primary incident remains a hazard, the likelihood of a secondary crash increases by 2.8% (Karlaftis et al., 1999). In order to support a timely response to incidents, traffic management centers establish workflows that consist of collecting information, analyzing it and executing the chosen strategy, continuously using updated information to control traffic, disseminate information and manage incident response resources. In this process, a lot of contextual information is gathered (usually in a database or in the form of logs), which can be used to enrich our understanding of the observed traffic behavior. This information is then (partially) made publicly available usually through broadcasting systems, such as official RSS feeds or social networks (e.g., Twitter or Facebook). Like incidents, roadworks and road blockages are another major contributor to road congestion, since they can severely reduce road capacity and they can last for weeks or even months. With the exception of roadworks and road blockages that are caused by an incident, this information is typically made publically available online well in advance by official governmental websites or through the use of RSS feeds.

98 PART

I Methodological

More recently, with the development of community-driven location-based services like Waze (Waze Mobile, 2018), sharing this kind of information has become much easier. Waze is mobile application that allows users to share the traffic conditions of the road that they are travelling on (e.g., travel time information) automatically, as well as to report accidents, traffic jams, speed and police traps, etc. As the size of the community of Waze users grows, so does the quality of the information provided. However, although all this information is available to the users of the mobile application, it is not readily available for the general public, thus making it harder to use, for example, when developing transportation models.

4.2.3 Events Data Over the last decade, the Internet has become the preferred resource to announce, search, and comment about social events such as concerts, sports games, parades, demonstrations, sales, or any other public event that potentially gathers a large group of people. These planned special events often carry potentially disruptive impacts on the transportation system, because they correspond to nonhabitual behavior patterns that are hard to predict and plan for. There is a myriad of websites that provide rather detailed descriptions of most planned special events that take place in cities around the world, especially in larger metropolitan areas. Particularly popular examples are: Eventful (Entercom, 2018), Timeoutworld (Time Out Group, 2018) and AllEventsIn (Alleventsin, 2018). These are typically user-contributed and have the advantage of providing world-wide coverage, rather than being limited to a certain region. Furthermore, these typically provide APIs to the users, which greatly simplifies the data collection. However, it is important to note that local public authorities typically maintain their own official event directories (e.g., New York City (Nyc.com, 2018)). A popular alternative to event websites are social networks. In particular, Facebook has become a great source for people to create and publicize events. Furthermore, the social nature of Facebook gives the potential for understanding the online popularity of events, which then may or may not be reflected in the transportation system. 4.2.4 Social Media Data Social media data is by far the most widely available on the Internet, with social networks like Facebook and Twitter being two of the most popular examples. It is widely understood that user-generated messages in social media platforms now play a determinant role in different areas such as politics, business and entertainment. Although perhaps to a lesser extent, the transport sector is no exception. Indeed, researchers have realized the potential of social media in providing a better understanding the behavior of the transport system, and several research works seek to explore that potential. Examples of this range from

Data Preparation Chapter

5

99

using social media to understand the opinion of the users toward the public transport system in certain cities (Schweitzer, 2012), to trying to detect incident in the road network by analyzing user tweets (D’Andrea et al., 2015). All of the most popular social media platforms provide APIs, which make accessing their data quite easy. However, data access in some of these APIs is quite restrictive due to privacy issues. For example, Facebook only allows a user to access the posts of their friends or other users who have public profiles.

4.3 Formats and Data Collection Depending on its source, context data can come in very different formats and flavors. The most popular ones are XML, JSON, HTML, relational databases and plain natural language text. XML and JSON formats will be encountered mostly when gathering data from RSS feeds or APIs. As we saw in Section 2.3, these greatly simplify accessing and processing contextual data from the web. However, many websites don’t provide such good interfaces. In these cases, one might need to resort to working the source HTML code of web pages directly, using techniques such as “screen scraping.” Although most contextual data sources provide a fair amount of structure that can be exploited (e.g., with formats like XML, JSON, relational databases, or even HTML tags), we quite often encounter context data in the form of plain natural language text. A good example of this are incident logs/reports as the ones used in (Pereira et al., 2013) for incident duration prediction. In these situations, one might need to consider natural language processing (NLP) techniques (Manning and Sch€ utze, 1999). Furthermore, even when using a wellstructured format like XML or JSON, some text-valued attributes might require further processing of the natural language text. A good example of this are event descriptions. These usually provide a significant amount of important information about the characteristics of the event. Therefore, NLP techniques need to be applied in order to turn that text into features (or attributes) that could be used, for example, in demand prediction models. Social networks, like Facebook or Twitter, are another contextual information source where a significant part of the data is raw text. Hence, if one considers using, for example, Twitter as an incident detection mechanism or as a feedback tool for the understanding the opinion of the public toward the transportation system, the use of NLP techniques is essential. In the following section, we provide some guidelines on how to approach this type of data and turn it into useful information for transport models.

4.4 Data Cleaning and Preparation In the previous sections, we discussed different types of context data, their sources and typical formats, as well as how to access those sources in order to retrieve the data. However, in most situations, having access to the raw data

100 PART

I Methodological

isn’t sufficient. We need to translate such data into features that are useful for ITS applications. We want to extract information about what (for example, an incident, concert, sports game, or religious celebration), when, where, and other relevant attributes (including how many lanes are blocked in an incident, the cost of concert tickets, the public’s age range, or a temple’s size). Sometimes this information can be directly obtained through APIs or screen scraped from well-structured websites (e.g., specific fields in XML). This was the technique used earlier, for example, for demand prediction in special events scenarios in (Pereira and Rodrigues, 2015; Rodrigues et al., 2016), where the event type, location, and start time were used as input to a neural network model. Unfortunately, quite often, well-structured, ready-to-use information won’t be available, deeming it necessary to extract information from unstructured text using natural language processing (NLP) techniques, such as Information Extraction, Named Entity Recognition, Topic Modeling and Sentiment Analysis. In the following, we briefly discuss each of these individually, how they can be relevant for ITS applications and software packages that apply them.

4.4.1 Information Extraction Information Extraction (IE) corresponds to the general task of automatically extracting structured information from unstructured text. An example consists on identifying the names of persons and locations that are mentioned in a document and the relationships between them. A particularly well-known subtask of IE is Named Entity Recognition (NER). The goal of NER is locate named entities in text, and classify them into a set of preestablished categories such as the names of persons, organizations, locations, dates, quantities, monetary values, etc. In the context of ITS application, IE techniques can be used for example to identify where an event will take place or the name of its performers from its textual description. Other uses include identifying the number of lanes blocked or vehicles involved from an incident report, or the location of an incident being reported on Twitter. There are several software packages that provide IE functionalities. For example, the “Stanford NER” is very popular implementation of a Named Entity Recognizer in Java. In Python, the “NLTK” package provides a lot of NLP functionality, including NER and many others such as text classification, tokenization (breaking a sentence into tokens), stemming (reducing words to their root; e.g., “smoking” reduces to “smoke”) and part-of-speech tagging (identifying nouns, verbs, adjectives, adverbs, etc.). Using the NLTK package in Python (Bird et al., 2017), splitting a sentence into tokens, assigning each its corresponding part-of-speech (POS) tag and identifying named entities is relatively easy, as shown in Fig. 8. For further details, the NLTK documentation is a great resource (Bird et al., 2017).

Data Preparation Chapter

5

101

FIG. 8 Tokenizing, POS tagging, and NER example using nltk in Python.

4.4.2 Topic Modeling Quite often, instead of being interested in extracting a specific information from a text, we are interested in obtaining a representation of the text as whole, which we can easily interpreted by a machine and fed, for example, as additional features/attributes to a prediction model. In such situation, a popular simple solution is to use a bag-of-words representation of the text, in which we disregard grammar and even word order and keep only track of the number of occurrences of each word in a document. In this way, a document is represented as a long vector of counts, whose length is the size of the vocabulary and where each entry corresponds to the number of occurrences of each word in that document. The simplicity and efficiency of the bag-of-words models makes it very appealing for practical applications. However, in some scenarios, its limitations can be problematic. For example, the size of the representations scales linearly with the size of the vocabulary, which means that even for small corpora, vocabulary sizes in the order of 20,000 or 30,000 words are frequent. Furthermore, the bag-of-words representation is not able to capture the semantics of words. This means that the words “car” and “automobile” are represented as completely independent, even though they represent the exact same concept. A common solution to some of these problems is to use topic models. In fact, the growing need to analyze large document corpora has led to great developments in topic modeling. Topic models, such as latent Dirichlet allocation (LDA) (Blei et al., 2003), allow us to analyze large collections of documents, by revealing their underlying themes, or topics, and how each document exhibits them. In LDA, each topic is a pattern represented by a distribution over the words present in the vocabulary. Its result is a set of topics and topic proportions associated with each document, meaning that documents are mixtures of topics and topics mixtures of words. Hence, by using LDA, a document can be represented as vector of size K, where K is the number of topics, and each element in the vector corresponds to the topic proportion of a given topic. A great number of open source and proprietary software packages provide implementations of LDA (and variants), including the Matlab (MathWorks, 2018), R (Chang, 2015; Gr€ un and Hornik, 2017), Python (Rehurek and Sojka, 2010), and Python/Scala in Apache Spark’s MLlib (Apache Software Foundation, 2018). In Python, one can easily convert a collections of texts to the bag-of-words representation and then apply LDA by making use of the “gensim” package as shown in Fig. 9. For further details, see the gensim tutorials and documentation available at (Rˇehu˚rek, 2018).

102 PART

I Methodological

FIG. 9 LDA model example using gensim in Python.

FIG. 10 Sentiment analysis example using nltk in Python.

4.4.3 Sentiment Analysis Sentiment analysis corresponds to the process of identifying the sentiment associated with a piece of text. It usually relies on applying machine learning techniques to classify documents based on a collection of features extracted from the text using other NLP techniques, such as the presence of certain words or the coverage of some topics. In the context of ITS, sentiment analysis can be applied, for example, to large scale collections of tweets in order to determine how people feel about the transport system in general, or certain aspects of it in particular. There are various packages that provide sentiment analysis functionality, such as the “RSentiment” package of R (Bose and Goswami, 2017) or the “nltk” package of Python (Bird et al., 2017). Most of these, actually allow you to train the user to train their own sentiment classifiers, by providing a dataset of texts along with their corresponding sentiments. However, pretrained versions also exist. For example, in Python on can easily use “nltk” package to identify the sentiment in texts as shown in Fig. 10.

REFERENCES Alleventsin (2018) All Events in. Available at: https://allevents.in/. Accessed 29 March 2018. Apache Software Foundation (2018) Machine Learning. Library (MLlib) Main Guide—Spark 2.3.0 Documentation. Available at: https://spark.apache.org/docs/latest/ml-guide.html. Accessed 29 March 2018. Basu, S. and Meckesheimer, M. (2007) ‘Automatic outlier detection for time series: an application to sensor data’, Knowl. Inf. Syst. Available at: http://www.springerlink.com/index/442WP9XQ 56286245.pdf. Accessed 1 February 2017. Bhaskar, A., Chung, E. and Dumont, A.-G. (2009) ‘Integrating cumulative plots and probe vehicle for travel time estimation on signalized urban network’, in 9th Swiss Transport Research Conference. Monte Verita/Ascona. Available at: http://www.strc.ch/conferences/2009/Bhaskar. pdf. Accessed 1 February 2017. Bird, S., Klein, E. and Loper, E. (2017) Natural Language Processing with Python—Analyzing Text with the Natural Language Toolkit. Available at: http://www.nltk.org/book/. Accessed 29 March 2018.

Data Preparation Chapter

5

103

Bitar, N. (2016) ‘Big Data Analytics In Transportation Networks Using The Npmrds’. Available at: https://www.researchgate.net/profile/Naim_Bitar/publication/305994432_Big_Data_ Analytics_in_Transportation_Networks_Using_the_NPMRDS/links/ 57a92b4608aef20758cd124c.pdf. Accessed 31 January 2017. Blei, D., Ng, A. and Jordan, M. (2003) ‘Latent dirichlet allocation’, J. Mach. Learn. Res. Available at: http://www.jmlr.org/papers/v3/blei03a.html. Accessed 24 February 2017. Bose, S. and Goswami, S. (2017) Package ‘RSentiment’. Available at: https://cran.r-project.org/ web/packages/RSentiment/RSentiment.pdf. Accessed 29 March 2018. Bucknell, C., Herrera, J.C., 2014. A trade-off analysis between penetration rate and sampling frequency of mobile sensors in traffic state estimation. Transp. Res. Part C: Emerg. Technol. 46, 132–150. https://doi.org/10.1016/J.TRC.2014.05.007. Pergamon. Buuren, S. and Groothuis-Oudshoorn, K. (2011) ‘Mice: Multivariate imputation by chained equations in R’, J. Stat. Softw. Available at: http://doc.utwente.nl/78938/. Accessed 27 January 2017. Cambridge Systematics and Texas Transportation Institute, 2015. NPMRDS Missing Data and Outlier Analysis. https://www.regulations.gov/document?D¼FHWA-2013-0054-0103. Chang, J. (2015) Package ‘lda’: Gibbs Sampling Methods for Topic Models. Available at: https:// cran.r-project.org/web/packages/lda/lda.pdf. Accessed 29 March 2018. Chiou, J.-M., et al., 2014. A functional data approach to missing value imputation and outlier detection for traffic flow data. Transportmetrica B: Transp. Dyn. 2 (2), 106–129. https://doi.org/ 10.1080/21680566.2014.892847. Taylor & Francis. D’Andrea, E., Ducange, P. and Lazzerini, B. (2015) ‘Real-time detection of traffic from twitter stream analysis’, IEEE transactions on. Available at: http://ieeexplore.ieee.org/abstract/ document/7057672/. Accessed 24 February 2017. Entercom (2018) Eventful. Available at: www.eventful.com. Accessed 29 March 2018. Florida Department of Transportation (2012) ‘Probe Data Analysis Evaluation of NAVTEQ, TrafficCast, and INRIX Travel Time System Data in the Tallahassee Region’. Available at: http:// www.fdot.gov/traffic/ITS/Projects_Deploy/2012-03-26_Probe_Data_Analysis_v2-0.pdf. Accessed 31 January 2017. Goyvaerts, J. (2016) Regular-Expressions.info—Regex Tutorial, Examples and Reference— Regexp Patterns. Available at: http://www.regular-expressions.info/. Accessed 29 March 2018. Gr€ un, B. and Hornik, K. (2017) Package ‘topicmodels’. Available at: https://cran.r-project.org/web/ packages/topicmodels/topicmodels.pdf. Accessed 29 March 2018. Gupta, M., Gao, J. and Aggarwal, C. (2014) ‘Outlier detection for temporal data: A survey’, on Knowledge and Data …. Available at: http://ieeexplore.ieee.org/abstract/document/6684530/. Accessed 2 February 2017. Haas, K., 2006. Benefits of Traffic Incident Management. National traffic incident management coalition. Haghani, A., Zhang, X., Hamedi, M., 2015. Validation and Augmentation of INRIX Arterial Travel Time Data Using Independent Sources. Maryland State Highway Administration. Hallenbeck, M. and McCormack, E. (2015) ‘Developing a System for Computing and Reporting MAP-21 and Other Freight Performance Measures’. Available at: http://wadot.wa.gov/NR/ rdonlyres/4869900F-9E88-4B2E-968B-EF2CA3B3D1FD/107207/8441.pdf. Accessed 31 January 2017. Henrickson, K., Zou, Y. and Wang, Y. (2015) ‘Flexible and robust method for missing loop detector data imputation’, J. Transp. Available at: http://trrjournalonline.trb.org/doi/abs/10.3141/252704. Accessed 24 February 2017. Hodge, V. and Austin, J. (2004) ‘A survey of outlier detection methodologies’, Artif. Intell. Rev. Available at: http://link.springer.com/article/10.1007/s10462-004-4304-y. Accessed 27 January 2017.

104 PART

I Methodological

Honaker, J., King, G., Blackwell, M., 2016. Amilia: A Program for Missing Data. Hosuri, S. R. (2017) Congestion Quantification Using the National Performance Management Research Dataset. University of Alabama at Birmingham. Available at: https://search. proquest.com/docview/1914681659?pq-origsite¼gscholar. Accessed 28 March 2018. Hyndman, R. (2016) ‘Package “forecast” Title Forecasting Functions for Time Series and Linear Models’. Available at: http://github.com/robjhyndman/forecast. Accessed 26 January 2017. INRIX (2014) ‘I-95 Vehicle Probe Project II Interface Guide’. Karlaftis, M., Latoski, S. and Richards, N. (1999) ‘ITS impacts on safety and traffic management: an investigation of secondary crash causes’, J. Intell. Available at: http://www.tandfonline.com/ doi/abs/10.1080/10248079908903756. Accessed 24 February 2017. Kaushik, K., Ernest Young, S., 2015. Computing performance measures using National Performance Management Research Data set (NPRMDS) data. Transp. Res. Record J. Trans. Res. Board 2529, 10–26. Kim, S., Coifman, B., 2014. Comparing INRIX speed data against concurrent loop detector stations over several months. Transp. Res. Part C: Emerg. Technol. 49, 59–72. https://doi.org/10.1016/ J.TRC.2014.10.002. Pergamon. Koetse, M. and Rietveld, P. (2009) ‘The impact of climate change and weather on transport: an overview of empirical findings’, Transp. Res. Part D: Transp. Available at: http://www. sciencedirect.com/science/article/pii/S136192090800165X. Accessed 24 February 2017. Laurikkala, J. et al. (2000) ‘Informal identification of outliers in medical data’, on Intelligent Data …. Available at: http://www.academia.edu/download/30622241/All.pdf#page¼24. Accessed 27 January 2017. Li, X. et al. (2009) ‘Temporal outlier detection in vehicle traffic data’, Data Engineering, 2009. ICDE’09. Available at: http://ieeexplore.ieee.org/abstract/document/4812530/. Accessed 2 February 2017. Li, L., Li, Y., Li, Z., 2014. Missing traffic data: Comparison of imputation methods. IET Intell. Transp. Syst. 8 (1), 51–57. https://doi.org/10.1049/iet-its.2013.0052. Liu, Z., Sharma, S., Datla, S., 2008. Transportation planning and technology imputation of missing traffic data during holiday periods. Transp. Plan. Technol. 31 (5), 525–544. https://doi.org/ 10.1080/03081060802364505. Manning, C. and Sch€utze, H. (1999) Foundations of Statistical Natural Language Processing. Available at: http://www.mitpressjournals.org/doi/pdf/10.1162/coli.2000.26.2.277. Accessed 24 February 2017. MathWorks (2018) Latent Dirichlet allocation (LDA) model, MATLAB 9.4 Documentation. Available at: https://www.mathworks.com/help/textanalytics/ref/ldamodel.html?requested Domain¼true. Accessed 29 March 2018. Moritz, S. (2017) ‘Package “imputeTS” Title Time Series Missing Value Imputation Description Imputation (replacement) of missing values in univariate time series’. Available at: https:// github.com/SteffenMoritz/imputeTS/issues. Accessed 26 January 2017. Ni, D. et al. (2005) ‘Multiple imputation scheme for overcoming the missing values and variability issues in ITS data’, J. Transp. Available at: http://ascelibrary.org/doi/abs/10.1061/(ASCE) 0733-947X(2005)131:12(931). Accessed 24 February 2017. Nyc.com (2018) New York Events and Event Calendar j NYC.com—New York’s Box Office. Available at: https://www.nyc.com/events/. Accessed 29 March 2018. OpenWeatherMap Inc. (2018) Сurrent weather and forecast—OpenWeatherMap. Available at: https://openweathermap.org/. Accessed 29 March 2018.

Data Preparation Chapter

5

105

Park, E., Turner, S. and Spiegelman, C. (2003) ‘Empirical approaches to outlier detection in intelligent transportation systems data’, Res. Record J. Available at: http://trrjournalonline.trb.org/ doi/abs/10.3141/1840-03. Accessed 31 January 2017. Patire, A.D., et al., 2015. How much GPS data do we need? Transp. Res. Part C: Emerg. Technol. 58, 325–342. https://doi.org/10.1016/J.TRC.2015.02.011. Pergamon. Pereira, F. and Rodrigues, F. (2015) ‘Using data from the web to predict public transport arrivals under special events scenarios’, J. Intell. Available at: http://www.tandfonline.com/doi/abs/ 10.1080/15472450.2013.868284. Accessed 24 February 2017. Pereira, F., Rodrigues, F. and Ben-Akiva, M. (2013) ‘Text analysis in incident duration prediction’, Transp. Res. Part C: Available at: http://www.sciencedirect.com/science/article/pii/S0968090 X13002088. Accessed 24 February 2017. Pereira, F.C., et al., 2015. Why so many people? Explaining nonhabitual transport overcrowding with internet data. IEEE Trans. Intell. Transp. Syst. 16 (3), 1370–1379. https://doi.org/10.1109/ TITS.2014.2368119. Pu, W., 2018. Interstate speed profiles. Transp. Res. Rec. https://doi.org/10.1177/03611981 18755713. SAGE Publications, Sage: Los Angeles, CA, p. 36119811875571. Python Software Foundation (2018) 19.2. json—JSON encoder and decoder—Python 3.6.5 documentation. Available at: https://docs.python.org/3/library/json.html. Accessed 29 March 2018. Ramaswamy, S., Rastogi, R. and Shim, K. (2000) ‘Efficient algorithms for mining outliers from large data sets’, SIGMOD Rec. Available at: http://dl.acm.org/citation.cfm?id¼335437. Accessed 1 February 2017. Ran, B. et al. (2015) ‘Traffic Speed Data Imputation Method Based on Tensor Completion’, Comput. Intell. Neurosci. Hindawi Publishing Corporation, 2015, pp. 1–9. https://doi.org/ 10.1155/2015/364089. Ravichandran, S., 2001. State space modelling versus ARIMA time-series modelling. J. Ind. Soc. Agric. Stat. 54 (1), 43–51. Rˇehu˚rek, R. (2018) Gensim: Topic modelling for humans. Available at: https://radimrehurek.com/ gensim/. Accessed 29 March 2018. Rehurek, R. and Sojka, P. (2010) ‘Software Framework for Topic Modelling with Large Corpora’, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi¼10.1.1.695.4595. Accessed 29 March 2018. Reitz, K. (2018) Requests: HTTP for Humans—Requests 2.18.4 documentation. Available at: http://docs.python-requests.org/en/master/. Accessed 29 March 2018. Richardson, L. (2017) Beautiful Soup Documentation—Beautiful Soup 4.4.0 documentation. Available at: https://www.crummy.com/software/BeautifulSoup/bs4/doc/. Accessed 29 March 2018. Rodrigues, F., Borysov, S. and Ribeiro, B. (2016) ‘A Bayesian additive model for understanding public transport usage in special events’, IEEE transactions on. Available at: http://ieeexplore. ieee.org/abstract/document/7765036/. Accessed 24 February 2017. Rubin, D. (1976) ‘Inference and missing data’, Biometrika. Available at: http://biomet. oxfordjournals.org/content/63/3/581.short. Accessed 18 January 2017. Rubin, D., Little, R.J.A., 1987. Statistical Analysis with Missing Data, first ed. Wiley and Sons, New York. Schafer, J. (1997) Analysis of Incomplete Multivariate Data. Available at: https://www.google.com/ books?hl¼en&lr¼&id¼3TFWRjn1f-oC&oi¼fnd&pg¼PR13&dq¼Analysis+of+incomplete +multivariate+data&ots¼2pLNsCBcq6&sig¼l40JIymHlDU9os09foooXl5Rnn8. Accessed 26 January 2017.

106 PART

I Methodological

Schafer, J. and Graham, J. (2002) ‘Missing data: our view of the state of the art.’, Psychol. Methods. Available at: http://psycnet.apa.org/journals/met/7/2/147/. Accessed 18 January 2017. Schweitzer, L. (2012) ‘How are we doing? opinion mining customer sentiment in us transit agencies and airlines via twitter’, Transportation Research Board 91st Annual Meeting. Available at: https://trid.trb.org/view.aspx?id¼1129878. Accessed 24 February 2017. Shekhar, S., Lu, C. and Zhang, P. (2001) ‘Detecting graph-based spatial outliers: algorithms and applications (a summary of results)’, Proceedings of the seventh ACM SIGKDD. Available at: http://dl.acm.org/citation.cfm?id¼502567. Accessed 31 January 2017. Tak, S., Woo, S. and Yeo, H. (2016) ‘Data-driven imputation method for traffic data in sectional units of road links’, IEEE Trans. Intell. Available at: http://ieeexplore.ieee.org/abstract/ document/7444178/. Accessed 27 January 2017. The PostGIS Development Group (2018) PostGIS 2.4.4dev Manual, http://postgis.net. Available at: https://postgis.net/docs/. Accessed 26 March 2018. Time Out Group (2018) Time Out World. Available at: https://world.timeout.com/. Accessed 29 March 2018. TomTom, 2011. TomTom Real Time Traffic Information. http://www.tomtom.com/lib/img/ REAL_TIME_TRAFFIC_WHITEPAPER.pdf. TomTom, 2012. OpenLR White Paper Version 1.5 revision 2. http://www.openlr.org/data/docs/ OpenLR-Whitepaper_v1.5.pdf. TomTom, 2014. Historical Traffic Information. http://www.tomtom.com/lib/img/HISTORICAL_ TRAFFIC_WHITEPAPER.pdf. TWC Product and Technology LLC (2018) Weather Forecast & Reports—Long Range & Local j Weather Underground. Available at: https://www.wunderground.com/. Accessed 29 March 2018. Wang, Z. et al. (2018) ‘A Cross-Vendor and Cross-State Analysis of the GPS-Probe Data Latency’, in Trans Res Board Ann Meeting. Washington DC: Transportation Research Board. Available at: http://amonline.trb.org/2017trb-1.3983622/t005-1.4000488/200-1. 4001272/18-05026-1.3997223/18-05026-1.4001275#tab_0¼1. Accessed 29 March 2018. Waze Mobile (2018) Free Community-based GPS, Maps & Traffic Navigation App j Waze. Available at: https://www.waze.com/. Accessed 29 March 2018. Young, S.E., et al., 2015. I-95 Corridor Coalition Vehicle Probe Project: Validation of Arterial Probe Data. http://i95coalition.org/wp-content/uploads/2015/02/I-95_Arterial_Validation_Report_ July2015-FINAL.pdf?fe2c99. Zhong, J. and Ling, S. (2015) ‘Key factors of k-nearest neighbours nonparametric regression in short-time traffic flow forecasting’, Proceedings of the 21st International Conference on. Available at: http://link.springer.com/chapter/10.2991/978-94-6239-102-4_2. Accessed 27 January 2017.

Chapter 6

Data Science and Data Visualization Michalis Xyntarakis* and Constantinos Antoniou† *

Cambridge Systematics, Medford, MA, United States, †Department of Civil, Geo and Environmental Engineering, Technical University of Munich, Munich, Germany

Chapter Outline 1 Introduction 2 Structured Visualization 3 Multidimensional Data Visualization Techniques 3.1 Parallel Coordinates 3.2 Multidimensional Scaling (MDS) 3.3 t-Distributed Stochastic Neighbor Embedding for High-Dimensional Data Sets (t-SNE) 4 Case Studies

1

107 115 120 121 123

123 124

4.1 4.2 4.3 4.4

Experimental Setup Car Characteristics Data Set Congestion on I95 Dimensionality Reduction on NYC Taxi Flows 4.5 Dimensionality Reduction on the NYC Turnstile Data Set 5 Conclusions References Further Reading

124 125 128 132

140 142 143 144

INTRODUCTION

Data analysis and data visualization are very important tools for engineers, analysts, policy-makers, and decision makers. Developed originally for “small data,” these techniques have been met with varied success in the past centuries and decades. There are a few famous visualizations that are very effective in capturing the essence of the data, and of course there are many infamous negative examples of poor visualizations. Even well-known publications, and researchers, often provide visualizations with considerable faults. Usually, a successful figure requires a lot of work, customization, attention to detail, and refinements. This leads naturally to the question of visualization quality. What is a good visualization? Many will say an eye-catching one, a colorful one. Others will say one that is accessible to color-blind or sight-impaired individuals. Others Mobility Patterns, Big Data and Transport Analytics. https://doi.org/10.1016/B978-0-12-812970-8.00006-3 © 2019 Elsevier Inc. All rights reserved.

107

108 PART

I Methodological

will quote the data-to-ink ratio, calculate the number of information elements in the figure and so on. In the context of big-data, additional constraints emerge, as the amount of data to be plotted is very large. Therefore, in order to be able to convey meaningful messages, one needs to resort to preprocessing techniques, in order to extract some meaningful structure from the data. Therefore, with big data, the process of visualization becomes inherently entangled with data analytics. Another important concept relating to visualization is beauty. Steele and Iliinsky (2010) describe beauty in this context as having four components; besides being aesthetically pleasing, it must also be novel, informative, and efficient. The authors identify Mendeleev’s Periodic Table of Elements (Fig. 1) and the Harry Beck’s London Metro Map as two historic visualization examples that satisfy these rules. The periodic table was a novel and efficient representation of dense information (providing up to nine pieces of information per item), and in its early versions, it did not include color (the figure could thus be produced in a typewriter). This stresses the point that “strong graphic design treatment is not a requirement for beauty” (Steele and Iliinksy, 2010). The London Map uses visual conventions and standards, but does not aim to be geographically accurate. Instead, it strips away unnecessary information, and focuses onto an abstract visual style that provides exceptional clarity. It is therefore clear that visualization is part science and part art, and also that it is difficult, even when dealing with “small” data. The emergence of datacollection technologies, accompanied by the explosion of user-generated content, has led to an abundance of data. In the process, visualization has become both more important and more challenging. The challenging part is easily understood. The more important has to do with the fact that as data become larger, then extracting meaningful relationships from them becomes harder. Visualization of big data is not merely a process of showing data in the best way, but usually often involves a certain degree of actual analysis or modeling, such as clustering, data-mining, or data reduction. In this chapter, we use a number of mobility-related data sets to demonstrate some of the state-of-the-art data visualization techniques. A small data set with passenger car characteristics from the 70s and 80s is used to showcase parallel coordinates and the rest of the dimensionality reduction techniques (Figs. 2–6 and 10). Variables that describe the cars include miles per gallon (mpg), number of cylinders, displacement (cc), power (hp), weight (lb), time to reach 60 miles per hour, and year of construction. The data set contains information for 408 cars and be downloaded from https://bl.ocks.org/ jasondavies/1341281 (Parallel Coordinates, 2018) The corridor congestion data set was derived from the National Performance Management Research data set that was purchased by United States Federal Highway Administration (FHWA). The derived data set contains link travel times aggregated every 5 min for network links on the northbound direction of US Interstate I95 between Washington D.C. and Baltimore. The distance

Data Science and Data Visualization Chapter 6

109

FIG. 1 Periodic table. ((By Armtuk—Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid¼2010645.))

110 PART I Methodological

FIG. 2 Parallel coordinates plot.

Data Science and Data Visualization Chapter

FIG. 3 Parallel coordinates plot: selected observation.

6

111

112 PART I Methodological

FIG. 4 Parallel coordinates plot: simple selection query.

Data Science and Data Visualization Chapter 6

FIG. 5 Scatterplot matrix of the pairwise relationships in the car data set.

113

114 PART

I Methodological

FIG. 6 t-SNE versus parallel coordinates on the car data set.

between the two beltways is 24.5 miles and it is covered by 26 links about one mile each. A feature in this data set consists of a matrix of 5-min speeds from 3 PM to 8 PM that pertain to a given day. The top-left graph of Fig. 14 shows the distribution of average corridor speeds in 2017. Fig. 7 uses a small multiple display to show each of these matrices for all the weekdays between March 27, 2017 and April 21, 2017.

Data Science and Data Visualization Chapter

6

115

The NYC Taxi and limousine commission publishes taxi trip records since January 2009. The NYC taxi data set contains a record for every ride a yellow taxi undertook between June 30, 2016 and June 30, 2017. The origins and the destinations of each of the trips are provided at the zone level. For each day in the data set, a variable is generated that contains the taxi trips from zone i to zone j for a given time period. Most of the taxi trips belong to a small percentage of zone pairs. Fig. 18 shows that 5000 zone pairs account for almost 80% of the total number of trips in the New York region. This is a publicly available data set that can be downloaded from http://www.nyc.gov/html/tlc/html/about/trip_ record_data.shtml. The New York Metropolitan Transit Administration (MTA) turnstile data set contains turnstile counts for every turnstile and station in NYC from June 2010 to August 2017. The extended time period of reporting can be used to visualize operations during extreme events, such as hurricane Sandy (Fig. 8) or visualize ridership trends. Fig. 9 shows turnstile entry counts aggregated for every day of the year. The raw data set that is publicly available contains 2646 days and 466 stations. After eliminating days and stations that have a significant number of missing values or implausible measurements, the test data set contains 2364 days and 266 stations. In the test data set, an observation is a day and a count station measurement is a column. The data set can be downloaded from http://web.mta.info/developers/turnstile.html. Naturally, the approaches demonstrated in this chapter are not the only techniques available, nor is it implied that they are the most suitable for each task. The reader should be able to adapt the background and insight obtained from reading this chapter, to seek and apply the most suitable techniques. All the data sets except from the corridor congestion one are publicly available online. Computational notebooks that replicate the analysis are available in the companion website. The remainder of this chapter is structured as follows. Section 2 provides a quick review of structured visualization techniques. Section 3 provides an overview of the multidimensional data visualization techniques that are considered in this chapter. Section 4 presents the results of these analyses, starting from the experimental setup that is being employed to demonstrate these algorithms using the considered data sets. Section 5 provides some concluding remarks.

2

STRUCTURED VISUALIZATION

The literature about what constitutes good visualization is long and ranges from individual researchers to stakeholders (such as the European Environmental Agency, 2018). While many of these guidelines and suggestions seem obvious, they are often violated, not only by students, but also by more experienced researchers, journalists, and other graphics creators. For example, quite often figures have illegible elements (axis labels, legends, and titles); use color excessively; have distracting elements; and are ill constructed by having misleading

116 PART I Methodological

FIG. 7 Corridor link speed heatmaps for I-95 NB between District of Columbia and Baltimore.

Daily demand for end of 2012 01 Sep 2012

01 Oct 2012

01 Nov 2012

01 Dec 2012

01 Jan 2013

6

Entries (millions)

4

Thanksgiving

3 2 Christmas

1 Hurricane Sandy

0 03

10

17

24

01

08

FIG. 8 New York city subway ridership for the end of 2012.

15

22

29 05 Date

12

19

26

03

10

17

24

31

Data Science and Data Visualization Chapter

5

6

117

118 PART

5.97 5.8

5.8 5.64

5.7

5.90 5.7

I Methodological

6.0

5.78

5.5 5.32

Average daily ridership

5.3 5.0 4.5 4.0 3.5

3.3

3.25

3.0 2.6

2.5 Monday

Tuesday

Wednesday

Thursday

FIG. 9 Changes in average daily subway ridership in New York from 2010 to 2017.

Friday

Saturday

2.60 Sunday

Data Science and Data Visualization Chapter

6

119

axis ranges or misleading axis labeling (e.g., using equal spacing for unequal ranges). Bateman et al., 2010 explore the potential value of visual embellishment on the effectiveness of charts, using both insight and focus groups. Contrary to expectations, it appears that visual embellishments are not always detrimental to comprehension, as they sometimes create mental associations for the viewer. Therefore, the choice of whether such approaches should be followed is harder to judge a priori, and it depends on the intended audience and specific circumstances. Creating good graphics is not a new field. For example, Edward Tufte in his popular books (Tufte, 1983, 1990, 1997) has curated a number of exemplary and ill-constructed visualizations and put forward guidelines on information design. Tukey (1977), Cleveland and Cleveland (1985), and Kosslyn (1994) are seminal works on visualizing statistical information. The ideas and guidance in these seminal references are often simple and straightforward. For example, Tufte’s six principles of graphical integrity, as outlined in Tufte (1983), include that representation of numbers should match the true proportions and that labeling should be clear and detailed. However, quite often, ill-constructed visualizations see the light of day, even from respected researchers or leading journalistic institutions. (This has also led to the creation of the term “chart-junk.”) The Grammar of Graphics (Wilkinson, 2006) is a seminal book that organizes information visualization through a series of structured rules: “the grammar of graphics takes us beyond a limited set of charts (words) to an almost unlimited world of graphical forms (statements).” Creating visualizations becomes thus an organic process that focuses on the essence and not the ornaments, although “the rules of graphics grammar are sometimes mathematical and sometimes aesthetic” (Wilkinson, 2006). One way to look at this, is as object-oriented visualization. Wickham (2010) presents his “layered grammar of graphics”, based on these ideas, which has been operationalized into the ggplot2 R package. Wickham (2016) provides a more detailed presentation of the powerful ggplot2 tool. While not the only such initiative, this is one that has been very well-established and allows the generation of very powerful and aesthetically pleasing visualizations. The previously mentioned principles and ideas are very important and have served the research and practitioner communities for decades. In the era of big data and high-dimensional data, additional challenges emerge. Manual inspection, using traditional data visualization techniques, such as histograms and scatterplots is infeasible. Dimensionality reduction techniques create a twoor three-dimensional map from the high-dimensional data where each data point is a dot in the map. Investigating the different clusters that appear in the twodimensional map interactively can reveal similarity patterns and other

120 PART

I Methodological

properties that is hard to deduce otherwise. The remainder of this section will briefly introduce multidimensional and high-dimensional visualization techniques and apply them on the four test data sets introduced earlier.

3 MULTIDIMENSIONAL DATA VISUALIZATION TECHNIQUES Multivariate visualization techniques can be classified into two broad groups (De Oliveira and Levkowitz, 2003). In the first group, additional dimensions in the data set are encoded using different elements, such as multiple axes or visual encodings, such as color and shape texture. To avoid overplotting, the amount of data displayed are usually not more than a few thousand records. The number of dimensions usually does not exceed 10, due to visual cluttering and human perception limitations. Effective multidimensional visualizations include, for example: l

l

l

l

The parallel coordinates plot which can effectively display up to a few thousand records and a dozen variables. This plot is discussed later in this chapter. Small multiple displays, such as the scatterplot matrix (Fig. 5) or a heatmap grid (Fig. 7). Pixel-oriented visualization techniques that map each data value in a data set to a color-coded pixel on the screen (Keim, 2000). Scatterplots or other frequently used two-dimensional graphs in which additional variables are encoded using color, shape, size, and other visual elements in addition to position. For example, gapminder, a visualization that has been viewed millions of times on the web, uses an animated scatter plot that shows the relationship between income and life expectancy for many countries from 1800 to 2018. In the scatterplot, point color represents a country’s continent and point size represents population. Animation is used to show how life expectancy changes over time.

In the second broad group, often termed high-dimensional visualization, dimensionality reduction methods, such as principal component analysis or PCA (Jolliffe, 1986) are applied to data sets with dozens, hundreds, or thousands of variables to obtain a lower-dimensional projection that is frequently shown as two-dimensional scatterplot map. For example, the first two principal components of PCA can be visualized in a scatterplot to produce a two-dimensional projection of the full data set regardless of dimensions. Projection methods, such as multidimensional scaling (MDS) or t-SNE, can be applied either on the original data set, if it is computationally feasible, or on a lower-dimension projection obtained by PCA. A thorough comparison between dimensionality reduction techniques is presented by Van Der Maaten et al. (2009) and Bunte et al. (2012), while a more general overview of multidimensional

Data Science and Data Visualization Chapter

6

121

visualization techniques is given by Liu et al. (2017). Popular projection techniques include l l l l l

Principal component analysis ( Jolliffe, 1986) Multidimensional scaling (Borg and Groenen, 2005) Isomap (Tenenbaum et al., 2000) Locally linear embedding (Roweis and Saul, 2000) t-distributed stochastic neighbor embedding (t-SNE, Van Der Maaten and Hinton, 2008 and 2014)

Projecting the multidimensional space into two dimensions is an ill-posed problem. Different techniques make different assumptions or tradeoffs and may be applicable to different problems. Unlike PCA, some of these techniques are probabilistic and parameterized and may not always yield the same result from the same inputs. For these reasons, a two-dimensional projection should be viewed as an exploratory method that is to be used interactively and possibly in combination with other techniques, such as clustering.

3.1 Parallel Coordinates Parallel coordinates is a well-researched visualization that allows the analyst to discover predominant multivariate patterns interactively (Wegman, 1990). Parallel coordinates can visualize effectively a few thousand observations, each having up to 10 or 15 variables depending on the data. If judged only as a static plot, the visual display can be seen as cluttered by many spaghetti lines. But when used interactively, parallel coordinates can reveal outliers, clusters, and relationships that can be further clarified using scatterplots or other simpler displays. Parallel coordinates (through their interaction) can help investigate the following: l

l

l

Focus interactively on any given observation or subspace. Selection queries are easily generated by brushing. Examine the relationship between the variables in the data set. Parallel lines between adjacent axes show positive correlation and intersecting lines a negative correlation. Sliding a selection window is required to uncover relationships between nonadjacent variables. Uncover clusters or patterns in the data or in a selected subspace (Artero et al., 2004). This can be made easier if opacity and color are used.

Fig. 2 visualizes the previously described car data set using parallel coordinates. In the figure, each variable takes values in a parallel vertical axis whose scale corresponds to each variable’s domain of values. An observation, or a car in this particular case, is represented by a point/coordinate on each of the parallel axes. A line connects all points that belong to a single observation. In Fig. 3, the car that has the highest horsepower, approximately 230 hp, is selected and shown as

122 PART

I Methodological

a blue line. The rest of the cars/lines are shown in gray. It is easy to inspect the selected car’s characteristics in relation to the rest of the data set. For example, the selected car was built in 1973 (last variable on the right); it was one of the heaviest and at the same time one of the fastest in the data set. Relationships between adjacent variables are revealed by studying how lines are arranged between the two adjacent axes. If the lines are parallel to each other, regardless of slope, then there is a positive correlation between the two variables. In Fig. 4, this is the relationship between displacement and power for the selected four-cylinder cars. Negative correlation is shown as lines that cross each other. For example, in Fig. 4, this is the relationship between fuel economy and the number of cylinders. The higher the negative correlation the smaller the area of intersecting lines. Uncorrelated variables in adjacent axes are shown as lines with a mix of crossing angles. In the scatterplot matrix of the car characteristics data set (Fig. 5), the pairwise correlations can be seen more clearly, but a scatterplot matrix cannot visualize as many variables as a parallel coordinates plot. The power of parallel coordinates relies in interactive record selection and animation. Data selection is achieved by drawing an adjustable rectangle around the desired range on any given axis. By selecting records based on multiple criteria, the relationship between variables in a subspace can be explored. When the user changes the selection interactively by moving the selection window up or down, an informative animation is generated that reveals the relationships between nonadjacent variables. If lines are color-coded using different colors, and by possibly using opacity data, clusters can be visible as shown in the bottom part of Fig. 6. The range of each variable in this parallel coordinates plot has been normalized between zero and one as explained later in the chapter. Cars have been color-coded, based on the number of cylinders variable (which is otherwise not included in the plot). At least three distinct patters emerge from the display. Cars with a small displacement have low power and weight and are slower, but more fuel-efficient. On the contrary, cars with high displacement have more power and are faster, despite weighting more. A third category of cars, those with six cylinders, lie in between the four- and eight-cylinder cars. Unlike other high-dimensional visualizations, parallel coordinates can visualize side by side categorical, ordinal, and numeric variables. However, nonnumeric variables concentrate many lines through the same point, something that makes the plot harder to comprehend. A scatterplot matrix is a much more effective way to visualize pairwise relationships, but can become unwieldy, when there are many variables to analyze. Þ unique scatterplots to invesIf there are n variables to visualize, there are nðn1 2 tigate. For example, for 10 variables, there are 45 scatterplots on the same display to go over. As stated earlier, Fig. 5 shows the scatterplot matrix of the car data set. While pairwise correlations are immediately evident in Fig. 6, the scatterplot matrix does not show the clusters that exist. Parallel coordinates and scatterplot matrices ought to be used complimentary to discover data set structure.

Data Science and Data Visualization Chapter

6

123

3.2 Multidimensional Scaling (MDS) Given a matrix of dissimilarities between n observations represented as pairwise distances dij, MDS (Torgerson, 1952) constructs a representative point yi in Rd for each observation i in RD, in such a way that the pairwise distances between the original objects and their representations are maintained. In classical scaling (Borg and Groenen, 2005), the low-dimensional representations of the highdimensional data are found by minimizing the sum of the square differences between the high-dimensional pairwise distances and the low-dimensional representations. Specifically, the objective function to be minimized is the following: !1=2 n  X   2 dij  yi  yj  : Stress ðy1 , y2 , …, yn Þ ¼ i z]

Conf. Interval

Twitter OD

T_OD

33.021

0.285

115.870

0.000

32.462

33.579

Spatial lag variables from CSTDM OD

C_Oad_Dad

0.294

0.009

31.010

0.000

0.313

0.276

C_Oad_D

0.536

0.009

61.930

0.000

0.519

0.552

C_O_Dad

0.572

0.009

63.180

0.000

0.554

0.590

T_Oad_Dad

21.991

1.526

14.410

0.000

19.000

24.981

T_Oad_D

40.263

0.884

45.540

0.000

41.997

38.531

Independent Variables

Spatial lag variables from Twitter

0.889

46.380

0.000

42.995

39.509

Demographic characteristics from US census

0.030

0.003

8.620

0.000

0.023

0.036

2

D_AREA km

0.029

0.003

8.400

0.000

0.022

0.036

O_Population

0.004

0.001

3.300

0.001

0.001

0.006

D_Population

0.003

0.001

2.780

0.006

0.001

0.005

O_Housing

0.005

0.002

1.940

0.052

0.000

0.010

D_Housing

0.007

0.002

2.660

0.008

0.002

0.011

O_POP_Density (person/km2)

0.255

0.035

7.380

0.000

0.323

0.188

O_AREA km

213

Continued

9

41.251 2

T_O_Dad

Statewide Comparison of Origin-Destination Matrices Chapter

Partial Effect

214 PART

TABLE 3 Estimated Partial Effect in Tobit Model—cont’d

Coef.

S.E.

z

P[Zj> z]

Conf. Interval

D_POP_Density (person/km2)

0.311

0.035

8.970

0.000

0.379

0.243

O_Housing Density (House/km2)

0.425

0.084

5.090

0.000

0.261

0.588

D_Housing Density (House/km2)

0.512

0.083

6.130

0.000

0.348

0.675

O_Number of Employees

1.855

0.444

4.170

0.000

0.983

2.726

D_Number of Employees

0.800

0.445

1.800

0.073

0.073

1.673

O_Business Diversity

306.117

107.016

2.860

0.004

96.369

515.864

D_Business Diversity

91.372

106.899

0.850

0.393

118.146

300.891

Route distance between OD (km)

3.629

0.070

52.200

0.000

3.765

3.492

Independent Variables

Land use characteristics from NETS dataset

Distance

II Applications

Partial Effect

Statewide Comparison of Origin-Destination Matrices Chapter

9

215

estimation step parameters are not close enough). Therefore, we use a hierarchical iterative process to estimate this model as follows: (a) Start with one-class without covariates; (b) Proceed by increasing number of classes for the models until any parameter fails to be identified and the size of a class becomes too small to be meaningful; (c) Estimate a series of Latent Class Regression with different combinations of exogenous variables and select the most suitable number of classes based on changes in goodness of fit criteria, such as Bayesian Information Criterion (BIC), Akaike Information Criterion (AIC) and the Consistent Akaike Information Criterion (CAIC), following (McCutcheon, 2002; Nylund et al., 2007); (d) Compare the models with different specifications and select the best model based on multiple statistical goodness-of-fit measures like the second step as well as classification errors and R2 values. Higher R2 indicates better model in predicting the endogenous variable, but the lower classification error means better model in classifying spatially homogenous groups. The first step is identifying a suitable number of classes describing this OD trip data set. Similarly to the spatial lag Tobit model we use CSTDM OD trips as the dependent variable, and estimate a series of Latent Class models (also called mixture regression models) starting with one-class and increasing the classes until we find an optimal model. No explanatory variable was added in this step and eight models were identified (Table 4). Although model fit improves with each additional class, goodness of fit indices (BIC, AIC, AIC3) ceased to improve dramatically beyond the four-class model, reaching an asymptote. This indicates that it is possible to explain the heterogeneous nature of the CSTDM trips efficiently with four latent classes representing the different groups of zones. Therefore, the subsequent latent class regression models are estimated using four classes. Covariates and predictors play different roles in estimating latent class regression models as discussed earlier; covariates influence the latent classes and predictors influence the dependent variable. Since we use latent class analysis to capture spatial heterogeneity, covariates in this model reflect spatial characteristics of Origins and Destinations. All our exogenous variables contain zonal information, therefore all of them could be used as either covariates, predictors, or both. Although there is no consensus in the literature about which exogenous variables should be used for this type of analysis, our previous experiment in Southern California Association of Governments area found that the latent regression model using spatial lag variables as covariates and all others as predictors produced the best results in OD matrix conversion. As mentioned earlier, the model was estimated with four classes, and their estimated membership proportions are reported in Table 5. The largest proportion of the sample (OD pairs) was found in the first class (88%, 61,995 OD pairs),

216 PART

LL

BIC (LL)

AIC (LL)

AIC3 (LL)

CAIC (LL)

Npar

Class.Err.

1-Class

628029

1256258

1256094

1256112

1256276

18

0

2-Class

90927.72

181376

181769

181726

181333

43

0.0049

3-Class

106110.6

211462

212085

212017

211394

68

0.0046

4-Class

111243.8

221450

222302

222209

221357

93

0.0096

5-Class

114619.8

227923

229004

228886

227805

118

0.0161

6-Class

114784.9

227974

229284

229141

227831

143

0.0145

7-Class

117099.5

232324

233863

233695

232156

168

0.0247

8-Class

117687.8

233222

234990

234797

233029

193

0.0256

II Applications

TABLE 4 The List of Estimated Latent Class Models

TABLE 5 Proportion of Latent Classes and Descriptive Statistics of Each Class Class Modal

2. (N ¼ 4388, 4,196,934 CSTDM trips)

3. (N ¼ 2851, 24,067,710 CSTDM trips)

C_O_Da

T_OD

T_Oa_Da

T_Oa_D

T_O_Da

Mean

21.6

31.5

25.3

22.1

0.9

0.6

0.7

0.6

Min

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

Median

3.0

0.8

1.5

0.6

0.0

0.0

0.0

0.0

Max

813.9

8481.1

3007.5

4054.7

265.0

67.0

128.0

134.0

Std. dev.

53.5

155.9

89.5

85.8

4.4

2.3

2.9

2.8

Mean

956.5

1765.2

1281.5

1233.6

11.2

14.9

12.6

12.6

Min

6.8

0.0

0.0

0.0

0.0

0.0

0.0

0.0

Median

767.5

1038.9

867.8

814.8

4.0

6.0

5.0

4.0

Max

9489.1

27872.0

16419.2

11263.3

588.0

274.0

524.0

525.0

Std. dev.

659.4

2159.3

1351.7

1369.8

29.1

27.5

30.6

32.5

Mean

8441.8

7815.0

8230.8

8022.3

27.7

34.8

30.6

29.6

Min

149.3

0.0

0.0

0.0

0.0

0.0

0.0

0.0

Median

5071.7

6116.4

6228.4

5955.6

14.0

17.0

17.0

16.0

Max

416149.1

65133.9

70473.9

70473.9

2161.0

424.0

565.0

561.0

Std. dev.

13521.0

7341.8

7465.9

7976.3

68.2

48.7

46.7

46.2 Continued

217

C_Oa_D

9

C_Oa_Da

Statewide Comparison of Origin-Destination Matrices Chapter

1. (N ¼ 61,995, 1,339,309 CSTDM trips)

CTrips

218 PART

Class Modal 4. (N ¼ 991, 61,038,429 CSTDM trips)

Total (N ¼ 70,225, 90,642,383 CSTDM trips)

CTrips

C_Oa_Da

C_Oa_D

C_O_Da

T_OD

T_Oa_Da

T_Oa_D

T_O_Da

Mean

61592.8

16504.3

20061.4

19035.9

301.7

79.6

119.6

116.7

Min

905.7

0.0

0.0

0.0

5.0

0.0

0.0

0.0

Median

35309.9

13430.1

16995.5

16255.8

117.0

47.0

52.0

51.0

Max

492169.6

85610.1

87855.6

88592.9

8194.0

589.0

2134.0

1980.0

Std. dev.

71505.5

14768.0

16505.1

16178.6

613.9

97.7

193.8

194.0

Mean

1290.7

688.3

719.7

690.9

6.9

4.0

4.3

4.1

Min

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

Median

4.3

1.0

2.0

1.0

0.0

0.0

0.0

0.0

Max

492169.6

85610.1

87855.6

88592.9

8194.0

589.0

2134.0

1980.0

Std. dev.

11591.1

3408.9

3773.5

3706.8

82.8

20.5

30.3

30.2

II Applications

TABLE 5 Proportion of Latent Classes and Descriptive Statistics of Each Class—cont’d

Statewide Comparison of Origin-Destination Matrices Chapter

9

219

followed by the second, third and fourth class (6% 4388 OD pairs, 4% 2851 OD pairs, and 2% 991 OD pairs, respectively). However, in terms of CSTDM OD trips, by far the largest proportion of OD trips (67.3%, 61,038,429 OD trips) were found in the fourth class followed by the third, second, and first class (26.6%, 24,067,710 OD trips 4.6%, 4,196,934 OD trips, and 1.5%, 1,339,309 OD trips, respectively). Although the fourth class consists of the smallest number of OD pairs, it has the largest number of CSTDM OD trips. Because we use spatial lag variables as covariates, these latent classes represent relatively homogeneous groups of OD flows with respect to their neighbors’ OD flow patterns. The right hand side of Table 5 provides the descriptive statistics of both CSTDM and Twitter OD trips and covariates for each class. The first class captures OD pairs with relatively few trips; these pairs have relatively small numbers of both CSTDM and Twitter OD trips and are adjacent to similarly low-traffic OD pairs. The second and third classes captured zone pairs with a moderate and mid-high-level CSTDM trips, and the fourth class consists of the OD pairs with the largest number of trips by both measures as well as large interactions between their neighboring zones. Table 6 shows latent class-specific coefficients of predictors as well as Wald statistics (significance of coefficients). The coefficients with grey color shading indicate significant value at the 5% level. Significant predictors of CSTDM OD trips are the classes themselves (i.e., the overall class specific averages are different). Based on the Wald test statistics, all of the predictors are different among latent groups except for the housing related variables and population size in origins. Most importantly, the coefficients of Twitter trips turned out to be very different. The smallest unit contribution of Twitter OD trips was found in the first class, the largest one was found in the third class (2.1279, 183.3002). Although the CSTDM OD pairs in the fourth class has the largest number of trips per OD pair, the coefficient on Twitter OD trips was relatively small because large number of Twitter OD trips were also found in the fourth class. This indicates that a Twitter based OD trip should be used in a different way depending on the underlying spatial structures when we validate modelbased OD trips. This result also shows the necessity of using a methodology that is able to reflect the heterogeneous nature of geography and the people living in different geographies. Although the first latent class regression model had the smallest R-square value (0.2960) among classes, it included a variety of significant predictors (14 in total), and their signs are the same with the output of the spatial lag Tobit model in the previous section. However, the magnitude of coefficients is smaller than the spatial lag Tobit model results (e.g., the unit contribution of a Twitter OD trip for this latent group was 2.1279 and the difference with the Tobit is 30.893). The density of housing and population in both origins and destinations has different directional effects in Tobit model and the latent class model especially in the first class. Higher housing density and lower population in origins

TABLE 6 Class-Specific Coefficients of Predictors 220 PART II Applications

Statewide Comparison of Origin-Destination Matrices Chapter

9

221

and destinations indicate higher number of CSTDM trips between two zones in the Tobit model, but their effects in the first latent class were the opposite. The smallest number of significant predictors was found in the second model with the moderate R2 value (0.4713). Among 16 predictors, only two significant predictors were found in the second model, but the Twitter trips play the most important role in this class. Also, a negative coefficient was found for the number of employees in destinations. This means that a higher number of employees in this class imply a lower number of trips. The highest R2 value (0.9317) was found in the third model with ten significant predictors; Twitter OD trips, area and population sizes in origins and destinations have the positive coefficients, but the number of houses, business employees in both origins and destination, and route distances between zones are negatively associated with the trips in CSTDM output. However, all of the density and diversity variables are not significantly related to the CSTDM OD matrix. The highest coefficient of Twitter OD trips across all the classes was found in this class (183.3002). Presumably, this is associated with higher number of CSTDM trips and lower number of Twitter trips (Table 5). The fourth class regression model yields an R2 of 0.5450, with ten significant predictor variables. This model produced the closest unit contribution of a Twitter OD trip to the Tobit model (32.6650). All of the significant coefficients in this class have a relatively higher impact on CSTDM trips than the coefficients in other classes. For example, route distance between origins and destinations were (Class 1: 0.1380, Class 2: 0.0235, Class 3: 133.5920, Class 4: 1782.6917). This is presumably due to the shortest mean distance between origins and destinations in Class 4 (Class 1: 424.1, Class 2: 44.5, Class 3: 27.2, Class 4: 13.9). Finally, the spatial lag variables play important roles as covariates in this model, the coefficients can be found in Table 6. Based on Wald statistics, the amount of trips from neighborhood area to the destinations from CSTDM model was the most important variable to classify the latent classes followed by two other spatial lag variables from CSTDM data based on Wald statistics. In terms of spatial distributions of the OD pairs of latent classes, those are distributed differently across California (Fig. 4). In this map, the straight lines between OD pairs are used to illustrate the distributions of the OD pairs. The first class seems to represent all of the long distance OD pairs, the straight lines in this class cover the entire state of California. The second, third, and fourth classes show spatial distributions of the OD pairs of much shorter trips. Notable is also that second and third classes covering some interregional OD pairs between zones. The fourth latent class represents the inner zone trips as well as the shortest OD pairs. Fig. 5 shows a set of maps describing California with bar charts indicating the amount of trips with the origins in red and destinations in blue. Each map shows each latent class’s CSTDM OD trips. The first class OD pairs are widely distributed across California. The second and third classes are more densely

222 PART

II Applications

FIG. 4 Spatial distribution of the OD pairs of each latent class.

concentrated in the City of Los Angeles and the San Francisco Bay Area and the fourth class is quite evenly distributed like the first class. These maps also show spatial clusters of the zones that have similar OD trip patterns with their neighbors’ trip patterns. Also, this classification reflects the effect of size and relative location of the zones because those were captured via the spatial lag variables and used as covariates. Fig. 6 describes the proportion of the OD pairs that are classified by the latent class regression model within each Metropolitan Planning Organization (MPO) in California. The total number of OD pairs are also provided underneath the proportional bar chart. The four largest MPOs contain diverse latent classes, for example, SCAG, MTC, SANDAG, and SACOG. On the other hand, smaller MPOs are mainly populated by the third and fourth classes. This is presumably because the larger MPOs consist of diverse OD pairs from short

Statewide Comparison of Origin-Destination Matrices Chapter 9

223

FIG. 5 Spatial distributions of the CSTDM OD trips in each class.

224 PART

II Applications

100% 90% 80% 70% 60% 50% 40% 30% 20% 10%

4

49

1

25

4

1 3025

Cluster 1

Cluster 2

9

Cluster 3

C O G TC AG

SR TA

4

an

15376 16

484

St

SC AG SJ C O G SL O C O G

25

PO SA C O SA G N DA G SB C AG

TC on N

Toatal # of OD pairs

41 323

_M

TC C

M

M

G O

KC AG KC O G M C AG

AG

FC

BC

AM

BA G

0%

1

16

9

Cluster 4

FIG. 6 The proportion of OD pairs in each latent class within each MPO.

TABLE 7 Four MPOs and Their Conversion Coefficients MTC

SACOG

SANDAG

SCAG

Total

Mean of Twitter OD trips

27.7

33.2

76.2

18.8

6.9

Mean of CSTDM OD trips

5805.9

16,444.3

15,924.5

2821.3

1289.4

Conversion coefficients (Tobit)

55.1

191.4

40.6

24.3

33.1

N

3025

323

484

15,376

70,225

distance to long distance OD pairs, and urban and rural areas. This result reinforces the fact that spatially heterogeneous OD pairs require different conversion coefficients from Twitter trips to CSTDM trips that need to be tailored to California regions. Moreover, the OD pairs in different MPOs may need their unique conversion coefficients because their combination of latent classes are different from each other. In this regard, we estimated Tobit models for four largest MPOs separately, and found different conversion coefficients (Table 7). As a result, SCAG model has the lowest conversion coefficient (24.3), but highest one was found at SACOG model (191.4). This result verifies the necessity of using conversion models that account for spatial heterogeneity and travel context.

Statewide Comparison of Origin-Destination Matrices Chapter

6

9

225

SUMMARY AND CONCLUSION

In this paper, a Twitter data harvester was developed with Python code, and data were stored as JSON format with MongoDB server 3.0x; approximately 8 million geo-tagged tweets were used as an input to infer travel from an origin to a destination in California. Three different Twitter trip extraction algorithms were developed, and one algorithm produced the most suitable list of Twitter trips from geo-tagged Twitter data. The list of Twitter trips was transformed into an OD matrix with spatial aggregation in PUMA level (70,225 OD pairs). Then, we compared statistically the Twitter based OD matrix with a recent OD matrix from a statewide travel simulation model maintained by the California Department of Transportation. We used a Spatial Lag Tobit model to develop an unbiased conversion method between Twitter trips and Travel Demand Model output. We also used Latent Class Regression models to take into account the heterogeneous nature of space and travel. The spatial lag Tobit model produced a single unit-contribution of Twitter trip. The Latent Class regression model produced four different unit-contributions depending on the underlying spatial organization of the zones between which trips are made. We also found that different regional jurisdictions in California contain different groups of latent classes and require different unit-contributions of Twitter inferred trips to use as ODs. Interestingly, the largest and most diverse region, the SCAG area (surrounding the Los Angeles metropolis), has the lowest conversion coefficient and SACOG area (surrounding Sacramento – the Capital of California) has the highest one. In terms of Twitter trip extraction, we recommend the Rule #2 for smaller input data, but Rule #3 would be the best extraction method, theoretically and practically. In addition, the OD matrix conversion with spatial lag latent class regression model provides the functionality needed to account for spatial autocorrelation as well as to capture spatial heterogeneity of OD pairs. However, the spatial lag Tobit model would be helpful to estimate individual MPO’s conversion coefficients if latent classes are not desired. There are a variety of research directions that emerge from lessons learned in this research. Although 6-month observation was a long period of data collection, it would be better to collect data for more than a year like the CHTS. In this way, we can observe yearlong dynamics of travel behavior. The number of tweets, however, and their spatial distribution yield sparse matrices and cannot replace other travel demand forecasting methods for statewide travel models. For this reason, we need to expand the repertory of social media used as data sources of travel behavior. To this end, we envision the creation of an observatory in which social media data are collected for a very long period and segmented in years (e.g., to mimic statewide household travel surveys and other data collection programs in the United States). This could provide valuable information for state governments but also to the MPOs. Additionally, we can also impute missing trips when

226 PART

II Applications

two tweets’ time difference is much longer than estimated travel time based on each user’s home locations and major tweeting locations. In this way, we can also identify social media data with high potential of complementary information to the traditional survey data as other researchers pointed out in the past.

REFERENCES Anselin, L., Bera, A.K., 1998. Introduction to spatial econometrics. In: Handbook of Applied Economic Statistics, 237. CRC Press. Burgner, K.D., Goulias, K.G., 2015. Measuring heterogeneity in spatial perception for activity and travel behavior modeling.Presented at the Transportation Research Board 94th Annual Meeting. Retrieved from: https://trid.trb.org/view.aspx?id¼1339155. Cebelak, M.K., 2014. Location-Based Social Networking Data: Doubly-Constrained Gravity Model Origin-Destination Estimation of the Urban Travel Demand for Austin, TX. Retrieved from: https://repositories.lib.utexas.edu/handle/2152/22296. Chaniotakis, E., Antoniou, C., 2015. Use of geotagged social media in urban settings: empirical evidence on its potential from Twitter. 2015 IEEE 18th International Conference on Intelligent Transportation Systemspp. 214–219. https://doi.org/10.1109/ITSC.2015.44. Chaniotakis, E., Antoniou, C., Grau, J.M.S., Dimitriou, L., 2016. Can social media data augment travel demand survey data?.2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), pp. 1642–1647. https://doi.org/10.1109/ITSC.2016.7795778. Chen, X., Yang, X., 2014. Does food environment influence food choices? A geographical analysis through “tweets” Appl. Geogr. 51, 82–89. Chen, Y., Frei, A., Mahmassani, H.S., 2015. Exploring activity and destination choice behavior in social networking data.Transportation Research Board 94th Annual Meeting. Retrieved from: https://trid.trb.org/view.aspx?id¼1339428. Coffey, C., Pozdnoukhov, A., 2013. Temporal decomposition and semantic enrichment of mobility flows. Proceedings of the 6th ACM SIGSPATIAL International Workshop on Location-Based Social Networks. ACM, pp. 34–43. Retrieved from: http://dl.acm.org/citation.cfm? id¼2536806. Collins, C., Hasan, S., Ukkusuri, S.V., 2013. A novel transit rider satisfaction metric: rider sentiments measured from online social media data. J. Public Transport. 16 (2), 2. Cottrill, C., Gault, P., Yeboah, G., Nelson, J.D., Anable, J., Budd, T., 2017. Tweeting transit: an examination of social media strategies for transport information management during a large event. Transport. Res. Part C: Emerg. Technol. 77, 421–432. Duggan, M., Ellison, N.B., Lampe, C., Lenhart, A., Madden, M., Rainie, L., Smith, A., 2015. Social Media Update 2014: While Facebook Remains the Most Popular Site, Other Platforms See Higher Rates of Growth. Research Report. Efthymiou, D., Antoniou, C., 2012. Use of social media for transport data collection. Procedia – Soc. Behav. Sci. 48 (Suppl. C), 775–785. https://doi.org/10.1016/j.sbspro.2012.06.1055. Gao, S., 2015. Spatio-temporal analytics for exploring human mobility patterns and urban dynamics in the mobile age. Spatial Cognit. Comput. 15 (2), 86–114. https://doi.org/ 10.1080/13875868.2014.984300. Gao, S., Yang, J.-A., Yan, B., Hu, Y., Janowicz, K., McKenzie, G., 2014. Detecting OriginDestination Mobility Flows From Geotagged Tweets in Greater Los Angeles Area. Retrieved from: http://www.grantmckenzie.com/academics/McKenzie_DODMF.pdf.

Statewide Comparison of Origin-Destination Matrices Chapter

9

227

Golob, T.F., Kitamura, R., Long, L. (Eds.), 1997. Panels for Transportation Planning: Methods and Applications. vol. 5. Springer. Goulias, K.G., 1999. Longitudinal analysis of activity and travel pattern dynamics using generalized mixed Markov latent class models. Transport. Res. Part B: Method. 33 (8), 535–558. Goulias, K.G., Kitamura, R., 1993. Analysis of binary choice frequencies with limit cases: comparison of alternative estimation methods and application to weekly household mode choice. Transport. Res. Part B: Method. 27 (1), 65–78. Greene, W.H., 2003. Econometric Analysis. Pearson Education, Inc., Upper Saddle River, New Jersey. Gu, Y., Qian, Z.S., Chen, F., 2016. From Twitter to detector: real-time traffic incident detection using social media data. Transport. Res. Part C: Emerg. Technol. 67, 321–342. Hasan, S., Ukkusuri, S.V., 2014. Urban activity pattern classification using topic models from online geo-location data. Transp. Res. Emerg. Technol. 44, 363–381. Jiang, B., 2015. Geospatial analysis requires a different way of thinking: the problem of spatial heterogeneity. GeoJournal 80 (1), 1–13. https://doi.org/10.1007/s10708-014-9537-y. Jurdak, R., Zhao, K., Liu, J., AbouJaoude, M., Cameron, M., Newth, D., 2015. Understanding human mobility from Twitter. PLoS ONE. 10 (7). Kuflik, T., Minkov, E., Nocera, S., Grant-Muller, S., Gal-Tzur, A., Shoor, I., 2017. Automating a framework to extract and analyse transport related social media content: the potential and the challenges. Transport. Res. Part C: Emerg. Technol. 77, 275–291. Lampoltshammer, T.J., Kounadi, O., Sitko, I., Hawelka, B., 2014. Sensing the public’s reaction to crime news using the ‘Links Correspondence Method’. Appl. Geogr. 52, 57–66. Lee, J., Gao, S., Goulias, K.G., 2015. Can Twitter data be used to validate travel demand models?. The 14th International Conference on Travel Behaviour Research, Windsor, UK and Also published as GEOTRANS Report 2015-5-03, Santa Barbara, CA. Retrieved from: http://www. iatbr2015.org.uk/index.php/iatbr/iatbr2015/paper/view/272. Lee, J.H., Davis, A.W., McBride, E., Goulias, K.G., 2016a. Exploring social media data for travel demand analysis: a comparison of Twitter, household travel survey and synthetic population data in California. In: Presented at the Paper accepted for 96th Annual Transportation Research Board Meeting, Washington, DC. Lee, J.H., Davis, A.W., Yoon, S.Y., Goulias, K.G., 2016b. Activity space estimation with longitudinal observations of social media data. Transportation 1–23. https://doi.org/10.1007/s11116016-9719-1. Lee, J., Gao, S., Goulias, K.G., 2016c. Comparing the origin-destination matrices from travel demand model and social media data.Paper Presented at the 95th Annual Meeting of the Transportation Research Board. Washington, DCTRB Compendium of Papers. Lin, J., Cromley, R.G., 2015. Evaluating geo-located Twitter data as a control layer for areal interpolation of population. Appl. Geogr. 58, 41–47. https://doi.org/10.1016/j.apgeog.2015.01.006. Maddala, G.S., 1986. Limited-Dependent and Qualitative Variables in Econometrics. Cambridge University Press. Retrieved from: https://books.google.com/books?hl¼en&lr¼&id¼-Ji1ZaUg7gcC& oi¼fnd&pg¼PR11&ots¼7f4v5DjTJM&sig¼jT1a-UqBOSFfPpam9YoE97m0v_M. McCutcheon, A.L., 2002. Basic concepts and procedures in single-and multiple-group latent class analysis. Appl. Latent Class Anal. 56–88. Monzon, J., Goulias, K., Kitamura, R., 1989. Trip generation models for infrequent trips. Transport. Res. Record. 1220, Retrieved from: https://trid.trb.org/view.aspx?id¼307005. Nylund, K.L., Asparouhov, T., Muthen, B.O., 2007. Deciding on the number of classes in latent class analysis and growth mixture modeling: a Monte Carlo simulation study. Struct. Equ. Model. 14 (4), 535–569.

228 PART

II Applications

Vermunt, J.K., Magidson, J., 2002. Latent class cluster analysis. Appl. Latent Class Anal. 11, 89–106. Vermunt, J.K., Magidson, J., 2015. Technical Guide for Latent GOLD 5.0: Basic, Advanced, and Syntax. Retrieved from:https://www.statisticalinnovations.com/wp-content/uploads/LGtecnical. pdf. Walls & Associates, 2013. National Establishment Time-Series (NETS) database. Widener, M.J., Li, W., 2014. Using geolocated Twitter data to monitor the prevalence of healthy and unhealthy food references across the US. Appl. Geogr. 54, 189–197. Xu, X., Lee, L., 2015. A spatial autoregressive model with a nonlinear transformation of the dependent variable. J. Economet. 186 (1), 1–18. Yang, W., Mu, L., 2015. GIS analysis of depression among Twitter users. Appl. Geogr. 60, 217–223. Yang, W., Mu, L., Shen, Y., 2015. Effect of climate and seasonality on depressed mood among twitter users. Appl. Geogr. 63, 184–191. Zhang, Z., He, Q., 2016. On-site traffic accident detection with both social media and traffic data. Proc. 9th Triennial Symp. Transp. Anal. (TRISTAN).

Chapter 10

Transit Data Analytics for Planning, Monitoring, Control, and Information Haris N. Koutsopoulos, Zhenliang Ma, Peyman Noursalehi and Yiwen Zhu Department of Civil and Environmental Engineering, Northeastern University, Boston, MA, United States

Chapter Outline 1 Introduction 2 Measuring System Performance From the Passenger’s Point of View 2.1 The Individual Reliability Buffer Time (IRBT) 2.2 Denied Boarding 3 Decision Support With Predictive Analytics 3.1 Framework 3.2 Application: Provision of Crowding Predictive Information

1

229

232 233 238 243 245

4 Optimal Design of Transit Demand Management Strategies 4.1 Framework and Problem Formulation 4.2 Application: Prepeak Discount Design 5 Conclusion Acknowledgments References Further Reading

252 254 256 258 259 259 261

250

INTRODUCTION

Automated data collection systems are transforming the planning, scheduling, monitoring, and operations control of transit systems. They provide operators extensive disaggregate data about the state of their system and the movement of passengers within the system (Wilson et al., 2008; Pelletier et al., 2011). The main categories of automated data sources include the following: l l l

Automated vehicle location systems (AVL) Automated passenger counting systems (APC) Automated fare-collection (AFC, smart cards)

Mobility Patterns, Big Data and Transport Analytics. https://doi.org/10.1016/B978-0-12-812970-8.00010-5 © 2019 Elsevier Inc. All rights reserved.

229

230 PART

II Applications

In addition to the previously mentioned main data sources, passengers also carry sensors (smartphones) that can provide detailed information about how they use the system (Calabrese et al., 2013; Goulet-Langlois et al., 2016; Zhao et al., 2018b). Data from these sources have different characteristics both, with respect to the information they convey and their availability in time. AVL and APC data have been available for a long time and used for operations planning and scheduling (e.g., run time distributions, bus loads, etc.). AFC systems are a rather recent development. They are becoming common among transit agencies because of the convenience smart cards offer to the passengers, and the efficiency with respect to other functions (e.g., accounting). AFC systems are in general, open or closed (the functionality, mainly dictated by the agency’s fare policy). Open systems require that passengers only tap in when they enter the system (e.g., MBTA in Boston). Closed systems require both, tapping in and tapping out (e.g., the transit system in Seoul, Korea). As such, they provide direct information about the origin-destination (OD) flows. Many systems are hybrid, utilizing an open architecture on the bus side and closed on the subway side (e.g., London). The previously mentioned technologies are vehicle or station based (i.e., they are part of the agency’s infrastructure). AVL data is typically available in real time at the transit control center. However, APC and AFC data is not communicated in real time yet (although technically feasible). The data is stored locally and, usually, uploaded overnight. For this reason, real-time applications of AFC data are only recently emerging. Passengers and infrastructure are also increasingly interconnected allowing effective communication. The introduction of the mobile internet and the apps ecosystem has changed the way transit systems communicate directly with their customers. These technological advances conveniently link passengers and services. Passengers receive real-time information, for example, about bus arrivals (which alters how waiting times are traditionally estimated), updates about incidents, and provide feedback to operators about the quality of their services. Furthermore, apps and mobile sensors can provide additional information that complements the data collected from smart card systems, enhancing the development of customer centric performance metrics, measures of equity and inclusion to inform policy, and better planning of operations and services. Accelerating the adoption of such technological advances is the foundation for innovation and important means to increase public transportation effectiveness and appeal. Fig. 1 summarizes the evolution of transit analytics as a function of the technological advances and introduction of new systems and sensors. Solid arrows indicate existing capabilities and dash arrows emerging applications. The fusion of data from the various sources is a key element in fully capitalizing on the potential contributions to public transport. All major agency

Transit Data Analytics Chapter

10

231

FIG. 1 Transit analytics evolution.

functions can benefit from such data: planning, performance measurement, operations control and management, and customer information. The latter two functions take place in real time (and hence, require real-time availability of data). Fig. 2 suggests a framework for planning and managing transit service in light of the availability of automated data. It connects the off-line and real-time functions, recognizing the role of the operator and the user of the system. The transit analytics element refers to fundamental building blocks for analysis, such as monitoring, performance evaluation metrics, and prediction. Predictive analytics has not yet received much attention. As data from the various sources are becoming increasingly available in real time, prediction is an important capability to design better control strategies, generate

FIG. 2 Automated data collection and key functions (Koutsopoulos et. al., 2017).

232 PART

II Applications

customized passenger information, deploy more dynamic services, and implement proactive transit demand management strategies (Zhao et al., 2018b; Noursalehi et al., under revision). The development of such methods is complicated due to the role of customer behavior and response to information (feedback loop on the demand side in Fig. 2). An important building block in pursuing the various applications is the inference of trip OD matrices fusing AFC and AVL data (Munizaga and Palma, 2012; Gordon et al., 2013; Sa´nchez-Martı´nez, 2017). The complexity of this task varies, depending on the AFC system (open, closed, and hybrid). In this paper, we assume that the OD matrix has been estimated. The objective of the paper is to provide an overview of the solution of several problems enabled by the availability of extensive transit databases and demonstrates the use of automated data to better understand and deal with the planning, monitoring, and control of transit systems. The discussion is on applications related to closed, urban heavy rail systems (subways) and focuses on the following problems. (a) measuring system performance from a passenger’s point of view; (b) making real-time decisions to improve operations and level of service; and (c) designing transit demand management strategies to increase capacity utilization. These areas represent views of the system from three distinct perspectives with respect to time: the first one uses AFC/AVL data to evaluate past system performance. The second area deals with the problem where historical data and real-time observations are fused to inform proactive operations control. It presents a decision support platform that combines online prediction of passenger demand and simulation-based transit network performance. It is applied to generate customer information with respect to expected levels of crowding. The last area uses available information on how the system is used by passengers to design optimal transit demand management (TDM) strategies. It focuses on promotions (e.g., off-peak discounts) which aim to incentivize users to modify their choices of departure time, route, and transfer stations to reduce crowding and passenger congestion, and hence improve the utilization of available capacity.

2 MEASURING SYSTEM PERFORMANCE FROM THE PASSENGER’S POINT OF VIEW The availability of AFC and AVL data affords the opportunity to monitor system performance and facilitate the development of relevant metrics to measure passenger experience, such as service reliability, crowding levels, excess waiting time due to limited capacity, etc. (Pelletier et al., 2011; Bagchi and White, 2005; Agard et al., 2006; Zhao et al., 2007). We present metrics for

Transit Data Analytics Chapter

10

233

measuring passenger experience with respect to: (a) system reliability; and (b) crowding.

2.1 The Individual Reliability Buffer Time (IRBT) Transit operators’ ability to understand and improve service reliability experienced by passengers depends upon their ability to measure it. Traditionally, largely due to lack of data, passengers’ experience of reliability has not been measured directly. Several studies have shown that the commonly used on-time performance and headway regularity measurements do not capture effectively passengers’ experienced reliability (Abkowitz et al., 1983; Furth and Muller, 2006; Henderson et al., 1990; Ma et al., 2013). These traditional metrics generally measure reliability at the route or line level and focus on deviations from a published timetable. For passengers, however, reliability is experienced at the level of individual journeys between OD pairs, ranging from short trips on a portion of a route, to long trips involving multiple interchanges. Furthermore, especially for high frequency services, passengers’ perceptions of reliability are largely determined by the predictability of their journey times, rather than adherence to a published schedule. Passengers are unlikely to consult a schedule before starting their journeys when using high frequency services. With the increasing availability of automatic data collection systems, it has become feasible to measure passengers’ reliability experience at a detailed level. Addressing the limitations of traditional reliability metrics, the reliability buffer time (RBT) has been proposed to quantify passenger-experienced transit service reliability. RBT is the extra time passengers need to budget into their journey time to reduce to an acceptable level the likelihood of late arrival. The RBT is a function of service reliability and is typically defined as the difference between the Nth and 50th (median) percentiles of the distribution of passenger journey times for a specific OD pair and time period. It represents the additional time passengers need to budget for, in order to achieve an N-percent likelihood of on time arrival. The median journey time is usually used since it is not sensitive to outliers: RBTOD ¼ ðTTN%  TT50% Þ

(1)

TTN% and TT50% indicate the Nth percentile and median of this distribution, respectively, for a specific OD pair and time period. Fig. 3 illustrates the concept of the RBT metric. The graph shows the journey time distribution and the values corresponding to the 50th and 95th percentiles. A distribution skewed to the right with a long right-side tail indicates that passengers may experience longer journey times so the RBT increases. If travel times are consistent from day to day (for a given time period), more trips will be concentrated around the median so the difference between percentiles will be small. For N ¼ 95th percentile, on average, only 1 in 20 trips (or about

II Applications

Frequency

234 PART

RBT 50th percentile

95th percentile

Journey time

FIG. 3 The RBT metric.

one trip per month for a commuter) will exceed the allocated time for the trip. The metric captures service reliability from the passengers’ perspective: a high value suggests an unreliable service with users experiencing frequent delays (e.g., crowding or incidents) that should be taken into account when scheduling their trips. Low values suggest reliable service with journey times which are consistent from day to day. AVL and AFC data have been used in the past for the calculation of the RBT. The AVL-based RBT, proposed by Furth and Muller (2006) and later extended by Ehrlich (2010) and Ma et al. (2014), is mainly applied to bus services. It models the journey time distribution indirectly, using vehicle headways and running times from AVL data. The AFC-based RBT was first proposed by Chan (2007) and extended by Uniman (2009). It calculates the journey time distribution directly from gate-to-gate travel times based on AFC transactions. Since it requires both, tap-in and tap-out times it can be used with closed AFC systems. Typical AFC-based approaches use the journey time distribution across users for calculating the RBT. Since the distribution is estimated based on gate-to-gate travel times, it captures the complete journey variability. It includes operational variability caused by, for example, delays and denied boarding. However, it also includes variability contributed by the interpersonal variation among users (Wood et al., 2018; Wood, 2015). This cross-passenger variation is caused by differences in walking speed, route choices, and access and egress paths within the stations (often influenced by the degree of familiarity with the system). As a result, the AFC-based RBT can be biased as it captures both, the variation of operations and the variation among passengers. The impact of the cross-passenger variation is illustrated in Fig. 4 using AFC data from a busy subway system for 1 month. Passengers are divided into two groups based on their frequency of using the system. One group had more than 20 trips per passenger in the analysis month, while the other had less than 10 trips in the same period. Fig. 4 compares the RBT for different periods of the day for the two groups using the 95th percentile. The RBTs for frequent riders are lower than for the infrequent ones. The overall RBT is close to the RBT for infrequent users as the demand on any given day is dominated by

Transit Data Analytics Chapter

10

235

14

RBT (minutes)

12 10 8 6 4 2 0 7

9

11

13

15

17

19

21

23

Hour of day Freq. (>20)

Infreq. (>10)

All

FIG. 4 RBT for frequent and infrequent users.

infrequent users. Hence, individual variability, if not controlled for, influences the calculation of the RBT. Journey times, calculated as gate to gate times (as reported by the AFC transactions), consist of access and egress times, transfer times, waiting times, and in-vehicle times. Individual characteristics impact access, egress, and interchange times, and interpersonal variations in these components of journey time elements should not contribute to the calculation of the RBT when the metric is used for measuring system performance. In order to deal with this problem, we have proposed to calculate the RBT at different aggregation levels of user groups: all users, specific groups, individual. Individual RBT (IBT) is, by the definition of RBT, the most accurate measure of reliability as experienced by the users. It is defined as (Wood et al., 2018): IBT ¼ ðTTN%  TT50% Þ

(2)

where TTN% and TT50% represent the individual’s journey time percentiles. The factors affecting the journey time distributions can be grouped into two categories: service- and passenger-related. Passenger-related factors contribute to the population journey time variability, but for the same individual, they are unlikely to vary significantly across trips. Hence, the passenger-related component of the typical individual’s travel time variability should be smaller relative to the service-related component. The IBT reflects mostly the individual’s service experience, whereas typical AFC-based RBT values reflect the combination of both, passenger-related and service-related factors. Consequently, RBT values may be higher than IBT values. The RBT, in effect, “over-estimates” the typical individual’s buffer time. The previously mentioned hypothesis was tested by analyzing individuals’ actual travel times from the same system for a specific OD pair and for a period

236 PART

II Applications

Median + RBT (minutes)

30 25 20 15 10 5 0 7

8

9

10 11 12 to 1617 18 Hour of day

19 20 to 23 Median

RBT

RBT (minutes)

8 7 6 5 4 3 2 5 0 7

8

9

10

11

12 to 16

18

19 20 to 23

Hour of day Indiv. RBT

RBT over freq. users

RBT over all users

FIG. 5 Individual RBT and median journey times.

of 2 months (Wood et al., 2018). Fig. 5 compares the individual RBT (IBT) and 95th percentile journey time values (broken down into median time and IBT) for each hour of the day for passengers who had at least 20 trips over the 2-month period. Fig. 5A orders the individuals by the median journey time, with that user’s RBT added on top, while in Fig. 5B individuals are ordered by their IBT. The RBT is calculated over those 2 months for all users, and for users with at least 20 trips in the same period. Midday and night hours have been aggregated for the individual RBTs because of the small number of frequent users in these periods (the overall hourly RBTs are still shown hour-by-hour). The individual RBT for frequent riders is lower than the RBT calculated over all passengers for different groups. The overall RBT seems to be similar to the higher end of the IBT, hence can play the role of an upper bound on buffer time. Fig. 5A further suggests that having a low median travel time does not correlate with a low RBT. While the individuals are ordered by median travel time, the total height is irregular, implying that some users with shorter median

Transit Data Analytics Chapter

10

237

travel times actually have a longer median + RBT time. Some passengers are fairly consistent, even if their trip is long, while others may have a wider distribution of journey times even if their median time is low. Those passengers with consistently longer travel times may be in such a group for reasons unrelated to service variability; for example, they could have lower walk speeds. While the IBT as defined earlier applies at the individual level, in order to quantify the “typical” passenger’s reliability experience at the system level, the Individual Reliability Buffer Time (IRBT) is defined as follows: IRBTOD ¼ median fIBTOD g

(3)

IBTOD is the IBT for users traveling on the corresponding OD pair. The IRBT can be calculated, depending on the application, at different levels of spatial aggregation (line segment, direction-line, network, and transfer pattern), in addition to the OD level. At the line level for example, it is calculated as follows: X IRBTLine ¼

fOD IRBTOD X fOD

OD2Line

(4)

OD2Line

where Line is the set of same-line OD pairs and fOD is the OD pair demand. We demonstrate the use of the IRBT metric in a scenario were the demand in the system increased due to external shocks (Wood, 2015). In particular, the system-wide demand increased by about 9%, with one of the lines experiencing more than 25% surge. Furthermore, mostly the increase was concentrated in the peak periods, especially the morning peak. With a system already operating at capacity, this surge in demand led to additional delays for passengers, especially due to denied boarding. The IRBT metric can be used to assess the impact of the demand increase, by comparing its value for a period of 6 weeks before (base) and 6 weeks during the surge. The IRBTLine for passengers using the most affected line in the system (i.e., both origin and destination stations belong to the line) increased by 1.5 minutes during the peak periods. The IRBT for passengers transferring to the most impacted line, using one of the most congested stations in the entire system as their transfer station, increased by more than 3 minutes in the peak (about 100%). This is mainly because the transfer station was already very crowded (even before the demand increase) and close to the line’s peak load point. Denied boarding and associated delays were the main contributors to this increase. The IRBT during the off-peak periods remained mostly the same. The results indicate that the IRBT was able to capture the significant impact of the demand surge on passengers’ experienced service reliability—for certain journeys and times of day.

238 PART

II Applications

2.2 Denied Boarding Increases in ridership are outpacing capacity in many transit systems, such as Hong Kong’s Mass Transit Railway (MTR), the London Underground, and the New York subway system (Zhu et al., 2017a). Crowding at stations and on trains is an important issue due to its impact on safety, service quality, and operating efficiency. Various studies have measured passengers’ willingness to pay for less crowded conditions (Li and Hensher, 2011) and suggest the incorporation of the crowding disutility in investment appraisals (Haywood and Koning, 2015). Given the interest in dealing with crowdingrelated problems effectively, developing related measures of performance is important. Denied boarding due to overcrowding has become a major concern for many transit operators. Hence, the number of times passengers are denied boarding and how long they wait before they can board a train are often used as related performance metrics. However, the problem of measuring the number of times a passenger is denied boarding is not trivial (some agencies conduct manual counts to collect

TABLE 1 Approaches to Denied Boarding Estimation From AFC/AVL Data Approach

Data

Level

Applications

Characteristics

Statistical inference

AFC (tap-in and out)

Station

Performance measurement

Needs access/ egress time distributions

AVL

Unsupervised learning

Access/ egress distance/ speed Regression

AFC (tap-in)

Station

AVL

Prediction

Denied boarding observations Network assignment

OD flows

Performance measurement

Requires actual observations of denied boarding for calibration Supervised learning

Performance measurement

Applied at the network level

Planning

AVL

Various crowding metrics

Capacity

Requires capacity

Path choice fractions

Network

Deterministic

Transit Data Analytics Chapter

10

239

the needed data). Since it is not directly observable by current automated data sources, various approaches have been proposed to estimate it by fusing smart card (AFC) and train movement (AVL) data. These methods belong to two broad categories: (a) statistical (inference and regression models); and (b) assignment. Table 1 summarizes their main characteristics. The statistical methods are either based on unsupervised learning or use actual observations for calibration. Recently, we presented a method for the estimation of denied boarding using two data sources: (i) fare transaction records from a closed AFC system (i.e., a system where passengers both tap-in and tap-out), and (ii) train tracking data from the AVL system which provides station arrival and departure times (Zhu et al., 2017b). We assume a closed AFC system, where the tap-in/out times of passengers are known. Train arrival/departure times at stations are also known from the train control and signaling system (AVL). Fig. 6 shows the movement of a passenger who enters the system at tin and exits at tout. The estimation uses data from trips without transfers and route choice. Access time is defined as the time

FIG. 6 Time-space diagram for a passenger and trains (Zhu et al., 2017a).

240 PART

II Applications

to walk from the tap-in (entry) gate to the platform; waiting time is the time waiting on the platform; and egress time is the time to walk to the tap-out (exit) gate after alighting. For each passenger i, we define the set of feasible trains. A train j is feasible if it arrives at the origin station after the passenger reaches the platform and at the destination before the passenger taps out. This is a conservative definition, assuming zero access and egress times. For example, the passenger in Fig. 6 can board one of three trains (1, 2, 3). The following notation is used in the discussion: tiin: passenger i’s tap-in time. tiout: passenger i’s tap-in time. tia: passenger i’s access time. tie: passenger i’s egress time. τia: minimum access time for passenger i (set conservatively to zero). τie: minimum egress time for passenger i (set conservatively to zero). Mi: number of feasible itineraries for passenger i. M: maximum number of feasible itineraries in the group. DTij: “relative” departure time from the origin station of the jth train in the feasible itinerary set (after setting the tap-in time of passenger i to zero) for j  Mi. ATij: “relative” arrival time at the destination station of the jth train in the feasible itinerary set (after setting the tap-in time of passenger i to zero) for j  Mi. JTi: journey time distribution for passenger i. fa(t): access time distribution. fe(t): egress time distribution. Pn: probability of left behind n times.

Passenger i taps in Arrival at origin platform:

Before 1st train departs P0

P1

P1 P0

Boarding train: Arrival at destination platform:

Before Mth i train departs

Before 2nd train departs

P2

P0

Before 1st train

Before 2nd train

Before Mth i train

t ie = tOut –ATi,1 i

t ie = tiout –ATi,2

t ie = tiout –ATi,M

Passenger i taps out

FIG. 7 Possible passenger trajectories (Zhu et al., 2017a).

Transit Data Analytics Chapter

10

241

Fig. 7 illustrates all possible instances for a passenger with Mi feasible itineraries. Given the feasible set, the passenger may have arrived at the platform in any of the train departure intervals (based on his/her access time) and may board the first train upon his/her arrival at the platform or be left behind if there is no available capacity. The branches with a dashed line represent the denied boarding instances. For example, even if the passenger arrives before the first train, he/she can be left behind and have to board the second (or third) train due to capacity constraints. We assume that the access/egress speed distributions are known. They can be estimated from manual surveys conducted at stations. Zhu et al. (2017a) discuss how AFC data can be used to estimate walk speeds at stations using observations from trips that only have one feasible train. The method corrects for the bias inherent in such observations, since passengers with one feasible train may not be representative of the population. The approach was shown to be effective when compared to manual observations, and provides an alternative to collecting access/egress times data through expensive (manual) surveys. Assuming that during a short time period, the probability distribution of denied boarding at one station is constant, the parameters of the denied boarding probability mass function can be estimated by the likelihood function of the observations. As shown in Fig. 7, the probability of passenger i arriving at the origin station platform between the departures of trains j  1 and j is as follows:   P DTi, j1  tai < DTi, j ¼

Z

DTi, j

fa ðtÞdt for 1  j  Mi

(5)

DTi, j1

with DTi, 0 ¼ 0. It can be shown that the probability of a passenger i tapping out at their observed exit time can be calculated as follows: Li ð ZÞ ¼

Mi X Mi X   a P tout i , board train k, DTi, j1  ti < DTi, j j¼1 k¼j

¼ ¼

Mi X Mi Z X

DTi, j

j¼1 k¼j DTi, j1 Mi Z DTi, j X

fa ðtÞdtPkj fe ðJTi  ATi, k Þ

fa ðtÞdt

j¼1

DTi, j1

(6)

Mi X Pkj fe ðJTi  ATi, k Þ k¼j

where Z ¼ [P0, P1, …, PM1]T, is a vector of the parameters of the denied boarding distribution, fe(JTi  ATi,k) is the egress time probability distribution (derived from the walk speed distribution). We assume that the maximum number of times a passenger is denied boarding equals the maximum number of feasible itineraries in the group minus one, i.e., the length of Z is equal to M.

242 PART

II Applications

For the whole group, assuming conditional independence among passengers, the probability of observing the journey times of all passengers in the group is as follows: Lð Z Þ ¼

N Y

Li ðZÞ

(7)

i¼1

The maximum likelihood formulation of the problem is based on Eq. (7) and yields the probability that a passenger, during the corresponding time period, is denied boarding n times. The model was validated with synthetic data and also applied using an extensive AFC/AVL data set from a congested subway system (Zhu et al., 2017a). The MLE problem was solved using the SciPy optimization package ( Jones et al., 2001). The data used reflect AFC transactions at two busy stations (S1, S2). The heavily used OD pair was used to estimate denied boarding probabilities at station S1. The journey time distribution for the OD pair S1–S2 is shown in Fig. 8. The journey times increase during the peak, reflecting longer waiting times due to denied boarding. The estimated denied boarding probabilities for passengers boarding at Station 1 are shown in Fig. 9 and compared against manual surveys that took place at the station on the same day as the AFC/AVL data used for the estimation. The estimation results are similar to the survey results. In many cases, the quantity of most interest is the denied boarding rate, defined as the percentage of passengers not able to board the first train. The results from the method

FIG. 8 Journey time distribution.

Transit Data Analytics Chapter

10

243

FIG. 9 Probability of denied boarding.

presented here are consistent with the manual observations with respect to the denied boarding ratio. The differences observed in the detailed distributions can be attributed to both, the method (e.g., assumptions about access/ egress speed distributions), as well as the inaccuracy of manually counting the number of passengers during the congested times (especially those waiting for several trains). This is due to the counting process itself, as well as the assumption used in the processing of the manual data, that boarding follows a first come, first served (FCFS) principle. The agency has used the model for denied boarding analysis and estimated the corresponding probabilities for a period of 2 months. Fig. 10 illustrates the heat map of the denied boarding ratio. While, in general, the performance of the system is consistent from day to day, in a few occasions the denied boarding was higher than expected, for example, on some weekend days. Further analysis revealed that in those days, incidents that took place reduced the train frequency and increased the number of passengers who had to wait for an extra train.

3

DECISION SUPPORT WITH PREDICTIVE ANALYTICS

Most of the literature on the use of automated transit data has focused on the retrospective analysis of system performance (evaluation and monitoring, developing performance measures, and understanding how passengers use it). Predictive models, however, enable proactive strategy implementation to deal with abnormal conditions (incidents, surges in demand, etc.). They can also be used to generate information for the users about the upcoming state of the network. Atypical conditions may be caused demand fluctuations, either due to dayto-day variations, or external factors such as large scale events and incidents (e.g., line or station closures). Short-term demand predictions, less than an hour

244 PART

II Applications

FIG. 10 Denied boarding rate (% of passengers unable to board the first train).

into the future, are thus important for developing anticipatory dynamic control strategies and providing useful customer information. Predictive control and information provide the opportunity for improving passenger experience by adjusting service and possibly influencing passengers’ trip making choices. Operators can foresee upcoming, and most importantly, unexpected demand patterns at stations and proactively adjust service or implement crowd management strategies.

Transit Data Analytics Chapter

10

245

There is also a growing demand from passengers to be provided with information on the near-future service conditions of the system. The prevalence of smartphones facilitates the delivery of such information to users in real time. This dissemination of information provides the opportunity to incite cooperative behavior from the passengers while they make informed travel decisions. For example, passengers whose origin stations are predicted to experience overcrowding and are unlikely to board the first arriving train, can be advised to delay their arrival, or use an alternate route. This may alleviate the pressure on the transit system and reduce congestion on platforms and trains, resulting in better utilization of the available capacity and improved passenger experience as well.

3.1 Framework We discuss a predictive decision support platform that addresses both, operations control and customer information needs. Fig. 11 illustrates the main structure of the proposed framework. The platform consists of two modules: (i) the short-term demand prediction engine, and (ii) the on-line mesoscale simulation engine (performance prediction). The framework accounts for demand–supply interactions, especially taking into account passenger response to information about the state of the system (if available). The inputs include real-time AFC transactions and train position data (and train car loads if available, e.g., from trainload sensors), as well as timetables, historical AFC transactions, and information about exogenous events.

3.1.1 Demand Prediction Engine The short-term demand prediction engine has two components: prediction of arrivals at stations, and OD prediction. Arrivals at stations are predicted in real time for the next few time periods (for example, each time period may be 15 minutes long). Information about major (planned) events that are known

FIG. 11 Predictive decision support platform.

246 PART

II Applications

to happen on that day (e.g., football games) are input to the system, and their effects are reflected in predictions. Model specification for each station can be a time-consuming and tedious task and station-specific models do not easily account for the possible relationships between demands at other stations. For practical purposes, it is desirable to have models which capture such interactions intrinsically and deal with a large number of stations simultaneously. Dynamic factor models (DFM) are an effective way of capturing such effects intrinsically. The main characteristic of DFMs is their ability to model a large number of time series simultaneously, through a few common factors. Fig. 12 shows the 1-step (15 minutes) ahead prediction for arrivals at a busy station in the London Underground network, and compares it to the true, observed demand, as well as the historical average values (the model was estimated using data from 27 weekdays, and the leaveone-out method was used for validation). Major social and sports events can increase demand for public transportation significantly over a short period of time. The method is able to deal with planned events, assuming enough days with such cases are available in the historical database (training set). Fig. 13 compares the 1-step ahead prediction of arrivals during a soccer game night with the observed demand. The model predicts the demand surge, as opposed to responding to it with a lag (see Noursalehi, 2017 for details).

FIG. 12 One-step ahead arrival predictions (Koutsopoulos, et al., 2017).

Transit Data Analytics Chapter

10

247

FIG. 13 One-step ahead arrivals prediction vs observed demand during a special event (Koutsopoulos et al., 2017).

While station demand is useful for station-specific crowd management, OD flows are important as they drive many decisions related to crowd management for the whole network and customer information. An interesting characteristic of the OD prediction problem is that the true OD demand cannot be observed until all passengers entering a station at time t have finished their trips at their destination station at time t0 , t0 > t. The observation lag, Δ t ¼ t0  t, is a function of the travel time from the origin to destination, which itself depends on many other factors, such as dwell times, other OD demands, train capacities and speeds, etc. Therefore, at each time period t observations include arrivals at the origin station and the number of trips to the destination station that have been completed by that time. OD demands from consecutive time intervals are likely correlated. There is also a correlation between OD demand and passenger arrivals at the origin station. In addition, OD flows often exhibit a nonsmooth behavior, with fluctuations in consecutive time steps. This is in part due to lower demand per time interval, as opposed to the station arrivals. Demand patterns during the peak hours are typically different from the rest of the day. The learning model should be able to capture both of those patterns. Because of these characteristics, tree-based ensemble methods are used for this prediction task. There are two major types of tree-based ensemble methods. Random forests, which are based on the idea of bagging, and gradient boosted trees, which use boosting for combining the trees (Breiman, 2001). Fig. 14

248 PART

II Applications

FIG. 14 MAE distribution of 1-step ahead predictions for a few major OD pairs.

compares the predictive performance of the two models for a number of busy stations in the London Underground using the Minimum Absolute Error (MAE) as the performance metric. It also shows predictions based on the historical average. The MAE has been calculated over 27 days using the leave-one-out cross validation method (i.e., the model is estimated on 26 days, and used to predict demand for the other day). Fig. 15 compares the 1-step ahead prediction of the demand for a busy OD pair with the historical averages. The predictions are generated using gradient boosted trees. The model is able to capture the demand fluctuations, even during the peak period, when the actual demand is much higher than the historical average.

Transit Data Analytics Chapter

10

249

FIG. 15 One-step ahead demand prediction for a busy OD pair.

3.1.2 Online Simulation Engine (Performance Prediction) The second building block of the predictive decision support platform is the online simulation engine for performance evaluation. It simulates train movements based on link-specific speed distributions and station-specific dwell times, taking into account a minimum buffer between consecutive trains. The model also simulates passenger arrivals at stations and assigns them to specific direction, line, and train, according to the OD predictions, and available train capacity. A passenger, if there is information at the platforms related to crowding in upcoming trains, may defer boarding the first train and wait for the next one if the information indicates more favorable conditions. The simulation model is designed to be computationally efficient, as it is used on-line (real time). Another important design feature is its ability to self-correct based on real-time train position data that may be available (usually from the signaling system). 3.1.3 Demand, Supply, Information Loop The decision support platform captures explicitly the demand–supply interactions, especially in the presence of information. Information influences the travel decisions of passengers, impacting their pretrip decisions, or path choices while in the system. If for example, information of crowding on upcoming trains is provided to passengers waiting at the platform, the information affects the passengers’ decision about boarding the current train or not. This, in turn,

250 PART

II Applications

changes the train loads and their available space upon arrival at the next station, which may change the predicted boarding likelihood. As such, it is important for the decision support system to incorporate the passenger response to information in its prediction of train loads. If predictions ignore this feedback loop, information may be unreliable. Unreliable information can result in erosion of trust, with users eventually ignoring it. The problem is treated as a fixed point problem and an iterative algorithm is developed to capture passenger’s response to information in the prediction framework (a similar problem also exists in the context of generating traffic information (Ben-Akiva et al., 2010). Details of the predictive information logic implemented in the decision support system can be found in Noursalehi (2017).

3.1.4 Implementation In a typical implementation, the decision support system model will be used on a continuous basis, constantly updating prediction and information as data becomes available. At 7:00 am, for example, the demand prediction module outputs OD predictions for the next 30 min, in 15-minute intervals. Based on the available data the simulation engine simulates passenger arrivals and train movements for the same time interval. As new data becomes available, the simulation corrects its state based on the most recent information. At 7:05 am, it compares actual train position data (e.g., from the signaling system) with the predicted ones for that time, and updates them accordingly. With its corrected state, the simulation engine then models the system for the prediction horizon.

3.2 Application: Provision of Crowding Predictive Information We use the decision support model in an application to provide passengers who are waiting on platforms with information about the upcoming trains, including their predicted arrival time and available space for passengers to board. For the purposes of this application, we assume that information about crowding is communicated to the passengers at stations using the design shown in Fig. 16. The color coding scheme translates the expected residual capacity of a train upon arrival at the stations to the likelihood of boarding. Passengers

FIG. 16 Information displayed to passengers waiting on platforms

Transit Data Analytics Chapter

10

251

see the predicted arrival times and predicted available space for the next two upcoming trains. Based on this information, they may defer boarding the first train if the information indicates that a subsequent train is expected to give a better experience. Passengers make their boarding decisions based on the state of the train at the platform (observed), and the displayed predictive information for the next arriving train. It is assumed that if the passenger has been denied boarding the previous train, or had decided to not board it, will attempt to board the current train regardless of the information about upcoming trains. If there is enough space available on the train for all the passengers who are waiting at the platform to board (guaranteed boarding), then he/she will always do so. If boarding is not guaranteed, then the passenger consults the information on upcoming trains. If there is an upcoming train that arrives in less than the tolerance time (e.g., 5 minutes), and the predictive information about its crowding state is “Green” (i.e., guaranteed boarding) the passenger may decide to wait for the next train with some probability p. Otherwise, the passenger will always attempt to board the current one. Observed Predicted

800

Number of denied boardings

600

400

200

0 0

0.2

0.4

Deferring probability threshold

FIG. 17 Number of left-behind passengers vs deferring probability threshold.

0.6

252 PART

II Applications

FIG. 18 Data-driven dashboard, for station and train crowding levels.

The study area consists of 46 stations on Central and Victoria line, for a total of 2070 OD pairs. The simulation is performed for the peak morning hours, from 6 am till 11 am. Different scenarios are considered based on the assumed probability of deferring boarding. p ¼ 0 corresponds to the case where there is no information. As p increases, passengers become more responsive to the information, and make their boarding decisions accordingly. Fig. 17 compares the predicted number of left-behind passengers with the “observed” ones, with respect to the deferring probability p. As expected, as p increases, meaning that passenger decisions are more responsive to the information about the status of upcoming trains, the number of left behind decreases. The run time for simulating 30 minutes of operations ranges from 40 to 70 seconds, on a computer with 16 GB of RAM and 3.4 GHz CPU. The runtime depends on the number of concurrent computations (number of updates, simultaneous train loadings, etc.), making the platform suitable for real-time applications. An interactive visualization module has been developed. It provides animation of train movements and crowding information. Stations predicted to experience overcrowding are shown in red (otherwise in green or white, depending on the line). Trains operating at capacity are shown in purple. By selecting any train, a time series of its load to that point is illustrated. Similarly, selecting any station displays the current number of people waiting on each platform, as well as crowding levels for the previous time period (Fig. 18).

4 OPTIMAL DESIGN OF TRANSIT DEMAND MANAGEMENT STRATEGIES While control strategies, like the ones described in the previous section, are used to mitigate crowding situations in real time, many agencies also deploy transit demand management (TDM) strategies to divert passengers to less crowded

Transit Data Analytics Chapter

10

253

routes and time periods. Well-structured transit TDM strategies can help agencies better manage the available system capacity when the opportunity and investment to expand are limited. Many transit TDM strategies currently implemented, use off-peak discounts to incentivize users to switch from peak periods. Within this context, various implementations offer station-based promotion schemes, such as the ‘early bird’ promotion in Hong Kong, where passengers exiting the designated stations between 7:15 am and 8:15 am receive a 25% discount (Halvorsen et al., 2016). The discount serves as an incentive for passengers to shift their travel times to an earlier period, hence resulting to less crowding during the peak, especially on lines that operate at capacity (critical links). An issue with such discount strategies is that the promotion benefits many passengers, beyond the ones that actually switch from the peak period. Let us assume a promotion that gives a discount to passengers travelling in the prepeak period, and denote period I as the early morning period, period II the prepeak promotion period, and period III the peak period. The main groups of passengers involved in such a promotion are: l

l

l

l

G1: Passengers who usually exit in time period I (off-peak) and shift to time period II (prepeak) to receive the discount. The change in behavior of the passengers in this group has no impact on the crowding levels at the critical links during the peak period. G2: Passengers who typically exit in time period II. These passengers receive the discount without having to alter their behavior, and with no benefit to the peak hour crowding levels. G3: Passengers who usually exit in time period III and shift to period II but do not contribute to the congestion during the peak period because they do not use the most congested paths or critical links. G4: Passengers who usually exit in time period III and shift to time period II, travel through the critical links and hence, their behavioral change results in reduction of crowding on the critical links during the peak period III.

Passengers in group G4 are the ones actually targeted by the promotion (effective passengers), while passengers in groups G1–G3 are not targeted since they are not contributing to the load on critical links, but still benefit from the promotion (ineffective passengers). Hence, the cost of implementing such promotions includes the lost revenue due to ineffective passengers, in addition to the other implementation costs. An effective TDM strategy should try to minimize the passengers who receive the “free lunch” and maximize the effective passengers. However, transit systems are complex and the design of a TDM scheme, deciding when, where, and how much discount or surcharge is implemented, is not trivial. We discuss a general framework for the optimal design of TDM promotion schemes for urban heavy rail (subway) systems. The approach is enabled by the

254 PART

II Applications

availability of AFC data that inform the spatiotemporal characteristics of the users, as well as accurate AVL data.

4.1 Framework and Problem Formulation The TDM strategies developed by various systems vary in their design. For example, Singapore, Hong Kong, and Melbourne offer free travel or discounts to trips entering/exiting designated stations at the specified time periods. These strategies may succeed in mitigating crowding, but are not necessarily costefficient as discussed earlier. The congestion in urban rail systems is usually unbalanced with only several links overcrowded (critical link). Furthermore, users, based on their travel patterns and sociodemographic characteristics, may have different response to promotion strategies. Halvorsen et al. (2016) in their analysis of the Hong Kong promotion grouped passengers in six groups, based on spatiotemporal trip characteristics. The response to the promotion clearly varied among the groups. Therefore, TDM strategies can be more effective and efficient by targeting passengers who use the congested links during the peak and are sensitive to promotion. The main design parameters of a TDM strategy include: – Spatial structure: It refers to the entry/exit stations, OD pairs, route, and links or a combination of these that are targeted. Passengers entering or exiting the designated stations, or travelling between specified OD pairs, or using the designated links or routes, transfer stations, or combination of these, during the promotion time period receive a discount. – Temporal structure: It refers to the time periods during which the promotion is effective. Discount may be provided during the prepeak and/or after peak periods. – Discount structure: It refers to the nature of the discount and timing. A flat structure uses the same discount level across the discount time period, while a step-wise one has varied discount levels (e.g., in 15 minutes interval) within the discount time period. The main design parameters include the place (e.g., station, OD pair, routes, and links), time and duration (e.g., discounted time period), and pricing (e.g., discount level). These parameters influence where, when, and how much discount to be offered in order to better target effective passengers who contribute to congested links during the peak time periods. The general approach consists of two main components: assignment and optimization. The assignment updates the OD demand in response to the promotion and assigns the OD demand to the network. It outputs the load on critical links. The optimization component has as inputs the system’s network topology, operational characteristics (schedule), fare-table, and the OD demand by time period in response to a specific TDM policy. It uses the various decision variables discussed earlier (promotion structures) to

Transit Data Analytics Chapter

10

255

formulate alternative objective functions. It identifies optimal designs considering the trade-off between performance and cost and by better targeting users (e.g., users contributing to the network performance of interest and sensitive to the TDM strategy). In the case of a discount promotion, the output of the approach is for example, users exiting station 1 between 7:15 am and 8:15 am get a 25% discount; users exiting station 2 between 7:30 am and 8:30 am get a 30% discount, etc. For the optimal TDM design problem formulation, the following notation is used:   X ¼ xj^tτ : Set of binary decision variables xj^tτ (xj^tτ ¼ 1 if station sj is eligible for discount τ in time period ^t; 0 otherwise). θlh: Minimum acceptable load reduction (% of base case) of critical link l in time period h. flh: Base case (with no promotion) passenger load on link l in time period h. flhX: Passenger load on link l in time period h given a promotion design X. νx: Cost of the promotion scheme (e.g., fare revenue loss). flhmax: Maximum acceptable passenger load on critical link l in time period h. S ¼ {s1, s2, …, sNs}: Set of network stations. L ¼ {l1, l2, …, lNl}: Set of network links. T ¼ {t1, t2, …, tNt}: Set of time periods, where the time period t ¼ (t, t + Λ], e.g., 7:00–7:15 am. B: Budget constraint Given the previously mentioned notation, the problem of minimizing the total load on critical links, subject to budget and minimum performance constraints, is formulated as a 0–1 integer program. Minimize x

XX X x

subject to X

flhx ,

l2Lc h2Hc

vx x  B,

8x 2 X,

x

8l 2 Lc , 8h 2 Hc , flhX  flhmax , XX xj^tτ ¼ 1, 8j 2 S, 8^t 2 T, 8τ 2 Γ, ^t

(8)

τ

x 2 f0, 1g,

8x 2 X :

The constraints guarantee that the load on all critical links is less than a maximum acceptable load, cost (lost revenue) does not exceed the available budget, and only one strategy is selected for each station (for details of the formulation see Ma and Koutsopoulos, 2017). Other objective functions may also be considered, for example, minimization of the cost subject to constraints on the acceptable loads at critical links (maximum acceptable load).

256 PART

II Applications

4.2 Application: Prepeak Discount Design We apply the previously mentioned methodology to the design of a morning peak discount strategy based on exit time using the MTR subway system in Hong Kong as an example. The network consists of 90 stations and 4 of the links are operating close to capacity. The objective is to maximize the load reduction on the critical links under different budget constraints. The behavioral response, measured by the fraction of the passengers shifting to the discount periods, is based on Halvorsen (2015). Fig. 19 shows an example of the expected demand shifts assuming a discount level of 25%. Passengers who regularly exit between 8:15 am and 8:30 am and 8:30 am and 8:45 am and switch to an earlier time period, typically shift to the 8:00–8:15 am, which is the latest time period they can switch to and still receive the discount. The problem was solved assuming that the load at the critical links should be less than 98.5% of the base load (no promotion). Different levels of budget constraints (lost revenue due to ineffective passengers) were considered. The Gurobi solver was used to solve the resulting integer optimization problem in Python (Gurobi, 2017). Table 2 summarizes the results showing the expected load reduction as a function of the budget and the design of the promotion (discount structure and timing). Each cell in the table is the optimal solution of the corresponding design structure and budget level. The table therefore provides a portfolio of schemes that can be adopted based on given budget constraints and implementation considerations. For example, with a budget of $18 million/year, the reduction in the load of the critical links ranges from 1.59% to 2.03% compared to the base case of no TDM. By targeting a specific performance level, e.g., 1.8%, different strategies can be implemented, however at different budget levels. Based on the results, and given the behavioral response assumptions used in the study, as expected, the performance improvement will not exceed 2.10%

Demand response

4.0%

Exit time after promotion

3.5%

7:15–7:30AM 7:30–7:45AM 7:45–8:00AM 8:00–8:15AM

3.0% 2.5% 2.0% 1.5% 1.0% 0.5% 0.0% 8:15–8:30

8:30–8:45

8:45–9:00

9:00–9:15

Exit time before promotion FIG. 19 Behavioral response: % users shifting from the peak to the promotion time period.

TABLE 2 Comparison of Promotion Design Effectiveness (Load Reduction) of Different Strategies Budget Level (USD Million/Year) 8

10

12

14

16

18

20

22

Step_VT_VD

1.71%

1.83%

1.92%

2.00%

2.01%

2.03%

2.05%

2.06%

Flat_VT_VD

NA

NA

1.62%

1.71%

1.80%

1.82%

1.83%

1.83%

Step_FT_VD

1.62%

1.74%

1.81%

1.85%

1.86%

1.87%

1.87%

1.87%

Flat_FT_VD

NA

NA

1.43%

1.48%

1.60%

1.61%

1.61%

1.61%

Step_FT_FD

1.58%

1.67%

1.78%

1.79%

1.80%

1.80%

1.80%

1.80%

Flat_FT_FD

NA

NA

NA

1.49%

1.56%

1.59%

1.59%

1.59%

Note: VT(VD), discount time (level) varies by station, FT(FD), same discount time (level) at all stations, NA, infeasible given the budget level and constraints.

Transit Data Analytics Chapter

Strategies

10

257

258 PART

II Applications

regardless of the budget invested. However, transit agencies can accomplish the maximum possible load reduction more efficiently if they design their TDM scheme carefully. The results also show that discount based promotions are not enough to reduce crowding during the peak periods. They should be considered in combination with other strategies, such as reward programs, in order to improve the overall effectiveness of transit TDM.

5 CONCLUSION Automated data collection systems have the potential to better inform the critical functions of transit agencies. The fusion of AFC and AVL data enables measurement of system performance that best captures actual passenger experience, reveals the way individual customers use the system, supports predictive customer information and operations control, and informs planning functions, such as effective and cost efficient design of demand management strategies. An important advantage of using AFC/AVL data in developing metrics for performance measurement is that such metrics capture performance from the passenger’s point of view. The RBT has been used before as a service reliability measure but suffers from a number of drawbacks since it does not separate the impact of variability in operations from inter-passenger variability. In contrast, the IRBT uses individual AFC transaction data. It is based on the reliability buffer time calculated for frequent passengers separately, based on their individual records of travel times. It controls for the impact of personal variability and is effective in monitoring service reliability, measuring the impact of various operating factors (e.g., incidents). With many systems experiencing increased demand, near capacity operations result in crowding at stations and trains. An important metric, from the passenger’s point of view, related to crowding, is the probability of a passenger denied boarding at busy subway stations. The paper discusses methods that can be used to estimate this probability from AFC/AVL data. The estimated probabilities compare favorably with manual surveys, and provide crowding information at a very detailed level. While AFC and AVL data have traditionally been used to measure past system performance, there is potential to develop predictive decision support systems for proactive real-time control of operations and information dissemination. We presented an on-line, self-correcting simulation-based decision support platform. The platform consists of a demand prediction module (based on historical and real-time AFC data) and a simulation performance module that uses AVL data to update its state representation of the transit network. The simulation models the interaction between supply and demand and captures the impact of information, predicting the near-future state of the transit network. A case study illustrates the use of the decision support system for generating information about expected crowding levels on upcoming trains, taking into account passenger response to information as well.

Transit Data Analytics Chapter

10

259

On the planning side the problem of designing transit TDM strategies to deal with crowding benefits from the availability of detailed AFC data. Such strategies, especially when they are based on incentives, typically suffer from inefficiencies introduced by the fact that many users may be rewarded without actually changing their behavior. We presented a framework that can be used to optimally design TDM strategies incorporating a wide range of TDM structures, as well as diverse response from various user groups (which can be identified based on their spatiotemporal characteristics as revealed by the AFC data). The case study demonstrates the applicability of the proposed method. The results also show that discount-based promotions are not enough to reduce crowding during the peak periods. They should be complemented by other strategies, such as reward programs, in order to improve the overall effectiveness of transit TDM.

ACKNOWLEDGMENTS The authors would like to thank the various transit agencies for their support and data sharing. We would also like to thank Anne Halvorsen and Daniel Wood for their work on the individual RBT metric and colleagues in the Transit Lab for many helpful discussions.

REFERENCES Abkowitz, M., Slavin, H., Waksman, R., Englisher, L., Wilson, N.H.M., 1983. Transit service reliability. Tech. report, U.S. Dept. of Transportation. Agard, B., Morency, C., Trepanier, M., 2006. Mining public transport user behaviour from smart card data. IFAC Proc. Vol. 39 (3), 399–404. Bagchi, M., White, P.R., 2005. The potential of public transport smart card data. Transp. Policy 12 (5), 464–474. Ben-Akiva, M., Koutsopoulos, H.N., Antoniou, C., Balakrishna, R., 2010. Traffic Simulation with DynaMIT (Chapter 10). In: Barcelo, J. (Ed.), Fundamentals of Traffic Simulation. SpringerVerlag, New York, pp. 363–398. Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. Calabrese, F., Diao, M., Di Lorenzo, G., Ferreira, J., Ratti, C., 2013. Understanding individual mobility patterns from urban sensing data: A mobile phone trace example. Transp. Res. Part C: Emerg. Technol. 26, 301–313. Chan, J., 2007. Rail Transit OD Matrix Estimation and Journey Time Reliability Metrics Using Automated Fare Data Matrix. Master of Science in Transportation thesis, Massachusetts Institute of Technology. Ehrlich, J.E., 2010. Applications of Automatic Vehicle Location Systems Towards Improving Service Reliability and Operations Planning in London. Master of Science in Transportation thesis, Massachusetts Institute of Technology. Furth, P., Muller, T., 2006. Service reliability and hidden waiting time: insights from AVL data. Transp. Res. Rec. 1955, 79–87. Gordon, J., Koutsopoulos, H.N., Wilson, N.H.M., Attanucci, J., 2013. Automated inference of linked transit journeys in London using fare-transaction and vehicle location data. Transp. Res. Rec. J. Transp. Res. Board 2343, 17–24.

260 PART

II Applications

Goulet-Langlois, G., Koutsopoulos, H.N., Zhao, J., 2016. Inferring patterns in the multi-week activity sequences of public transport users. Transp. Res. Part C: Emerg. Technol. 64 (Suppl. C), 1–16. Halvorsen, A., 2015. Improving Transit Demand Management With Smart Card Data: General Framework and Applications. Master of Science in Transportation thesis, Massachusetts Institution of Technology. Halvorsen, A., Koutsopoulos, H.N., Lau, S., Au, T., Zhao, J., 2016. Reducing subway crowding: analysis of an off-peak discount experiment in Hong Kong. Transp. Res. Rec. J. Transp. Res. Board 2544, 38–46. Haywood, L., Koning, M., 2015. The distribution of crowding costs in public transport: new evidence from Paris. Transp. Res. Part A: Policy Pract. 77, 182–201. Henderson, G., Adkins, H., Kwong, P., 1990. Toward a passenger-oriented model of subway performance. Transp. Res. Rec. 1266, 221–228. Jones, E., Oliphant, E., Peterson, P., SciPy: open source scientific tools for Python. 2001, http:// www.scipy.org/, Accessed 18 February 2018. Koutsopoulos, H.N., Noursalehi, P., Zhu, Y., Wilson, N.H.M., 2017. Automated data in transit: Recent developments and applications.Proceedings of the 5th IEEE International Conference on Models and Technologies for Intelligent Transportation Systems (MT-ITS2017), pp. 604–609. Li, Z., Hensher, D.A., 2011. Crowding and public transport: a review of willingness to pay evidence and its relevance in project appraisal. Transp. Policy 18 (6), 880–887. Ma, Z., Koutsopoulos, H.N., 2017. In: Optimal design of transit demand management strategies.IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan. Ma, Z., Ferreira, L., Mesbah, M., 2013. A framework for the development of bus service reliability measures.Proceedings of Australian Transport Research Forum. Ma, Z., Ferreira, L., Mesbah, M., 2014. Measuring service reliability using automatic vehicle location data. Math. Prob. Eng. 2014, 1–12. Munizaga, M.A., Palma, C., 2012. Estimation of a disaggregate multimodal public transport origin– destination matrix from passive smartcard data from Santiago Chile. Transp. Res. Part C: Emerging Technol. 24, 9–18. Noursalehi, P., 2017. Decision support platform for urban rail systems: real-time crowding prediction and information generation. (Ph.D. dissertation), Department of Civil and Environmental Engineering, Northeastern University. Noursalehi P., Koutsopoulos H.N., Zhao J., Real time transit demand prediction capturing station interactions and impact of special events, Transp. Res. C, under revision. Gurobi Optimization, I, 2017. Gurobi Optimizer Reference Manual. http://www.gurobi.com. Pelletier, M.-P., Trepanier, M., Morency, C., 2011. Smart card data use in public transit: a literature review. Transp. Res. Part C: Emerg. Technol. 19 (4), 557–568. Sa´nchez-Martı´nez, G.E., 2017. Inference of public transportation trip destinations by using fare transaction and vehicle location data. Transp. Res. Rec. J. Transp. Res. Board 2652, 1–7. Uniman, D., 2009. Service Reliability Measurement Framework Using Smart Card Data: Application to the London Underground. Master of Science in Transportation thesis, Massachusetts Institute of Technology. Wilson, N.H.M., Zhao, J., Rahbee, A., 2008. The potential impact of automated data collection systems on urban public transport planning. In: Schedule-Based Modeling of Transportation Networks: Theory and Applications, Operations Research/Computer Science Interface Series. Springer, Boston, MA, pp. 75–97.

Transit Data Analytics Chapter

10

261

Wood, D., 2015. A Framework for Measuring Passenger-Experienced Transit Reliability Using Automated Data. Master of Science in Transportation thesis, Massachusetts Institute of Technology. Wood, D., Halvorsen, A., Koutsopoulos, H.N., Wilson, N.H.M., 2018. In: Measuring passengers’ reliability experience from AFC data.INSTR2018 (extended abstract), 7th International Conference on Transport Network Reliability. Zhao, J., Rahbee, A., Wilson, N.H.M., 2007. Estimating a rail passenger trip origin-destination matrix using automatic data collection systems. Comput. Aided Civil Infrastruct. Eng. 22, 376–387. Zhao, Z., Koutsopoulos, H.N., Zhao, J., 2018b. Individual mobility prediction using transit smart card data. Transp. Res. Part C: Emerging Technol. 89, 19–34. Zhu, Y., Koutsopoulos, H.N., Wilson, N.H.M., 2017a. A probabilistic passenger-to-train assignment model based on automated data. Transp. Res. Part B: Methodol. 104, 522–542. Zhu, Y., Koutsopoulos, H.N., Wilson, N.H.M., 2017b. Inferring left behind passengers in congested metro systems from automated data. Transp. Res. Part C: Emerg. Technol. 94, 323–337.

FURTHER READING Goulet-Langlois, G., Koutsopoulos, H.N., Zhao, Z., Zhao, J., 2017. Measuring regularity of individual travel patterns. IEEE Trans. Intell. Transp. Syst. 99, 1–10. Zhao, Z., Koutsopoulos, H.N., Zhao, J., 2018a. Detecting pattern changes in individual travel behavior: a bayesian approach. Transp. Res. Part B: Methodol. 112, 73–88.

Chapter 11

Data-Driven Traffic Simulation Models Mobility Patterns Using Machine Learning Techniques Vasileia Papathanasopoulou*, Constantinos Antoniou† and Haris N. Koutsopoulos‡

* National Technical University of Athens, Athens, Greece, †Department of Civil, Geo and Environmental Engineering, Technical University of Munich, Munich, Germany, ‡Department of Civil and Environmental Engineering, Northeastern University, Boston, MA, United States

Chapter Outline 1 New Modeling Challenges and Data Opportunities 263 1.1 New Modeling Requirements 264 1.2 New Data Sources 264 1.3 Future Challenges 265 2 Background 265 3 Data-Driven Traffic Performance Modeling: Overall Framework 267 3.1 Modeling Approach 267 3.2 Model Components 268 4 Application to Mesoscopic Modeling 276 4.1 Data and Experimental Design 276 4.2 Case Study Setup 276 4.3 Application and Results 277 5 Application to Microscopic Traffic Modeling 277

5.1 Data and Experimental Design 5.2 Case Study Setup 5.3 Application and Results 6 Application to Weak Lane Discipline Modeling 6.1 Data and Experimental Design 6.2 Case Study Setup 6.3 Application and Results 7 Network-Wide Application 7.1 Implementation Aspects 7.2 Case Study Setup 7.3 Results 8 Conclusions Acknowledgments References

278 279 279 280 281 282 284 287 287 289 289 290 291 291

1 NEW MODELING CHALLENGES AND DATA OPPORTUNITIES Transportation is experiencing a period of great development potential and changes, including new modes and new data sources. In an era of big data Mobility Patterns, Big Data and Transport Analytics. https://doi.org/10.1016/B978-0-12-812970-8.00011-7 © 2019 Elsevier Inc. All rights reserved.

263

264 PART

II Applications

and autonomous vehicles, traffic simulation models need to adapt to these new challenges. The objective of this research is to provide an alternative modeling approach for traffic simulation models. This modeling approach can take advantage of a wide range of available data, and is therefore suitable to implementation in the context of intelligent transportation systems.

1.1 New Modeling Requirements The emergence of new transportation modes (services and technologies), also new challenges for modeling traffic system, have created a need for more robust and advanced traffic simulation models. In particular, traffic simulation models need to capture the operations and interactions among new traffic systems. New systems affect different aspects of traffic demand and driving behavior. Autonomous vehicles, for example, under active development, are expected to be gradually introduced in the market. Therefore, autonomous driving constitutes one of the future modes that should be modeled, as well as the interaction of autonomous vehicles and classical vehicles. In addition, new modes such as car-sharing (Barth et al., 2004) may not be easily modeled using the current models. These modes share attributes of both private and public transports. These modes are expected to be more popular, as they offer a plausible alternative solution to severe parking problems in metropolitan areas (Xu and Lim, 2007). Furthermore, there has been an increasing interest in modeling driving behavior in developing countries where conditions, such as nonlane discipline and heterogeneity in vehicle types, prevail. Traffic flow in developing world is very complex in nature and safety issues arise. The focus here is on traffic moving from the local level to the network level. Toward sustainable mobility, vehicle-to-vehicle and vehicle-to-infrastructure architectures should be employed in real-time applications to develop effective traffic solutions based on real-time data. Data-driven models may offer a reliable alternative for simulation of all these new modes that require the evolution of traffic simulation models including data fusion of various data sources.

1.2 New Data Sources The rapid development of technology has contributed to the availability of highquality traffic data, leading the way for the development of more advanced simulation models. As a result of an explosive increase of the data that are being generated and collected, data-driven modeling is emerging as a fast developing field of transportation research. Zhang et al. (2011) have expressed the need for a shift from a conventional technology-driven system into a more powerful multifunctional data-driven intelligent transportation system. On the demand side, social media networks provide a huge volume of data including temporal, spatial, and textual information that could be exploited in the transportation field (Chaniotakis et al., 2016). In the era of big data, it is

Data-Driven Traffic Simulation Models Chapter

11

265

important to be able to handle the available information to increase accuracy and reliability of traffic models. On the supply side, technological advances have significantly improved our traffic data collection capabilities and increasing volumes of potentially useful data are readily available from low-cost opportunistic sensors. Other sources of data (such as cameras, GPS, cell phone tracking, and probe vehicles) are increasingly used as supplementary measurement systems (El Faouzi et al., 2011). Methods such as differential GPS allow the collection of high fidelity traffic data (Ranjitkar et al., 2005) and consequently may improve the accuracy of traffic simulation models. On the other hand, ubiquitous sensors (e.g., accelerometers and gyroscopes) from regular smartphones could provide a much richer sample of heterogeneous data, which could facilitate both, the development of a new generation of models and their calibration, for example, utilizing distributions rather than point values (Antoniou et al., 2014). For a review of novel data collection techniques and their applications to traffic management applications see Antoniou et al. (2011). Drones could also be a future option for data collection. Drones equipped with video cameras have been used for the acquisition of accurate vehicle tracking profiles (Guido et al., 2016).

1.3 Future Challenges Kaisler et al. (2013) define “Big Data” as the amount of data just beyond technology’s capability to store, manage, and process efficiently. Advances in information technology are likely to offer new opportunities for transportation and generate changes in speed and efficiency. New data sources can help us to optimize the transportation networks and improve the balance between demand and supply. Providing accurate traffic information is becoming a major challenge for road traffic management and the deployment of intelligent transportation systems. The focus should be placed on travel time and cost minimization, as well as environmental challenges. New transportation modes need to be integrated in the cities and a new balance between new and old systems needs to be found, especially as the penetration level of autonomous vehicles keeps changing. In a future world with a cooperative vehicle-to-vehicle and vehicle-to-infrastructure communication, all traffic modes and conditions need to be modeled. It is important to be able to offer solutions online and provide information and guidance back to drivers.

2

BACKGROUND

Limitations of conventional models have been the motivation to explore alternative approaches for the estimation of models, combining flexible data-driven components. Such methods have been used in several transport-related applications. Various machine learning techniques have been used in transportation research in recent years.

266 PART

II Applications

More than 10 years ago, Antoniou and Koutsopoulos (2006b) developed a framework for speed estimation using machine learning concepts, including clustering algorithms and locally weighted regression (loess). Antoniou and Koutsopoulos (2006a) compared a number of machine learning techniques for speed estimation, including loess, support vector regression, and neural networks. Other data-driven methods, including neural networks (Huval et al., 2015), Gaussian processes (GP) (Chen et al., 2014), and Kernel methods offering similar capabilities, have also been used in applications (Karlaftis and Vlahogianni, 2011). Antoniou et al. (2013) developed a framework for dynamic traffic state estimation and prediction using machine learning methods. Papathanasopoulou and Antoniou (2015) have developed a methodology for estimation of data-driven models. The remainder of this chapter relies heavily on the last two mentioned references. Furthermore, Kleyko et al. (2015) have compared three machine learning techniques, specifically logistic regression, neural networks, and support vector machines (SVM), for a vehicle classification problem and have indicated that logistic regression provided the best results. Jenelius and Koutsopoulos (2013) have presented a statistical model for travel time estimation for urban road network travel time estimation using low-frequency probe vehicle data. Jenelius and Koutsopoulos (2018) used probabilistic component methods for traffic state prediction. Lv et al. (2015) and Huang et al. (2014) have used deep learning for traffic flow prediction. Clustering and classification are popular techniques with many applications. El Faouzi (2004) presents a data-driven approach that aggregates multiple estimators, attempting to aggregate all the information which each estimation model embodies (some of which might be lost if only the “best” model was chosen and applied), while El Faouzi and Lefevre (2006) use two different approaches from evidence theory (classifier fusion and distance-based classification) for clustering and classification for road travel time estimation. Azimi and Zhang (2010) apply three different unsupervised learning methods (i.e., K-means, fuzzy C-means, and CLARA) to classify freeway traffic flow conditions based on the characteristics of the flow. Focusing on microscopic data-driven models, data-driven approaches have already been used in developing a fully adaptive cruise control system (Simonelli et al., 2009; Bifulco et al., 2013) and in modeling car-following behavior via artificial neural networks (Colombaroni and Fusco, 2014; Chong et al., 2013; Zheng et al., 2013). Simonelli et al. (2009) have applied neural networks to develop a real-time learning model for capturing carfollowing behavior taking into consideration individual drivers’ characteristics. Bifulco et al. (2013) extended the work of Simonelli et al. (2009) into a framework for reproducing spacing in adaptive cruise control applications. Furthermore, Panwai and Dia (2007) developed a car-following model based on neural networks and fuzzy neural networks. They tried different types of neural networks and validated their model using field data from two vehicles equipped with radar detectors. The results were promising as their models outperformed

Data-Driven Traffic Simulation Models Chapter

11

267

Gipps’ model. Zheng et al. (2013) proposed a model based on neural networks, too. The difference of their model is that they used a two-level neural network structure. The first level is used to estimate the dynamic reaction delay, while the other to predict the acceleration of the following vehicle. While most data-driven studies adopt a neural network approach, there are several methods that either have not been adequately explored or have not been compared on the same data with other methods in order to obtain a better understanding on how the algorithm choice could influence the results. Kumar et al. (2013) have proposed a learning-based approach, using SVM and Bayesian filtering, for online lane-change intention prediction. Their model predicts driver intention to change lanes about 1.3 seconds in advance. Ding et al. (2013) have explored the ability of a back-propagation (BP) neural network to learn the uncertainties and perceptions in human behavior from real driving data in order to predict a lane-changing trajectory. Hou et al. (2014) have developed a lanechanging assistance system that advises drivers for safe gaps and if it is safe or unsafe to execute a mandatory lane-change. The model is validated on NGSIM data and predicts whether a driver will merge or not as a function of certain input variables using Bayes Classifier and Decision Trees. Bi et al. (2016) have developed a data-driven model to simulate the process of lane-changing in traffic simulation using randomized forest and BP neural network algorithms. However, they do not take driver heterogeneity into account. Wang et al. (2017) have modeled various merging behaviors at expressway on-ramp bottlenecks using SVM models. They have considered four merging behaviors with different degrees of merging risk. In comparison with other models including discrete choice model, Bayesian network, and classification and regression tree, SVM achieves the best prediction results.

3 DATA-DRIVEN TRAFFIC PERFORMANCE MODELING: OVERALL FRAMEWORK The overall framework for data-driven model development is presented first, followed by two conceptual case studies, one for mesoscopic modeling and another one for microscopic modeling. The former concerns a data-driven approach for local traffic state estimation and prediction, while the latter provides a paradigm for estimation of car-following models and speed prediction.

3.1 Modeling Approach The overall process for data-driven model development is outlined in Fig. 1. The approach includes two parts: training and application. First, the required explanatory variables of the model are determined and the appropriate surveillance data are collected. In the training step traffic models are estimated according to the available surveillance data, while in the application step these traffic models are applied to provide predictions using new observations.

268 PART

II Applications

FIG. 1 Process diagram for data-driven model development.

The training process is initialized with the identification of clusters based on underlying patterns and in the available data, corresponding to traffic states with similar characteristics. A flexible regression technique is applied to each cluster separately and representative models are formed for each group of the data (calibration). The fitted models are stored into a knowledge database. The application step follows, when new measurements become available, the new data are classified to the appropriate classes based on their characteristics. The model that has been estimated for that class is then retrieved from the knowledge base and applied to the new data for the estimation of the response variable. The predicted values are evaluated and the next iteration improves the model.

3.2 Model Components In this section, we present some of the alternative methods for fitting, clustering, and classification that can be applied in this context. Fig. 2 summarizes the interconnection of the various components. The available observations have different characteristics that can be used to cluster them into groups with similar characteristics. Clustering is a well-researched area with a large number of available approaches and algorithms, often based on heuristics. Clustering

Data-Driven Traffic Simulation Models Chapter

11

269

FIG. 2 Main methodological components for data-driven modeling.

involves several decisions, such as the optimal number of clusters that effectively clusters the observations to meaningful clusters. Conflicting objectives characterize this task, as on the one had a larger number of clusters may provide a more precise clustering, while a smaller number of clusters provide a more manageable (and possibly easier to interpret) clustering (Antoniou et al., 2013). The classified observations result in a time series of clusters. Studying the evolution of this time-series provides the ability to predict the future state, based on the last few states, through the estimation of an appropriate statepredictive process (i.e., Markov chains). Appropriate flexible regression models are employed based on the observations belonging to the corresponding cluster. When new observations become available, they can be classified into one of the available clusters based on their specific attributes. When the cluster of a future observation has been predicted, the appropriate function for that cluster can be selected and used to make a speed prediction.

3.2.1 Clustering and Classification Clustering and classification are two important branches of machine learning. Clustering is carried out in an unsupervised way by trying to find subsets of data with similar characteristics without having a predefined notion of the cluster.

270 PART

II Applications

On the other hand, classification involves the supervised assignment of data observations to predefined and known classes (Bagirov et al., 2003). Clustering A simple form of clustering is k-means algorithm. As its name suggests, the k-means algorithm (MacQueen et al., 1967; Hartigan and Wong, 1979) minimizes the distance between each point and the center of its cluster for k given clusters. This is achieved by assigning each point to the nearest mean and reestimating or moving the mean to the center of its cluster. It is regarded as a maximum-likelihood clustering. The objective function to be minimized is: XX k X  μh k2 , min (1) ðμ1 , …, μk Þ h¼1 x2Xh

where μi is the mean of cluster i. A hypothesis h1 ¼ hμ1, …, μki with the means of the k different normal distributions is requested. A random hypothesis is assumed for the initialization of the procedure. Each instance could be written as hxi, zi1, zi2, …, ziki, where xi is the observed variable and zij is equal to 1 if it was obtained by the jth normal distribution or 0 otherwise. A maximumlikelihood hypothesis is sought after iterative reestimations of the expected values of zij. Then, a new maximum-likelihood hypothesis h2 is calculated using the expected values in the previous step. Finally, the new hypothesis replaces the earlier one and iterations are going on until the algorithm converges to a value for the hypothesis. Fraley and Raftery (2002, 2003) proposed a model-based clustering, which combines hierarchical clustering, expectation-maximization algorithm (EM algorithm) for mixture models and Bayesian Information Criterion (BIC) for selection of models and number of classes (Schwarz et al., 1978). Hierarchical clustering, used for model-based hierarchical agglomeration, is initialized by default with each observation of the data in a cluster by itself and finished when all observations have been merged into a cluster. A classification maximum-likelihood approach is required to determine which two groups are merged at each stage (Banfield and Raftery, 1993; McLachlan and Krishnan, 1997; Fraley, 1998). The EM algorithm is included in the R Mclust package and is applied for maximum-likelihood clustering with parameterized Gaussian mixture models (Dempster et al., 1977; McLachlan and Krishnan, 1997). The EM algorithm is implemented in two steps: E-step which calculates a matrix zik, which corresponds to the likelihood of an observation i to be merged into a cluster k given the current parameter estimates, and M-step, which calculates maximum-likelihood parameter estimates given z. Each cluster is represented by a Gaussian model ϕk(xjμk, Σk), where x are the data, k an integer indicating a cluster centered at means μk and covariances Σk. Then the maximum-likelihood values for the Gaussian mixture model are given by Eq. (2) (Fraley and Raftery, 2002), where τk are the mixing proportions.

Data-Driven Traffic Simulation Models Chapter

f ðzÞ ¼ arg min dðz,yÞ y2

11

271

(2)

Banfield and Raftery (1993) suggested a clustering strategy based on a maximization algorithm and Bayes factors. This strategy was upgraded by Fraley (1998) and later by Fraley and Raftery (2002, 2003) and could be carried out with the following steps: l

l

l

l

l

A maximum number of clusters and a subset of covariance structures are considered. A hierarchical agglomeration that maximizes the classification likelihood for each model is performed and the appropriate classifications are illustrated up to M groups. The EM algorithm is applied for each model and each number of clusters 2, …, M. The procedure is initialized from the classification result of hierarchical agglomeration. The BIC is calculated for the one-cluster case for each model and for the mixture model with the optimal parameters from EM for 2,…, M clusters. Each combination corresponds to a unique probability model. The model with the highest BIC is selected and the best classification is recovered. Although in such a way the optimal number of classes is determined, a lower number of classes could be chosen, aiming at the development of more parsimonious models.

Classification One of the most common methods of classification is k-nearest neighbors (Mitchell et al., 1997). According to this method, all observations correspond to points in n-dimensional space. Future data points are registered in the class of nearest neighbors of the already grouped data. Especially, the point of the nearest neighbor classification is the calculation of the correlation map: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi !ffi u n u X (3) ½ar ðxi Þ  ar ðxj Þ2 dðxi , xj Þ ¼ t r¼1

In a pattern space P, where M  P, z 2 P, and d(z, y) is a metric in P-dimensional space. The evaluation of Eq. (3) could be easily achieved on a computer following three steps: computation of an array with distances from z to each y 2 M, finding the minimum distance after comparisons, and exporting the final result y*2 M (Muezzinoglu and Zuracla, 2005). The nearest neighbors could be defined according to the Euclidean distance (Roughan et al., 2004), if a point x is described as ha1(x), a2(x), …, an(x)i, where ar(x) corresponds to the value of the rth attribute of x. Attributes of x could include density, traffic flow, and time. The distance between two points is defined by Eq. (4) (Mitchell et al., 1997). Thus the class of a new observation xi is the same as the class of point xj, which minimizes the distance kxi  xjk.

272 PART

II Applications

f ðzÞ ¼ arg min dðz, yÞ y2M

(4)

Classification could be also performed using neural networks. Neural networks (cf. e.g., Ripley, 1996) have been presented in many traffic-related applications (e.g., Vlahogianni et al., 2005).

3.2.2 Flexible Fitting Models In this section we provide an overview of some key computational models that allow the fitting to data, without the explicit specification of a functional form. Thus, these models provide superior fit over conventional models that are restricted by their theory-based functional forms. Naturally, this comes at a potential cost, as these models are less easy to interpret. Locally Weighted Regression loess could be considered as a generalization of the k-nearest neighbor method (Mitchell et al., 1997). It was firstly introduced by Cleveland (1979) and the following analysis is based on Cleveland and Devlin (1988). Loess yi ¼ g(xi) + Ei, where i ¼ 1, …, n index of observations, g is the regression function, and Ei are residual errors, provides an estimate g(x) of each regression surface at any value x in the d-dimensional space of the independent variables. Correlations between observations of the response variable yi and the vector with the observations d-tuples xi of d predictor variables are identified. Local regression provides an estimation of function g(x) near x ¼ x0 according to its value in a particular parametric class. This estimation could be achieved by adapting a regression surface to the data points within a neighborhood of the point x0, which is bounded by a smoothing parameter: span. The span determines the percentage of data that are considered for each local fit and hence the smoothness of the estimated surface is influenced (Cohen, 1999). The span ranges from 0 (wavy curve) to 1 (smooth curve). Each local regression uses either a first- or a second-degree polynomial that it is specified by the value of the “degree” parameter of the method (degree ¼ 1 or degree ¼ 2). The data are weighted according to their distance from the center of neighborhood x; therefore, a distance and a weight function are required. As a distance function p, Euclidean distance could be used for a single-independent variable; otherwise, for the multiple regression case, any variable should be evaluated on a scale before applying a standard distance function (Cleveland et al., 1988). A weight function defines the size of influence on fit for each data point taking for granted that nearby points have higher influence than the most distant. Therefore, the weight function calculates the distances between each point and the estimation point and higher values in a scale from 0 to 1 are set for the nearest observations. A weight function should meet the requirements determined by Cleveland (1979) and the most common one is the tri-cube function:

Data-Driven Traffic Simulation Models Chapter

 WðuÞ ¼

ð1  u3 Þ3 , 0,

0u1 otherwise

The weight of each observation (yi, xi) is defined as following:   !3 xi  x 3 wi ðxÞ ¼ W½pðx, xi Þ=dðxÞ ¼ 1  dðxÞ

11

273

(5)

(6)

where d(x) is the distance of the most distant predictor value within the area of influence. In the loess method, weighted least squares are used so as linear or quadratic functions of the independent variables could be fitted at the centers of neighborhoods (Cleveland, 1979). The objective function that should be minimized is: n X

wi  E2i

(7)

n¼1

Multivariate Adaptive Regression Splines MARS have been introduced by Friedman (1991). It is a nonparametric method for flexible regression modeling of high-dimensional data that identifies nonlinearities and interactions between variables. In this research, this method is implemented using package “earth” (Milborrow, 2017) in R (R Core Team, 2018). MARS builds a model of the form: f ðxÞ ¼

k X ci  Bi ðxÞ

(8)

i¼1

The model is a weighted sum of basis functions Bi(x) where ci are coefficients estimated by minimizing the residual sum of squares (Happe et al., 2010). The model strategy is similar to stepwise linear regression, except that the basis functions are taken into account instead of the observations. An independent variable translates into a series of linear segments joint together at points called knots (Courtois and Woodside, 2000). Each segment uses a piecewise linear basis function, which is constructed around a knot. MARS selects dynamically the knot locations. It is a forward pass-backward pass process in order to decrease the training error. Optimal number of terms in the model is estimated using generalized cross-validation (Happe et al., 2010).

Kernel Support Vector Machines SVM are based on the structural risk minimization principle (Cortes and Vapnik, 1995). An SVM model is a representation of training data as points in space. Training an SVM leads to the following quadratic optimization

274 PART

II Applications

problem with bound constraints and one linear equality constraint (Cortes and Vapnik, 1995). WðaÞ ¼ 

n n X n X 1X ai + yi  yj  ai  aj  Kðxi , xj Þ 2 i¼1 j¼1 i¼1

(9)

subject to n X y i  ai , 0 < a i < C

(10)

n¼1

where n is the number of training examples, α is a vector of n variables, each component αi corresponds to a training example, (xi, yi), K(xi, xj) is the kernel function, which is used as a similarity measure between objects xi and xj and C is an upper bound on αi. Gaussian Processes GP are based on the idea that adjacent observations convey information about each other (Williams and Rasmussen, 1996). Observations are considered to be normal and the relationship between them is represented by a covariance matrix of a normal distribution. Kernel matrix is used as the covariance matrix in order to extend Bayesian modeling to nonlinear situations. The following analysis is based on Quin˜onero-Candela and Rasmussen (2005). For regression estimation it is assumed that observations f(x) could be written as y(x) ¼ f(x) + E, where E is Gaussian distribution noise with zero mean, E  Nð0, σ 2n Þ. A Gaussian distribution is fully described by the mean μ and covariance Σ of the distribution in terms of hyperparameters θ. The log marginal likelihood is given by Eq. (11). 1 1 n L ¼ log pðyjx,θÞ ¼  log jΣj  ðy  μÞT Σ1 ðy  μÞ  log ð2πÞ 2 2 2

(11)

Bayesian Regularized Neural Networks In the Bayesian framework, model parameters are treated as probabilistic variables. The posterior probability of the weights is given according to Bayes’ rule by Eq. (12). pðwjDÞ ¼

pðwjDÞpðwÞ pðDÞ

(12)

where D is a set of observations, p(wjD) is the probability of observations given a choice of weights w, p(w) is a prior distribution of weights, and p(D) is a normalization factor. Bayesian Regularized Neural Networks (BRNN) address one of the difficulties in building a neural network (i.e., determining the number of hidden neurons). To overcome this difficulty, the BRNN algorithm incorporates the Bayes’

Data-Driven Traffic Simulation Models Chapter

11

275

theorem into the regularization scheme. Foresee and Hagan (1997) provide a detailed description of BRNN.

3.2.3 Forecasting One of the most general models for a stationary categorical process taking values in a finite categorical space X is a full Markov chain (of possibly high, but finite order) (Markov, 1971). For example, in a traffic flow theory context, traffic flow might be categorized as one of five states (say A, B, C, and D), defining the categorical space X. A stationary full Markov chain of order p exists whenever the transition mechanism has no specific structure; that is the state space is the entire Xp. While such general models may be theoretically attractive, they also have practical limitations. For example, a full Markov chain is rather inflexible in terms of the number of parameters that it can represent. For a model with four states (as in the simple example of A through D earlier), chains with 0 to 5 parameters have dimensions of 3, 12, 48, 192, and 768, respectively. Markov chains can only be fitted in these “intervals,” thus reducing the model flexibility (e.g., if 48 parameters are not enough, 1 needs to estimate 192 parameters; intermediate values are not possible). This issue introduces another problem with the full Markov chain model, the “dimensionality curse,” as the dimension of the model increases exponentially with the order p. Markov processes have found many applications in a diverse number of fields. For example, Geroliminis and Skabardonis (2005) propose an analytical methodology for prediction of the platoon arrival profiles and queue length along signalized arterials using Markov decision processes, while Yeon et al. (2008) develop a model that can estimate travel time on a freeway using Discrete Time Markov Chains where the states correspond to whether or not the link is congested. Variable length Markov chains (VLMC) address both issues introduced previously (inflexibility in terms of number of parameters, and lack of scalability) and provide a natural and elegant way to avoid (some of ) the associated difficulties. The idea is to allow the memory of the Markov chain to have a variable length, depending on the observed past values (M€achler and B€uhlmann, 2004). For example, while a past history of states A-A-B-C-C may be a good indication of the next state being D, for other cases it might be sufficient to “store” the sequence A-C as a precursor to a state C. The first example would indicate that the order p of the Markov chain would be five, resulting in a dimension of 768 parameters. However, a history of A-C is sufficient to make another state prediction and detailed transition history is not required and can be deleted (or pruned, as is said in this context). A memory of variable length (five in the first case, but only two in the latter) is thus appropriate. Using this idea, fitting a VLMC from data involves estimation of the structure of the variable length memory, which can be reformulated as a problem of

276 PART

II Applications

estimating a tree, using the so-called context algorithm (Rissanen, 1983), which can be implemented very efficiently. In the fitted tree-structured models, every terminal node (as well as some internal nodes) represents a state in the Markov chain and is equipped with corresponding transition probabilities. The context algorithm grows a large tree and prunes it back. The pruning part requires specification of a tuning parameter, the so-called cutoff. The cutoff K is a threshold value when comparing a tree with its subtree by pruning away one terminal node; the comparison is made with respect to the difference of deviance from the two trees. A large cutoff has a stronger tendency for pruning and yields smaller estimated context trees (i.e., a smaller dimension of the model). In practice, the way the VLMC works is that it computes a huge tree and then prunes it, based on some appropriate parameter value.

4 APPLICATION TO MESOSCOPIC MODELING The objective of this case study is to validate the proposed data-driven methodology for mesoscopic traffic modeling, in particular the identification and short-term prediction of traffic state and local speed, in order to take advantage of the ever-increasing availability of traffic data through emerging sensors.

4.1 Data and Experimental Design Two freeway datasets from Irvine, CA, and Tel Aviv in Israel have been used for this research. In both cases, weekday data were used. The Irvine dataset includes 5 days of sensor data from freeway I-405. The application involved training/calibration with 4 days of data and subsequent testing/validation of the model for the 5th day (not used in the calibration). Data from 10 a.m. to 12 midnight have been used, since this period includes the (p.m.) peak flow for this direction. Speed, occupancy, and flow data over 2-minute intervals were available for calibration and validation. The second dataset was collected in Highway 20 (Ayalon Highway), a major intracity freeway running through the center of Tel Aviv in Israel. Four days of data were used for the training of the models and a different 5th day was used for validation. Speed, density, and flow data were available and were aggregated over 5-minute intervals.

4.2 Case Study Setup The overall framework, including the main methodological components the information flows, is outlined in Fig. 2. In general, each observation may include multiple attributes (e.g., [lagged] speed, density, flow, number of lanes, grade, meteorological information, vehicle mix, driver mix). The application comprises training and application steps. During the training step archived surveillance data are used to (A) identify the various traffic

Data-Driven Traffic Simulation Models Chapter

11

277

states through clusterings; (B) estimate the transition processes between these regimes; and (C) estimate cluster-specific traffic models. This information is stored into a knowledge base and supports the application of the framework. As new measurements become available, they are (D) classified into the appropriate regimes and based on the transition processes and the short-term evolution of the traffic state, (E) short-term predictions of the traffic state are performed using the applicable estimated transition processes. Furthermore, (F) the appropriate flexible traffic model is retrieved and applied to the new observations to (G) perform speed predictions.

4.3 Application and Results Fig. 3 summarizes the prediction results. The performance of the models is evaluated using several goodness-of-fit measures: normalized root mean square error (RMSN), root mean square percentage error (RMSPE), mean percentage error (MPE), and Theil’s U, Um, and Us coefficients (for details and a discussion of these metrics see Antoniou et al., 2013). Overall, the results are encouraging, with prediction error of about 3%–4%, according to the RMSN and RMSPE metrics. This represents an improvement of about 50% from the typical speed density relationship for both data sets. The MPE measure is in general low. Also, the values indicated by the components of Theil inequality coefficient components have very low values in absolute terms and are considerably improved after the application of the complete framework. In addition, the following observations can be made: l

l

5

Loess applied to the entire dataset (i.e., without clustering) provides superior performance to the typical speed-density relationship. This is expected, as (i) it can integrate additional explanatory variables, and (ii) its functional form is less restricted (i.e., can better follow the data). Decreasing the number of clusters from 8 to 5, in the application of the fullblown methodology, does not significantly affect the performance (in terms of accuracy in traffic state and local speed prediction). Further reduction (to three clusters for the Ayalon data set) provides a deterioration, but still much better performance than the typical speed-density relationship. For further information see Antoniou et al. (2013).

APPLICATION TO MICROSCOPIC TRAFFIC MODELING

The objective of this case study is to validate the proposed data-driven methodology for estimation of microscopic traffic simulation models. In particular, speed estimation is achieved using real traffic data in order to develop a representative modeling of driving behavior.

278 PART

II Applications

0.02

0.04

0.06

Speed–density relationship Loess – no clusters Loess – 4 clusters Loess – 3 clusters Naive predictor – 4 clusters Naive predictor – 3 clusters

0.00

Value (unitless)

0.08

0.10

Measures of effectiveness – Irvine, CA

RMSN

RMSPE

MPE

U

Um

Us

0.10

Measures of effectiveness – Ayalon, IL

0.06 0.04 0.00

0.02

Value (unitless)

0.08

Speed–density relationship Loess – no clusters Loess – 8 clusters Loess – 5 clusters Loess – 3 clusters

RMSN

RMSPE

MPE

U

Um

Us

FIG. 3 Visual comparison of the measures of effectiveness for all scenarios of mesoscopic modeling (top: Irvine, bottom: Ayalon).

5.1 Data and Experimental Design A series of data-collection experiments were carried out on roads surrounding the city of Naples, Italy (Punzo et al., 2005). All data were collected from the same platoon under real traffic conditions in October 2002. The same four drivers were moving by the same vehicles in the same sequence, but from different driving sessions. The driving routes and traffic conditions were differentiated among the datasets. Datasets with index A and C correspond to one-lane urban road, while datasets with index B to a two-lane extraurban highway.

Data-Driven Traffic Simulation Models Chapter

11

279

TABLE 1 Characteristics of Naples Data a/a

Dataset

No. of Observations

1

B1695

1695

2

C621

621

3

A358

358

4

A172

172

5

C168

168

6

C171

171

However, all selected roads have one lane per direction in order to avoid effects on driving behavior by lane changing. GPS receivers located on the vehicles were recording the coordinates X, Y, Z of each vehicle per 0.1 seconds (i.e., in 10 Hz) (Table 1).

5.2 Case Study Setup The process diagram in Fig. 1 comprises a combination of computational methods, such as flexible regression techniques, model-based clustering, and classification algorithms. In the following case study, a simplified process that employs a flexible regression technique (without prior clustering). It has been applied using different state-of-the-art machine learning techniques, such as loess, MARS, GP, KSVM, and Bayesian neural networks (BRNN). Gipps’ model is used as a reference benchmark and has been applied to the same data for a fair comparison. For further details on the calibration of Gipps’ model see Papathanasopoulou and Antoniou (2015).

5.3 Application and Results Traffic models are trained using as input data the most representative data series, B1695, as it is the dataset with the longest duration and the widest speed range. Relationships among the predictor variables (the speed of the subject vehicle v(t), the speed of the preceding vehicle vfront(t), their distance Dfront(t)), and the response variable (speed of the subject vehicle at the next time instant v(t + τ)) are identified using observations from data series B1695. After the model fitting, the proposed methods are applied to the remainder of the data series for validation. The R statistical software (R Core Team, 2018) was used. Specifically for MARS using “earth” package was used (Milborrow, 2017), for KSVM and GP the “kernlab” package (Zeileis et al., 2004), and finally for BRNN the “brnn” package (Perez-Rodriguez and Gianola, 2013).

280 PART

II Applications

12 Gipps'

Loess

MARS

GP

KSVM

BRNN

10

RMSN (%)

8

6

4

2

0 B1695

C621

A358

A172

C168

C171

Data series FIG. 4 Normalized root mean square error for all scenarios of microscopic modeling (Naples data).

Their performance in terms of RMSN is presented in Fig. 4. The results indicate that the most stable performance is achieved by the loess method and GP for the majority of the data series. The loess method, which has the added benefit of being very simple to implement, seems to be the best choice for this case study, as speed estimation with the lowest error is achieved consistently. Similar behavior is observed using the other machine learning techniques and all of them provide good alternatives for estimation of data-driven models using the available data.

6 APPLICATION TO WEAK LANE DISCIPLINE MODELING Modeling driving behavior in mixed traffic streams is still a challenge. Heterogeneous mixture of vehicle types and violation of lane regulations are common characteristics in cities in developing countries. These characteristics are difficult to be simulated using conventional microscopic models. In cases of carfollowing situations, there is difficulty in the determination of leader-follower pairs due to multiple-leader following. Furthermore, in cases of lane-changing situations there is difficulty in the determination of lanes, as drivers do not obey the real lane marks. Asaithambi et al. (2016) review driver behavior models under mixed traffic conditions and have pointed out limitations of current models, arguing that the main limitation is that they do not explicitly consider the wider range of

Data-Driven Traffic Simulation Models Chapter

11

281

situations that drivers in mixed traffic face. Munigety and Mathew (2016) have identified that due to weak lane discipline, drivers maneuvering in mixed traffic streams exhibit some peculiar patterns such as maintaining shorter headways, swerving, and filtering. They have also proposed that the lane should be divided into small strips in order to handle virtual lane movements. Li et al. (2015) have proposed a car-following model that considers the effect of two-sided lateral gaps and have they have shown that their model has larger stable region compared to a car-following model that captures the impacts from the lateral gap on only one side. In addition, Parsuvanathan (2015) has used proxy lanes between the main lanes. It is assumed that free space is perceived as lanes by small vehicles. However, distribution and types of vehicles could affect the width of the lanes. A grid-based modeling approach akin to cellular automata (Gundaliya et al., 2008) and a strip-based modeling method (Mathew et al., 2013) have also been proposed. Mathew et al. (2013) have based their idea on portions of traffic queues instead of regular main lane queues. Kanagaraj et al. (2013) have evaluated the performance of different car-following models under mixed traffic conditions. However, they have not taken into account the fact that a vehicle may not be exactly in line with its leading vehicle due to weak lane discipline in mixed traffic. Metkari et al. (2013) have modified an existing car-following model in order to take into account lateral movements and include mixed traffic conditions. Choudhury and Islam (2016) have developed a latent leader acceleration model. Papathanasopoulou and Antoniou (2017) have used data-driven approaches for modeling mixed traffic and proposed virtual lanes for weak lane discipline conditions. This section is based on this analysis.

6.1 Data and Experimental Design In order to evaluate the feasibility of data-driven modeling to mixed traffic conditions, data from an experiment in India (Kanagaraj et al., 2015) were used. The video data were collected on a six-lane separated urban arterial road at the Maraimalai Adigalar Bridge in Saidapet, Chennai, India. Collection took place on the northbound approach. The section was on a bridge, which ensured that the road geometry was uniform and that there were no nearby intersections, bus stops, parked vehicles, or other side factors that could affect drivers’ behavior. Furthermore, there was no interaction between the vehicle traffic and pedestrians, because the pedestrian walkway is segregated by a barrier. A detailed description of the data could be found in Kanagaraj et al. (2015). The data are presented in two parts—two excel files for the data collected in the periods 2:45–3:00 p.m. (data245) and 3:00–3:15 p.m. (data300), respectively, on February 13, 2014. Each excel sheet contains data such as vehicle type, length and width, longitudinal and lateral positions, longitudinal and lateral speeds, and longitudinal and lateral acceleration. The trajectory data are available publicly at http://toledo.net.technion.ac.il/downloads/.

282 PART

II Applications

6.2 Case Study Setup 6.2.1 Identification of Lead and Lag Vehicle Since multiple-leader vehicles may be present in heterogeneous traffic conditions, the critical leader vehicle should be identified. The probability of a given front vehicle to be the governing leader depends on the type of the lead vehicle and the extent of lateral overlap with the following vehicle (Choudhury and Islam, 2016). In order to apply a microscopic model, it should be determined whether there is a vehicle pair of follower-leader. The main characteristic of mixed traffic is that the size of overlap between the leader and the follower varies. Assuming that the lateral and longitudinal coordinates of the front center of each vehicle (xci , xci ) are known, it could be defined which vehicle follows the other. The coordinates for the left and the right lateral bound of each vehicle are estimated per time instant t by Eqs. (13), (14) (as shown in Fig. 5). wi xli ðtÞ ¼ xci ðtÞ   si ðtÞ (13) 2 wi (14) xri ðtÞ ¼ xci ðtÞ + + si ðtÞ 2

Direction of movement

xli+1 xli+2

xci+1

xri+1

xci+2 xri+2 wi+2

si+2

si+1

si+2

wi+1

xci

xli

xri

wi si

FIG. 5 Estimation of lateral coordinates.

si

si+1

Data-Driven Traffic Simulation Models Chapter

11

283

where i is the 0, 1, 2, …, n vehicle index, xci is the lateral coordinate of the front center of vehicle i, xli is the lateral coordinate of the front left bound of vehicle i, xri is the lateral coordinate of the front right bound of vehicle i, wi is the width of vehicle i, and si is the a lateral safety distance for vehicle i. In order to define the car-following vehicle pairs, the longitudinal position of the leader should be in front of the following vehicle and in a distance L that could influence the movement of the following vehicle (Eq. 15). In addition, a part of the front side of a vehicle should overlap a part of the front side of another vehicle (Eq. 16). This overlap is evident in Fig. 6 with light-blue color.

Xrfollower

Xlleader

Direction of movement

Xrfollower

Xlfollower

Case 1 Xlleader

Xrleader

Case 2 Xrleader

Xlleader

Xrleader

Xlfollower Xlfollower

Direction of movement

Xlfollower

Xrleader

Xrfollower

Xrfollower

Case 3 FIG. 6 Identification of leader-follower pair.

Case 4

Direction of movement

Direction of movement

Xlleader

284 PART

II Applications

Each vehicle i is considered as follower and then a leader vehicle is required to fulfill the conditions, described by Eqs. (15), (16), at the same instant t: yfollower ðtÞ  yleader ðtÞ  yfollower ðtÞ + L

(15)

xlfollower ðtÞ  xrleader ðtÞ xlleader ðtÞ  xrfollower ðtÞ

(16)

Four cases of vehicle pair leader-follower have been identified, as shown in Fig. 6. Furthermore, a scenario with two leaders and one follower case is also possible. For instance, a bus could be the follower and a part of its front side may overlap with two leaders such as two motorcycles or a small vehicle and a motorcycle. In this case the closest vehicle according to the direction of movement is chosen as the most critical leader. If no vehicles are identified as leaders, then the driving situation of the vehicle is free flow.

6.2.2 Determination of Virtual Lanes Temporary virtual lanes are proposed for the simulation of mixed traffic conditions. Heterogeneity in vehicle types implies various widths of vehicles and thus various widths of virtual lanes. A typical example of modification of virtual lane change is illustrated in Fig. 7. In this figure, there are two vehicles. The first vehicle follows the virtual lane i. While there are small lateral movements, it is considered that it does not change lane. However, when its movement is constrained by the hatched vehicle at the breakpoint, it is considered that it changes lane and then follows virtual lane i + 1. The challenge is that vehicles are moving constantly laterally. This could be addressed in two distinct ways. The first one is to estimate the threshold that indicates a lane change. The second one is using change detection algorithms. Algorithms that are capable of finding major changes in data sequence could be used, that is, “strucchange” package (Kleiber et al., 2002).

6.3 Application and Results The application includes the fitting of data-driven models for car-following situations of the available Indian data. The problem to be addressed is the speed estimation of each vehicle at the next time instant, taking into account its speed, the speed of the preceding vehicle, and the distance between the two vehicles (in

FIG. 7 Virtual lanes.

Data-Driven Traffic Simulation Models Chapter

11

285

the previous time instant). The time step is 0.5 seconds. Loess was used for this application. In the training step the flexible car-following model is fitted or calibrated on the surveillance data (data245) and then it is validated on the another dataset (data300). For further details see Papathanasopoulou and Antoniou (2017). The speed for each vehicle was estimated as the resultant speed by Eq. (17) pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (17) vi ðtÞ ¼ vlongi 2 + vlati 2 where vi is the resultant speed of vehicle i, vlongi is the longitudinal speed of vehicle i, and vlati is the lateral speed of vehicle i. The next step was to define the car-following sequence, namely which vehicle is in front of the other. This was a task due to the nature of mixed traffic data. Kanagaraj et al. (2015) have identified that in 45% of the observations the overlap between the leader and the follower is less than half the follower width. The identification of the front vehicle was based on Eqs. (15), (16). As lateral safety distance, s ¼ 0.20 m was considered for each vehicle on both sides. As distance L in Eq. (15), L ¼ 200 m is considered. The same procedure was implemented to the dataset “data300.” If no vehicles were identified as leaders, then these observations were omitted, as they do not correspond to car-following state. Finally, dataset “data245” includes 47,036 observations and dataset “data300” 45,982 observations. A conventional car-following model, the Gipps’ model (Gipps, 1981), is used as reference benchmark in order to monitor and evaluate the effectiveness of data-driven modeling. The calibration of the model was implemented using the Improved Stochastic Ranking Evolution Strategy (ISRES) algorithm was used, which is included in the package “nloptr” (Runarsson and Yao, 2005) and is appropriate for nonlinearly constrained global optimization. This method is implemented in a simple way and supports arbitrary nonlinear inequality and equality constraints in addition to the bound constraints. For further details on Gipps’ model calibration see in Papathanasopoulou and Antoniou (2017). The data-driven model identifies the relationships between predictor variables (vleader(t), vfollower(t)), the distance D(t) between the two vehicles and the response data vfollower(t+τ)), where τ ¼ 0.5 seconds. After the relevant pattern from “data245” data series has been identified, the proposed method is applied to “data300” data series. It requires the input data vleader(t), vfollower(t) and distance D(t) and exports the estimated output vfollower(t + 0.5). The RMSN values have been estimated per time instant t in order to compare predicted and observed speed values and estimate the performance of this modeling approach. The estimated RMSN for dataset “data300” is 0.19 using the Gipps’ model and 0.12 using the loess model. The flexible model outperforms the conventional model and produces a more reliable speed prediction. Then, in Figs. 8 and 9, an analysis of the results per vehicle type is attempted. Fig. 8 shows the density of RMSN per vehicle type. The best performance

286 PART

II Applications

12

Motorcycle Car Bus Truck Light commercial vehicle Autorickshaw

10

Density

8 6 4 2 0 0.0

0.1

0.2

0.3

0.4

0.5

RMSN FIG. 8 Density of RMSN per vehicle type for dataset “data300.”

10

Motorcycle Car Bus Truck Light commercial vehicle Autorickshaw

Density

8

6

4

2

0 0.0

0.1

0.2

0.3

0.4

0.5

RMSN FIG. 9 Density plot of RMSN per vehicle type of the preceding vehicle when the follower is a car (dataset “data300”).

of loess method is achieved for cars and light commercial vehicles, while higher RMSN are observed for other vehicle types, especially for trucks and autorickshaws. In Fig. 9 densities of RMSN are outlined per vehicle type of the leader when the follower is a car. Vehicles pairs car-car and motorcycle-car (leaderfollower) have a peak of density curve lower than RMSN ¼ 0.1. The density

Data-Driven Traffic Simulation Models Chapter

11

287

curve of vehicle pair truck-car corresponds to higher RMSN than the other vehicle pairs. Data-driven approaches could be a promising tool for modeling mixed traffic. They lead to flexible car-following models and thus to more robust and reliable representation of driving behavior. Data-driven estimation techniques are designed to address cases in which the traditional approaches do not perform well. Models developed for lane-based traffic conditions may not be appropriate to simulate traffic situations in developing countries, where weak lane discipline is often observed. Traffic in the developing world is so heterogeneous that often lane-based models cannot be realistic. Furthermore, vehicle-dependent models need to be developed for the case of heterogeneous traffic, as the drivers of vehicles with unequal dimensions tend to have different driving behaviors; furthermore, different vehicle types are characterized by varying vehicle kinematics. Thus, it is expected that further exploration of data-driven approaches could open up opportunities to understand and simulate driving behavior in nonlane discipline conditions with heterogeneity of vehicle types. The integration of data-driven methods in advanced driver assistance systems under mixed traffic conditions could be very interesting, though additional research should be conducted.

7

NETWORK-WIDE APPLICATION

The results presented in the last three sections provide clear evidence that datadriven traffic approaches have the potential to contribute to improved modeling capabilities, in light of new data and emerging simulation needs. In this section, we present a network-wide validation of these results, using a microscopic traffic simulator (SUMO).

7.1 Implementation Aspects Flexible regression techniques are integrated into the simulator in order to create a flexible environment. Three parts are combined: training, application, and simulation. The training step involves the estimation of traffic models on the acquired surveillance data, which can be processed offline. The online process, however, includes speed prediction for the following cars at the next time instant and interaction with the simulator for the simulation process (sumo. dlr.de/wiki/Downloads). SUMO was used in the analysis. SUMO is an open-source software, which can be extended under specific needs and requirements (Black, 1999; Krajzewicz et al., 2012). The DLR SUMO is written in C++ and the opensource files for the compilation are available on the official website of DLR, allowing a convenient environment for further modification.

288 PART

II Applications

FIG. 10 Architecture of SUMO, extended with data-driven components.

In order to integrate a data-driven model into SUMO traffic microsimulator, it is necessary to build a connection between the traffic simulator and a flexible regression tool. As a flexible regression estimator, R was chosen (C++ implementation, Rcpp [http://www.rcpp.org/]). Since Rcpp offers fast calculation, this is a suitable choice, as the software can return the estimated values to the simulator in real-time. SUMO calls the Rcpp part, instead of estimating speed using the default native car-following models doing conventional speed estimation at the stage of vehicle behavior model in the segment of carfollowing models. Rcpp loads the R script containing necessary functions as well as the observation data, thereby combining and simultaneously calculating the estimated speed (Fig. 10). SUMO offers extended control over the simulation process as well as modification of parameters. This is feasible through the interface called TraCI (Wegener et al., 2008). TraCI builds a connection between the controller of the simulation process and SUMO. It serves as a server to manipulate the simulation online. Python is used to control the whole process, as the TraCI commands can be transmitted to SUMO only in Python language. For Python implementation, there is a library package called rpy2. For the case study, the R 3.3.2 version along with rpy2-0.2.8 (a Python library) build on WinPython 2.7.10.3 were used. The process starts with the Python code creating an environment for running SUMO. Python sends the call function via TraCI to SUMO in order to set the speed of the first car in the network. Afterward, via the same interface, the Python controller retrieves the speed of the leading and following cars as well as the distance between them, and sends the observation data to the R script via rpy2, from where it returns the estimated speed of the

Data-Driven Traffic Simulation Models Chapter

11

289

FIG. 11 SUMO connection to R over Python.

following car for the next time step. The SUMO connection to R over Python is outlined in Fig. 11.

7.2 Case Study Setup Two different flexible regression methods, Kernel regression (Nadaraya, 1964) and loess (Cleveland, 1979), as well as a reference conventional model, Krauss car-following model, were implemented and validated using the Naples data (see Section 5.1). All the models were used to predict the speed of the following vehicle in a car-following state. The trajectory data from Naples are based on the time steps of 10 milliseconds and the same time step was provided by SUMO. Regarding the Naples network, the map was retrieved from www.openstreetmap.org and adapted accordingly. These include conversion of simple edge-node roads into polyline links to provide more realistic representations. To create the trips, a unified Python code was used, according to which the number of vehicles in the network can be manipulated by entering the necessary value.

7.3 Results The outcomes of both regression techniques are quite promising. In all data series, both Kernel regression and loess showed more accurate results compared to Krauss model (Figs. 12 and 13). Both flexible regression techniques provide more reliable speed prediction in the simulator environment. The results show that the loess is even more accurate in predictions compared to the kernel regression technique.

290 PART

II Applications

10 Krauss Loess Kernel

RMSN (%)

8

6

4

2

0 B1695

C621

A358

A172

C168

C171

Data series FIG. 12 Normalized root mean square error—network-wide simulation-based validation.

0.6 Krauss Loess Kernel

0.5

RMSPE

0.4 0.3 0.2 0.1 0.0 B1695

C621

A358

A172

C168

C171

Data series FIG. 13 Root means square percentage error—network-wide simulation-based validation.

8 CONCLUSIONS Computational intelligence in general has proven its applicability to traffic simulation models. Data-driven models offer reliable and robust alternative solutions for modeling driving behavior to overcome some of the associated limitations of conventional models. An integrated methodological framework

Data-Driven Traffic Simulation Models Chapter

11

291

has been developed and successfully demonstrated using actual data from a variety of facilities for mesoscopic and microscopic modeling. Data-driven approaches could be a promising tool for future challenges. Flexible models allow the incorporation of additional predictor variables, while resorting cumbersome reformulations of a fixed-model form could be impractical. Although traditional models may provide better insight into traffic flow theory, they cannot meet new modeling requirements and exploit the availability of big data. Traffic conditions vary in multiple dimensions, including across individual drivers, vehicles, spatially, and temporally. The computational and data requirements in data-driven modeling are such that allow the application of the methodology to each individual vehicle. This is very important in the context of autonomous vehicles. The proposed methodology allows to have not only different parameters estimated and predicted per each vehicle class, but indeed for each individual vehicle, in real time. The extrapolation will be needed in the case of only having a sample of vehicles with the ability to collect/receive the required data. In this case, it could be practical to identify classes of vehicles, estimate and predict these parameters for the vehicles comprising the sample for each class, and then extrapolate this information to the entire population of vehicles of this class in the studied area. Besides temporal variability of these parameters, by class, one can of course foresee a spatial distribution, as traffic conditions, road characteristics, fleet mix, and other parameters could influence their value. Data-driven modeling could be leveraged for the improvement of traffic simulation models. Data-driven estimation of traffic models appears to be a promising tool that could offer considerable benefits if integrated into traffic simulation models, resulting in higher accuracy and reliability of model outputs.

ACKNOWLEDGMENTS The authors would like to thank Prof. Vincenzo Punzo from the University of Naples, Federico II for providing us with the data for the Naples case study, and Prof. Tomer Toledo from Technion, Israel Institute of Technology, and the US Federal Highway Administration (FHWA) for making the data from India and the NGSIM project, respectively, freely available. The first author is thankful for a scholarship by the National Technical University of Athens.

REFERENCES Antoniou, C., Koutsopoulos, H.N., 2006a. A comparison of machine learning methods for speed estimation. In: Proceedings of the 11th IFAC Symposium on Control in Transportation Systems. Antoniou, C., Koutsopoulos, H.N., 2006b. Estimation of traffic dynamics models with machine learning methods. Transp. Res. Rec. J. Transp. Res. Board 1965, 103–111.

292 PART

II Applications

Antoniou, C., Balakrishna, R., Koutsopoulos, H.N., 2011. A synthesis of emerging data collection technologies and their impact on traffic management applications. Eur. Transp. Res. Rev. 3 (3), 139–148. Antoniou, C., Koutsopoulos, H.N., Yannis, G., 2013. Dynamic data-driven local traffic state estimation and prediction. Transp. Res. C Emerg. Technol. 34, 89–107. Antoniou, C., Gikas, V., Papathanasopoulou, V., Mpimis, T., Markou, I., Perakis, H., 2014. Towards distribution-based calibration for traffic simulation. In: 2014 IEEE 17th International Conference on Intelligent Transportation Systems (ITSC), pp. 786–791. Asaithambi, G., Kanagaraj, V., Toledo, T., 2016. Driving behaviors: models and challenges for nonlane based mixed traffic. Transp. Dev. Econ. 2 (2), 19. Azimi, M., Zhang, Y., 2010. Categorizing freeway flow conditions by using clustering methods. Transp. Res. Rec. J. Transp. Res. Board 2173, 105–114. Bagirov, A.M., Rubinov, A.M., Soukhoroukova, N.V., Yearwood, J., 2003. Unsupervised and supervised data classification via nonsmooth and global optimization. Top 11 (1), 1–75. Banfield, J.D., Raftery, A.E., 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821. Barth, M., Todd, M., Xue, L., Jan. 2004. User-Based Vehicle Relocation Techniques for MultipleStation Shared-Use Vehicle Systems. Paper presented at the Transportation Research Board 83rd Annual Meeting.Washington, DC. Bi, H., Mao, T., Wang, Z., Deng, Z., 2016. A data-driven model for lane-changing in traffic simulation. In: Symposium on Computer Animation, pp. 149–158. Bifulco, G.N., Pariota, L., Simonelli, F., Di Pace, R., 2013. Development and testing of a fully adaptive cruise control system. Transp. Res. C Emerg. Technol. 29, 156–170. Black, P.E. (Ed.), 1999. Levenshtein Distance. Algorithms and Theory of Computation Handbook. CRC Press LLC (from Dictionary of Algorithms and Data Structures, NIST). Chaniotakis, E., Antoniou, C., Pereira, F., 2016. Mapping social media for transportation studies. IEEE Intell. Syst. 31 (6), 64–70. Chen, X.Y., Pao, H.K., Lee, Y.J., 2014. Efficient traffic speed forecasting based on massive heterogenous historical data. In: Big Data (Big Data), 2014 IEEE International Conference on, pp. 10–17. Chong, L., Abbas, M.M., Flintsch, A.M., Higgs, B., 2013. A rule-based neural network approach to model driver naturalistic behavior in traffic. Transp. Res. C Emerg. Technol. 32, 207–223. Choudhury, C.F., Islam, M.M., 2016. Modelling acceleration decisions in traffic streams with weak lane discipline: a latent leader approach. Transp. Res. C Emerg. Technol. 67, 214–226. Cleveland, W.S., 1979. Robust locally weighted regression and smoothing scatterplots. J. Am. Stat. Assoc. 74 (368), 829–836. Cleveland, W.S., Devlin, S.J., 1988. Locally weighted regression: an approach to regression analysis by local fitting. J. Am. Stat. Assoc. 83 (403), 596–610. Cleveland, W.S., Devlin, S.J., Grosse, E., 1988. Regression by local fitting: methods, properties, and computational algorithms. J. Econom. 37 (1), 87–114. Cohen, R.A., 1999. An introduction to PROC LOESS for local regression. In: Proceedings of the 24th SAS Users Group International Conference, Paper, Vol. 273. Colombaroni, C., Fusco, G., 2014. Artificial neural network models for car following: experimental analysis and calibration issues. J. Intell. Transp. Syst. 18 (1), 5–16. Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20 (3), 273–297. Courtois, M., Woodside, M., 2000. Using regression splines for software performance analysis. In: Proceedings of the 2nd International Workshop on Software and Performance, pp. 105–114. Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 39, 1–38. Ding, C., Wang, W., Wang, X., Baumann, M., 2013. A neural network model for driver’s lanechanging trajectory prediction in urban traffic flow. Math. Prob. Eng. 2013, 8. Article ID 967358.

Data-Driven Traffic Simulation Models Chapter

11

293

El Faouzi, N.E., 2004. Data-driven aggregative schemes for multisource estimation fusion: a road travel time application. In: Multisensor, Multisource Information Fusion: Architectures, Algorithms, and Applications, vol. 5434, pp. 351–360. El Faouzi, N.E., Lefevre, E., 2006. Classifiers and distance-based evidential fusion for road travel time estimation. In: Multisensor, Multisource Information Fusion: Architectures, Algorithms, and Applications, vol. 6242, p. 62420A. El Faouzi, N.E., Leung, H., Kurian, A., 2011. Data fusion in intelligent transportation systems: progress and challenges—a survey. Inf. Fusion 12 (1), 4–10. Foresee, F.D., Hagan, M.T., 1997. Gauss-Newton approximation to Bayesian learning. In: International Conference on Neural Networks, vol. 3, pp. 1930–1935. Fraley, C., 1998. Algorithms for model-based Gaussian hierarchical clustering. SIAM J. Sci. Comput. 20 (1), 270–281. Fraley, C., Raftery, A.E., 2002. Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97 (458), 611–631. Fraley, C., Raftery, A.E., 2003. Enhanced model-based clustering, density estimation, and discriminant analysis software: MCLUST. J. Classif. 20 (2), 263–286. Friedman, J.H., 1991. Multivariate adaptive regression splines (with discussion). Ann. Stat. 19, 1–141. Geroliminis, N., Skabardonis, A., 2005. Prediction of arrival profiles and queue lengths along signalized arterials by using a Markov decision process. Transp. Res. Rec. J. Transp. Res. Board 1934, 116–124. Gipps, P.G., 1981. A behavioural car-following model for computer simulation. Transp. Res. B Methodol. 15 (2), 105–111. Guido, G., Gallelli, V., Rogano, D., Vitale, A., 2016. Evaluating the accuracy of vehicle tracking data obtained from unmanned aerial vehicles. Int. J. Transp. Sci. Technol. 5 (3), 136–151. Gundaliya, P.J., Mathew, T.V., Dhingra, S.L., 2008. Heterogeneous traffic flow modelling for an arterial using grid based approach. J. Adv. Transp. 42 (4), 467–491. Happe, J., Westermann, D., Sachs, K., Kapova´, L., 2010. Statistical inference of software performance models for parametric performance completions. In: International Conference on the Quality of Software Architectures, pp. 20–35. Hartigan, J.A., Wong, M.A., 1979. Algorithm AS 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C Appl. Stat. 28 (1), 100–108. Hou, Y., Edara, P., Sun, C., 2014. Modeling mandatory lane changing using Bayes classifier and decision trees. IEEE Trans. Intell. Transp. Syst. 15 (2), 647–655. Huang, W., Song, G., Hong, H., Xie, K., 2014. Deep architecture for traffic flow prediction: deep belief networks with multitask learning. IEEE Trans. Intell. Transp. Syst. 15 (5), 2191–2201. Huval, B., Wang, T., Tandon, S., Kiske, J., Song, W., Pazhayampallil, J., Andriluka, M., ChengYue, R., Mujica, F., Coates, A., Rajpurkar, P., Migimatsu, T., Ng, A.Y., 2015. An empirical evaluation of deep learning on highway driving. ArXiv preprint arXiv:1504.01716. . Jenelius, E., Koutsopoulos, H.N., 2013. Travel time estimation for urban road networks using low frequency probe vehicle data. Transp. Res. B Methodol. 53, 64–81. Jenelius, E., Koutsopoulos, H.N., 2018. Urban network travel time prediction based on a probabilistic principal component analysis model of probe data. IEEE Trans. Intell. Transp. Syst. 19 (2), 436–445. Kaisler, S., Armour, F., Espinosa, J.A., Money, W., 2013. Big data: issues and challenges moving forward. In: 46th Hawaii International Conference on System Sciences (HICSS), pp. 995–1004. Kanagaraj, V., Asaithambi, G., Kumar, C.H.N., Srinivasan, K.K., Sivanandan, R., 2013. Evaluation of different vehicle following models under mixed traffic conditions. Procedia Soc. Behav. Sci. 104, 390–401.

294 PART

II Applications

Kanagaraj, V., Asaithambi, G., Toledo, T., Lee, T.C., 2015. Trajectory data and flow characteristics of mixed traffic. Transp. Res. Rec. J. Transp. Res. Board 2491, 1–11. Karlaftis, M.G., Vlahogianni, E.I., 2011. Statistical methods versus neural networks in transportation research: differences, similarities and some insights. Transp. Res. C Emerg. Technol. 19 (3), 387–399. Kleiber, C., Hornik, K., Leisch, F., Zeileis, A., 2002. Strucchange: an R package for testing for structural change in linear regression models. J. Stat. Softw. 7 (2), 1–38. Kleyko, D., Hostettler, R., Birk, W., Osipov, E., 2015. Comparison of machine learning techniques for vehicle classification using road side sensors. In: IEEE 18th International Conference on Intelligent Transportation Systems (ITSC), pp. 572–577. Krajzewicz, D., Erdmann, J., Behrisch, M., Bieker, L., 2012. Recent development and applications of SUMO-simulation of urban mobility. Int. J. Adv. Syst. Meas. 5 (3/4), 128–138. Kumar, P., Perrollaz, M., Lefevre, S., Laugier, C., 2013. Learning-based approach for online lane change intention prediction. In: Intelligent Vehicles Symposium (IV), 2013 IEEE, pp. 797–802. Li, Y., Zhang, L., Peeta, S., Pan, H., Zheng, T., Li, Y., He, X., 2015. Non-lane-discipline-based carfollowing model considering the effects of two-sided lateral gaps. Nonlinear Dyn. 80 (1–2), 227–238. Lv, Y., Duan, Y., Kang, W., Li, Z., Wang, F.Y., 2015. Traffic flow prediction with big data: a deep learning approach. IEEE Trans. Intell. Transp. Syst. 16 (2), 865–873. M€achler, M., B€ uhlmann, P., 2004. Variable length Markov chains: methodology, computing, and software. J. Comput. Graph. Stat. 13 (2), 435–455. MacQueen, J., et al., 1967. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. Markov, A., 1971. Extension of the limit theorems of probability theory to a sum of variables connected in a chain. Dynamic Probabilistic Systems, vol. 1, R. Howard. Wiley, Hoboken, NJ, USA (Reprinted in Appendix B). Mathew, T.V., Munigety, C.R., Bajpai, A., 2013. Strip-based approach for the simulation of mixed traffic conditions. J. Comput. Civil Eng. 29 (5), 04014069. McLachlan, G.J., Krishnan, T., 1997. The EM Algorithm and Extensions. Wiley Series in Probability and Statistics, pp. 361–369. Metkari, M., Budhkar, A., Maurya, A.K., 2013. Development of simulation model for heterogeneous traffic with no lane discipline. Procedia Soc. Behav. Sci. 104, 360–369. Milborrow, S., 2017 Feb 20. Derived from mda:mars by T. Hastie and R. Tibshirani. In: Earth: Multivariate Adaptive Regression Splines. R package version 4.4.9 [software]. [cited 2017 Apr 07]. Available from: https://CRAN.R-project.org/package¼earth. Mitchell, T.M., et al., 1997. Machine Learning. WCB. McGraw-Hill, Boston, MA. Muezzinoglu, M.K., Zuracla, J.M., 2005. A recurrent RBF network model for nearest neighbor classification. In: Proceedings. IEEE International Joint Conference on Neural Networks, IJCNN’05, vol. 1, pp. 343–348. Munigety, C.R., Mathew, T.V., 2016. Towards behavioral modeling of drivers in mixed traffic conditions. Transp. Dev. Econ. 2 (1), 6. Nadaraya, E.A., 1964. On estimating regression. Theory Probab. Appl. 9 (1), 141–142. Panwai, S., Dia, H., 2007. Neural agent car-following models. IEEE Trans. Intell. Transp. Syst. 8 (1), 60–70. Papathanasopoulou, V., Antoniou, C., 2015. Towards data-driven car-following models. Transp. Res. C Emerg. Technol. 55, 496–509.

Data-Driven Traffic Simulation Models Chapter

11

295

Papathanasopoulou, V., Antoniou, C., January 2017. Flexible car-following models on mixed traffic trajectory data. Proceedings of the 96th Annual Meeting of the Transportation Research Board. Washington, DC. Parsuvanathan, C., 2015. Proxy-lane algorithm for lane-based models to simulate mixed traffic flow conditions. Int. J. Traffic Transp. Eng. 4 (5), 131–136. Perez-Rodriguez, P., Gianola, D., 2013. brnn: brnn (Bayesian Regularization for Feed-Forward Neural Networks). R Package Version 0.3. Punzo, V., Formisano, D., Torrieri, V., 2005. Part 1: traffic flow theory and car following: nonstationary Kalman filter for estimation of accurate and consistent car-following data. Transp. Res. Rec. J. Transp. Res. Board 1934, 1–12. Quin˜onero-Candela, J., Rasmussen, C.E., 2005. A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 6 (Dec), 1939–1959. R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing. Version 3.5.1 (2018-07-02). Available from: http://www.R-project.org/. (accessed 10 August 2018). Ranjitkar, P., Suzuki, H., Nakatsuji, T., 2005. Microscopic traffic data with real-time kinematic global positioning system. In: Proceedings of Annual Meeting of Infrastructure Planning and Management, Japan Society of Civil Engineer, Miyazaki, Preprint CD. Ripley, B.D., 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge. Rissanen, J., 1983. A universal data compression system. IEEE Trans. Inf. Theory 29 (5), 656–664. Roughan, M., Sen, S., Spatscheck, O., Duffield, N., 2004. Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification. In: Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, pp. 135–148. Runarsson, T.P., Yao, X., 2005. Search biases in constrained evolutionary optimization. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 35 (2), 233–243. Schwarz, G., et al., 1978. Estimating the dimension of a model. Ann. Stat. 6 (2), 461–464. Simonelli, F., Bifulco, G., De Martinis, V., Punzo, V., 2009. Human-like adaptive cruise control systems through a learning machine approach. Appl. Soft Comput., 52, 240–249. Vlahogianni, E.I., Karlaftis, M.G., Golias, J.C., 2005. Optimized and meta-optimized neural networks for short-term traffic flow prediction: a genetic approach. Transp. Res. C Emerg. Technol. 13 (3), 211–234. Wang, E.G., Sun, J., Jiang, S., Li, F., 2017. Modeling the various merging behaviors at expressway on-ramp bottlenecks using support vector machine models. Transp. Res. Procedia 25, 1327–1341. Wegener, A., Pio´rkowski, M., Raya, M., Hellbr€uck, H., Fischer, S., Hubaux, J.P., 2008. TraCI: an interface for coupling road traffic and network simulators. In: Proceedings of the 11th Communications and Networking Simulation Symposium, pp. 155–163. Williams, C.K.I., Rasmussen, C.E., 1996. Gaussian processes for regression. Adv. Neural Inf. Process. Syst. 8, 514–520. Xu, J.-X., Lim, J.S., 2007. A new evolutionary neural network for forecasting net flow of a car sharing system. In: IEEE Congress on Evolutionary Computation. CEC 2007, pp. 1670–1676. Yeon, J., Elefteriadou, L., Lawphongpanich, S., 2008. Travel time estimation on a freeway using discrete time Markov chains. Transp. Res. B Methodol. 42 (4), 325–338. Zeileis, A., Hornik, K., Smola, A., Karatzoglou, A., 2004. kernlab-an S4 package for kernel methods in R. J. Stat. Softw. 11 (9), 1–20. Zhang, J., Wang, F.Y., Wang, K., Lin, W.H., Xu, X., Chen, C., 2011. Data-driven intelligent transportation systems: a survey. IEEE Trans. Intell. Transp. Syst. 12 (4), 1624–1639. Zheng, J., Suzuki, K., Fujita, M., 2013. Car-following behavior with instantaneous driver-vehicle reaction delay: a neural-network-based methodology. Transp. Res. C Emerg. Technol. 36, 339–351.

Chapter 12

Big Data and Road Safety: A Comprehensive Review Katerina Stylianou*, Loukas Dimitriou* and Mohamed Abdel-Aty† *

Laboratory for Transport Engineering, Department of Civil and Environmental Engineering, University of Cyprus, Nicosia, Cyprus, †Department of Civil, Environmental and Construction Engineering, Orlando, FL, United States

Chapter Outline 1. Introduction 2. The Role of Big Data in Traffic Safety Analysis 2.1. Real-Time Crash Prediction 2.2. Driving Behavior

1

297 298 299 327

3. ADAS and Autonomous Vehicles (AVs) 4. Conclusions References

332 336 337

INTRODUCTION

The widespread use of sensors along with the latest developments in wireless technologies has led to the prevalence of network monitoring and Intelligent Transport Systems (ITS). The main focus is to collect the data from the distributed sensors, archive it, analyze it, transform it into actionable knowledge, and diffuse it through various transportation applications related to planning, mobility, and safety. Data play a critical role in the transportation sector and in all modes of transport, and nowadays there is a plethora of information available for transport mode operators in order to improve performance, efficiency, service and safety. With an increasing trend, data are continuously collected through in-situ sensors, remote sensors, cameras, microphones, wireless sensor networks, and mobile devices. With this wide availability of data through different technologies, data sets can rapidly become so large and complex that they become difficult to process following traditional data analytics but because of the increased demands of transportation development, data are progressively significant in the management and use of transportation systems. The increase in data is manifest in the availability of traffic information which is continuously growing in volume due to the growth in the amount of traffic and the detectors. Big Data in the traffic area, enabled by the rapid Mobility Patterns, Big Data and Transport Analytics. https://doi.org/10.1016/B978-0-12-812970-8.00012-9 © 2019 Elsevier Inc. All rights reserved.

297

298 PART

II Applications

popularization of ITS, are continuously collected over vast geographic scales and though seemingly abstract and unorganized in nature, could be used in enhancing expert knowledge about the transportation system. These data can be used in finding potential research trends and investigating the impact of specific events or even decisions. The benefits of traffic detection technologies and vast data collection include both direct and indirect applications. Direct applications include travel time estimations, route planning, and congestion reduction whereas indirect applications are carried through the calibration and validation of models and traffic simulations. It is important to identify the needs for data and its capabilities, and in recent years a transformational change has occurred where traffic data have been recognized as a foundational element of planning, constructing, and operating road networks. A more data-driven framework is followed enabling a quicker response to safety-related issues, travel time optimization, congestion reduction, and overall a more effective deployment of public spent. Big Data applications have introduced new perspectives in safety analysis as well. Traffic safety has long been deemed as a priority and research projects in this field have highlighted the importance of Big Data in supporting safety science applications and services. Studies on traffic safety cover a wide variety of topics including crash frequency modeling, real-time safety analysis, human factors, before and after analyses of safety evaluations, economic appraisal, and injury severity modeling. These studies are heavily dependent on data and hence there is a continuous need of additional and more refined traffic data. This chapter will begin with the presentation of the role of Big Data in traffic safety analysis where a review of the most recent real-time crash prediction studies is provided, followed by a description of the types of studies aimed at examining driving behavior. The Section 2 focuses on Advanced Driving Assistance Systems (ADAS) and Automated Vehicles (AVs) where a brief overview of these two correlated fields is provided and the Section 4 presents the conclusions.

2 THE ROLE OF BIG DATA IN TRAFFIC SAFETY ANALYSIS In previous years, detailed driving data and crash data were typically not available and as a result researchers addressed this issue by examining the factors that affect the number of crashes on a roadway entity (usually segments or intersections) over a specified time period, which led to the traditional crash frequency studies. With the advent of Big Data though, major changes have been triggered in the field of traffic safety. In the recent decade, and through the technological advancements, researchers shifted their attention to dynamic traffic data. To catch the dynamic processes between safety and its contributing factors, Big Data generated from ITS are leveraged to develop real-time analyses, and offer new perspectives in safety analysis such as the ability to match the exact traffic conditions with crash cases. However, other than traffic data

Big Data and Road Safety Chapter

12

299

through ITS applications, another field of interest is human factors; it is important to gain a deep understanding in driver behavior to be able to develop crash avoidance mechanisms. In order to harness the power of Big Data in the traffic safety arena therefore, it is imperative to take full advantage of its real-time nature and the vast information contained therein. This section will begin with a literature review of real-time crash prediction and will continue with presentation of studies investigating driving behavior.

2.1 Real-Time Crash Prediction Real-time safety analysis relies on the traffic surveillance systems and the objective is the identification of crash likelihood based on very short time period traffic and weather information. The main assumption in this type of analysis is that crashes are attributed to factors appearing shortly right before the occurrence of the crash; hence the real-time information from reliable surveillance systems could help in capturing the micro-level influences of these factors. The implementation of real-time analysis is mainly in traffic management such as adaptive signal control or ramp metering control. Since the 90s, there has been extensive research linking crashes to traffic characteristics, weather and geometric factors and now due to the real-time nature of Big Data in the transportation area, proactive traffic management for better system performance is made possible. In existing traffic safety studies, there exist two groups: (i) aggregate studies, characterized by crash frequency modeling, representing counts of crashes using highly aggregated traffic data such as ADT; and (ii) disaggregate studies, characterized by real-time safety analysis by employing real-time traffic data such as traffic variables in 1-min intervals. The former group of studies is employed to study the factors that affect the number of crashes occurring in a specified network section (e.g., intersection), over a defined time period (e.g., year) resulting in crash-frequency data suggesting therefore the application of count-data regression methods, such as Poisson, negative-binomial, Poisson-lognormal, etc. (Lord and Mannering, 2010). Through such analysis Safety Performance Functions (SPFs) have been developed in order to identify crash contributing factors such as geometric design characteristics and results can be used in network screening, hot spot identification, Crash Modification Factor (CMF) estimations, as well as countermeasure comparisons like improvements of geometric characteristics. However, a well-known issue of crash-frequency models is the averaging of time-varying explanatory variables which ignores potentially important within-period variations. On the other hand, disaggregate or real-time studies analyze crash-likelihood with short-term conditions, by utilizing dynamic traffic and weather data. Disaggregate studies were made possible by the recent enhancements in the capabilities of collecting, storing, and analyzing real-time traffic data through ITS applications. With the

300 PART

II Applications

focus on Big Data, this section will review relevant papers to catalog the research progress made in real-time safety analysis. As stated, real-time analysis is heavily dependent on data, and as such this type of analysis takes advantage of traffic detection technologies to the maximum. In general, traffic detection technologies fall into three categories: intrusive detectors (in-roadway), nonintrusive detectors (above roadway), and off-roadway technologies (Martin, 2003). Intrusive sensors are the ones which are embedded in the pavement, or attached to the surface of the roadway (e.g., inductive loop detectors) whereas nonintrusive detectors are installed usually above or on the sides of the roadway without disrupting traffic flow (e.g., microwave radars). Off-roadway technologies refer to probe vehicles, including Global Positioning System (GPS) and cellular phones, as well as remote sensing which could for example use satellite images to extract traffic information. Traditionally, crash prediction models have been developed utilizing loop detector data but more recently radar data as well as Automatic Vehicle Identification (AVI) data have also been used. This section will focus on reviewing the real-time safety studies utilizing traffic and weather data; geometric factors have also been explored in the literature but are not reviewed here. A literature search for relevant studies published from 2001 to 2018 in international journals was conducted using several online databases, with “real-time crash prediction”, “real-time crash evaluation,” “disaggregate safety study,” and “big data traffic safety” being used as search terms. It should be mentioned that studies which used traffic variables aggregated in intervals larger than 6 min (i.e., hourly crash studies) are excluded from this review. After controlling for these factors, 48 studies were identified as presented in Table 1. It should be mentioned that the literature presented in the current Chapter is not exhaustive but rather exemplary in order to provide a wide view of the topics being researched within real-time crash prediction. For consistency, the following information was collected for each study: i. Study title. ii. Authors and publication year. iii. Study scope is to indicate the focus of the study. The studies were grouped in (i) traffic: if the scope was the identification of traffic characteristics contributing to crash occurrence; (ii) weather: if the aim was the identification of the influence of weather conditions on traffic occurrence; (iii) models: studies investigating new modeling approaches; (iv) other: studies which did not come under the previous three groups. iv. Methodological approach is to discriminate statistical and data mining techniques. v. Traffic variables is to capture information on the traffic variables used and their aggregation level. vi. Data is to indicate the data collection sources. vii. Key findings is to gain information of the most important results of the study.

TABLE 1 Real-Time Crash Prediction Studies

Study Title

Authors

Study Scope

Method

Real-Time Estimation of freeway accident likelihood

(Oh et al., 2001)

The demonstration of the potential capability of identifying traffic conditions that lead to crashes from real-time data.

Bayesian model nonparametric

Analysis of crash precursors on instrumented freeways

(Lee et al., 2002)

The exploration of factors contributing to changes in crash rate for individual vehicles travelling over an urban freeway

Log-linear model

Traffic Variables

Data

Key Findings

5-min: vehicle count, speed, occupancy

Crash Data (from probe vehicles), Loop detector data

- Reducing speed variation is advantageous in reducing crash likelihood

20-s: vehicle count, occupancy, speed

Crash data, loop detector data

- Crashes are more likely to occur when: the variation of speed and the variation of speed difference across lanes increase, when density increases, during peak hours, in road sections with more frequent lane changes, in normal weather conditions and as the exposure increases

Traffic

Big Data and Road Safety Chapter 12

301

Continued

302 PART

TABLE 1 Real-Time Crash Prediction Studies—cont’d

Data

Key Findings

Nonlinear (nonparametric) canonical correlation analysis (NLCCA)

30-s: vehicle count, occupancy

Crash data, loop detector data

- Collision type is the best explained characteristic and is related to the median speed, and to left-lane and interior lane variations in speed

The identification of traffic conditions that might lead to a traffic accident from realtime traffic data

Bayesian Model

5-min: vehicle count, occupancy, speed

Crash data, loop detector data

- Speed variation is the most contributing factor differentiating between pre-crash and noncrash conditions

The identification of patterns in the freeway loop detector fata that potentially precede traffic crashes

Probabilistic neural network

5-min: vehicle count, speed, occupancy

Crash data, loop detector data

- Lower speeds associated with high variance are rear-end crash prone driving conditions

Authors

Study Scope

Method

Freeway safety as a function of traffic flow

(Golob et al., 2004)

The development of a classification scheme by which traffic flow conditions in urban freeway can be classified into mutually exclusive clusters that differ as much as possible in terms of likelihood of crash by type

Real-time estimation of accident likelihood for safety enhancement

(Oh et al., 2005)

Identifying crash propensity using specific traffic speed conditions

(Abdel-Aty and Pande, 2005)

II Applications

Traffic Variables

Study Title

Logistic regression

5-min: average, std. dev. and coefficient of variation of vehicle count, speed, occupancy

Crash data, loop detector data

- Higher speed upstream of a ramp and lower speed downstream increase the probabilities of crashes on off-ramps - Lower speed upstream of a ramp and higher speed downstream of a ramp increase the probability of crashes on-ramps - Lower ramp volume contributes more to crash occurrence on offramps whereas higher ramp volume contributes more to crash occurrence on on-ramps

Impact of traffic oscillations on freeway crash occurrences

(Zheng et al., 2010)

The understanding of the impact of traffic oscillations on the likelihood of freeway crash occurrences by analyzing highresolution traffic data on a freeway segment

Conditional logistic regression

20-s: vehicle count, occupancy, speed

Crash data, loop detector data

- Standard deviation of speed is a significant variable in crash likelihood - The likelihood of a rear-end crash increases by about 8% with an additional unit increase in the standard deviation of speed Continued

303

The presentation of a comprehensive overview of the novel idea of real-time safety improvement on freeways

12

(Abdel-Aty et al., 2007b)

Big Data and Road Safety Chapter

Crash risk assessment using intelligent transportation systems data and real-time intervention strategies to improve safety on freeways

Traffic Variables

Authors

Study Scope

Method

Data

Key Findings

Evaluation of the impacts of traffic states on crash risk on freeways

(Xu et al., 2012)

The division of traffic flow into different states and the evaluation of the safety performance associated with each state

Conditional logistic regression models

5-min: average, std. dev. of vehicle count, speed, occupancy

Crash data, loop detector data

- Each traffic state can be assigned with a certain safety level - The impacts of traffic flow parameters on crash risks are different across different traffic states

Evaluation of the impacts of speed variation on freeway traffic collision in various traffic states

(Li et al., 2013)

The evaluation of impacts of speed variation on the likelihood of traffic collision in various types of traffic states in a freeway section

Logistic regression model

5-min: average, std. dev. and coefficient of variation of speed

Crash data, loop detector data

- Standard deviation and coefficient of variation of speed are significantly related to crashes - In free flow, the coefficient of variation of speed is positively related to crashes - In congested traffic and back of queue, the standard deviation of speed and coefficient of variation of speed have positive impact on crash likelihood

II Applications

Study Title

304 PART

TABLE 1 Real-Time Crash Prediction Studies—cont’d

Predicting crash likelihood and severity on freeways with real-time loop detector data

(Xu et al., 2013a)

The development of a model that predicts crash likelihood at different levels of severity with a particular focus on severe crashes

Binary logit model

Crash data, loop detector data, weather data, geometric data

- Traffic flow characteristics contributing to crash likelihood are different at different levels of severity - PDO crashes are more likely to occur under congested traffic flow conditions with highly variable speed and frequent lane changes - Fatal and incapacitating injury crashes and nonincapacitating crashes are more likely to occur under less congested traffic flow conditions - High speed coupled with large speed difference between adjacent lanes increases the likelihood of severe crashes

12

Continued

Big Data and Road Safety Chapter

5-min: vehicle count, occupancy, speed, std. dev.: occupancy, mean speed, average absolute difference: occupancy, speed

305

Authors

Study Scope

Method

Traffic Variables

Data

Key Findings

A dynamic Bayesian network model for real-time crash prediction using traffic speed conditions data

(Sun and Sun, 2015)

Investigate the relationship of traffic flow characteristics and crash risk on urban expressways

Dynamic Bayesian network

5-min: vehicle count, speed, occupancy

Crash data, loop detector data

- Traffic state variables and speed condition data are advantageous for crash prediction

Real-time crash prediction for expressway weaving segments

(Wang et al., 2015)

Conduct real-time safety analysis on highway weaving segments

Multilevel Bayesian logistic regression

5-min: vehicle count, speed, lane occupancy, std. dev. speed, weaving volume ratio, speed difference between beginning and end of weaving segment

Crash data, microwave vehicle detection system (MVDS) data, geometric data, weather data

- The mainline speed at the beginning of the segment, the speed difference between the beginning and the end of the segment and logarithm of volume have significant impact on crash risk - Weaving segments with no need for lane change present higher crash risk - Wet pavement surface increases crash ratio

II Applications

Study Title

306 PART

TABLE 1 Real-Time Crash Prediction Studies—cont’d

(Shi and AbdelAty, 2015)

The exploration of the viability of a proactive real-time traffic monitoring strategy evaluating operation and safety simultaneously to improve the system performance of urban expressways nu reducing congestion and crash risk

Random forest, Bayesian logit model

Logarithm of vehicle count, truck percentage, average std. dev. and logarithm of coefficient of variation of speed, speed difference between lanes, congestion index

Crash data, MVDS data

- Congestion has a significant effect on rear-end crash likelihood

Development of a real-time crash risk prediction model incorporating the various crash mechanisms across different traffic states

(Xu et al., 2015)

The identification of traffic flow variables contributing to crash risks under different traffic states and the development of a realtime crash risk model incorporating the varying crash mechanisms across different states

NLCCA

5-min: average std. dev. and coefficient of variation of vehicle count, speed, and occupancy

Crash data, loop detector data, geometric data

- Traffic states significantly affect crash likelihood collision type and injury severity

12

Continued

Big Data and Road Safety Chapter

Big Data applications in real-time traffic operation and safety monitoring and improvement of urban expressways

307

Authors

Study Scope

Method

Impact of realtime traffic characteristic on crash occurrence: preliminary results of the case of rare events

(Theofilatos et al., 2018)

Investigation of crash occurrence in motorways by utilizing real-time traffic data when the number of crashes is low

Bias correction method, penalized maximum likelihood estimation, exact logistic regression

Analysis of realtime crash risk for expressway ramps using traffic, geometric, trip generation and sociodemographic predictors

(Wang et al., 2017b)

The exploration of the impact of sociodemographic and trip generation parameters on real-time crash risk

Real-time crash prediction in an urban expressway using disaggregated data

(Basso et al., 2018)

The study of precursors of crashes in urban expressway using online data to create a real-time accident prediction model

Traffic Variables

Data

Key Findings

1-h: flow, occupancy, mean time speed, truck proportion

Crash data, loop detector data

- Lower speeds are positively associated with crash risk

Bayesian logistic regression and support vector machine

5-min: vehicle count, average speed, speed standard deviation, average lane occupancy

Crash data, MVDS data, geometric (RCI and ArcGIS), trip generation and sociodemographic (SWTAZs)

- The logarithm of vehicle count, speed and percentage of home-based-work production have a positive impact on crash risk - Off-ramps or nondiamond ramps experience higher crash potential

Random forest, support vector machine, logistic regression

5-min: vehicle count, speed, std. dev. speed, density, density change

Crash data, automatic vehicle identification (AVI) data

- Crashes present higher probabilities when density drops upstream with ensuing high speeds and when there are unusual low speeds downstream

II Applications

Study Title

308 PART

TABLE 1 Real-Time Crash Prediction Studies—cont’d

Weather The examination of the possibility of using AVI data to predict visibility-related crashes

Bayesian matched casecontrol logistic regression

5-min: average, std. dev. and coefficient of variation of speed

Crash data, loop detector data, AVI data, weather data

- The average speed, at the nearest downstream station along with the coefficient of variation in speed at the nearest upstream station 5–10 min prior the crash have significant effect on VR crashes - LD data are possibly better than AVI data for predicting VR crashes in expressways

Bayesian random effect models incorporating real-time weather and traffic data to investigate mountainous freeway hazardous factors

(Yu et al., 2013)

The exploration of realtime weather and traffic data in crash frequency

Poisson model, random effect model with Bayesian inference

5-min: vehicle count, speed, occupancy

Crash data, remote traffic microwave sensor (RTMS) radar data, weather data, geometric data

- Real-time weather and traffic data are essential to address crash frequency models - Single vehicle crashes are more related to weather condition and vehicle speeds, while more traffic variables are found to be statistically significant in multi-vehicle crashes

309

Continued

12

(Abdel-Aty et al., 2012)

Big Data and Road Safety Chapter

Real-time prediction of visibility-related crashes

TABLE 1 Real-Time Crash Prediction Studies—cont’d

Key Findings

5-min: vehicle count, occupancy, speed

Crash data, loop detector data, weather data

- Traffic flow characteristics contributing to crash risk are different across different weather conditions - Speed difference between upstream and downstream stations had the largest impact on crash risk in reduced visibility weather

Random forest, logistic regression models

5-min: average, std. dev. and coefficient of variation of vehicle count, speed and occupancy

Crash data, loop detector data, radar sensors

- Higher occupancy rates downstream, during 10–15 min prior to the crash coupled with an increase of the average speed downstream and upstream, 5–10 min prior to the crash increase the likelihood of a visibility relates crash - Traffic flow variables leading to visibilityrelated crashes are slightly different than those leading to clear visibility crashes

Study Scope

Method

Identifying crash-prone traffic conditions under different weather on freeways

(Xu et al., 2013b)

The development of separate crash risk prediction models for different weather conditions

Predicting reduced visibility-related crashes on freeways using real-time traffic flow data

(Hassan and Abdel-Aty, 2013)

The investigation of whether real-time traffic flow data collected from loop detectors and radar sensors on freeways can be used to predict crashes occurring at reduced visibility conditions

II Applications

Data

Bayesian random intercept logistic regression models

Authors

310 PART

Traffic Variables

Study Title

The development of crash injury severity analysis models for a mountainous freeway section

Random forest, fixed parameter logit model, Support vector Machine (SVM), random parameter logit model

6-min: average, std. dev. and coefficient of variation of speed

Crash data, AVI data, weather data, geometric data

- Large speed variations increase the likelihood of severe crashes - Severe crashes are less likely to happen during snow season - Presence of steep grades could lead to more severe crashes - Low temperature would increase the probability of having severe crashes - SVM and random parameter logit model outperform the fixed parameter logit model

Real-time assessment of fog-related crashes using airport weather data: a feasibility analysis

(Ahmed et al., 2014)

The examination of the viability of using airport weather information in realtime road crash risk assessment in locations with recurrent fog problems

Bayesian logistic regression

N/A

Crash data, weather data (from airports)

- The reduction in visibility reported by airports’ weather stations is associated with crash occurrence

12

(Yu and AbdelAty, 2014b)

Big Data and Road Safety Chapter

Analyzing crash injury severity for a mountainous freeway incorporating real-time traffic and weather data

311

Continued

Traffic Variables

Authors

Study Scope

Method

Data

Key Findings

Crash risk analysis during fog conditions using real-time traffic data

(Wu et al., 2017)

Investigation of traffic flow changes and crash risk under fog conditions

Binary logistic regression model

5-min: vehicle count, average speed, average occupancy

Weather data, loop detector data, microwave vehicle detection data

- Average 5-min speed and the average 5-min volume are prone to decreasing during fog - Crash risk can increase under fog conditions and the risk is more likely to increase near ramp areas

(Abdel-Aty and Pemmanaboina, 2006)

The development of a crash-likelihood prediction model using real-time traffic-flow variables and rain data potentially associated with crash occurrence

Principal component analysis (PCA), logistic regression, logit model

5-min: average, std. dev., coefficient of variation of vehicle count, occupancy, speed

Crash data, loop detector data, weather data

- The 5-min average occupancy, and std. dev. of volume at the downstream station and the 5-min coefficient of variation in speed at the station closest to the crash during all 5–10 min prior to the crash occurrence along with the rain index are found to affect crash occurrence

Models Calibrating a real-time traffic crash prediction model using archived weather and ITS traffic data

II Applications

Study Title

312 PART

TABLE 1 Real-Time Crash Prediction Studies—cont’d

Development of a realtime prediction model in urban expressways

Bayesian belief network

5-min: cumulative vehicle count, number of heavy vehicle count, average speed, average occupancy

Crash data, loop detector data

A detector placed approximately 250 m downstream from the centroid of the section under consideration can capture the abnormality in the highest precision

Bayesian updating approach for real-time safety evaluation with AVI data

(Ahmed et al., 2012)

The presentation of a Bayesian updating framework to identify real-time traffic conditions prone to cause crashes by the use of expressway AVI data

Linear logistic regression with Bayesian updating

5-min: average, std. dev., coefficient of variation of speed

Crash data, AVI data

- The hazard ratio for the standard deviation of the speed of the crash segment in the 5–10 min before the crash for the rear-end crash model was found to increase by more than twice the hazard ratio for the overall crash model - The increase in the variation of the speed at any given segment coupled with a decrease

313

Continued

12

(Hossain and Muromachi, 2012)

Big Data and Road Safety Chapter

A Bayesian network based framework for real-time crash prediction on the basic freeway segments of urban expressways

Study Title

Authors

Study Scope

Method

Traffic Variables

Data

Key Findings

A real-time crash prediction model for the ramp vicinities and urban expressways

(Hossain and Muromachi, 2013)

The identification of the dynamically forming hazardous traffic conditions near the ramp vicinities with high resolution realtime traffic flow data

Random multinomial logit, Bayesian belief net

5-min: vehicle count, vehicle count for heavy vehicles, average speed, occupancy, congestion index

Crash data, loop detector data

- The newly developed BBN models could predict 50%, 42%, 43% and 55% of all the crashes from the evaluation dataset for the downstream of entrance ramp, downstream of exit ramp, upstream of entrance ramp and upstream of exit ramp respectively

Utilizing support vector machine in realtime crash risk evaluation

(Yu and AbdelAty, 2013b)

The evaluation of realtime crash risk using the statistical learning model of support vector machine

Support Vector Machine (SVM), Bayesian Logistic Regression

5-min: vehicle count, speed, occupancy

Crash Data, RTMS radar data

- Smaller sample size could enhance SVM model’s classification accuracy - Variable selection procedure is needed

II Applications

in the average speed in the downstream segment may result in a rear-end crash more than any other type of crash

314 PART

TABLE 1 Real-Time Crash Prediction Studies—cont’d

prior to the SVM model estimation - Explanatory variables have identical effects on crash occurrence for the SVM models and logistic regression models The proposition of a framework to augment more traffic data from multiple sources, weather and geometry data using an advanced machine learning (ML) technique

Stochastic gradient boosting (SGB)

6-min: average, std. dev., logarithm of coefficient of variation: space mean speed, time mean speed, vehicle count, occupancy

Crash data, geometric data, AVI data, RTMS data

- Crash prediction from AVI data is comparably equivalent to RTMS data - Augmenting information from multiple traffic detectors, weather and geometry provides the best results in crash prediction

A novel visible network approach for freeway crash analysis

(Wu et al., 2013)

The development of a visible network model and its application in the freeway crash data analysis via mapping to the network from real-

Visible network model

5-min: average, std. dev., logarithm of coefficient of variation:

Crash data, RTMS radar data, weather data, geometric data

- Through the visible network traffic states with higher crash occurrence probability can be identified

12

(Ahmed and Abdel-Aty, 2013)

Big Data and Road Safety Chapter

A data fusion framework for real-time risk assessment on freeways

315

Continued

Study Title

Authors

Study Scope

Method

Transferability and robustness of real-time freeway crash risk assessment

(Shew et al., 2013)

The evaluation of transferability of the predictive model developed on one freeway segment to other similar facilities

Data

speed, vehicle count, occupancy

Classification tree, MLP neural network

5-min: vehicle count, lane occupancy

Key Findings Low traffic flow conditions with bad weather and low downstream average speed contribute to crash occurrence - Bad weather conditions together with high average occupancy and stop-and-go driving conditions increase the probability of multivehicle crashes

Crash data, loop detector data, vehicle detection station (VDS) data

- Models for most locations may be transferable from one freeway to the other, but some locations on the same freeway may require additional training

II Applications

time freeway crash data

Traffic Variables

316 PART

TABLE 1 Real-Time Crash Prediction Studies—cont’d

The improvement of the spatial and temporal transferability of the real-time crash risk prediction models using Bayesian updating

Bayesian logistic regression

5-min: vehicle count, speed, occupancy, std. dev.: count, speed, occupancy coefficient of variation: count, speed, occupancy, difference between adjacent lanes: count, speed, occupancy

Crash data, loop detector data

- The transferability results suggest that crash risk models cannot be directly transferred across time and space - The Bayesian updating approach is effective in improving spatial and temporal transferability

Utilizing the eigenvectors of freeway loop data spatiotemporal schematic for real time crash prediction

(Fang et al., 2016)

The proposition of a new method to model the real-time crash likelihood based on loop detector data through schematic eigenvectors

Eigenvectors

5-min: vehicle count, speed, occupancy 2-min: vehicle count, speed, occupancy

Crash data, loop detector data

- Eigenvectors and eigenvalues can significantly impact crash likelihood - 5-min slice data represent traffic flow variation in a certain period before the crash better than 2-min data - Speed variation in the crash nearby section is one of the major risks to crashes

317

Continued

12

(Xu et al., 2014)

Big Data and Road Safety Chapter

Using the Bayesian updating approach to improve the spatial and temporal transferability of real-time crash risk prediction models

Traffic Variables

Authors

Study Scope

Method

Data

Key Findings

Safety analytics for integrating crash frequency and real-time risk modeling for expressways

(Wang et al., 2017a)

The integration of crash frequency and real-time safety analysis using expressway data

Poisson lognormal model, Bayesian logistic regression

5-min: vehicle count, speed, lane occupancy, std. dev.: speed, lane occupancy, truck percentage

Crash data, MVSD data, geometric data

- The logarithm of vehicle count has a positive impact on crash likelihood - The average speed is negatively related to crash likelihood - The standard deviation of speed and lane occupancy are positively related to crash likelihood

Predicting realtime crash risk for urban expressways in China

(Liu and Chen, 2017)

To divide freeway traffic flow into different states and to evaluate the safety performance associated with each state

Decision tree, adaptive neural network fuzzy inference system (ANFIS)

5-min: average, standard deviation, coefficient of variation for: vehicle count, speed

Crash data, microwave detector data, video detector data

- The proposed model has a smaller training error and testing error compared to logistic regression, decision tree and SVM

Analysis and comparison of safety models using average daily, average

(Wang et al., 2018)

The evaluation of the performance of the three types of safety analysis (daily, hourly, microscopic traffic) on

Bayesian Poissonlognormal, Bayesian logistic regression

5-min: vehicle count, speed, occupancy, std. dev.: speed,

Crash data, MVDS data, geometric data

- All models showed that the log of volume, standard deviation of speed, the log of segment length and the

II Applications

Study Title

318 PART

TABLE 1 Real-Time Crash Prediction Studies—cont’d

hourly and microscopic traffic

identifying important crash contributing factors and the comparison of their abilities of using the identified significant variables to predict traffic safety conditions at different time intervals

occupancy, truck percentage

existence of diverge segment are positively related to crash occurrence - For almost all cases, the three safety studies are transferable to obtain the safety conditions at other time intervals

ATMS implementation system for identifying traffic conditions leading to potential crashes

(Abdel-Aty and Pande, 2006)

The development of a strategy to identify crash-prone conditions on freeways in realtime

Logistic regression

5-min: average, std. dev. and coefficient of variation of vehicle count, speed, occupancy

Crash data, loop detector data

- The 5-min coefficient of variation in speed a the loop detector station closest to the crash location affects crash occurrence most significantly - The 5-min average occupancy and 5-min std. dev. of volume are also significant factors in crash occurrence

12

Continued

Big Data and Road Safety Chapter

Other

319

Data

Key Findings

Split models (for crash risk model)

5-min: peed, vehicle count, occupancy

Crash data, loop detector data

- Ramp metering helps in reducing the risk of crashes by decreasing variances in speed, increasing the overall average speeds and also decreasing the average occupancy on the mainline

The identification of freeway locations with high crash potential using real-time speed data collected from AVI

Random forest, linear logistic regression model

5-min: speed, std. speed, coefficient of variation of speed

Crash data, AVI data

- AVI data were found to be promising in providing a measure of crash risk in real time - The length of the AVI segment was found to be a crucial factor that affects the usefulness of the AVI data

The presentation of aggregate and disaggregate analyses for single-vehicle and multi-vehicle crashes

Multilevel Bayesian logistic regression

5-min: vehicle count, speed, occupancy

Crash data, RTMS radar data, weather data, geometric data

- For multi-vehicle crashes, average speed and standard deviation of occupancy along with visibility conditions are

Authors

Study Scope

Method

Considering various ALINEA ramp metering strategies for crash risk mitigation on freeways under congested regime

(Abdel-Aty et al., 2007a)

The proposition of introducing ramp metering to produce a significant decrease in the risk of crashes on the freeway

The viability of using AVI data for Real-time crash prediction

(Ahmed and Abdel-Aty, 2012)

Multilevel Bayesian analyses for single- and multi-vehicle freeway crashes

(Yu and AbdelAty, 2013a)

II Applications

Traffic Variables

Study Title

320 PART

TABLE 1 Real-Time Crash Prediction Studies—cont’d

on a mountainous freeway

(Yu and AbdelAty, 2014c)

The introduction of real-time weather and traffic data to analyze crash injury severity

Binary probit models, Hierarchical Bayesian binary probit models

6-min: average, std. dev. and coefficient of variation of speed

Crash data, AVI data, weather data, geometric data

- Large variations of speed prior to the crash occurrence would increase the likelihood of a severe crash

Continued

Big Data and Road Safety Chapter

Using hierarchical Bayesian binary probit models to analyze crash injury severity on high speed facilities with real-time traffic data

significant, whereas for single-vehicle crashes average speed, sum volume, and standard deviation of occupancy are significant - Bayesian random parameters models are capable of accounting for seasonal variation effects

12

321

Authors

Study Scope

Method

Traffic Variables

Data

Key Findings

An optimal variable speed limits system to ameliorate traffic safety risk

(Yu and AbdelAty, 2014a)

The investigation of the feasibility of utilizing a variable speed limits system, one key part of Active Traffic Management, to improve traffic safety on freeways

Logistic regression (for crash risk model)

5-min: speed, density, vehicle count

Crash data, RTMS radar data

- VSL would effectively improve safety under high and moderate compliance levels

Real-time estimation of secondary crash likelihood on freeways using high-resolution loop detector data

(Xu et al., 2016)

The development of a secondary crash risk prediction model on freeways using realtime traffic flow data

Bayesian random effect logit model

5-min: vehicle count, speed, occupancy, std. dev.: count, speed, occupancy coefficient of variation: count, speed, occupancy, difference between adjacent lanes: count, speed, occupancy

Crash data, loop detector data

- The occurrence of a primary crash leads to turbulent conditions, which propagates upstream. The upstream drivers may be involved in a secondary crash of they travel with high speed and short following distance when met with such turbulent traffic conditions.

II Applications

Study Title

322 PART

TABLE 1 Real-Time Crash Prediction Studies—cont’d

(Park et al., 2018a, b)

The sequential prediction of secondary incidents

SGB decision trees, Bayesian neural networks

N/A

Crash data, traffic message channel (TMC) data

- Unexpected traffic congestion incurred by an incident is a dominant causative factor for the causation of secondary incidents at different stages of incidence clearance

Assessing the impact of reduced visibility on traffic crash risk using microscopic data and surrogate safety measures

(Peng et al., 2017)

The investigation of change of traffic risk in foggy conditions using real-time vehicle based traffic and weather data

Log-inverse Gaussian regression model

Individual speed, length, headway

Conflict data, RTMS data, weather data

- Reduced visibility would significantly increase the traffic crash risk especially rear-end crashes - The impact on crash risk is different for different vehicle types and different lanes

Assessing rearend crash potential in urban locations based on

(Dimitriou et al., 2018)

To assess rear-end crash potential at a microscopic level in an urban environment by

Multinomial logit model

- Individual speed, length, headway

Conflict data, loop detector data, geometric data

- Rear-end crash potential is higher when traffic flow and speed standard deviation are higher

- 5-min: vehicle count,

12

Continued

Big Data and Road Safety Chapter

Real-time prediction and avoidance of secondary crashes under unexpected traffic congestion

323

Study Title

Authors

Analysis of rearend conflicts in urban networks using Bayesian networks

Method

investigating vehicleby-vehicle interactions

(Stylianou and Dimitriou, 2018)

The analysis of rearend conflict likelihood in an urban network using the Time to Collision (TTC) indicator as a conflict indicator, by estimating a Bayesian Network (BN)

Traffic Variables

Data

speed, std. dev. speed

Bayesian networks

- Individual speed, speed difference, headway, length - 5-min: vehicle count, speed, std. dev. and coeff. of variation of speed

Key Findings - Speeds are lower and headways are higher when Heavy Goods Vehicles (HGV) lead

Conflict data, loop detector data, geometric data, weather data

- Rear-end conflict likelihood is increased when involved vehicles are of different type, when the following vehicle’s speed is higher than the leading’s, when individual speed is high, when individual headway is small, with higher coefficient of variation of speed, when type of intersection is priority, when carriageway is dual and when it is rainy

II Applications

vehicle-byvehicle interactions, geometric characteristics and operational conditions

Study Scope

324 PART

TABLE 1 Real-Time Crash Prediction Studies—cont’d

Big Data and Road Safety Chapter

12

325

A great effort has been given in the literature to analyze the interrelationships between crashes and traffic flow variables collected from traffic detection technologies, and researchers have systematically reported that certain traffic are associated with crash likelihood. Concerning data collection, traffic data for real-time crash prediction are usually collected though loop detector data, but more recently other data sources such as Microwave Vehicle Detection System (MVDS) data (Shi and Abdel-Aty, 2015; Wang et al., 2015, 2017a, b, 2018; Wu et al., 2017), Remote Traffic Microwave Sensors (RTMS) data (Ahmed and Abdel-Aty, 2013; Wu et al., 2013; Yu et al., 2013; Yu and Abdel-Aty, 2013a, 2014a), and AVI data (Abdel-Aty et al., 2012; Ahmed et al., 2012; Ahmed and Abdel-Aty, 2012, 2013; Basso et al., 2018; Yu and Abdel-Aty, 2014b, 2014c) have been utilized. With regard to the methodological approach, there are two general approaches in real-time safety analysis: (i) statistical methods and (ii) datamining techniques or artificial intelligence. The former includes simple or matched case control logistic regression (Abdel-Aty et al., 2007b; Abdel-Aty and Pande, 2006; Hassan and Abdel-Aty, 2013; Li et al., 2013; Xu et al., 2012; Yu and Abdel-Aty, 2014a; Zheng et al., 2010), log-linear modeling (Lee et al., 2002), and Bayesian statistical models (Abdel-Aty et al., 2012; Ahmed et al., 2012, 2014; Oh et al., 2001, 2005; Wang et al., 2015, 2017a, 2018; Xu et al., 2013b, 2014; Yu and Abdel-Aty, 2013a, 2014b). The latter includes neural networks (Abdel-Aty and Pande, 2005), Bayesian networks (Hossain and Muromachi, 2012; Shew et al., 2013; Sun and Sun, 2015), Stochastic Gradient Boosting (SGB) (Ahmed and Abdel-Aty, 2013; Park et al., 2018a, b), classification trees (Shew et al., 2013), support vector machines (Yu and Abdel-Aty, 2013b), and random forests (Ahmed and Abdel-Aty, 2012). Statistical approaches interpret the effects of traffic variables on crash likelihood in an understandable way but these methods assume a linear relationship between the independent and dependent variable. On the other hand, datamining approaches have a high prediction accuracy but are usually criticized for their black-box-like process. More recently, researchers have opted to combine the two approaches by using data-mining techniques for significant variable identification and statistical approaches for variable effect interpretation (Basso et al., 2018; Hassan and Abdel-Aty, 2013; Shi and Abdel-Aty, 2015; Wang et al., 2017b; Yu and Abdel-Aty, 2014b). The effect of traffic characteristics on real-time safety was the first to gain considerable attention in the literature. Oh et al. (2001) was the first to establish a statistical link between real-time traffic conditions (normal and disruptive traffic) and crashes. Using a Bayesian model, the study concluded that reducing speed variation would be advantageous in reducing crash likelihood. Later, another study by the same authors reported that speed variation was the most important contributing factor in differentiating between pre-crash and noncrash conditions (Oh et al., 2005). The term of crash “precursors” was introduced by Lee et al. (2002) referring to the characteristics observed prior to a

326 PART

II Applications

crash occurrence. By using a log-linear model, it was shown that the variation of speed and traffic density were significant predictors of crash frequency. Another detailed study by Golob et al. (2004) used nonlinear canonical correlation analysis (NLCCA) to analyze patterns in crash characteristics as a function of traffic flow. A classification scheme was developed by which traffic flow conditions were classified into mutually exclusive clusters which differed as much as possible in terms of crash likelihood by type and was concluded that collision type is related to median speed and the variation in speed between the left-lane and the interior lane. Based on the analyses of the archived data used in these disaggregate studies, combined with historical crash records, several traffic flow measures have been identified as precursors of crashes with the variation of speed being the most reported. In the study of Zheng et al. (2010), it was reported that an increase of one unit in the standard deviation of speed would increase the likelihood of (rear-end) crashes by about 8%. Furthermore, in a meta-analysis study by Roshandel et al. (2015), it was reported that the summary effect of speed variation (from selected studies) was 1.226, indicating that an additional unit increase in speed variation increases the odds ratio of crash occurrence by 22.6%. Park and Ritchie (2004) developed a framework where traffic safety was presented as the function of speed variance which in turn was the function of vehicle heterogeneity and driver behavior. This study showed that the lane-changing behavior as well as the presence of long vehicles within a freeway section significantly impacts the section speed variability. An innovative feature of this study was the state-of-the-art vehicle-signature based traffic monitoring technology providing individual vehicle trajectories and vehicle classification. The literature has also suggested that the crash mechanism might not be the same under different traffic conditions. Xu et al. (2012) reported that each traffic state can be assigned with a certain safety level and that the traffic flow parameters on crash risk differ among the traffic states. Another study by Li et al. (Li et al., 2013) also found different variables significant in the free flow and congested traffic and back of queue. Similarly, Xu et al. (2013a) reported that the traffic flow characteristics which contribute to crash likelihood are different at different levels of severity, and more specifically stated that Property Damage Only (PDO) crashes are more likely to occur under congested traffic whereas fatal and incapacitating injury crashes are more likely to occur under less congested traffic. There are also numerous studies that have investigated weather factors in the crash prediction models. Weather characteristics are usually obtained from records stored in meteorological stations close to section under study, but recently data from airport weather stations have also been utilized (Ahmed et al., 2014). Weather data refer to mainly precipitation and visibility. Yu et al. (2013) reported that real-time weather and traffic data are essential to address crash frequency models. In their study, the results showed that single vehicle crashes were more related to weather conditions. Hassan and Abdel-Aty (2013) found that traffic variables leading to visibility-related crashes are

Big Data and Road Safety Chapter

12

327

slightly different than those leading to clear visibility crashes. When investigating crash severity Yu and Abdel-Aty (2014b) showed that severe crashes were less likely to happen during snow season. Furthermore, Wu et al. (2017) investigated crash risk under fog conditions and by employing a binary logistic regression model showed that crash risk can increase under fog. With more and more effort in real-time crash study, the area expanded to investigating crash severity (Yu and Abdel-Aty, 2014c), secondary crashes (Park et al., 2018a, b; Xu et al., 2016),developing advance traffic management methodologies to ameliorate crash risk (Abdel-Aty et al., 2007a; Yu and Abdel-Aty, 2014a), and using conflict-based analysis (Dimitriou et al., 2018; Peng et al., 2017; Stylianou and Dimitriou, 2018). The previously mentioned literature on realtime crash prediction indicates that considerable research has been conducted to develop real-time crash risk prediction models and to interpret the variables most significant in crash occurrence. Through this highly data-driven research the relationship between traffic and weather characteristics and crashes has been revealed, which is crucial in developing proactive safety strategies to help ameliorate hazardous traffic patterns. The advancement of traffic data collection systems as well as the vast data availability has helped in the enhancement of the models and their transferability across time and space. Researchers have shifted from examining only one stretch of road with highly aggregate data to using microscopic data in larger scale studies.

2.2 Driving Behavior The identification of driving behavior represents another fundamental requirement in the traffic safety arena. In the past, due to the lack of information, priority was generally given to the identification of risk factors through epidemiological studies of crash causation, while failing to consider driving behavior. However, newer technologies made it possible to also collect behavioral data. It became apparent to the researchers that collecting real “naturalistic” data is an advantageous approach for obtaining the necessary human factors data, thus, Naturalistic Driving Studies (NDS) emerged in the early 2000s. NDS are large scale studies, where volunteers drive their vehicle which is equipped with an unobtrusive Data Acquisition System (DAS) with the aim of recording continuously driving behavior (e.g., vision tracking), vehicle behavior (e.g., speed) and the interaction with other road users. This method therefore, overcomes the problems associated with traditional data collection approaches, as it records information in all the situations a driver is involved in, and allows for the direct observation of the driver’s behavior. The first large-scale NDS called the “100-Car Naturalistic Study” was pioneered by the Virginia Tech Transportation Institute in the United States (Dingus et al., 2006a, b). This study served as a pilot in the USA for a much larger Naturalistic Driving Study under the

328 PART

II Applications

Second Strategic Highway Research Program which was later employed, and ever since Europe and Australia followed. The 100-Car Naturalistic Driving Study was the first instrumented-vehicle study undertaken, aiming at the large-scale collection of naturalistic driving data. The research was initiated to provide a novel data set containing detail concerning driver performance, behavior, environment and other factors associated with critical incidents, near crashes and crashes for 100 drivers across the period of one year. The data set includes approximately 2,000,000 vehicle miles, almost 43,000 h of data, 241 primary and secondary drivers, 12–13 months of data collection of each vehicle and data from a highly capable instrumentation system including five channels of video and vehicle kinematic sensors, totaling in 6.4 TB of data (Dingus et al., 2006a, b). A primary goal of the study was also the provision of vital exposure and pre-crash data, necessary for understanding the causes of crashes. Altogether during the study, 82 crashes were documented of which 29% were rear-end striking and 25% were rear-end struck crashes. Additionally, 761 near-crashes and over 8000 incidents were reported. The drivers were not given any special instructions, and the results indicated that they disregarded the presence of the instrumentation, as many cases of ‘extreme’ driving behavior were presented such as traffic violations, severe fatigue or aggressive driving. Following this naturalistic approach, it was possible to gather great information regarding the pre-crash and crash events, therefore filling a void in the existing safety research methods. A novel and important finding from the study was the ability to capture lowseverity, property-damage only crashes, which are usually not reported to the police. Specifically, it was reported that for urban/suburban settings the total crash involvement may be even five times higher than police-reported crashes. Moreover, inattention was found to be the contributing factor of 93% of the conflict with lead vehicle crashes and minor collisions, but the rate of inattentionrelated crashes and near-crash events is as less as four times lower in older driver groups relative to novice drivers. Another important finding of the study was that the development of purely quantitative near-crash criteria is not feasible, which was proved by the fact that vehicle kinematics associated with nearcrashes were virtually identical to common driving situations that were not indicative to a crash, therefore implying that both qualitative and quantitative criteria are dependent upon one another (Dingus et al., 2006a, b). The Second Strategic Highway Research Program Naturalistic Safety Study (SHRP 2 NDS) completed in 2015 in the USA is the largest of its size. The data set comprises more than 2 PB of naturalistic driving data collected through a 3-year period from over 3500 participants in six site centers. This data set includes some 50 million miles travelled and well over one million hours of video. The goal of this effort was to collect and archive the largest store of naturalistic data ever attempted up to date, taking advantage of the advances in the collection, movement and storage of Big Data (Dingus et al., 2014). The vehicles were equipped with the same Data Acquisition System (DAS) which was

Big Data and Road Safety Chapter

12

329

developed for the 100-Car Naturalistic Study, that collected four video views (forward roadway, driver’s face and upper torso, driver interactions with wheel and center stack and the rear and right of the vehicle), vehicle network data (e.g., accelerometer, brake pedal activation, speed) and other information from additional sensors such as GPS and forward radar. The advantage of the SHRP 2 NDS over the 100-car Naturalistic Driving study, was that the former study included an order-of-magnitude larger sample size allowing thus the sole use of the crash events to determine the safety outcome for risk factor evaluation (Dingus et al., 2016). In 2012, the European Commission funded the first large-scale NDS in Europe called the UDRIVE project (Eenink et al., 2014). The focus of the study was the identification of well-founded and tailored measures to improve road safety, and the identification of approaches for reducing harmful emissions and fuel consumption. The field trials included three types of vehicles (120 cars, 40 powered two-wheelers, and 50 trucks), seven countries (France, Germany, Poland, UK, Austria, Spain, and the Netherlands) and a study period of 21 months. In total, the project consisted of 290 participants. A common DAS was used for data collection including 5 to 8 cameras (depending on the vehicle type), 1 smart camera for environmental recording, CAN interface (depending on the vehicle type), 1 GPS/3G antenna, 1 Accelerometer/Gyroscope and 1 speed sensor (for motorcycles) (Eenink et al., 2014). The results of the UDRIVE project were used to show comparisons among different European countries and the data collected are considered to be a valuable resource facilitating research, traffic safety and eco-improvements for many years (Jonas et al., 2017). In April 2015, Australia also launched its first large-scale Naturalistic Driving Study involving 360 participants in New South Wales and Victoria, for a 4-month period. The overall aim of the project was to use the NDS in order to understand what people do when driving their cars in normal and safetycritical situations (Regan et al., 2012). The data collected for the Australian NDS are comparable with that collected in the SHRP 2 NDS, given that the DAS and data management protocols were the same in both studies. Other than the previously mentioned large-scale NDS, smaller scale NDS have also been undertaken to examine several research issues related to human factors and traffic. For example, one field of investigation through NDS has been novice drivers as they are over-represented in road crashes. Through the Naturalistic Teenage Driving Study (NTDS) where 42 teenagers were monitored for their first 18 months of independent driving, it was made possible to examine the exposure and crash risk factors of novice young drivers (Lee et al., 2011). The DAS installed in the vehicles consisted of cameras and sensors (accelerometers, GPS, front radar, lane position etc.) for the continuous monitoring of the drivers. The results of the study showed, among others, that crash and near-crash rates were significantly higher during the first six months of licensure, the rates were higher for the novice drivers compared to the adult

330 PART

II Applications

drivers and that crash and near-crash types were similar between male and female drivers (Lee et al., 2011). Similarly, Prato et al. (2010) investigated the behavior of 62 novice young drivers during the initial 12 months after licensure. The data used in the study were collected through in-vehicle data recorders (IVDR) which continuously measured speeds and accelerations and the results of this study, among others, showed that the risk taking behavior of young drivers is influenced by their gender, sensation seeking tendency, and driving behavior of their parents (Prato et al., 2010). Another focus has been distracted driving and secondary tasks. Using the NTDS data and the 100-Car NDS data, Klauer et al. (2014) examined the relationship between the performance of secondary tasks, including cell-phone use, and the risk of crashes and near-crashes, and was found that the risk of a crash or near-crash among novice drivers increased with the performance of many secondary crashes including texting and dialing cell phones (Klauer et al., 2014). Similarly, Wu and Xu (2018) analyzed the influence of familiarity on the involvement of secondary tasks and driving operation using NDS data. The distracted driving activities and driving operations used in this study were generated from the SHRP 2 NDS, and it was shown that drivers were more likely to have more distraction types on familiar roads. Other driver characteristics have also been the subject of NDS, such as older drivers and their behavior (Aksan et al., 2013; Charlton et al., 2013; Guo et al., 2015), drivers with Alzheimer’s disease (Festa et al., 2013), dementia (Eby et al., 2012), driver fatigue (Dingus et al., 2006a, b), driving errors (Precht et al., 2017) and many more. Naturalistic driving studies provide a powerful tool for safety researchers as they provide novel data and analytic methods in which to explore driving behavior. Through this type of studies driver observation and driving operation is collected unobtrusively, which is the most realistic type of data for driving behavior. Such large-scale databases are an enormous asset for researchers to gain understanding into a wide array of driver behavior and possibly serve as a means of decision support tools for crash mitigation mechanisms. The NDS method is the only one which has the ability to collect detailed driver observations and driving operations simultaneously. New and more detailed data can be collected to a wider range of driver, vehicle and environment, which can provide information about the context in which drivers choose to engage in distractions for example such as looking at their phones, as well as data on nearcrash situations which could be valuable into the optimization of driver assistance systems. However, the drawback of such studies is the fact that they are extremely costly, labor-intensive and require prolonged periods for both data collection and analysis. An alternative approach to NDS, to study driving performance and human factors is through driving simulators. Driving simulators are virtual reality devices designed to conduct experiments in a controlled environment for the measurement of human performance. The first driving simulator was built by Volkswagen in the beginning of the 70s (Nordmark, 1994), and ever since a

Big Data and Road Safety Chapter

12

331

strong increase in the use of driving simulators, not only for research purposes but also for training, has occurred. The primary purpose of the development of driving simulators was to understand driver behavior in the laboratory and therefore potentially improve traffic safety in the real world. In the research, driving simulators are used for many purposes and are a rich source of data for human factors in various traffic situations. The applications of driving simulators range from studies of driving behavior such as the influence of secondary tasks for instance the use of a cell phone (Alm and Nilsson, 1994; Atchley et al., 2011; Lipovac et al., 2017; Shinar et al., 2005; Strayer and Johnston, 2001; T€ornros and Bolling, 2006), to the investigation of the effects of alcohol on driving (Irwin et al., 2017; Laude and Fillmore, 2015; Li et al., 2016; Vollrath and Fischer, 2017), to driver performance (Kang and Momtaz, 2018; Nilsson et al., 2018; Pawar and Patil, 2018; Soria et al., 2014) to the test of Driver Assistance Systems (DAS) (Bianchi Piccinini et al., 2014). The main advantage of driving simulators is their versatility; in a simulator study the experimenter can recreate the traffic scenario multiple times, in a risk-free and cost-effective manner. Different scenarios can be created to cater for the needs of the research, and vehicle and environmental characteristics can quickly be altered. Additionally, the experiments take place in a controlled environment which places the drivers under no risk or any critical conditions. On the other hand, simulator studies have been criticized for not producing results comparable with the real world, as drivers do not always behave in simulators as they would in the real world. As traffic surveillance through cameras has been expanding in the last years, video based analysis and safety level estimation has become another active area of research. Together with the advancements in computer vision techniques, applications of video based analyses include vehicle detection and recognition, vehicle tracking, incident detection, as well as behavior analysis. The most popular technology used today is traffic cameras which allow tracking the vehicle’s trajectory via image processing techniques. One of the biggest trajectory databases was developed through the Next Generation SIMulation program (NGSIM), a public-private project of the Federal Highway Administration of the USA with the aim of supporting the development of algorithms for driver behavior at microscopic levels (US Department of Transportation, 2007; Kovvali et al., 2007). This data set was used by a number of researchers for safety analyses, such as the calibration of microscopic traffic models (Cunto and Saccomanno, 2008; Duong et al., 2010) the development of behavioral car-following models (Chen et al., 2012) and the assessment of rear-end collision risk (Zhao and Lee, 2018). Additionally, other studies have also been performed using video trajectory data. For example, Park et al. (2018a, b) developed a lane change risk index using vehicle trajectory data obtained from a drone flown over a freeway work zone for a 20-min period. Similarly, Oh and Kim (Oh and Kim, 2010) utilized trajectory data collected from traffic surveillance systems in order to develop a methodology for estimating rear-end crash potential.

332 PART

II Applications

The analysis of travel behavior, whether through simulation studies or naturalistic studies or video-based analysis, has provided researchers with valuable insights regarding road user behavior. An in-depth understanding of road user behavior is needed not only for the identification of reasons behind high risk situations in traffic but also for the design of methodological approaches aimed at mitigating the negative consequences of such actions.

3 ADAS AND AUTONOMOUS VEHICLES (AVS) Tremendous steps by innovators in the last years have resulted in fascinating enhancements in the transportation industry, which has led to the production of huge amounts of data that can be used to improve safety and efficiency of the transportation system. Developers have been driven to figure out between when humans should be in charge and when machines should intervene for improving safety, reducing costs and improving operation, and data is the fuel that powers exceptional insights to enhance this system. Powered by Big Data, the transportation system is poised to become safer and smarter. In the field of ITS, a major continuously evolving area is ADAS. The aim of ADAS is to improve traffic safety and ultimately support efficient transport systems, as they not only control the driving task but also directly influence the interaction among road users. ADAS are designed to assist drivers while operating their vehicle, and a number of these have a primary purpose of preventing unsafe actions while driving. These systems continuously monitor various parameters and as soon as predefined thresholds are exceeded warnings are sent to the driver Not all ADAS have a primary purpose of safety but yet in general these systems indirectly affect road safety as well. In the context of this chapter, this section focuses on systems that prevent unsafe situations while participating in traffic. In the early starts of the development of the driver assistance functions, the focus was on vehicle control and specifically its stabilization. The first active assistance system developed was the Anti-lock Braking System (ABS) in the late 70s, a system which prevents the wheels from locking while the driver is braking in order to avoid uncontrolled skidding on the road surface. Soon after that, Traction Control System (TCS) augmented the system and years later, in the mid-nineties, Electronic Stability Control (ESC) was introduced (Bengler et al., 2014). These systems were the first developed to prevent unsafe situations or actions during driving and support vehicle control; the most well-known ADAS aimed to prevent unsafe situations in traffic up to date are presented in Table 2. The purpose of ADAS is to reduce driver error or even eliminate it, in order to enhance efficiency. These systems can broadly be categorized into three groups based on their function (Carsten and Nilsson, 2001; Gietelink et al., 2006): (i) Informative systems (ii) Warning systems (iii) Physically intervening systems

Big Data and Road Safety Chapter

12

333

TABLE 2 Advanced Driver Assistance (ADAS) Systems Aimed To Prevent Unsafe Situations in Traffic Category

Name

Purpose

Vehicle control

Antilock braking system (ABS)

Prevents wheels from locking to avoid skidding on the road surface

Traction control system (TSC)

Prevents loss of traction on wheels

Electronic stability control (ESC)

Prevents car from skidding during for example emergency evasive maneuvers

Adaptive cruise control (ACC)

Maintains threshold distance from preceding vehicle by adjusting vehicle speed

Lane keep assist system (LKAS)

(i) Assists driver by warning when lane departure occurs

Information, communication and comfort

(ii) Force feedback on steering wheel to return to lane Forward collision avoidance (FCA)

(i) Warns driver in case of imminent forward collision

Intersection collision avoidance (ICA)

(i) Warns driver in case of collision at intersection

Intelligent speed adaptation (ISA)

(i) Passive ISA: warns driver when exceeds speed limit

(ii) Provides automatic control of the vehicle if driver does not intervene

(ii) Provides automatic control of the vehicle if driver does not intervene

(ii) Active ISA: automatically corrects vehicle speed when speed limit is exceeded Vision enhancement system

Assists driver providing enhanced vision information in adverse lighting and weather conditions

Blind spot detection system

Assists driver by providing information for objects outside their range of vision

Informative systems provide further information from what is available from the road and traffic environment and increase driver’s situation awareness and examples include navigation systems and park assist systems. Warning systems, such as lane departure systems, assist drivers in the driving task by actively warning them in case of a potential danger, thus allowing the driver

334 PART

II Applications

to take corrective actions to ameliorate a potential risk. On the other hand, physically intervening systems provide active support to the driver, by taking control of the vehicle to avoid imminent danger. This continuous development of ADAS is gradually leading to the concept of Autonomous Vehicles (AVs) where the ultimate future DAS should be capable of automated driving in all situations at a significantly superior safety level to that of a human driver (Bengler et al., 2014). In rough terms, an AV is any vehicle which adopts a technology which supports and assists the driver to control the vehicle and monitor the surrounding environment. The official definition of automated driving comes from the Society of Automotive Engineers (SAE) where a taxonomy for motor vehicle driving automations is provided, ranging from no driving automation (level 0) to full driving automation level (level 5). Fig. 1 provides the summary of the definition of the six levels of driving autonomy (SAE, 2016). These levels of driving automation are defined by reference to the specific role played by each of the three primary actors in performance of the Dynamic Driving Task (DDT): (i) the human (diver); (ii) the driving automation system; (iii) the other vehicle systems and components. The Dynamic Driving Task (DDT) is defined as “all the real time operational and tactical functions required to operate a vehicle in on-road traffic.” A key distinction in the six levels of vehicle automation is the “fallback” agent which will serve as a back-up in case of failure of the Autonomous Technology (AT) of the vehicle, thus discriminating between “semi-autonomous” vehicles, where

FIG. 1 Summary of levels of driving automation (SAE, 2016).

Big Data and Road Safety Chapter

12

335

the fallback performance is placed on the driver (levels 1–3), and “fullyautonomous” where the back-up is on the system (levels 4–5). However, in whichever level of automation, the AT is subject to disengagement, making it possible for the vehicle to enter manual mode and placing the control to the driver. Automated driving has been a research topic for many years now, and a large number of projects have significantly advanced the research towards autonomous vehicles. Some of the most worthy projects are mentioned later in chronological order: - In 1986, the EUREKA PROMETHEUS Project was funded and defined the state of the art of autonomous vehicles - In 1994, two Mercedes 500 SEL (VaMP from UniBwM and VITA II from Daimler-Benz) demonstrated automatic driving in dense traffic on French highways including lane changes (Franke et al., n.d.; Maurer et al., 1996) - In 1995, an AV drove a long distance ride from Munich (Germany) to Odense (Denmark) with vision-based automation (Maurer et al., 1996) - In 1995, in the trip dubbed as “No Hands Across America” an AV using the RALPH computer program drove from Pittsburg to Washington DC steering autonomously 96% of the way (Pomerleau, n.d.) - In 2004, the first DARPA (Defense Advanced Research Projects Agency) Grand challenge (funded by the US Government) was created for the development of technologies needed for autonomous vehicles. The vehicles had to traverse a distance of 175 miles through the dessert but none of the contestants was successful (Seetharaman et al., 2006) - In the second DARPA challenge, in 2005, five vehicles were able to complete the course (Seetharaman et al., 2006) - In the third DARPA challenge, which was named the Urban Challenge, in 2007, autonomous vehicles had to interact with both manned and unmanned vehicle traffic in an urban environment. This year six contestants were successful in completing the course (Urmson et al., 2006) - In 2006, the first European Land-Robot Trial (ELROB) was conducted in Hammelburg - In 2010, the VisLab Intercontinental Autonomous Challenge took place, consisting of a 13.000 km long test for intelligent vehicle applications (Broggi et al., 2012) - In 2011, the Grand Cooperative Driving Challenge took place in Helmond, where teams were shuffled to a random starting position in a random platoon over 15 runs. Participating teams had to come up with strategies that were able to perform as good as possible without knowing the algorithms and technical equipment of the other vehicles in the platoon (Geiger et al., 2011) - In 2013, BRAiVE, VisLab’s most advanced autonomous vehicle drove in downtown Parma (Teoh and Kidd, 2017)

336 PART

II Applications

- Waymo, previously known as Google’s self-driving car, which has been performing supervised autonomous driving since 2009, has self-driven more than 5 million miles up to today. Initially, the tests began with a fleet of Toyota Prius cars but in 2012, Google switched to a fleet of modified Lexus RX450h SUVs, and expanded the tests from highways to urban streets. In 2017, Waymo’s fully self-driving vehicles began test-driving on public roads Autonomous and driverless vehicles are the next tech quantum leap in driving and many automotive manufacturers have announced testing driverless car systems in the current decade. Level 1 automation features, such as automated parking, automatic cruise control and lane keep assist have become standard features in current vehicle models, but higher levels of automation are continuously sought by manufacturers such as Volvo, BMW, Mercedes-Benz, Tesla motors, Toyota, Google car and more. Examples of commercial vehicles include (Van Brummelen et al., 2018): -

2015 Infinity Q50S 2016 Lexus RX 2016 Volvo XC90 BMW750i xDrive Ford (high-end production vehicles) Mercedes-Benz E and S-Class Otto Semi-Trucks Renault GT Nav Tesla Model S

The ultimate goal of AVs is the elimination of human error, by simply taking the vehicle control from the driver. To be able to achieve even the lowest level of automation it is essential that vehicles are equipped with sensors that collect information from the vehicle and the surrounding environment so it comes as no surprise that the operation of such vehicles relies on Big Data and analytics.

4 CONCLUSIONS Big traffic Data are collected through various instrumentations which are continuously evolving and increasing. New opportunities are presented to identify problems connected to the Transportation area and especially traffic safety. The quality and quantity of the high-resolution data used for safety estimation is very important for the successful interpretation of the dynamic mechanisms leading to higher risk levels in the network. Data are continuously generated through ITS applications and are leveraged from researchers in the search of identifying new patterns and trends previously unknown. By leveraging the real-time nature of Big Data, researchers have performed real-time crash prediction models which are subsequently used in

Big Data and Road Safety Chapter

12

337

Active Traffic Management aiming for a better traffic system performance. The continuous evolutions in sensors and computer analytics has also provided the opportunity of performing large scale studies for collecting individual driver data used for understanding correlations between human factors and crash causation. Big Data analytics are also proving to be crucial in the next tech quantum leap of Autonomous Vehicles. These rely on a huge amount of data analytics in order to ensure a safe operation, and on the other hand are also data producers as well as they are equipped with myriad sensors. To conclude, in recent decades, traffic safety analysis has seen big progress, and the evolution of Big Data has been an important milestone for future steps.

REFERENCES Abdel-Aty, M.A., Hassan, H.M., Ahmed, M., Al-Ghamdi, A.S., 2012. Real-time prediction of visibility related crashes. Transp. Res. Part C Emerg. Technol. 24, 288–298. https://doi.org/ 10.1016/j.trc.2012.04.001. Abdel-Aty, M.A., Pemmanaboina, R., 2006. Calibrating a real-time traffic crash-prediction model using archived weather and ITS traffic data. IEEE Trans. Intell. Transp. Syst. 7, 167–174. https://doi.org/10.1109/TITS.2006.874710. Abdel-Aty, M., Dhindsa, A., Gayah, V., 2007a. Considering various ALINEA ramp metering strategies for crash risk mitigation on freeways under congested regime. Transp. Res. Part C Emerg. Technol. 15, 113–134. https://doi.org/10.1016/j.trc.2007.02.003. Abdel-Aty, M., Pande, A., 2006. ATMS implementation system for identifying traffic conditions leading to potential crashes. IEEE Trans. Intell. Transp. Syst. 7, 78–91. https://doi.org/ 10.1109/TITS.2006.869612. Abdel-Aty, M., Pande, A., 2005. Identifying crash propensity using specific traffic speed conditions. J. Safety Res. 36, 97–108. https://doi.org/10.1016/j.jsr.2004.11.002. Abdel-Aty, M., Pande, A., Lee, C., Gayah, V., Santos, C.D., 2007b. Crash risk assessment using intelligent transportation systems data and real-time intervention strategies to improve safety on freeways. J. Intell. Transp. Syst. 11, 107–120. https://doi.org/10.1080/ 15472450701410395. Ahmed, M., Abdel-Aty, M., 2013. A data fusion framework for real-time risk assessment on freeways. Transp. Res. Part C Emerg. Technol. 26, 203–213. https://doi.org/10.1016/j. trc.2012.09.002. Ahmed, M., Abdel-Aty, M., Yu, R., 2012. Bayesian updating approach for real-time safety evaluation with automatic vehicle identification data. Transp. Res. Rec. J. Transp. Res. Board 2280, 60–67. https://doi.org/10.3141/2280-07. Ahmed, M.M., Abdel-Aty, M.A., 2012. The viability of using automatic vehicle identification data for real-time crash prediction. IEEE Trans. Intell. Transp. Syst. 13, 459–468. https://doi.org/ 10.1109/TITS.2011.2171052. Ahmed, M.M., Abdel-Aty, M., Lee, J., Yu, R., 2014. Real-time assessment of fog-related crashes using airport weather data: a feasibility analysis. Accid. Anal. Prev. 72, 309–317. https://doi. org/10.1016/j.aap.2014.07.004. Aksan, N., Dawson, J.D., Emerson, J.L., Yu, L., Uc, E.Y., Anderson, S.W., Rizzo, M., 2013. Naturalistic distraction and driving safety in older drivers. Hum. Factors J. Hum. Factors Ergon. Soc. 55, 841–853. https://doi.org/10.1109/TMI.2012.2196707.Separate.

338 PART

II Applications

Alm, H., Nilsson, L., 1994. Changes in driver behaviour as a function of handsfree mobile phones—a simulator study. Accid. Anal. Prev. 26, 441–451. https://doi.org/10.1016/0001-4575(94)90035-3. Atchley, P., Atwood, S., Boulton, A., 2011. The choice to text and drive in younger drivers: behavior may shape attitude. Accid. Anal. Prev. 43, 134–142. https://doi.org/10.1016/j.aap.2010.08.003. Basso, F., Basso, L.J., Bravo, F., Pezoa, R., 2018. Real-time crash prediction in an urban expressway using disaggregated data. Transp. Res. Part C Emerg. Technol. 86, 202–219. https://doi.org/ 10.1016/j.trc.2017.11.014. Bengler, K., Dietmayer, K., Farber, B., Maurer, M., Stiller, C., Winner, H., 2014. Three decades of driver assistance systems: review and future perspectives. IEEE Intell. Transp. Syst. Mag. 6, 6–22. https://doi.org/10.1109/MITS.2014.2336271. Bianchi Piccinini, G.F., Rodrigues, C.M., Leita˜o, M., Simo˜es, A., 2014. Driver’s behavioral adaptation to adaptive cruise control (ACC): the case of speed and time headway. J. Safety Res. 49, 77–84. https://doi.org/10.1016/j.jsr.2014.02.010. Broggi, A., Medici, P., Zani, P., Coati, A., Panciroli, M., 2012. Autonomous vehicles control in the VisLab intercontinental autonomous challenge. Annu. Rev. Control 36, 161–171. https://doi. org/10.1016/j.arcontrol.2012.03.012. Carsten, O., Nilsson, L., 2001. White rose research online. Eur. J. Transp. Infrastruct. Res. 1, 225–243. https://doi.org/10.1016/S1366-5545(02)00012-1. Charlton, J.L., Catchlove, M., Scully, M., Koppel, S., Newstead, S., 2013. Older driver distraction: a naturalistic study of behaviour at intersections. Accid. Anal. Prev. 58, 271–278. https://doi.org/ 10.1016/j.aap.2012.12.027. Chen, D., Laval, J., Zheng, Z., Ahn, S., 2012. A behavioral car-following model that captures traffic oscillations. Transp. Res. Part B Methodol. 46, 744–761. https://doi.org/10.1016/j.trb.2012.01.009. Cunto, F., Saccomanno, F.F., 2008. Calibration and validation of simulated vehicle safety performance at signalized intersections. Accid. Anal. Prev. 40, 1171–1179. https://doi.org/10.1016/j. aap.2008.01.003. Dimitriou, L., Stylianou, K., Abdel-Aty, M.A., 2018. Assessing rear-end crash potential in urban locations based on vehicle-by-vehicle interactions, geometric characteristics and operational conditions. Accid. Anal. Prev. 0–1. https://doi.org/10.1016/j.aap.2018.02.024. Dingus, T.A., Guo, F., Lee, S., Antin, J.F., Perez, M., Buchanan-King, M., Hankey, J., 2016. Driver crash risk factors and prevalence evaluation using naturalistic driving data. Proc. Natl. Acad. Sci. 113, 2636–2641. https://doi.org/10.1073/pnas.1513271113. Dingus, T.A., Hankey, J.M., Antin, J.F., Lee, S.E., Eichelberger, L., Stulce, K.E., McGraw, D., Perez, M., Stowe, L., 2014. Naturalistic Driving Study: Technical Coordination and Quality Control. https://doi.org/10.17226/22362. Dingus, T.A., Neale, V.L., Klauer, S.G., Petersen, A.D., Carroll, R.J., 2006a. The development of a naturalistic data collection system to perform critical incident analysis: an investigation of safety and fatigue issues in long-haul trucking. Accid. Anal. Prev. 38, 1127–1136. https:// doi.org/10.1016/j.aap.2006.05.001. Dingus, T.A., Klauer, S.G., Neale, V.L., Petersen, A., Lee, S.E., Sudweeks, J., Perez, M.A., Hankey, J., Ramsey, D., Gupta, S., Bucher, C., Doerzaph, Z.R., Jermeland, J., Knipling, R.R., 2006b. The 100-Car naturalistic driving study phase II—results of the 100-Car field experiment. Chart No. HS-810 593. doi:DOT HS 810 593. Duong, D., Saccomanno, F., Hellinga, B., 2010. Calibration of microscopic traffic model for simulating safety performance. In: 89th Annu. Transp. Res. Board Meet. Eby, D.W., Silverstein, N.M., Molnar, L.J., Leblanc, D., Adler, G., 2012. Driving behaviors in early stage dementia: a study using in-vehicle technology. Accid. Anal. Prev. 49, 330–337. https:// doi.org/10.1016/j.aap.2011.11.021.

Big Data and Road Safety Chapter

12

339

Eenink, R., Barnard, Y., Baumann, M., Augros, X., Utesch, F., 2014. UDRIVE the European naturalistic driving study. Transp. Res. Arena 32, 1–10. Fang, S., Xie, W., Wang, J., Ragland, D.R., 2016. Utilizing the eigenvectors of freeway loop data spatiotemporal schematic for real time crash prediction. Accid. Anal. Prev. 94, 59–64. https:// doi.org/10.1016/j.aap.2016.05.013. Festa, E.K., Ott, B.R., Manning, K.J., Davis, J.D., Heindel, W.C., 2013. Effect of cognitive status on self-regulatory driving behavior in older adults: an assessment of naturalistic driving using in-car video recordings. J. Geriatr. Psychiatry Neurol. 26, 10–18. https://doi.org/ 10.1177/0891988712473801. Franke, U., Mehring, S., Suissa, A., Hahn, S., n.d. The Daimler-Benz steering assistant: a spin-off from autonomous driving. Proc. Intell. Veh. ‘94 Symp 120–124. doi: https://doi.org/10.1109/ IVS.1994.639486. Geiger, A., Lauer, M., Moosmann, F., Ranft, B., Rapp, H., Stiller, C., Ziegler, J., 2011. Team AnnieWAY’s entry to the grand cooperative driving challenge 2011. IEEE Trans. Intell. Transp. Syst. 13, 1–10. https://doi.org/10.1109/TITS.2012.2189882. Gietelink, O., Ploeg, J., De Schutter, B., Verhaegen, M., 2006. Development of advanced driver assistance systems with vehicle hardware-in-the-loop simulations. Veh. Syst. Dyn. 44, 569–590. https://doi.org/10.1080/00423110600563338. Golob, T.F., Recker, W.W., Alvarez, V.M., 2004. Freeway safety as a function of traffic flow. Accid. Anal. Prev. 36, 933–946. https://doi.org/10.1016/j.aap.2003.09.006. Guo, F., Fang, Y., Antin, J.F., 2015. Older driver fitness-to-drive evaluation using naturalistic driving data. J. Safety Res. 54, 49–54. https://doi.org/10.1016/j.jsr.2015.06.013. Hassan, H.M., Abdel-Aty, M.A., 2013. Predicting reduced visibility related crashes on freeways using real-time traffic flow data. J. Safety Res. 45, 29–36. https://doi.org/10.1016/j.jsr.2012.12.004. Hossain, M., Muromachi, Y., 2013. A real-time crash prediction model for the ramp vicinities of urban expressways. IATSS Res. 37, 68–79. https://doi.org/10.1016/j.iatssr.2013.05.001. Hossain, M., Muromachi, Y., 2012. A Bayesian network based framework for real-time crash prediction on the basic freeway segments of urban expressways. Accid. Anal. Prev. 45, 373–381. https://doi.org/10.1016/j.aap.2011.08.004. Irwin, C., Iudakhina, E., Desbrow, B., McCartney, D., 2017. Effects of acute alcohol consumption on measures of simulated driving: a systematic review and meta-analysis. Accid. Anal. Prev. 102, 248–266. https://doi.org/10.1016/j.aap.2017.03.001. Jonas, B., van Nes, N., Michiel, C., Jansen, R., Heijne, V., Carsten, O., Dotzauer, M., Utech, F., Svanberg, E., Cocron, M.P., Forcolin, F., Kovaceva, J., Guyonvarch, L., Hibberd, D., Lotan, T., Winkelbauer, M., Sagberg, F., Stemmier, E., Gellerman, H., Val, C., Quintero, K., Tattegrain, H., Donabauer, M., Pommer, A., Neumann, I., Alber, G., Welsh, R., Fox, C., 2017. The UDrive Dataset and Key Analysis Results. https://doi.org/10.26323/UDRIVE. Kang, M.W., Momtaz, S.U., 2018. Assessment of driver compliance on roadside safety signs with auditory warning sounds generated from pavement surface—a driving simulator study. J. Traffic Transp. Eng. (English Ed.) 5, 1–13. https://doi.org/10.1016/j.jtte.2017.09.001. Klauer, S.G., Guo, F., Simmons-Morton, B.G., Ouimet, M.C., Lee, S.E., Dingus, T.A., 2014. Distracted driving and risk of road crashes among novice and experienced drivers. J. Emerg. Med. 46, 600–601. https://doi.org/10.1016/j.jemermed.2014.02.017. Kovvali, V., Systematics, C., Alexiadis, V., Zhang, L., Length, P., 2007. Video-Based Vehicle Trajectory Data Collection. pp. 1–18. Laude, J.R., Fillmore, M.T., 2015. Simulated driving performance under alcohol: effects on driverrisk versus driver-skill. Drug Alcohol Depend. 154, 271–277. https://doi.org/10.1016/j. drugalcdep.2015.07.012.

340 PART

II Applications

Lee, C., Saccomanno, F., Hellinga, B., 2002. Analysis of crash precursors on instrumented freeways. Transp. Res. Rec. 1784, 1–8. https://doi.org/10.3141/1784-01. Lee, S.E., Simons-Morton, B.G., Klauer, S.E., Ouimet, M.C., Dingus, T.A., 2011. Naturalistic assessment of novice teenage crash experience. Accid. Anal. Prev. 43, 1472–1479. https:// doi.org/10.1016/j.aap.2011.02.026. Li, Y.C., Sze, N.N., Wong, S.C., Yan, W., Tsui, K.L., So, F.L., 2016. A simulation study of the effects of alcohol on driving performance in a Chinese population. Accid. Anal. Prev. 95, 334–342. https://doi.org/10.1016/j.aap.2016.01.010. Li, Z., Wang, W., Chen, R., Liu, P., Xu, C., 2013. Evaluation of the impacts of speed variation on freeway traffic collisions in various traffic states. Traffic Inj. Prev. 14, 861–866. https://doi.org/ 10.1080/15389588.2013.775433. Lipovac, K., Đeric, M., Tesˇic, M., Andric, Z., Maric, B., 2017. Mobile phone use while drivingliterary review. Transp. Res. Part F Traffic Psychol. Behav. 47, 132–142. https://doi.org/ 10.1016/j.trf.2017.04.015. Liu, M., Chen, Y., 2017. Predicting real-time crash risk for urban expressways in China. Math. Probl. Eng. 2017. https://doi.org/10.1155/2017/6263726. Lord, D., Mannering, F., 2010. The statistical analysis of crash-frequency data: a review and assessment of methodological alternatives. Transp. Res. Part A Policy Pract. 44, 291–305. https://doi. org/10.1016/j.tra.2010.02.001. Martin, P.T., 2003. Detector Technology Evaluation (MPC-03-154) 140. Maurer, M., Behringer, R., F€urst, S., Thomanek, F., Dickmanns, E.D., 1996. In: A compact vision system for road vehicle guidance.Proc. Int. Conf. Pattern Recognit. vol. 3, pp. 313–317. https:// doi.org/10.1109/ICPR.1996.546962. Nilsson, P., Laine, L., Sandin, J., Jacobson, B., Eriksson, O., 2018. On actions of long combination vehicle drivers prior to lane changes in dense highway traffic—a driving simulator study. Transp. Res. Part F Psychol. Behav. 55, 25–37. https://doi.org/10.1016/j.trf.2018.02.004. Nordmark, S., 1994. In: Driving simulators, trends and experiences. Driv. Simul. Conf. Oh, C., Kim, T., 2010. Estimation of rear-end crash potential using vehicle trajectory data. Accid. Anal. Prev. 42, 1888–1893. https://doi.org/10.1016/j.aap.2010.05.009. Oh, C., Oh, J.-S., Ritchie, S., Chang, M., 2001. Real-time estimation of freeway accident likelihood. In: Proc. 80th Annu. Meet. Transp. Res. Board. Oh, J.-S., Oh, C., Ritchie, S.G., Chang, M., 2005. Real-time estimation of accident likelihood for safety enhancement. J. Transp. Eng. 131, 358–363. https://doi.org/10.1061/(ASCE)0733947X(2005)131:5(358). Park, H., Haghani, A., Samuel, S., Knodler, M.A., 2018a. Real-time prediction and avoidance of secondary crashes under unexpected traffic congestion. Accid. Anal. Prev. 112, 39–49. https://doi.org/10.1016/j.aap.2017.11.025. Park, H., Oh, C., Moon, J., Kim, S., 2018b. Development of a lane change risk index using vehicle trajectory data. Accid. Anal. Prev. 110, 1–8. https://doi.org/10.1016/j.aap.2017.10.015. Park, S., Ritchie, S.G., 2004. In: Exploring the relationship between freeway speed variance, lane changing and vehicle heterogeneity.Presented at the 83rd Annual Meeting of the Transportation Research Board. Pawar, D.S., Patil, G.R., 2018. Response of major road drivers to aggressive maneuvering of the minor road drivers at unsignalized intersections: a driving simulator study. Transp. Res. Part F Traffic Psychol. Behav. 52, 164–175. https://doi.org/10.1016/j.trf.2017.11.016. Peng, Y., Abdel-Aty, M., Shi, Q., Yu, R., 2017. Assessing the impact of reduced visibility on traffic crash risk using microscopic data and surrogate safety measures. Transp. Res. Part C Emerg. Technol. 74, 295–305. https://doi.org/10.1016/j.trc.2016.11.022.

Big Data and Road Safety Chapter

12

341

Pomerleau, D., n.d. RALPH: rapidly adapting lateral position handler. Proc. Intell. Veh. ‘95. Symp 506–511. doi: https://doi.org/10.1109/IVS.1995.528333. Prato, C.G., Toledo, T., Lotan, T., Taubman - Ben-Ari, O., 2010. Modeling the behavior of novice young drivers during the first year after licensure. Accid. Anal. Prev. 42, 480–486. https://doi. org/10.1016/j.aap.2009.09.011. Precht, L., Keinath, A., Krems, J.F., 2017. Identifying the main factors contributing to driving errors and traffic violations—results from naturalistic driving data. Transp. Res. Part F Traffic Psychol. Behav. 49, 49–92. https://doi.org/10.1016/j.trf.2017.06.002. Regan, M.a., Williamson, A., Grzebieta, R., Tao, L., 2012. In: Naturalistic driving studies: literature review and planning for the Australian naturalistic driving study.Australas. Coll. Road Saf. Natl. Conf, pp. 1–13. Roshandel, S., Zheng, Z., Washington, S., 2015. Impact of real-time traffic characteristics on freeway crash occurrence: systematic review and meta-analysis. Accid. Anal. Prev. 79, 198–211. https://doi.org/10.1016/j.aap.2015.03.013. SAE, 2016. Taxonomy and Definitions for Terms Related to Driving Automation systems for On-Road Motor Vehicles. Seetharaman, G., Lakhotia, A., Blasch, E.P., 2006. Unmanned vehicles come of age: the DARPA grand challenge. Computer (Long. Beach. Calif ). 39, 26–29. https://doi.org/10.1109/ MC.2006.447. Shew, C., Pande, A., Nuworsoo, C., 2013. Transferability and robustness of real-time freeway crash risk assessment. J. Safety Res. 46, 83–90. https://doi.org/10.1016/j.jsr.2013.04.005. Shi, Q., Abdel-Aty, M., 2015. Big data applications in real-time traffic operation and safety monitoring and improvement on urban expressways. Transp. Res. Part C Emerg. Technol. 1, 1–15. https://doi.org/10.1016/j.trc.2015.02.022. Shinar, D., Tractinsky, N., Compton, R., 2005. Effects of practice, age, and task demands, on interference from a phone task while driving. Accid. Anal. Prev. 37, 315–326. https://doi.org/ 10.1016/j.aap.2004.09.007. Soria, I., Elefteriadou, L., Kondyli, A., 2014. Assessment of car-following models by driver type and under different traffic, weather conditions using data from an instrumented vehicle. Simul. Model. Pract. Theory 40, 208–220. https://doi.org/10.1016/j.simpat.2013.10.002. Strayer, D., Johnston, W., 2001. Driven to distraction—dual task studies of driving and conversing on a cellular phone. Psychol. Sci. 12, 462–466. Stylianou, K., Dimitriou, L., 2018. Analysis of rear-end conflicts in urban networks using Bayesian networks. Transp. Res. Rec. J. Transp. Res. Board, 1–19. Sun, J., Sun, J., 2015. A dynamic Bayesian network model for real-time crash prediction using traffic speed conditions data. Transp. Res. Part C Emerg. Technol. 54, 176–186. https://doi.org/ 10.1016/j.trc.2015.03.006. Teoh, E.R., Kidd, D.G., 2017. Rage against the machine? Google’s self-driving cars versus human drivers. J. Safety Res. 63, 57–60. https://doi.org/10.1016/j.jsr.2017.08.008. Theofilatos, A., Yannis, G., Kopelias, P., Papadimitriou, F., 2018. Impact of real-time traffic characteristics on crash occurrence: preliminary results of the case of rare events. Accid. Anal. Prev. 1–9. https://doi.org/10.1016/j.aap.2017.12.018. T€ ornros, J., Bolling, A., 2006. Mobile phone use—effects of conversation on mental workload and driving speed in rural and urban environments. Transp. Res. Part F Traffic Psychol. Behav. 9, 298–306. https://doi.org/10.1016/j.trf.2006.01.008. Urmson, C., Anhalt, J., Bangekk, D., Baker, C., Bittner, R., Clark, M.N., Dolan, J., Duggins, D., Galatali, T., Geyer, C., Gittleman, M., Harbaugh, S., Hebert, M., Howard, T.M., Kolski, S., Kelly, A., Likhachev, M., McNaughton, M., Miller, N., Peterson, K., Pilnick, B.,

342 PART

II Applications

Rajkumar, R., Rybski, P., Salesky, B., Seo, Y.-W., Singh, S., Sni, J., Zigler, J., 2006. Autonomous driving in urban environments: boss and the urban challenge. J. F. Robot. 23, 245–267. https://doi.org/10.1002/rob. US Department of Transportation, 2007. NGSIM—Next Generation Simulation. http://www.ngsim. fhwa.dot.gov. Van Brummelen, J., O’Brien, M., Gruyer, D., Najjaran, H., 2018. Autonomous vehicle perception: the technology of today and tomorrow. Transp. Res. Part C Emerg. Technol., 1–23. https://doi. org/10.1016/j.trc.2018.02.012. Vollrath, M., Fischer, J., 2017. When does alcohol hurt? A driving simulator study. Accid. Anal. Prev. 109, 89–98. https://doi.org/10.1016/j.aap.2017.09.021. Wang, L., Abdel-Aty, M., Lee, J., 2017a. Safety analytics for integrating crash frequency and realtime risk modeling for expressways. Accid. Anal. Prev. 104, 58–64. https://doi.org/10.1016/j. aap.2017.04.009. Wang, L., Abdel-Aty, M., Lee, J., Shi, Q., 2017b. Analysis of real-time crash risk for expressway ramps using traffic, geometric, trip generation, and socio-demographic predictors. Accid. Anal. Prev., 1–7. https://doi.org/10.1016/j.aap.2017.06.003. Wang, L., Abdel-Aty, M., Shi, Q., Park, J., 2015. Real-time crash prediction for expressway weaving segments. Transp. Res. Part C Emerg. Technol. 61, 1–10. https://doi.org/10.1016/j. trc.2015.10.008. Wang, L., Abdel-Aty, M., Wang, X., Yu, R., 2018. Analysis and comparison of safety models using average daily, average hourly, and microscopic traffic. Accid. Anal. Prev. 111, 271–279. https:// doi.org/10.1016/j.aap.2017.12.007. Wu, J., Abdel-Aty, M., Yu, R., Gao, Z., 2013. A novel visible network approach for freeway crash analysis. Transp. Res. Part C Emerg. Technol. 36, 72–82. https://doi.org/10.1016/j. trc.2013.08.005. Wu, J., Xu, H., 2018. The influence of road familiarity on distracted driving activities and driving operation using naturalistic driving study data. Transp. Res. Part F Traffic Psychol. Behav. 52, 75–85. https://doi.org/10.1016/j.trf.2017.11.018. Wu, Y., Abdel-Aty, M., Lee, J., 2017. Crash risk analysis during fog conditions using real-time traffic data. Accid. Anal. Prev. 1–8. https://doi.org/10.1016/j.aap.2017.05.004. Xu, C., Liu, P., Wang, W., Li, Z., 2012. Evaluation of the impacts of traffic states on crash risks on freeways. Accid. Anal. Prev. 47, 162–171. https://doi.org/10.1016/j.aap.2012.01.020. Xu, C., Liu, P., Yang, B., Wang, W., 2016. Real-time estimation of secondary crash likelihood on freeways using high-resolution loop detector data. Transp. Res. Part C Emerg. Technol. 71, 406–418. https://doi.org/10.1016/j.trc.2016.08.015. Xu, C., Tarko, A.P., Wang, W., Liu, P., 2013a. Predicting crash likelihood and severity on freeways with real-time loop detector data. Accid. Anal. Prev. 57, 30–39. https://doi.org/10.1016/j. aap.2013.03.035. Xu, C., Wang, W., Liu, P., 2013b. Identifying crash-prone traffic conditions under different weather on freeways. J. Safety Res. 46, 135–144. https://doi.org/10.1016/j.jsr.2013.04.007. Xu, C., Wang, W., Liu, P., Guo, R., Li, Z., 2014. Using the bayesian updating approach to improve the spatial and temporal transferability of real-time crash risk prediction models. Transp. Res. Part C Emerg. Technol. 38, 167–176. https://doi.org/10.1016/j.trc.2013.11.020. Xu, C., Wang, W., Liu, P., Zhang, F., 2015. Development of a real-time crash risk prediction model incorporating the various crash mechanisms across different traffic states. Traffic Inj. Prev. 16, 28–35. https://doi.org/10.1080/15389588.2014.909036.

Big Data and Road Safety Chapter

12

343

Yu, R., Abdel-Aty, M., 2014a. An optimal variable speed limits system to ameliorate traffic safety risk. Transp. Res. Part C Emerg. Technol. 46, 235–246. https://doi.org/10.1016/j. trc.2014.05.016. Yu, R., Abdel-Aty, M., 2014b. Analyzing crash injury severity for a mountainous freeway incorporating real-time traffic and weather data. Saf. Sci. 63, 50–56. https://doi.org/10.1016/j. ssci.2013.10.012. Yu, R., Abdel-Aty, M., 2014c. Using hierarchical Bayesian binary probit models to analyze crash injury severity on high speed facilities with real-time traffic data. Accid. Anal. Prev. 62, 161–167. https://doi.org/10.1016/j.aap.2013.08.009. Yu, R., Abdel-Aty, M., 2013a. Multi-level Bayesian analyses for single- and multi-vehicle freeway crashes. Accid. Anal. Prev. 58. https://doi.org/10.1016/j.aap.2013.04.025. Yu, R., Abdel-Aty, M., 2013b. Utilizing support vector machine in real-time crash risk evaluation. Accid. Anal. Prev. 51, 252–259. https://doi.org/10.1016/j.aap.2012.11.027. Yu, R., Abdel-Aty, M., Ahmed, M., 2013. Bayesian random effect models incorporating real-time weather and traffic data to investigate mountainous freeway hazardous factors. Accid. Anal. Prev. 50, 371–376. https://doi.org/10.1016/j.aap.2012.05.011. Zhao, P., Lee, C., 2018. Assessing rear-end collision risk of cars and heavy vehicles on freeways using a surrogate safety measure. Accid. Anal. Prev. 113, 149–158. https://doi.org/10.1016/j. aap.2018.01.033. Zheng, Z., Ahn, S., Monsere, C.M., 2010. Impact of traffic oscillations on freeway crash occurrences. Accid. Anal. Prev. 42, 626–636. https://doi.org/10.1016/j.aap.2009.10.009.

Chapter 13

A Back-Engineering Approach to Explore Human Mobility Patterns Across Megacities Using Online Traffic Maps Vana Gkania and Loukas Dimitriou Laboratory for Transport Engineering, Department of Civil and Environmental Engineering, University of Cyprus, Nicosia, Cyprus

Chapter Outline 1 Introduction 2 Data and Traffic Information Extraction Methods 2.1 Cities Characteristics 2.2 Data Gathering and Preprocessing 2.3 Extracting Traffic Information by Image Processing

1

345 347 347 349 349

3 Temporal and Spatiotemporal Mobility Patterns 3.1 Temporal Patterns 3.2 Spatiotemporal Patterns 4 Dynamic Clustering and Propagation of Congestion 5 Conclusions References

351 352 356 358 362 363

INTRODUCTION

Observing and modeling human movement in urban environments has gained research interest in many fields such as transportation engineering, urban planning, and social science. Initially, mobility data was derived from census data, the yesterday’s big data (Cottineau et al., 2017), a useful but still costly and time-consuming procedure. The rise of the Internet and the ubiquity of telecommunication networks enriched the available mobility data, as more and more people, vehicles, and goods can now be tracked and tagged in real time (Claudel et al., 2016). The contemporary “Big Data” era in addition to new tools for analysis and visualization enable the recognition of formerly invisible patterns. Relevant research projects are Real Time Rome, that explores aggregated data from cell phones, buses, and taxis in Rome, revealing the pulse of the city (Calabrese et al., 2011), and Live Singapore that Mobility Patterns, Big Data and Transport Analytics. https://doi.org/10.1016/B978-0-12-812970-8.00013-0 © 2019 Elsevier Inc. All rights reserved.

345

346 PART

II Applications

utilize various datasets, including social media, weather, public and private transit, and telecommunications to enhance the portrait of the city (Kloeckl et al., 2012). A comparative study of three global cities based on the detection of mobile phone usage patterns revealed a universal structure of cities, with core financial centers all sharing similar activity patterns (Grauwin et al., 2015). Other seminal work that paved a new research field dealing with human mobility understood from digital records can be found in (Candia et al., 2008; Gonza´lez et al., 2008; Song et al., 2010). Apart from mobile phone trajectories, another valuable source of data used to analyze and model human behavior in the real world is produced from webbased services that create enormous digital records. Data from online locationbased social networks such as Twitter, Facebook, Foursquare, have been used in many research efforts such as real-time event detection (Sakaki et al., 2010), prediction purposes (Asur and Huberman, 2010), traffic information extraction and classification (Chaniotakis et al., 2016; Wanichayapong et al., 2011), and exploration of universal urban mobility patterns (Noulas et al., 2012). The common tendency is that the interaction with web services can be used to gain insight from the derived data and thus to achieve a better understanding of urban dynamics. Online traffic maps are a web-based service that gives a snapshot of the current state-of-the-road network’s traffic state, typically depicted in standard color-coding of road links, but with distinctive characterization on traffic conditions. The user base of online maps has expanded exponentially in recent years as the reliability of them increased due to the extensive use of crowdsourced information and the augmentation of data from many sources (mobile tracing, GPS information, traffic counts). Among the most popular online traffic map providers are Google, Bing, Here Maps, Yandex, and Baidu in terms of coverage and usage rate. The way these providers track traffic and depict it on maps varies according to the data and the method that each provider use. Overall, the developments in web mapping are a continuous blend of improvements in technologies, data and information proliferation, growth in users and usage, and their interaction (Veenendaal, 2016). As the reliability of real-time traffic conditions displayed on traffic maps lies on the scale of users’ interaction, the more users start utilizing them, the better quality and reliability is achieved. Taking into account the proliferation of users in recent years, especially in cities where the negative consequences from traffic congestion are greater (Bugliarello et al., 1996), these online maps can be a reliable data source for further traffic analysis and mobility patterns interpretation. Thus, inspired the current research work to explore, how the online maps which were designed to inform users for the real-time traffic state can be used to analyze and compare mobility patterns, revealed by traffic dynamics, across multiple cities. The main contribution of this approach lies in the synthetic nature of mobility data that online maps depict, in contrast to the single type of digital data used in other

Back-Engineering Traffic Maps for Exploring Human Mobility Chapter

13

347

studies and in the extension of the application from one limited spatial entity (in general a city) to global coverage, provide by traffic maps. Specifically, this chapter takes advantage of a huge collection of data, derived from online traffic maps, both in time and space in six cities—Paris, London, Moscow, New York, Los Angeles, and Tokyo—to investigate questions such as: is the activity detected from one traffic layer (i.e., slow traffic) independent of others (i.e., fast, medium traffic) in an urban road network. Are there any regularities between spatiotemporal mobility patterns among cities? How we can compare the results across multiple cities? What are the signatures of the congestion patterns in the major US, European, or Asian cities and how do they compare? To answer these questions, this comparative study uses one-week’s images from one standard commercial freely available online traffic map, collected for six cities that span three continents and six countries. The data has been preprocessed prior to analysis to accommodate for errors that arise from sourcing data from online maps. Afterwards, image processing techniques were applied to isolate the traffic layers in every image. Finally, further computer vision methods and clustering techniques were used to address the extent that urban mobility (independently of the city that appears) share similar patterns in the diurnal cycle.

2 DATA AND TRAFFIC INFORMATION EXTRACTION METHODS 2.1 Cities Characteristics The investigation of the spatiotemporal mobility patterns, based on traffic maps is applied to six cities, as shown in Fig. 1, all of which have available online traffic coverage. Three cities were chosen from Europe (Paris, London, Moscow), two from North America (New York, Los Angeles), and Tokyo from Asia. The key characteristics of the selected cities related to population, land coverage, density (Demographia, 2017), and hours spent in traffic congestion (Cookson and Pishue, 2017) are given in Table 1. The population, land, and population density data refer to the urban area that may differ from the municipality. At a first glance, cities’ population density vary but all have in common high number of hours spend in congestion. Regarding this last metric, average hours spend in congestion are calculated by applying the average peak period congestion rate to travel times allowing a derivation of daily time spent in peak period congestion. Assuming 240 working days a year, the average number of hours spent in congestion during peak hours is estimated for every city. This is a metric for the impact on the typical car commuter where peak hours are locally defined based upon the actual driving habits in each city (Cookson and Pishue, 2017).

348 PART

II Applications

FIG. 1 Map of the selected cities.

TABLE 1 Cities Characteristics

Land (km2)

Population Density (per km2)

Average Hours Drivers Spend in Congestion in 2016

Geography

City

Population (2016)

France

Paris

10,950,000

2845

3700

65.3

United Kingdom

London

10,470,000

1738

5600

73.4

Russia

Moscow

16,710,000

5698

2900

91.4

United States

New York

21,445,000

11,875

1700

89.4

United States

Los Angeles

15,500,000

6299

2300

104.1

Japan

TokyoYokohama

37,900,000

8547

4400



Back-Engineering Traffic Maps for Exploring Human Mobility Chapter

13

349

Regardless of their apparent differences in terms of road network topology, weather conditions, and geographical distances, one can expect emerging similarities between these cities as they share globalization characteristics. They are therefore representative candidates for tracking universal mobility and congestion patterns.

2.2 Data Gathering and Preprocessing The analysis is based on raster images, collected from one standard commercial freely available online traffic map (website). Data were collected between May 29 and June 3, 2017 at 5-min intervals across the six cities. In total, over 12,000 of images were included in the original dataset. Although, a small percentage of images was discarded due to bad quality as shown in Table 2. The percentage of data retained per city are listed in the same table, too.

2.3 Extracting Traffic Information by Image Processing The first step to extract traffic information from map images here relies on distinguishing and then capturing the traffic layers from the rest of the map, as shown in Fig. 2, providing a structured database for further analysis. The approach followed for achieving this is based on raster image processing of online traffic maps and make use of some important image features: i. The discretization of areas into seamless sections/digits (pixels, in raster images),

TABLE 2 Data Filtering Statistics Per City Country

City

Precleaning Images

Postcleaning Images

Data Retained (%)

France

Paris

2016

1939

96.18

United Kingdom

London

2016

1939

96.18

Russia

Moscow

2016

1932

95.83

United States

New York

2016

1967

95.83

United States

Los Angeles

2016

1932

96.18

Japan

Tokyo

2016

1939

97.56

350 PART

II Applications

(A)

(B)

(C)

(D)

(E)

(F)

FIG. 2 Traffic layers’ extraction from online traffic maps for the cities: (A) Paris; (B) London; (C) Moscow; (D) New York; (E) Los Angeles; (F) Tokyo.

ii. Each pixel has selected dimensions (length and width) representing an area at a specific location, and iii. Each pixel captures—in a seamless manner—the traffic characteristics of the area it represents. Image processing of urban networks has been used for many years, but this back-engineering approach is valuable for postprocessing traffic maps information and investigating the quantitative and qualitative characteristics in terms of traffic flow fundamentals, as a tool to further explore the mobility across cities.

Back-Engineering Traffic Maps for Exploring Human Mobility Chapter

13

351

There are many ways for extracting color layers from raster images; here the most straightforward and reliable way is preferred, which is based on identifying the color codes that are used for traffic depiction, in a color mode. Starting from the representation of maps by raster images, these are using a 3-multidimensional matrix format of size [r, c, 3] where each pixel at location [r, c] reflect the Red-Green-Blue/RGB color model. When traffic maps are captured in raster images, the map space is discretized into homogeneous digits/ pixels, each assigned one color. Several levels of discretization can be assumed and adopted for different purposes, making use of the pixels’ characterization and the traffic state that they are representing. For each color code, one representative value of the traffic variable can be assigned. For example, as each color represents a range of values for each traffic variable (traffic volume/flow, average speed, average density, etc.), one can reasonably assume one characteristic value for each color. Usually green color is assigned to free-flow traffic state, orange to medium traffic, red to heavy traffic, and dark red to congestion. Representative average speed values for each traffic layer could be over 80 km/h for green, between 40 and 80 km/h for orange, between 20 and 40 km/h for red, and 0 are parameters used to adjust the relative importance of the three components, where μx, μy, σ x, σ y, and σ xy are the local means, standard deviations, and the cross-covariance for images x, y. The parameters C1C2, C3 are included to avoid instability when the denominator is very close to zero. The default value for exponents (α, β, γ) is equal to one. In our case, the measure of similarity between two sequential images can capture and quantify the variation in mobility in the next time interval. Furthermore, if an initial image is chosen as a baseline (i.e., off-peak hour state of the network) the propagation of congestion can be monitored and compared across the cities. Thus, two different SSIM indices were used. The first named SSIM1 compares image x to image y, according to Eq. (5). In that case, x corresponds to the image taken on time interval tk and y correspond to the image taken on time interval t0. The k variable ranges between [1 288] for one day’s images while time interval t0 refers to off-peak hour state of the network.        SSIM1 imgtk , imgto ¼ l imgtk , imgto  c imgtk , imgto  s imgtk , imgto (5) By calculating the SSIM1 index for all the selected cities, the dynamic monitor of mobility patterns‘ variation compared to the off-peak hour state of the network, can be captured and compared. Obviously, if image x is identical to image y the index is equal to 1, while if image x differs from y the index is closer to 0. Fig. 5 shows the daily variation of the SSIM1 index across the six cities. Initially, the index values are equal to 1 or close to 1, as activity drops at night and the sequential images are like the off-peak hour state of the network. A decline is observed in the morning peak hours for all the cities while Paris,

FIG. 5 SSIM1 index for the selected cities.

356 PART

II Applications

Moscow, and London appear a steeper drop (from 1 to 0.7–0.85) compared to the rest three. Especially, Tokyo appears the smallest variation at the morning peak (SSIM1 equal to 0.9) and a more stable pattern in general. During the rest of the day, the variation becomes more stable while a second drop is observed in the evening peak hours for all the cities. Then an incline follows while activity drops again during the night hours. To identify the traffic states’ alternation in the diurnal cycle, as reflected in the sequence of the screenshots taken, a second SSIM2 index was calculated, as follows:  SSIM2 imgtk + 1 , imgtk       (6) ¼ l imgtk + 1 , imgtk  c imgtk + 1 , imgtk  s imgtk + 1 , imgtk The SSIM2 index compares image x to image y, where x corresponds to the image taken on time interval tk+1 and y to the image taken on time interval tk, while k ranges between [1 288], as well. Fig. 6 plots the SSIM2 index for the same day and for all the selected cities. It can be said that all cities display a broadly comparable rhythm, common to all components of activity, based on the second index. Interestingly, Tokyo seems to differ, as a systematic alternation is observed in the sequence of images. Steep drops or peaks in these timeseries suggest a sharp transition between the mobility patterns at the next time interval, thus SSIM2 could detect irregular operations and moreover anticipate upcoming special events. To further investigate the spatio-temporal dimension of mobility, a spatial separation of the total urban vehicles’ activity was applied using the total pixels’ marginal distributions per traffic layer (Figs. 7 and 8). For each city, Kernel Density plots per traffic layer were generated for every time interval. The following figures show the traffic layers’ spatio-temporal distribution, grouped per traffic state during the evening peak-hour (18:00 p.m.). As red and dark red traffic layers represent the congested links of the road network within a city, these two traffic layers grouped together. Respectively, the green and yellow traffic layers that are related to light and moderate traffic confronted as a separate cluster. So far, the temporal dimension of mobility patterns was investigated across cities by monitoring the variation of pixels’ percentages per traffic layer and utilizing the SSIM indices. A common rhythm of mobility was observed across the cities during the day while a closer look at the SSIM2 index revealed a dissimilar pattern for the city of Tokyo. In the next section, emphasis will be given at the spatial dimension of mobility.

3.2 Spatiotemporal Patterns In this section, we go further in the exploration of the spatiotemporal patterns among cities. The first analysis (Figs. 3 and 4) indicated the total percentage of space in terms of the road network that remains uncongested across the cities

Back-Engineering Traffic Maps for Exploring Human Mobility Chapter

13

357

FIG. 6 SSIM2 index for the selected cities.

during the day. To further investigate the spatiotemporal dimension of mobility, a spatial separation of the total request activity was applied using the total pixels’ marginal distributions per traffic layer. For each megacity Kernel density plots per traffic layer were generated for every time interval. The following figures show the traffic layers’ spatial distribution, grouped per two traffic layers. The peaks of the distributions indicate the inhomogeneity of activity in space. The spatial repartition of mobility among the six cities shows a concentric organization, with strong activity level in the city center of Moscow, New York, and Los Angeles, while Paris presented the most homogenous profile. In Tokyo, the area near the port appears a strong activity during the day that spreads toward the inner urban area. In London, one big center of activity is the City of London while secondary centers can be found as one moves away from the City. The dynamic monitoring of these marginal distributions enables a deeper understanding of the urban mobility patterns both in time and space. The observation of these sequence plots allows the tracking of areas that appear stronger activity within the city and their alternate in time.

358 PART

II Applications

(A)

(B)

(C) FIG. 7 Traffic layers’ spatial distribution: (A) Paris, (B) London, (C) Tokyo.

4 DYNAMIC CLUSTERING AND PROPAGATION OF CONGESTION As a next step, clustering analysis is used to identify propagation of congestion and areas with similar congestion patterns within the city. Based on our approach these clusters emerge through image segmentation into perceptually meaningful atomic regions, known as superpixels. The superpixel algorithm applied in our dataset is the Simple Linear Iterative Clustering (SLIC)

Back-Engineering Traffic Maps for Exploring Human Mobility Chapter

13

359

(A)

(B)

(C) FIG. 8 Traffic layers’ spatial distribution: (A) Moscow, (B) New York, (C) Los Angeles.

(Achanta et al., 2012). By default, the only parameter of the algorithm is the desired number of approximately equally sized superpixels. In our case, this number was set to 3500, resulting in a 14  14 pixels grid that corresponds to an urban area equal to 150 m by 150 m. In the following figure, the steps of our method are presented. Fig. 9 shows the steps followed to produce congestion patterns. The first step is the extraction of traffic information as previously described. Here only red and dark red traffic layers were used as they indicate heavy traffic and

360 PART

II Applications

Step 1

Step 3

Step 2

Step 4

FIG. 9 Steps followed to create congestion patterns: (Step 1) heavy traffic layers isolation; (Step 2) image segmentation to meaningful atomic regions by clustering pixels; (Step 3) setting the color of each superpixel region; (Step 4) congestion patterns visualization.

congestion. In the second step, the SLIC algorithm is applied and the initial image (step 1) is divided into several regions. Although, two main clusters can be easily observed. The first one contains grid shape regions while the second consists of uniform regions in both size and shape. The grid shape area emerges due to the lack of traffic information (black pixels) while the shape changes in areas that pixels or their adjacent pixels are colored. In the next step, the color of each superpixel region is chosen according to pixel’s maximum RGB value. Thus, the existence of dark red and red color dominates over black pixels and colors the superpixel respectively. Last, in step 4, a combination of images in steps 2 and 3 gives a better visualization of the congestion patterns. This transition from the linear perspective to regions/clusters with a similar level of congestion provides useful insights regarding the propagation of congestion both in space and time. In the following figures, we can see the variation of these patterns during the morning peak hours. Figs. 10 and 11 show the spatiotemporal variation of congestion for the city of Moscow and Los Angeles, respectively. By comparing the two cities we can see the differences in propagation of congestion in space. In Moscow, the red clusters are scattered, and their number rapidly increases as we move to the next time interval whereas in Los Angeles congestion starts from the center and spreads to the periphery during the morning peak hour.

07:00

Heavy traffic Congestion

08:00

07:30

08:30

FIG. 10 Spatiotemporal congestion patterns for Moscow during morning peak hours (07:00–08:30 a.m.)

07:00

08:00

Heavy traffic Congestion

07:30

08:30

FIG. 11 Spatiotemporal congestion patterns for Los Angles during morning peak hours (07:00–08:30 a.m.)

362 PART

II Applications

Regarding the rest of the cities, the monitoring of the sequential images (step 4) revealed similar congestion patterns to Los Angeles for the city of London and Tokyo where congestion starts from the city of London and Tokyo port, respectively and moves in the inner of the city. Paris’ congestion patterns resembles Moscow’s as congestion clusters were scattered in space. On the contrary, in New York, the congestion spread from the borough of Queens, Bronx, and Brooklyn toward the core of New York, Manhattan for the same time intervals. This clustering technique, based on traffic data revealed human dynamic behaviors across cities in a more meaningful way compared to the classic land use mapping or mobile phone activity mapping. Furthermore, the reveal of regularities and differences due to geographic, economic, and technological factors enriched our understanding of the multiple dynamics of the city structures.

5 CONCLUSIONS The current chapter proposed a new method to explore mobility patterns based on traffic data, derived from online traffic maps. The main contributions of this approach can be summarized in the synthetic nature of mobility data that online maps depict, and in the extension of the application from one limited spatial entity (in general a city) to global coverage, provided by traffic maps. Initially, the valuable traffic information was extracted based on image processing techniques, resulting in a structured database for further analysis. The temporal variation of mobility patterns was firstly analyzed revealing a broadly comparable rhythm across the cities. As a next step, emphasis was given to the spatiotemporal patterns. Kernel density plots were applied to investigate the homogeneity of activity within the city. This spatial repartition of mobility among the six cities showed a concentric organization, with strong activity level in the city center of Moscow, New York, and Los Angeles, while Paris presented the most homogenous profile. In Tokyo, the area near the port appeared a strong activity during the day that spreads toward the inner urban area. Last, clustering analysis was employed to identify propagation of congestion and areas with similar congestion patterns within the city. The transition from the linear perspective to regions/clusters with a similar level of congestion provided useful insights regarding the propagation of congestion both in space and time. Comparing the cities, Moscow and Paris congestion patterns were scattered and their number rapidly increased as moving to the next time interval whereas in Los Angeles, London, and Tokyo the congestion started from the center and spread to the periphery during the morning peak hour. On the other hand, New York appeared a differential congestion pattern as strong activity was moving from the periphery boroughs to the core of New York, Manhattan. On a broader note, this chapter presented a new method of exploring mobility patterns in the era of Big Data. We confronted this challenge of collecting mobile traffic data by utilizing web mapping and then proceed to further

Back-Engineering Traffic Maps for Exploring Human Mobility Chapter

13

363

analysis, enriching our understanding in human mobility. Much more remains to be researched extending this preliminary analysis using more analytical and computational means toward understanding human mobility at the city scale.

REFERENCES Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., S€usstrunk, S., 2012. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34 (11), 2274–2281. Asur, S., Huberman, B.A., 2010. Predicting the future with social media. Proc. Int. Conf. Web Intell. Intell. Agent Technol., 492–499. Bugliarello, G., Boland, J., Gakenheimer, R., Kahan, R., Assembly, U., Kasarda, J., Richardson, H., Rowland, F.S., 1996. Committee on Megacity Challenges. Calabrese, F., Ratti, C., Colonna, M., Lovisolo, P., Parata, D., 2011. Rome. Real-time urban monitoring using cell phones: a case study. IEEE Trans. Intell. Transp. Syst. 12 (1), 141–151. Candia, J., Gonza´lez, M.C., Wang, P., Schoenharl, T., Madey, G., Baraba´si, A.L., 2008. Uncovering individual and collective human dynamics from mobile phone records. J. Phys. A Math. Theor. 41 (22), 224015. Chaniotakis, E., Antoniou, C., Aifadopoulou, G., Dimitriou, L., 2016. Inferring activities from social media data.96th Annu. Meet. Transp. Res. Board. Claudel, M., Nagel, T., Ratti, C., 2016. From origins to destinations: the past, present and future of visualizing flow maps. Built Environ. 42 (3), 338–355. Cookson, G., Pishue, B., 2017. INRIX global traffic scorecard. In: Inrix Glob. Traffic Scorec, p. 44. Cottineau, C., Hatna, E., Arcaute, E., Batty, M., 2017. Diverse cities or the systematic paradox of urban scaling Laws. Comput. Environ. Urban Syst. 63, 80–94. Demographia, 2017. Demographia world urban areas & population projections. Demographia, 18. Gonza´lez, M.C., Hidalgo, C.A., Baraba´si, A.-L., 2008. Understanding individual human mobility patterns. Nature 453 (June), 779–782. Grauwin, S., Sobolevsky, S., Moritz, S., Go´dor, I., Ratti, C., 2015. Towards a comparative science of cities: using mobile traffic records in New York, London, and Hong Kong. In: Computational Approaches for Urban Environments, pp. 363–387. Kloeckl, K., Senn, O., Ratti, C., 2012. Enabling the real-time city: LIVE Singapore! J. Urban Technol. 19 (2), 89–112. Noulas, A., Scellato, S., Lambiotte, R., Pontil, M., Mascolo, C., 2012. A tale of many cities: universal patterns in human urban mobility. PLoS ONE 7 (5), e37027. Sakaki, T., Okazaki, M., Matsuo, Y., 2010. In: Earthquake shakes Twitter users: real-time event detection by social sensors.Proc. 19th Int. Conf. World Wide Web, pp. 851–860. Song, C., Qu, Z., Blumm, N., Baraba´si, A.L., 2010. Limits of predictability in human mobility. Science 327 (5968), 1018–1021. Veenendaal, B., 2016. Eras of web mapping developments: past, present and future. Int. Arch. Photogr. Remote Sens. Spat. Inf. Sci. – ISPRS Arch. 41, 247–252. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., 2004. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13 (4), 600–612. Wanichayapong, N., Pruthipunyaskul, W., Pattara-Atikom, W., Chaovalit, P., 2011. In: Socialbased traffic information extraction and classification.11th Int. Conf. ITS Telecommun, pp. 107–112.

Chapter 14

Pavement Patch Defects Detection and Classification Using Smartphones, Vibration Signals and Video Images Symeon E. Christodoulou, Charalambos Kyriakou and George Hadjidemetriou Department of Civil and Environmental Engineering, University of Cyprus, Nicosia, Cyprus

Chapter Outline 1 Introduction 2 Brief Literature Review 2.1 Vibration-Based Methods 2.2 Vision-Based Methods 3 Methodology 3.1 Anomaly Detection Using ANNs and Timeseries Analysis of Vibration Signals

1

365 367 367 368 369

369

3.2 Anomaly Detection Using Entropic-Filter Image Segmentation 3.3 Patch Detection and Measurement Using Support Vector Machines (SVM) 4 Conclusions References

371

374 377 379

INTRODUCTION

In recent years several national and transnational roadway management programs, such as the US “LTPP” and the EU’s “Ten-T” programs, have been put to action in an effort to improve on the condition of transport networks and to mitigate the effects of time and of heavy usage on these networks. In fact, several regional and international studies estimate the annual potential impacts of changes in roadway maintenance expenditures (as these impacts relate to vehicle operating costs, safety, the environment, and the wider economy) at billions of dollars worldwide (Chatti and Zaabar, 2012; National Economic Council, 2014; Gleave et al., 2014). It is thus becoming increasingly important that automated roadway pavement condition assessment technologies are employed, so that sustainable and efficient roadway network management

Mobility Patterns, Big Data and Transport Analytics. https://doi.org/10.1016/B978-0-12-812970-8.00014-2 © 2019 Elsevier Inc. All rights reserved.

365

366 PART

II Applications

systems are developed. A key function of such technologies should be the automatic roadway defects detection and classification. Pavement surface condition, determined by the anomalies in the pavement surface that have an effect on the ride quality of a vehicle, is one of the most significant indicators for road quality. Not only can pavement surface anomalies damage vehicles and cause unpleasant driving but they are also source of traffic accidents, injuries, and/or fatalities. Further, the condition of pavement surfaces is in constant flux over time, as pavements deteriorate from causes related to location, materials used, traffic, weather, and so on. Currently, even though the continuous monitoring of pavement surface defects would improve the ride quality and the safety of travellers, pavement monitoring agencies typically monitor and assess pavement quality approximately once a year because current methods are expensive and laborious. The rise, though, of low-cost technologies could provide a viable solution to the aforementioned problem in the form of distributed mobile sensing by use of connected vehicles and smartphones. In current years, smartphone technology has gained significant attention within the infrastructure, transportation, and automotive industries. Smartphones can be exploited to collect vehicle sensor data by use of their built-in sensors (such as accelerometer, gyroscope, and GPS sensors), and/or linked to on-board diagnostic devices (OBD) to collect data from the host vehicle on how the vehicle performs while in movement. This combination of hardware and software mechanisms can enable the real-time monitoring of, among others, GPS latitude and longitude, forward and lateral acceleration, vehicle roll and pitch. The goal for roadway anomaly detection by use of smartphone technology, OBD devices, and vehicles is set in parallel with the premise that such technology can be applied in GIS-based pavement management systems (PMS). An adequate number of vehicles collecting this crowd-sourced data can be used in order to generate georeferenced events at points where vehicles encounter pavement surface anomalies within a roadway network. Even though multiple vehicles may probably provide conflicting data with regard to pavement surface conditions, the total effect and the joint “knowledge provided by participatory sensing” which is inherent in the collected data do provide an accurate model of the pavement surface in relation to how an average user experiences the roadway condition. With regard to the volume and complexity of data processed, “big data” techniques may provide viable solutions to the challenges posed by the problem. The term “big data” does not necessarily refer to the size of the datasets being processed (even though in this case the sensor and image-related data is voluminous), but it often refers simply to the use of predictive and other advanced data analytics methods that extract value from data. The chapter presents both a data-driven vibration-based method for the detection and classification of pavement anomalies by use of low-cost

Pavement Patch Defects Detection Chapter

14

367

(smartphone) technology, and a vision-based method for the enhancement of the detection process. The vibration-based component makes use of artificial neural networks (ANN) for data-mining and of timeseries analysis of vibration signals to detect vibration-inducing roadway anomalies, while the vision-based method uses video data, image segmentation via entropy texture filters, and object classification via support vector machines (SVM) to detect roadway anomalies. Further to this brief introduction, the chapter includes a literature review on the state of knowledge in automated roadway anomaly detection, followed by brief discussions of the methods used in the analysis: data-mining, ANN, texture segmentation, entropy filters, and SVMs. The section on methodology setup presents the developed data collection system and methods, while the results and discussion section discusses the processes and tools used to detect and classify pavement anomalies. The main characteristics of a case-study potholedetection implementation are then presented; a patch detection analysis is performed for a case-study roadway; and finally the results and efficiency of the proposed methods are discussed.

2

BRIEF LITERATURE REVIEW

The automated detection of roadway pavement anomalies by use of low-cost technologies has been the focus of several research efforts in the past decade, with these efforts generally classified in two categories: vibration-based and vision-based methods. A brief summary of some of these approaches and of their findings is listed later.

2.1 Vibration-Based Methods Vittorio et al. (2014) proposed a system based on a simple application for smartphones that uses a GPS receiver and a three axis accelerometer to collect acceleration data due to vehicles motion on road anomalies. The high-energy events (anomalies) are identified by monitoring and measuring the vertical acceleration impulse. Seraj et al. (2014) proposed a system that detects road anomalies using mobile phones equipped with inertial accelerometers and gyroscope sensors. Alessandroni et al. (2014) proposed a system which combined a custom mobile application and a georeferenced database system. The roughness score was calculated and stored into a back-end geographic information system for visualizing road conditions. Mohamed et al. (2015), to avoid false-positive signals when there was a sudden change in motion acceleration, suggested the gyroscope around gravity rotation as the main indicator for road anomalies in addition to the accelerometer sensor. Jang et al. (2016) proposed an automated method to obtain up-to-date information about potholes by using a mobile data collection kit, mounted on vehicles. In each mobile data collection kit, a triaxial accelerometer and GPS sensor collect data for the detection of

368 PART

II Applications

street defects. At a back-end server, a street defect algorithm which relies on a supervised machine learning technique and a trajectory clustering algorithm enhances the performance of the proposed monitoring system. The above systems, despite hardware differences in terms of GPS accuracy and accelerometer sampling rate and noise, they show that pothole detection is possible. Bridgelall (2015) developed theoretical precision bounds for a ride index called the road impact factor and demonstrated its relationship with vehicle suspension parameter variances. The 2014 Mercedes-Benz S-Class used a Light-Detection-and-Ranging (LiDAR) scanner to estimate pavement surface roughness as a part of an active suspension system. Lately, the Jaguar Land Rover is testing a new connected vehicle technology which permits a vehicle to spot hazardous potholes in the roadway and then distribute this data in real time with other Jaquar Land Rover vehicles (O’Donnell, 2015). Kyriakou et al. (2016, 2017) explored the use of data collected by sensors from smartphones and vehicles utilizing automobiles’ OBD-II devices while vehicles were in movement, for the detection and classification of pavement surfaces anomalies. The proposed system architecture was complimented with artificial neural network techniques for classifying detected roadway anomalies. The proposed system was trained, validated and tested against three types of common roadway anomalies exhibiting above 90% accuracy rate.

2.2 Vision-Based Methods In the work by Nejad and Zakeri (2011) an automated imaging system was described for distress detection in asphalt pavements. The work focused on comparing the discriminating power of several multiresolution texture analysis techniques using wavelet-, ridgelet-, and curvelet-based texture descriptors, and concluded that curvelet-based signatures outperform all other multiresolution techniques for pothole distress (yielding accuracy rates of 97.9%) while ridgelet-based signatures outperform all other multiresolution techniques for cracking distress (accuracy rates of 93.6%–96.4%). A computer-vision approach was also the subject of the work by Koch and Brilakis (2011), who proposed a method for automated pothole detection by which an image was first segmented into defect and nondefect regions using histogram shape-based thresholding, and then the texture inside a potential defect shape was extracted and compared with the texture of the surrounding nondefect pavement to determine if the region of interest represents an actual pothole. The aforementioned camera-based pothole-detection method was subsequently extended by Koch et al. (2013) for assessing the severity of potholes, by incrementally updating a representative texture template for intact pavement regions and using a vision tracker to reduce the computational effort. Related was also the work by Jog et al. (2012) who used vision-based data for both 2D recognition and for 3D reconstruction, based on visual and spatial characteristics of potholes, and measured properties were used to assess the severity of potholes.

Pavement Patch Defects Detection Chapter

14

369

Citing limitations of camera-based methods, Yu and Salari (2011) proposed the use of laser imaging and described a method by which regions in captured images corresponding to potholes are represented by a matrix of square tiles and the estimated shape of the pothole is determined. The vertical, horizontal distress measures, the total number of distress tiles, and the depth index information are calculated providing input to a three-layer feed-forward neural network for pothole severity and crack type classification. A vision approach was also employed by Murthy and Varaprasad (2014) who used images obtained from a camera mounted on top of a vehicle and custom MATLAB code to detect potholes. In the work by Ryu et al. (2015) a pothole detection method was proposed using various features in two-dimensional images. The proposed method first uses a histogram and the closing operation of a morphology filter to extract dark regions for pothole detection, and then candidate regions of a pothole are extracted with the use of features such as size and compactness. Finally, a decision is made on whether candidate regions are potholes with a comparison of pothole and background features. Radopoulou and Brilakis (2015) presented an application of the Semantic Texton Forests (STF) algorithm for automatically detecting patches, potholes, and three types of cracks in video frames captured by a common parking camera, reporting over 70% accuracy in all of the tests performed, and over 75% precision for most of the defects. Subsequently, Radopoulou et al. (2016) utilized video data collected from a car’s parking camera to detect defects in frames and classified detected defects reported that the initial identification of frames including defects produced an accuracy of 96% and approximately 97% precision. A vision-based approach was, finally, employed by Li et al. (2016) who proposed a method to integrate the processing of two-dimensional images and of ground penetrating radar (GPR) data for pothole detection. The images and GPR scans are first preprocessed and a pothole detector designed by investigating the patterns of GPR signals, and then the position and dimension of the detected potholes is estimated from GPR data and mapped to the image to enable a localized shape segmentation. The researcher reported a precision, recall, and accuracy of 94.7%, 90%, and 88%, respectively.

3

METHODOLOGY

3.1 Anomaly Detection Using ANNs and Timeseries Analysis of Vibration Signals The described research work focuses on three types of common pavement surface anomalies (transverse depressions/patches, Fig. 1A; longitudinal depressions/patches, Fig. 1B; and potholes or manholes, Fig. 1C). Data on these types of pavement surface anomalies is collected in situ by use of a car equipped with a smartphone (mounted on the car’s windshield) with its sensors turned on, and with an OBD-II reader attached to it. The smartphone was also fitted with an android application for recording (and exporting) sensor readings of taken data.

370 PART

(A)

II Applications

(B)

(C)

FIG. 1 Pavement surface anomaly types examined for detection and classification: (A) transverse defect/anomaly; (B) longitudinal defect/anomaly; (C) potholes/manholes.

Further for visually verifying the existence of a pavement surface anomaly (as detected by the sensors data), the smartphone had also its video camera active for recording the routes travelled. The sensor data, collected at intervals of 0.1 s, included a total of 14 unidimensional (e.g., X, Y, Z accelerations, speed, etc.) and two-dimensional indicators (e.g., the smartphone’s roll and pitch values). The smartphone was also fitted with the DashCommand application for recording sensor readings of taken data. Vehicle system data is transmitted through the OBD-II reader to the smartphone device and then transferred to a back-end server for processing and storing. At the back-end server, defect detection algorithms based on artificial neural network techniques, robust regression analysis, various algorithms and bagged trees classification model enhances the performance of the proposed monitoring system, by integrating data collected from multiple sensors and deducing knowledge from these participatory sensors. Mathematically, the proposed method is based on rigid-body dynamics and in particular the roll, pitch, and yaw rotations about the object’s XYZ axis (Kyriakou et al., 2017). In essence, the roll metric refers to a car’s acceleration variation between its left and right front wheels, while the pitch metric refers to a car’s acceleration variation between its front and rear wheels. Concurrently, the roll and pitch values define the way in which the host car is off balance (sideways and front/back). The datasets are fed into an artificial neural network (ANN) consisting of 4 inputs (forward acceleration, lateral acceleration, vehicle pitch, vehicle roll), 10 hidden neurons, and 4 outputs (Class Type 0, Class Type 1, Class Type 2, Class Type 3). The ANN outputs are binary in nature (“0” for no defect, “1” for defect) and they are used to classify data readings into classes of roadway anomalies. The ANN is first trained for each case of road-way anomaly (as given by Fig. 1), and then trained with all three roadway anomalies in tandem (Class Type 0, Class Type 1, Class Type 2, Class Type 3). The results of the ANN-based pattern recognition and pavement anomaly classification are as shown in the confusion matrix of Fig. 2. In essence, the ANN classifier detects and accurately categorizes the three roadway anomalies (target classes “2”, “3”, and “4”) while also distinguishing the “no defect”

Pavement Patch Defects Detection Chapter

14

371

FIG. 2 ANN confusion matrix.

condition (target class “1”), thus separating normal and abnormal roadway pavement conditions. The vibration-based method, though, fails to efficiently cover the entire roadway and, more importantly, fails to detect nonvibration-inducing pavement anomalies; hence, the need for a vision-based method to complement the aforementioned vibration-based approach.

3.2 Anomaly Detection Using Entropic-Filter Image Segmentation The proposed vision-based approach makes use of image texture segmentation with entropy texture filters, and has been implemented on MATLAB’s computer vision toolbox. Entropy is a statistical measure of randomness, and an entropy filter can characterize the texture of an image by providing information about the local variability of the intensity values of pixels in an image. For example, in areas with smooth texture, the range of values in the neighborhood around a pixel will be a small value; in areas of rough texture, the range will be larger. Similarly, calculating the standard deviation of pixels in a neighborhood can indicate the degree of variability of pixel values in that region. The entropy (E) of a grayscale image (I) is defined as E ¼ -sum[p.  log 2(p)], where p contains the histogram counts of the intensity image. By default, entropy uses two bins for logical arrays and 256 bins for uint8, uint16, or double arrays. The entropy filter (J ¼ entropyfilt(I)) of a grayscale image returns the array (J), where each output pixel contains the entropy value of the 9-by-9 neighborhood around the corresponding pixel in the input image I (Fig. 3). Thus, the entropy filter creates a texture image. For pixels on the borders of I, entropyfilt uses symmetric padding, where the values of padding pixels are a mirror reflection of the border pixels in I.

372 PART II Applications

FIG. 3 Sample entropyfilt() calculations.

Pavement Patch Defects Detection Chapter

Grayscale original image 1080 rows by 1920 columns 200

400

400

600

600

800

800

1000

1000 1000

373

Entropy-filtered grayscale image 1080 rows by 1920 columns

200

500

14

1500

500

Zero-count image 8 rows by 14 columns

1000

1500

Texture-anomaly-marked image 1080 rows by 1920 columns

2

200 400

4

600 6

800 1000

8 2

4

6

8

10

12

14

500

1000

1500

FIG. 4 Sample texture segmentation and anomaly detection using entropy filters.

The steps used in the proposed entropy texture segmentation approach are as listed below, with Fig. 4 serving as a reference for the resulting image at each analysis step: 1. Read a video of the roadway pavement to be analyzed. 2. For each video frame, convert it to grayscale image (Fig. 4A) and calculate the overall image entropy. 3. If the computed image entropy deviates from the running average, then presume that the image contains a pavement anomaly (manifested in the image as texture anomaly) and isolate it for further analysis.  Create a texture image (Fig. 4B).  Threshold the image to segment the textures (a threshold value of 0.8 is used as default value, for it is roughly the intensity value of pixels along the boundary between the textures). A function is also used to smooth the edges and to close any open holes in objects.  Partition the entropy image (Fig. 4B) into a grid (in this case from 1080  1920 pixels to 18  32 pixels), and count the proportion of black-to-white pixels in each grid cell. Threshold this ratio (say at 80%) and output the resulting image (Fig. 4C). White regions indicate texture anomalies.  Rescale the threshold texture image (Fig. 4C) back to the original image dimensions and display the segmentation results, marking the corresponding image areas as pavement anomalies (Fig. 4D).

374 PART

II Applications

3.3 Patch Detection and Measurement Using Support Vector Machines (SVM) The texture segmentation approach has been complimented with support vector machines (SVM) and applied to patch detection and measurement. SVMs are supervised machine learning models, which identify patterns after taking labeled training data. An SVM is divided into two main phases (training and testing), and it can be used efficiently for two-group classification problems such as the presented research study (“patch” vs. “no-patch” classes). The key steps of the presented algorithm (Hadjidemetriou et al., 2016) are presented in Fig. 5 for both the training and testing stages, and the input consists of the pavement surface images extracted from the roadway videos. The SVM training phase begins with transforming collected pavement video frames into grayscale images. Thus, every pixel has a grayscale value in the range from 0 (which characterizes black) and 255 (which describes white). The presented system uses only one SVM, which is trained by labeled data (ground truth) and feature vectors. The ground truth is entered in the algorithm given data regarding the pixels of each frame which are part of a pavement patch. Each feature vector is generated, and subsequently the SVM is trained, extracting information from nonoverlapped areas within the frame, whose size is 20  20 pixels in width and height. The selection of blocks size is based on usual image resolutions, whose

FIG. 5 Training and testing stages of the proposed patch detection algorithm.

Pavement Patch Defects Detection Chapter

14

375

dimensions are multiples of 20 (e.g., 640  480), so that blocks would cover the whole image. A number of block sizes, which fulfill this criterion (e.g. 10  10), have been tested and the use of a trial and error technique designates our final district size (20  20). One should also note that blocks which are comprised of weighty proportions of both patch and nonpatch areas (i.e. the patch area is more than 5% and less than 95% of the block) are not used to facilitate the training of the SVM and consequently its ability to distinguish “patch” from “no-patch” areas. Every feature vector, corresponding to a square block, is generated by the local intensity histogram and two texture descriptors, named twodimensional Discrete Cosine Transform (DCT) and Gray-Level Co-occurrence Matrix (GLCM). DCT, which can be used efficiently for purposes of pattern recognition purposes, expresses a finite amount of data points in respect of a weighted sum of cosine functions oscillating at diverse frequencies. GLCM is a statistical system that examines the spatial relationship of pixels; while its functions are able to designate the texture of a picture by creating a matrix, which contains the estimated frequencies of the occurrence of pixel pairs with definite values and in a specific spatial relationship. The presented method extracts data from this matrix to calculate and then use the statistical measures of contrast, correlation, energy, and homogeneity. The SVM training stage is followed by a testing phase (Fig. 5B). Its flow is similar with the SVM training, starting with transforming RGB pavement frames into grayscale images and dividing them into square blocks of 20  20 pixels. A feature vector is formed by the local intensity histogram and the two texture descriptors for each square block. The flowchart continues with the feature vector used by the SVM to classify each block of the testing pavement picture in “patch” (1) or “no-patch” (0) categories. Fig. 6 depicts the way patch areas are identified by the algorithm; where yellow-colored blocks represent the “patch” class and blue-colored cells correspond to the “no patch” group. Further, the morphological operation of closing is applied, to fill and eliminate blocks which are classified differently than their surrounded blocks, by changing their label from 0 to 1 and vice versa (Fig. 7). Finally, a trial and error technique is used to define the number of connected “patch” blocks (50) which indicate the presence of a patch in an image. Fig. 7 presents an example of a frame correctly classified as “including patches” as it has more than 50 connected blocks; and a second example, truly identified as “not including patches,” even if it has 46 of False Positive connected blocks. In case a patch has just appeared in the video view (thus covering a limited proportion of the frame) and the identified connected blocks are less than 50, then it will not be detected by the algorithm. However, the next extracted image from the video will include a greater percentage of the patch, providing the opportunity to be detected. Consequently, the algorithm, after the blocks classification,

FIG. 6 Examples of processed images by the proposed algorithm.

376 PART II Applications

Pavement Patch Defects Detection Chapter

14

377

FIG. 7 Application of the morphological operation.

discriminates between the images which include parts of patches and the frames which do not contain any patches parts (image classification). At this point, the difference between presence and detection should be clarified. The former answers with a “yes” or a “no” the question of whether an examined object occurs in an image, while the latter provides information regarding the place of the object in the image. Despite the restriction of a range of algorithms to only identifying the presence of a distress, while classifying images between damaged and undamaged pavement, the proposed algorithm achieves both the presence and detection of patches. The method has been successfully field-tested on case-study roadways, using either a generic dash cam or a smartphone camera, with the accuracy, precision, and recall rates shown in Table 1. Further to the obtained accuracy levels, the presented method is characterized by some strong advantages such as the identification of multiple patches in a single image or the detection of proportions of patches when their entire area is not included in the image.

4

CONCLUSIONS

The chapter presented a vibration-based and a vision-based method, working in tandem, for the detection and classification of roadway anomalies by use of “big-data” tools and low-cost smartphone technology. The popularity of smartphone technology in vehicles and the advancement of “big-data” technologies provide an opportunity to efficiently collect vehicle data and process it by use of connected and distributed systems. Even though vehicle data is not likely to directly provide traditional assessment metrics (such as IRI and PCI), new metrics might supplement and eventually supplant traditional metrics. The applied methodology is instantly available, low-cost, and precise, and can be utilized in crowdsourced applications leading to roadway assessment and pavement management systems.

378 PART II Applications

TABLE 1 The Performance of Blocks and Images Classification—(A) Generic Dash Camera; (B) Smartphone Camera Block Classification

Images Classification

(Generic Dash Camera)

Blocks Classification

Images Classification

(Smartphone Camera)

Accuracy

82.9%

Accuracy

82.5%

Accuracy

80.5%

Accuracy

80.0%

Precision

65.6%

Precision

77.8%

Precision

63.8%

Precision

75.4%

Recall

92.0%

Recall

91.0%

Recall

89.4%

Recall

89.0%

Pavement Patch Defects Detection Chapter

14

379

REFERENCES Alessandroni, G., Klopfenstein, L., Delpriori, S., Dromedari, M., Luchetti, G., Paolini, B., Seraghiti, A., Lattanzi, E., Freschi, V., Carini, A., 2014. In: SmartRoadSense: collaborative road surface condition monitoring.UBICOMM, The Eighth International Conference on Mobile Ubiquitous Computing, Systems, Services and Technologies, 24–28 August, Rome, Italy. Bridgelall, R., 2015. Precision bounds of pavement deterioration forecasts from connected vehicles. J. Infrastruct. Syst. 21 (1), 1–7. Chatti, K., Zaabar, I., 2012. Estimating the Effects of Pavement Condition on Vehicle Operating Costs. Vol. 720. Transportation Research Board. Gleave, S.D., Frisoni, R., Dionori, F., Casullo, L., Vollath, C., Devenish, L., Spano, F., Sawicki, T., Carl, S., Lidia, R., Neri, J., Silaghi, R., Stanghellini, A., 2014. EU Road Surfaces: Economic and Safety Impact of the Lack of Regular Road Maintenance. European Parliament—Directorate General for Internal Policies, Policy Department B: Structural and Cohesion Policies, Transport and Tourism. Hadjidemetriou, G.M., Christodoulou, S.E., Vela, P.A., 2016. Automated detection of pavement patches utilizing support vector machine classification. In: Proc. Electrotechnical Conference (MELECON), 2016 18th Mediterranean. IEEE, pp. 1–5. Jang, J., Yang, Y., Smyth, A., Cavalcanti, D., Kumar, R., 2016. Framework of data acquisition and integration for the detection of pavement distress via multiple vehicles. J. Comput. Civil Eng. 04016052. Jog, G., Koch, C., Golparvar-Fard, M., Brilakis, I., 2012. Pothole properties measurement through visual 2d recognition and 3d reconstruction. Proc., ASCE International Conference on Computing in Civil Engineering, 553–560. Koch, C., Brilakis, I., 2011. Pothole detection in asphalt pavement images. Adv. Eng. Inform. 25 (3), 507–515. Koch, C., Jog, G., Brilakis, I., 2013. Automated pothole distress assessment using asphalt pavement video data. J. Comput. Civil Eng. 27 (4), 370–378. Kyriakou, C., Christodoulou, S.E., Dimitriou, L., 2016. In: Road anomaly detection and classification using smartphones and artificial neural networks.TRB, The Transportation Research Board 95th Annual Meeting, Washington, DC, USA, 10–14 January, Washington, DC, USA. Kyriakou, C., Christodoulou, S.E., Dimitriou, L., 2017. In: Detecting and classifying roadway pavement anomalies utilizing smartphones, on-board diagnostic devices and classification models. TRB, The Transportation Research Board 96th Annual Meeting, Washington, DC, USA, 8–12 January, Washington, DC, USA. Li, S., Yuan, C., Liu, D., Cai, H., 2016. Integrated processing of image and GPR data for automated pothole detection. J. Comput. Civil Eng., 04016015. Mohamed, A., Fouad, M., Elhariri, E., El-Bendary, N., Zawbaa, H.M., Tahoun, M., Hassanien, A.E., 2015. RoadMonitor: an intelligent road surface condition monitoring system. In: Intelligent Systems’ 2014. vol. 323. Springer, pp. 377–387. Murthy, S., Varaprasad, G., 2014. Detection of potholes in autonomous vehicle. IET Intell. Trans. Syst. 8 (6), 543–549. National Economic Council: The President’s Council of Economic Advisers, 2014. An Economic Analysis of Transportation Infrastructure Investment. The White House, Washington, DC. Nejad, F., Zakeri, H., 2011. A comparison of multi-resolution methods for detection and isolation of pavement distress. Expert Syst. Appl. 38 (3), 2857–2872.

380 PART

II Applications

O’Donnell, N., 2015. Jaquar Land Rover Announces Technology Research Project to Detect, Predict and Share Data on Potholes [online]. Available from:http://newsroom.jaguarlandrover.com/enin/jlr-corp/news/2015/06/jlr_pothole_alert_research_100615. (accessed 15.07.15). Radopoulou, S., Brilakis, I., 2015. In: Detection of multiple road defects for pavement condition assessment.Proc., EG-ICE 2015—22nd Workshop of the European Group of Intelligent Computing in Engineering. Radopoulou, S., Brilakis, I., Doycheva, K., Koch, C., 2016. In: A framework for automated pavement condition monitoring.Proc., Construction Research Congress 2016: Old and New Construction Technologies Converge in Historic San Juan - Proceedings of the 2016 Construction Research Congress, CRC 2016, pp. 770–779. Ryu, S.-K., Kim, T., Kim, Y.-R., 2015. Feature-based pothole detection in two- dimensional images. Transp. Res. Rec. 2528, 9–17. Seraj, F., van der Zwaag, B.J., Dilo, A., Luarasi, T., Havinga, P., 2014. RoADS: a road pavement monitoring system for anomaly detection using smart phones. In: International Workshop on Modeling Social Media. Springer, pp. 128–146. Vittorio, A., Rosolino, V., Teresa, I., Vittoria, C.M., Vincenzo, P.G., Francesco, D.M., 2014. Automated sensing system for monitoring of road surface quality by mobile devices. Procedia: Social Behav. Sci. 111, 242–251. Yu, X., Salari, E., 2011. In: Pavement pothole detection and severity measurement using laser imaging. Proc., 2011 IEEE International Conference on Electro/Information Technology (EIT). IEEE, pp. 1–5.

Chapter 15

Collaborative Positioning for Urban Intelligent Transportation Systems (ITS) and Personal Mobility (PM): Challenges and Perspectives Vassilis Gikas*, Guenther Retscher† and Allison Kealy‡ *

National Technical University of Athens, Athens, Greece, †Technical University of Vienna, Vienna, Austria, ‡RMIT University, Melbourne, VIC, Australia

Chapter Outline 1 Introduction 382 2 C-ITS in Support of the Smart Cities Concept 383 2.1 Scientific and Policy Perspectives of Urban C-ITS 383 2.2 Taxonomy of Urban C-ITS Applications 386 3 User Requirements for Urban C-ITS 387 3.1 Requirements Overview 387 3.2 Positioning Requirements and Parameters Definition 387 4 Positioning Technologies for Urban ITS 389 4.1 Radio Frequency-Based (RF) Technologies 393 4.2 MEMS-Based Inertial Navigation 397 4.3 Optical Technologies 398 5 Measuring Types and Positioning Techniques 399

5.1 Absolute Positioning Techniques 5.2 Relative and Hybrid Positioning Techniques 6 CP for C-ITS 6.1 From Single Sensor Positioning to CP 6.2 Fusion Algorithms and Techniques for CP 7 Application Cases of Integrated Urban C-ITS 7.1 Case 1: Smart-Bike Systems as a Component of Urban C-ITS 7.2 Case 2: Smart Intersection for Traffic Control and Safety 8 Discussion, Perspectives, and Conclusions References Further Reading

Mobility Patterns, Big Data and Transport Analytics. https://doi.org/10.1016/B978-0-12-812970-8.00015-4 © 2019 Elsevier Inc. All rights reserved.

399 401 402 402 403 404

404

406 407 409 413

381

382 PART

II Applications

1 INTRODUCTION In recent years, the fast growth in urban population and in the number of vehicles together with the global trend towards megacities inspires new, smart mobility solutions. Therefore, more sophisticated and integrated modes of transportation and environmental friendly solutions are required to accommodate the rising demands of high livability in modern cities. To this end, fast adaptation and exhaustive use of emerging information and communication technologies (ICT) is vital for enhancing intelligent transportation systems (ITS) (Barbaresso et al., 2015). Overly, four technological trends define the development framework of ITS; namely, cloud computing, big data and analytics, cybersecurity, and cooperative systems. Cloud-computing facilitates automated and rapid management of the complete spectrum of data gathered within the urban mobility sector, spanning those collected by conventional road ITS terminals through crowd-sourced traffic information. Cloud-computing results in new services that bring important benefits, reducing infrastructure costs while improving road user safety (Arutyunov, 2012). More recently, big data and analytics are being adopted gradually by the transportation sector, and soon are expected to make a profound economic and societal impact in the form of time and fuel savings, as well as carbon emission reduction. However, despite their high potential, big data solutions are still not an important part of the traffic control operation (Borgi et al., 2017). Cybersecurity forms the third critical technological element for the ITS chain. As vehicles become increasingly connected via wireless networks towards ITS-5G, the potential for individuals to cause damage is ever more viable. To this effect, stakeholders and traffic operators need to enforce standards and establish partnerships to minimize evolving threats and keep road users and their data safe (Skˇorput et al., 2017). Finally, cooperative systems allow vehicles and pedestrians to exchange both status and event information via reliable, high connectivity data communication protocols towards fully automated ITS. While cooperative systems rely on ICT, the development and efficient operation of an ITS service requires knowledge of the location and kinematics of people or vehicles involved in a scenario. In effect, location awareness is a prerequisite for any ITS application in order to enable effective services to the users including faster, safer, and less costly operations. This paper focuses on the positioning and navigation aspects of urban Cooperative ITS (C-ITS). It examines the role and significance of positioning information for the development and deployment of sustainable, emerging C-ITS services. Furthermore, it discusses the positioning challenges associated with highly demanding, safetycritical ITS applications while it offers future perspectives of the technologies, the techniques and their application for C-ITS. The remainder of this paper is structured as follows. Section 2 provides an overview of C-ITS services in the smart cities concept, while Section 3 deals with the user requirements

Collaborative Positioning for Urban Intelligent Chapter

15

383

underlying C-ITS implementation. Sections 4 and 5 offer a review of respective positioning technologies and techniques, followed by a review on collaborative positioning (CP) concepts, algorithms, and techniques. Section 7 presents key examples of emerging C-ITS applications, and finally Section 8 offers a critical discussion on the technical concerns and perspectives relating to positioning aspects for urban C-ITS.

2

C-ITS IN SUPPORT OF THE SMART CITIES CONCEPT

2.1 Scientific and Policy Perspectives of Urban C-ITS Thanks to the recent developments in ICT and multisensory systems ITS constantly evolve. However, while motorway ITS applications have reached a level of technological maturity with some of them being commercially deployed by the automotive industry and gradually accepted by society, less progress has been made in ITS services for the urban and public transport use cases. In fact, the dynamic character and the high complexity of the megacity environment demands a high level of automation and interoperability of ITS systems that despite past efforts is still not evident (CEN-CENELEC-ETS, 2015; AbdelRahim, 2012). This is due to the high traffic volumes, the multifarious traffic conditions and the mixture of multimodal transport means including pedestrians that characterize overly urban scenarios. In response to this need, the EU Commission has been adopted an Action Plan for the Deployment of ITS in Europe (EU, 2008), followed by the establishment of the Expert Group on Urban ITS (EU, 2010). Also, coordinated actions have been taken between the EU and the USA on the interoperability of the deployment of ITS (Row and J€a€askel€ainen, 2012). However, despite the substantial progress gained in various technological aspects of ITS, organizational principles to ensure interoperability, such as data management and ownership still remain poorly developed at an international level (Vartis et al., 2016). Indisputably, the evolution of urban ITS relies both on a policy and a technological perspective. From a policy perspective, the successful deployment of future ITS resides on the integration of urban-related ITS services and sustainable urban mobility plans, such as car sharing services, integrated public transport and multimodal mobility. Furthermore, the rapid increase of user generated information including activity/location data via smartphones, Personal Digital Assistance (PDAs) and other in-vehicle devices, will support further the future of urban mobility through large-scale data analyses and market demand studies (Antoniou et al., 2016; Gikas and Perakis, 2016). However, using such a multifaceted channel of information sources raises openly numerous legal issues including data protection, privacy, and liability. Therefore, further work is required to establish common methodologies, data protocols, and standards for addressing data protection and security issues. Finally, new business models

384 PART

II Applications

and coordinated efforts on interoperability are required to ensure efficient deployment of ITS at international level. On the other hand, the solution to efficient urban ITS services from a technological perspective resides on Cooperative ITS (C-ITS). C-ITS is the natural evolution of the overall ITS that “communicates and shares information between ITS stations to give advice or facilitate actions with the objective of improving safety, sustainability, efficiency and comfort beyond the scope of stand-alone systems” (Schade, 2012). As shown in Fig. 1, the development of C-ITS relies on the application of a number of scientific aspects to the transport sector. Sensors provide diverse types of data and parameters ranging from vehicle position and dynamics to road conditions and air quality. Communication technologies vary depending on application type. Despite the maturity of wireless communication technologies for vehicular applications (IEEE 802.11p, European Telecommunications Standards Institute (ETSI) ITS-G5) (Rappaport et al., 2013), C-ITS are still facing challenges regarding the communication capacity, data transmission delays, message formats, and contextual information. Regarding information systems and platforms, a major concern discussed further in Section 8 relates to limitations of manipulating and analyzing big data volumes in real-time obtained from various information channels. Data protection, security, and human-machine interaction (HMI) aspects are also critical ranging from liability issues through developing efficient interface systems used to inform, assist, and interact information between drivers.

Sensors

Information systems

Communication technologies

C-ITS

Data protection and security

HMI

Specific ITS application

FIG. 1 Scientific aspects defining urban C-ITS.

Collaborative Positioning for Urban Intelligent Chapter

15

385

The origins of current C-ITS go back to the early days of connected vehicles some 15 years ago (Hartman, 2016). Today, C-ITS refer to vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) technologies that enable information exchange between vehicles and roadside infrastructure facilitating a wide range of ITS services (Karedal et al., 2011; Jacobi et al., 2015; Mousumi and Gautam, 2015). In the urban environment, V2V and V2I technologies collectively referred to vehicle-to-everything (V2X) (Katriniok et al., 2017). V2X is a mesh network in which each vehicle is a node with the ability to transmit, receive, and retransmit messages to other ITS stations. Fig. 2 shows the different types of ITS stations and functionalities involved in urban C-ITS. That is, V2V communications primarily concern with collision avoidance safety systems while V2I systems focus on traffic signal timing and priority information services. Similarly, vehicle-to-pedestrians (V2P) systems aim at issuing safety alerts to pedestrians and bicyclists, whereas vehicle-to-network (V2N) services aim at providing real-time traffic, routing, and cloud services, respectively. Evidently, while much progress has been made on the technological aspects of urban C-ITS there are still many open issues to face. Investigating the interoperability of different sensors to support efficient data fusion capabilities and integrating sensor systems with communication technologies are only two of them. Other areas that there is a clear need for future research involve studying the effects of coupling between different sensor systems and investigating the reliability and adaptation level to the traffic environment. Among other types of sensing technologies, positioning sensors used for navigation and

V2V

V2N

V2I V2P

FIG. 2 Relationship among ITS stations present in V2X technologies for smart cities.

386 PART

II Applications

geo-referencing purposes are integral to the efficient functioning of ITS and examined in detail in the remainder of this paper.

2.2 Taxonomy of Urban C-ITS Applications ITS services can be classified in various ways including their field of development or specific characteristics that vary with geographic region. A general classification based on their areas of application includes as a minimum the following categories: traffic and travel information systems, advanced vehicle safety systems, security and emergency systems, payment systems, and freight transport management. Another way of generic classification of ITS applications divides them into two categories: sustainability applications and all remaining ITS types (Row and J€a€askel€ainen, 2012). In this approach, safety related and other applications are distinguished from use cases aiming to reduce energy consumption, vehicle emissions, and environmental impact at the urban environment. Other examples of classifying ITS applications are summarized in Table 1. These include the “type of support” they provide, the “driving conditions” when the system operates in respect to a possible crash, their “human-machine interface” and the “type of criticality” with regard to certain aspects of interest (Clausen et al., 2015). For instance, relevant ITS systems concerning a possible crash are collision avoidance and obstacle warning systems at a precrash stage, smart restraint systems during the crash, and e-call facilities following the crash (Clausen et al., 2017). Particularly, when the interest goes to the level of services a positioning terminal can support an ITS application, the classification shown in the last column of Table 1 becomes more relevant. For instance, positioning accuracy and integrity measures are more important for safety-critical systems (e.g., advance driver assistance systems (ADAS)) than for securitycritical systems (e.g., transportation of cash) and much less important for environment and health-critical systems (e.g., air quality recording or transportation of dangerous goods).

TABLE 1 Ways of Classification of ITS Applications Type of Support

Driving Conditions

HMI

Type of Criticality

Driving task

Normal driving

Informative

Safety-critical

Information

Crash-system

Warning

Liability-critical

Monitoring

Post-crash system

Intervening

Security-critical Environment and healthcritical

Collaborative Positioning for Urban Intelligent Chapter

3

15

387

USER REQUIREMENTS FOR URBAN C-ITS

3.1 Requirements Overview User requirements for urban C-ITS can take several forms and can be analyzed at various levels of detail and rigor, ranging from more relaxed requirements for noncritical applications (e.g., traffic and travel information systems) to more stringent ones for safety-critical and security-critical applications. User requirements can be organized in six generic categories following the classification of the scientific themes contributing to the performance of ITS services discussed in Section 2.1 (see Fig. 1). The relevant categories include requirements for sensors, communication technologies, information systems and platforms, HMI, data protection and security, as well as requirements for specific use cases. Last but not least, cost is an important user requirement that can be assessed in several ways including capital, maintenance, time, and space costs. Clearly, due to the large number of contributing factors in urban ITS, it is not always obvious how to prioritize user requirements, and therefore, a unified approach could only be adopted when based on certified standards and common policies. As this paper concentrates on the positioning aspects for road C-ITS, the discussion on user requirements focuses on position-sensing technologies and data-processing requirements. Also, requirements concerned with communication technologies and information systems affect vehicle position and dynamics, and thus are also briefly discussed in Section 3.2.

3.2 Positioning Requirements and Parameters Definition In this study, user requirements concerned with the positioning terminals found in urban C-ITS refer to the relevant performance features. Global Navigation Satellite Systems (GNSS) receivers and proximity sensors used for computing longitudinal and lateral positions and inter-vehicle distances respectively are typical cases of positioning terminals. Position availability, accuracy and integrity are considered to be the most critical requirements followed by coverage, continuity, update rate, system latency and data output. Position availability refers to the percentage of time (measurement epochs) during which a positioning terminal is available for use at a required performance level of accuracy and/or integrity. Availability usually is affected by random factors, such as communication congestion and other failures. Accuracy should be regarded as a key driver to the performance of an ITS. It describes the conformance of an estimated (measured) position to the statistical figures of merit of position or velocity error. Accuracy is expressed for the horizontal or vertical components (horizontal position error, HPE; vertical position error, VPE) usually at 95% confidence level. Particularly, for road ITS applications a statistical characterization of HPE relies usually on the 50th, 75th or 95th percentiles of the error cumulative distribution function (CDF) (Fig. 3, left). In order an accuracy

0.5

1 1.2 1.5 1.6

2 2.2 2.5 3 Horizontal position error (m) 3.5

4 4.5

PDF CDF

5

0

10

20

30

40

50

60

70

80

90

0

20

MI Epochs: 8 7.38552e-005

40 60 80 Horizontal position error (meters)

Normal Operation 0.999926

GNSS-only 1E-4- Urban Canyon

100

0

0.5

1

1.5

2

FIG. 3 Accuracy metrics realized via probability density and cumulative distribution function of horizontal position GNSS data (left), and integrity metric representing protection level versus GNSS horizontal position error (right). (Source: Peyret, F., B etaille, D., Engdahl, J., Gikas, V., et al., 2017. Assessment of positioning performance in ITS applications. In: COST Action: TU1302-SaPPART Handbook, TMI. IFSTTAR.)

0 0

0.2

0.4

0.50

0.6

0.8 0.75

1 0.95

Kalman-based HPL (meters)

100

Log10 of the number of points per dot

388 PART II Applications

Collaborative Positioning for Urban Intelligent Chapter

15

389

measure to be meaningful, it is assumed that systematic measurement errors have been adequately modeled and sensors have been properly calibrated. Integrity is a quality metric used to assess unambiguously the performance of a system even if high accuracy is observed, and thus is important for safety related applications. Particularly, integrity refers to the trust a user can have in the delivered value of a computed position or velocity, and therefore, corresponds to the confidence that can be placed in the output of a system. For civil aviation, integrity refers to the probability (risk) of failure of a system for which the user is not informed within a preset period of time (time-to-alarm). For road transport applications two distinct features define integrity. Firstly, integrity risk (IR) refers to the probability that a position or velocity error exceeds a protection level computed by the positioning terminal supposed to over bind the actual error. Secondly, a horizontal protection level (HPL) is a statistical bound of the HPE computed so as to guarantee that the probability of the actual position error exceeding said number is smaller than or equal to the target integrity risk (Peyret, 2013; Peyret et al., 2017). Fig. 3 right, shows the HPL as a function of HPE obtained for a kinematic GNSS data set realized in an urban canyon using the so-called “Stanford plot”. On this plot, the bisecting line x ¼ y divides the “safe” region from the “unsafe” one assuming a 50th percentile HPL. Therefore, the data in the bottom right part of the plot highlight a limited number of instants (eight epochs) during which the system failed to fulfill the preset target integrity risk, and this data are called misleading information. Generally, defining positioning requirements for ITS is a rather complicated task. It requires thorough investigation of specific application requirements for different operating scenarios and extensive testing of sensors. As an example, here the field and laboratory simulation testing and “record and replay” analysis is mentioned here for GNSS receiver assessment (Cristodaro et al., 2017). For this purpose extensive collaboration between organizations and stakeholders is required to ensure interoperability at a development and deployment phase.

4

POSITIONING TECHNOLOGIES FOR URBAN ITS

Table 2 provides an overview of the major technologies currently applied for vehicle and pedestrian positioning in combined out-/indoor environments. These can be classified in four major categories based on their principle of operation; namely, radio frequency (RF), inertial, optical and other technologies. In the following subsections, the most important technologies with use in ITS applications are examined in detail, whereby a special emphasis is led on wireless options and low-cost inertial systems.

COO TDoA AoA RSSI

ToA TDoA AoA CoO RSSI

Lateration

Lateration

Lateration

Proximity, lateration, hyperbolic lateration, angulation

Fingerprinting, lateration

Proximity, lateration

Proximity, lateration

Lateration, hyperbolic lateration, angulation

Proximity, lateration, fingerprinting

Lateration

Fingerprinting

DGNSS

A-GNSS (assistedGNSS)

Pseudolites (e.g., Locata)

Cellular network

Wi-Fi (wireless fidelity)

ZigBee

Bluetooth

UWB (ultrawideband)

Radio frequency identification (RFID)

FM radio

Digital television

RSSI

RSSI

CoO RSSI ToA

CoO SSI

TOA

TOA

TOA

TOA

Lateration

GPS (SPP1)

Radio frequency (RF)

Method

Technique

System

Signal

Low

Low

Medium (active) low (passive)

X, Y, (Z)

X, Y, (Z)

X, Y, (Z)

X, Y, Z

X, Y, (Z)

Medium

High

X, Y, Z

X, Y, (Z)

X, Y, (Z)

X, Y, Z

X, Y

X, Y, Z

X, Y, Z

Navigation Information

High

Low

Low

High

Low

Low

Low

Costs

TABLE 2 Characteristics and Specifications of Current Localization Techniques and Systems

10–20 m

1–10 m

1–5 m

0.1–1 m

5–20 m

1–10 m

1–5 m

>50 m

0.1–3 m

>10 m

0.5–3 m

500 MHz) bandwidth communication. Nevertheless, today, it becomes a rapidly evolving ranging technology used for precise positioning and objecttracking applications. Position fixing is accomplished using lateration and hyperbolic lateration as well as angulation techniques (see Section 4.2.1). Compared to other radio-based technologies, such as Radio Frequency Identification (RFID) and Wi-Fi, which are subject to deep fading and inter-symbol interference, a major benefit of UWB is their immunity to multipath fading leading to low range uncertainty (